At Skytap, we face the same challenges as our customers in developing and releasing high-quality software quickly.
Often, the most painful stages of the software development lifecycle are the integration phase (when independently developed components run together for the first time) and the release phase (when new features face production workloads for the first time). Like other organizations, we’ve evolved a variety of DevOps techniques for mitigating these problems.
We currently use continuous integration (CI) and continuous delivery (CD) techniques, along with several of the unique features of the Skytap platform itself, to create entire environments that resemble the production platform. Copies of these ready-to-use environments are started on-demand and used by engineers to test code in a production-like environment, and in a variety of (possibly destructive) test scenarios. This frees developers to more thoroughly test their code, which reduces the QA burden and speeds up the release process.
This article outlines the continuous delivery workflow that we use to produce these environments and discusses how this workflow has improved our release velocity and reliability. Part two will be a deeper dive into how CI and CD are implemented with Skytap.
Update: Click here to read Part 2!
Our Problem Scope
The challenges that we face in scaling our software and organization aren’t unique. As with many organizations, as Skytap grows, the breadth and complexity of our platform necessarily increases. Additionally, as we add more engineers, we are able to develop features faster than they can be tested and released under a strictly linear, gated process. This introduces a bottleneck, which may leave developers in a wait-state during the test and release phases. It’s counter-productive to have chunks of an engineering organization idling while they wait to ship their work, and it’s expensive to coordinate complex scenarios where more and more features are added to the release.
Our Solution
To remove bottlenecks and improve the flow of code, we reviewed our integration and delivery processes. We decided that ideally, developers should strive to check in code frequently and that code should be built immediately (continuous integration). The result of that build—the artifact—should pass some test criteria and should be automatically packaged as a discrete piece of software. This packaged artifact then becomes a deployment candidate (continuous delivery).
We decided that to truly leverage the benefits of CI and CD, we needed to provide developers and testers with easy, safe access to their own copy of a clone of their production environment. We expected that this would enable engineers to:
Improve code quality
- With decreased false positive/negative results in comparative testing; Skytap environments, while virtualized, very closely mimic the behavior of physical environments, and automated configuration management decreases drift between production and pre-prod environments
- Access and visibility into all affected portions of the stack allows engineers to better understand the system-wide impact of their changes
- Engineers are able to run more extensive automated tests, test against other features, run simulated production workloads, and perform dangerous or destructive experiments, among other things
- We check-in code to main project branches more frequently; smaller incremental changes have fewer compounding issues
- When problems do occur, they are visible earlier in the process and are smaller in scope
Enhance cross-functional collaboration
- Test burden is shifted, in part, to the developers, who are likely to be most familiar with causes and solutions to problems that appear with their code; this frees up QA teams to spend more time developing robust test scenarios and to advise development on effective test techniques
- Frequent delivery of discrete, fully functional environments with continuously integrated changes allows teams to use each other’s work quickly, instead of waiting for delivery to a shared integration environment
- The self-service nature of pre-packaged environments allows developers and operations to focus on the most important interactions between the platform and infrastructure. This helps to identify potential issues well before code reaches the production environment, and reduces the operational load inherent in maintaining multiple pre-prod environments
Increase release speed
- By front-loading the effort of addressing bugs, they’re cheaper and faster to fix. This reduces the QA time spent on final integration testing and pre-release verification
- Releases are less risky, as changes have already been tested in the context of the platform at large. The development of smaller viable changes is simpler; a small change can be tested in a production environment clone by smaller teams, with simpler cross-team collaboration
- Conceptually, this sounds great. To achieve this, we needed to combine a set of tools and a process that could scale with us, and this process should be largely automatic and easy to replicate.
Our Tool Set
Like most software companies, we heavily leverage distributed source control (Mercurial and Git in our case) and configurable build servers (Jenkins). We manage our build jobs with configuration and a job construction tool (Jenkins Job Builder). We make use of configuration management tools (like Ansible and Puppet), and we modularize our platform services with containerization tools (Docker, Kubernetes).
We already had many of the pieces in place to begin delivering full environments to engineers. To pull everything together, we needed to integrate these tools with our internally developed automated environment construction tool (Jenga) and add the real secret sauce: Skytap Templates.
With these tools and the CI/CD techniques we’ll explore in-depth in part two, we are now able to produce several nightly caches of our full stack — including the supporting infrastructure for each — and save these as Skytap Templates.
Each template contains a production environment clone with (currently) anywhere from around 40 VM’s and 3 networks, up to around 200 VM’s with six networks, and each captures a fully functional snapshot of the Skytap platform (which, conveniently, also runs on the Skytap Platform – we’re testing all the way down!) We’ve abstracted one step further away from continuous delivery of software artifacts; we’re instead delivering entire environments running the full platform.
What Does All of That Get Us?
If you’re a Skytap engineer, you simply need to copy one of those Jenga-constructed Skytap templates and run it. This provides an advantage over using provisioning tools to produce environments on the fly because provisioning environments is a slow, complex process, and a lot can go wrong. Engineers should have fast, easy access to production environment clones, and should be able to treat them as disposable when problems inevitably appear—otherwise, you’re losing one of the primary benefits of virtualization.
In just a few minutes, our engineers have an environment running a fully functional instance of Skytap. This is an incredible productivity boost for many common activities:
- Comparing two releases to understand regressions
- Producing a development environment that matches production
- Testing release scenarios
- Destructive testing and experimentation: if your cost to produce a new environment is nearly zero, you can break whatever you want!
- Integration testing
- Platform exploration—On-boarding new engineers is simplified because they can be trained in disposable environments
In part 2 of this series, we’ll dive deeper into how our build system creates these nightly templates.
The Results
Ultimately, we’ve been able to increase our release cadence and reliability, without sacrificing the ability of discrete teams to work independently of each other. Continuous Integration at the team level allows changes to be immediately merged into mainline development, and this, in turn, surfaces problems while the change is an active work item. Raising problems early in development simplifies resolution because the context surrounding the issue is still fresh in everyone’s mind.
Decoupling code check-ins and integration from the release process has given us ancillary benefits in code reviews. Our reviews are focused on professional growth and code quality, rather than being a gate that blocks our check-in. It’s always unfortunate when you make a mistake and break a build, but if your builds are cheap and frequent with fast and visible feedback, developers can safely treat check-in and integration as a separate activity from review. For us, this has been a boon to our culture of collaboration: code reviews are about feedback and growth, rather than being release-oriented transactions or gates.
Automating environment construction and making access to environments a cheap self-service task has allowed us to significantly reduce the load on our operations team. Without these tools, operations might be required to service requests to create and maintain dev/test environments. Additionally, it’s much simpler for our dev and ops teams to collaborate.
With functional environments that can break without impacting operational integrity, operations can advise development without worry that this advice will be misapplied to the production environment, and without the burden of resolving problems in shared dev/test environments when something goes wrong (again, we can just throw away the environment and start fresh). Constant operational support of non-prod environments doesn’t scale well and can lead to hostility between development and operations. DevOps is about the opposite!
Continuously delivering changes from each individual line of development into freshly constructed environments each day has helped us surface deployment and service integration issues more quickly. Your project’s CI process may complete successfully and the code may pass review, but you’re still likely to discover problems that only occur when you plug your service into the platform. By continuously exercising these changes together, we’re now able to see both the adverse effects and the added value of current work very quickly.
Additionally, disconnecting the running environment from the process used to produce working environments has allowed us to easily clone environment state. These clones simplify A/B comparison (E.G., “did this problem exist before, or is it new?”) Guaranteeing the same state in various clones also makes it simple to do destructive testing without impacting other teams. If you’ve ever had dev and test stalled because a shared integration environment was down, you’ll understand how much we like being able to let individuals and teams create and destroy environments at will. Plus, we’re able to run automated system or acceptance testing in single-purpose environments, without tests being ruined by activity in shared environments!
Finally, by leveraging Skytap Templates to continuously deliver fully functional snapshots of the current and upcoming platform, we’ve significantly reduced the amount of time it takes to get a functional clone of most environments. Even with powerful tools like Jenga, producing an environment from scratch often requires a lot of expertise about the tool-chain (dealing with puppet errors, for example), and a lot of knowledge about the infrastructure.
While slogging through these problems can be an instructive exercise, it’s time-consuming and can place a heavy support load on our infrastructure and provisioning experts. Delivering functional templates has reduced the time it takes to get a working dev stack for an individual from days or weeks, down to about an hour. Our engineers are free to spend time developing features, not wrangling their bespoke environments – and since they’re using automatically constructed clones, their environmental assumptions are far more likely to match the reality of integration and production environments when it comes time to release their changes.
Building and maintaining all of this CI and CD infrastructure takes time and effort, of course, but it pays strong dividends in the form of increased individual productivity, increased ability for teams to work in parallel without creating integration headaches, and increased confidence and predictability of our releases. One of the great things about being a cloud provider is that we frequently face the same challenges as our customers. It’s a constant pleasure to know that we’re able to use the very tools we provide as key components in our solutions to these problems.
We encourage you to check out Part 2 to this story to explore some of these solutions in more detail, with the hope that our success will be your success as well!