Knocking Down our own Skytap Stack with “Jenga”
Today’s column is firmly in the column of “eating our own dog food” at Skytap. Specifically, how our own development and test teams use Skytap APIs to quickly automate the creation, provisioning and teardown of instant, production-like Skytap lab implementations, on Skytap – in conjunction with Puppet.
Skytap engineers call the project “Jenga” which conveys the throwaway nature of the game itself where you set up and knock down a stack of reusable blocks over and over again. Like in the game, in developing new features, making a mistake and knocking over the stack is an expected failure. There really should be no costly property damage or harm to anyone else, nor an extensive labor effort to reset the stack.
I met up with Skytap engineers Lara Martin and Petr Novodvorskiy about how we use Jenga in our software build and release process, after seeing Lara present the best practice to our last Customer Advisory Board meeting.
“We needed a way to create pre-production environments for Skytap itself – deployments we could all use in a consistent, repeatable way so we could use it for testing deployments, discovering failures and having different versions that we can use for sandboxing new features,” said Lara. “We don’t want to spend time fixing a stack, and we want to be able to share a specific version of Skytap with other team members.”
It stands to reason that we’d want to build a Skytap-on-Skytap-for-Skytap solution and follow the best practices of Infrastructure-as-Code (IaC) while taking advantage of faster automation and collaboration in Skytap. Our developers build as many Jenga stacks as they need from scratch using Puppet-defined configurations for specific Skytap versions, and then they can be easily cloned or shared with others.
“It helps us keep track of the changes that happen within our version control,” says Lara. “We can say that this is what the Skytap stack looks like today, and here’s what changed, and these errors happened in this version, and these errors no longer occur.”
Any Skytap engineer or PM can install a current Skytap development stack on their laptop or in a VM in a Skytap lab.
Here’s a 3-Step process of how this works:
1. Jenga can pull base templates of virtual machines, networks, etc. from a local library.
2. Then a Puppet script configures the environment for automated provisioning.
3. Last, bootstrap and get the Skytap stack up and running – databases, network and all.
The resulting dev stack is pretty impressive in scale – 58 VMs, 126 SVMs and 2.5TB by a recent count (we’re not going into showing every detail of it here though — out of scope!)
Lara demonstrated a dry run for me in a couple minutes and it looks like any other shell process as it finds the environment definition, builds out nodes, etc. If you were to select a complete pre-prod run, that batch process would take a couple hours to get to a stack as shown above. Like any active enterprise software company, Skytap is always adding new capabilities, and new service components to the solution that are owned by different individuals and teams, and that complexity needs to be represented in the labs dev/test teams are using.
“The bigger Skytap gets, and the more complicated it gets, the fewer people we can actually expect to know how to set up our whole stack,” said Lara. “Now they don’t need to know how to set up everyone else’s components, they can just build their own. We used to spend a lot of time setting up environments, especially for new developers, which could take days. Now they can get it the first day.”
“We have around 150 services in our system, and numerous ways to deploy them, and we had numerous pages in our intranet describing these processes, and that information could also get out of date,” said Petr. “There was no consistent way to know how to deploy everything before – but now with builds every night, we automatically know if a service breaks that there’s a deployment problem.”
Novel ways to use Skytap’s Jenga environments include our “Fail Fridays” testing exercise where environment components are changed, pushed past their limits or purposefully failed to experiment with failure conditions and potential outcomes.
“Our old Fail Friday would itself fail when we tried to use synthetic or stubbed out services and data,” says Petr. “Now we’ve increased our success rate, and spend more time working on features and testing them, and less time on fixing the stack.”
“Especially for Skytap features that must cross services,” said Lara. “If the web has to talk to the configuration manager, or the load balancer, you won’t have all that on your local machine normally, and you need that in place to test the stack out.”
As new features are continuously developed and integrated into the Skytap stack, our engineering teams can build new ones from scratch, or copy and share them for collaboration purposes, and run regression suites to ensure that old issues never arise again.
Like any software organization, we are focused on delivering capabilities to market and talking about those, so it is easy to take the corresponding evolution of our SLDC over the last few years for granted. The ability to set up and knock down the Skytap cloud services stack from exact definitions is now an essential part of our agile software delivery pipeline.