In previous blog posts, we covered Jenga (our environment construction tool) and did a deep dive into how we use Jenga as a component in delivering fully functional clones of our production environment.
We’re very happy with this delivery pipeline, and the test environments it produces. Our engineers always have up-to-date templates for development or testing work, and the disposable nature of pre-built copies of environments encourages a culture of productive experimentation. There are significant opportunities for engineers to explore components of the platform that are peripheral to their specialization, CI/CD increase confidence in our releases, and easy duplication of continuously delivered templates ensures that everyone has access to the most recent work done by our various internal teams.
Even with automatic delivery of these environments, however, these test environments still had some rough edges, particularly in our two primary categories of concern: test environments should be easy to use, and they should be as production-ready as possible.
We had a few high-level problems to address:
- Each host in each environment should have a hostname and FQDN that can be resolved by DNS — including any virtualized hosts that are invisible to the automatic network so that we can avoid configuration that relies on NAT IP addresses.
- Each host should be able to resolve domain names for records that exist only on our internal corporate DNS servers — the automatic network does not use our internal DNS as its upstream provider.
- Environments should have something like the ‘organic’ data that comes about with real use of the platform.
- Environments should have a TLS Termination Proxy instead of using workarounds to allow clients to communicate in the clear, and to avoid using self-signed certificates.
- Containerized services should use load-balancing pools for Kubernetes pods as we do in production, instead of directly addressing a socket on a Kubernetes node.
- Every environment should be accessible through a single, static domain; this eliminates the need for end-users to make any local configuration changes to access their environments.
- End-users should be able to start up an environment, with all of the above problems solved, in a single step.
Breaking Up the Problem Space
While deciding where to apply effort to solve each of these problems, we took a look at how templates for test environments are constructed. We determined that some solutions made sense to apply early in the build process, and others made sense to apply later — and finally that some of the problems required external infrastructure, along with automation that should be applied when the cloned environment is launched from a template.
We broke the build process down into four phases. Phases 1-3 are ‘construction’ or delivery oriented; the last (launching) is related to load-balancing infrastructure and will be discussed in detail in an upcoming article: Dynamically Multiplexing Disposable Prod Clones
The four phases of a test environment:
- Provisioning—base templates are cloned, VMs are launched, and hosts are configured with Puppet. Automatic network settings are replaced with environment-specific static networking configuration.
- Bootstrapping—services are launched and configured according to the platform configuration. Additional virtualized infrastructure (hypervisors, virtual networking, etc) is configured and started within the test environment.
- Post-boot configuration—we adjust settings in the environment using its now-functional API and install a startup script that the end-user will use to restart the environment from a template. At the end of this phase, we shut down the services, shut down the VMs, and create a template from the stopped environment.
- Launching—in this phase, the pre-constructed template is copied and started by the end-user. This is where we want single-step operation: the end-user should be able to run a single script, and then be ready to go.
Enhancements at Each Phase
During provisioning (phase 1) we address the problems of resolving domain names within the test environment’s sandbox, and of enabling upstream DNS resolution for hosts within the Skytap Cloud corporate network.
During this phase, VMs are launched and their initial network configuration is bootstrapped from the automatic network provided by the Skytap Cloud platform. Puppet then configures each host depending on its type, and the Jenga build process applies additional configuration that allows the platform to operate inside of the sandbox of a Skytap Cloud Environment.
Once the provisioning for each node is complete, we pull a switcheroo and replace the automatic network (DHCP) with a static, custom configuration. We provision our puppetmaster host to include a DNSMasq server. DNSMasq is configured to the static IP set for our internal DNS as its upstream provider, which solves our problem of resolving corporate-network domains from within the test environment sandbox. Now, for each host in the test environment, we add a record to the DNSMasq configuration. DHCP is disabled on each host, and the /etc/resolv.conf for each is updated to use our puppetmaster host as its DNS provider.
The bootstrap phase (phase 2) happens after all of the hosts are provisioned, and the network is reconfigured. Services haven’t started up in the platform yet, so this is a convenient time to address our need to have usable platform data in the environment.
Our approach is to create database fixtures describing customers and users and inject them into the clean database. The fixtures should appear reasonably organic.
To achieve this effect, we gather a list of our Engineering users from LDAP, then feed this to a Rake task that creates the expected customers and users in the database. Additional Rake tasks configure a customer with appropriate administrative privileges and provide a functional configuration for the fixture customer so that environments, templates, VMs, networks, and storage will have usable quotas set for each environment cloned from this build.
Post-bootstrap configuration (phase 3) occurs after all of the platform services have started up — at this point, the environment has a functional API. We have an opportunity to do some real work with the proto-platform before we shut everything down and create a template, and to exercise the API to ensure that basic functionality is intact. So, we use the API to set up VM import jobs, transfer VM images to the FTP server in the environment with LFTP, and run those import jobs. At the end of this stage, the fresh test environment is decked out with usable VM templates that can be launched and used like any other VM, with the only difference being that it exists inside of the sandbox of a test environment.
We also configure projects and users that are appropriate for automatic testing. We can prepare the environment to run a suite of automated functional tests — so hey, why not go ahead and run that test suite and generate a report?
Finally, we install an environment startup script (and supporting modules) into the environment, then shut everything down and produce a template. We’ve done a detailed discussion of this template delivery pipeline in previous posts.
Test Environment Startup (phase 4) is distinct from the test environment template construction that occurs in phases 1-3. This is where our ‘single step startup’ problem is solved. In this phase, the end-user creates a copy of the previously constructed environment and runs the startup script to safely and automatically start up services.
The start script will also attach the new environment to the test environment support infrastructure (an external load balancer). A few additional niceties are applied — hosts in the environment are configured with an .hgrc with the environment owner’s mercurial username, and the ‘suspend on idle timeout’ period is adjusted.
At the end the startup phase, the environment is usable just like the production environment, either through our REST API or through the web UI. A web browser (or REST client) pointed at https://www-1234.test_envs.internal.skytap.com from within the corporate network will function like the production platform, assuming that ‘1234’ is the configuration ID of the newly-started test environment.
Phase four is simple for our end-user — but there’s a lot of stuff that needs to happen to make this function transparently for all of the concurrently running test environments. In the upcoming blog entry, Dynamically Multiplexing Disposable Prod Clones, we’ll discuss the details of how this works.
Did this really make life easier?
In short: yes, and it continues to improve.
The usability enhancements have been delivered iteratively over the lifecycle of the Jenga project, so our users haven’t always enjoyed ease of use in all of the problem areas we outlined. Setting up a basic customer account began relatively early in the project, for example, but injecting fixtures for user data didn’t happen until much later. With each milestone, however, ease of use and production parity have improved.
Until recently (early 2017), test environment owners were expected to configure local hosts files to use the web UI and REST API for their environment and to whitelist self-signed SSL certificates. This was tedious, and made people reluctant to discard their ‘pet’ environment in favor of launching from a fresh template; enhancements in the environment startup phase have greatly improved this experience, and consequently, engineers are more likely to be running an up-to-date environment.
Prior to our post-bootstrap configuration enhancements, testing things like VM hosting (which, unsurprisingly, is a primary function of the Skytap Cloud platform), required the environment owner to import a VM image. This is time-consuming. Having the ability to launch VM’s and use the platform to manage those VM without any preliminary work has greatly improved usability, and has given us additional confidence that the platform in the freshly delivered templates is at least minimally functional.
Without organic user data in environments, functional tests would require either manual administrative action, or some external test driver to establish fixture data—baking this into each template eliminates the need to repeat this for every environment, and again, improves usability.
Finally, by swapping out the automatic network that’s available during provisioning with a static configuration and sandboxed DNS, we were able to improve the flexibility to prototype new components within the environment — we can, for example, provide an FQDN to VM’s that are hosted within the test environment, or establish custom routing channels for things like testing the effects of intermittent network failures.
With each of these accommodations in place, we’ve seen two behaviors emerge for test environments: First, they’re used much more extensively (to the point where our efforts are shifting away from usability and towards capacity management), and second: problems are being identified and corrected very early in the lifecycle of a given release.
Even within Skytap Cloud, we continue to be impressed by the flexibility that our platform can deliver. Acting as our own customer, and continuously producing and using complex environments on the Skytap Cloud platform itself has afforded us opportunities to understand how to build and deliver software with Skytap Cloud as a cornerstone. Enhanced usability for the product that we’re running in Skytap Cloud has, in turn, exposed us to the sorts of challenges that our customers may face (and solve) with our platform.
We look forward to hearing more about the amazing things our customers build using Skytap Cloud!