Dynamically Multiplexing Disposable Production Environment Clones

We recently discussed a set of usability enhancements for our internal test environments. During the course of this work, we discovered a subset of problems which couldn’t be solved with environment configuration alone. Specifically, we require external infrastructure if we want to provide a single domain as an endpoint for all test environments, if we want to provide an SSL certificate signed by a trusted third party, and if we want to route traffic to Kubernetes microservices in a production-like way.

Where we started

In times of yore, using the web UI in a test environment required that the environment owner configure a local hosts file so that the placeholder hostnames for the environment could be resolved to IP addresses.

This was a bit cumbersome (the hosts file needed to be reconfigured each time a new environment is created), but also caused redirect logic and TLS termination to behave differently in test environments than it does in production.

We maintained a collection of specific workarounds to allow local hosts files to function with services that rely on URL redirects, like the SRA browser, or those that relied on the existence of our production load balancer — such as containerized services managed with our Kubernetes Microservice Architecture.

This custom configuration violated both of our cardinal rules for internal test environments: it wasn’t easy, and it wasn’t prod-like. Furthermore, it reduced the test environment’s ‘disposability,’ because environment owners were reluctant to discard an environment that needed custom configuration … even when it’s a small amount of configuration.

To be easy and prod-like, we needed these things to be true:

  • All test environments must have a single base domain.
  • HTTPS connections to services in a test environment must not use self-signed certificates and should use a TLS termination proxy.
  • Many test environments must work simultaneously through this system — unlike production, which only needs to support one platform
  • Services must present the same API endpoints regardless of whether they’re running directly on VM’s or in the Kubernetes infrastructure; migrating a legacy service into the Kubernetes infrastructure must not require changes to how the service is accessed by clients.
  • We must not rely on hand-configured ‘hosts’ files or other local workarounds to access services in the test environment

Ultimately, we realized that all of these things could be best addressed by introducing a load balancer and proxy to act as a mediator between client requests and the services that ultimately respond to those requests.

Planning

Early on, we considered adding a load balancer to each test environment. There were two routes we could take: a custom solution with an open-source load balancer and proxy (such as HAproxy), or an off-the-shelf product like an F5 BigIP device.

Using a custom solution would have had a higher development cost and would have reduced our similarity to production; licensing a BigIP device for each (temporary) environment would have been very expensive. In both cases, the complexity of each discrete environment would have increased.

In addition to cost and complexity concerns, DNS records would have to be dynamically added and removed each time an environment was launched or destroyed. This would introduce frequent updates into a system that usually has a low rate of change, and TTLs would need to decrease significantly to make this feature usable in something approaching real-time. Shorter TTLs mean higher load on DNS servers, and with it, at least a slightly greater need to consider the sort of systems-level architecture problems that our operations team excels at. Now, our Ops team is great, and there’s no doubt we could have handled increased traffic and complexity inherent in dynamic DNS … but one of the best ways to thank them for their greatness is to avoid burdening them with unnecessary work, right? So, dynamic DNS was a non-starter.

For all of these reasons, we decided on a central load-balancing architecture based on an off-the-shelf product. The load balancer would have a single DNS record, and would be responsible for two primary tasks:

  • Routing client requests to the correct destination environment
  • Routing client requests to the correct service within each environment

Our rough plan was to use a single base domain, and then use a service/environment ID pair in the least significant portion of the UR; the load balancer would direct traffic to the correct service pool, in the correct environment, based on this string.

This load balancer will need to manage a very small volume of traffic compared to production workloads. As we mentioned previously, we put a high premium on trying to reduce operational overhead for internal tools — much DevOps, very wow! So, it made sense to run our load balancer infrastructure in a Skytap Cloud VM instead of installing a physical appliance. For parity with production (and to leverage institutional knowledge), we opted to do this with the F5 BigIP LTM (virtual edition).

Building it Out

After licensing and installing the F5 device in a Skytap Cloud VM, we configured our  ‘test’ base domain in our internal DNS provider. All test environments share the base *.test_envs.internal.skytap.com, which directs traffic to the F5 load balancer. Having settled on a domain, we were then able to set up our SSL certificate. We decided that a wildcard cert is preferable here because of the transient nature of the test sub-domains; a specific test environment might only live for a couple hours, so creating a certificate for every environment isn’t a great use of resources.  

With a certificate registered and configured on the F5 device, we can move on to eliminating HTTPS workarounds that existed when we used self-signed certificates without a proxy.

Some HTTP services in production expect a TLS termination proxy to decrypt incoming traffic at the load balancer before it’s directed to the service endpoint within the environment. Prior to adding a TLS termination proxy to the test environment ecosystem, we needed two workarounds: services were modified to handle HTTPS natively (unlike their prod counterparts, which received decrypted traffic behind the proxy), and self-signed certificates needed to be configured for each environment.


The configuration for TLS termination was pretty straightforward; we followed the same process as outlined in this Lullabot article. Aren’t engineering blogs just the best?

Now, the load balancer needs to know how to direct the unencrypted traffic to the correct environment and service.  

A (very simplified) example of our situation looks like this:

test environments

Requests are received at the load balancer. They’re forwarded to one of many test environments; each test environment runs the Skytap Cloud platform and its constituent services. A request should hit a single endpoint at the top, but be routed to exactly one of the hosts responsible for services, in exactly one of the test environments. This is what we call the multiplexing trick: many signals (API endpoints) are accessible over a single communications channel (the F5 load balancer).

We accomplish this trick with two components. First, we identify the specific instance of a service that should receive the request from the least-significant portion of the FQDN. This segment packs two pieces of information into a single string: the service name, and the ID for the environment, separated by a dash (-). Second, we configure a collection of load balancer pools for each service we route to, on each environment.

The pools for each environment are collected into an administrative partition that’s named using the environment’s unique environment ID. We use an iRule to split the service-environmentID string into routable components, and then direct the traffic to the correct partition and pool.  

The simplified workflow for a request to the webUI looks like this:

  1. HTTPS request from a browser: https://www-123.test_envs.internal.skytap.com
         1a. The service is ‘www’, and the environment ID is ‘123’
    2. Skytap internal DNS redirects the request to the load balancer
    3. The load balancer terminates TLS and decrypts the HTTPS request
    4. An iRule on the load balancer splits the packed service-environment string into tokens; ‘www-123’ becomes the tokens <www> and <123>
    5. The iRule redirects to a pool within the ‘test-123’ partition, called ‘www-123’
    6. The full proxy architecture of the load balancer allows redirects to function transparently between a browser and destination Kubernetes pod.

Setting aside the minutiae of the static load balancer configuration (an F5 topic), the dynamic nature of test environments leaves us with three remaining difficulties.

  1. Service Reconfiguration: How do we adjust the configuration of the services in the environment so that they can handle HTTP(S) redirects transparently? Specifically: URLs configured for each service need to deal with the dynamic portion of test environment URLs (the environment ID).
  2. Networking Concerns: How do we deal with the fact that networks in Skytap Cloud environments are isolated from each other? Specifically, how do we let a test environment communicate with the load balancer (which is running in another Skytap Cloud environment), and let the load balancer communicate with the test environment?
  3. Dynamic Load Balancer Reconfiguration: How do we reconfigure the load balancer to add and remove the necessary nodes, pools, and partitions when a new environment is launched or an old environment is deleted?

As with any non-trivial project, there’s a bit of complexity under the hood … but conceptually, all three of these things were pretty easy to solve with existing Skytap Cloud features and platform components.

Service reconfiguration

To handle the dynamic portion of URLs in service configuration, we do two things: we treat some of the lines in the configuration files as template objects rather than literal values, and we populate the value for the template (the environment ID) with data from the VM Metadata Service.

The services that read this configuration understand ERB, so it’s convenient to use that for templating. With our hosts provisioned to include jq, a command line JSON parser, this becomes very simple. A parameterized config snippet looks like this:

# <%config_id =`(curl -s http://gw/skytap | jq .configuration_url | xargs —no-run-if-empty basename) 2>/dev/null`.strip%>
# <%config_id = "invalid-config-id" if config_id.empty?%>
web_url: 'https://www-<%=config_id%>.test_envs.internal.skytap.com'

The ERB lets us use a mixture of inline bash and Ruby to extract the config_id (the unique ID for this environment) as a single string, with a second step setting an obviously incorrect value in case it fails. When the value of ‘web_url’ is read, we use another piece of ERB to insert the config_id. The environment configuration is now aware of its own configuration ID in places it needs to be.

Networking Concerns

Skytap Cloud environments are isolated from each other unless specific actions are taken to route between them. We use the Inter-Configuration Network Routing (ICNR) feature of Skytap Cloud to establish connectivity between the load balancer and its constituent test environments.  

The load balancer has two interfaces that deal with test environment traffic — external, for … well, traffic that comes from the external (corporate side) network, and internal for (can you guess?) the assorted internal environment-side networks.

Interactions with the F5 API happen across the ‘internal’ interface, and all proxy traffic between the load balancer and the test environment traverses this interface. The network associated with this interface on the load balancer VM is configured with a NAT pool; this means each host in the test environment has an IP that can be configured as a node on the load balancer.

Traffic from the corporate network traverses the external network, and this network is configured with a connection to the corporate VPN so that the load balancer is accessible from within Skytap Cloud. The DNS record for *.test_envs.internal.skytap.com points to the IP address on the external network.

Our setup looks like this:

new test environments

One complexity introduced by the the NAT/ICNR setup is that services expect to be able to redirect from one URL to another. In production, this is no problem because anything that references the ‘cloud.skytap.com’ domain is world-resolvable — in the case of test environments, however, the ‘*.test_envs.internal.skytap.com’ domain is only resolvable on the corporate network. Furthermore, multiplexing multiple environments requires the load balancer to handle the HTTP request, so that it can be routed to the correct service pool.

We experimented with a few ways to distinguish between requests that were intended to be ‘internal’ (Network 2 in the diagram) from those that were intended as ‘external’ (Network 1), but ultimately this reduces the prod-like behavior of services and greatly increased the load balancer complexity. Everything should be as transparent and production-like as possible, remember!

Ultimately our solution was pretty simple: we establish a second ICNR tunnel between the test environment and the load balancer’s external network. As discussed in How Skytap Cloud Makes Creating Production Clones Ridiculously Easy, every test environment runs its own DNS server. This proves very convenient in the URL redirect case; we adjusted the DNSMasq configuration for test environments to resolve any requests for *.test_envs.internal.skytap.com to the ‘external’ (Network 1) interface on the load balancer. Voilà; HTTP/HTTPS redirects work like they would in production — despite the fact that the environment is nested and virtualized.

Skytap test environments

Load balancer configuration

Here, we reuse code from our production Kubernetes infrastructure to manipulate configuration objects on the F5 Load balancer, using the load balancer’s REST API. An environment startup script is written to each test environment during the bootstrap phase of the build. After an environment is created, the user runs the script to automatically configure the environment: This script:

  • Jiggles the ICNR cables (as discussed in ‘Networking Concerns’ above).
  • Queries the Skytap Cloud API to get the NAT IPs for service-providing hosts — because of details in the way test environment networks are configured, these IPs will be different than the host IPs.
  • Uses the F5 API and our adapter module to configure the partition for the new environment, add nodes that correspond to the services, and configure pools to balance traffic across those nodes.
  • Runs a garbage collection process to remove unused configuration objects on the load balancer

There’s fair bit of code involved in those steps, but the process is conceptually simple. For the end-user, this script runs in a single step, making it easy to set up an environment.

But will it blend?

What? No, it’s software. Of course it won’t blend—but the system works out pretty well in practice!

We’ve accomplished our high-level goals: test environment end-users can start everything up by running a single script. All of the environments have a single endpoint, so no custom adjustment of hosts files is required, and everything operates in a prod-like manner with TLS Termination and appropriate routing to Kubernetes pods.

At this point, the test environment ecosystem cycles through about 4000 VMs each week. The load balancer/multiplexer is supporting about 30 concurrent test environments (with ~100 VMs apiece).  We’re working towards reducing the footprint for developer stacks and widening the footprint for multi-region stacks, with an aim towards replacing our shared integration and pre-production environments.  

We completed the multiplexing portion of test environment usability enhancements in early 2017. Since then, we’ve seen a dramatic increase in internal use of these environments for assorted use. They’re used heavily enough at this point that our focus is shifting to operational concerns like capacity management.

Running the entire ecosystem inside of Skytap Cloud has allowed us a lot of flexibility. We have the ability to manage inter-configuration routing and don’t need to deal with the overhead of physical hardware. Using our product to build our own internal tooling, of course, gives us a sort of built-in customer perspective: our customers are likely to encounter the same benefits and difficulties that we face in building such systems.

We, of course, enjoy building software in Skytap Cloud— we wouldn’t be able to write these glowing articles if we didn’t — and we hope that you do, too!

Join our email list for news, product updates, and more.