Dynamically Multiplexing Disposable Production Clones

We recently discussed a set of usability enhancements for our internal test environments. During the course of this work, we discovered a subset of problems which couldn’t be solved with environment configuration alone. Specifically, we require external infrastructure if we want to provide a single domain as an endpoint for all test environments, if we want to provide an SSL certificate signed by a trusted third party, and if we want to route traffic to Kubernetes microservices in a production-like way.

Where we started

In times of yore, using the web UI in a test environment required that the environment owner configure a local hosts file so that the placeholder hostnames for the environment could be resolved to IP addresses.

This was a bit cumbersome (the hosts file needed to be reconfigured each time a new environment is created), but also caused redirect logic and TLS termination to behave differently in test environments than it does in production.

We maintained a collection of specific workarounds to allow local hosts files to function with services that rely on URL redirects, like the SRA browser, or those that relied on the existence of our production load balancer — such as containerized services managed with our Kubernetes Microservice Architecture.

This custom configuration violated both of our cardinal rules for internal test environments: it wasn’t easy, and it wasn’t prod-like. Furthermore, it reduced the test environment’s ‘disposability,’ because environment owners were reluctant to discard an environment that needed custom configuration … even when it’s a small amount of configuration.

To be easy and prod-like, we needed these things to be true:

All test environments must have a single base domain.
HTTPS connections to services in a test environment must not use self-signed certificates and should use a TLS termination proxy.
Many test environments must work simultaneously through this system — unlike production, which only needs to support one platform
Services must present the same API endpoints regardless of whether they’re running directly on VM’s or in the Kubernetes infrastructure; migrating a legacy service into the Kubernetes infrastructure must not require changes to how the service is accessed by clients.
We must not rely on hand-configured ‘hosts’ files or other local workarounds to access services in the test environment

Ultimately, we realized that all of these things could be best addressed by introducing a load balancer and proxy to act as a mediator between client requests and the services that ultimately respond to those requests.

Planning

Early on, we considered adding a load balancer to each test environment. There were two routes we could take: a custom solution with an open-source load balancer and proxy (such as HAproxy), or an off-the-shelf product like an F5 BigIP device.

Using a custom solution would have had a higher development cost and would have reduced our similarity to production; licensing a BigIP device for each (temporary) environment would have been very expensive. In both cases, the complexity of each discrete environment would have increased.

In addition to cost and complexity concerns, DNS records would have to be dynamically added and removed each time an environment was launched or destroyed. This would introduce frequent updates into a system that usually has a low rate of change, and TTLs would need to decrease significantly to make this feature usable in something approaching real-time. Shorter TTLs mean higher load on DNS servers, and with it, at least a slightly greater need to consider the sort of systems-level architecture problems that our operations team excels at. Now, our Ops team is great, and there’s no doubt we could have handled increased traffic and complexity inherent in dynamic DNS … but one of the best ways to thank them for their greatness is to avoid burdening them with unnecessary work, right? So, dynamic DNS was a non-starter.

For all of these reasons, we decided on a central load-balancing architecture based on an off-the-shelf product. The load balancer would have a single DNS record, and would be responsible for two primary tasks:

Routing client requests to the correct destination environment
Routing client requests to the correct service within each environment

Our rough plan was to use a single base domain, and then use a service/environment ID pair in the least significant portion of the UR; the load balancer would direct traffic to the correct service pool, in the correct environment, based on this string.

This load balancer will need to manage a very small volume of traffic compared to production workloads. As we mentioned previously, we put a high premium on trying to reduce operational overhead for internal tools — much DevOps, very wow! So, it made sense to run our load balancer infrastructure in a Skytap Cloud VM instead of installing a physical appliance. For parity with production (and to leverage institutional knowledge), we opted to do this with the F5 BigIP LTM (virtual edition).

Building it Out

After licensing and installing the F5 device in a Skytap Cloud VM, we configured our ‘test’ base domain in our internal DNS provider. All test environments share the base *.test_envs.internal.skytap.com, which directs traffic to the F5 load balancer. Having settled on a domain, we were then able to set up our SSL certificate. We decided that a wildcard cert is preferable here because of the transient nature of the test sub-domains; a specific test environment might only live for a couple hours, so creating a certificate for every environment isn’t a great use of resources.

With a certificate registered and configured on the F5 device, we can move on to eliminating HTTPS workarounds that existed when we used self-signed certificates without a proxy.

Some HTTP services in production expect a TLS termination proxy to decrypt incoming traffic at the load balancer before it’s directed to the service endpoint within the environment. Prior to adding a TLS termination proxy to the test environment ecosystem, we needed two workarounds: services were modified to handle HTTPS natively (unlike their prod counterparts, which received decrypted traffic behind the proxy), and self-signed certificates needed to be configured for each environment.

The configuration for TLS termination was pretty straightforward; we followed the same process as outlined in this Lullabot article. Aren’t engineering blogs just the best?

Now, the load balancer needs to know how to direct the unencrypted traffic to the correct environment and service.

A (very simplified) example of our situation looks like this:

Requests are received at the load balancer. They’re forwarded to one of many test environments; each test environment runs the Skytap Cloud platform and its constituent services. A request should hit a single endpoint at the top, but be routed to exactly one of the hosts responsible for services, in exactly one of the test environments. This is what we call the multiplexing trick: many signals (API endpoints) are accessible over a single communications channel (the F5 load balancer).

We accomplish this trick with two components. First, we identify the specific instance of a service that should receive the request from the least-significant portion of the FQDN. This segment packs two pieces of information into a single string: the service name, and the ID for the environment, separated by a dash (-). Second, we configure a collection of load balancer pools for each service we route to, on each environment.

The pools for each environment are collected into an administrative partition that’s named using the environment’s unique environment ID. We use an iRule to split the service-environmentID string into routable components, and then direct the traffic to the correct partition and pool.

The simplified workflow for a request to the webUI looks like this:

HTTPS request from a browser: https://www-123.test_envs.internal.skytap.com
1a. The service is ‘www’, and the environment ID is ‘123’
2. Skytap internal DNS redirects the request to the load balancer
3. The load balancer terminates TLS and decrypts the HTTPS request
4. An iRule on the load balancer splits the packed service-environment string into tokens; ‘www-123’ becomes the tokens <www> and <123>
5. The iRule redirects to a pool within the ‘test-123’ partition, called ‘www-123’
6. The full proxy architecture of the load balancer allows redirects to function transparently between a browser and destination Kubernetes pod.

Setting aside the minutiae of the static load balancer configuration (an F5 topic), the dynamic nature of test environments leaves us with three remaining difficulties.

Service Reconfiguration: How do we adjust the configuration of the services in the environment so that they can handle HTTP(S) redirects transparently? Specifically: URLs configured for each service need to deal with the dynamic portion of test environment URLs (the environment ID).
Networking Concerns: How do we deal with the fact that networks in Skytap Cloud environments are isolated from each other? Specifically, how do we let a test environment communicate with the load balancer (which is running in another Skytap Cloud environment), and let the load balancer communicate with the test environment?
Dynamic Load Balancer Reconfiguration: How do we reconfigure the load balancer to add and remove the necessary nodes, pools, and partitions when a new environment is launched or an old environment is deleted?

As with any non-trivial project, there’s a bit of complexity under the hood … but conceptually, all three of these things were pretty easy to solve with existing Skytap Cloud features and platform components.

Service reconfiguration

To handle the dynamic portion of URLs in service configuration, we do two things: we treat some of the lines in the configuration files as template objects rather than literal values, and we populate the value for the template (the environment ID) with data from the VM Metadata Service.

The services that read this configuration understand ERB, so it’s convenient to use that for templating. With our hosts provisioned to include jq, a command line JSON parser, this becomes very simple. A parameterized config snippet looks like this:

# <%config_id =`(curl -s http://gw/skytap | jq .configuration_url | xargs —no-run-if-empty basename) 2>/dev/null`.strip%> # <%config_id = "invalid-config-id" if config_id.empty?%> web_url: 'https://www-<%=config_id%>.test_envs.internal.skytap.com'
The ERB lets us use a mixture of inline bash and Ruby to extract the config_id (the unique ID for this environment) as a single string, with a second step setting an obviously incorrect value in case it fails. When the value of ‘web_url’ is read, we use another piece of ERB to insert the config_id. The environment configuration is now aware of its own configuration ID in places it needs to be.

Networking Concerns

Skytap Cloud environments are isolated from each other unless specific actions are taken to route between them. We use the Inter-Configuration Network Routing (ICNR) feature of Skytap Cloud to establish connectivity between the load balancer and its constituent test environments.

The load balancer has two interfaces that deal with test environment traffic — external, for … well, traffic that comes from the external (corporate side) network, and internal for (can you guess?) the assorted internal environment-side networks.

Interactions with the F5 API happen across the ‘internal’ interface, and all proxy traffic between the load balancer and the test environment traverses this interface. The network associated with this interface on the load balancer VM is configured with a NAT pool; this means each host in the test environment has an IP that can be configured as a node on the load balancer.

Traffic from the corporate network traverses the external network, and this network is configured with a connection to the corporate VPN so that the load balancer is accessible from within Skytap Cloud. The DNS record for *.test_envs.internal.skytap.com points to the IP address on the external network.

Our setup looks like this:

One complexity introduced by the the NAT/ICNR setup is that services expect to be able to redirect from one URL to another. In production, this is no problem because anything that references the ‘cloud.skytap.com’ domain is world-resolvable — in the case of test environments, however, the ‘*.test_envs.internal.skytap.com’ domain is only resolvable on the corporate network. Furthermore, multiplexing multiple environments requires the load balancer to handle the HTTP request, so that it can be routed to the correct service pool.

We experimented with a few ways to distinguish between requests that were intended to be ‘internal’ (Network 2 in the diagram) from those that were intended as ‘external’ (Network 1), but ultimately this reduces the prod-like behavior of services and greatly increased the load balancer complexity. Everything should be as transparent and production-like as possible, remember!

Ultimately our solution was pretty simple: we establish a second ICNR tunnel between the test environment and the load balancer’s external network. As discussed in How Skytap Cloud Makes Creating Production Clones Ridiculously Easy, every test environment runs its own DNS server. This proves very convenient in the URL redirect case; we adjusted the DNSMasq configuration for test environments to resolve any requests for *.test_envs.internal.skytap.com to the ‘external’ (Network 1) interface on the load balancer. Voilà; HTTP/HTTPS redirects work like they would in production — despite the fact that the environment is nested and virtualized.

Load balancer configuration

Here, we reuse code from our production Kubernetes infrastructure to manipulate configuration objects on the F5 Load balancer, using the load balancer’s REST API. An environment startup script is written to each test environment during the bootstrap phase of the build. After an environment is created, the user runs the script to automatically configure the environment: This script:

Jiggles the ICNR cables (as discussed in ‘Networking Concerns’ above).
Queries the Skytap Cloud API to get the NAT IPs for service-providing hosts — because of details in the way test environment networks are configured, these IPs will be different than the host IPs.
Uses the F5 API and our adapter module to configure the partition for the new environment, add nodes that correspond to the services, and configure pools to balance traffic across those nodes.
Runs a garbage collection process to remove unused configuration objects on the load balancer

There’s fair bit of code involved in those steps, but the process is conceptually simple. For the end-user, this script runs in a single step, making it easy to set up an environment.

But will it blend?

What? No, it’s software. Of course it won’t blend—but the system works out pretty well in practice!

We’ve accomplished our high-level goals: test environment end-users can start everything up by running a single script. All of the environments have a single endpoint, so no custom adjustment of hosts files is required, and everything operates in a prod-like manner with TLS Termination and appropriate routing to Kubernetes pods.

At this point, the test environment ecosystem cycles through about 4000 VMs each week. The load balancer/multiplexer is supporting about 30 concurrent test environments (with ~100 VMs apiece). We’re working towards reducing the footprint for developer stacks and widening the footprint for multi-region stacks, with an aim towards replacing our shared integration and pre-production environments.

We completed the multiplexing portion of test environment usability enhancements in early 2017. Since then, we’ve seen a dramatic increase in internal use of these environments for assorted use. They’re used heavily enough at this point that our focus is shifting to operational concerns like capacity management.

Running the entire ecosystem inside of Skytap Cloud has allowed us a lot of flexibility. We have the ability to manage inter-configuration routing and don’t need to deal with the overhead of physical hardware. Using our product to build our own internal tooling, of course, gives us a sort of built-in customer perspective: our customers are likely to encounter the same benefits and difficulties that we face in building such systems.

We, of course, enjoy building software in Skytap Cloud— we wouldn’t be able to write these glowing articles if we didn’t — and we hope that you do, too!

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
cookielawinfo-checkbox-preferences	1 year	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
55d66ab20f0ad28a_cfid	2 years	Set by ChatFunnels to store chat sessions
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lidc	1 day	This cookie is set by LinkedIn and used for routing.
sc_anonymous_id	9 years	Cookie is placed by SoundCloud to provide functions across pages.

Cookie	Duration	Description
__utma	2 years	This cookie is set by Google Analytics and is used to distinguish users and sessions. The cookie is created when the JavaScript library executes and there are no existing __utma cookies. The cookie is updated every time data is sent to Google Analytics.
__utmb	30 minutes	The cookie is set by Google Analytics. The cookie is used to determine new sessions/visits. The cookie is created when the JavaScript library executes and there are no existing __utma cookies. The cookie is updated every time data is sent to Google Analytics.
__utmc		The cookie is set by Google Analytics and is deleted when the user closes the browser. The cookie is not used by ga.js. The cookie is used to enable interoperability with urchin.js which is an older version of Google analytics and used in conjunction with the __utmb cookie to determine new sessions/visits.
__utmt	10 minutes	The cookie is set by Google Analytics and is used to throttle the request rate.
__utmz	6 months	This cookie is set by Google analytics and is used to store the traffic source or campaign through which the visitor reached your site.
_gat_UA-4086838-1	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_uetsid	1 day	Bing Ads sets this cookie to engage with a user that has previously visited the website.
_uetvid	1 year 24 days	Bing Ads sets this cookie to engage with a user that has previously visited the website.
YSC		This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, camapign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assigns a randoly generated number to identify unique visitors.
_gcl_au	2 months	This cookie is placed by Google Tag Manager to place and track conversions.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_uv_id	2 years	Slideshare: Collects data on the user's visits to the website, such as which pages have been read.
browser_id	5 years	This cookie is used for identifying the visitor browser on re-visit to the website.
bscookie	2 years	This cookie is placed by Linkedin to store performed actions on the website.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
GPS	30 minutes	This cookie is set by Youtube and registers a unique ID for tracking users based on their geographical location
li_sugr	2 months	This cookie is placed by Linkedin to store browser details.
lissc	1 year	Used by the social networking service, LinkedIn, for tracking the use of embedded services.
MR	1 week	This cookie is used to measure the use of the website for analytics purposes.
pardot		The cookie is set when the visitor is logged in as a Pardot user.
undefined	never	Wistia sets this cookie to collect data on visitor interaction with the website's video-content, to make the website's video-content more relevant for the visitor.
vuid	2 years	Vimeo

Cookie	Duration	Description
ANONCHK	10 minutes	The ANONCHK cookie, set by Bing, is used to store a user's session ID and also verify the clicks from ads on the Bing search engine. The cookie helps in reporting and personalization as well.
IDE	2 years	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
MUID	1 year	Used by Microsoft as a unique identifier. The cookie is set by embedded Microsoft scripts. The purpose of this cookie is to synchronize the ID across many different Microsoft domains to enable user tracking.
SRM_B	1 year	Bing.com
SRM_I	1 year	Bing.com
u	2 months	Collects data on user visits to the website, such as what pages have been accessed. The registered data is used to categorize the user's interest and demographic profiles in terms of resales for targeted marketing
uid	1 year	This cookie is used to measure the number and behavior of the visitors to the website anonymously. The data includes the number of visits, average duration of the visit on the website, pages visited, etc. for the purpose of better understanding user preferences for targeted advertisments.
UserMatchHistory	1 month	This cookie is place by Linkedin to enable ad delivery or retargeting.
VISITOR_INFO1_LIVE	5 months	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Dynamically Multiplexing Disposable Production Environment Clones

Where we started

Planning

Building it Out

Service reconfiguration

Networking Concerns

Load balancer configuration

But will it blend?

Join our email list for news, product updates, and more.

Product

Company

Help

Cookie	Duration	Description
_clck	1 year	No description
_clsk	1 day	No description
AnalyticsSyncHistory	1 month	No description
CLID	1 year	No description
ingrammicro.com	1 hour	No description
li_gc	2 years	No description
loglevel	never	No description available.
original_req_url	past	No description
visitor_id869971	10 years	No description
visitor_id869971-hash	10 years	No description