My colleague Jonathan recently kicked off a recurring blog series about our internal adoption and use of Docker containers and Kubernetes. His post gave a brief history of how we came to introduce these technologies as part of ongoing modernization, and detailed how we addressed people and process changes along the way.
In this post, I’m going to take things a step (or two) further with a deeper look into how we went from cloud services running on traditional VMs to containerized services orchestrated by Kubernetes nodes in production. It was a major undertaking for our whole engineering team, with the goal of improving observability, manageability, and reliability as our cloud platform grew.
The familiar challenges of VM-based services
Like so many of our enterprise customers, we originally ran our services on a traditional VM-based infrastructure, with individual service owners responsible for the infrastructure powering their service. This approach worked when we were small, but complexity spread as our cloud service expanded and customers began consuming services en masse.
In our previous architecture, building a new service first required an engineer to specify hardware requirements, then engage our infrastructure team to request the necessary VMs. It was not an unusual process, but gauging requirements on services in progress was challenging. Engineers would often overestimate hardware requirements. Not only did this approach waste dollars and hours, it was also difficult to reclaim resources once deployed. The challenges didn’t stop there, however. Once the VMs were identified, we had to configure the prerequisites to run a given service, which introduced another scaling issue — VM provisioning trended toward a superset of all packages and dependencies necessary for a wide range of services, but we couldn’t easily track and audit the requirement drivers.
This challenge became untenable as our customer base went global, making simply provisioning infrastructure and calculating new hardware requirements a recurring struggle. We built alerts for critical services to help us monitor the state of our cloud, and notify us when an error occurred. Still, our cloud kept growing and manual service remediation became unwieldy. For instance, when a service experienced an out-of-memory kill, we first had to perform a manual restart, then, if there were multiple services on a single VM, begin to manually investigate each to determine the root cause.
Moving to containerized services
While evaluating Docker containers and Kubernetes — the latter being pre-1.0 at the time — as an alternative to traditional VMs, it became clear that such a dramatic shift would require major changes. While we were confident in the long-term benefits, we were also apprehensive about the journey ahead, from reskilling teams to unforeseen issues almost certain to arise.
We decided a measured, iterative approach was the best way forward. A small group of engineers was created to lead the effort. Their first job was to consolidate a set of best practices for modifying existing services to run in containers and to monitor and test these new containerized services. We built our own internal tools to address a range of issues:
- Ease adoption path for service teams
- Orchestrate and operationalize usage of container
- Update our existing deployment tools to support containerized deploys
We were confident that containers orchestrated with Kubernetes would improve deployment, observability, manageability, and reliability for our cloud service. Achieving these goals meant working cross-functionally using DevOps methods. For us, that simply meant the infrastructure team could write scripts and create manifests, while developers had greater infrastructure responsibility, sharing the operational load and proactively managing the growth of our cloud platform.
Before we containerized anything, we evaluated existing services to identify those that were less critical and required minimal refactoring. We used these services, which were primarily Python-based services using asynchronous communication, as a proving ground. Container manifests provided an auditable, consistent, and repeatable template for every build. However, as we containerized more complex services, there was still significant administration required due to the unique nature of our cloud.
In-depth service monitoring
As our container adoption and approach matured, so did our team. Service owners now had a range of sophisticated analytics and monitoring available to them, giving them greater responsibility for their service profiles. Meanwhile, the infrastructure team now had much more visibility into resources thanks to service-level monitoring and a custom-built assessor we created to calculate when a service would need more capacity — all leading to dramatic efficiency gains.
In the next post we’ll share more details on our service-level monitoring tools and processes. The combination of what Kubernetes provides “out of the box” and our homegrown tooling has delivered dramatic efficiency gains we hope others can achieve, too. Stay tuned!