Noel: Hello, this is Noel Wurst with Skytap, and I am speaking with Gene Kim today. At Skytap, a lot of us are big fans of Gene and his work, and we just finished reading The Phoenix Project, which we will talk about, as well as The DevOps Cookbook. I’m curious to know how that’s coming along, and when we can expect to read it. Gene, how are you today.
Gene: I’m doing great, and it’s great being here with you.
Noel: So, to get started, being the DevOps aficionado that you are, we recently had a contest on our blog where we asked people to submit their best definition of DevOps, and we knew we would get a bunch of different definitions.
What I thought was the best definition, had a line about how “communication was the responsibility of both Dev and Ops.” I know sounds really simple, but if you extend that definition of “communication” I thought that did a good job of showing that DevOps is absolutely both departments’ responsibility. It’s not just relying on one team to change culturally. I’d love to get your short definition of how you define DevOps today.
Gene: This is my personal definition: I would define DevOps by the outcomes. In my mind, DevOps is those set of cultural norms and technology practices that enable the fast flow of planned work from, among others, development, through tests into operations while preserving world class reliability, operation and security.
DevOps is not about what you do, but what your outcomes are. So many things that we associate with DevOps fits underneath this very broad umbrella of beliefs and practices—which of course, communication and culture are part of them. I love this notion that it takes more than just a bunch of nice people communicating well to achieve DevOps outcomes.
For organizations to be able to do tens, hundreds of thousands of deploys per day, while preserving great reliability, security, availability, it takes amazing technical practices, as well.
Otherwise, we’re just a bunch of nice people getting suboptimal outcomes.
Noel: Right. That reminds me of the way we talk about Agile sometimes. There’s so much focus on culture. Obviously culture is very important, but you can have the greatest culture in the world, but without the technology to enable you to put all that communication and culture into play—it’s not going to get you more than having a bunch of nice people at work.
Gene: Exactly. In fact, one of my heroes is Elizabeth Hendrickson. She was one of the speakers at the DevOps Enterprise a couple months ago. We were having this exact same conversation at Agile 2014 is that some of our best friends are Agile coaches and Agile practitioners, but a talk about a success story that doesn’t involve the actual technical practices, is in some ways unsatisfactory.
Exactly what you said, it takes more than just people communicating, and it takes more than just trust. Those are necessary, but those set the stage for the really hard things, which is how do you get a level of productivity and reliability that we didn’t even think possible five years ago.
Noel: Exactly. That really leads to a great quote from The Phoenix Project. It’s a quote from Erik, who has so many great lines, but there’s a line that says:
Allspaw taught us that Dev and Ops working together, along with QA and the business are a super tribe that can achieve amazing things.
He did this by ensuring environments were always available when they were needed. He automated the build and deployment process, recognizing that infrastructure could be treated as code, and says that enabled him to create a “one step environment creation and deploy procedure.”
Here at Skytap, we’ve seen firsthand that concept of a one-step environment creation and deployment procedure is not something that everyone is even aware exists. What do you think will make that technology common sense for everyone? Is there something that will come along where it won’t be so surprising for people to find out they have access to these things?
Gene: This is one of my favorite lines in the book, because it’s actually one of my own personal “ah-ha” moments. This whole notion, what we see in high performers, are several things that substantiate how important this capability is. One, this is in the DevOps Survey of Practice that I did with Jez Humble and the great folks at Puppet Labs. We found that certain behaviors were not only a high predictor of performance, but in some cases, they were, I would elevate them to be prerequisites of performance.
One of them was that there is an automated environment creation process that can build the dev, test, and the production environments all at the same time. This is so amazing because it eliminates a whole category of errors that most of us grew up with, which was that those environments that actually match. You only find out during the deploy.
A second piece is that they have an automated deployment process, as opposed to a six-week, thirteen hundred step deploy process, which is, all the steps are manually error prone.
The third is, where do these environments come from? There are amazing technologies like Skytap, but I think what’s important is that the same build mechanism that creates these environments are all under some sort of version control, so we can repeatedly, reliably, build these environments and I think the constituency, who would benefit from this most is Developers.
That means that Developers can now work, write the code in, and run their code in production-like environments, even at the earliest stage of the process.
That is such a game changer. Because it eliminates the whole phenomenon around, “hey, it worked on my laptop, why doesn’t it work in production?” I think so much of our experience has been, “hey, I would like a test environment” and the answer is, “I can get one for you in thirty weeks.” The whole notion that you can get one on-demand in minutes changes everything.
Noel: There’s a great survey that was recently done by voke, Inc., and voke’s founder Teresa Lanowitz uses the phrase “lifecycle virtualization” to describe the benefits of extending virtualization into dev and test, and shifting everything left. I’ve seen her speak on this a couple times and she talks about the fact that it’s 2014. It shouldn’t be so mind blowing that you have the ability to do these things.
We’re not sitting around waiting on environments and provisioning and hardware anymore. There’s just no need.
Gene: John Willis said something I found startling. He said these capabilities have been around for nearly a decade around virtualization. I think the magic was virtualization where you could spin up environments in seconds or minutes. The problem was that the provisioning and code deployment process was still manual, so the whole notion of “infrastructure as code” has elevated our sights on not only what’s possible, but what is actually required to achieve world class performance.
The notion of virtualization, and being able to spin up environments, it’s not just good for Ops. The fact you can shift everything left dramatically improves productivity for developers. Jez Humble has this great quote. He said the best proxy measure, for him, of developer productivity, is lead time.
Lead time would be measured by, “how quickly can we go from code committed through the test cycle were everything is running in a production-like environment and being deployed into production?”
This is such a provocative statement. Development productivity is not about how many features can you complete, it’s about how quickly can you go from, it runs on your laptop, to where you have confidence that it’s going to run in production, and then it’s actually running in production. I love that quote.
Noel: That leads to another quote from The Phoenix Project. “Until code is in production, there’s no value actually being generated.”
Gene: Right. In fact, one of the sayings from the lean manufacturing community that I think about so often. There’s this deeply held belief that lead time is the most accurate predictor of customer satisfaction, quality, and employee happiness. This is one of the most cherished expressions.
The fact that as we benchmark over 14,000 organizations in the last two years, we found that was absolutely true. Lead time correlates with change success rate, mean time to repair, deployment frequency, employee happiness and even organizational performance, meaning how likely is the organization to exceed productivity, profitability, and market share goals.
They’re twice as likely to exceed profitability, market share and productivity goals.
Noel: That’s awesome.
Gene: It’s true of manufacturing, and apparently true in DevOps as well.
Noel: That’s very cool. So, talk about The DevOps Cookbook for a little while. I noticed on your website it says that your intent behind it was, “To catalog what the high performing DevOps organizations all have in common. Attempting to describe all the necessary and sufficient steps to create the culture of values, processes, procedures, and daily work behind their transformations.”
That sounds like the most massive undertaking. It’s so much to cover. How long have you been compiling that research and then how many organizations did you collect that information from?
Gene: The DevOps Cookbook was actually supposed to come out before The Phoenix Project. Arguably, it’s about two years late. I’ve promised everybody and myself that a first draft manuscript will be complete this year.
I would say in some ways I’m very glad that we were doing this, coming out now as opposed to two years before.
Noel: I think it works well.
Gene: Yeah, we’ve been able up to now to benchmark 14,000 organizations and really prove to a level of confidence that we wouldn’t have been able to do, had you not done that research.
It’s been now at least, almost over two years in the making.
Noel: That sounds like a pretty short time for the number of things you’re attempting to show in there. Did you find that many of those who were “high performing” as you call them, were “doing DevOps” in a similar way, or kind of like Agile, in all its ambiguity and all the definitions of DevOps—did you find people doing DevOps in their own unique way, yet still getting that same goal.
Gene: No, I would say our experience has been the opposite. There’s a certain uniformity about “the emergent behavior.” It’s as if all these different organizations have simultaneously and independently derived the same sort of systems of work of how exactly does dev, test, InfoSec, and Ops work together to get these amazing outcomes.
We now know through the research that there are seven predictors of performance.
First off is the use of version control by operations. In other words, not only do you have a build mechanism, that can build the Dev, test, and prod environments all at the same time, but this is under version control. There’s a continuous integration and continuous tests in integration practices.
There’s, develop the checking code into a chunk every day, and automated test suites to validate that it’s actually working as designed. There’s automated deployment processes, there’s proactive monitoring of the production environment, and there’s a high trust culture. The presence of a high level of trust seems to be a prerequisite for their performance, as opposed to a culture fear, a cultured bureaucracy.
Those seem to be the emergent behaviors and beliefs that drive all organizations whether it’s Google, Amazon, Etsy or whether it’s Disney, Atheon, or the U.S. Department of Homeland Security.
It’s always been very cool to be able to show that it’s not an amorphous bunch of nice people doing random things. Instead, it’s a common philosophy, a common set of practices that lead to high performance.
Noel: It’s almost a relief to find out that, one, there are 14,000 that are doing something similar to this. You look at the track record of some of those bigger names you were talking about and to know it’s a set of things that are hard to argue against.
It almost takes a little bit of the weight off of, “can we do this, will this work here, are we too big, are we too small?.” Those are all things that would be great for everyone to be doing.
Gene: I think one of the reasons for the separate research we do, or you benchmark 14,000 organizations, it’s called “a cross-population study.” That’s type of research they used to show how they linked smoking with early morbidity.
You cast a very wide net, and then you try to figure out what is the link between early death, behaviors, and environmental factors and so forth.
One of the reasons that one does cross-population studies is to negate objections. People say, “Is it really true that large organizations can’t be high performers?” No, it turns out that that’s not true.
How about, “Does it really matter if you have a DevOps team or not?” Jez thought yes, I thought no. We tested it, and it turned out it didn’t matter. Does it matter who does your deploys? We measured change success rate, Jez thought it would be better if Dev does deploys, I though it was better if Ops does deploys, it turns out it didn’t matter. The two populations have a statistically identical performance.
These are the kind of awesome things you can do when you can do benchmark studies like this.
Noel: That’s awesome. To move to another bit of writing that you’ve done, you have a whitepaper on your site that I really enjoyed and would suggest other people check out. It’s, “The top 11 things you need to know about DevOps.”
One of your favorite DevOps patterns resonated with me a lot. You talk about the need for an automated environment creation process, to go back to that again.
You said that it makes environments not available just early, but you said perhaps before a project even begins. Like these things you just listed, your seven principles, this automated creation process has become almost essential at this point.
There’s not a lot of room for saying, “Our wait time isn’t that bad, or we’re not being that hampered.” They should be available before you’ve even started.
Gene: These are certainly, out of the set of set of about seven things we found, that were the strongest predictors of performance. I think one can argue that, it’s hard to argue that you can get high performance if Developers are building their own environments. Environments that are different from the production environments, and even if they used production environments, or it takes a long time, or they’re lagging behind what their real production environment is. It’s one of these sort of logical, satisfying conclusions. Absolutely, yes.
I think it matches most people’s experiences. I love this phrase that someone’s said to me. He said, “The goal of science is to explain the most amount of observable phenomenon with the fewest number of principles, to confirm deeply held intuitions, and to reveal surprising insights.
I think this is one of those. It makes a lot of sense.
Noel: Absolutely. I’ve got one last question for you. In that same list of your eleven things to know about DevOps, number eight asks a question: “How do InfoSec and QA integrate into the DevOps work team?” I think this is a great question and whenever you’ve been talking about it in this conversation, you’ve always mentioned reliability and security as well.
But, you mentioned something I hadn’t heard before, and it was:
When software is delivered as a service, and defects can be fixed very quickly, then QA can reduce its reliance on testing and instead rely more on production monitoring to detect defects in production, as long as they can be quickly fixed.
I just wanted to get clear on that. I’d always thought about testers wanting to test more often and the whole move to test earlier, but I wanted to find out the risks or the rewards of being able to detect defects in production and not sooner.
Gene: There’s an adage in the DevOps community of, “Architect not for meantime between failures, but instead architect for meantime to repair.”
In other words, instead of spending our time trying to create the perfect system that never fails, let’s instead create systems that are designed so that we can detect problems early and fix them fast.
I guess what I might have imprecisely stated in that sentences was, if you adopt this philosophy, it means that we don’t have to find and prevent all the issues before we release. If we can have confidence that we’ll find issues fast enough, and that it’s safe to make changes, and also that it then restores quickly—then let’s consider the opposite of that. If we have this fragile, big monolithic application that takes two weeks to deploy, and whenever something goes wrong, three hundred people have to drop what they’re doing to nurse a quick, hot fix in production.
It means that the desire is that we want to do more testing to try to prevent future failures. This whole notion of the safe environment where it’s safe to make changes and can rely upon production monitoring to find issues quickly, it alleviates a huge burden for developers, and quality engineers and information security.
I think that’s another game changer in technology.
Noel: That reminds me that video of the Jeep.
Gene: Yeah!
Noel: They completely, they didn’t just break it, the broke it into every single part imaginable, but like you just said, they rebuilt it on the road.
It didn’t have to go back to the body shop or back to the testing team. They rebuilt it on the road in just as quick a time as it took to break. It was back up and running in no time.
Gene: Exactly right. Imagine if they had to take it back to shop and they had to go through a two-month certification process in order to be back on the road. What we’re talking about is the exact opposite.
Noel: Exactly. Awesome. That’s about all I had for you today, thank you so much for your time. And if anyone wants to learn more about Gene, to learn more about The Phoenix Project, or The DevOps Cookbook, I recommend doing all three.
Gene: Hey, no problem. And if anyone wants to subscribe to updates about the DevOps Cookbook release and receive an advance draft of the manuscript, you can sign up here.
Gene is a multiple award winning CTO, researcher and author. He was founder and CTO of Tripwire for 13 years. He has written three books, including “The Visible Ops Handbook” and “The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win.” Gene is a huge fan of IT operations, and how it can enable developers to maximize throughput of features from “code complete” to “in production,” without causing chaos and disruption to the IT environment.