GCP and AWS: What’s the Difference?

AWS has long been leading the charge when it comes to public cloud providers. I believe this is largely attributed to Bezos’ mandate of “APIs everywhere” in the early days of Amazon, which in turn allowed them to be one of the first major players in the space. Google, on the other hand, has a very different DNA. In contrast to Amazon’s laser-focused product mindset, their approach to cloud has broadly been to spin out services based on internal systems backing Google’s core business. When put in the context of the very different leadership styles and cultures of the two companies, this actually starts to make a lot of sense. But which approach is better, and what does this mean for those trying to settle on a cloud provider?

I think GCP gets a bad rap for three reasons: historically, their support has been pretty terrible, there’s the massive gap in offerings between GCP and AWS, and Google tends to be very opaque with its product roadmaps and commitments. It is nearly impossible now to keep track of all the services AWS offers (which seems to continue to grow at a staggering rate), while GCP’s list of services remains fairly modest in comparison. Naively, it would seem AWS is the obvious “better” choice purely due to the number of services. Of course, there’s much more to the story. This article is less of a comparison of the two cloud providers (for that, there is a plethora of analyses) and more of a look at their differing philosophies and legacies.

Philosophies

AWS and GCP are working toward the same goal from completely opposite ends. AWS is the ops engineer’s cloud. It provides all of the low-level primitives ops folks love like network management, granular identity and access management (IAM), load balancers, placement groups for controlling how instances are placed on underlying hardware, and so forth. You need an ops team just to manage all of these things. It’s not entirely different from a traditional on-prem build-out, just in someone else’s data center. This is why ops folks tend to gravitate toward AWS—it’s familiar and provides the control and flexibility they like.

GCP is approaching it from the angle of providing the best managed services of any cloud. It is the software engineer’s cloud. In many cases, you don’t need a traditional ops team, or at least very minimal staffing in that area. The trade-off is it’s more opinionated. This is apparent when you consider GCP was launched in 2008 with the release of Google App Engine. Other key GCP offerings (and acquisitions) bear this out further, such as Google Kubernetes Engine (GKE), Cloud Spanner, Firebase, and Stackdriver.

Platform

A client recently asked me why more companies aren’t using Heroku. I have nothing personal against Heroku, but the reality is I have not personally run into a company of any size using it. I’m sure they exist, but looking at the customer list on their website, it’s mostly small startups. For greenfield initiatives, larger enterprises are simply apprehensive to use it (and PaaS offerings in general). But I think GCP has a pretty compelling story for managed services with a nice spectrum of control from fully managed “NoOps” type services to straight VMs:

Firebase, Cloud Functions → App Engine → App Engine Flex → GKE → GCE

With a typical PaaS like Heroku, you start to lose that ability to “drop down” a level. Even if a company can get by with a fully managed PaaS, they feel more comfortable having the escape hatch, whether it’s justified or not. App Engine Flexible Environment helps with this by providing a container as a service solution, making it much easier to jump to GKE.

I read an article recently on the good, bad, and ugly of GCP. It does a nice job of telling the same story in a slightly different way. It shows the byzantine nature of the IAM model in AWS and GCP’s much simpler permissioning system. It describes the dozens of compute-instance types AWS has and the four GCP has (micro, standard, highmem, and highcpu—with the ability to combine whatever combination of CPU and memory that makes sense for your workload). It also touches on the differences in product philosophy. In particular, when GCP releases new services or features into general availability (GA), they are usually very high quality. In contrast, when AWS releases something, the quality and production-readiness varies greatly. The common saying is “Google’s Beta is like AWS’s GA.” The flipside is GCP’s services often stay in Beta for a very long time.

GCP also does a better job of integrating their different services together, providing a much smaller set of core primitives that are global and work well for many use cases. The article points out Cloud Pub/Sub as a good example. In AWS, you have SQS, SNS, Amazon MQ, Kinesis Data Streams, Kinesis Data Firehose, DynamoDB Streams, and the list seems to only grow over time. GCP has Pub/Sub. It’s flexible enough to fit many (but not all) of the same use cases. The downside of this is Google engineers tend to be pretty opinionated about how problems should be solved.

This difference in philosophy usually means AWS is shipping more services, faster. I think a big part of this is because there isn’t much of a cohesive “platform” story. AWS has lots of disparate pieces—building blocks—many of which are low-level components or more or less hosted versions of existing tech at varying degrees of ready come GA. This becomes apparent when you have to trudge through their hodgepodge of clunky service dashboards which often have a wildly different look and feel than the others. That’s not to say there aren’t integrations between products, it just feels less consistent than GCP. The other reason for this, I suspect, is Amazon’s pervasive service-oriented culture.

For example, AWS took ActiveMQ and stood it up as a managed service called Amazon MQ. This is something Google is unlikely to do. It’s just not in their DNA. It’s also one reason why they are so far behind. GCP tends to be more on the side of shipping homegrown services, but the tech is usually good and ready for primetime when it’s released. Often they spin out internal services by rewriting them for public consumption. This has made them much slower than AWS.

Part of Amazon’s problem, too, is that they are—in a sense—victims of their own success. They got a much earlier head start. The AWS platform launched in 2002 and made its public debut in 2004 with SQS, shortly followed by S3 and EC2. As a result, there’s more legacy and cruft that has built up over time. Google just started a lot later.

More recently, Google has become much more strategic about embracing open APIs. The obvious case is what it has done with Kubernetes—first by open sourcing it, then rallying the community around it, and finally making a massive strategic investment in GKE and the surrounding ecosystem with pieces like Istio. And it has paid off. GKE is, by far and away, the best managed Kubernetes experience currently available. Amazon, who historically has shied away from open APIs (Google has too), had their hand forced, finally making Elastic Container Service for Kubernetes (EKS) generally available last month—probably a bit prematurely. For a long time, Amazon held firm on ECS as the way to run container workloads in AWS. The community spoke, however, and Amazon reluctantly gave in. Other lower-profile cases of Google embracing open APIs include Cloud Dataflow (Apache Beam) and Cloud ML (TensorFlow). As an aside, machine learning and data is another area GCP is leading the charge with its ML and other services like BigQuery, which is arguably a better product than Amazon Redshift.

There are some other implications with the respective approaches of GCP and AWS, one of which is compliance. AWS usually hits certifications faster, but it’s typically on a region-by-region basis. There’s also GovCloud for FedRAMP, which is an entirely separate region. GCP usually takes longer on compliance, but when it happens, it certifies everything. On the same note, services and features in AWS are usually rolled out by region, which often precludes organizations from taking advantage of them immediately. In GCP, resources are usually global, and the console shows things for the entire cloud project. In AWS, the console UIs are usually regional or zonal.

Billing and Support

For a long time, billing has been a rough spot for GCP. They basically gave you a monthly toy spreadsheet with your spend, which was nearly useless for larger operations. There also was not a good way to forecast spend and track it throughout the month. You could only alert on actual spend and not estimated usage. The situation has improved a bit more recently with better reporting, integration with Data Studio, and the recently announced forecasting feature, but it’s still not on par with AWS’s built-in dashboarding. That said, AWS’s billing is so complicated and difficult to manage, there is a small cottage industry just around managing your AWS bill.

Related to billing, GCP has a simpler pricing model. With AWS, you can purchase Reserved Instances to reduce compute spend, which effectively allows you to rent VMs upfront at a considerable discount. This can be really nice if you have stable and predictable workloads. GCP offers sustained use discounts, which are automatic discounts that get applied when running GCE instances for a significant portion of the billing month. If you run a standard instance for more than 25% of a month, Google automatically discounts your bill. The discount increases when you run for a larger portion of the month. They also do what they call inferred instances, which is bin-packing partial instance usage into a single instance to prevent you from losing your discount if you replace instances. Still, GCP has a direct answer to Amazon’s Reserved Instances called committed use discounts. This allows you to purchase a specific amount of vCPUs and memory for a discount in return for committing to a usage term of one or three years. Committed use discounts are automatically applied to the instances you run, and sustained use discounts are applied to anything on top of that.

Support has still been a touchy point for GCP, though they are working to improve it. In my experience, Google has become more committed to helping customers of all sizes be successful on GCP, primarily because AWS has eaten their lunch for a long time. They are much more willing to assign named account reps to customers regardless of size, while AWS won’t give you the time of day if you’re a smaller shop. Their Customer Reliability Engineering program is also one example of how they are trying to differentiate in the support area.

Outcomes

Something interesting that was pointed out to me by a friend and former AWS engineer was that, while GCP and AWS are converging on the same point from opposite ends, they also have completely opposite organizational structures and practices.

Google relies heavily on SREs and service error budgets for operations and support. SREs will manage the operations of a service, but if it exceeds its error budget too frequently, the pager gets handed back to the engineering team. Amazon support falls more on the engineers. This org structure likely influences the way Google and Amazon approach their services, i.e. Conway’s Law. AWS does less to separate development from operations and, as a result, the systems reflect that.

Suffice to say, there are compelling reasons to go with both AWS and GCP. Sufficiently large organizations will likely end up building out on both. You can use either provider to build the same thing, but how you get there depends heavily on the kinds of teams and skill sets your organization has, what your goals are operationally, and other nuances like compliance and workload shapes. If you have significant ops investment, AWS might be a better fit. If you have lots of software engineers, GCP might be. Pricing is often a point of discussion as well, but the truth is you will end up spending more in some areas and less in others. Moreover, all providers are essentially in a race to the bottom anyway as they commoditize more and more. Where it becomes interesting is how they differentiate with value-added services. This is where “multi-cloud” becomes truly meaningful.

Real Kinetic has extensive experience leveraging both AWS and GCP. Learn more about working with us.

Scaling DevOps and the Revival of Operations

Operations is going through a renaissance right now. With the move to cloud, the increasing amount of automation, and the increasing importance of automation, Ops as we know it is reinventing itself out of necessity. Infrastructure is becoming more and more sophisticated—and commoditized—and practices are just now starting to grow up around that. So while some worry about robots taking our jobs, the reality is more about how automation will help augment us to build better software and focus on higher-value things. It’s not so much about the distant future—whatever that may hold—so much as it is about the next five to ten years, what Operations looks like in that timeframe, and why I think it has to retool.

When we think about traditional Operations, we probably think about hardware and servers, managing networks and databases, application servers and runtimes, disaster recovery, Nagios checks, as well as the business side—vendor management, procurement, and so on. Finally, we have applications built on top by development teams.

We have a nice, clean separation—developers focus on building features and products, and Ops focuses on making sure the lights stay on. Of course, we know the reality is this separation also creates a lot of problems, so DevOps was borne out of this as a way to bring these two groups into alignment by improving communication and feedback loops.

Now, with the move to cloud, many of these traditional Ops functions are effectively being outsourced to cloud providers, i.e. the idea of NoOps. We get unprecedented elasticity and on-demand compute with far less overhead than we ever had before—shrinking procurement time from days or weeks to seconds or minutes.

What this leaves is a thin but important slice between Google or Amazon and those products built by developers—the glue, essentially, between cloud and product. I call this NewOps (which I use facetiously in reference to NoSQL/NewSQL), and it’s the future of Ops. This encompasses infrastructure automation, deployment automation, configuration management, logging, monitoring, and many other things. When Marc Andreessen said software is eating the world, he really meant it. The future of Ops—and many other things—is software. It’s killing the boring, repetitive things we really don’t want to be doing anyway and letting us shift our focus elsewhere.

Certainly, automation is nothing new and is, I think, an important part of DevOps, so I’m going to explain what I mean by NewOps and why I’m distinguishing it. I also don’t want to mischaracterize by having these neatly delineated Ops models. The truth is, your company doesn’t just one day graduate and gets its DevOps diploma. Instead, it might evolve through various manifestations of these different models. DevOps is a journey, not a destination in and of itself.

I like to think of a DevOps scale of automation, from manual provisioning all the way to fully self-service. Next, I add a second dimension, org size, from the smallest startups to the biggest enterprises.

Scaling DevOps

Scaling a business is probably one of the hardest things a company has to go through. In particular, dealing with the problem of silos. They happen at every company as it grows, but why is it that silos form in the first place?

Many companies start with a “DevOps” approach, often out of necessity more than anything. As a small startup, we can’t afford to have dedicated developers, QA, Ops, and security people. We just have people, and those people wear many different hats. Developers might be pushing their own code to production. They might even be managing the infrastructure that code runs on. There’s probably not a lot of stability, probably a lot of risk, and probably not a whole lot of thought towards controlling costs.

But as the product scales, we specialize. And as the business scales, we add various safety checks, controls, and processes. Developers write code, Ops people run it, QA gets blamed for defects, security blocks everything, and management wonders why nothing gets shipped.

And so we end up in the top left-hand quadrant with Ops as gatekeepers. Ops is fighting for stability and, at the same time, devs are basically fighting for change. More or less, we have a stable, cost-controlled, risk-averse environment—hopefully. But we also have a significant delivery and innovation bottleneck.

Specialization is good! But misalignment is not good. The question is, then, how do we scale specialization? Cross-functional teams come to mind. After all, DevOps encourages cooperation! We add an Ops engineer to each team, and maybe a reliability engineer, and perhaps a few extra for on-call backup, and of course a QA engineer too. Problem solved, right?

But hold on. What if we have 40 development teams? And all those teams are doing microservices. And, of course, all of those microservices are special snowflakes each with their own stacks, infrastructure, databases, and so on. This quickly gets out of control, but moreover, that’s a lot of teams and specialized roles on those teams. That’s a lot of headcount which equates to a lot of hiring and a lot of time and money. If you’re Google and you can just throw money at the problem, this might work out okay. For the rest of us, it might not be such a realistic option.

We go back to the drawing board and again ask ourselves how do we scale specialization? My thought to how we do this is with vision and product.

A vision is simply a mental image of what the future could be like. It enables independent decision making and alignment. Vision allows all of those teams, and the people on those teams, to make decisions without having to constantly coordinate with each other. Without vision, you’re just iterating to nowhere fast.

But vision without execution is just hallucination. Products are how we scale execution. Specifically, this idea of Operations through the lens of product, which I’ll describe after showing the parallel with what’s happening in QA.

In a lot of engineering organizations, many QA roles have been quietly disappearing. I think what’s happening is this evolution of QA, particularly, this shift from being test-focused to tools-focused.

We can look at companies like Amazon and Microsoft who popularized the SDET (Software Development Engineer in Test) model. These companies recognized that having a separate QA and development group causes a lot of problems, just like how having a separate Ops group does. We end up with SDEs (Software Development Engineers) who still focus on the development aspects of building software and SDETs who focus on the quality aspects, but rather than having two wholly separate groups, we just have development teams with SDETs embedded in them.

More recently, Microsoft moved to what they call a “Combined Engineering” model—effectively combining the SDE and SDET roles into a single role called a Software Engineer. Software Engineers write the product code, test code, and tools code needed to deliver their service. They are responsible for everything. Quality is a core concern of software development anyway.

Software Engineers write the code, unit tests, and integration tests. Those tests run in CI. The code moves through a CD pipeline before finally going out to production in some fashion. QA teams are shrinking, but what’s growing are the teams building the tools—the CI environments, the CD pipelines, the automated testing frameworks, the production tooling and automation, etc. The same is becoming true of Ops.

This is what I mean by “Operations through the lens of product.” The build, release, deploy automation, configuration management, infrastructure automation, logging, monitoring—these are all products.

Constraints often make problems easier. At Workiva, as we were struggling through that scaling phase, we placed a constraint on ourselves. We capped our infrastructure engineering headcount at 15% of R&D. This forced us to solve the problem using technology, and technical problems tend to be easier than people problems. In effect, this required us to productize our infrastructure. In doing so, we scaled. We controlled costs. We kept our headcount in check. We reduced risk. We accelerated development. Ultimately, we delivered value to customers faster, going from about three to four releases per year to multiple releases per day. In the end, this is really the goal of DevOps—to deliver value to customers continuously and to do it rapidly and reliably.

Rethinking Ops

It’s time we start to rethink Operations because clearly this model of Ops as cluster or infrastructure admins does not scale. Developers will always out-demand their capacity to supply. Either your headcount is out of control or your ability to innovate and deliver is severely hamstrung. Operations becomes this interrupt-driven thing where we’re just fighting fires as they happen. Ops as masters of production usually devolves to Ops becoming human incident routers, trying to figure out what team or person can help resolve problems because, being responsible for everything, they don’t have the insight to fix it themselves.

Another path that many companies take is Platform as a Service. Workiva is an example of this. For a very long time, Workiva didn’t have a traditional Ops team because the Ops team was Google. The first product was built on Google App Engine. This helped immensely to deliver value to customers quickly. We could just focus on the product and not the surrounding operational aspects, but there is a very real innovation bottleneck that comes with this.

The idea of “Ops lock-in” can be a major problem, whether it’s a PaaS like App Engine locking you in or your own Ops team who just isn’t able to support the kind of innovation that you’re trying to do.

My vision for the future of Operations is taking Combined Engineering to its logical conclusion. Just like with QA, Ops capabilities should be embedded within development teams. The reality is you can’t be an effective software engineer today without some Ops skills, and I think every role should be working towards automating itself out of a job. Specifically, my vision is enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services.

The knee-jerk reaction to this idea is usually fully embracing Infrastructure as a Service, infrastructure as code, and giving developers freedom—and usually the consequences are dire. The point here is that the pendulum can swing too far in the other direction. This was a problem for a brief period of time at Workiva. As we were building new products off of App Engine, developers had this newfound freedom, so teams all went different directions introducing new tech, new infrastructure, new services, and so forth. It was a free-for-all, an explosion of stuff, and the cost explosion that comes with it.

There has to be some control around that, so we tweak the vision statement a bit: enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services…with minimal Ops intervention. We have to have some checks and balances in place.

With this, Ops become force multipliers. We move away from the reactive, interrupt-driven model where Ops are masters of production responsible for everything. Instead, we make dev teams responsible for their services but provide the tools they need to actually own their systems end-to-end—from the code on their laptops to operating it in production.

Enabling developers to self-service through tooling and automation means treating Ops as a product team. The infrastructure automation, deployment automation, configuration management, logging, monitoring, and production tools—these are all products. It’s these products that allow teams to fully own their services. This leads to empowerment.

I have this theory that all engineering organizations operate in this fashion which I call pain-driven development. As a company grows, it starts to develop limbs—teams or silos. Each of these limbs has its own pain receptors. Teams operate in a way that minimizes the amount of pain that they feel, it’s human instinct. We make locally optimal decisions to minimize pain and end up following a path of least resistance.

Silos promote pain displacement, which results in a “bulkhead” effect. Product development feels the pain of building software, QA feels the pain of testing software, and Ops feels the pain of running software. This creates broken feedback loops. For instance, developers aren’t feeling the pain Ops is feeling trying to run their software. We just throw things over the wall and it becomes an empathy problem.

This leads to misaligned incentives because each team will optimize for the pain that they feel. How do you expect developers to care about quality if they’re not on the hook? Similarly, how do you expect them to care about operability if they’re not on the hook? Developers won’t build truly reliable software until they are on-call for it and directly responsible. However, responsibility requires empowerment. You can’t have one without the other. You can’t ask someone to care about something and fix it without also giving them the power to do so. Most Ops teams simply haven’t done enough to empower and offload responsibility onto development teams.

Products enable ownership. We move away from Ops as masters of production responsible for everything and push that responsibility onto dev teams. They are the experts for their services. They are best equipped to deal with problems that arise. But we provide the tools they need to diagnose and resolve those problems on their own.

Products maintain control through enablement—enabling teams to follow best practices for builds, testing, deploys, support, and compliance. Compliance and other SDLC requirements have to be encoded into the tools and processes. These are things developers won’t empathize with or simply won’t understand. Rather than giving them a long list of things they have to do, we take as many of those things as we can and bake them into our products. If you use these tools or follow these processes, you’ll get a lot of this stuff for free. This reduces risk and accelerates development.

Similarly, we can’t allow all of the special snowflakes to happen. We have to control that explosion of stuff. To do this, we use pain-driven development to our advantage by creating paths of least resistance. Using standardized patterns, application shapes, and infrastructure services, we can setup “paths” to both make it easier to reach production and meet the goals of the business. As a developer, if you follow this path, your life will be a lot easier and you’ll feel less pain. If you deviate from that path, things get much harder—and painful.

We end up with a set “menu” of standard application shapes and infrastructure. If teams want to deviate and go off-menu, it’s on them to make a case for it. For example, if I want to introduce Erlang into our stack, it’s on my team and me to present the case for that. Part of this might mean we help build and maintain the tools needed to support that. If there is a compelling enough case or enough teams are making similar asks, we can start to standardize new shapes.

Note that we aren’t necessarily mandating technologies, but we’re leveraging pain-driven development to work in our favor.

Products in Practice

Next, I’m going to look at this idea of Operations through the lens of product in a bit more detail. We’ll see what this might actually look like in practice, again using Workiva as a bit of a case study.

Below is the high-level flow that I think about, from code on laptop to code in production.

Starting with the Build and continuous integration stage, this workflow tends to look something like the following. A developer pushes a change to a branch in a code repository, e.g. GitHub. This triggers a few things to happen. First, the build process, which runs unit/integration tests and builds artifacts. This, in turn, might trigger a QA and/or compliance process. At the same time, we have code reviews happening. All of these processes provide feedback to the developer to quickly iterate.

Workiva has a lot of automated processes built into the developer workflow, some off-the-shelf and some built in-house. For example, when a PR is opened, a security scanner runs which does static analysis and looks for various security vulnerabilities. This can flag a security review when a closer look is needed. Likewise, there is code coverage, automated builds, unit tests, and integration tests, Docker image builds, and compliance checks. The screenshots below come from an open-source repo showing some of these products in practice.

For compliance reasons, Workiva requires at least one other person sign-off on code changes. GitHub provides pretty good support for this. Code reviewers provide their feedback, developers work through that feedback, and, once satisfied, reviewers give their “plus one.”

The screenshot below shows some of the automated processes Workiva relies on in the developer workflow: Travis CI, Codecov, Smithy (which is Workiva’s internal build system), Skynet (automated testing), Rosie (automated compliance controls, e.g. do you have code reviews, security reviews, other SDLC compliance requirements?), and Aviary (the security scanner). Once all of these have passed, the PR is automatically labeled with “Merge Requirements Met” and the change can be merged into master.

There are a couple things worth pointing out with this workflow. First, the build plan is part of the code and not baked into some build tool. This allows dev teams to fully control their builds. Second, you noticed that Workiva has very deep integration with GitHub. This has allowed them to build automated controls into the development process, which speeds up the developer’s workflow while reducing risk.

Next, we move on to the Release stage. This flow looks something like the following:

The developer tags a branch for release, which triggers a build process for creating the artifact. This may have a QA process which then promotes the artifact to a development artifact repository. As you may have noticed, Workiva has a lot of compliance requirements since they deal with companies’ pre-financial data, so there is typically a sign-off process at various stages involving different parties like Release Management, QA, Security, etc. Depending on your compliance controls, this might just be clicking a button to promote an artifact to a production repository. From there, it can actually be deployed to a production environment.

With this workflow, artifact tagging, building, and promotion is all automated. It’s also important we have processes around security. Container and machine image auditing is automated as well as security patching for OS updates, etc. For example, this workflow might use something like Packer to automate AMI building. Finally, the artifact sign-off is streamlined for the various parties involved, if not fully automated.

Now we’re ready to actually deploy our application. This is a key part of self-service and “owning” a product. This allows a team to configure their application and, ideally, deploy it themselves to production. Initially, this might be handled by a Release Management team who actually clicks the deploy button, but as you become more confident in your processes and your tools become more mature, more of this responsibility can be pushed onto the development teams.

This is also where control comes into play. For instance, I may be allowed to configure my application to use 1GB of RAM, but if I need 1TB, I may need to get additional sign-off.

Self-service deploys and self-service configuration—with guard rails—are an important part of continuous deployment. Additionally, infrastructure provisioning should be automated. No more submitting tickets for a nameless Ops person to provision and configure servers, VMs, or other resources—no ticket-driven development.

I’ve been deliberate about not prescribing particular solutions for some of these problems. You might be using Kubernetes or ECS to orchestrate containers, it doesn’t really matter. These should mostly be implementation details. What does matter, though, is having good abstractions around certain implementation details. For example, Workiva was meticulous about building some layers around workload scheduling. This allowed them at one point to switch from using Fleet to ECS to manage containers with virtually no impact to developers. With the amount of churn that happens in tech, it’s important not to tie yourself too heavily to any one implementation. Instead, think about the APIs you expose for your infrastructure and consider those the deliverable.

Finally, we need to operate our service in production, another important part of ownership. There are a lot of products here, so we’ll just look at a cross section.

Logging is arguably the most important part of how we figure out what is happening in our systems. For this reason, Workiva built structured logging and metrics specs and language libraries implementing these specs. As a developer, this made it easy to simply pull in the library for your language and get structured, contextual logging for free. The other half to this was building out a data pipeline. Basically all metadata at Workiva went into Amazon Kinesis, including logs, metrics, and traces. First, this allowed us to reuse the same infrastructure for all of this data, from the agents running on the machines to the pipeline itself. Second, it allowed us to fan this data out to different backend systems—Splunk, SumoLogic, Datadog, Stackdriver, BigQuery, as well as various internal tools. This is probably one of the most important things you can do with your infrastructure.

Other continuous operations tools include telemetry, tracing, health checks, alerting, and more sophisticated production tools like canary deploys, A/B testing, and traffic shadowing. Some might refer to these as tools for testing in production. Realistically, once you reach a certain scale, testing in production is the only real alternative to the proliferation of deployment environments.

It’s worth mentioning that you do not need to build all of these products yourself. In fact, you shouldn’t. Many off-the-shelf solutions just need glued together. However, I’ve also come to realize that it’s often the “glue” that is important. That is to say, taking some large, commercial off-the-shelf solution and introducing it into a company is frequently rife with headaches. It’s like Jira, a big Frankenstein product that attempts to solve everyone’s problems and, in doing so, solves none of them particularly well. This is why I tend to favor small, modular solutions that can be composed. But it also highlights why there is a cultural aspect to this.

If you think the solution to your ailments is some magical product—maybe a CI/CD pipeline or Kubernetes or something else—you’re misguided. If anything, most problems are cultural, not technical in nature. Technology will not fix your broken culture! The products are not the endgame, they are a means to an end. And the products need to fit the company, its culture, its architecture, and its constraints. It’s tempting to take something you see on Hacker News and introduce it into your stack, but you have to be careful.

Likewise, it’s tempting to dive straight into the deep-end, automate everything, and build out a highly sophisticated infrastructure. But it’s important to start small and evolve over time. My approach to this is get the workflow correct, start manual, then automate more and more over time.

Wrapping Up

Specialization leads to misalignment and broken feedback loops, but it’s an important part of scaling a business. The question is: how do we specialize?

We know the traditional Ops model does not scale—devs will always out-demand capacity in this reactive model. Not only this, the siloing creates an empathy problem. DevOps attempts to help with this by tightening feedback loops and building empathy. NewOps takes this further by empowering teams and providing autonomy. It’s not a replacement for DevOps, it’s an evolution of it. It’s applying a product mindset to the traditional Ops model.

The future of Ops is taking Combined Engineering to its logical conclusion. As such, Ops teams should be redefining their vision from being masters of production to enablers of production. Just like with QA, Ops capabilities need to be embedded within dev teams, but the caveat is they need to be enabled! This is the direction Operations is headed. Software is eating the world, which means both up and down the stack. NewOps treats Ops like a product team whose product, effectively, is infrastructure. It’s creating guard rails, not walls—taking SDLC and compliance controls and encoding them into products rather than giving devs a laundry list of things, having them run the gauntlet through a long, drawn-out development process, and having a gatekeeper at the end.

Offloading responsibility helps correct and scale feedback loops. In my opinion, this is how we scale specialization. Operations isn’t going away, it’s just getting a product manager.

More Environments Will Not Make Things Easier

Microservices are hard. They require extreme discipline. They require a lot more upfront thinking. They introduce integration challenges and complexity that you otherwise wouldn’t have with a monolith, but service-oriented design is an important part of scaling organization structure. Hundreds of engineers all working on the same codebase will only lead to angst and the inability to be nimble.

This requires a pretty significant change in the way we think about things. We’re creatures of habit, so if we’re not careful, we’ll just keep on applying the same practices we used before we did services. And that will end in frustration.

How can we possibly build working software that comprises dozens of services owned by dozens of teams? Instinct tells us full-scale integration. That’s how we did things before, right? We ran integration tests. We run all of the services we depend on and develop our service against that. But it turns out, these dozen or so services I depend on also have their own dependencies! This problem is not linear.

Okay, so we can’t run everything on our laptop. Instead, let’s just have a development environment that is a facsimile of production with everything deployed. This way, teams can develop their products against real, deployed services. The trade-off is teams need to provide a high level of stability for these “development” services since other teams are relying on them for their own development. If nothing works, development is hamstrung. Personally, I think this is a pretty reasonable trade-off because if we’re disciplined enough, it shouldn’t be hard to provide stable APIs. In fact, if we’re disciplined, it should be a requirement. This is why upfront thinking is critical. Designing your APIs is the most important thing you do. Service-oriented architecture necessitates API-driven development. Literally nothing else matters but the APIs. It reminds me of the famous Jeff Bezos mandate:

  1. All teams will henceforth expose their data and functionality through service interfaces.

  2. Teams must communicate with each other through these interfaces.

  3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

  4. It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols – doesn’t matter. Bezos doesn’t care.

  5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

  6. Anyone who doesn’t do this will be fired.

  7. Thank you; have a nice day!

If we’re not disciplined, maintaining stability in a development environment becomes too difficult. So naturally, the solution becomes doubling down—we just need more environments. If every team just gets its own full-scale environment to develop against, no more stability problems. We get to develop our distributed monolith happily in our own little world. That sound you hear is every CFO collectively losing their shit, but whatever, they’re nerds and we’ve gotta get this feature to production!

Besides the obvious cost implications to this approach, perhaps the more insidious problem is it will cause teams to develop in a vacuum. In and of itself, this is not an issue, but for the undisciplined team who is not practicing rigorous API-driven development, it will create moving goalposts. A team will spend months developing its product against static dependencies only to find a massive integration headache come production time. It’s pain deferral, plain and simple. That pain isn’t being avoided or managed, you’re just neglecting to deal with instability and integration to a point where it is even more difficult. It is the opposite of the “fail-fast” mindset. It’s failing slowly and drawn out.

“We need to run everything with this particular configuration to test this, and if anyone so much as sneezes my service becomes unstable.” Good luck with that. I’ve got a dirty little secret: if you’re not disciplined, no amount of environments will make things easier. If you can’t keep your service running in an integration environment, production isn’t going to be any easier.

Similarly, massive end-to-end integration tests spanning numerous services  are an anti-pattern. Another dirty little secret: integrated tests are a scam. With a big enough system, you cannot reasonably expect to write meaningful large-scale tests in any tractable way.

What are we to do then? With respect to development, get it out of your head that you can run a facsimile of production to build features against. If you need local development, the only sane and cost-effective option is to stub. Stub everything. If you have a consistent RPC layer—discipline—this shouldn’t be too difficult. You might even be able to generate portions of stubs.

We used Google App Engine heavily at Workiva, which is a PaaS encompassing numerous services—app server, datastore, task queues, memcache, blobstore, cron, mail—all managed by Google. We were doing serverless before serverless was even a thing. App Engine provides an SDK for developing applications locally on your machine. Numerous times I overheard someone who thought the SDK was just running a facsimile of App Engine on their laptop. In reality, it was running a bunch of stubs!

If you need a full-scale deployed environment, keep in mind that stability is the cost of entry. Otherwise, you’re just delaying problems. In either case, you need stable APIs.

With respect to integration testing, the only tractable solution that doesn’t lull you into a false sense of security is consumer-driven contract testing. We run our tests against a stub, but these tests are also included in a consumer-driven contract. An API provider runs consumer-driven contract tests against its service to ensure it’s not breaking any downstream services.

All of this aside, the broader issue is ensuring a highly disciplined engineering organization. Without this, the rest becomes much more difficult as pain-driven development takes hold. Discipline is a key part of doing service-oriented design and preventing things from getting out of control as a company scales. Moving to microservices means using the right tools and processes, not just applying the old ones in a new context.

Plant Trees Before You Need the Shade

Like humans, companies go through phases. There’s the early seed and development phase. Founders are so preoccupied with a problem they go crazy. They consider solutions and the feasibility of a business. There’s the startup phase, when a business is actually born, and it stumbles towards product/market fit. There’s the growth and scaling phase, as we try to close more and more deals while, at the same time, hiring the right people. If we’re lucky, we reach the later stages. There’s the expansion phase, as we attempt to land and expand or attack new verticals or geographies. This is when things get really interesting—and hard. Who are the right people to hire? What are the right products to build? The formula that got us here almost certainly won’t get us there. Lastly, there’s maturity, which is when the business has really hit its stride. Maybe there’s an exit, and very likely there’s new leadership involved.

Consistent in all of this are two things: culture and capabilities. Culture is the invisible hand inside your organization. It’s your company’s autopilot. Specifically, culture is the unique combination of processes and values within an organization. These processes and values are what enable us to replicate our success. They allow people to make decisions which are in alignment with the goals of the company without having to constantly coordinate with one another.

This also means your culture is derived from your capabilities, what your organization can and cannot do. Clayton Christensen groups these factors into three buckets: resources, processes, and values. Resources are the (mostly) tangible things a company has—people, capital, brands, intellectual property, relationships with customers, manufacturers, distributors, and so forth. Processes are what we do with resources to accomplish the organization’s goals, such as developing products, developing employees, hiring, firing, doing market research, and allocating resources. They take in resources and produce value. Processes help us protect and scale our values by providing a means of documenting and codifying them. These are predominantly intangible things. Finally, values define how a company makes decisions. What goes to the top of the list, and what gets ignored. What gets investment, and what doesn’t. These are our priorities that guide us.

There are a few problems with how leadership tends to view capabilities. There is typically an overemphasis on resources. This happens because in the startup phase, success is largely governed by resources. Notably, people. This is especially true of software startups. I quote this section from The Innovator’s Dilemma frequently:

In the start-up stages of an organization, much of what gets done is attributable to resources—people, in particular. The addition or departure of a few key people can profoundly influence its success. Over time, however, the locus of the organization’s capabilities shifts toward its processes and values. As people address recurrent tasks, processes become defined. And as the business model takes shape and it becomes clear which types of business need to be accorded highest priority, values coalesce. In fact, one reason that many soaring young companies flame out after an IPO based on a single hot product is that their initial success is grounded in resources—often the founding engineers—and they fail to develop processes that can create a sequence of hot products.

In the beginning stages, people drive success. Early engineers and founders (the two are not mutually exclusive) have an itch to scratch that they are so passionate about, they evangelize this crazy grand vision and people get excited. The company is small, focused, passionate, and everyone is working closely together to solve a customer problem they are obsessed with. And, sometimes, it pays off.

Next, leadership starts asking itself, “OK, we shipped this crazy successful product, now how do we grow?” This is where the wheels start to come off. There are three problems that occur.

First, the focus shifts away from solving a problem you’re obsessed with to finding the next big product. These companies fail to find a new product because they are searching for one without passion. They are seeking top-line revenue growth, not pursuing a vision. They are looking at markets through the lens of “here’s where we can make money” and not “here’s where we can solve problems,” and in doing so, they lose sight of the customer. At this stage, they basically have forgotten what got them here.

Second, their processes get in their own way. This seems contradictory given that processes, by their very nature, are meant to facilitate repeatability. If we have good processes in place, we should be able to apply them time and time again, even with different people, and end up with consistent results. But things break down when we apply the same processes to different problems. In essence, they try to find success using what brought them success to begin with, but with a warped perspective. You can look at every startup and they will all have wildly different stories about how they found success, but the one thing they will all have in common is that relentless itch. When we attack a new market, we need to do a reset on the processes and values, not a recycle. Christensen suggests if your company is so deeply entrenched in its processes that this isn’t viable, as is often the case with large enterprises, spin it out into a new venture. Processes are as dynamic as the company itself. They are not a one-size-fits-all deal. Christensen calls this the migration of capabilities.

The factors that define an organization’s capabilities and disabilities evolve over time—they start in resources; then move to visible, articulated processes and values; and migrate finally to culture. As long as the organization continues to face the same sorts of problems that its processes and values were designed to address, managing the organization can be straightforward. But because those factors also define what an organization cannot do, they constitute disabilities when the problems facing the company change fundamentally.

Third, they don’t fully appreciate the nature of dynamic priorities. Software startups in particular often mistakenly attribute their initial success to technology when, in reality, it’s because of people and timing. Aside from the passion, you need people early on who can ship and ship often. These might not be the most technically capable engineers, but they get things done when it matters. Later, you need people who can still ship but while cleaning things up. Lastly, you need maintainers—people who can refine without breaking anything. That’s not to say you have these archetypes exclusively at each stage—you want a balance of people—but it’s similar to how companies commonly need a different type of CTO at different phases or an Interim CTO.

While resources are an essential part of early success, it’s processes and values that will sustain you. However, we tend to overfit on resources because we become biased from that success—investing heavily in technology and innovation and grounding the company’s success in a few key individuals. We also overfit because resources are more visible and measurable. Being deliberate about establishing processes and values—which are derived from vision—helps to overcome this bias, but it needs to happen early and be continually reinforced. We also need to ensure our processes adapt to new problems. The larger and more complex an organization becomes, the harder this gets.

Moreover, we need to be conscientious about which processes matter to us the most. Often the most important capabilities aren’t reflected by the most visible processes—product development or customer service, for example—but in the less visible, background processes that support decisions about where to invest resources. These might include determining how market research is done, how financial and sales projections are drawn from this analysis, how products are conceived, how planning and budgets are negotiated, and so on. These processes are where many companies get their inability to cope with change.

And this is where the breakdown happens: a company has highly capable people—people who have helped shape its success as a startup—but arms them with the wrong processes and values. The result is often a boom followed by a bunch of fizzles as they try to catch lightning in a bottle once more. A compelling vision plants a seed. Strong processes and clear values help that seed to grow. But the shade produced by that seed—our capabilities—is stationary, so when we approach a new challenge, we need to recognize when to start tilling.

The Future of Ops

Traditional Operations isn’t going away, it’s just retooling. The move from on-premise to cloud means Ops, in the classical sense, is largely being outsourced to cloud providers. This is the buzzword-compliant NoOps movement, of which many call the “successor” to DevOps, though that word has become pretty diluted these days. What this leaves is a thin but crucial slice between Amazon and the products built by development teams, encompassing infrastructure automation, deployment automation, configuration management, log management, and monitoring and instrumentation.

The future of Operations is actually, in many ways, much like the future of QA. Traditional QA roles are shifting away from test-focused to tools-focused. Engineers write code, unit tests, and integration tests. The tests run in CI and the code moves to production through a CD pipeline and canary rollouts. QA teams are shrinking, but what’s growing are the teams building the tools—the test frameworks, the CI environments, the CD pipelines. QA capabilities are now embedded within development teams. The SDET (Software Development Engineer in Test) model, popularized by companies like Microsoft and Amazon, was the first step in this direction. In 2014, Microsoft moved to a Combined Engineering model, merging SDET and SDE (Software Development Engineer) into one role, Software Engineer, who is responsible for product code, test code, and tools code.

The same is quickly becoming true for Ops. In my time with Workiva’s Infrastructure and Reliability group, we combined our Operations and Infrastructure Engineering teams into a single team effectively consisting of Site Reliability Engineers. This team is responsible for building and maintaining infrastructure services, configuration management, log management, container management, monitoring, etc.

I am a big proponent of leadership through vision. A compelling vision is what enables alignment between teams, minimizes the effects of functional and organizational silos, and intrinsically motivates and mobilizes people. It enables highly aligned and loosely coupled teams. It enables decision making. My vision for the future of Operations as an organizational competency is essentially taking Combined Engineering to its logical conclusion. Just as with QA, Ops capabilities should be embedded within development teams. The fact is, you can’t be an effective software engineer in a modern organization without Ops skills. Ops teams, as they exist today, should be redefining their vision.

The future of Ops is enabling developers to self-service through tooling, automation, and processes and empowering them to deploy and operate their services with minimal Ops intervention. Every role should be working towards automating itself out of a job.

If you asked an old-school Ops person to draw out the entire stack, from bare metal to customer, and circle what they care about, they would draw a circle around the entire thing. Then they would complain about the shitty products dev teams are shipping for which they get paged in the middle of the night. This is broadly an outdated and broken way of thinking that leads to the self-loathing, chainsmoking Ops stereotype. It’s a cop out and a bitterness resulting from a lack of empathy. If a service is throwing out-of-memory exceptions at 2AM, does it make sense to alert the Ops folks who have no insight or power to fix the problem? Or should we alert the developers who are intimately familiar with the system? The latter seems obvious, but the key is they need to be empowered to be notified of the situation, debug it, and resolve it autonomously.

The NewOps model instead should essentially treat Ops like a product team whose product is the infrastructure. Much like the way developers provide APIs for their services, Ops provide APIs for their infrastructure in the form of tools, UIs, automation, infrastructure as code, observability and alerting, etc.

In many ways, DevOps was about getting developers to empathize with Ops. NewOps is the opposite. Overly martyrlike and self-righteous Ops teams simply haven’t done enough to empower and offload responsibility onto dev teams. With this new Combined Engineering approach, we force developers to apply systems thinking in a holistic fashion. It’s often said: the only way engineers will build truly reliable systems is when they are directly accountable for them—meaning they are on call, not some other operator.

With this move, the old-school, wild-west-style of Operations needs to die. Ops is commonly the gatekeeper, and they view themselves as such. Old-school Ops is building in as much process as possible, slowing down development so that when they reach production, the developers have a near-perfectly reliable system. Old-school Ops then takes responsibility for operating that system once it’s run the gauntlet and reached production through painstaking effort.

Old-school Ops are often hypocrites. They advocate for rigorous SDLC and then bypass the same SDLC when it comes to maintaining infrastructure. NewOps means infrastructure is code. Config changes are code. Neither of which are exempt from the same SDLC to which developers must adhere. We codify change requests. We use immutable infrastructure and AMIs. We don’t push changes to a live environment without going through the process. Similarly, we need to encode compliance and other SDLC requirements which developers will not empathize with into tooling and process. Processes document and codify values.

Old-school Ops is constantly at odds with the Lean mentality. It’s purely interrupt-driven—putting out fires and fixing one problem after another. At the same time, it’s important to have balance. Will enabling dev teams to SSH into boxes or attach debuggers to containers in integration environments discourage them from properly instrumenting their applications? Will it promote pain displacement? It’s imperative to balance the Ops mentality with the Dev mentality.

Development teams often hold Ops responsible for being an innovation or delivery bottleneck. There needs to be empathy in both directions. It’s easy to vilify Ops but oftentimes they are just trying to keep up. You can innovate without having to adopt every bleeding-edge technology that hits Hacker News. On the other hand, modern Ops organizations need to realize they will almost never be able to meet the demand placed upon them. The sustainable approach—and the approach that instills empathy—is to break down the silos and share the responsibility. This is the future of Ops. With the move to cloud, Ops needs to reinvent itself by empowering and entrusting development teams, not trying to protect them from themselves.

Ops is dead, long live Ops!