Multi-Cloud Is a Trap

It comes up in a lot of conversations with clients. We want to be cloud-agnostic. We need to avoid vendor lock-in. We want to be able to shift workloads seamlessly between cloud providers. Let me say it again: multi-cloud is a trap. Outside of appeasing a few major retailers who might not be too keen on stuff running in Amazon data centers, I can think of few reasons why multi-cloud should be a priority for organizations of any scale.

A multi-cloud strategy looks great on paper, but it creates unneeded constraints and results in a wild-goose chase. For most, it ends up being a distraction, creating more problems than it solves and costing more money than it’s worth. I’m going to caveat that claim in just a bit because it’s a bold blanket statement, but bear with me. For now, just know that when I say “multi-cloud,” I’m referring to the idea of running the same services across vendors or designing applications in a way that allows them to move between providers effortlessly. I’m not speaking to the notion of leveraging the best parts of each cloud provider or using higher-level, value-added services across vendors.

Multi-cloud rears its head for a number of reasons, but they can largely be grouped into the following points: disaster recovery (DR), vendor lock-in, and pricing. I’m going to speak to each of these and then discuss where multi-cloud actually does come into play.

Disaster Recovery

Multi-cloud gets pushed as a means to implement DR. When discussing DR, it’s important to have a clear understanding of how cloud providers work. Public cloud providers like AWS, GCP, and Azure have a concept of regions and availability zones (n.b. Azure only recently launched availability zones in select regions, which they’ve learned the hard way is a good idea). A region is a collection of data centers within a specific geographic area. An availability zone (AZ) is one or more data centers within a region. Each AZ is isolated with dedicated network connections and power backups, and AZs in a region are connected by low-latency links. AZs might be located in the same building (with independent compute, power, cooling, etc.) or completely separated, potentially by hundreds of miles.

Region-wide outages are highly unusual. When they happen, it’s a high-profile event since it usually means half the Internet is broken. Since AZs themselves are geographically isolated to an extent, a natural disaster taking down an entire region would basically be the equivalent of a meteorite wiping out the state of Virginia. The more common cause of region failures are misconfigurations and other operator mistakes. While rare, they do happen. However, regions are highly isolated, and providers perform maintenance on them in staggered windows to avoid multi-region failures.

That’s not to say a multi-region failure is out of the realm of possibility (any more than a meteorite wiping out half the continental United States or some bizarre cascading failure). Some backbone infrastructure services might span regions, which can lead to larger-scale incidents. But while having a presence in multiple cloud providers is obviously safer than a multi-region strategy within a single provider, there are significant costs to this. DR is an incredibly nuanced topic that I think goes underappreciated, and I think cloud portability does little to minimize those costs in practice. You don’t need to be multi-cloud to have a robust DR strategy—unless, perhaps, you’re operating at Google or Amazon scale. After all, Amazon.com is one of the world’s largest retailers, so if your DR strategy can match theirs, you’re probably in pretty good shape.

Vendor Lock-In

Vendor lock-in and the related fear, uncertainty, and doubt therein is another frequently cited reason for a multi-cloud strategy. Beau hits on this in Stop Wasting Your Beer Money:

The cloud. DevOps. Serverless. These are all movements and markets created to commoditize the common needs. They may not be the perfect solution. And yes, you may end up “locked in.” But I believe that’s a risk worth taking. It’s not as bad as it sounds. Tim O’Reilly has a quote that sums this up:

“Lock-in” comes because others depend on the benefit from your services, not because you’re completely in control.

We are locked-in because we benefit from this service. First off, this means that we’re leveraging the full value from this service. And, as a group of consumers, we have more leverage than we realize. Those providers are going to do what is necessary to continue to provide value that we benefit from. That is what drives their revenue. As O’Reilly points out, the provider actually has less control than you think. They’re going to build the system they believe benefits the largest portion of their market. They will focus on what we, a player in the market, value.

Competition is the other key piece of leverage. As strong as a provider like AWS is, there are plenty of competing cloud providers. And while competitors attempt to provide differentiated solutions to what they view as gaps in the market they also need to meet the basic needs. This is why we see so many common services across these providers. This is all for our benefit. We should take advantage of this leverage being provided to us. And yes, there will still be costs to move from one provider to another but I believe those costs are actually significantly less than the costs of going from on-premise to the cloud in the first place. Once you’re actually on the cloud you gain agility.

The mental gymnastics I see companies go through to avoid vendor lock-in and “reasons” for multi-cloud always astound me. It’s baffling the amount of money companies are willing to spend on things that do not differentiate them in any way whatsoever and, in fact, forces them to divert resources from business-differentiating things.

I think there are a couple reasons for this. First, as Beau points out, we have a tendency to overvalue our own abilities and undervalue our costs. This causes us to miscalculate the build versus buy decision. This is also closely related to the IKEA effect, in which consumers place a disproportionately high value on products they partially created. Second, as the power and influence in organizations has shifted from IT to the business—and especially with the adoption of product mindset—it strikes me as another attempt by IT operations to retain control and relevance.

Being cloud-agnostic should not be an important enough goal that it drives key decisions. If that’s your starting point, you’re severely limiting your ability to fully reap the benefits of cloud. You’re just renting compute. Platforms like Pivotal Cloud Foundry and Red Hat OpenShift tout the ability to run on every major private and public cloud, but doing so—by definition—necessitates an abstraction layer that abstracts away all the differentiating features of each cloud platform. When you abstract away the differentiating features to avoid lock-in, you also abstract away the value. You end up with vendor “lock-out,” which basically means you aren’t leveraging the full value of services. Either the abstraction reduces things to a common interface or it doesn’t. If it does, it’s unclear how it can leverage differentiated provider features and remain cloud-agnostic. If it doesn’t, it’s unclear what the value of it is or how it can be truly multi-cloud.

Not to pick on PCF or Red Hat too much, but as the major cloud providers continue to unbundle their own platforms and rebundle them in a more democratized way, the value proposition of these multi-cloud platforms begins to diminish. In the pre-Kubernetes and containers era—aka the heyday of Platform as a Service (PaaS)—there was a compelling story. Now, with the prevalence of containers, Kubernetes, and especially things like Google’s GKE and GKE On-Prem (and equivalents in other providers), that story is getting harder to tell. Interestingly, the recently announced Knative was built in close partnership with, among others, both Pivotal and Red Hat, which seems to be a play to capture some of the value from enterprise adoption of serverless computing using the momentum of Kubernetes.

But someone needs to run these multi-cloud platforms as a service, and therein lies the rub. That responsibility is usually dumped on an operations or shared-services team who now needs to run it in multiple clouds—and probably subscribe to a services contract with the vendor.

A multi-cloud deployment requires expertise for multiple cloud platforms. A PaaS might abstract that away from developers, but it’s pushed down onto operations staff. And we’re not even getting in to the security and compliance implications of certifying multiple platforms. For some companies who are just now looking to move to the cloud, this will seriously derail things. Once we get past the airy-fairy marketing speak, we really get into the hairy details of what it means to be multi-cloud.

There’s just less room today for running a PaaS that is not managed for you. It’s simply not strategic to any business. I also like to point out that revenues for companies like Pivotal and Red Hat are largely driven by services. These platforms act as a way to drive professional services revenue.

Generally speaking, the risk posed to businesses by vendor lock-in of non-strategic systems is low. For example, a database stores data. Whether it’s Amazon DynamoDB, Google Cloud Datastore, or Azure Cosmos DB—there might be technical differences like NoSQL, relational, ANSI-compliant SQL, proprietary, and so on—fundamentally, they just put data in and get data out. There may be engineering effort involved in moving between them, but it’s not insurmountable and that cost is often far outweighed by the benefits we get using them. Where vendor lock-in can become a problem is when relying on core strategic systems. These might be systems which perform actual business logic or are otherwise key enablers of a company’s business. As Joel Spolsky says, “If it’s a core business function—do it yourself, no matter what. Pick your core business competencies and goals, and do those in house.”

Pricing

Price competitiveness might be the weakest argument of all for multi-cloud. The reality is, as they commoditize more and more, all providers are in a race to the bottom when it comes to cost. Between providers, you will end up spending more in some areas and less in others. Multi-cloud price arbitrage is not a thing, it’s just something people pretend is a thing. For one, it’s wildly impractical. For another, it fails to account for volume discounts. As I mentioned in my comparison of AWS and GCP, it really comes down more to where you want to invest your resources when picking a cloud provider due to their differing philosophies.

And to Beau’s point earlier, the lock-in angle on pricing, i.e. a vendor locking you in and then driving up prices, just doesn’t make sense. First, that’s not how economies of scale work. And once you’re in the cloud, the cost of moving from one provider to another is dramatically less than when you were on-premise, so this simply would not be in providers’ best interest. They will do what’s necessary to capture the largest portion of the market and competitive forces will drive Infrastructure as a Service (IaaS) costs down. Because of the competitive environment and desire to capture market share, pricing is likely to converge.  For cloud providers to increase margins, they will need to move further up the stack toward Software as a Service (SaaS) and value-added services.

Additionally, most public cloud providers offer volume discounts. For instance, AWS offers Reserved Instances with significant discounts up to 75% for EC2. Other AWS services also have volume discounts, and Amazon uses consolidated billing to combine usage from all the accounts in an organization to give you a lower overall price when possible. GCP offers sustained use discounts, which are automatic discounts that get applied when running GCE instances for a significant portion of the billing month. They also implement what they call inferred instances, which is bin-packing partial instance usage into a single instance to prevent you from losing your discount if you replace instances. Finally, GCP likewise has an equivalent to Amazon’s Reserved Instances called committed use discounts. If resources are spread across multiple cloud providers, it becomes more difficult to qualify for many of these discounts.

Where Multi-Cloud Makes Sense

I said I would caveat my claim and here it is. Yes, multi-cloud can be—and usually is—a distraction for most organizations. If you are a company that is just now starting to look at cloud, it will serve no purpose but to divert you from what’s really important. It will slow things down and plant seeds of FUD.

Some companies try to do build-outs on multiple providers at the same time in an attempt to hedge the risk of going all in on one. I think this is counterproductive and actually increases the risk of an unsuccessful outcome. For smaller shops, pick a provider and focus efforts on productionizing it. Leverage managed services where you can, and don’t use multi-cloud as a reason not to. For larger companies, it’s not unreasonable to have build-outs on multiple providers, but it should be done through controlled experimentation. And that’s one of the benefits of cloud, we can make limited investments and experiment without big up-front expenditures—watch out for that with the multi-cloud PaaS offerings and service contracts.

But no, that doesn’t mean multi-cloud doesn’t have a place. Things are never that cut and dry. For large enterprises with multiple business units, multi-cloud is an inevitability. This can be a result of product teams at varying levels of maturity, corporate IT infrastructure, and certainly through mergers and acquisitions. The main value of multi-cloud, and I think one of the few arguments for it, is leveraging the strengths of each cloud where they make sense. This gets back to providers moving up the stack. As they attempt to differentiate with value-added services, multi-cloud starts to become a lot more meaningful. Secondarily, there might be a case for multi-cloud due to data-sovereignty reasons, but I think this is becoming less and less of a concern with the prevalence of regions and availability zones. However, some services, such as Google’s Cloud Spanner, might forgo AZ-granularity due to being “globally available” services, so this is something to be aware of when dealing with regulations like GDPR. Finally, for enterprises with colocation facilities, hybrid cloud will always be a reality, though this gets complicated when extending those out to multiple cloud providers.

If you’re just beginning to dip your toe into cloud, a multi-cloud strategy should not be at the forefront of your mind. It definitely should not be your guiding objective and something that drives core decisions or strategic items for the business. It has a time and place, but outside of that, it’s just a fool’s errand—a distraction from what’s truly important.

The Observability Pipeline

The rise of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. It has also led to new tools and services created to help us support our systems.

Many of the clients we work with at Real Kinetic are trying to navigate their way through this transformation and struggle to figure out where to begin with these solutions. Beau Lyddon, one of our partners, recently gave a talk on exactly this called What is Happening: Attempting to Understand Our Systems (as an aside, Honeycomb’s Charity Majors live-blogged the talk which is worth a read). In this post, I’m going to attempt to summarize some of the key ideas from Beau’s talk and introduce the concept of an observability pipeline, which we think is an essential component in today’s cloud-native, product-oriented world.

Observability Explosion

With traditional static deployments and monolithic architectures, monitoring is not too challenging (that’s not to say it’s easy, but, in relative terms, it’s uncomplicated). This is where tools like Nagios became very popular. When we have only a handful of servers and/or a single, monolithic application, it’s relatively straightforward to determine the health of the system and to correlate system behavior to actual customer or business impact. It’s also feasible to “see inside the box” and get meaningful code-level instrumentation. Once again, tools like AppDynamics and Dynatrace became popular here.

With cloud-native and container-based systems, instances tend to be highly elastic and ephemeral, and what used to comprise a single, monolithic application might now consist of dozens of different microservices and even different instances running different versions of the same service. Simply put, systems are more distributed, more dynamic, and more complex now than ever before—and users have even more expectations. This means many of the tools that were well-suited before might not be adequate now.

For example, the ability to “see inside the box” with intra-process, code-level tracing becomes largely impractical in a highly dynamic cloud environment. By the time you are debugging an issue, the container is gone. This is only exacerbated by the serverless or functions as a service (FaaS) movement. Similarly, it’s much more difficult to correlate the behavior of a single service to the user’s experience since partial failure becomes more of an everyday thing. Thus, many of these tools end up being better suited to static infrastructures where there is a small set of long-lived VMs with a limited number of services. That’s where most of them originated from anyway. Instead, service-level distributed tracing becomes a key part of microservice observability, as does structured logging. With this shift in how we build systems, there has been an explosion in new terms, new tools, and new services.

Of course, in addition to tools, there are also the cultural aspects of monitoring and incident response. Many companies traditionally rely on an operations team to monitor, triage, and—in some cases—even resolve issues. This model quickly becomes untenable as the number of services increases. A single operations team will not be able to maintain enough context for a non-trivial amount of services and systems to do this effectively. This model also leads to ineffective feedback loops if engineers are not on-call and responsible for the operation of their services—something I’ve talked about ad nauseum. My advice is to push ownership of systems onto the teams who built them. This includes on-call duty and general operational responsibilities. However, in order for development teams to take on this responsibility, they need to be empowered to act on it. With this model, which I’ve come to facetiously call NewOps, the operations team becomes responsible for providing the tools and data teams need to adequately operate their services. Some organizations take this even further with dedicated observability teams.

Observability” is a term that has emerged recently within the industry as a more nuanced take on traditional monitoring. While monitoring tends to focus more on the overall health of systems and business metrics, observability aims to provide more granular insights into the behavior of systems along with rich context useful for debugging and business purposes. Put another way, monitoring is about known-unknowns and actionable alerts; observability is about unknown-unknowns and empowering teams to interrogate their systems.

In a sense, observability encompasses all of the telemetry needed to gain insight into the behavior and state of a running system. This includes items like application logs, system logs, audit logs, application metrics, and distributed-tracing data. These are all valuable signals for diagnosing and debugging production issues, especially in a microservice environment where containers are largely ephemeral. In this environment, it is no longer practical to SSH into a machine to debug a problem or tail a log file. Distributed tracing becomes particularly important since a single application transaction may invoke multiple service functions.

Observability Pipeline

It’s important that you can really own your data and prevent it from being locked up inside a single vendor’s solution. Likewise, it’s important that data can be made available to the entire enterprise (or, in some cases, made not available to the entire enterprise). Since the number of tools and products can be quite large, tool and data needs vary from team to team, and the overall amount of data can be overwhelming, I suggest a decoupled approach. By building an observability pipeline, we can decouple the collection of this data from the ingestion of it into a variety of systems.

To illustrate, if we have log data going to Splunk, metrics and traces going to Datadog, client events going to Google Analytics and BigQuery, and everything going to Amazon Glacier for cold storage, the number of integrations quickly becomes large and grows for every additional service we add. It also probably means we are running an agent for many of these services on each host, and if any of these services are unavailable or behind, our application either blocks or we lose critical observability data. With the amount of data we end up collecting, it’s not uncommon to spend more time collecting it than actually performing business logic unless we find a way to efficiently get it out of the critical path.

Finally, as vendors in this space converge on features (which they are), differentiating capabilities are released (which they will need), or licensing/pricing issues arise (which they do), it’s likely that the business will need to add or remove SaaS solutions over time. If these are tightly integrated, this can be difficult to do. An observability pipeline, as we will later see, allows us to evaluate multiple solutions simultaneously or replace solutions transparently to applications and infrastructure. For example, perhaps we need to switch from Splunk to Sumo Logic or Datadog to New Relic or evaluate Honeycomb in addition to New Relic. How big of a lift would this be for your organization today? How easy is it to experiment with a new tool or service?

With an observability pipeline, we decouple the data sources from the destinations and provide a buffer. This makes the observability data easily consumable. We no longer have to figure out what data to send from containers, VMs, and infrastructure, where to send it, and how to send it. Rather, all the data is sent to the pipeline, which handles filtering it and getting it to the right places. This also gives us greater flexibility in terms of adding or removing data sinks, and it provides a buffer between data producers and consumers.

There are a few components to this pipeline which I will cover below. Many of the components can be implemented with existing open source tools or off-the-shelf services, so those I will touch on only briefly. Other parts require more involvement and some up-front thinking, so I’ll speak to them in more detail.

Data Specifications

Structured logging is hugely important to aiding debuggability. Anyone who’s shipped production code has been in the situation where they’re frantically trying to regex logs to pull out the information they need to debug a problem. It’s even worse when we’re debugging a request going through a series of microservices with haphazard logging. But structured logging isn’t just about creating better logs, it’s about creating a data pipeline that can feed the many tools you’ll need to leverage to understand, debug, and optimize complex systems, meet security and compliance requirements, and provide critical business intelligence.

In order to monitor systems, debug problems, make decisions, or automate processes, we need data. And we need the systems to give us data to provide necessary context. Aside from structured logging, one piece of advice we give every client is to pass a context object to basically everything. This context includes all of the important metadata flowing through a system—usually IDs that allow you to correlate events and piece together a story of what’s happening inside your system: user ID, account ID, trace ID, request ID, parent ID, and so on. What we want to avoid is the sort of murder-mystery debugging that often happens. A lone error log is the equivalent of finding a body. We know a crime occurred, but how do we piece together the clues to tell the right story? Observability—that is, being able to ask questions of your systems and truly explore them—requires access to pre-aggregate, raw data and support for high-cardinality dimensions.

The way to decide what goes on the context is to think about the data you wish you had while debugging an issue (this also highlights the importance of developers supporting their own systems). What is the data that would change the behavior of the system? Some examples include the user (or company), their license, time, machine stats (e.g. CPU and memory), software version, configuration data, the incoming request, downstream requests, etc. Of these, what can we get for “free” and what do we need to pass along? “Free” in this case would be things which are machine-provided, such as memory and CPU. The data we can’t get for free should go on the context, typically data that is request-specific. This context should be included on every log message.

This brings us back to the importance of structuring your data. To do this, I encourage creating standard specifications for each data type collected—logs, metrics, traces, events, etc. You can take this as far as you’d like—highly structured with a type system and rigid specification—but at a minimum, get logs into a standard format with property tags. JSON is fine for the actual structure, but be sure to version the spec so that it can evolve. For application events, one pattern that can work well is to create an inheritance structure with a base spec that applies across services (e.g. user context and tracing information are the same) and specialized specs that can be defined by services if needed. Just be careful not to leak sensitive data here—this is one area where code reviews are vital.

Specification Libraries

A key part of empowering developers is providing tools that align the “easy” path with the “right” path. If these aren’t aligned, pain-driven development creates problems. In order for developers to take advantage of structured data, specifications aren’t enough. We need libraries which implement the specs and make it easy for engineers to actually instrument their systems. For logging, there are many existing libraries. Just Google “structured logs” and your language of choice. For tracing and metrics, there are APIs like OpenTracing and OpenCensus. In practice, implementing the spec might be a combination of libraries and transformations made by the data collector described below.

Data Collector

This component is responsible for collecting data from hosts, containers, or other sources and writing it to the data pipeline. It may also perform transformations or filtering of data. A couple popular open source solutions for this are Fluentd and Logstash. Typically this runs as a sidecar or agent on the host, and data is written to stdout/stderr or a Unix domain socket, which it then pushes to the pipeline.

Data Pipeline

This component is a highly scalable data stream which can handle the firehose of observability data being generated and has high availability. This also provides a buffer for the data and decouples producers from consumers. Off-the-shelf solutions include Apache Kafka, Google Cloud Pub/Sub, Amazon Kinesis Data Streams, and Liftbridge.

Data Router

This component consumes data from the pipeline, performs filtering, and writes it to the appropriate backends. It may perform some transformations and processing of the data as well, but generally any heavy processing should be the responsibility of a backend system (e.g. alerting or aggregations). This is where the data specifications come into play. The data type will determine how routers handle incoming data, e.g. routing log data to Splunk and cold storage, routing traces to Google Stackdriver, and routing metrics and APM data to New Relic.

Like the specifications and libraries, this is a component that requires some more involvement. The downside of moving away from agent-based data collection is we now have to handle routing that data ourselves. The upside is most vendors provide good APIs and client libraries which make this easier.

Since this is typically a stateless service, it’s a good fit for “serverless” solutions like Google Cloud Functions or AWS Lambda.

Piecing It All Together

Putting all of these pieces together, the observability pipeline looks something like the following:

One caveat I want to point out is that this is not something you need to build out from day one. At most of the companies where we’ve implemented this, it was something that evolved over time. For instance, with some of the clients we work with who are attempting to move to the cloud and adopt DevOps practices, we typically would not advise making a significant upfront investment to architect this pipeline. This is an ideal goal to work towards that will become increasingly important as the amount of services, traffic, and data scales. Instead, architect your systems from the beginning to be able to adopt this approach more easily—use structured logging, keep collection out-of-process, and use a centralized logging system.

For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. Unlocking this data can also be a huge win for the business. It provides a layer of abstraction that allows you to get the data everywhere it needs to be without impacting developers and the core system. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.

Scaling DevOps and the Revival of Operations

Operations is going through a renaissance right now. With the move to cloud, the increasing amount of automation, and the increasing importance of automation, Ops as we know it is reinventing itself out of necessity. Infrastructure is becoming more and more sophisticated—and commoditized—and practices are just now starting to grow up around that. So while some worry about robots taking our jobs, the reality is more about how automation will help augment us to build better software and focus on higher-value things. It’s not so much about the distant future—whatever that may hold—so much as it is about the next five to ten years, what Operations looks like in that timeframe, and why I think it has to retool.

When we think about traditional Operations, we probably think about hardware and servers, managing networks and databases, application servers and runtimes, disaster recovery, Nagios checks, as well as the business side—vendor management, procurement, and so on. Finally, we have applications built on top by development teams.

We have a nice, clean separation—developers focus on building features and products, and Ops focuses on making sure the lights stay on. Of course, we know the reality is this separation also creates a lot of problems, so DevOps was borne out of this as a way to bring these two groups into alignment by improving communication and feedback loops.

Now, with the move to cloud, many of these traditional Ops functions are effectively being outsourced to cloud providers, i.e. the idea of NoOps. We get unprecedented elasticity and on-demand compute with far less overhead than we ever had before—shrinking procurement time from days or weeks to seconds or minutes.

What this leaves is a thin but important slice between Google or Amazon and those products built by developers—the glue, essentially, between cloud and product. I call this NewOps (which I use facetiously in reference to NoSQL/NewSQL), and it’s the future of Ops. This encompasses infrastructure automation, deployment automation, configuration management, logging, monitoring, and many other things. When Marc Andreessen said software is eating the world, he really meant it. The future of Ops—and many other things—is software. It’s killing the boring, repetitive things we really don’t want to be doing anyway and letting us shift our focus elsewhere.

Certainly, automation is nothing new and is, I think, an important part of DevOps, so I’m going to explain what I mean by NewOps and why I’m distinguishing it. I also don’t want to mischaracterize by having these neatly delineated Ops models. The truth is, your company doesn’t just one day graduate and gets its DevOps diploma. Instead, it might evolve through various manifestations of these different models. DevOps is a journey, not a destination in and of itself.

I like to think of a DevOps scale of automation, from manual provisioning all the way to fully self-service. Next, I add a second dimension, org size, from the smallest startups to the biggest enterprises.

Scaling DevOps

Scaling a business is probably one of the hardest things a company has to go through. In particular, dealing with the problem of silos. They happen at every company as it grows, but why is it that silos form in the first place?

Many companies start with a “DevOps” approach, often out of necessity more than anything. As a small startup, we can’t afford to have dedicated developers, QA, Ops, and security people. We just have people, and those people wear many different hats. Developers might be pushing their own code to production. They might even be managing the infrastructure that code runs on. There’s probably not a lot of stability, probably a lot of risk, and probably not a whole lot of thought towards controlling costs.

But as the product scales, we specialize. And as the business scales, we add various safety checks, controls, and processes. Developers write code, Ops people run it, QA gets blamed for defects, security blocks everything, and management wonders why nothing gets shipped.

And so we end up in the top left-hand quadrant with Ops as gatekeepers. Ops is fighting for stability and, at the same time, devs are basically fighting for change. More or less, we have a stable, cost-controlled, risk-averse environment—hopefully. But we also have a significant delivery and innovation bottleneck.

Specialization is good! But misalignment is not good. The question is, then, how do we scale specialization? Cross-functional teams come to mind. After all, DevOps encourages cooperation! We add an Ops engineer to each team, and maybe a reliability engineer, and perhaps a few extra for on-call backup, and of course a QA engineer too. Problem solved, right?

But hold on. What if we have 40 development teams? And all those teams are doing microservices. And, of course, all of those microservices are special snowflakes each with their own stacks, infrastructure, databases, and so on. This quickly gets out of control, but moreover, that’s a lot of teams and specialized roles on those teams. That’s a lot of headcount which equates to a lot of hiring and a lot of time and money. If you’re Google and you can just throw money at the problem, this might work out okay. For the rest of us, it might not be such a realistic option.

We go back to the drawing board and again ask ourselves how do we scale specialization? My thought to how we do this is with vision and product.

A vision is simply a mental image of what the future could be like. It enables independent decision making and alignment. Vision allows all of those teams, and the people on those teams, to make decisions without having to constantly coordinate with each other. Without vision, you’re just iterating to nowhere fast.

But vision without execution is just hallucination. Products are how we scale execution. Specifically, this idea of Operations through the lens of product, which I’ll describe after showing the parallel with what’s happening in QA.

In a lot of engineering organizations, many QA roles have been quietly disappearing. I think what’s happening is this evolution of QA, particularly, this shift from being test-focused to tools-focused.

We can look at companies like Amazon and Microsoft who popularized the SDET (Software Development Engineer in Test) model. These companies recognized that having a separate QA and development group causes a lot of problems, just like how having a separate Ops group does. We end up with SDEs (Software Development Engineers) who still focus on the development aspects of building software and SDETs who focus on the quality aspects, but rather than having two wholly separate groups, we just have development teams with SDETs embedded in them.

More recently, Microsoft moved to what they call a “Combined Engineering” model—effectively combining the SDE and SDET roles into a single role called a Software Engineer. Software Engineers write the product code, test code, and tools code needed to deliver their service. They are responsible for everything. Quality is a core concern of software development anyway.

Software Engineers write the code, unit tests, and integration tests. Those tests run in CI. The code moves through a CD pipeline before finally going out to production in some fashion. QA teams are shrinking, but what’s growing are the teams building the tools—the CI environments, the CD pipelines, the automated testing frameworks, the production tooling and automation, etc. The same is becoming true of Ops.

This is what I mean by “Operations through the lens of product.” The build, release, deploy automation, configuration management, infrastructure automation, logging, monitoring—these are all products.

Constraints often make problems easier. At Workiva, as we were struggling through that scaling phase, we placed a constraint on ourselves. We capped our infrastructure engineering headcount at 15% of R&D. This forced us to solve the problem using technology, and technical problems tend to be easier than people problems. In effect, this required us to productize our infrastructure. In doing so, we scaled. We controlled costs. We kept our headcount in check. We reduced risk. We accelerated development. Ultimately, we delivered value to customers faster, going from about three to four releases per year to multiple releases per day. In the end, this is really the goal of DevOps—to deliver value to customers continuously and to do it rapidly and reliably.

Rethinking Ops

It’s time we start to rethink Operations because clearly this model of Ops as cluster or infrastructure admins does not scale. Developers will always out-demand their capacity to supply. Either your headcount is out of control or your ability to innovate and deliver is severely hamstrung. Operations becomes this interrupt-driven thing where we’re just fighting fires as they happen. Ops as masters of production usually devolves to Ops becoming human incident routers, trying to figure out what team or person can help resolve problems because, being responsible for everything, they don’t have the insight to fix it themselves.

Another path that many companies take is Platform as a Service. Workiva is an example of this. For a very long time, Workiva didn’t have a traditional Ops team because the Ops team was Google. The first product was built on Google App Engine. This helped immensely to deliver value to customers quickly. We could just focus on the product and not the surrounding operational aspects, but there is a very real innovation bottleneck that comes with this.

The idea of “Ops lock-in” can be a major problem, whether it’s a PaaS like App Engine locking you in or your own Ops team who just isn’t able to support the kind of innovation that you’re trying to do.

My vision for the future of Operations is taking Combined Engineering to its logical conclusion. Just like with QA, Ops capabilities should be embedded within development teams. The reality is you can’t be an effective software engineer today without some Ops skills, and I think every role should be working towards automating itself out of a job. Specifically, my vision is enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services.

The knee-jerk reaction to this idea is usually fully embracing Infrastructure as a Service, infrastructure as code, and giving developers freedom—and usually the consequences are dire. The point here is that the pendulum can swing too far in the other direction. This was a problem for a brief period of time at Workiva. As we were building new products off of App Engine, developers had this newfound freedom, so teams all went different directions introducing new tech, new infrastructure, new services, and so forth. It was a free-for-all, an explosion of stuff, and the cost explosion that comes with it.

There has to be some control around that, so we tweak the vision statement a bit: enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services…with minimal Ops intervention. We have to have some checks and balances in place.

With this, Ops become force multipliers. We move away from the reactive, interrupt-driven model where Ops are masters of production responsible for everything. Instead, we make dev teams responsible for their services but provide the tools they need to actually own their systems end-to-end—from the code on their laptops to operating it in production.

Enabling developers to self-service through tooling and automation means treating Ops as a product team. The infrastructure automation, deployment automation, configuration management, logging, monitoring, and production tools—these are all products. It’s these products that allow teams to fully own their services. This leads to empowerment.

I have this theory that all engineering organizations operate in this fashion which I call pain-driven development. As a company grows, it starts to develop limbs—teams or silos. Each of these limbs has its own pain receptors. Teams operate in a way that minimizes the amount of pain that they feel, it’s human instinct. We make locally optimal decisions to minimize pain and end up following a path of least resistance.

Silos promote pain displacement, which results in a “bulkhead” effect. Product development feels the pain of building software, QA feels the pain of testing software, and Ops feels the pain of running software. This creates broken feedback loops. For instance, developers aren’t feeling the pain Ops is feeling trying to run their software. We just throw things over the wall and it becomes an empathy problem.

This leads to misaligned incentives because each team will optimize for the pain that they feel. How do you expect developers to care about quality if they’re not on the hook? Similarly, how do you expect them to care about operability if they’re not on the hook? Developers won’t build truly reliable software until they are on-call for it and directly responsible. However, responsibility requires empowerment. You can’t have one without the other. You can’t ask someone to care about something and fix it without also giving them the power to do so. Most Ops teams simply haven’t done enough to empower and offload responsibility onto development teams.

Products enable ownership. We move away from Ops as masters of production responsible for everything and push that responsibility onto dev teams. They are the experts for their services. They are best equipped to deal with problems that arise. But we provide the tools they need to diagnose and resolve those problems on their own.

Products maintain control through enablement—enabling teams to follow best practices for builds, testing, deploys, support, and compliance. Compliance and other SDLC requirements have to be encoded into the tools and processes. These are things developers won’t empathize with or simply won’t understand. Rather than giving them a long list of things they have to do, we take as many of those things as we can and bake them into our products. If you use these tools or follow these processes, you’ll get a lot of this stuff for free. This reduces risk and accelerates development.

Similarly, we can’t allow all of the special snowflakes to happen. We have to control that explosion of stuff. To do this, we use pain-driven development to our advantage by creating paths of least resistance. Using standardized patterns, application shapes, and infrastructure services, we can setup “paths” to both make it easier to reach production and meet the goals of the business. As a developer, if you follow this path, your life will be a lot easier and you’ll feel less pain. If you deviate from that path, things get much harder—and painful.

We end up with a set “menu” of standard application shapes and infrastructure. If teams want to deviate and go off-menu, it’s on them to make a case for it. For example, if I want to introduce Erlang into our stack, it’s on my team and me to present the case for that. Part of this might mean we help build and maintain the tools needed to support that. If there is a compelling enough case or enough teams are making similar asks, we can start to standardize new shapes.

Note that we aren’t necessarily mandating technologies, but we’re leveraging pain-driven development to work in our favor.

Products in Practice

Next, I’m going to look at this idea of Operations through the lens of product in a bit more detail. We’ll see what this might actually look like in practice, again using Workiva as a bit of a case study.

Below is the high-level flow that I think about, from code on laptop to code in production.

Starting with the Build and continuous integration stage, this workflow tends to look something like the following. A developer pushes a change to a branch in a code repository, e.g. GitHub. This triggers a few things to happen. First, the build process, which runs unit/integration tests and builds artifacts. This, in turn, might trigger a QA and/or compliance process. At the same time, we have code reviews happening. All of these processes provide feedback to the developer to quickly iterate.

Workiva has a lot of automated processes built into the developer workflow, some off-the-shelf and some built in-house. For example, when a PR is opened, a security scanner runs which does static analysis and looks for various security vulnerabilities. This can flag a security review when a closer look is needed. Likewise, there is code coverage, automated builds, unit tests, and integration tests, Docker image builds, and compliance checks. The screenshots below come from an open-source repo showing some of these products in practice.

For compliance reasons, Workiva requires at least one other person sign-off on code changes. GitHub provides pretty good support for this. Code reviewers provide their feedback, developers work through that feedback, and, once satisfied, reviewers give their “plus one.”

The screenshot below shows some of the automated processes Workiva relies on in the developer workflow: Travis CI, Codecov, Smithy (which is Workiva’s internal build system), Skynet (automated testing), Rosie (automated compliance controls, e.g. do you have code reviews, security reviews, other SDLC compliance requirements?), and Aviary (the security scanner). Once all of these have passed, the PR is automatically labeled with “Merge Requirements Met” and the change can be merged into master.

There are a couple things worth pointing out with this workflow. First, the build plan is part of the code and not baked into some build tool. This allows dev teams to fully control their builds. Second, you noticed that Workiva has very deep integration with GitHub. This has allowed them to build automated controls into the development process, which speeds up the developer’s workflow while reducing risk.

Next, we move on to the Release stage. This flow looks something like the following:

The developer tags a branch for release, which triggers a build process for creating the artifact. This may have a QA process which then promotes the artifact to a development artifact repository. As you may have noticed, Workiva has a lot of compliance requirements since they deal with companies’ pre-financial data, so there is typically a sign-off process at various stages involving different parties like Release Management, QA, Security, etc. Depending on your compliance controls, this might just be clicking a button to promote an artifact to a production repository. From there, it can actually be deployed to a production environment.

With this workflow, artifact tagging, building, and promotion is all automated. It’s also important we have processes around security. Container and machine image auditing is automated as well as security patching for OS updates, etc. For example, this workflow might use something like Packer to automate AMI building. Finally, the artifact sign-off is streamlined for the various parties involved, if not fully automated.

Now we’re ready to actually deploy our application. This is a key part of self-service and “owning” a product. This allows a team to configure their application and, ideally, deploy it themselves to production. Initially, this might be handled by a Release Management team who actually clicks the deploy button, but as you become more confident in your processes and your tools become more mature, more of this responsibility can be pushed onto the development teams.

This is also where control comes into play. For instance, I may be allowed to configure my application to use 1GB of RAM, but if I need 1TB, I may need to get additional sign-off.

Self-service deploys and self-service configuration—with guard rails—are an important part of continuous deployment. Additionally, infrastructure provisioning should be automated. No more submitting tickets for a nameless Ops person to provision and configure servers, VMs, or other resources—no ticket-driven development.

I’ve been deliberate about not prescribing particular solutions for some of these problems. You might be using Kubernetes or ECS to orchestrate containers, it doesn’t really matter. These should mostly be implementation details. What does matter, though, is having good abstractions around certain implementation details. For example, Workiva was meticulous about building some layers around workload scheduling. This allowed them at one point to switch from using Fleet to ECS to manage containers with virtually no impact to developers. With the amount of churn that happens in tech, it’s important not to tie yourself too heavily to any one implementation. Instead, think about the APIs you expose for your infrastructure and consider those the deliverable.

Finally, we need to operate our service in production, another important part of ownership. There are a lot of products here, so we’ll just look at a cross section.

Logging is arguably the most important part of how we figure out what is happening in our systems. For this reason, Workiva built structured logging and metrics specs and language libraries implementing these specs. As a developer, this made it easy to simply pull in the library for your language and get structured, contextual logging for free. The other half to this was building out a data pipeline. Basically all metadata at Workiva went into Amazon Kinesis, including logs, metrics, and traces. First, this allowed us to reuse the same infrastructure for all of this data, from the agents running on the machines to the pipeline itself. Second, it allowed us to fan this data out to different backend systems—Splunk, SumoLogic, Datadog, Stackdriver, BigQuery, as well as various internal tools. This is probably one of the most important things you can do with your infrastructure.

Other continuous operations tools include telemetry, tracing, health checks, alerting, and more sophisticated production tools like canary deploys, A/B testing, and traffic shadowing. Some might refer to these as tools for testing in production. Realistically, once you reach a certain scale, testing in production is the only real alternative to the proliferation of deployment environments.

It’s worth mentioning that you do not need to build all of these products yourself. In fact, you shouldn’t. Many off-the-shelf solutions just need glued together. However, I’ve also come to realize that it’s often the “glue” that is important. That is to say, taking some large, commercial off-the-shelf solution and introducing it into a company is frequently rife with headaches. It’s like Jira, a big Frankenstein product that attempts to solve everyone’s problems and, in doing so, solves none of them particularly well. This is why I tend to favor small, modular solutions that can be composed. But it also highlights why there is a cultural aspect to this.

If you think the solution to your ailments is some magical product—maybe a CI/CD pipeline or Kubernetes or something else—you’re misguided. If anything, most problems are cultural, not technical in nature. Technology will not fix your broken culture! The products are not the endgame, they are a means to an end. And the products need to fit the company, its culture, its architecture, and its constraints. It’s tempting to take something you see on Hacker News and introduce it into your stack, but you have to be careful.

Likewise, it’s tempting to dive straight into the deep-end, automate everything, and build out a highly sophisticated infrastructure. But it’s important to start small and evolve over time. My approach to this is get the workflow correct, start manual, then automate more and more over time.

Wrapping Up

Specialization leads to misalignment and broken feedback loops, but it’s an important part of scaling a business. The question is: how do we specialize?

We know the traditional Ops model does not scale—devs will always out-demand capacity in this reactive model. Not only this, the siloing creates an empathy problem. DevOps attempts to help with this by tightening feedback loops and building empathy. NewOps takes this further by empowering teams and providing autonomy. It’s not a replacement for DevOps, it’s an evolution of it. It’s applying a product mindset to the traditional Ops model.

The future of Ops is taking Combined Engineering to its logical conclusion. As such, Ops teams should be redefining their vision from being masters of production to enablers of production. Just like with QA, Ops capabilities need to be embedded within dev teams, but the caveat is they need to be enabled! This is the direction Operations is headed. Software is eating the world, which means both up and down the stack. NewOps treats Ops like a product team whose product, effectively, is infrastructure. It’s creating guard rails, not walls—taking SDLC and compliance controls and encoding them into products rather than giving devs a laundry list of things, having them run the gauntlet through a long, drawn-out development process, and having a gatekeeper at the end.

Offloading responsibility helps correct and scale feedback loops. In my opinion, this is how we scale specialization. Operations isn’t going away, it’s just getting a product manager.

More Environments Will Not Make Things Easier

Microservices are hard. They require extreme discipline. They require a lot more upfront thinking. They introduce integration challenges and complexity that you otherwise wouldn’t have with a monolith, but service-oriented design is an important part of scaling organization structure. Hundreds of engineers all working on the same codebase will only lead to angst and the inability to be nimble.

This requires a pretty significant change in the way we think about things. We’re creatures of habit, so if we’re not careful, we’ll just keep on applying the same practices we used before we did services. And that will end in frustration.

How can we possibly build working software that comprises dozens of services owned by dozens of teams? Instinct tells us full-scale integration. That’s how we did things before, right? We ran integration tests. We run all of the services we depend on and develop our service against that. But it turns out, these dozen or so services I depend on also have their own dependencies! This problem is not linear.

Okay, so we can’t run everything on our laptop. Instead, let’s just have a development environment that is a facsimile of production with everything deployed. This way, teams can develop their products against real, deployed services. The trade-off is teams need to provide a high level of stability for these “development” services since other teams are relying on them for their own development. If nothing works, development is hamstrung. Personally, I think this is a pretty reasonable trade-off because if we’re disciplined enough, it shouldn’t be hard to provide stable APIs. In fact, if we’re disciplined, it should be a requirement. This is why upfront thinking is critical. Designing your APIs is the most important thing you do. Service-oriented architecture necessitates API-driven development. Literally nothing else matters but the APIs. It reminds me of the famous Jeff Bezos mandate:

  1. All teams will henceforth expose their data and functionality through service interfaces.

  2. Teams must communicate with each other through these interfaces.

  3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

  4. It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols – doesn’t matter. Bezos doesn’t care.

  5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

  6. Anyone who doesn’t do this will be fired.

  7. Thank you; have a nice day!

If we’re not disciplined, maintaining stability in a development environment becomes too difficult. So naturally, the solution becomes doubling down—we just need more environments. If every team just gets its own full-scale environment to develop against, no more stability problems. We get to develop our distributed monolith happily in our own little world. That sound you hear is every CFO collectively losing their shit, but whatever, they’re nerds and we’ve gotta get this feature to production!

Besides the obvious cost implications to this approach, perhaps the more insidious problem is it will cause teams to develop in a vacuum. In and of itself, this is not an issue, but for the undisciplined team who is not practicing rigorous API-driven development, it will create moving goalposts. A team will spend months developing its product against static dependencies only to find a massive integration headache come production time. It’s pain deferral, plain and simple. That pain isn’t being avoided or managed, you’re just neglecting to deal with instability and integration to a point where it is even more difficult. It is the opposite of the “fail-fast” mindset. It’s failing slowly and drawn out.

“We need to run everything with this particular configuration to test this, and if anyone so much as sneezes my service becomes unstable.” Good luck with that. I’ve got a dirty little secret: if you’re not disciplined, no amount of environments will make things easier. If you can’t keep your service running in an integration environment, production isn’t going to be any easier.

Similarly, massive end-to-end integration tests spanning numerous services  are an anti-pattern. Another dirty little secret: integrated tests are a scam. With a big enough system, you cannot reasonably expect to write meaningful large-scale tests in any tractable way.

What are we to do then? With respect to development, get it out of your head that you can run a facsimile of production to build features against. If you need local development, the only sane and cost-effective option is to stub. Stub everything. If you have a consistent RPC layer—discipline—this shouldn’t be too difficult. You might even be able to generate portions of stubs.

We used Google App Engine heavily at Workiva, which is a PaaS encompassing numerous services—app server, datastore, task queues, memcache, blobstore, cron, mail—all managed by Google. We were doing serverless before serverless was even a thing. App Engine provides an SDK for developing applications locally on your machine. Numerous times I overheard someone who thought the SDK was just running a facsimile of App Engine on their laptop. In reality, it was running a bunch of stubs!

If you need a full-scale deployed environment, keep in mind that stability is the cost of entry. Otherwise, you’re just delaying problems. In either case, you need stable APIs.

With respect to integration testing, the only tractable solution that doesn’t lull you into a false sense of security is consumer-driven contract testing. We run our tests against a stub, but these tests are also included in a consumer-driven contract. An API provider runs consumer-driven contract tests against its service to ensure it’s not breaking any downstream services.

All of this aside, the broader issue is ensuring a highly disciplined engineering organization. Without this, the rest becomes much more difficult as pain-driven development takes hold. Discipline is a key part of doing service-oriented design and preventing things from getting out of control as a company scales. Moving to microservices means using the right tools and processes, not just applying the old ones in a new context.

There and Back Again: Why PaaS Is Passé (And Why It’s Not)

In 10 years nobody will be talking about Kubernetes. Not because people stopped using it or because it fell out of favor, but because it became utility. Containers, Kubernetes, service meshes—they’ll all be there, the same way VMs, hypervisors, and switches will be. Compute is a commodity, and I don’t care how my workload runs so long as it meets my business’s SLOs and other requirements. Within AWS alone, there are now innumerable ways to run a compute workload.

This was the promise of Platform as a Service (PaaS): provide a pre-built runtime where you simply plug in your application and the rest—compute, networking, storage—is handled for you. Heroku (2007), Google App Engine (2008), OpenShift (2011), and Cloud Foundry (2011) all come to mind. But PaaS has, in many ways, become a sort of taboo in recent years. As a consultant working with companies either in the cloud or looking to move to the cloud, I’ve found PaaS to almost be a trigger word; the wince from clients upon its utterance is almost palpable. It’s hard to pin down exactly why this is the case, but I think there are a number of reasons which range from entirely legit to outright FUD.

There is often a funny cognitive dissonance with these companies who recoil at the mention of PaaS. After unequivocally rejecting the idea for reasons like vendor lock-in and runtime restrictions (again, some of these are legitimate concerns), they will describe, in piecemeal fashion, their own half-baked idea of a PaaS. “Well, we’ll use Kubernetes to handle compute, ELK stack for logging, Prometheus for metrics, OpenTracing for distributed tracing, Redis for caching…”, and so the list goes on. Not to mention there tends to be a bias on build over buy. And we need to somehow provide all of these things as a self-service platform to developers.

While there are ongoing efforts to democratize the cloud and provide reference architectures of sorts, the fact is there are no standards and the proliferation of tools and technologies continues to expand at a rapid pace. On the other hand, as certain tools emerge, such as Kubernetes, the patterns and practices around them have naturally lagged behind. The serverless movement bears this out further. Serverless is the microservice equivalent for PaaS but with a lot less tooling and operations maturity. This is an exciting time, but the cloud has become—without a doubt—an unnavigable wasteland. Even with all the things at your disposal today, it’s still a ton of work to build and operate what is essentially your own PaaS.

But technology is cyclical and the cloud is no different. This evolution, in some sense, parallels what happened with the NoSQL movement. Eric Brewer discusses this in his RICON 2012 talk. When you cut through the hype, NoSQL was about giving developers more control at the expense of less pre-packaged functionality, but it was not intended to be the end game or an alternative to SQL. It’s about two different, equally valid world views: top-down and bottom-up. The top-down view is looking at a model and its semantics and then figuring out what you need to do to implement it. With a relational database, this is using SQL to declaratively construct our model. The bottom-up view is about the layering of primitive components into something more complex. For example, modern databases like CockroachDB present a SQL abstraction on top of a transactional layer on top of a replication layer on top of a simple key-value-store layer. NoSQL gives us a reusable storage component with a lot of flexibility and, over time, as we add more and more pieces on top, we get something that looks more like a database. We start with low-level layers, but the end goal is still the same: nice, user-friendly semantics. I would argue the same thing is happening with PaaS.

What the major cloud providers are doing is unbundling the PaaS. We have our compute, our cluster scheduler, our databases and caches, our message queues, and other components. What’s missing is the glue—the standards and tools that tie these things together into a coherent, manageable unit—a PaaS. Everything old is new again. What we will see is the rebundling of these components gradually happen over time as those standards and tools emerge. Tools like AWS Fargate and Google App Engine Flexible Environment are a step in that direction (Google really screwed up by calling it App Engine Flex because of all the PaaS baggage associated with the App Engine name). The container is just the interface. However, that’s only the start.

PaaS and serverless are great because they truly accelerate application development and reduce operations overhead. However, the trade-off is: we become constrained. For example, with App Engine, we were initially constrained to certain Google Cloud APIs, such as Cloud Datastore and Task Queues, and specific language runtimes. Over time, this has improved, notably, with Cloud SQL, and now today we can use custom runtimes. Similarly, PaaS gives us service autoscaling, high availability, and critical security patches for free, but we lose a degree of control over compute characteristics and workload-processing patterns.

In a sense, what a PaaS offers is an opinionated framework for running applications. Opinionated is good if you want to be productive, but it’s limiting once you have a mature product. What we want are the benefits of PaaS with a bit more flexibility. A PaaS provides us a top-down template from which we can start, but we want to be able to tweak that to our needs. Kubernetes is a key part of that template, but it’s ultimately just a means to an end.

This is why I think no one will be talking about Kubernetes in 10 years. Hopefully by then it’s just not that interesting. If it still is, we’re not done yet.