Brave New Geek

April 29, 2024April 29, 2024

How Konfig provides an enterprise platform with GitLab and Google Cloud

In a previous post, I explained the fundamental competing priorities that companies have when building software: security and governance, maintainability, and speed to production. These three concerns are all in constant tension with each other. For companies either migrating to the cloud or beginning a modernization effort, addressing them can be a major challenge. When you’re unfamiliar with the cloud, building systems that are both secure and maintainable is difficult because you’re not in a position to make decisions that have long-lasting and significant impact—you just don’t know what you don’t know. One small misstep can result in a major security incident. A bad decision can take years to manifest a problem. As a result, these migration and modernization efforts often stall out as analysis paralysis takes hold.

This is where Real Kinetic usually steps in: to get a stuck project moving again, to provide guard rails, and to help companies avoid the hidden landmines by offering our expertise and experience. We’ve been there before, so we help navigate our clients through the foundational decision making, design, and execution of large-scale cloud migrations. We’ve helped migrate systems generating billions of dollars in revenue and hundreds of millions in cloud spend. We’ve also helped customers save tens of millions in cloud spend by guiding them through more cost-effective solution architectures. And while we’ve had a lot of success helping our clients operationalize the cloud, they still routinely ask us: why is it so damn difficult? The truth is it doesn’t have to be if you’re willing to take just a slightly more opinionated stance.

Recently, we introduced Konfig, our solution for this exact problem. Konfig packages up our expertise and years of experience operationalizing and building software in the cloud. More concretely, it’s an enterprise integration of GitLab and Google Cloud that addresses those three competing priorities I mentioned earlier. The reason it’s so difficult for organizations to operationalize GitLab and GCP is because they are robust and flexible platforms that address a broad set of customer needs. As a result, they do not take an opinionated stance on pretty much anything. This leaves a gap unaddressed, and customers are left having to put together their own opinionation that meets their needs—except, they usually aren’t in a position to do this. Thus, they stall.

Konfig gives you a functioning, enterprise-ready GitLab and GCP environment that is secure by default, has strong governance and best practices built-in, and scales with your organization. The best part? You can start deploying production workloads in a matter of minutes. It does this by taking an opinionated stance on some things. It bridges the gap that is unaddressed by Google and GitLab. Those opinions are the recommendations, guidance, and best practices we share with clients when they are operationalizing the cloud.

Perhaps the most obvious opinion is that Konfig is specific to GCP and GitLab. We could extend this model to other platforms like AWS and GitHub, but we chose to focus on building a white-glove experience with GCP and GitLab first because they work together so well. GCP has first-class managed services and serverless offerings which lend themselves to providing a platform that is secure, maintainable, and has a great developer experience. GitLab’s CI/CD is better designed than GitHub Actions and its hierarchical structure maps well to GCP’s resource hierarchy.

Moreover, Konfig embraces service-oriented architecture and domain-driven design which drives how we structure folders and projects in GCP and groups in GitLab. This structure gives us a powerful way to map access management and governance, which we’ll explore later. It’s a best practice that makes systems more maintainable and evolvable. We’ll discuss Konfig’s opinions and their rationale in more depth in a future post. For now, I want to explain how Konfig provides an enterprise platform by addressing each of the three concerns in the software development triangle: security and governance, maintainable infrastructure, and speed to production.

Security and Governance

Access Management

Konfig relies on a hierarchy consisting of control plane > platforms > domains > workloads. The control plane is the top-level container which is responsible for managing all of the resources contained within it. Platforms are used to group different lines of business, product lines, or other organizational units. Domains are a way to group related workloads or services.

This structure provides several benefits. First, we can map it to hierarchies in both GitLab and GCP, shown in the image below. A platform maps to a group in GitLab and a folder in GCP. A domain maps to a subgroup in GitLab and a nested folder along with a project per environment in GCP.

Konfig synchronizes structure and permissions between GitLab and GCP

This hierarchy lets us manage permissions cleanly because we can assign access at the control plane, platform, and domain levels. These permissions will be synced to GitLab in the form of SAML group links and to GCP in the form of IAM roles. When a user has “dev” access, they get the Developer role for the respective group in GitLab. In GCP, they get the Editor role for dev environment projects and Viewer for higher environments. “Maintainer” has slightly more elevated access, and “owner” effectively provides root access to allow for a “break-glass” scenario. The hierarchy means these permissions can be inherited by setting them at different levels. This access management is shown in the platform.yaml and domain.yaml examples below highlighted in bold.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  groups:
    dev: [ecommerce-devs@example.com]
    maintainer: [ecommerce-maintainers@example.com]
    owner: [ecommerce-owners@example.com]

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  groups:
    dev: [payment-devs@example.com]
    maintainer: [payment-maintainers@example.com]
    owner: [payment-owners@example.com]

domain.yaml

Authentication and Authorization

There are three different authentication and authorization concerns in Konfig. First, GitLab needs to authenticate with GCP such that pipelines can deploy to the Konfig control plane. Second, the control plane, which runs in a privileged customer GCP project, needs to authenticate with GCP such that it can create and manage cloud resources in the respective customer projects. Third, customer workloads need to be able to authenticate with GCP such that they can correctly access their resource dependencies, such as a database or Pub/Sub topic. The configuration for all of this authentication as well as the proper authorization settings is managed by Konfig. Not only that, but none of these authentication patterns involve any kind of long-lived credentials or keys.

GitLab to GCP authentication is implemented using Workload Identity Federation, which uses OpenID Connect to map a GitLab identity to a GCP service account. We scope this identity mapping so that the GitLab pipeline can only deploy to its respective control plane namespace. For instance, the Payment Processing team can’t deploy to the Fulfillment team’s namespace and vice versa.

Control plane to GCP authentication relies on domain-level service accounts that map a control plane namespace for a domain (let’s say Payment Processing) to a set of GCP projects for the domain (e.g. Payment Processing Dev, Payment Processing Stage, and Payment Processing Prod).

Finally, workloads also rely on service accounts to authenticate and access their resource dependencies. Konfig creates a service account for each workload and sets the proper roles on it needed to access resources. We’ll look at this in more detail next.

This approach to authentication and authorization means there is very little attack surface area. There are no keys to compromise and even if an attacker were to somehow compromise GitLab, such as by hijacking a developer’s account, the blast radius is minimal.

Least-Privilege Access

Konfig is centered around declaratively modeling workloads and their infrastructure dependencies. This is done with the workload.yaml. This lets us spec out all of the resources our service needs like databases, storage buckets, caches, etc. Konfig then handles provisioning and managing these resources. It also handles creating a service account for each workload that has roles that are scoped to only the resources specified by the workload. Let’s take a look at an example.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Workload
metadata:
  name: order-api
spec:
  region: us-central1
  runtime:
    kind: RunService
    parameters:
      template:
        containers:
          - image: order-api
  resources:
    - kind: StorageBucket
      name: receipts
    - kind: SQLInstance
      name: order-store
    - kind: PubSubTopic
      name: order-events

workload.yaml

Here we have a simple workload definition for a service called “order-api”. This workload is a Cloud Run service that has three resource dependencies: a Cloud Storage bucket called “receipts”, a Cloud SQL instance called “order-store”, and a Pub/Sub topic called “order-events”. When this YAML definition gets applied by the GitLab pipeline, Konfig will handle spinning up these resources as well as the Cloud Run service itself and a service account for order-api. This service account will have the Pub/Sub Publisher role scoped only to the order-events topic and the Storage Object User role scoped to the receipts bucket. Konfig will also create a SQL user on the Cloud SQL instance whose credentials will be securely stored in Secret Manager and accessible only to the order-api service account. The Konfig UI shows this workload, all of its dependencies, and each resource’s status.

Enforcing Enterprise Standards

After looking at the example workload definition above, you may be wondering: there’s a lot more to creating a storage bucket, Cloud SQL database, or Pub/Sub topic than just specifying its name. Where’s the rest? It’s a good segue into how Konfig offers a means for providing sane defaults and enforcing organizational standards around how resources are configured.

Konfig uses templates to allow an organization to manage either default or required settings on resources. This lets a platform team centrally manage how things like databases, storage buckets, or caches are configured. For instance, our organization might enforce a particular version of PostgreSQL, high availability mode, private IP only, and customer-managed encryption key. For non-production environments, we may use a non-HA configuration to reduce costs. Just like our platform, domain, and workload definitions, these templates are also defined in YAML and managed via GitOps.

We can also take this further and even manage what cloud APIs or services are available for developers to use. Like access management, this is also configured at the control plane, platform, and domain levels. We can specify what services are enabled by default at the platform level which will inherit across domains. We can also disable certain services, for example, at the domain level. The example platform and domain definitions below illustrate this. We enable several services on the Ecommerce platform and restrict Pub/Sub, Memorystore (Redis), and Firestore on the Payment Processing domain.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gcp:
    services:
      defaults:
        - cloud-run
        - cloud-sql
        - cloud-storage
        - secret-manager
        - cloud-kms
        - pubsub
        - redis
        - firestore

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  gcp:
    services:
      disabled:
        - pubsub
        - redis
        - firestore

domain.yaml

This model provides a means for companies to enforce a “golden path” or an opinionated and supported way of building something within your organization. It’s also a critical component for organizations dealing with regulatory or compliance requirements such as PCI DSS. Even for organizations which prefer to favor developer autonomy, it allows them to improve productivity by setting good defaults so that developers can focus less on infrastructure configuration and more on product or feature development.

SDLC Integration

It’s important to have an SDLC that enables developer efficiency while also providing a sound governance story. Konfig fits into existing SDLCs by following a GitOps model. It allows your infrastructure to follow the same SDLC as your application code. Both rely on a trunk-based development model. Since everything from platforms and domains to workloads is managed declaratively, in code, we can apply typical SDLC practices like protected branches, short-lived feature branches, merge requests, and code reviews.

Even when we create resources from the Konfig UI, they are backed by this declarative configuration. This is something we call “Visual IaC.” Teams who are more comfortable working with a UI can still define and manage their infrastructure using IaC without even having to directly write any IaC. We often encounter organizations who have teams like data analytics, data science, or ETL which are not equipped to deal with managing cloud infrastructure. This approach allows these teams to be just as productive—and empowered—as teams with seasoned infrastructure engineers while still meeting an organization’s SDLC requirements.

Creating a resource in the Konfig workload UI

Cost Management

Another key part of governance is having good cost visibility. This can be challenging for organizations because it heavily depends on how workloads and resources are structured in a customer’s cloud environment. If things are structured incorrectly, it can be difficult to impossible to correctly allocate costs across different business units or product areas.

The Konfig hierarchy of platforms > domains > workloads solves this problem altogether because related workloads are grouped into domains and related domains are grouped into platforms. A domain maps to a set of projects, one per environment, which makes it trivial to see what a particular domain costs. Similarly, we can easily see an aggregate cost for an entire platform because of this grouping. The GCP billing account ID is set at the platform level and all projects within a platform are automatically linked to this account. Konfig makes it easy to implement an IT chargeback or showback policy for cloud resource consumption within a large organization.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gcp:
    billingAccountId: "123ABC-456DEF-789GHI"

platform.yaml

Maintainable Infrastructure

Opinionated Model

We’ve talked about opinionation quite a bit already, but I want to speak to this directly. The reason companies so often struggle to operationalize their cloud environment is because the platforms themselves are unwilling to take an opinionated stance on how customers should solve problems. Instead, they aim to be as flexible and accommodating as possible so they can meet as many customers where they are as possible. But we frequently hear from clients: “just tell me how to do it” or even “can you do it for me?” Many of them don’t want the flexibility, they just want a preassembled solution that has the best practices already implemented. It’s the difference between a pile of Legos with no instructions and an already-assembled Lego factory. Sure, it’s fun to build something yourself and express your creativity, but this is not where most businesses want creativity. They want creativity in the things that generate revenue.

Konfig is that preassembled Lego factory. Does that mean you get to customize and change all the little details of the platform? No, but it means your organization can focus its energy and creativity on the things that actually matter to your customers. With Konfig, we’ve codified the best practices and patterns into a turnkey solution. This more opinionated approach allows us to provide a good developer experience that results in maintainable infrastructure. The absence of creative constraints tends to lead to highly bespoke architectures and solutions that are difficult to maintain, especially at scale. It leads to a great deal of inefficiency and complexity.

Architectural Standards

Earlier we saw how Konfig provides a powerful means for enforcing enterprise standards and sane defaults for infrastructure as well as how we can restrict the use of certain services. While we looked at this in the context of governance, it’s also a key ingredient for maintainable infrastructure. Organizational standards around infrastructure and architecture improve efficiency and maintainability for the same reason the opinionation we discussed above does. Konfig’s templating model and approach to platforms and domains effectively allows organizations to codify their own internal opinions.

Automatic Reconciliation

There are a number of challenges with traditional IaC tools like Terraform. One such challenge is the problem of state management and drift. A resource managed by Terraform might be modified outside of Terraform which introduces a state inconsistency. This can range from something simple like a single field on a resource to something very complex, such as an entire application stack. Resolving drift can sometimes be quite problematic. Terraform works by storing its configuration in a state file. Aside from the problem that the state file often contains sensitive information like passwords and credentials, the Terraform state is applied in a “one-off” fashion. That is to say, when the Terraform apply command is run, the current state configuration is applied to the environment. At this point, Terraform is no longer involved until the next time the state is applied. It could be hours, days, weeks, or longer between applies.

Konfig uses a very different model. In particular, it regularly reconciles the infrastructure state automatically. This solves the issue of state drift altogether since infrastructure is no longer applied as “one-off” events. Instead, it treats infrastructure the way it actually is—a living, breathing thing—rather than a single, point-in-time snapshot.

Speed to Production

Turnkey Setup

Our goal with Konfig is to provide a fully turnkey experience, meaning customers have a complete and enterprise-grade platform with little-to-no setup. This includes setup of the platform itself, but also setup of new workloads within Konfig. We want to make it as easy and frictionless as possible for organizations to start shipping workloads to production. It’s common for a team to build a service that is code complete but getting it deployed to various environments takes weeks or even months due to the different organizational machinations that need to occur first. With Konfig, we start with a workload deploying to an environment. You can use our workload template in GitLab to create a new workload project and deploy it to a real environment in a matter of minutes. The CI/CD pipeline is already configured for you. Then you can work backwards and start adding your code and infrastructure resources. We call this “Deployment-Driven Development.”

Konfig works by using a control plane which lives in a customer GCP project. The setup of this control plane is fully automated using the Konfig CLI. When you run the CLI bootstrap command, it will run through a guided wizard which sets up the necessary resources in both GitLab and GCP. After this runs, you’ll have a fully functioning enterprise platform.

Workload Autowiring

We saw earlier how workloads declaratively specify their infrastructure resources (something we call resource claims) and how Konfig manages a service account with the correctly scoped, minimal set of permissions to access those resources. For resources that use credentials, such as Cloud SQL database users, Konfig will manage these secrets by storing them in GCP’s Secret Manager. Only the workload’s service account will be able to access this. This secret gets automatically mounted onto the workload. Resource references, such as storage bucket names, Pub/Sub topics, or Cloud SQL connections, will also be injected into the workload to make it simple for developers to start consuming these resources.

API Ingress and Path-Based Routing

Konfig makes it easy to control the ingress of services. We can set a service such that it is only accessible within a domain, within a platform, or within a control plane. We can even control which domains can access an API. Alternatively, we can expose a service to the internet. Konfig uses a path-based routing scheme which maps to the platform > domain > workload hierarchy. Let’s take a look at an example platform, domain, and workload configuration.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gcp:
    api:
      path: /ecommerce

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  gcp:
    api:
      path: /payment

domain.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Workload
metadata:
  name: authorization-service
spec:
  region: us-central1
  runtime:
    kind: RunService
    parameters:
      template:
        containers:
          - image: authorization-service
  api:
    path: /auth

workload.yaml

Note the API path component in the above configurations. Our ecommerce platform specifies /ecommerce as its path, the payment-processing domain specifies /payment, and the authorization-service workload specifies /auth. The full route to hit the authorization-service would then be /ecommerce/payment/auth. We’ll explore API ingress and routing in more detail in a later post.

An Enterprise-Ready Workload Delivery Platform

We’ve looked at a few of the ways Konfig provides a compelling enterprise integration of GitLab and Google Cloud. It addresses a gap these products leave by not offering strong opinions to customers. Konfig allows us to package up the best practices and patterns for implementing a production-ready workload delivery platform and provide that missing opinionation. It tackles three competing priorities that arise when building software: security and governance, maintainable infrastructure, and speed to production. Konfig plays a strategic role in reducing the cost and improving the efficiency of cloud migration, modernization, and greenfield efforts. Reach out to learn more about Konfig or schedule a demo.

Follow @tyler_treat

April 17, 2024April 17, 2024

Security, Maintainability, Velocity: Choose One

There are three competing priorities that companies have as it relates to software development: security, maintainability, and velocity. I’ll elaborate on what I mean by each of these in just a bit. When I originally started thinking about this, I thought of it in the context of the “good, fast, cheap: choose two” project management triangle. But after thinking about it for more than a couple minutes, and as I related it to my own experience and observations at other companies, I realized that in practice it’s much worse. For most organizations building software, it’s more like security, maintainability, velocity: choose one.

Of course, most organizations are not explicitly making these trade-offs. Instead, the internal preferences and culture of the company reveal them. I believe many organizations, consciously or not, accept this trade-off as an immovable constraint. More risk-averse groups might even welcome it. Though the triangle most often results in a “choose one” sort of compromise, it’s not some innate law. You can, in fact, have all three with a little bit of careful thought and consideration. And while reality is always more nuanced than what this simple triangle suggests, I find looking at the extremes helps to ground the conversation. It emphasizes the natural tension between these different concerns. Bringing that tension to the forefront allows us to be more intentional about how we manage it.

It wasn’t until recently that I distilled down these trade-offs and mapped them into the triangle shown above, but we’ve been helping clients navigate this exact set of competing priorities for over six years at Real Kinetic. We built Konfig as a direct response to this since it was such a common challenge for organizations. We’re excited to offer a solution which is the culmination of years of consulting and which allows organizations to no longer compromise, but first let’s explore the trade-offs I’m talking about.

Security

Companies, especially mid- to large- sized organizations, care a great deal about security (and rightfully so!). That’s not to say startups don’t care about it, but the stakes are just much higher for enterprises. They are terrified of being the next big name in the headlines after a major data breach or ransomware attack. I call this priority security for brevity, but it actually consists of two things which I think are closely aligned: security and governance.

Governance directly supports security in addition to a number of other concerns like reliability, risk management, and compliance. This is sometimes referred to as Governance, Risk, and Compliance or GRC. Enterprises need control over, and visibility into, all of the pieces that go into building and delivering software. This is where things like SDLC, separation of duties, and access management come into play. Startups may play it more fast and loose, but more mature organizations frequently have compliance or regulatory obligations like SOC 2 Type II, PCI DSS, FINRA, FedRAMP, and so forth. Even if they don’t have regulatory constraints, they usually have a reputation that needs to be protected, which typically means more rigid processes and internal controls. This is where things can go sideways for larger organizations as it usually leads to practices like change review boards, enterprise (ivory tower) architecture programs, and SAFe. Enterprises tend to be pretty good at governance, but it comes at a cost.

It should come as no surprise that security and governance are in conflict with speed, but they are often in contention with well-architected and maintainable systems as well. When organizations enforce strong governance and security practices, it can often lead to developers following bad practices. Let me give an example I have seen firsthand at an organization.

A company has been experiencing stability and reliability issues with its software systems. This has caused several high-profile, revenue-impacting outages which have gotten executives’ attention. The response is to implement a series of process improvements to effectively slow down the release of changes to production. This includes a change review board to sign-off on changes going to production and a production gating process which new workloads going to production must go through before they can be released. The hope is that these process changes will reduce defects and improve reliability of systems in production. At this point, we are wittingly trading off velocity.

What actually happened is that developers began batching up more and more changes to get through the change review board which resulted in “big bang” releases. This caused even more stability issues because now large sets of changes were being released which were increasingly complex, difficult to QA, and harder to troubleshoot. Rollbacks became difficult to impossible due to the size and complexity of releases, increasing the impact of defects. Release backlogs quickly grew, prompting developers to move on to more work rather than sit idle, which further compounded the issue and led to context switching. Decreasing the frequency of deployments only exacerbated these problems. Counterintuitively, slowing down actually increased risk.

To avoid the production gating process, developers began adding functionality to existing services which, architecturally speaking, should have gone into new services. Services became bloated grab bags of miscellaneous functionality since it was easier to piggyback features onto workloads already in production than it was to run the gauntlet of getting a new service to production. These processes were directly and unwittingly impacting system architecture and maintainability. In economics, this is called a “negative externality.” We may have security and governance, but we’ve traded off velocity and maintainability. Adding insult to injury, the processes were not even accomplishing the original goal of improving reliability, they were making it worse!

Maintainability

It’s critical that software systems are not just built to purpose, but also built to last. This means they need to be reliable, scalable, and evolvable. They need to be conducive to finding and correcting bugs. They need to support changing requirements such that new features and functionality can be delivered rapidly. They need to be efficient and cost effective. More generally, software needs to be built in a way that maximizes its useful life.

We simply call this priority maintainability. While it covers a lot, it can basically be summarized as: is the system architected and implemented well? Is it following best practices? Is there a lot of tech debt? How much thought and care has been put into design and implementation? Much of this comes down to gut feel, but an experienced engineer can usually intuit whether or not a system is maintainable pretty quickly. A good proxy can often be the change fail rate, mean time to recovery, and the lead time for implementing new features.

Maintainability’s benefits are more of a long tail. A maintainable system is easier to extend and add new features later, easier to identify and fix bugs, and generally experiences fewer defects. However, the cost for that speed is basically frontloaded. It usually means moving slower towards the beginning while reaping the rewards later. Conversely, it’s easy to go fast if you’re just hacking something together without much concern for maintainability, but you will likely pay the cost later. Companies can become crippled by tech debt and unmaintained legacy systems to the point of “bankruptcy” in which they are completely stuck. This usually leads to major refactors or rewrites which have their own set of problems.

Additionally, building systems that are both maintainable and secure can be surprisingly difficult, especially in more dynamic cloud environments. If you’ve ever dealt with IAM, for example, you know exactly what I mean. Scoping identities with the right roles or permissions, securely managing credentials and secrets, configuring resources correctly, ensuring proper data protections are in place, etc. Misconfigurations are frequently the cause of the major security breaches you see in the headlines. The unfortunate reality is security practices and tooling lag in the industry, and security is routinely treated as an afterthought. Often it’s a matter of “we’ll get it working and then we’ll come back later and fix up the security stuff,” but later never happens. Instead, an IAM principal is left with overly broad access or a resource is configured improperly. This becomes 10x worse when you are unfamiliar with the cloud, which is where many of our clients tend to find themselves.

Velocity

The last competing priority is simply speed to production or velocity. This one probably requires the least explanation, but it’s consistently the priority that is sacrificed the most. In fact, many organizations may even view it as the enemy of the first two priorities. They might equate moving fast with being reckless. Nonetheless, companies are feeling the pressure to deliver faster now more than ever, but it’s much more than just shipping quickly. It’s about developing the ability to adapt and respond to changing market conditions fast and fluidly. Big companies are constantly on the lookout for smaller, more nimble players who might disrupt their business. This is in part why more and more of these companies are prioritizing the move to cloud. The data center has long been their moat and castle as it relates to security and governance, however, and the cloud presents a new and serious risk for them in this space. As a result, velocity typically pays the price.

As I mentioned earlier, velocity is commonly in tension with maintainability as well, it’s usually just a matter of whether that premium is frontloaded or backloaded. More often than not, we can choose to move quickly up front but pay a penalty later on or vice versa. Truthfully though, if you’ve followed the DORA State of DevOps Reports, you know that a lot of companies neither frontload nor backload their velocity premium—they are just slow all around. These are usually more legacy-minded IT shops and organizations that treat software development as an IT cost center. These are also usually the groups that bias more towards security and governance, but they’re probably the most susceptible to disruption. “Move fast and break things” is not a phrase you will hear permeating these organizations, yet they all desire to modernize and accelerate. We regularly watch these companies’ teams spend months configuring infrastructure, and what they construct is often complex, fragile, and insecure.

Choose Three

Businesses today are demanding strong security and governance, well-structured and maintainable infrastructure, and faster speed to production. The reality, however, is that these three priorities are competing with each other, and companies often end up with one of the priorities dominating the others. If we can acknowledge these trade-offs, we can work to better understand and address them.

We built Konfig as a solution that tackles this head-on by providing an opinionated configuration of Google Cloud Platform and GitLab. Most organizations start from a position where they must assemble the building blocks in a way that allows them to deliver software effectively, but their own biases result in a solution that skews one way or the other. Konfig instead provides a turnkey experience that minimizes time-to-production, is secure by default, and has governance and best practices built in from the start. Rather than having to choose one of security, maintainability, and velocity, don’t compromise—have all three. In a follow-up post I’ll explain how Konfig addresses concerns like security and governance, infrastructure maintainability, and speed to production in a “by default” way. We’ll see how IAM can be securely managed for us, how we can enforce architecture standards and patterns, and how we can enable developers to ship production workloads quickly by providing autonomy with guardrails and stable infrastructure.

Follow @tyler_treat

April 4, 2024April 29, 2024

Introducing Konfig: GitLab and Google Cloud preconfigured for startups and enterprises

Real Kinetic helps businesses transform how they build and deliver software in the cloud. This encompasses legacy migrations, app modernization, and greenfield development. We work with companies ranging from startups to Fortune 500s and everything in between. Most recently, we finished helping Panera Bread migrate their e-commerce platform to Google Cloud from on-prem and led their transition to GitLab. In doing this type of work over the years, we’ve noticed a problem organizations consistently hit that causes them to stumble with these cloud transformations. Products like GCP, GitLab, and Terraform are quite flexible and capable, but they are sort of like the piles of Legos below.

These products by nature are mostly unopinionated, which means customers need to put the pieces together in a way that works for their unique situation. This makes it difficult to get started, but it’s also difficult to assemble them in a way that works well for 1 team or 100 teams. Startups require a solution that allows them to focus on product development and accelerate delivery, but ideally adhere to best practices that scale with their growth. Larger organizations require something that enables them to transform how they deliver software and innovate, but they need it to address enterprise concerns like security and governance. Yet, when you’re just getting started, you know the least and are in the worst position to make decisions that will have a potentially long-lasting impact. The outcome is companies attempting a cloud migration or app modernization effort fail to even get off the starting blocks.

It’s easy enough to cobble together something that works, but doing it in a way that is actually enterprise-ready, scalable, and secure is not an insignificant undertaking. In fact, it’s quite literally what we have made a business of helping customers do. What’s worse is that this is undifferentiated work. Companies are spending countless engineering hours building and maintaining their own bespoke “cloud assembly line”—or Internal Developer Platform (IDP)—which are all attempting to address the same types of problems. That engineering time would be better spent on things that actually matter to customers and the business.

This is what prompted us to start thinking about solutions. GitLab and GCP don’t offer strong opinions because they address a broad set of customer needs. This creates a need for an opinionated configuration or distribution of these tools. The solution we arrived at is Konfig. The idea is to provide this distribution through what we call “Platform as Code.” Where Infrastructure as Code (IAC) is about configuring the individual resource-level building blocks, Platform as Code is one level higher. It’s something that can assemble these discrete products in a coherent way—almost as if they were natively integrated. The result is a turnkey experience that minimizes time-to-production in a way that will scale, is secure by default, and has best practices built in from the start. A Linux distro delivers a ready-to-use operating system by providing a preconfigured kernel, system library, and application assembly. In the same way, Konfig delivers a ready-to-use platform for shipping software by providing a preconfigured source control, CI/CD, and cloud provider assembly. Whether it’s legacy migration, modernization, or greenfield, Konfig provides your packaged onramp to GCP and GitLab.

Platform as Code

Central to Konfig is the notion of a Platform. In this context, a Platform is a way to segment or group parts of a business. This might be different product lines, business units, or verticals. How these Platforms are scoped and how many there are is different for every organization and depends on how the business is structured. A small company or startup might consist of a single Platform. A large organization might have dozens or more.

A Platform is then further subdivided into Domains, a concept we borrow from Domain-Driven Design. A Domain is a bounded context which encompasses the business logic, rules, and processes for a particular area or problem space. Simply put, it’s a way to logically group related services and workloads that make up a larger system. For example, a business providing online retail might have an E-commerce Platform with the following Domains: Product Catalog, Customer Management, Order Management, Payment Processing, and Fulfillment. Each of these domains might contain on the order of 5 to 10 services.

This structure provides a convenient and natural way for us to map access management and governance onto our infrastructure and workloads because it is modeled after the organization structure itself. Teams can have ownership or elevated access within their respective Domains. We can also specify which cloud services and APIs are available at the Platform level and further restrict them at the Domain level where necessary. This hierarchy facilitates a powerful way to enforce enterprise standards for a large organization while allowing for a high degree of flexibility and autonomy for a small organization. Basically, it allows for governance when you need it (and autonomy when you don’t). This is particularly valuable for organizations with regulatory or compliance requirements, but it’s equally valuable for companies wanting to enforce a “golden path”—that is, an opinionated and supported way of building something within your organization. Finally, Domains provide clear cost visibility because cloud resources are grouped into Domain projects. This makes it easy to see what “Fulfillment” costs versus “Payment Processing” in our E-commerce Platform, for example.

“Platform as Code” means these abstractions are modeled declaratively in YAML configuration and managed via GitOps. The definitions of Platforms and Domains consist of a small amount of metadata, shown below, but that small amount of metadata ends up doing a lot of heavy lifting in the background.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gitlab:
    parentGroupId: 82224252
  gcp:
    billingAccountId: "123ABC-456DEF-789GHI"
    parentFolderId: "1080778227704"
    defaultEnvs:
      - dev
      - stage
      - prod
    services:
      defaults:
        - cloud-run
        - cloud-sql
        - cloud-storage
        - secret-manager
        - cloud-kms
        - pubsub
        - redis
        - firestore
    api:
      path: /ecommerce

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  gcp:
    services:
      disabled:
        - pubsub
        - redis
        - firestore
    api:
      path: /payment
  groups:
    dev: [payment-devs@example.com]
    maintainer: [payment-maintainers@example.com]
    owner: [gitlab-owners@example.com]

domain.yaml

The Control Plane

Platforms, Domains, and all of the resources contained within them are managed by the Konfig control plane. The control plane consumes these YAML definitions and does whatever is needed in GitLab and GCP to make the “real world” reflect the desired state specified in the configuration.

The control plane manages the structure of groups and projects in GitLab and synchronizes this structure with GCP. This includes a number of other resources behind the scenes as well: configuring OpenID Connect to allow GitLab pipelines to authenticate with GCP, IAM resources like service accounts and role bindings, managing SAML group links to sync user permissions between GCP and GitLab, and enabling service APIs on the cloud projects. The Platform/Domain model allows the control plane to specify fine-grained permissions and scope access to only the things that need it. In fact, there are no credentials exposed to developers at all. It also allows us to manage what cloud services are available to developers and what level of access they have across the different environments. This governance is managed centrally but federated across both GitLab and GCP.

The net result is a configuration- and standards-driven foundation for your cloud development platform that spans your source control, CI/CD, and cloud provider environments. This foundation provides a golden path that makes it easy for developers to build and deliver software while meeting an organization’s internal controls, standards, or regulatory requirements. Now we’re ready to start delivering workloads to our enterprise cloud environment.

Managing Workloads and Infrastructure

The Konfig control plane establishes an enterprise cloud environment in which we could use traditional IAC tools such as Terraform to manage our application infrastructure. However, the control plane is capable of much more than just managing the foundation. It can also manage the workloads that get deployed to this cloud environment. This is because Konfig actually consists of two components: Konfig Platform, which configures and manages our cloud platform comprising GitLab and GCP, and Konfig Workloads, which configures and manages application workloads and their respective infrastructure resources.

Using the Lego analogy, think of Konfig Platform as providing a pre-built factory and Konfig Workloads as providing pre-built assembly lines within the factory. You can use both in combination to get a complete, turnkey experience or just use Konfig Platform and “bring your own assembly line” such as Terraform.

Konfig Workloads provides an IAC alternative to Terraform where resources are managed by the control plane. Similar to how the platform-level components like GitLab and GCP are managed, this works by using an operator that runs in the control plane cluster. This operator runs on a control loop which is constantly comparing the desired state of the system with the current state and performs whatever actions are necessary to reconcile the two. A simple example of this is the thermostat in your house. You set the temperature—the desired state—and the thermostat works to bring the actual room temperature—the current state—closer to the desired state by turning your furnace or air conditioner on and off. This model removes potential for state drift, where the actual state diverges from the configured state, which can be a major headache with tools like Terraform where state is managed with backends.

The Konfig UI provides a visual representation of the state of your system. This is useful for getting a quick understanding of a particular Platform, Domain, or workload versus reading through YAML that could be scattered across multiple files or repos (and which may not even be representative of what’s actually running in your environment). With this UI, we can easily see what resources a workload has configured and can access, the state of these resources (whether they are ready, still provisioning, or in an error state), and how the workload is configured across different environments. We can even use the UI itself to provision new resources like a database or storage bucket that are scoped automatically to the workload. This works by generating a merge request in GitLab with the desired changes, so while we can use the UI to configure resources, everything is still managed declaratively through IAC and GitOps. This is something we call “Visual IAC.”

Your Packaged Onramp to GCP and GitLab

The current cloud landscape offers powerful tools, but assembling them efficiently, securely, and at scale remains a challenge. This “undifferentiated work” consumes valuable engineering resources that could be better spent on core business needs, and it often prevents organizations from even getting off the starting line when beginning their cloud journey. Konfig, built around the principles of Platform as Code and standards-driven development, addresses this very gap. We built it to help our clients move quicker through operationalizing the cloud so that they can focus on delivering business value to their customers. Whether you’re migrating to the cloud, modernizing, or starting from scratch, Konfig provides a preconfigured and opinionated integration of GitLab, GCP, and Infrastructure as Code which gives you:

Faster time-to-production: Streamlined setup minimizes infrastructure headaches and allows developers to focus on building and delivering software.
Enterprise-grade security: Built-in security best practices and fine-grained access controls ensure your cloud environment remains secure.
Governance: Platforms and Domains provide a flexible model that balances enterprise standards with team autonomy.
Scalability: Designed to scale with your business, easily accommodating growth without compromising performance or efficiency.
Great developer UX: Designed to provide a great user experience for developers shipping applications and services.

Konfig functions like an operating system for your development organization to deliver software to the cloud. It’s an opinionated IDP specializing in cloud migrations and app modernization. This allows you to focus on what truly matters—building innovative software products and delivering exceptional customer experiences.

We’ve been leveraging these patterns and tools for years to help clients ship with confidence, and we’re excited to finally offer a solution that packages them up. Please reach out if you’d like to learn more and see a demo. If you’re undertaking a modernization or cloud migration effort, we want to help make it a success. We’re looking for a few organizations to partner with to develop Konfig into a robust solution.

Follow @tyler_treat

February 19, 2024February 19, 2024

Choosing Good SLIs

Transitioning from an on-prem environment to a cloud environment involves a lot of major shifts for organizations. One of those shifts is often around how we monitor the overall health of systems. The typical way to measure things like the availability, reliability, and performance of systems is with SLIs or Service Level Indicators. SLIs are a valuable tool both on-prem and in the cloud, but when it comes to the latter, I often see organizations carrying over some operational anti-patterns from their data center environment.

Unlike public clouds, data centers are often resource-constrained. Services run on dedicated sets of VMs and it can take days or weeks for new physical servers to be provisioned. Consequently, it’s common for organizations to closely monitor metrics such as CPU utilization, memory consumption, disk space, and so forth since these are all precious resources within a data center.

Often what happens is that ops teams get really good at identifying and pattern-matching the common issues that arise in their on-prem environment. For instance, certain applications may be prone to latency issues. Each time we dig into a latency issue we find that the problem is due to excessive garbage collection pauses. As a result, we define a metric around garbage collection because it is often an indicator of performance problems in the application. In practice, this becomes an SLI, whether it’s explicitly defined as such or not, because there is some sort of threshold beyond which garbage collection is considered “excessive.” We begin watching this metric closely to gauge whether the service is healthy or not and alerting on it.

The cloud is a very different environment than on-prem. Whether we’re using an orchestrator such as Kubernetes or a serverless platform, containers are usually ephemeral and instances autoscale up and down. If an instance runs out of memory, it will just get recycled. This is why we sometimes say you can “pay your way out” of a problem in these environments because autoscaling and autohealing can hide a lot of application issues such as a slow memory leak. In an on-prem environment, these can be significantly more impactful. The performance profile of applications often looks quite differently in the cloud than on-prem as well. Underlying hardware, tenancy, and networking characteristics differ considerably. All this is to say, things look and behave quite differently between the two environments, so it’s important to reevaluate operational practices as well. With SLIs and monitoring, it’s easy to bias toward specific indicators from on-prem, but they might not translate to more cloud-native environments.

User-centric monitoring

So how do we choose good SLIs? The key question to ask is: what is the customer’s experience like? Everything should be driven from this. Is the application responding slowly? Is it returning errors to the user? Is it returning bad or incorrect results? These are all things that directly impact the customer’s experience. Conversely, things that do not directly impact the customer’s experience are questions such as what is the CPU utilization of the service? The memory consumption? The rate of garbage collection cycles? These are all things that could impact the customer’s experience, but without actually looking from the user’s perspective, we simply don’t know whether they are or not. Rather, they are diagnostic tools that—once an issue is identified—can help us to better understand the underlying cause.

Take, for example, the CPU and memory utilization of processes on your computer. Most people probably are not constantly watching the Activity Monitor on their MacBook. Instead, they might open it up when they notice their machine is responding slowly to see what might be causing the slowness.

Three key metrics

When it comes to monitoring services, there are really three key metrics that matter: traffic rate, error rate, and latency. These three things all directly impact the user’s experience.

Traffic Rate

Traffic rate, which is usually measured in requests or queries per second (qps), is important because it tells us if something is wrong upstream of us. For instance, our service might not be throwing any errors, but if it’s suddenly handling 0 qps when it ordinarily is handling 80-100 qps, then something happened upstream that we should know about. Perhaps there is a misconfiguration that is preventing traffic from reaching our service, which almost certainly impacts the user experience.

Error Rate

Error rate simply tells us the rate in which the service is returning errors to the client. If our service normally returns 200 responses but suddenly starts returning 500 errors, we know something is wrong. This requires good status code hygiene to be effective. I’ve encountered codebases where various types of error codes are used to indicate non-error conditions which can add a lot of noise to this type of SLI. Additionally, this metric might be more fine-grained than just “error” or “not error”, since—depending on the application—we might care about the rate of specific 2xx, 4xx, or 5xx responses, for example.

It’s common for teams to rely on certain error logs rather than response status codes for monitoring. This can provide even more granularity around types of error conditions, but in my experience, it usually works better to rely on fairly coarse-grained signals such as HTTP status codes for the purposes of aggregate monitoring and SLIs. Instead, use this logging for diagnostics and troubleshooting once you have identified there is a problem (I am, however, a fan of structured logging and log-based metrics for instrumentation but this is for another blog post).

Latency

Combined with error rate, latency tells us what the customer’s experience is really like. This is an important metric for synchronous, user-facing APIs but might be less critical for asynchronous processes such as services that consume events from a message queue. It’s important to point out that when looking at latency, you cannot use averages. This is a common trap I see ops teams and engineers fall into. Latency rarely follows a normal distribution, so relying on averages or medians to provide a summarized view of how a system is performing is folly.

Instead, we have to look at percentiles to get a better understanding of what the latency distribution looks like. Similarly, you cannot average percentiles either. It mathematically makes no sense, meaning you can’t, for instance, look at the average 90th percentile over some period of time. To summarize latency, we can plot multiple percentiles on a graph. Alternatively, heatmaps can be an effective way to visualize latency because they can reveal useful details like distribution modes and outliers. For example, the heatmap below shows that the latency for this service is actually bimodal. Requests usually either respond in approximately 10 milliseconds or 1 second. This modality is not apparent in the line chart above the heatmap where we are only plotting the 50th, 95th, and 99th percentiles. The line chart does, however, show that latency ticked up a tiny bit around 10:10 AM following a severe spike in tail latency where the 99th percentile momentarily jumped over 4 seconds…curious.

Latency distribution for a service as percentiles

Latency distribution for a service as a heatmap

Identifying other SLIs

While these three metrics are what I consider the critical baseline metrics, there may be other SLIs that are important to a service. For example, if our service is a cache, we might care about the freshness of data we’re serving as something that impacts the customer experience. If our service is queue-based, we might care about the time messages spend sitting in the queue.

Heatmap showing the age distribution of data retrieved from a cache

Whatever the SLIs are, they should be things that directly matter to the user’s experience. If they aren’t, then at best they are a useful diagnostic or debugging tool and at worst they are just dashboard window dressing. Usually, though, they’re no use for proactive monitoring because it’s too much noise, and they’re no use for reactive debugging because it’s typically pre-aggregated data.

What’s worse is that when we focus on the wrong SLIs, it can lead us to take steps that actively harm the customer’s experience or simply waste our own time. A real-world example of this is when I saw a team that was actively monitoring garbage collection time for a service. They noticed one instance in particular appeared to be running more garbage collections than the others. While it appeared there were no obvious indicators of latency issues, timeouts, or out-of-memory errors that would actually impact the client, the team decided to redeploy the service in order to force instances to be recycled. This redeploy ended up having a much greater impact on the user experience than any of the garbage collection behavior ever did. The team also spent a considerable amount of time tuning various JVM parameters and other runtime settings, which ultimately had minimal impact.

Where lower-level metrics can provide value is with optimizing resource utilization and cloud spend. While the elastic nature of cloud may allow us to pay our way out of certain types of problems such as a memory leak, this can lead to inefficiency and waste long term. If we see that our service only utilizes 20% of its allocated CPU, we are likely overprovisioned and could save money. If we notice memory consumption consistently creeping up and up before hitting an out-of-memory error, we likely have a memory leak. However, it’s important to understand this distinction in use cases: SLIs are about gauging customer experience while these system metrics are for identifying optimizations and understanding long-term resource characteristics of your system. At any rate, I think it’s preferable to get a system to production with good monitoring in place, put real traffic on it, and then start to fine-tune its performance and resource utilization versus trying to optimize it beforehand through synthetic means.

Transitioning from an on-prem environment to the cloud necessitates a shift in how we monitor the health of systems. It’s essential to recognize and discard operational anti-patterns from traditional data center environments, where resource constraints often lead to a focus on specific metrics and behaviors. This can frequently lead to a sort of “overfitting” when monitoring cloud-based systems. The key to choosing good SLIs is by aligning them with the customer’s experience. Metrics such as traffic rate, error rate, and latency directly impact the user and provide meaningful insights into the health of services. By emphasizing these critical baseline metrics and avoiding distractions with irrelevant indicators, organizations can proactively monitor and improve the customer experience. Focusing on the right SLIs ensures that efforts are directed toward resolving actual issues that matter to users, avoiding pitfalls that can inadvertently harm user experience or waste valuable time. As organizations navigate the complexities of migrating to a cloud-native environment, a user-centric approach to monitoring remains fundamental to successful and efficient operations.

Need help making the transition?

Real Kinetic helps organizations with their cloud migrations and implementing effective operations. If you have questions or need help getting started, we’d love to hear from you. These emails come directly to us, and we respond to every one.

Follow @tyler_treat

February 12, 2024

Cloud without Kubernetes

I think it’s safe to say Kubernetes has “won” the cloud mindshare game. If you look at the CNCF Cloud Native landscape (and manage to not go cross eyed), it seems like most of the projects are somehow related to Kubernetes. KubeCon is one of the fastest-growing industry events. Companies we talk to at Real Kinetic who are either preparing for or currently executing migrations to the cloud are centering their strategies around Kubernetes. Those already in the cloud are investing heavily in platform-izing their Kubernetes environment. Kubernetes competitors like Nomad, Pivotal Cloud Foundry, OpenShift, and Rancher have sort of just faded to the background (or simply pivoted to Kubernetes). In many ways, “cloud native” seems to be equated with “Kubernetes”.

All this is to say, the industry has coalesced around Kubernetes as the way to do cloud. But after working with enough companies doing cloud, watching their experiences, and understanding their business problems, I can’t help but wonder: should it be? Or rather, is Kubernetes actually the right level of abstraction?

Going k8sless

While we’ve worked with a lot of companies doing Kubernetes, we’ve also worked with some that are deliberately not. Instead, they leaned into serverless—heavily—or as I like to call it, they’ve gone k8sless. These are not small companies or startups, they are name brands you would recognize.

At first, we were skeptical. Our team came from a company that made it all the way to IPO using Google App Engine, one of the earliest serverless platforms available. We have regularly espoused the benefits of serverless. We’ve talked to clients about how they should consider it for their own workloads (often to great skepticism). But using only serverless? For once, we were the serverless skeptics. One client in particular was beginning a migration of their e-commerce platform to Google Cloud. They wanted to do it completely serverless. We gave our feedback and recommendations based on similar migrations we’ve performed:

“There are workloads that aren’t a good fit.”

“It would require major re-architecting.”

“It will be expensive once fully migrated.”

“You’ll have better cost efficiency bin packing lots of services into VMs with Kubernetes.”

We articulated all the usual arguments made by the serverless doubters. Even Google was skeptical, echoing our sentiments to the customer. “Serious companies doing online retail like The Home Depot or Target are using Google Kubernetes Engine,” was more or less the message. We have a team of serverless experts at Real Kinetic though, so we forged ahead and helped execute the migration.

Fast forward nearly three years later and we will happily admit it: we were wrong. You can run a multibillion-dollar e-commerce platform without a single VM. You don’t have to do a full rewrite or major re-architecting. It can be cost-effective. It doesn’t require proprietary APIs or constraints that result in vendor lock-in. It might sound like an exaggeration, but it’s not.

Container as the interface

Over the last several years, Google’s serverless offerings have evolved far beyond App Engine. It has reached the point where it’s now viable to run a wide variety of workloads without much issue. In particular, Cloud Run offers many of the same benefits of a PaaS like App Engine without the constraints. If your code can run in a container, there’s a very good chance it will run on Cloud Run with little to no modification.

In fact, other than using the gcloud CLI to deploy a service, there’s nothing really Google- or Cloud Run-specific needed to get a functioning application. This is because Cloud Run uses Knative, an open-source Kubernetes-based platform, as its deployment interface. And while Cloud Run is a Google-managed backend for the Knative interface, we could just as well switch the backend to GKE or our own Kubernetes cluster. When we implement our Cloud Run services, we actually implement them using a Kubernetes Deployment manifest, shown below, and right before deploying, we swap Deployment for Knative’s Service manifest.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    cloud.googleapis.com/location: us-central1
    service: my-service
  name: my-service
spec:
  template:
    spec:
      containers:
        - image: us.gcr.io/my-project/my-service:v1
          name: my-service
          ports:
            - containerPort: 8080
          resources:
            limits:
              cpu: 2
              memory: 1024Mi

This means we can deploy to Kubernetes without Knative at all, which we often do during development using the combination of Skaffold and K3s to perform local testing. It also allows us to use Kubernetes native tooling such as Kustomize to manage configuration. Think of Cloud Run as a Kubernetes Deployment as a service (though really more like Deployment and Service…as a service).

“Normal” businesses versus internet-scale businesses

What about cost? Yes, the unit cost in terms of compute is higher with serverless. If you execute enough CPU cycles to fill the capacity of a VM, you are better off renting the whole VM as opposed to effectively renting timeshares of it. But here’s the thing: most “normal” businesses tend to have highly cyclical traffic patterns throughout the day and their scale is generally modest.

What do I mean by “normal” businesses? These are primarily non-internet-scale companies such as insurance, fast food, car rental, construction, or financial services, not Google, Netflix, or Amazon. As a result, these companies can benefit greatly from pay-per-use, and those in the retail space also benefit greatly from the elasticity of this model during periods like Black Friday or promotional campaigns. Businesses with brick-and-mortar have traffic that generally follows their operating hours. During off-hours, they can often scale quite literally to zero.

Many of these businesses, for better or worse, treat software development as an IT cost center to be managed. They don’t need—or for that matter, want—the costs and overheads associated with platform-izing Kubernetes. A lot of the companies we interact with fall into this category of “normal” businesses, and I suspect most companies outside of tech do as well.

BYOP—Bring Your Own Platform

I’ve asked it before: is Kubernetes really the end-game abstraction? In my opinion, it’s an implementation detail. I don’t think I’m alone in that opinion. Some companies put a tremendous amount of investment into abstracting Kubernetes from their developers. This is what I mean by “platform-izing” Kubernetes. It typically involves significant and ongoing OpEx investment. The industry has started to coalesce around two concepts that encapsulate this: Platform Engineering and Internal Developer Platform. So while Kubernetes may have become the default container orchestrator, the higher-level pieces—the pieces constituting the Internal Developer Platform—are still very much bespoke. Kelsey Hightower said it best: the majority of people managing infrastructure just want a PaaS. The only requirement: it has to be built by them. That’s a problem.

Imagine having a Kubernetes cluster per Deployment. Full blast radius isolation, complete cost traceability, granular yet simple permissioning. It sounds like a maintenance nightmare though, right? Now imagine those clusters just being hidden from you completely and the Deployment itself is the only thing you interact with and maintain. You just provide your container (or group of containers), configure your CPU and memory requirements, specify the network and resource access, and deploy it. The Deployment manages your load balancing and ingress, automatically scales the pods up and down or canaries traffic, and gives you aggregated logs and metrics out of the box. You only pay for the resources consumed while processing a request. Just a few years ago, this was a futuristic-sounding fantasy.

The platform Kelsey describes above does now exist. From my experience, it’s a nearly ideal solution for those “normal” businesses who are looking to minimize complexity and operational costs and avoid having to bring (more like build) their own platform. I realize GCP is a distant third when it comes to public cloud market share so this will largely fall on deaf ears, but for those who are still listening: stop wasting time on Kubernetes and just use Cloud Run. Let me expand on the reasons why.

Easily and quickly get started with the cloud. Many of the companies we work with who are still in the midst of migrating to the cloud get hung up with analysis paralysis. Cloud Run isn’t a perfect solution for everything, but it’s good enough for the majority of cases. The rest can be handled as exceptions.
Minimize complexity of cloud environments. Cloud Run does not eliminate the need for infrastructure (there are still caches, queues, databases, and so forth), but it greatly simplifies it. Using managed services for the remaining infrastructure pieces simplifies it further.
Increase the efficiency of your developers and reduce operational costs. Rather than spending most of their time dealing with infrastructure concerns, allow your developers to focus on delivering business value. For most businesses, infrastructure is undifferentiated commodity work. By “outsourcing” large parts of your undifferentiated Internal Developer Platform, you can reallocate developers to product or feature development and reduce operational costs. This allows you to get the benefits of Platform Engineering with a fraction of the maintenance and overhead. Lastly, if you are a “normal” business that doesn’t operate at internet scale and has fairly cyclical traffic, it’s entirely likely Cloud Run will be cheaper than VM-based platforms.
Maintain the flexibility to evolve to a more complex solution over time if needed. This is where traditional serverless platforms and PaaS solutions fall short. Again, with Cloud Run there is no actual vendor lock-in, it’s just a Kubernetes Deployment as a Service. Even without Knative, we can take that Deployment and run it in any Kubernetes cluster. This is a very different paradigm from, say, App Engine where you wrote your application using App Engine APIs and deployed your service to the App Engine runtime. In this new paradigm, the artifact is a Plain Old Container. There are cases where Cloud Run is not a good fit, such as certain kinds of stateful legacy applications or services with sustained, non-cyclical traffic. We don’t want to be painted into a corner with these types of situations so having flexibility is important.

There are similar analogs to Cloud Run on other cloud platforms. For example, AWS has AppRunner. However, in my experience these fall short in terms of developer experience because of either lack of investment from the cloud provider or environment complexity (as I would argue is the case for AWS). Managed services like Cloud Run are one of the areas that GCP truly excels and differentiates itself.

Just use Cloud Run, seriously

I realize not everyone will be convinced. The gravitational pull of Kubernetes is strong and as a platform, it’s a safe bet. However, operationalizing Kubernetes properly—whether it’s a managed offering like GKE or not—requires some kind of platform team and ongoing investment. We’ve seen it approached without this where developers are given clusters or allowed to spin them up and fend for themselves. This quickly becomes untenable because standards are non-existent, security and compliance is unmanageable, and developer time is split between managing infrastructure and actual feature development.

If your organization is unable or unwilling to make this investment, I urge you to consider Cloud Run. There’s still work needed on the periphery to properly operationalize it, such as implementing CI/CD pipelines and managing accessory infrastructure, but it’s a much lower investment. Additionally, it provides an escape hatch—unlike App Engine or traditional PaaS solutions, there is no real switching cost in moving to Kubernetes if you need to in the future. With Cloud Run, serverless has finally reached a tipping point where it’s now viable for a majority of workloads rather than a niche subset. Unlike Kubernetes, it provides the right level of abstraction for most businesses building software. In my opinion, serverless is still not taken seriously due to preconceived notions, but it’s time to start reevaluating those notions.

Agree? Disagree? I’d love to hear your thoughts. If you’re an organization that would like to do cloud differently or are looking for the playbook to operationalize Google Cloud Platform, please get in touch.

Follow @tyler_treat