Serverless on GCP

Like many other marketing buzzwords, the concept of “serverless” has taken on a life of its own, which can make it difficult to understand what serverless actually means. What it really means is that the cloud provider fully manages server infrastructure all the way up to the application layer. For example, GCE isn’t serverless because, while Google manages the physical server infrastructure, we still have to deal with patching operating systems, managing load balancers, configuring firewall rules, and so on. Serverless means we merely worry about our application code and business logic and nothing else. This concept extends beyond pure compute though, including things like databases, message queues, stream processing, machine learning, and other types of systems.

There are several benefits to the serverless model. First, it allows us to focus on building products, not managing infrastructure. These operations-related tasks, while important, are not generally things that differentiate a business. It’s just work that has to be done to support the rest of the business. With cloud—and serverless in particular—many of these tasks are becoming commoditized, freeing us up to focus on things that matter to the business.

Another benefit related to the first is that serverless systems provide automatic scaling and fault-tolerance across multiple data centers or, in some cases, even globally. When we leverage GCP’s serverless products, we also leverage Google’s operational expertise and the experience of an army of SREs. That’s a lot of leverage. Few companies are able to match the kind of investment cloud providers like Google or Amazon are able to make in infrastructure and operations, nor should they. If it’s not your core business, leverage economies of scale.

Finally, serverless allows us to pay only for what we use. This is quite a bit different from what traditional IT companies are used to where it’s more common to spend several millions of dollars on a large solution with a contract. It’s also different from what many cloud-based companies are used to where you typically provision some baseline capacity and pay for bursts of additional capacity as needed. With serverless, VMs are eschewed and we pay only for the resources we use to serve the traffic we have. This means no more worrying about over-provisioning or under-provisioning.

GCP’s Compute Options

GCP has a comprehensive set of compute options ranging from minimally managed VMs all the way to highly managed serverless backends. Below is the full spectrum of GCP’s compute services at the time of this writing. I’ll provide a brief overview of each of these services just to get the lay of the land. We’ll start from the highest level of abstraction and work our way down, and then we’ll hone in on the serverless solutions.

GCP compute platforms

Firebase is Google’s managed Backend as a Service (BaaS) platform. This is the highest level of abstraction that GCP offers (short of SaaS like G Suite) and allows you to build mobile and web applications quickly and with minimal server-side code. For example, it can implement things like user authentication and offline data syncing for you. This is often referred to as a “backend as a service” because there is no server code. The trade-off is you have less control over the system, but it can be a great fit for quickly prototyping applications or building a proof of concept with minimal investment. The primary advantage is that you can focus most of your development effort on client-side application code and user experience. Note that some components of Firebase can be used outside of the Firebase platform, such as Cloud Firestore and Firebase Authentication

Cloud Functions is a serverless Functions as a Service (FaaS) offering from GCP. You upload your function code and Cloud Functions handles the runtime of it. Because it’s a sandboxed environment, there are some restrictions to the runtime, but it’s a great choice for building event-driven services and connecting systems together. While you can develop basic user-facing APIs, the operational tooling is not sufficient for complex systems. The benefit is Cloud Functions are highly elastic and have minimal operational overhead since it is a serverless platform. They are an excellent choice for dynamic, event-driven plumbing such as moving data between services or reacting to log events. They work well for basic APIs, but can rapidly become operationally complex for more than a few endpoints.

App Engine is Google’s Platform as a Service (PaaS). Like Cloud Functions, it’s an opinionated but fully managed runtime that lets you upload your application code while handling the operational aspects such as autoscaling and fault-tolerance. App Engine has two modes: Standard, which is the opinionated PaaS runtime, and Flexible, which allows providing a custom runtime using a container—this is colloquially referred to as a Container as a Service (CaaS). For stateless applications with quick instance start-up times, it is often an excellent choice. It offers many of the benefits of Cloud Functions but simplifies operational aspects since larger components are easy to deploy and manage. App Engine allows developers to focus most of their effort on business logic. Standard is a great fit for greenfield applications where server-side processing and logic is required. Flex can be easier for migrating existing workloads because it is less opinionated.

Cloud Run is a new offering in GCP that provides a managed compute platform for stateless containers. Essentially, Google manages the underlying compute infrastructure and all you have to do is provide them an application container. Like App Engine, they handle scaling instances up and down, load balancing, and fault-tolerance. Cloud Run actually has two modes: the Google-managed version, which runs your containers on Google’s internal compute infrastructure known as Borg, and the GKE version, which allows running workloads on your own GKE cluster. This is because Cloud Run is built on an open source Kubernetes platform for serverless workloads called Knative.

Cloud Run and App Engine Flex are similar to each other, but there are some nuanced differences. One key difference is Cloud Run has very fast instance start-up time due to its reliance on the gVisor container runtime. Flex instances, on the other hand, usually take minutes to start because they involve provisioning GCE instances, load balancers, and other GCP-managed infrastructure. Flex is also more feature-rich than Cloud Run, supporting things like traffic splitting, deployment rollbacks, WebSocket connections, and VPC connections.

Kubernetes Engine, or GKE, is Google’s managed Kubernetes service. GKE effectively adds a container orchestration layer on top of GCE, putting it somewhere between IaaS (Infrastructure as a Service) and CaaS. This is typically the lowest level of abstraction most modern applications should require. There is still a lot of operational overhead involved with using a managed Kubernetes service.

Lastly, Compute Engine, or GCE, is Google’s VM offering. GCE VMs are usually run on multi-tenant hosts, but GCP also offers sole-tenant nodes where a physical Compute Engine server is dedicated to hosting a single customer’s VMs. This is the lowest level of infrastructure that GCP offers and the lowest common denominator generally available in the public clouds, usually referred to as IaaS. This means there are a lot of operational responsibilities that come with using it. There are generally few use cases that demand a bare VM.

Choosing a Serverless Option

Now that we have an overview of GCP’s compute services, we can focus in on the serverless options.

GCP serverless compute platforms

GCP currently has four serverless compute options (emphasis on computebecause there are other serverless offerings for things like databases, queues, and so forth, but these are out of scope for this discussion).

  • Cloud Run: serverless containers (CaaS)
  • App Engine: serverless platforms (PaaS)
  • Cloud Functions: serverless functions (FaaS)
  • Firebase: serverless applications (BaaS)

With four different serverless options to choose from, how do we decide which one is right? The first thing to point out is that we don’t necessarily need to choose a single solution. We might end up using a combination of these services when building a system. However, I’ve provided some criteria below on selecting solutions for different types of problems.

Firebase

If you’re looking to quickly prototype an application or focus only on writing code, Firebase can be a good fit. This is especially true if you’re wanting to focus most of your investment and time on the client-side application code and user experience. Likewise, if you want to build a mobile-ready application and don’t want to implement things like user authentication, it’s a good option. 

Firebase is obviously the most restrictive and opinionated solution, but it’s great for rapid prototyping and accelerating development of an MVP. You can also complement it with services like App Engine or Cloud Functions for situations that require server-side compute.

Good Fit Characteristics

  • Mobile-first (or ready) applications
  • Rapidly prototyping applications
  • Applications where most of the logic is (or can be) client-side
  • Using Firebase components on other platforms, such as using Cloud Firestore or Firebase Authentication on App Engine, to minimize investment in non-differentiating work

Bad Fit Characteristics

  • Applications requiring complex server-side logic or architectures
  • Applications which require control over the runtime

Cloud Functions

If you’re looking to react to real-time events, glue systems together, or build a simple API, Cloud Functions are a good choice provided you’re able to use one of the supported runtimes (Node.js, Python, and Go). If the runtime is a limitation, check out Cloud Run.

Good Fit Characteristics

  • Event-driven applications and systems
  • “Glueing” systems together
  • Deploying simple APIs

Bad Fit Characteristics

  • Highly stateful systems
  • Deploying large, complex APIs
  • Systems that require a high level of control or need custom runtimes or binaries

App Engine

If you’re looking to deploy a full application or complex API, App Engine is worth looking at. Standard is good for greenfield applications which are able to fit within the constraints of the runtime. It can scale to zero and deploys take seconds. Flexible is easier for existing applications where you’re unwilling or unable to make changes fitting them into Standard. Deploys to Flex can take minutes, and you must have a minimum of one instance running at all times.

Good Fit Characteristics

  • Stateless applications
  • Rapidly developing CRUD-heavy applications
  • Applications composed of a few services
  • Deploying complex APIs

Bad Fit Characteristics

  • Stateful applications that require lots of in-memory state to meet performance or functional requirements
  • Applications built with large or opinionated frameworks or applications that have slow start-up times (this can be okay with Flex)
  • Systems that require protocols other than HTTP

Cloud Run

If you’re looking to react to real-time events but need custom runtimes or binaries not supported by Cloud Functions, Cloud Run is a good choice. It’s also a good option for building stateless HTTP-based web services. It’s trimmed down compared to App Engine Flex, which means it has fewer features, but it also has faster instance start-up times, can scale to zero, and is billed only by actual request-processing time rather than instance time. 

Good Fit Characteristics

  • Stateless services that are easily containerized
  • Event-driven applications and systems
  • Applications that require custom system and language dependencies

Bad Fit Characteristics

  • Highly stateful systems or systems that require protocols other than HTTP
  • Compliance requirements that demand strict controls over the low-level environment and infrastructure (might be okay with the Knative GKE mode)

Finally, Google also provides a decision tree for choosing a serverless compute platform.

* App Engine standard environment supports Node.js, Python, Java, Go, PHP
* Cloud Function supports Node.js, Python, Go

Summary

Going serverless can provide a lot of efficiencies by freeing up resources and investment to focus on things that are more strategic and differentiating for a business rather than commodity infrastructure. There are trade-offs when using managed services and serverless solutions. We lose some control and visibility. At certain usage levels there can be a premium, so eventually renting VMs might be the more cost-effective solution once you crack that barrier. However, it’s important to consider not just operational costs involved in managing infrastructure, but also opportunity costs. These trade-offs have to be weighed carefully against the benefits they bring to the business.

One thing worth pointing out is that it’s often easier to move down a level of abstraction than up. That is, there’s typically less friction involved in moving from a more opinionated platform to a less opinionated one than vice versa. This is why we usually suggest starting with the highest level of abstraction possible and dropping down if and when needed.

Security by Happenstance

Key rotation, auditing, and secure CI/CD

Companies often require employees to regularly change their passwords for security purposes. PCI compliance, for example, requires that passwords be changed every 90 days. However, NIST, whose guidelines commonly become the foundation for security best practices across countless organizations, recently revised its recommendations around password security. Its Digital Identity Guidelines (NIST 800-63-3) now recommends removing periodic password-change requirements due to a growing body of research suggesting that frequent password changes actually makes security worse. This is because these requirements encourage the use of passwords which are more susceptible to cracking (e.g. incrementing a number or altering a single character) or result in people writing their passwords down.

Unfortunately, many companies have now adapted these requirements to other parts of their IT infrastructure. This is largely due to legacy holdover practices which have crept into modern systems (or simply lingered in older ones), i.e. it’s tech debt. Specifically, I’m talking about practices like using username/password credentials that applications or systems use to access resources instead of individual end users. These special credentials may even provide a system free rein within a network much like a user might have, especially if the network isn’t segmented (often these companies have adopted a perimeter-security model, relying on a strong outer wall to protect their network). As a result, because they are passwords just like a normal user would have, they are subject to the usual 90-day rotation policy or whatever the case may be.

Today, I think we can say with certainty that—along with the perimeter-security model—relying on usernames and passwords for system credentials is a security anti-pattern (and really, user credentials should be relying on multi-factor authentication). With protocols like OAuth2 and OpenID Connect, we can replace these system credentials with cryptographically strong keys. But because these keys, in a way, act like username/passwords, there is a tendency to apply the same 90-day rotation policy to them as well. This is a misguided practice for several reasons and is actually quite risky.

First, changing a user’s password is far less risky than rotating an access key for a live, production system. If we’re changing keys for production systems frequently, there is a potential for prolonged outages. The more you’re touching these keys, the more exposure and opportunity for mistakes there is. For a user, the worst case is they get temporarily locked out. For a system, the worst case is a critical user-facing application goes down. Second, cryptographically strong keys are not “guessable” like a password frequently is. Since they are generated by an algorithm and not intended to be input by a human, they are long and complex. And unlike passwords, keys are not generally susceptible to social engineering. Lastly, if we are requiring keys to be rotated every 90 days, this means an attacker can still have up to 89 days to do whatever they want in the event of a key being compromised. From a security perspective, this frankly isn’t good enough to me. It’s security by happenstance. The Twitter thread below describes a sequence of events that occurred after an AWS key was accidentally leaked to a public code repository which illustrates this point.

To recap that thread, here’s a timeline of what happened:

  1. AWS credentials are pushed to a public repository on GitHub.
  2. 55 seconds later, an email is received from AWS telling the user that their account is compromised and a support ticket is automatically opened.
  3. A minute later (2 minutes after the push), an attacker attempts to use the credentials to list IAM access keys in order to perform a privilege escalation. Since the IAM role attached to the credentials is insufficient, the attempt failed and an event is logged in CloudTrail.
  4. The user disables the key 5 minutes and 58 seconds after the push.
  5. 24 minutes and 58 seconds after the push, GuardDuty fires a notification indicating anomalous behavior: “APIs commonly used to discover the users, groups, policies and permissions in an account, was invoked by IAM principal some_user under unusual circumstances. Such activity is not typically seen from this principal.”

Given this timeline, rotating access keys every 90 days would do absolutely no good. If anything, it would provide a false sense of security. An attack was made a mere 2 minutes after the key was compromised. It makes no difference if it’s rotated every 90 days or every 9 minutes.

If 90-day key rotation isn’t the answer, what is? The timeline above already hits on it. System credentials, i.e. service accounts, should have very limited permissions following the principle of least privilege. For instance, a CI server which builds artifacts should have a service account which only allows it to push artifacts to a storage bucket and nothing else. This idea should be applied to every part of your system.

For things running inside the cloud, such as AWS or GCP, we can usually avoid the need for access keys altogether. With GCP, we rely on service accounts with GCP-managed keys. The keys for these service accounts are not exposed to users at all and are, in fact, rotated approximately every two weeks (Google is able to do this because they own all of the infrastructure involved and have mature automation). With AWS, we rely on Identity and Access Management (IAM) users and roles. The role can then be assumed by the environment without having to deal with a token or key. This situation is ideal because we can avoid key exposure by never having explicit keys in the first place.

For things running outside the cloud, it’s a bit more involved. In these cases, we must deal with credentials somehow. Ideally, we can limit the lifetime of these credentials, such as with AWS’ Security Token Service (STS) or GCP’s short-lived service account credentials. However, in some situations, we may need longer-lived credentials. In either case, the critical piece is using limited-privilege credentials such that if a key is compromised, the scope of the damage is narrow.

The other key component of this is auditing. Both AWS and GCP offer extensive audit logs for governance, compliance, operational auditing, and risk auditing of your cloud resources. With this, we can audit service account usage, detect anomalous behavior, and immediately take action—such as revoking the credential—rather than waiting up to 90 days to rotate it. Amazon also has GuardDuty which provides intelligent threat detection and continuous monitoring which can identify unauthorized activity as seen in the scenario above. Additionally, access credentials and other secrets should never be stored in source code, but tools like git-secrets, GitGuardian, and truffleHog can help detect when it does happen.

Let’s look at a hypothetical CI/CD pipeline as an example which ties these ideas together. Below is the first pass of our proposed pipeline. In this case, we’re targeting GCP, but the same ideas apply to other environments.

CircleCI is a SaaS-based CI/CD solution. Because it’s deploying to GCP, it will need a service account with the appropriate IAM roles. CircleCI has support for storing secret environment variables, which is how we would store the service account’s credentials. However, there are some downsides to this approach.

First, the service account that Circle needs in order to make deploys could require a fairly wide set of privileges, like accessing a container registry and deploying to a runtime. Because it lives outside of GCP, this service account has a user-managed key. While we could use a KMS to encrypt it or a vault that provides short-lived credentials, we ultimately will need some kind of credential that allows Circle to access these services, so at best we end up with a weird Russian-doll situation. If we’re rotating keys, we might wind up having to do so recursively, and the value of all this indirection starts to come into question. Second, these credentials—or any other application secrets—could easily be dumped out as part of the build script. This isn’t good if we wanted Circle to deploy to a locked-down production environment. Developers could potentially dump out the production service account credentials and now they would be able to make deploys to that environment, circumventing our pipeline.

This is why splitting out Continuous Integration (CI) from Continuous Delivery (CD) is important. If, instead, Circle was only responsible for CI and we introduced a separate component for CD, such as Spinnaker, we can solve this problem. Using this approach, now Circle only needs the ability to push an artifact to a Google Cloud Storage bucket or Container Registry. Outside of the service account credentials needed to do this, it doesn’t need to deal with secrets at all. This means there’s no way to dump out secrets in the build because they will be injected later by Spinnaker. The value of the service account credentials is also much more limited. If compromised, it only allows someone to push artifacts to a repository. Spinnaker, which would run in GCP, would then pull secrets from a vault (e.g. Hashicorp’s Vault) and deploy the artifact relying on credentials assumed from the environment. Thus, Spinnaker only needs permissions to pull artifacts and secrets and deploy to the runtime. This pipeline now looks something like the following:

With this pipeline, we now have traceability from code commit and pull request (PR) to deploy. We can then scan audit logs to detect anomalous behavior—a push to an artifact repository that is not associated with the CircleCI service account or a deployment that does not originate from Spinnaker, for example. Likewise, we can ensure these processes correlate back to an actual GitHub PR or CircleCI build. If they don’t, we know something fishy is going on.

To summarize, requiring frequent rotations of access keys is an outdated practice. It’s a remnant of password policies which themselves have become increasingly reneged by security experts. While similar in some ways, keys are fundamentally different than a username and password, particularly in the case of a service account with fine-grained permissions. Without mature practices and automation, rotating these keys frequently is an inherently risky operation that opens up the opportunity for downtime.

Instead, it’s better to rely on tightly scoped (and, if possible, short-lived) service accounts and usage auditing to detect abnormal behavior. This allows us to take action immediately rather than waiting for some arbitrary period to rotate keys where an attacker may have an unspecified amount of time to do as they please. With end-to-end traceability and evidence collection, we can more easily identify suspicious actions and perform forensic analysis.

Note that this does not mean we should never rotate access keys. Rather, we can turn to NIST for its guidance on key management. NIST 800-57 recommends cryptoperiods of 1-2 years for asymmetric authentication keys in order to maximize operational efficiency. Beyond these particular cryptoperiods, the value of rotating keys regularly is in having the confidence you can, in fact, rotate them without incident. The time interval itself is mostly immaterial, but developing this confidence is important in the event of a key actually being compromised. In this case, you want to know you can act swiftly and revoke access without causing outages.

The funny thing about compliance is that, unless you’re going after actual regulatory standards such as FedRAMP or PCI compliance, controls are generally created by the company itself. Compliance auditors mostly ensure the company is following its own controls. So if you hear, “it’s a compliance requirement” or “that’s the way it’s always been done,” try to dig deeper to understand what risk the control is actually trying to mitigate. This allows you to have a dialog with InfoSec or compliance folks and possibly come to the table with better alternatives.

Authenticating Stackdriver Uptime Checks for Identity-Aware Proxy

Google Stackdriver provides a set of tools for monitoring and managing services running in GCP, AWS, or on-prem infrastructure. One feature Stackdriver has is “uptime checks,” which enable you to verify the availability of your service and track response latencies over time from up to six different geographic locations around the world. While Stackdriver uptime checks are not as feature-rich as other similar products such as Pingdom, they are also completely free. For GCP users, this provides a great starting point for quickly setting up health checks and alerting for your applications.

Last week I looked at implementing authentication and authorization for APIs in GCP using Cloud Identity-Aware Proxy (IAP). IAP provides an easy way to implement identity and access management (IAM) for applications and APIs in a centralized place. However, one thing you will bump into when using Stackdriver uptime checks in combination with IAP is authentication. For App Engine in particular, this can be a problem since there is no way to bypass IAP. All traffic, both internal and external to GCP, goes through it. Until Cloud IAM Conditions is released and generally available, there’s no way to—for example—open up a health-check endpoint with IAP.

While uptime checks have support for Basic HTTP authentication, there is no way to script more sophisticated request flows (e.g. to implement the OpenID Connect (OIDC) authentication flow for IAP-protected resources) or implement fine-grained IAM policies (as hinted at above, this is coming with IAP Context-Aware Access and IAM Conditions). So are we relegated to using Nagios or some other more complicated monitoring tool? Not necessarily. In this post, I’ll present a workaround solution for authenticating Stackdriver uptime checks for systems protected by IAP using Google Cloud Functions.

The Solution

The general strategy is to use a Cloud Function which can authenticate with IAP using a service account to proxy uptime checks to the application. Essentially, the proxy takes a request from a client, looks for a header containing a host, forwards the request that host after performing the necessary authentication, and then forwards the response back to the client. The general architecture of this is shown below.

There are some trade-offs with this approach. The benefit is we get to rely on health checks that are fully managed by GCP and free of charge. Since Cloud Functions are also managed by GCP, there’s no operations involved beyond deploying the proxy and setting it up. The first two million invocations per month are free for Cloud Functions. If we have an uptime check running every five minutes from six different locations, that’s approximately 52,560 invocations per month. This means we could run roughly 38 different uptime checks without exceeding the free tier for invocations. In addition to invocations, the free tier offers 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month. Using the GCP pricing calculator, we can estimate the cost for our uptime check. It generally won’t come close to exceeding the free tier.

The downside to this approach is the check is no longer validating availability from the perspective of an end user. Because the actual service request is originating from Google’s infrastructure by way of a Cloud Function as opposed to Stackdriver itself, it’s not quite the same as a true end-to-end check. That said, both Cloud Functions and App Engine rely on the same Google Front End (GFE) infrastructure, so as long as both the proxy and App Engine application are located in the same region, this is probably not all that important. Besides, for App Engine at least, the value of the uptime check is really more around performing a full-stack probe of the application and its dependencies than monitoring the health of Google’s own infrastructure. That is one of the goals behind using managed services after all. The bigger downside is that the latency reported by the uptime check no longer accurately represents the application. It can still be useful for monitoring aggregate trends nonetheless.

The Implementation Setup

I’ve built an open-source implementation of the proxy as a Cloud Function in Python called gcp-oidc-proxy. It’s runnable out of the box without any modification. We’ll assume you have an IAP-protected application you want to setup a Stackdriver uptime check for. To deploy the proxy Cloud Function, first clone the repository to your machine, then from there run the following gcloud command:

$ gcloud functions deploy gcp-oidc-proxy \
   --runtime python37 \
   --entry-point handle_request \
   --trigger-http

This will deploy a new Cloud Function called gcp-oidc-proxy to your configured cloud project. It will assume the project’s default service account. Ordinarily, I would suggest creating a separate service account to limit scopes. This can be configured on the Cloud Function with the –service-account flag, which is under gcloud beta functions deploy at the time of this writing. We’ll omit this step for brevity however.

Next, we need to add the “Service Account Actor” IAM role to the Cloud Function’s service account since it will need it to sign JWTs (more on this later). In the GCP console, go to IAM & admin, locate the appropriate service account (in this case, the default service account), and add the respective role.

The Cloud Function’s service account must also be added as a member to the IAP with the “IAP-secured Web App User” role in order to properly authenticate. Navigate to Identity-Aware Proxy in the GCP console, select the resource you wish to add the service account to, then click Add Member.

Find the OAuth2 client ID for the IAP by clicking on the options menu next to the IAP resource and select “Edit OAuth client.” Copy the client ID on the next page and then navigate to the newly deployed gcp-oidc-proxy Cloud Function. We need to configure a few environment variables, so click edit and then expand more at the bottom of the page. We’ll add four environment variables: CLIENT_ID, WHITELIST, AUTH_USERNAME, and AUTH_PASSWORD.

CLIENT_ID contains the OAuth2 client ID we copied for the IAP. WHITELIST contains a comma-separated list of URL paths to make accessible or * for everything (I’m using /ping in my example application), and AUTH_USERNAME and AUTH_PASSWORD setup Basic authentication for the Cloud Function. If these are omitted, authentication is disabled.

Save the changes to redeploy the function with the new environment variables. Next, we’ll setup a Stackdriver uptime check that uses the proxy to call our service. In the GCP console, navigate to Monitoring then Create Check from the Stackdriver UI. Skip any suggestions for creating a new uptime check. For the hostname, use the Cloud Function host. For the path, use /gcp-oidc/proxy/<your-endpoint>. The proxy will use the path to make a request to the protected resource.

Expand Advanced Options to set the Forward-Host to the host protected by IAP. The proxy uses this header to forward requests. Lastly, we’ll set the authentication username and password that we configured on the Cloud Function.

Click “Test” to ensure our configuration works and the check passes.

The Implementation Details

The remainder of this post will walk you through the implementation details of the proxy. The implementation closely resembles what we did to authenticate API consumers using a service account. We use a header called Forward-Host to allow the client to specify the IAP-authenticated host to forward requests to. If the header is not present, we just return a 400 error. We then use this host and the path of the original request to construct the proxy request and retain the HTTP method and headers (with the exception of the Host header, if present, since this can cause problems).

Before sending the request, we perform the authentication process by generating a JWT signed by the service account and exchange it for a Google-signed OIDC token.

We can cache this token and renew it only once it expires. Then we set the Authorization header with the OIDC token and send the request.

We simply forward on the resulting content body, status code, and headers. We strip HTTP/1.1 “hop-by-hop” headers since these are unsupported by WSGI and Python Cloud Functions rely on Flask. We also strip any Content-Encoding header since this can also cause problems.

Because this proxy allows clients to call into endpoints unauthenticated, we also implement a whitelist to expose only certain endpoints. The whitelist is a list of allowed paths passed in from an environment variable. Alternatively, we can whitelist * to allow all paths. Wildcarding could be implemented to make this even more flexible. We also implement a Basic auth decorator which is configured with environment variables since we can setup uptime checks with a username and password in Stackdriver.

The only other code worth looking at in detail is how we setup the service account credentials and IAM Signer. A Cloud Function has a service account attached to it which allows it to assume the roles of that account. Cloud Functions rely on the Google Compute Engine metadata server which stores service account information among other things. However, the metadata server doesn’t expose the service account key used to sign the JWT, so instead we must use the IAM signBlob API to sign JWTs.

Conclusion

It’s not a particularly simple solution, but it gets the job done. The setup of the Cloud Function could definitely be scripted as well. Once IAM Conditions is generally available, it should be possible to expose certain endpoints in a way that is accessible to Stackdriver without the need for the OIDC proxy. That said, it’s not clear if there is a way to implement uptime checks without exposing an endpoint at all since there is currently no way to assign a service account to a check. Ideally, we would be able to assign a service account and use that with IAP Context-Aware Access to allow the uptime check to access protected endpoints.

API Authentication with GCP Identity-Aware Proxy

Cloud Identity-Aware Proxy (Cloud IAP) is a free service which can be used to implement authentication and authorization for applications running in Google Cloud Platform (GCP). This includes Google App Engine applications as well as workloads running on Compute Engine (GCE) VMs and Google Kubernetes Engine (GKE) by way of Google Cloud Load Balancers.

When enabled, IAP requires users accessing a web application to login using their Google account and ensure they have the appropriate role to access the resource. This can be used to provide secure access to web applications without the need for a VPN. This is part of what Google now calls BeyondCorp, which is an enterprise security model designed to enable employees to work from untrusted networks without a VPN. At Real Kinetic, we frequently bump into companies practicing Death-Star security, which is basically relying on a hard outer shell to protect a soft, gooey interior. It’s simple and easy to administer, but it’s also vulnerable. That’s why we always approach security from a perspective of defense in depth.

However, in this post I want to explore how we can use Cloud IAP to implement authentication and authorization for APIs in GCP. Specifically, I will use App Engine, but the same applies to resources behind an HTTPS load balancer. The goal is to provide a way to securely expose APIs in GCP which can be accessed programmatically.

Configuring Identity-Aware Proxy

Cloud IAP supports authenticating service accounts using OpenID Connect (OIDC). A service account belongs to an application instead of an individual user. You authenticate a service account when you want to allow an application to access your IAP-secured resources. A GCP service account can either have GCP-managed keys (for systems that reside within GCP) or user-managed keys (for systems that reside outside of GCP). GCP-managed keys cannot be downloaded and are automatically rotated and used for signing for a maximum of two weeks. User-managed keys are created, downloaded, and managed by users and expire 10 years from creation. As such, key rotation must be managed by the user as appropriate. In either case, access using a service account can be revoked either by revoking a particular key or removing the service account itself.

An IAP is associated with an App Engine application or HTTPS Load Balancer. One or more service accounts can then be added to an IAP to allow programmatic authentication. When the IAP is off, the resource is accessible to anyone with the URL. When it’s on, it’s only accessible to members who have been granted access. This can include specific Google accounts, groups, service accounts, or a general G Suite domain.

IAP will create an OAuth2 client ID for OIDC authentication which can be used by service accounts. But in order to access our API using a service account, we first need to add it to IAP with the appropriate role. We’ll add it as an IAP-secured Web App User, which allows access to HTTPS resources protected by IAP. In this case, my service account is called “IAP Auth Test,” and the email associated with it is iap-auth-test@rk-playground.iam.gserviceaccount.com.

As you can see, both the service account and my user account are IAP-secured Web App Users. This means I can access the application using my Google login or using the service account credentials. Next, we’ll look at how to properly authenticate using the service account.

Authenticating API Consumers

When you create a service account key in the GCP console, it downloads a JSON credentials file to your machine. The API consumer needs the service account credentials to authenticate. The diagram below illustrates the general architecture of how IAP authenticates API calls to App Engine services using service accounts.

In order to make a request to the IAP-authenticated resource, the consumer generates a JWT signed using the service account credentials. The JWT contains an additional target_audience claim containing the OAuth2 client ID from the IAP. To find the client ID, click on the options menu next to the IAP resource and select “Edit OAuth client.” The client ID will be listed on the resulting page. My code to generate this JWT looks like the following:

This assumes you have access to the service account’s private key. If you don’t have access to the private key, e.g. because you’re running on GCE or Cloud Functions and using a service account from the metadata server, you’ll have to use the IAM signBlob API. We’ll cover this in a follow-up post.

This JWT is then exchanged for a Google-signed OIDC token for the client ID specified in the JWT claims. This token has a one-hour expiration and must be renewed by the consumer as needed. To retrieve a Google-signed token, we make a POST request containing the JWT and grant type to https://www.googleapis.com/oauth2/v4/token.

This returns a Google-signed JWT which is good for about an hour. The “exp” claim can be used to check the expiration of the token. Authenticated requests are then made by setting the bearer token in the Authorization header of the HTTP request:

Authorization: Bearer <token>

Below is a sequence diagram showing the process of making an OIDC-authenticated request to an IAP-protected resource.

Because this is quite a bit of code and complexity, I’ve implemented the process flow in Java as a Spring RestTemplate interceptor. This transparently authenticates API calls, caches the OIDC token, and handles automatically renewing it. Google has also provided examples of authenticating from a service account for other languages.

With IAP, we’re able to authenticate and authorize requests at the edge before they even reach our application. And with Cloud Audit Logging, we can monitor who is accessing protected resources. Be aware, however, that if you’re using GCE or GKE, users who can access the application-serving port of the VM can bypass IAP authentication. GCE and GKE firewall rules can’t protect against access from processes running on the same VM as the IAP-secured application. They can protect against access from another VM, but only if properly configured. This does not apply for App Engine since all traffic goes through the IAP infrastructure.

Alternative Solutions

There are some alternatives to IAP for implementing authentication and authorization for APIs. Apigee is one option, which Google acquired not too long ago. This is a more robust API-management solution which will do a lot more than just secure APIs, but it’s also more expensive. Another option is Google Cloud Endpoints, which is an NGINX-based proxy that provides mechanisms to secure and monitor APIs. This is free up to two million API calls per month.

Lastly, you can also simply implement authentication and authorization directly in your application instead of with an API proxy, e.g. using OAuth2. This has downsides in that it can introduce complexity and room for mistakes, but it gives you full control over your application’s security. Following our model of defense in depth, we often encourage clients to implement authentication both at the edge (e.g. by ensuring requests have a valid token) and in the application (e.g. by validating the token on a request). This way, we avoid implementing a Death-Star security model.


Operations in the World of Developer Enablement

NewOps is not a replacement for DevOps, it’s an evolution of it by looking at Operations through the lens of product. It’s what I’ve come to call “Developer Enablement” because the goal is to shift the focus of Ops teams from being masters of production to enablers of production. Through Developer Enablement, teams are enabled—and tasked with the responsibility—to control their own destiny. This extends far beyond just the responsibility of building products. It includes how we build, test, secure, deploy, monitor, and operate systems.

For some, this might come naturally. Many startups don’t have the privilege of siloing up their organizations (although you’d be surprised!). For others, this can be a major shift in how we build software. Especially in large, established organizations with more specialized roles, responsibilities can be so siloed people aren’t even aware they’re happening. Basic “ilities” like scalability, reliability, and even security become someone else’s responsibility. “Good Operations” means no one even knows you’re there, unless something goes wrong.

So when this is turned on its ear, and these responsibilities are placed on the dev team’s shoulders, how do they adapt? In many cases, teams are eager to take on these new responsibilities but also blissfully unaware of what that actually entails. DBAs are a good example of this. Often a staple of enterprise IT Ops, DBAs are tasked with—among other things—installing and patching DBMSs, performing backups, managing HA and DR strategies, balancing database workloads, managing resources, tuning performance, configuring security settings, and monitoring systems. Many of these responsibilities are invisible to developers.

With cloud and Developer Enablement, this can change in profound ways. However, in a typical lift-and-shift, the role of DBAs is widely unchanged. In this case, we’re just running the same stuff in someone else’s data center. There are still databases to be patched, replication to be managed, backups to be made, and so on. But pure lift-and-shifts, at least as an end goal, are largely a misstep. You throw away all that institutional memory—the knowledge and experience you have managing your own data center—for more expensive compute with which you have less experience administering. Things change when we start to rely on managed cloud services. We no longer run our own databases on VMs but instead rely on cloud-managed ones. This is where things become much more grey—but also much more interesting.

Developer Enablement in the Cloud

First, a quick aside. There are two different concepts we’re talking about here: cloud and Developer Enablement (DevOps for brevity). These are two distinct but related concepts. We can “do” DevOps on-prem, just as we can in the cloud. Likewise, we can also do traditional Operations in the cloud, just as we can on-prem. One of the benefits of cloud is it allows us to focus more investment on business-differentiating things, but it also makes implementing DevOps easier for two reasons. First, the cloud provider takes on more operational responsibilities (the stuff that supports—but doesn’t directly contribute to—business value). Second, it provides a lower barrier to self-service infrastructure. This means developers can, of their own accord, provision and manage supporting infrastructure like databases, caches, queues, and other things without a go-between or the customary “throw-it-over-the-wall” approach. This is a key part of Developer Enablement.

In the world of Developer Enablement in the cloud, what is the role of a DBA, or any other Ops person for that matter? When you start to map who is accountable for what, you quickly realize there is far too much nuance to cleanly map responsibilities. Which cloud provider are we talking about? Within that cloud provider, which database offering? Proprietary NoSQL databases like Google’s Cloud Datastore? Relational databases like Amazon’s RDS? Globally-distributed databases like Spanner? How we handle things like HA and DR vary drastically depending on the service and service provider. In some cases, the vendor is entirely responsible, e.g. because the database has built-in replication. In other cases, the customer. Sometimes it’s a combination of both, such as a database that has automated backups which must first be enabled. It’s not as cut and dry as it used to be.

As we push more responsibility onto developers, how do we ensure they are actually tackling all of those responsibilities, especially the ones they might not even know about? How do we implement DevOps responsibly?

The goal of Developer Enablement is not to enable developers by giving them total control and free rein. Instead, it’s to empower them in a way that is “safe” for the business. People often misconstrue DevOps and automation as things that reduce lead times and increase deployment frequencies by simply pulling security out of the process. This is categorically not the purpose of DevOps. In fact, the intention is to improve security by integrating it more deeply and earlier into the process in a more reliable and repeatable way, i.e. “shift left.” Developer Enablement is about providing the tools, automation, services, and standards teams need to do just this.

So when we say we want to implement DevOps and Developer Enablement, we’re not saying we want to hand developers the keys to production with a pat on the back. We’re saying we want to pave a path to production which allows developers to release software in a way that is safe and secure with greater autonomy—because autonomy enables building more reliable software faster. In this world, Operations teams become increasingly Developer Enablement teams because there is simply less stuff to operate. It becomes more about supporting development teams and organizing around products than acting purely as a gatekeeper or service provider. It’s pretty amazing how things start to improve when you align yourself this way.

Responsibilities of Developer Enablement

Those Operations teams still have extremely valuable skill sets however. It’s just that they start to act more in an advisory role than the assembly-line-worker role converting Jira tickets into outputs. For instance, DBAs have deep expertise on the intricacies and operations of various database systems, but when Amazon is now responsible for installing the database, patching it, scaling it, monitoring it, performing backups, managing replication and failovers, and handling encryption and security, what do the DBAs do? They become domain experts and developer advocates. They make sure teams aren’t shooting themselves—or the company—in the foot and provide domain expertise and tooling in a supporting role. When a developer complains about a slow query, they are the ones who can help them identify, understand, and fix the problem. “It’s doing a full-table scan since you’re missing an index,” or “You have a hot partition because you’re using a timestamp as the partition key. Try using a more uniform ID to distribute workloads evenly.” These folks can often help developers better structure their data to improve application performance and scalability.

In addition to this supporting role, these Developer Enablement teams also help ensure dev teams are thinking about all the things they need to be considering. In the case of data, how is encryption handled? HA? DR? Data migrations? Rollbacks? Not that all of these things need to be handled by the teams themselves—again, often the cloud provider has it covered—but simply ensuring that they have been considered and can be spoken to is important. It’s vital to start this conversation early in the development process.

The Three Phases of Development

There are basically three phases of development to consider. There’s the “playground” phase, which is when teams are essentially exploring different technologies. At this stage, there can be little-to-no oversight outside of controlling cloud spend (which is important for when your intern accidentally starts a task bomb before leaving for the weekend). Teams are free to try out new ideas without worrying about production. Often this work happens in a separate “experimentation” cloud project.

Next, there’s the “green-light” phase. The thing being built is going to production, it’s part of the company’s strategic plan, people are talking about it, etc. At this point, we start an ongoing dialogue with the team and provide them with a list of the key things to be thinking about. This should not be a 10-page document. It should be a one-page document hitting the main areas. An example portion of this might look like the following:

  • How do you plan to implement HA?
  • What classifications of data will this system handle and how do you plan to secure that data in transit and at rest?
  • How much traffic do you expect the system to handle and how will you scale it?
  • How will the system handle authentication and authorization?
  • What are the integration points?
  • Who will support the system in production?
  • What is the CI/CD story for the system?
  • What is the testing strategy?

Depending on your company’s culture, this can sometimes be seen as an affront or threat to teams if they’re used to Ops or InfoSec groups gatekeeping. That is not the goal as it’s intended to be in an advisory capacity. This ends up having a couple benefits. First, it gets teams thinking about and planning for key operational items, and second, it uncovers any major gaps early in the process. The number of times I’ve heard someone ask, “What’s HA?” after reading this list is non-zero. The purpose of this isn’t to shame anyone, just to provide a way to start critical discussions between the team and Developer Enablement groups.

Finally, there’s the “ready-for-production” phase. The team is ready to ship what they’ve been building. This is where things get real. Typically, there are a few things that should happen here. When launching a new service or product, there should be a comprehensive review of the system. The team will sit down with a group of their peers, architects, and security engineers and walk them through the system. People hate the dreaded architecture review, so we call it a product technical walkthrough instead.

Operational Readiness and Change Management

About a month or so prior to the walkthrough, the team should be working through an “operational-readiness checklist” which is used to guide the walkthrough. This checklist is much more detailed than the previous one, enumerating items like what the deploy process consists of, configuration management, API versioning, incident-response procedures, system observability, etc. The checklist we commonly use with clients at Real Kinetic is about seven pages long and covers 10 areas: Deployment, Testing, Reliability/Failover, Architecture, Costs, Security, CI/CD, Infrastructure, Capacity/Performance Estimates, and Operations and Support. This checklist is used to probe different areas. If certain areas feel a little weak, this can lead to deeper discussions depending on the importance or severity. If a system is particularly critical to the business or high-risk, this process can veto a release. Having a sign-off process like this makes some people nervous, but it’s important to point out that this should only apply to new launches. It is not a general change-management process. It’s really about helping teams learn about running systems in production and understanding what that takes.

In addition to the product technical walkthrough, we also recommend doing a security assessment for new services. This usually encompasses a vulnerability and threat assessment, risk assessment, pen testing, the whole nine yards. I usually also like to see some sort of load profiling done on the service before putting it in production (though load and chaos testing should ideally be part of the normal development process, not saved for the very end).

When it comes to infrastructure, there’s also the question of how to manage changes. This is where infrastructure as code (IaC) becomes hugely important as it not only provides a way to automate infrastructure changes, but also a means to review those changes. We can treat infrastructure changes in the same way we treat application changes—storing them in source control, doing code reviews on them, running them through static analysis tools, and so forth. Infrastructure changes, like all changes, should go through a code review process. It cannot be overstated how essential code reviews are and how much they benefit your organization. And once again, this is where Developer Enablement comes into play. I recommend IaC changes be reviewed by a Developer Enablement team member. This provides a touchpoint where they can provide domain expertise and ensure changes are within acceptable parameters. If a developer is requesting a change which falls outside those parameters, such as a database instance with 1TB of RAM for example, it requires a conversation and sign-off process.

Conclusion

With Developer Enablement, what used to be Operations becomes primarily a product and advisory team. “Product” in the sense of providing systems and tools that help developers take on more responsibility, from day-to-day development to operations and support. “Advisory” in the sense of offering domain expertise and guidance. Through this approach, we get better alignment by giving engineers end-to-end ownership from development to on-call and improve efficiency by reducing handoffs. This also lets us scale more effectively. Through products and reduced hand-offs, a Developer Enablement group can empower far more engineers than any conventional Ops team could.