Zero-Trust Security on GCP With Context-Aware Access

A lot of our clients at Real Kinetic leverage serverless on GCP to quickly build applications with minimal operations overhead. Serverless is one of the things that truly differentiates GCP from other cloud providers, and App Engine is a big component of this. Many of these companies come from an on-prem world and, as a result, tend to favor perimeter-based security models. They rely heavily on things like IP and network restrictions, VPNs, corporate intranets, and so forth. Unfortunately, this type of security model doesn’t always fit nicely with serverless due to the elastic and dynamic nature of serverless systems.

Recently, I worked with a client who was building an application for internal support staff on App Engine. They were using Identity-Aware Proxy (IAP) to authenticate users and authorize access to the application. IAP provides a fully managed solution for implementing a zero-trust access model for App Engine and Compute Engine. In this case, their G Suite user directory was backed by Active Directory, which allowed them to manage access to the application using Single Sign-On and AD groups.

Everything was great until the team hit a bit of a snag when they went through their application vulnerability assessment. Because it was for internal users, the security team requested the application be restricted to the corporate network. While I’m deeply skeptical of the value this adds in terms of security—the application was already protected by SSO and two-factor authentication and IAP cannot be bypassed with App Engine—I shared my concerns and started evaluating options. Sometimes that’s just the way things go in a larger, older organization. Culture shifts are hard and take time.

App Engine has firewall rules built in which allow you to secure incoming traffic to your application with allow/deny rules based on IP, so it seemed like an easy fix. The team would be in production in no time!

App Engine firewall rules

Unfortunately, there are some issues with how these firewall rules work depending on the application architecture. All traffic to App Engine goes through Google Front End (GFE) servers. This provides numerous benefits including TLS termination, DDoS protection, DNS, load balancing, firewall, and integration with IAP. It can present problems, however, if you have multiple App Engine services that communicate with each other internally. For example, imagine you have a frontend service which talks to a backend service.

App Engine does not provide a static IP address and instead relies on a large, dynamic pool of IP addresses. Two sequential outbound calls from the same application can appear to originate from two different IP addresses. One option is to allow all possible App Engine IPs, but this is riddled with issues. For one, Google uses netblocks that dynamically change and are encoded in Sender Policy Framework (SPF) records. To determine all of the IPs App Engine is currently using, you need to recursively perform DNS lookups by fetching the current set of netblocks and then doing a DNS lookup for each netblock. These results are not static, meaning you would need to do the lookups and update firewall rules continually. Worse yet, allowing all possible App Engine IPs would be self-defeating since it would be trivial for an attacker to work around by setting up their own App Engine application to gain access, assuming there isn’t any additional security beyond the firewall.

Another, slightly better option is to set up a proxy on Compute Engine in the same region as your App Engine application. With this, you get a static IP address. The downside here is that it’s an additional piece of infrastructure that must be managed, which isn’t great when you’re shooting for a serverless architecture.

Luckily, there is a better solution—one that fits our serverless model and enables us to control external traffic while allowing App Engine services to securely communicate internally. IAP supports context-aware access, which allows enforcing granular access controls for web applications, VMs, and GCP APIs based on an end-user’s identity and request context. Essentially, context-aware access brings a richer zero-trust model to App Engine and other GCP services.

To set up a network firewall in IAP, we first need to create an Access Level in the Access Context Manager. Access Levels are a way to add an extra level of security based on request attributes such as IP address, region, time of day, or device. In the client’s case, they can create an Access Level to only allow access from their corporate network.

GCP Access Context Manager

We can then add the Access Level to roles that are assigned to users or groups in IAP. This means even if users are authenticated, they must be on the corporate network to access the application.

Cloud Identity-Aware Proxy roles

To allow App Engine services to communicate freely, we simply need to assign the IAP-secured Web App User role without the Access Level to the App Engine default service account. Services will then authenticate as usual using OpenID Connect without the added network restriction. The default service account is managed by GCP and there are no associated credentials, so this provides a solid security posture.

Now, at this point, we’ve solved the IP firewall problem, but that’s not really in the spirit of zero-trust, right? Zero-trust is a security principle believing that organizations should not inherently trust anything inside or outside of their perimeters and instead should verify anything trying to connect to their systems. Having to connect to a VPN in order to access an application in the cloud is kind of a bummer, especially when the corporate VPN goes down. COVID-19 has made a lot of organizations feel this pain. Fortunately, Access Levels can be a lot smarter than providing simple lists of approved IP addresses. With the Cloud IAM Conditions Framework, we can even write custom rules to allow access based on URL path, resource type, or other request attributes.

At this point, I talked the client through the Endpoint Verification process and how we can shift away from a perimeter-based security model to a defense-in-depth, zero-trust model. Rather than requiring the end-user to be signed in from the corporate network, we can require them to be signed in from a trusted, corporate-owned device from anywhere. We can require that the device has a screen lock and is encrypted or has a minimum OS version.

With IAP and context-aware access, we can build layered security on top of applications and resources without the need for a VPN, while still centrally managing access. This can even extend beyond GCP to applications hosted on-prem or in other cloud platforms like AWS and Azure. Enterprises don’t have to move away from more traditional security models all at once. This pattern allows you to gradually shift by adding and removing Access Levels and attributes over time. Zero-trust becomes much easier to implement within large organizations when they don’t have to flip a switch.

What’s Going on with GKE and Anthos?

GCP’s Slippery Slide into Enterprise

When former Oracle exec Thomas Kurian took over for Diane Greene as Google Cloud’s CEO, a lot of people expressed concern about what this meant for the future of GCP. Vendor lock-in is already at the forefront of the minds of many cloud adopters, and Oracle is notorious for locking customers into expensive and prolonged contracts. However, I thought the move was smart on Google’s part.

Google has never been a customer-first company. While it has always been a technology leader, it struggles immensely with enterprise sales and support. It continues to have issues dogfooding its own products (Google’s products are typically built on internal versions of services not available to customers, then there are the external GCP versions that their customers actually use). This means its engineers don’t feel the same pain points that its customers experience and their products lose out on a critical feedback loop (contrast this with Amazon where AWS is treated as a separate company to Amazon.com, and there is a mandate to build with the same services Amazon’s customers use). Customer empathy matters.

Now, most people probably wouldn’t characterize Oracle as a customer-first company, but it knows how to meet customers where they are and to sell in a way that resonates with enterprise decision makers. Historically, Google has approached sales engineering in a way that has failed to resonate with customers by attempting to map its superior technology offerings onto actual customer problems. Nothing could be more off-putting to a decision maker with a round hole than a sales engineer with a square peg telling them their hole is wrong.

Thomas Kurian was brought in to address these glaring issues for Google Cloud. Through restructuring and growing its sales organization, key leadership hires, and strategic acquisitions and partnerships, it’s clear he’s serious about fixing Google Cloud’s enterprise perception problem. Slowly but surely, Google is attempting to shift its culture from being technology-obsessed to customer-obsessed. And while Oracle is notorious when it comes to vendor lock-in, all signs thus far have pointed to Google more strategically embracing open APIs with things like GKE (Kubernetes), Traffic Director (Istio), ML Engine (Tensorflow), and Dataflow (Apache Beam). They are also starting to meet customers where they are with things like Dataproc (Apache Spark and Hadoop), Memorystore (Redis), and Cloud SQL (MySQL, PostgreSQL, and Microsoft SQL Server). Hell, they’ll even run Microsoft Active Directory for you now! Who says Google can’t do enterprise? So the future is bright for GCP, right? Maybe. What follows is speculation based on my own observations and anecdotal information.

There’s one thing that could change the outlook on all of this: Anthos. Anthos is GCP’s answer to hybrid-cloud solutions like Pivotal Cloud Foundry (PCF), AWS Outposts, or Azure Stack. It allows organizations to build and manage workloads across public clouds and on-prem by extending GKE. If multi-cloud is your thing and you hate money, these platforms all sound like pretty good things. But here’s the disconcerting thing about Anthos in particular: it’s becoming clear that GCP is deliberately blurring the lines between Anthos and GKE.

I received an email yesterday from GCP announcing that Binary Authorization is now generally available (GA). Binary Authorization is a neat security feature that ensures only trusted container images can be deployed to GKE. It’s been in beta for some time and now it’s GA with a six-month free trial starting today. Great! How much will it cost after the trial? Contact your sales representative. Wait, what? That’s because starting on March 16, 2020, GKE clusters will need to be part of an Anthos-subscribed organization to enable Binary Authorization. If you choose not to upgrade to Anthos, starting March 16, 2020, you will not be able to turn on Binary Authorization on new clusters.

This is a slippery slope for GCP. I can already foresee other features requiring an Anthos subscription just to use them in GKE, where GKE basically becomes an Anthos subscription funnel. Which features go into Anthos and which go into GKE? Now this is something I’d come to expect from Oracle. If GCP starts to roll differentiating features into Anthos instead of GKE, it could mark the beginning of the end.

While the lines between Anthos and GKE are becoming increasingly fuzzy, Google is clear about this particular feature:

Binary Authorization is a feature of the Anthos platform and use of Binary Authorization is included in the Anthos subscription.

That wasn’t clear, however, when I started using it with GKE and started to advise clients to use it there, completely irrespective of Anthos. This sets a very dangerous precedent.

What’s more alarming is the marketing and product language on a number of GCP services and features have quietly replaced “GKE” with “Anthos” or, worse yet, “Anthos GKE.” For example, Cloud Run—which is still in beta—now says it can “run stateless containers on a fully managed environment or on Anthos.” Will I need an Anthos subscription to use Cloud Run with GKE once it goes GA? Based on the Binary Authorization move and the language updates, it seems likely. And looking at the GKE cluster setup wizard, it appears managed Istio might also.

Anthos features listed in GKE cluster setup wizard

Which of these features is going to require a subscription next? We know Binary Authorization already does.

Security features listed in GKE cluster setup wizard

And how much does Anthos even cost? Contact sales. Not a good look for Kurian’s vision of openness and customer choice. As AWS CEO Andy Jassy puts it, no longer does the process of buying technology involve the purchase of heavy proprietary software with multi-year contracts that include annual maintenance fees. Now it’s about choice and ease of use, including letting customers turn things off if they’re not working. But choice also means not bundling all of your differentiating features into a massive contract. List prices for Anthos start at $10,000 per month per 100 virtual CPUs with a minimum one-year commitment. This is just for the software layer. It doesn’t include any of the underlying GCP infrastructure. Again, fine for organizations willing to throw similar sums of money at things like PCF or Outposts, but are plain old GKE users really going to get roped in to this nonsense? Are they going to lose out on value-added features?

Either GCP has a well-thought-out strategy for GKE and Anthos (which, given Google’s history, is frankly unlikely) and is simply tone deaf to how it would be perceived by people already skittish about a former Oracle exec taking the reigns as CEO or this will end in disaster. It’s entirely possible this is all just a misunderstanding and they are, in a misguided fashion, rebranding GKE to Anthos (it’s been renamed once already and GCP has a history of rebranding existing products), but requiring a subscription hidden behind a sales contact form in order to use basic features is spooky.

My hope is that there is some longer-term strategy at play and GCP is not moving to an enterprise-subscription model for what should be GKE features. Best case, Google is just muddying the waters as they’ve done in the past. Worst case, they’re steamrolling their entire platform strategy to make way for enterprise sales. That would be tragic for Google given GKE is still by far and away the best managed Kubernetes service available. So what’s going on with GKE and Anthos?

Serverless on GCP

Like many other marketing buzzwords, the concept of “serverless” has taken on a life of its own, which can make it difficult to understand what serverless actually means. What it really means is that the cloud provider fully manages server infrastructure all the way up to the application layer. For example, GCE isn’t serverless because, while Google manages the physical server infrastructure, we still have to deal with patching operating systems, managing load balancers, configuring firewall rules, and so on. Serverless means we merely worry about our application code and business logic and nothing else. This concept extends beyond pure compute though, including things like databases, message queues, stream processing, machine learning, and other types of systems.

There are several benefits to the serverless model. First, it allows us to focus on building products, not managing infrastructure. These operations-related tasks, while important, are not generally things that differentiate a business. It’s just work that has to be done to support the rest of the business. With cloud—and serverless in particular—many of these tasks are becoming commoditized, freeing us up to focus on things that matter to the business.

Another benefit related to the first is that serverless systems provide automatic scaling and fault-tolerance across multiple data centers or, in some cases, even globally. When we leverage GCP’s serverless products, we also leverage Google’s operational expertise and the experience of an army of SREs. That’s a lot of leverage. Few companies are able to match the kind of investment cloud providers like Google or Amazon are able to make in infrastructure and operations, nor should they. If it’s not your core business, leverage economies of scale.

Finally, serverless allows us to pay only for what we use. This is quite a bit different from what traditional IT companies are used to where it’s more common to spend several millions of dollars on a large solution with a contract. It’s also different from what many cloud-based companies are used to where you typically provision some baseline capacity and pay for bursts of additional capacity as needed. With serverless, VMs are eschewed and we pay only for the resources we use to serve the traffic we have. This means no more worrying about over-provisioning or under-provisioning.

GCP’s Compute Options

GCP has a comprehensive set of compute options ranging from minimally managed VMs all the way to highly managed serverless backends. Below is the full spectrum of GCP’s compute services at the time of this writing. I’ll provide a brief overview of each of these services just to get the lay of the land. We’ll start from the highest level of abstraction and work our way down, and then we’ll hone in on the serverless solutions.

GCP compute platforms

Firebase is Google’s managed Backend as a Service (BaaS) platform. This is the highest level of abstraction that GCP offers (short of SaaS like G Suite) and allows you to build mobile and web applications quickly and with minimal server-side code. For example, it can implement things like user authentication and offline data syncing for you. This is often referred to as a “backend as a service” because there is no server code. The trade-off is you have less control over the system, but it can be a great fit for quickly prototyping applications or building a proof of concept with minimal investment. The primary advantage is that you can focus most of your development effort on client-side application code and user experience. Note that some components of Firebase can be used outside of the Firebase platform, such as Cloud Firestore and Firebase Authentication

Cloud Functions is a serverless Functions as a Service (FaaS) offering from GCP. You upload your function code and Cloud Functions handles the runtime of it. Because it’s a sandboxed environment, there are some restrictions to the runtime, but it’s a great choice for building event-driven services and connecting systems together. While you can develop basic user-facing APIs, the operational tooling is not sufficient for complex systems. The benefit is Cloud Functions are highly elastic and have minimal operational overhead since it is a serverless platform. They are an excellent choice for dynamic, event-driven plumbing such as moving data between services or reacting to log events. They work well for basic APIs, but can rapidly become operationally complex for more than a few endpoints.

App Engine is Google’s Platform as a Service (PaaS). Like Cloud Functions, it’s an opinionated but fully managed runtime that lets you upload your application code while handling the operational aspects such as autoscaling and fault-tolerance. App Engine has two modes: Standard, which is the opinionated PaaS runtime, and Flexible, which allows providing a custom runtime using a container—this is colloquially referred to as a Container as a Service (CaaS). For stateless applications with quick instance start-up times, it is often an excellent choice. It offers many of the benefits of Cloud Functions but simplifies operational aspects since larger components are easy to deploy and manage. App Engine allows developers to focus most of their effort on business logic. Standard is a great fit for greenfield applications where server-side processing and logic is required. Flex can be easier for migrating existing workloads because it is less opinionated.

Cloud Run is a new offering in GCP that provides a managed compute platform for stateless containers. Essentially, Google manages the underlying compute infrastructure and all you have to do is provide them an application container. Like App Engine, they handle scaling instances up and down, load balancing, and fault-tolerance. Cloud Run actually has two modes: the Google-managed version, which runs your containers on Google’s internal compute infrastructure known as Borg, and the GKE version, which allows running workloads on your own GKE cluster. This is because Cloud Run is built on an open source Kubernetes platform for serverless workloads called Knative.

Cloud Run and App Engine Flex are similar to each other, but there are some nuanced differences. One key difference is Cloud Run has very fast instance start-up time due to its reliance on the gVisor container runtime. Flex instances, on the other hand, usually take minutes to start because they involve provisioning GCE instances, load balancers, and other GCP-managed infrastructure. Flex is also more feature-rich than Cloud Run, supporting things like traffic splitting, deployment rollbacks, WebSocket connections, and VPC connections.

Kubernetes Engine, or GKE, is Google’s managed Kubernetes service. GKE effectively adds a container orchestration layer on top of GCE, putting it somewhere between IaaS (Infrastructure as a Service) and CaaS. This is typically the lowest level of abstraction most modern applications should require. There is still a lot of operational overhead involved with using a managed Kubernetes service.

Lastly, Compute Engine, or GCE, is Google’s VM offering. GCE VMs are usually run on multi-tenant hosts, but GCP also offers sole-tenant nodes where a physical Compute Engine server is dedicated to hosting a single customer’s VMs. This is the lowest level of infrastructure that GCP offers and the lowest common denominator generally available in the public clouds, usually referred to as IaaS. This means there are a lot of operational responsibilities that come with using it. There are generally few use cases that demand a bare VM.

Choosing a Serverless Option

Now that we have an overview of GCP’s compute services, we can focus in on the serverless options.

GCP serverless compute platforms

GCP currently has four serverless compute options (emphasis on computebecause there are other serverless offerings for things like databases, queues, and so forth, but these are out of scope for this discussion).

  • Cloud Run: serverless containers (CaaS)
  • App Engine: serverless platforms (PaaS)
  • Cloud Functions: serverless functions (FaaS)
  • Firebase: serverless applications (BaaS)

With four different serverless options to choose from, how do we decide which one is right? The first thing to point out is that we don’t necessarily need to choose a single solution. We might end up using a combination of these services when building a system. However, I’ve provided some criteria below on selecting solutions for different types of problems.

Firebase

If you’re looking to quickly prototype an application or focus only on writing code, Firebase can be a good fit. This is especially true if you’re wanting to focus most of your investment and time on the client-side application code and user experience. Likewise, if you want to build a mobile-ready application and don’t want to implement things like user authentication, it’s a good option. 

Firebase is obviously the most restrictive and opinionated solution, but it’s great for rapid prototyping and accelerating development of an MVP. You can also complement it with services like App Engine or Cloud Functions for situations that require server-side compute.

Good Fit Characteristics

  • Mobile-first (or ready) applications
  • Rapidly prototyping applications
  • Applications where most of the logic is (or can be) client-side
  • Using Firebase components on other platforms, such as using Cloud Firestore or Firebase Authentication on App Engine, to minimize investment in non-differentiating work

Bad Fit Characteristics

  • Applications requiring complex server-side logic or architectures
  • Applications which require control over the runtime

Cloud Functions

If you’re looking to react to real-time events, glue systems together, or build a simple API, Cloud Functions are a good choice provided you’re able to use one of the supported runtimes (Node.js, Python, and Go). If the runtime is a limitation, check out Cloud Run.

Good Fit Characteristics

  • Event-driven applications and systems
  • “Glueing” systems together
  • Deploying simple APIs

Bad Fit Characteristics

  • Highly stateful systems
  • Deploying large, complex APIs
  • Systems that require a high level of control or need custom runtimes or binaries

App Engine

If you’re looking to deploy a full application or complex API, App Engine is worth looking at. Standard is good for greenfield applications which are able to fit within the constraints of the runtime. It can scale to zero and deploys take seconds. Flexible is easier for existing applications where you’re unwilling or unable to make changes fitting them into Standard. Deploys to Flex can take minutes, and you must have a minimum of one instance running at all times.

Good Fit Characteristics

  • Stateless applications
  • Rapidly developing CRUD-heavy applications
  • Applications composed of a few services
  • Deploying complex APIs

Bad Fit Characteristics

  • Stateful applications that require lots of in-memory state to meet performance or functional requirements
  • Applications built with large or opinionated frameworks or applications that have slow start-up times (this can be okay with Flex)
  • Systems that require protocols other than HTTP

Cloud Run

If you’re looking to react to real-time events but need custom runtimes or binaries not supported by Cloud Functions, Cloud Run is a good choice. It’s also a good option for building stateless HTTP-based web services. It’s trimmed down compared to App Engine Flex, which means it has fewer features, but it also has faster instance start-up times, can scale to zero, and is billed only by actual request-processing time rather than instance time. 

Good Fit Characteristics

  • Stateless services that are easily containerized
  • Event-driven applications and systems
  • Applications that require custom system and language dependencies

Bad Fit Characteristics

  • Highly stateful systems or systems that require protocols other than HTTP
  • Compliance requirements that demand strict controls over the low-level environment and infrastructure (might be okay with the Knative GKE mode)

Finally, Google also provides a decision tree for choosing a serverless compute platform.

* App Engine standard environment supports Node.js, Python, Java, Go, PHP
* Cloud Function supports Node.js, Python, Go

Summary

Going serverless can provide a lot of efficiencies by freeing up resources and investment to focus on things that are more strategic and differentiating for a business rather than commodity infrastructure. There are trade-offs when using managed services and serverless solutions. We lose some control and visibility. At certain usage levels there can be a premium, so eventually renting VMs might be the more cost-effective solution once you crack that barrier. However, it’s important to consider not just operational costs involved in managing infrastructure, but also opportunity costs. These trade-offs have to be weighed carefully against the benefits they bring to the business.

One thing worth pointing out is that it’s often easier to move down a level of abstraction than up. That is, there’s typically less friction involved in moving from a more opinionated platform to a less opinionated one than vice versa. This is why we usually suggest starting with the highest level of abstraction possible and dropping down if and when needed.

Security by Happenstance

Key rotation, auditing, and secure CI/CD

Companies often require employees to regularly change their passwords for security purposes. PCI compliance, for example, requires that passwords be changed every 90 days. However, NIST, whose guidelines commonly become the foundation for security best practices across countless organizations, recently revised its recommendations around password security. Its Digital Identity Guidelines (NIST 800-63-3) now recommends removing periodic password-change requirements due to a growing body of research suggesting that frequent password changes actually makes security worse. This is because these requirements encourage the use of passwords which are more susceptible to cracking (e.g. incrementing a number or altering a single character) or result in people writing their passwords down.

Unfortunately, many companies have now adapted these requirements to other parts of their IT infrastructure. This is largely due to legacy holdover practices which have crept into modern systems (or simply lingered in older ones), i.e. it’s tech debt. Specifically, I’m talking about practices like using username/password credentials that applications or systems use to access resources instead of individual end users. These special credentials may even provide a system free rein within a network much like a user might have, especially if the network isn’t segmented (often these companies have adopted a perimeter-security model, relying on a strong outer wall to protect their network). As a result, because they are passwords just like a normal user would have, they are subject to the usual 90-day rotation policy or whatever the case may be.

Today, I think we can say with certainty that—along with the perimeter-security model—relying on usernames and passwords for system credentials is a security anti-pattern (and really, user credentials should be relying on multi-factor authentication). With protocols like OAuth2 and OpenID Connect, we can replace these system credentials with cryptographically strong keys. But because these keys, in a way, act like username/passwords, there is a tendency to apply the same 90-day rotation policy to them as well. This is a misguided practice for several reasons and is actually quite risky.

First, changing a user’s password is far less risky than rotating an access key for a live, production system. If we’re changing keys for production systems frequently, there is a potential for prolonged outages. The more you’re touching these keys, the more exposure and opportunity for mistakes there is. For a user, the worst case is they get temporarily locked out. For a system, the worst case is a critical user-facing application goes down. Second, cryptographically strong keys are not “guessable” like a password frequently is. Since they are generated by an algorithm and not intended to be input by a human, they are long and complex. And unlike passwords, keys are not generally susceptible to social engineering. Lastly, if we are requiring keys to be rotated every 90 days, this means an attacker can still have up to 89 days to do whatever they want in the event of a key being compromised. From a security perspective, this frankly isn’t good enough to me. It’s security by happenstance. The Twitter thread below describes a sequence of events that occurred after an AWS key was accidentally leaked to a public code repository which illustrates this point.

To recap that thread, here’s a timeline of what happened:

  1. AWS credentials are pushed to a public repository on GitHub.
  2. 55 seconds later, an email is received from AWS telling the user that their account is compromised and a support ticket is automatically opened.
  3. A minute later (2 minutes after the push), an attacker attempts to use the credentials to list IAM access keys in order to perform a privilege escalation. Since the IAM role attached to the credentials is insufficient, the attempt failed and an event is logged in CloudTrail.
  4. The user disables the key 5 minutes and 58 seconds after the push.
  5. 24 minutes and 58 seconds after the push, GuardDuty fires a notification indicating anomalous behavior: “APIs commonly used to discover the users, groups, policies and permissions in an account, was invoked by IAM principal some_user under unusual circumstances. Such activity is not typically seen from this principal.”

Given this timeline, rotating access keys every 90 days would do absolutely no good. If anything, it would provide a false sense of security. An attack was made a mere 2 minutes after the key was compromised. It makes no difference if it’s rotated every 90 days or every 9 minutes.

If 90-day key rotation isn’t the answer, what is? The timeline above already hits on it. System credentials, i.e. service accounts, should have very limited permissions following the principle of least privilege. For instance, a CI server which builds artifacts should have a service account which only allows it to push artifacts to a storage bucket and nothing else. This idea should be applied to every part of your system.

For things running inside the cloud, such as AWS or GCP, we can usually avoid the need for access keys altogether. With GCP, we rely on service accounts with GCP-managed keys. The keys for these service accounts are not exposed to users at all and are, in fact, rotated approximately every two weeks (Google is able to do this because they own all of the infrastructure involved and have mature automation). With AWS, we rely on Identity and Access Management (IAM) users and roles. The role can then be assumed by the environment without having to deal with a token or key. This situation is ideal because we can avoid key exposure by never having explicit keys in the first place.

For things running outside the cloud, it’s a bit more involved. In these cases, we must deal with credentials somehow. Ideally, we can limit the lifetime of these credentials, such as with AWS’ Security Token Service (STS) or GCP’s short-lived service account credentials. However, in some situations, we may need longer-lived credentials. In either case, the critical piece is using limited-privilege credentials such that if a key is compromised, the scope of the damage is narrow.

The other key component of this is auditing. Both AWS and GCP offer extensive audit logs for governance, compliance, operational auditing, and risk auditing of your cloud resources. With this, we can audit service account usage, detect anomalous behavior, and immediately take action—such as revoking the credential—rather than waiting up to 90 days to rotate it. Amazon also has GuardDuty which provides intelligent threat detection and continuous monitoring which can identify unauthorized activity as seen in the scenario above. Additionally, access credentials and other secrets should never be stored in source code, but tools like git-secrets, GitGuardian, and truffleHog can help detect when it does happen.

Let’s look at a hypothetical CI/CD pipeline as an example which ties these ideas together. Below is the first pass of our proposed pipeline. In this case, we’re targeting GCP, but the same ideas apply to other environments.

CircleCI is a SaaS-based CI/CD solution. Because it’s deploying to GCP, it will need a service account with the appropriate IAM roles. CircleCI has support for storing secret environment variables, which is how we would store the service account’s credentials. However, there are some downsides to this approach.

First, the service account that Circle needs in order to make deploys could require a fairly wide set of privileges, like accessing a container registry and deploying to a runtime. Because it lives outside of GCP, this service account has a user-managed key. While we could use a KMS to encrypt it or a vault that provides short-lived credentials, we ultimately will need some kind of credential that allows Circle to access these services, so at best we end up with a weird Russian-doll situation. If we’re rotating keys, we might wind up having to do so recursively, and the value of all this indirection starts to come into question. Second, these credentials—or any other application secrets—could easily be dumped out as part of the build script. This isn’t good if we wanted Circle to deploy to a locked-down production environment. Developers could potentially dump out the production service account credentials and now they would be able to make deploys to that environment, circumventing our pipeline.

This is why splitting out Continuous Integration (CI) from Continuous Delivery (CD) is important. If, instead, Circle was only responsible for CI and we introduced a separate component for CD, such as Spinnaker, we can solve this problem. Using this approach, now Circle only needs the ability to push an artifact to a Google Cloud Storage bucket or Container Registry. Outside of the service account credentials needed to do this, it doesn’t need to deal with secrets at all. This means there’s no way to dump out secrets in the build because they will be injected later by Spinnaker. The value of the service account credentials is also much more limited. If compromised, it only allows someone to push artifacts to a repository. Spinnaker, which would run in GCP, would then pull secrets from a vault (e.g. Hashicorp’s Vault) and deploy the artifact relying on credentials assumed from the environment. Thus, Spinnaker only needs permissions to pull artifacts and secrets and deploy to the runtime. This pipeline now looks something like the following:

With this pipeline, we now have traceability from code commit and pull request (PR) to deploy. We can then scan audit logs to detect anomalous behavior—a push to an artifact repository that is not associated with the CircleCI service account or a deployment that does not originate from Spinnaker, for example. Likewise, we can ensure these processes correlate back to an actual GitHub PR or CircleCI build. If they don’t, we know something fishy is going on.

To summarize, requiring frequent rotations of access keys is an outdated practice. It’s a remnant of password policies which themselves have become increasingly reneged by security experts. While similar in some ways, keys are fundamentally different than a username and password, particularly in the case of a service account with fine-grained permissions. Without mature practices and automation, rotating these keys frequently is an inherently risky operation that opens up the opportunity for downtime.

Instead, it’s better to rely on tightly scoped (and, if possible, short-lived) service accounts and usage auditing to detect abnormal behavior. This allows us to take action immediately rather than waiting for some arbitrary period to rotate keys where an attacker may have an unspecified amount of time to do as they please. With end-to-end traceability and evidence collection, we can more easily identify suspicious actions and perform forensic analysis.

Note that this does not mean we should never rotate access keys. Rather, we can turn to NIST for its guidance on key management. NIST 800-57 recommends cryptoperiods of 1-2 years for asymmetric authentication keys in order to maximize operational efficiency. Beyond these particular cryptoperiods, the value of rotating keys regularly is in having the confidence you can, in fact, rotate them without incident. The time interval itself is mostly immaterial, but developing this confidence is important in the event of a key actually being compromised. In this case, you want to know you can act swiftly and revoke access without causing outages.

The funny thing about compliance is that, unless you’re going after actual regulatory standards such as FedRAMP or PCI compliance, controls are generally created by the company itself. Compliance auditors mostly ensure the company is following its own controls. So if you hear, “it’s a compliance requirement” or “that’s the way it’s always been done,” try to dig deeper to understand what risk the control is actually trying to mitigate. This allows you to have a dialog with InfoSec or compliance folks and possibly come to the table with better alternatives.

Authenticating Stackdriver Uptime Checks for Identity-Aware Proxy

Google Stackdriver provides a set of tools for monitoring and managing services running in GCP, AWS, or on-prem infrastructure. One feature Stackdriver has is “uptime checks,” which enable you to verify the availability of your service and track response latencies over time from up to six different geographic locations around the world. While Stackdriver uptime checks are not as feature-rich as other similar products such as Pingdom, they are also completely free. For GCP users, this provides a great starting point for quickly setting up health checks and alerting for your applications.

Last week I looked at implementing authentication and authorization for APIs in GCP using Cloud Identity-Aware Proxy (IAP). IAP provides an easy way to implement identity and access management (IAM) for applications and APIs in a centralized place. However, one thing you will bump into when using Stackdriver uptime checks in combination with IAP is authentication. For App Engine in particular, this can be a problem since there is no way to bypass IAP. All traffic, both internal and external to GCP, goes through it. Until Cloud IAM Conditions is released and generally available, there’s no way to—for example—open up a health-check endpoint with IAP.

While uptime checks have support for Basic HTTP authentication, there is no way to script more sophisticated request flows (e.g. to implement the OpenID Connect (OIDC) authentication flow for IAP-protected resources) or implement fine-grained IAM policies (as hinted at above, this is coming with IAP Context-Aware Access and IAM Conditions). So are we relegated to using Nagios or some other more complicated monitoring tool? Not necessarily. In this post, I’ll present a workaround solution for authenticating Stackdriver uptime checks for systems protected by IAP using Google Cloud Functions.

The Solution

The general strategy is to use a Cloud Function which can authenticate with IAP using a service account to proxy uptime checks to the application. Essentially, the proxy takes a request from a client, looks for a header containing a host, forwards the request that host after performing the necessary authentication, and then forwards the response back to the client. The general architecture of this is shown below.

There are some trade-offs with this approach. The benefit is we get to rely on health checks that are fully managed by GCP and free of charge. Since Cloud Functions are also managed by GCP, there’s no operations involved beyond deploying the proxy and setting it up. The first two million invocations per month are free for Cloud Functions. If we have an uptime check running every five minutes from six different locations, that’s approximately 52,560 invocations per month. This means we could run roughly 38 different uptime checks without exceeding the free tier for invocations. In addition to invocations, the free tier offers 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month. Using the GCP pricing calculator, we can estimate the cost for our uptime check. It generally won’t come close to exceeding the free tier.

The downside to this approach is the check is no longer validating availability from the perspective of an end user. Because the actual service request is originating from Google’s infrastructure by way of a Cloud Function as opposed to Stackdriver itself, it’s not quite the same as a true end-to-end check. That said, both Cloud Functions and App Engine rely on the same Google Front End (GFE) infrastructure, so as long as both the proxy and App Engine application are located in the same region, this is probably not all that important. Besides, for App Engine at least, the value of the uptime check is really more around performing a full-stack probe of the application and its dependencies than monitoring the health of Google’s own infrastructure. That is one of the goals behind using managed services after all. The bigger downside is that the latency reported by the uptime check no longer accurately represents the application. It can still be useful for monitoring aggregate trends nonetheless.

The Implementation Setup

I’ve built an open-source implementation of the proxy as a Cloud Function in Python called gcp-oidc-proxy. It’s runnable out of the box without any modification. We’ll assume you have an IAP-protected application you want to setup a Stackdriver uptime check for. To deploy the proxy Cloud Function, first clone the repository to your machine, then from there run the following gcloud command:

$ gcloud functions deploy gcp-oidc-proxy \
   --runtime python37 \
   --entry-point handle_request \
   --trigger-http

This will deploy a new Cloud Function called gcp-oidc-proxy to your configured cloud project. It will assume the project’s default service account. Ordinarily, I would suggest creating a separate service account to limit scopes. This can be configured on the Cloud Function with the –service-account flag, which is under gcloud beta functions deploy at the time of this writing. We’ll omit this step for brevity however.

Next, we need to add the “Service Account Actor” IAM role to the Cloud Function’s service account since it will need it to sign JWTs (more on this later). In the GCP console, go to IAM & admin, locate the appropriate service account (in this case, the default service account), and add the respective role.

The Cloud Function’s service account must also be added as a member to the IAP with the “IAP-secured Web App User” role in order to properly authenticate. Navigate to Identity-Aware Proxy in the GCP console, select the resource you wish to add the service account to, then click Add Member.

Find the OAuth2 client ID for the IAP by clicking on the options menu next to the IAP resource and select “Edit OAuth client.” Copy the client ID on the next page and then navigate to the newly deployed gcp-oidc-proxy Cloud Function. We need to configure a few environment variables, so click edit and then expand more at the bottom of the page. We’ll add four environment variables: CLIENT_ID, WHITELIST, AUTH_USERNAME, and AUTH_PASSWORD.

CLIENT_ID contains the OAuth2 client ID we copied for the IAP. WHITELIST contains a comma-separated list of URL paths to make accessible or * for everything (I’m using /ping in my example application), and AUTH_USERNAME and AUTH_PASSWORD setup Basic authentication for the Cloud Function. If these are omitted, authentication is disabled.

Save the changes to redeploy the function with the new environment variables. Next, we’ll setup a Stackdriver uptime check that uses the proxy to call our service. In the GCP console, navigate to Monitoring then Create Check from the Stackdriver UI. Skip any suggestions for creating a new uptime check. For the hostname, use the Cloud Function host. For the path, use /gcp-oidc/proxy/<your-endpoint>. The proxy will use the path to make a request to the protected resource.

Expand Advanced Options to set the Forward-Host to the host protected by IAP. The proxy uses this header to forward requests. Lastly, we’ll set the authentication username and password that we configured on the Cloud Function.

Click “Test” to ensure our configuration works and the check passes.

The Implementation Details

The remainder of this post will walk you through the implementation details of the proxy. The implementation closely resembles what we did to authenticate API consumers using a service account. We use a header called Forward-Host to allow the client to specify the IAP-authenticated host to forward requests to. If the header is not present, we just return a 400 error. We then use this host and the path of the original request to construct the proxy request and retain the HTTP method and headers (with the exception of the Host header, if present, since this can cause problems).

Before sending the request, we perform the authentication process by generating a JWT signed by the service account and exchange it for a Google-signed OIDC token.

We can cache this token and renew it only once it expires. Then we set the Authorization header with the OIDC token and send the request.

We simply forward on the resulting content body, status code, and headers. We strip HTTP/1.1 “hop-by-hop” headers since these are unsupported by WSGI and Python Cloud Functions rely on Flask. We also strip any Content-Encoding header since this can also cause problems.

Because this proxy allows clients to call into endpoints unauthenticated, we also implement a whitelist to expose only certain endpoints. The whitelist is a list of allowed paths passed in from an environment variable. Alternatively, we can whitelist * to allow all paths. Wildcarding could be implemented to make this even more flexible. We also implement a Basic auth decorator which is configured with environment variables since we can setup uptime checks with a username and password in Stackdriver.

The only other code worth looking at in detail is how we setup the service account credentials and IAM Signer. A Cloud Function has a service account attached to it which allows it to assume the roles of that account. Cloud Functions rely on the Google Compute Engine metadata server which stores service account information among other things. However, the metadata server doesn’t expose the service account key used to sign the JWT, so instead we must use the IAM signBlob API to sign JWTs.

Conclusion

It’s not a particularly simple solution, but it gets the job done. The setup of the Cloud Function could definitely be scripted as well. Once IAM Conditions is generally available, it should be possible to expose certain endpoints in a way that is accessible to Stackdriver without the need for the OIDC proxy. That said, it’s not clear if there is a way to implement uptime checks without exposing an endpoint at all since there is currently no way to assign a service account to a check. Ideally, we would be able to assign a service account and use that with IAP Context-Aware Access to allow the uptime check to access protected endpoints.