Security, Maintainability, Velocity: Choose One

There are three competing priorities that companies have as it relates to software development: security, maintainability, and velocity. I’ll elaborate on what I mean by each of these in just a bit. When I originally started thinking about this, I thought of it in the context of the “good, fast, cheap: choose two” project management triangle. But after thinking about it for more than a couple minutes, and as I related it to my own experience and observations at other companies, I realized that in practice it’s much worse. For most organizations building software, it’s more like security, maintainability, velocity: choose one.

The Software Development Triangle

Of course, most organizations are not explicitly making these trade-offs. Instead, the internal preferences and culture of the company reveal them. I believe many organizations, consciously or not, accept this trade-off as an immovable constraint. More risk-averse groups might even welcome it. Though the triangle most often results in a “choose one” sort of compromise, it’s not some innate law. You can, in fact, have all three with a little bit of careful thought and consideration. And while reality is always more nuanced than what this simple triangle suggests, I find looking at the extremes helps to ground the conversation. It emphasizes the natural tension between these different concerns. Bringing that tension to the forefront allows us to be more intentional about how we manage it.

It wasn’t until recently that I distilled down these trade-offs and mapped them into the triangle shown above, but we’ve been helping clients navigate this exact set of competing priorities for over six years at Real Kinetic. We built Konfig as a direct response to this since it was such a common challenge for organizations. We’re excited to offer a solution which is the culmination of years of consulting and which allows organizations to no longer compromise, but first let’s explore the trade-offs I’m talking about.

Security

Companies, especially mid- to large- sized organizations, care a great deal about security (and rightfully so!). That’s not to say startups don’t care about it, but the stakes are just much higher for enterprises. They are terrified of being the next big name in the headlines after a major data breach or ransomware attack. I call this priority security for brevity, but it actually consists of two things which I think are closely aligned: security and governance.

Governance directly supports security in addition to a number of other concerns like reliability, risk management, and compliance. This is sometimes referred to as Governance, Risk, and Compliance or GRC. Enterprises need control over, and visibility into, all of the pieces that go into building and delivering software. This is where things like SDLC, separation of duties, and access management come into play. Startups may play it more fast and loose, but more mature organizations frequently have compliance or regulatory obligations like SOC 2 Type II, PCI DSS, FINRA, FedRAMP, and so forth. Even if they don’t have regulatory constraints, they usually have a reputation that needs to be protected, which typically means more rigid processes and internal controls. This is where things can go sideways for larger organizations as it usually leads to practices like change review boards, enterprise (ivory tower) architecture programs, and SAFe. Enterprises tend to be pretty good at governance, but it comes at a cost.

It should come as no surprise that security and governance are in conflict with speed, but they are often in contention with well-architected and maintainable systems as well. When organizations enforce strong governance and security practices, it can often lead to developers following bad practices. Let me give an example I have seen firsthand at an organization.

A company has been experiencing stability and reliability issues with its software systems. This has caused several high-profile, revenue-impacting outages which have gotten executives’ attention. The response is to implement a series of process improvements to effectively slow down the release of changes to production. This includes a change review board to sign-off on changes going to production and a production gating process which new workloads going to production must go through before they can be released. The hope is that these process changes will reduce defects and improve reliability of systems in production. At this point, we are wittingly trading off velocity.

What actually happened is that developers began batching up more and more changes to get through the change review board which resulted in “big bang” releases. This caused even more stability issues because now large sets of changes were being released which were increasingly complex, difficult to QA, and harder to troubleshoot. Rollbacks became difficult to impossible due to the size and complexity of releases, increasing the impact of defects. Release backlogs quickly grew, prompting developers to move on to more work rather than sit idle, which further compounded the issue and led to context switching. Decreasing the frequency of deployments only exacerbated these problems. Counterintuitively, slowing down actually increased risk.

To avoid the production gating process, developers began adding functionality to existing services which, architecturally speaking, should have gone into new services. Services became bloated grab bags of miscellaneous functionality since it was easier to piggyback features onto workloads already in production than it was to run the gauntlet of getting a new service to production. These processes were directly and unwittingly impacting system architecture and maintainability. In economics, this is called a “negative externality.” We may have security and governance, but we’ve traded off velocity and maintainability. Adding insult to injury, the processes were not even accomplishing the original goal of improving reliability, they were making it worse!

Maintainability

It’s critical that software systems are not just built to purpose, but also built to last. This means they need to be reliable, scalable, and evolvable. They need to be conducive to finding and correcting bugs. They need to support changing requirements such that new features and functionality can be delivered rapidly. They need to be efficient and cost effective. More generally, software needs to be built in a way that maximizes its useful life.

We simply call this priority maintainability. While it covers a lot, it can basically be summarized as: is the system architected and implemented well? Is it following best practices? Is there a lot of tech debt? How much thought and care has been put into design and implementation? Much of this comes down to gut feel, but an experienced engineer can usually intuit whether or not a system is maintainable pretty quickly. A good proxy can often be the change fail rate, mean time to recovery, and the lead time for implementing new features.

Maintainability’s benefits are more of a long tail. A maintainable system is easier to extend and add new features later, easier to identify and fix bugs, and generally experiences fewer defects. However, the cost for that speed is basically frontloaded. It usually means moving slower towards the beginning while reaping the rewards later. Conversely, it’s easy to go fast if you’re just hacking something together without much concern for maintainability, but you will likely pay the cost later. Companies can become crippled by tech debt and unmaintained legacy systems to the point of “bankruptcy” in which they are completely stuck. This usually leads to major refactors or rewrites which have their own set of problems.

Additionally, building systems that are both maintainable and secure can be surprisingly difficult, especially in more dynamic cloud environments. If you’ve ever dealt with IAM, for example, you know exactly what I mean. Scoping identities with the right roles or permissions, securely managing credentials and secrets, configuring resources correctly, ensuring proper data protections are in place, etc. Misconfigurations are frequently the cause of the major security breaches you see in the headlines. The unfortunate reality is security practices and tooling lag in the industry, and security is routinely treated as an afterthought. Often it’s a matter of “we’ll get it working and then we’ll come back later and fix up the security stuff,” but later never happens. Instead, an IAM principal is left with overly broad access or a resource is configured improperly. This becomes 10x worse when you are unfamiliar with the cloud, which is where many of our clients tend to find themselves.

Velocity

The last competing priority is simply speed to production or velocity. This one probably requires the least explanation, but it’s consistently the priority that is sacrificed the most. In fact, many organizations may even view it as the enemy of the first two priorities. They might equate moving fast with being reckless. Nonetheless, companies are feeling the pressure to deliver faster now more than ever, but it’s much more than just shipping quickly. It’s about developing the ability to adapt and respond to changing market conditions fast and fluidly. Big companies are constantly on the lookout for smaller, more nimble players who might disrupt their business. This is in part why more and more of these companies are prioritizing the move to cloud. The data center has long been their moat and castle as it relates to security and governance, however, and the cloud presents a new and serious risk for them in this space. As a result, velocity typically pays the price.

As I mentioned earlier, velocity is commonly in tension with maintainability as well, it’s usually just a matter of whether that premium is frontloaded or backloaded. More often than not, we can choose to move quickly up front but pay a penalty later on or vice versa. Truthfully though, if you’ve followed the DORA State of DevOps Reports, you know that a lot of companies neither frontload nor backload their velocity premium—they are just slow all around. These are usually more legacy-minded IT shops and organizations that treat software development as an IT cost center. These are also usually the groups that bias more towards security and governance, but they’re probably the most susceptible to disruption. “Move fast and break things” is not a phrase you will hear permeating these organizations, yet they all desire to modernize and accelerate. We regularly watch these companies’ teams spend months configuring infrastructure, and what they construct is often complex, fragile, and insecure.

Choose Three

Businesses today are demanding strong security and governance, well-structured and maintainable infrastructure, and faster speed to production. The reality, however, is that these three priorities are competing with each other, and companies often end up with one of the priorities dominating the others. If we can acknowledge these trade-offs, we can work to better understand and address them.

We built Konfig as a solution that tackles this head-on by providing an opinionated configuration of Google Cloud Platform and GitLab. Most organizations start from a position where they must assemble the building blocks in a way that allows them to deliver software effectively, but their own biases result in a solution that skews one way or the other. Konfig instead provides a turnkey experience that minimizes time-to-production, is secure by default, and has governance and best practices built in from the start. Rather than having to choose one of security, maintainability, and velocity, don’t compromise—have all three. In a follow-up post I’ll explain how Konfig addresses concerns like security and governance, infrastructure maintainability, and speed to production in a “by default” way. We’ll see how IAM can be securely managed for us, how we can enforce architecture standards and patterns, and how we can enable developers to ship production workloads quickly by providing autonomy with guardrails and stable infrastructure.

Using Google-Managed Certificates and Identity-Aware Proxy With GKE

Ingress on Google Kubernetes Engine (GKE) uses a Google Cloud Load Balancer (GCLB). GCLB provides a single anycast IP that fronts all of your backend compute instances along with a lot of other rich features. In order to create a GCLB that uses HTTPS, an SSL certificate needs to be associated with the ingress resource. This certificate can either be self-managed or Google-managed. The benefit of using a Google-managed certificate is that they are provisioned, renewed, and managed for your domain names by Google. These managed certificates can also be configured directly with GKE, meaning we can configure our certificates the same way we declaratively configure our other Kubernetes resources such as deployments, services, and ingresses.

GKE also supports Identity-Aware Proxy (IAP), which is a fully managed solution for implementing a zero-trust security model for applications and VMs. With IAP, we can secure workloads in GCP using identity and context. For example, this might be based on attributes like user identity, device security status, region, or IP address. This allows users to access applications securely from untrusted networks without the need for a VPN. IAP is a powerful way to implement authentication and authorization for corporate applications that are run internally on GKE, Google Compute Engine (GCE), or App Engine. This might be applications such as Jira, GitLab, Jenkins, or production-support portals.

IAP works in relation to GCLB in order to secure GKE workloads. In this tutorial, I’ll walk through deploying a workload to a GKE cluster, setting up GCLB ingress for it with a global static IP address, configuring a Google-managed SSL certificate to support HTTPS traffic, and enabling IAP to secure access to the application. In order to follow along, you’ll need a GKE cluster and domain name to use for the application. In case you want to skip ahead, all of the Kubernetes configuration for this tutorial is available here.

Deploying an Application Behind GCLB With a Managed Certificate

First, let’s deploy our application to GKE. We’ll use a Hello World application to test this out. Our application will consist of a Kubernetes deployment and service. Below is the configuration for these:

Apply these with kubectl:

$ kubectl apply -f .

At this point, our application is not yet accessible from outside the cluster since we haven’t set up an ingress. Before we do that, we need to create a static IP address using the following command:

$ gcloud compute addresses create web-static-ip --global

The above will reserve a static external IP called “web-static-ip.” We now can create an ingress resource using this IP address. Note the “kubernetes.io/ingress.global-static-ip-name” annotation in the configuration:

Applying this with kubectl will provision a GCLB that will route traffic into our service. It can take a few minutes for the load balancer to become active and health checks to begin working. Traffic won’t be served until that happens, so use the following command to check that traffic is healthy:

$ curl -i http://<web-static-ip>

You can find <web-static-ip> with:

$ gcloud compute addresses describe web-static-ip --global

Once you start getting a successful response, update your DNS to point your domain name to the static IP address. Wait until the DNS change is propagated and your domain name now points to the application running in GKE. This could take 30 minutes or so.

After DNS has been updated, we’ll configure HTTPS. To do this, we need to create a Google-managed SSL certificate. This can be managed by GKE using the following configuration:

Ensure that “example.com” is replaced with the domain name you’re using.

We now need to update our ingress to use the new managed certificate. This is done using the “networking.gke.io/managed-certificates” annotation.

We’ll need to wait a bit for the certificate to finish provisioning. This can take up to 15 minutes. Once it’s done, we should see HTTPS traffic flowing correctly:

$ curl -i https://example.com

We now have a working example of an application running in GKE behind a GCLB with a static IP address and domain name secured with TLS. Now we’ll finish up by enabling IAP to control access to the application.

Securing the Application With Identity-Aware Proxy

If you’re enabling IAP for the first time, you’ll need to configure your project’s OAuth consent screen. The steps here will walk through how to do that. This consent screen is what users will see when they attempt to access the application before logging in.

Once IAP is enabled and the OAuth consent screen has been configured, there should be an OAuth 2 client ID created in your GCP project. You can find this under “OAuth 2.0 Client IDs” in the “APIs & Services” > “Credentials” section of the cloud console. When you click on this credential, you’ll find a client ID and client secret. These need to be provided to Kubernetes as secrets so they can be used by a BackendConfig for configuring IAP. Apply the secrets to Kubernetes with the following command, replacing “xxx” with the respective credentials:

$ kubectl create secret generic iap-oauth-client-id \
--from-literal=client_id=xxx \
--from-literal=client_secret=xxx

BackendConfig is a Kubernetes custom resource used to configure ingress in GKE. This includes features such as IAP, Cloud CDN, Cloud Armor, and others. Apply the following BackendConfig configuration using kubectl, which will enable IAP and associate it with your OAuth client credentials:

We also need to ensure there are service ports associated with the BackendConfig in order to trigger turning on IAP. One way to do this is to make all ports for the service default to the BackendConfig, which is done by setting the “beta.cloud.google.com/backend-config” annotation to “{“default”: “config-default”}” in the service resource. See below for the updated service configuration.

Once you’ve applied the annotation to the service, wait a couple minutes for the infrastructure to settle. IAP should now be working. You’ll need to assign the “IAP-secured Web App User” role in IAP to any users or groups who should have access to the application. Upon accessing the application, you should now be greeted with a login screen.

Your Kubernetes workload is now secured by IAP! Do note that VPC firewall rules can be configured to bypass IAP, such as rules that allow traffic internal to your VPC or GKE cluster. IAP will provide a warning indicating which firewall rules allow bypassing it.

For an extra layer of security, IAP sets signed headers on inbound requests which can be verified by the application. This is helpful in the event that IAP is accidentally disabled or misconfigured or if firewall rules are improperly set.

Together with GCLB and GCP-managed certificates, IAP provides a great solution for serving and securing internal applications that can be accessed anywhere without the need for a VPN.

Zero-Trust Security on GCP With Context-Aware Access

A lot of our clients at Real Kinetic leverage serverless on GCP to quickly build applications with minimal operations overhead. Serverless is one of the things that truly differentiates GCP from other cloud providers, and App Engine is a big component of this. Many of these companies come from an on-prem world and, as a result, tend to favor perimeter-based security models. They rely heavily on things like IP and network restrictions, VPNs, corporate intranets, and so forth. Unfortunately, this type of security model doesn’t always fit nicely with serverless due to the elastic and dynamic nature of serverless systems.

Recently, I worked with a client who was building an application for internal support staff on App Engine. They were using Identity-Aware Proxy (IAP) to authenticate users and authorize access to the application. IAP provides a fully managed solution for implementing a zero-trust access model for App Engine and Compute Engine. In this case, their G Suite user directory was backed by Active Directory, which allowed them to manage access to the application using Single Sign-On and AD groups.

Everything was great until the team hit a bit of a snag when they went through their application vulnerability assessment. Because it was for internal users, the security team requested the application be restricted to the corporate network. While I’m deeply skeptical of the value this adds in terms of security—the application was already protected by SSO and two-factor authentication and IAP cannot be bypassed with App Engine—I shared my concerns and started evaluating options. Sometimes that’s just the way things go in a larger, older organization. Culture shifts are hard and take time.

App Engine has firewall rules built in which allow you to secure incoming traffic to your application with allow/deny rules based on IP, so it seemed like an easy fix. The team would be in production in no time!

App Engine firewall rules

Unfortunately, there are some issues with how these firewall rules work depending on the application architecture. All traffic to App Engine goes through Google Front End (GFE) servers. This provides numerous benefits including TLS termination, DDoS protection, DNS, load balancing, firewall, and integration with IAP. It can present problems, however, if you have multiple App Engine services that communicate with each other internally. For example, imagine you have a frontend service which talks to a backend service.

App Engine does not provide a static IP address and instead relies on a large, dynamic pool of IP addresses. Two sequential outbound calls from the same application can appear to originate from two different IP addresses. One option is to allow all possible App Engine IPs, but this is riddled with issues. For one, Google uses netblocks that dynamically change and are encoded in Sender Policy Framework (SPF) records. To determine all of the IPs App Engine is currently using, you need to recursively perform DNS lookups by fetching the current set of netblocks and then doing a DNS lookup for each netblock. These results are not static, meaning you would need to do the lookups and update firewall rules continually. Worse yet, allowing all possible App Engine IPs would be self-defeating since it would be trivial for an attacker to work around by setting up their own App Engine application to gain access, assuming there isn’t any additional security beyond the firewall.

Another, slightly better option is to set up a proxy on Compute Engine in the same region as your App Engine application. With this, you get a static IP address. The downside here is that it’s an additional piece of infrastructure that must be managed, which isn’t great when you’re shooting for a serverless architecture.

Luckily, there is a better solution—one that fits our serverless model and enables us to control external traffic while allowing App Engine services to securely communicate internally. IAP supports context-aware access, which allows enforcing granular access controls for web applications, VMs, and GCP APIs based on an end-user’s identity and request context. Essentially, context-aware access brings a richer zero-trust model to App Engine and other GCP services.

To set up a network firewall in IAP, we first need to create an Access Level in the Access Context Manager. Access Levels are a way to add an extra level of security based on request attributes such as IP address, region, time of day, or device. In the client’s case, they can create an Access Level to only allow access from their corporate network.

GCP Access Context Manager

We can then add the Access Level to roles that are assigned to users or groups in IAP. This means even if users are authenticated, they must be on the corporate network to access the application.

Cloud Identity-Aware Proxy roles

To allow App Engine services to communicate freely, we simply need to assign the IAP-secured Web App User role without the Access Level to the App Engine default service account. Services will then authenticate as usual using OpenID Connect without the added network restriction. The default service account is managed by GCP and there are no associated credentials, so this provides a solid security posture.

Now, at this point, we’ve solved the IP firewall problem, but that’s not really in the spirit of zero-trust, right? Zero-trust is a security principle believing that organizations should not inherently trust anything inside or outside of their perimeters and instead should verify anything trying to connect to their systems. Having to connect to a VPN in order to access an application in the cloud is kind of a bummer, especially when the corporate VPN goes down. COVID-19 has made a lot of organizations feel this pain. Fortunately, Access Levels can be a lot smarter than providing simple lists of approved IP addresses. With the Cloud IAM Conditions Framework, we can even write custom rules to allow access based on URL path, resource type, or other request attributes.

At this point, I talked the client through the Endpoint Verification process and how we can shift away from a perimeter-based security model to a defense-in-depth, zero-trust model. Rather than requiring the end-user to be signed in from the corporate network, we can require them to be signed in from a trusted, corporate-owned device from anywhere. We can require that the device has a screen lock and is encrypted or has a minimum OS version.

With IAP and context-aware access, we can build layered security on top of applications and resources without the need for a VPN, while still centrally managing access. This can even extend beyond GCP to applications hosted on-prem or in other cloud platforms like AWS and Azure. Enterprises don’t have to move away from more traditional security models all at once. This pattern allows you to gradually shift by adding and removing Access Levels and attributes over time. Zero-trust becomes much easier to implement within large organizations when they don’t have to flip a switch.

Security by Happenstance

Key rotation, auditing, and secure CI/CD

Companies often require employees to regularly change their passwords for security purposes. PCI compliance, for example, requires that passwords be changed every 90 days. However, NIST, whose guidelines commonly become the foundation for security best practices across countless organizations, recently revised its recommendations around password security. Its Digital Identity Guidelines (NIST 800-63-3) now recommends removing periodic password-change requirements due to a growing body of research suggesting that frequent password changes actually makes security worse. This is because these requirements encourage the use of passwords which are more susceptible to cracking (e.g. incrementing a number or altering a single character) or result in people writing their passwords down.

Unfortunately, many companies have now adapted these requirements to other parts of their IT infrastructure. This is largely due to legacy holdover practices which have crept into modern systems (or simply lingered in older ones), i.e. it’s tech debt. Specifically, I’m talking about practices like using username/password credentials that applications or systems use to access resources instead of individual end users. These special credentials may even provide a system free rein within a network much like a user might have, especially if the network isn’t segmented (often these companies have adopted a perimeter-security model, relying on a strong outer wall to protect their network). As a result, because they are passwords just like a normal user would have, they are subject to the usual 90-day rotation policy or whatever the case may be.

Today, I think we can say with certainty that—along with the perimeter-security model—relying on usernames and passwords for system credentials is a security anti-pattern (and really, user credentials should be relying on multi-factor authentication). With protocols like OAuth2 and OpenID Connect, we can replace these system credentials with cryptographically strong keys. But because these keys, in a way, act like username/passwords, there is a tendency to apply the same 90-day rotation policy to them as well. This is a misguided practice for several reasons and is actually quite risky.

First, changing a user’s password is far less risky than rotating an access key for a live, production system. If we’re changing keys for production systems frequently, there is a potential for prolonged outages. The more you’re touching these keys, the more exposure and opportunity for mistakes there is. For a user, the worst case is they get temporarily locked out. For a system, the worst case is a critical user-facing application goes down. Second, cryptographically strong keys are not “guessable” like a password frequently is. Since they are generated by an algorithm and not intended to be input by a human, they are long and complex. And unlike passwords, keys are not generally susceptible to social engineering. Lastly, if we are requiring keys to be rotated every 90 days, this means an attacker can still have up to 89 days to do whatever they want in the event of a key being compromised. From a security perspective, this frankly isn’t good enough to me. It’s security by happenstance. The Twitter thread below describes a sequence of events that occurred after an AWS key was accidentally leaked to a public code repository which illustrates this point.

To recap that thread, here’s a timeline of what happened:

  1. AWS credentials are pushed to a public repository on GitHub.
  2. 55 seconds later, an email is received from AWS telling the user that their account is compromised and a support ticket is automatically opened.
  3. A minute later (2 minutes after the push), an attacker attempts to use the credentials to list IAM access keys in order to perform a privilege escalation. Since the IAM role attached to the credentials is insufficient, the attempt failed and an event is logged in CloudTrail.
  4. The user disables the key 5 minutes and 58 seconds after the push.
  5. 24 minutes and 58 seconds after the push, GuardDuty fires a notification indicating anomalous behavior: “APIs commonly used to discover the users, groups, policies and permissions in an account, was invoked by IAM principal some_user under unusual circumstances. Such activity is not typically seen from this principal.”

Given this timeline, rotating access keys every 90 days would do absolutely no good. If anything, it would provide a false sense of security. An attack was made a mere 2 minutes after the key was compromised. It makes no difference if it’s rotated every 90 days or every 9 minutes.

If 90-day key rotation isn’t the answer, what is? The timeline above already hits on it. System credentials, i.e. service accounts, should have very limited permissions following the principle of least privilege. For instance, a CI server which builds artifacts should have a service account which only allows it to push artifacts to a storage bucket and nothing else. This idea should be applied to every part of your system.

For things running inside the cloud, such as AWS or GCP, we can usually avoid the need for access keys altogether. With GCP, we rely on service accounts with GCP-managed keys. The keys for these service accounts are not exposed to users at all and are, in fact, rotated approximately every two weeks (Google is able to do this because they own all of the infrastructure involved and have mature automation). With AWS, we rely on Identity and Access Management (IAM) users and roles. The role can then be assumed by the environment without having to deal with a token or key. This situation is ideal because we can avoid key exposure by never having explicit keys in the first place.

For things running outside the cloud, it’s a bit more involved. In these cases, we must deal with credentials somehow. Ideally, we can limit the lifetime of these credentials, such as with AWS’ Security Token Service (STS) or GCP’s short-lived service account credentials. However, in some situations, we may need longer-lived credentials. In either case, the critical piece is using limited-privilege credentials such that if a key is compromised, the scope of the damage is narrow.

The other key component of this is auditing. Both AWS and GCP offer extensive audit logs for governance, compliance, operational auditing, and risk auditing of your cloud resources. With this, we can audit service account usage, detect anomalous behavior, and immediately take action—such as revoking the credential—rather than waiting up to 90 days to rotate it. Amazon also has GuardDuty which provides intelligent threat detection and continuous monitoring which can identify unauthorized activity as seen in the scenario above. Additionally, access credentials and other secrets should never be stored in source code, but tools like git-secrets, GitGuardian, and truffleHog can help detect when it does happen.

Let’s look at a hypothetical CI/CD pipeline as an example which ties these ideas together. Below is the first pass of our proposed pipeline. In this case, we’re targeting GCP, but the same ideas apply to other environments.

CircleCI is a SaaS-based CI/CD solution. Because it’s deploying to GCP, it will need a service account with the appropriate IAM roles. CircleCI has support for storing secret environment variables, which is how we would store the service account’s credentials. However, there are some downsides to this approach.

First, the service account that Circle needs in order to make deploys could require a fairly wide set of privileges, like accessing a container registry and deploying to a runtime. Because it lives outside of GCP, this service account has a user-managed key. While we could use a KMS to encrypt it or a vault that provides short-lived credentials, we ultimately will need some kind of credential that allows Circle to access these services, so at best we end up with a weird Russian-doll situation. If we’re rotating keys, we might wind up having to do so recursively, and the value of all this indirection starts to come into question. Second, these credentials—or any other application secrets—could easily be dumped out as part of the build script. This isn’t good if we wanted Circle to deploy to a locked-down production environment. Developers could potentially dump out the production service account credentials and now they would be able to make deploys to that environment, circumventing our pipeline.

This is why splitting out Continuous Integration (CI) from Continuous Delivery (CD) is important. If, instead, Circle was only responsible for CI and we introduced a separate component for CD, such as Spinnaker, we can solve this problem. Using this approach, now Circle only needs the ability to push an artifact to a Google Cloud Storage bucket or Container Registry. Outside of the service account credentials needed to do this, it doesn’t need to deal with secrets at all. This means there’s no way to dump out secrets in the build because they will be injected later by Spinnaker. The value of the service account credentials is also much more limited. If compromised, it only allows someone to push artifacts to a repository. Spinnaker, which would run in GCP, would then pull secrets from a vault (e.g. Hashicorp’s Vault) and deploy the artifact relying on credentials assumed from the environment. Thus, Spinnaker only needs permissions to pull artifacts and secrets and deploy to the runtime. This pipeline now looks something like the following:

With this pipeline, we now have traceability from code commit and pull request (PR) to deploy. We can then scan audit logs to detect anomalous behavior—a push to an artifact repository that is not associated with the CircleCI service account or a deployment that does not originate from Spinnaker, for example. Likewise, we can ensure these processes correlate back to an actual GitHub PR or CircleCI build. If they don’t, we know something fishy is going on.

To summarize, requiring frequent rotations of access keys is an outdated practice. It’s a remnant of password policies which themselves have become increasingly reneged by security experts. While similar in some ways, keys are fundamentally different than a username and password, particularly in the case of a service account with fine-grained permissions. Without mature practices and automation, rotating these keys frequently is an inherently risky operation that opens up the opportunity for downtime.

Instead, it’s better to rely on tightly scoped (and, if possible, short-lived) service accounts and usage auditing to detect abnormal behavior. This allows us to take action immediately rather than waiting for some arbitrary period to rotate keys where an attacker may have an unspecified amount of time to do as they please. With end-to-end traceability and evidence collection, we can more easily identify suspicious actions and perform forensic analysis.

Note that this does not mean we should never rotate access keys. Rather, we can turn to NIST for its guidance on key management. NIST 800-57 recommends cryptoperiods of 1-2 years for asymmetric authentication keys in order to maximize operational efficiency. Beyond these particular cryptoperiods, the value of rotating keys regularly is in having the confidence you can, in fact, rotate them without incident. The time interval itself is mostly immaterial, but developing this confidence is important in the event of a key actually being compromised. In this case, you want to know you can act swiftly and revoke access without causing outages.

The funny thing about compliance is that, unless you’re going after actual regulatory standards such as FedRAMP or PCI compliance, controls are generally created by the company itself. Compliance auditors mostly ensure the company is following its own controls. So if you hear, “it’s a compliance requirement” or “that’s the way it’s always been done,” try to dig deeper to understand what risk the control is actually trying to mitigate. This allows you to have a dialog with InfoSec or compliance folks and possibly come to the table with better alternatives.

Authenticating Stackdriver Uptime Checks for Identity-Aware Proxy

Google Stackdriver provides a set of tools for monitoring and managing services running in GCP, AWS, or on-prem infrastructure. One feature Stackdriver has is “uptime checks,” which enable you to verify the availability of your service and track response latencies over time from up to six different geographic locations around the world. While Stackdriver uptime checks are not as feature-rich as other similar products such as Pingdom, they are also completely free. For GCP users, this provides a great starting point for quickly setting up health checks and alerting for your applications.

Last week I looked at implementing authentication and authorization for APIs in GCP using Cloud Identity-Aware Proxy (IAP). IAP provides an easy way to implement identity and access management (IAM) for applications and APIs in a centralized place. However, one thing you will bump into when using Stackdriver uptime checks in combination with IAP is authentication. For App Engine in particular, this can be a problem since there is no way to bypass IAP. All traffic, both internal and external to GCP, goes through it. Until Cloud IAM Conditions is released and generally available, there’s no way to—for example—open up a health-check endpoint with IAP.

While uptime checks have support for Basic HTTP authentication, there is no way to script more sophisticated request flows (e.g. to implement the OpenID Connect (OIDC) authentication flow for IAP-protected resources) or implement fine-grained IAM policies (as hinted at above, this is coming with IAP Context-Aware Access and IAM Conditions). So are we relegated to using Nagios or some other more complicated monitoring tool? Not necessarily. In this post, I’ll present a workaround solution for authenticating Stackdriver uptime checks for systems protected by IAP using Google Cloud Functions.

The Solution

The general strategy is to use a Cloud Function which can authenticate with IAP using a service account to proxy uptime checks to the application. Essentially, the proxy takes a request from a client, looks for a header containing a host, forwards the request that host after performing the necessary authentication, and then forwards the response back to the client. The general architecture of this is shown below.

There are some trade-offs with this approach. The benefit is we get to rely on health checks that are fully managed by GCP and free of charge. Since Cloud Functions are also managed by GCP, there’s no operations involved beyond deploying the proxy and setting it up. The first two million invocations per month are free for Cloud Functions. If we have an uptime check running every five minutes from six different locations, that’s approximately 52,560 invocations per month. This means we could run roughly 38 different uptime checks without exceeding the free tier for invocations. In addition to invocations, the free tier offers 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month. Using the GCP pricing calculator, we can estimate the cost for our uptime check. It generally won’t come close to exceeding the free tier.

The downside to this approach is the check is no longer validating availability from the perspective of an end user. Because the actual service request is originating from Google’s infrastructure by way of a Cloud Function as opposed to Stackdriver itself, it’s not quite the same as a true end-to-end check. That said, both Cloud Functions and App Engine rely on the same Google Front End (GFE) infrastructure, so as long as both the proxy and App Engine application are located in the same region, this is probably not all that important. Besides, for App Engine at least, the value of the uptime check is really more around performing a full-stack probe of the application and its dependencies than monitoring the health of Google’s own infrastructure. That is one of the goals behind using managed services after all. The bigger downside is that the latency reported by the uptime check no longer accurately represents the application. It can still be useful for monitoring aggregate trends nonetheless.

The Implementation Setup

I’ve built an open-source implementation of the proxy as a Cloud Function in Python called gcp-oidc-proxy. It’s runnable out of the box without any modification. We’ll assume you have an IAP-protected application you want to setup a Stackdriver uptime check for. To deploy the proxy Cloud Function, first clone the repository to your machine, then from there run the following gcloud command:

$ gcloud functions deploy gcp-oidc-proxy \
   --runtime python37 \
   --entry-point handle_request \
   --trigger-http

This will deploy a new Cloud Function called gcp-oidc-proxy to your configured cloud project. It will assume the project’s default service account. Ordinarily, I would suggest creating a separate service account to limit scopes. This can be configured on the Cloud Function with the –service-account flag, which is under gcloud beta functions deploy at the time of this writing. We’ll omit this step for brevity however.

Next, we need to add the “Service Account Actor” IAM role to the Cloud Function’s service account since it will need it to sign JWTs (more on this later). In the GCP console, go to IAM & admin, locate the appropriate service account (in this case, the default service account), and add the respective role.

The Cloud Function’s service account must also be added as a member to the IAP with the “IAP-secured Web App User” role in order to properly authenticate. Navigate to Identity-Aware Proxy in the GCP console, select the resource you wish to add the service account to, then click Add Member.

Find the OAuth2 client ID for the IAP by clicking on the options menu next to the IAP resource and select “Edit OAuth client.” Copy the client ID on the next page and then navigate to the newly deployed gcp-oidc-proxy Cloud Function. We need to configure a few environment variables, so click edit and then expand more at the bottom of the page. We’ll add four environment variables: CLIENT_ID, WHITELIST, AUTH_USERNAME, and AUTH_PASSWORD.

CLIENT_ID contains the OAuth2 client ID we copied for the IAP. WHITELIST contains a comma-separated list of URL paths to make accessible or * for everything (I’m using /ping in my example application), and AUTH_USERNAME and AUTH_PASSWORD setup Basic authentication for the Cloud Function. If these are omitted, authentication is disabled.

Save the changes to redeploy the function with the new environment variables. Next, we’ll setup a Stackdriver uptime check that uses the proxy to call our service. In the GCP console, navigate to Monitoring then Create Check from the Stackdriver UI. Skip any suggestions for creating a new uptime check. For the hostname, use the Cloud Function host. For the path, use /gcp-oidc/proxy/<your-endpoint>. The proxy will use the path to make a request to the protected resource.

Expand Advanced Options to set the Forward-Host to the host protected by IAP. The proxy uses this header to forward requests. Lastly, we’ll set the authentication username and password that we configured on the Cloud Function.

Click “Test” to ensure our configuration works and the check passes.

The Implementation Details

The remainder of this post will walk you through the implementation details of the proxy. The implementation closely resembles what we did to authenticate API consumers using a service account. We use a header called Forward-Host to allow the client to specify the IAP-authenticated host to forward requests to. If the header is not present, we just return a 400 error. We then use this host and the path of the original request to construct the proxy request and retain the HTTP method and headers (with the exception of the Host header, if present, since this can cause problems).

Before sending the request, we perform the authentication process by generating a JWT signed by the service account and exchange it for a Google-signed OIDC token.

We can cache this token and renew it only once it expires. Then we set the Authorization header with the OIDC token and send the request.

We simply forward on the resulting content body, status code, and headers. We strip HTTP/1.1 “hop-by-hop” headers since these are unsupported by WSGI and Python Cloud Functions rely on Flask. We also strip any Content-Encoding header since this can also cause problems.

Because this proxy allows clients to call into endpoints unauthenticated, we also implement a whitelist to expose only certain endpoints. The whitelist is a list of allowed paths passed in from an environment variable. Alternatively, we can whitelist * to allow all paths. Wildcarding could be implemented to make this even more flexible. We also implement a Basic auth decorator which is configured with environment variables since we can setup uptime checks with a username and password in Stackdriver.

The only other code worth looking at in detail is how we setup the service account credentials and IAM Signer. A Cloud Function has a service account attached to it which allows it to assume the roles of that account. Cloud Functions rely on the Google Compute Engine metadata server which stores service account information among other things. However, the metadata server doesn’t expose the service account key used to sign the JWT, so instead we must use the IAM signBlob API to sign JWTs.

Conclusion

It’s not a particularly simple solution, but it gets the job done. The setup of the Cloud Function could definitely be scripted as well. Once IAM Conditions is generally available, it should be possible to expose certain endpoints in a way that is accessible to Stackdriver without the need for the OIDC proxy. That said, it’s not clear if there is a way to implement uptime checks without exposing an endpoint at all since there is currently no way to assign a service account to a check. Ideally, we would be able to assign a service account and use that with IAP Context-Aware Access to allow the uptime check to access protected endpoints.