API Authentication with GCP Identity-Aware Proxy

Cloud Identity-Aware Proxy (Cloud IAP) is a free service which can be used to implement authentication and authorization for applications running in Google Cloud Platform (GCP). This includes Google App Engine applications as well as workloads running on Compute Engine (GCE) VMs and Google Kubernetes Engine (GKE) by way of Google Cloud Load Balancers.

When enabled, IAP requires users accessing a web application to login using their Google account and ensure they have the appropriate role to access the resource. This can be used to provide secure access to web applications without the need for a VPN. This is part of what Google now calls BeyondCorp, which is an enterprise security model designed to enable employees to work from untrusted networks without a VPN. At Real Kinetic, we frequently bump into companies practicing Death-Star security, which is basically relying on a hard outer shell to protect a soft, gooey interior. It’s simple and easy to administer, but it’s also vulnerable. That’s why we always approach security from a perspective of defense in depth.

However, in this post I want to explore how we can use Cloud IAP to implement authentication and authorization for APIs in GCP. Specifically, I will use App Engine, but the same applies to resources behind an HTTPS load balancer. The goal is to provide a way to securely expose APIs in GCP which can be accessed programmatically.

Configuring Identity-Aware Proxy

Cloud IAP supports authenticating service accounts using OpenID Connect (OIDC). A service account belongs to an application instead of an individual user. You authenticate a service account when you want to allow an application to access your IAP-secured resources. A GCP service account can either have GCP-managed keys (for systems that reside within GCP) or user-managed keys (for systems that reside outside of GCP). GCP-managed keys cannot be downloaded and are automatically rotated and used for signing for a maximum of two weeks. User-managed keys are created, downloaded, and managed by users and expire 10 years from creation. As such, key rotation must be managed by the user as appropriate. In either case, access using a service account can be revoked either by revoking a particular key or removing the service account itself.

An IAP is associated with an App Engine application or HTTPS Load Balancer. One or more service accounts can then be added to an IAP to allow programmatic authentication. When the IAP is off, the resource is accessible to anyone with the URL. When it’s on, it’s only accessible to members who have been granted access. This can include specific Google accounts, groups, service accounts, or a general G Suite domain.

IAP will create an OAuth2 client ID for OIDC authentication which can be used by service accounts. But in order to access our API using a service account, we first need to add it to IAP with the appropriate role. We’ll add it as an IAP-secured Web App User, which allows access to HTTPS resources protected by IAP. In this case, my service account is called “IAP Auth Test,” and the email associated with it is iap-auth-test@rk-playground.iam.gserviceaccount.com.

As you can see, both the service account and my user account are IAP-secured Web App Users. This means I can access the application using my Google login or using the service account credentials. Next, we’ll look at how to properly authenticate using the service account.

Authenticating API Consumers

When you create a service account key in the GCP console, it downloads a JSON credentials file to your machine. The API consumer needs the service account credentials to authenticate. The diagram below illustrates the general architecture of how IAP authenticates API calls to App Engine services using service accounts.

In order to make a request to the IAP-authenticated resource, the consumer generates a JWT signed using the service account credentials. The JWT contains an additional target_audience claim containing the OAuth2 client ID from the IAP. To find the client ID, click on the options menu next to the IAP resource and select “Edit OAuth client.” The client ID will be listed on the resulting page. My code to generate this JWT looks like the following:

This assumes you have access to the service account’s private key. If you don’t have access to the private key, e.g. because you’re running on GCE or Cloud Functions and using a service account from the metadata server, you’ll have to use the IAM signBlob API. We’ll cover this in a follow-up post.

This JWT is then exchanged for a Google-signed OIDC token for the client ID specified in the JWT claims. This token has a one-hour expiration and must be renewed by the consumer as needed. To retrieve a Google-signed token, we make a POST request containing the JWT and grant type to https://www.googleapis.com/oauth2/v4/token.

This returns a Google-signed JWT which is good for about an hour. The “exp” claim can be used to check the expiration of the token. Authenticated requests are then made by setting the bearer token in the Authorization header of the HTTP request:

Authorization: Bearer <token>

Below is a sequence diagram showing the process of making an OIDC-authenticated request to an IAP-protected resource.

Because this is quite a bit of code and complexity, I’ve implemented the process flow in Java as a Spring RestTemplate interceptor. This transparently authenticates API calls, caches the OIDC token, and handles automatically renewing it. Google has also provided examples of authenticating from a service account for other languages.

With IAP, we’re able to authenticate and authorize requests at the edge before they even reach our application. And with Cloud Audit Logging, we can monitor who is accessing protected resources. Be aware, however, that if you’re using GCE or GKE, users who can access the application-serving port of the VM can bypass IAP authentication. GCE and GKE firewall rules can’t protect against access from processes running on the same VM as the IAP-secured application. They can protect against access from another VM, but only if properly configured. This does not apply for App Engine since all traffic goes through the IAP infrastructure.

Alternative Solutions

There are some alternatives to IAP for implementing authentication and authorization for APIs. Apigee is one option, which Google acquired not too long ago. This is a more robust API-management solution which will do a lot more than just secure APIs, but it’s also more expensive. Another option is Google Cloud Endpoints, which is an NGINX-based proxy that provides mechanisms to secure and monitor APIs. This is free up to two million API calls per month.

Lastly, you can also simply implement authentication and authorization directly in your application instead of with an API proxy, e.g. using OAuth2. This has downsides in that it can introduce complexity and room for mistakes, but it gives you full control over your application’s security. Following our model of defense in depth, we often encourage clients to implement authentication both at the edge (e.g. by ensuring requests have a valid token) and in the application (e.g. by validating the token on a request). This way, we avoid implementing a Death-Star security model.


Multi-Cloud Is a Trap

It comes up in a lot of conversations with clients. We want to be cloud-agnostic. We need to avoid vendor lock-in. We want to be able to shift workloads seamlessly between cloud providers. Let me say it again: multi-cloud is a trap. Outside of appeasing a few major retailers who might not be too keen on stuff running in Amazon data centers, I can think of few reasons why multi-cloud should be a priority for organizations of any scale.

A multi-cloud strategy looks great on paper, but it creates unneeded constraints and results in a wild-goose chase. For most, it ends up being a distraction, creating more problems than it solves and costing more money than it’s worth. I’m going to caveat that claim in just a bit because it’s a bold blanket statement, but bear with me. For now, just know that when I say “multi-cloud,” I’m referring to the idea of running the same services across vendors or designing applications in a way that allows them to move between providers effortlessly. I’m not speaking to the notion of leveraging the best parts of each cloud provider or using higher-level, value-added services across vendors.

Multi-cloud rears its head for a number of reasons, but they can largely be grouped into the following points: disaster recovery (DR), vendor lock-in, and pricing. I’m going to speak to each of these and then discuss where multi-cloud actually does come into play.

Disaster Recovery

Multi-cloud gets pushed as a means to implement DR. When discussing DR, it’s important to have a clear understanding of how cloud providers work. Public cloud providers like AWS, GCP, and Azure have a concept of regions and availability zones (n.b. Azure only recently launched availability zones in select regions, which they’ve learned the hard way is a good idea). A region is a collection of data centers within a specific geographic area. An availability zone (AZ) is one or more data centers within a region. Each AZ is isolated with dedicated network connections and power backups, and AZs in a region are connected by low-latency links. AZs might be located in the same building (with independent compute, power, cooling, etc.) or completely separated, potentially by hundreds of miles.

Region-wide outages are highly unusual. When they happen, it’s a high-profile event since it usually means half the Internet is broken. Since AZs themselves are geographically isolated to an extent, a natural disaster taking down an entire region would basically be the equivalent of a meteorite wiping out the state of Virginia. The more common cause of region failures are misconfigurations and other operator mistakes. While rare, they do happen. However, regions are highly isolated, and providers perform maintenance on them in staggered windows to avoid multi-region failures.

That’s not to say a multi-region failure is out of the realm of possibility (any more than a meteorite wiping out half the continental United States or some bizarre cascading failure). Some backbone infrastructure services might span regions, which can lead to larger-scale incidents. But while having a presence in multiple cloud providers is obviously safer than a multi-region strategy within a single provider, there are significant costs to this. DR is an incredibly nuanced topic that I think goes underappreciated, and I think cloud portability does little to minimize those costs in practice. You don’t need to be multi-cloud to have a robust DR strategy—unless, perhaps, you’re operating at Google or Amazon scale. After all, Amazon.com is one of the world’s largest retailers, so if your DR strategy can match theirs, you’re probably in pretty good shape.

Vendor Lock-In

Vendor lock-in and the related fear, uncertainty, and doubt therein is another frequently cited reason for a multi-cloud strategy. Beau hits on this in Stop Wasting Your Beer Money:

The cloud. DevOps. Serverless. These are all movements and markets created to commoditize the common needs. They may not be the perfect solution. And yes, you may end up “locked in.” But I believe that’s a risk worth taking. It’s not as bad as it sounds. Tim O’Reilly has a quote that sums this up:

“Lock-in” comes because others depend on the benefit from your services, not because you’re completely in control.

We are locked-in because we benefit from this service. First off, this means that we’re leveraging the full value from this service. And, as a group of consumers, we have more leverage than we realize. Those providers are going to do what is necessary to continue to provide value that we benefit from. That is what drives their revenue. As O’Reilly points out, the provider actually has less control than you think. They’re going to build the system they believe benefits the largest portion of their market. They will focus on what we, a player in the market, value.

Competition is the other key piece of leverage. As strong as a provider like AWS is, there are plenty of competing cloud providers. And while competitors attempt to provide differentiated solutions to what they view as gaps in the market they also need to meet the basic needs. This is why we see so many common services across these providers. This is all for our benefit. We should take advantage of this leverage being provided to us. And yes, there will still be costs to move from one provider to another but I believe those costs are actually significantly less than the costs of going from on-premise to the cloud in the first place. Once you’re actually on the cloud you gain agility.

The mental gymnastics I see companies go through to avoid vendor lock-in and “reasons” for multi-cloud always astound me. It’s baffling the amount of money companies are willing to spend on things that do not differentiate them in any way whatsoever and, in fact, forces them to divert resources from business-differentiating things.

I think there are a couple reasons for this. First, as Beau points out, we have a tendency to overvalue our own abilities and undervalue our costs. This causes us to miscalculate the build versus buy decision. This is also closely related to the IKEA effect, in which consumers place a disproportionately high value on products they partially created. Second, as the power and influence in organizations has shifted from IT to the business—and especially with the adoption of product mindset—it strikes me as another attempt by IT operations to retain control and relevance.

Being cloud-agnostic should not be an important enough goal that it drives key decisions. If that’s your starting point, you’re severely limiting your ability to fully reap the benefits of cloud. You’re just renting compute. Platforms like Pivotal Cloud Foundry and Red Hat OpenShift tout the ability to run on every major private and public cloud, but doing so—by definition—necessitates an abstraction layer that abstracts away all the differentiating features of each cloud platform. When you abstract away the differentiating features to avoid lock-in, you also abstract away the value. You end up with vendor “lock-out,” which basically means you aren’t leveraging the full value of services. Either the abstraction reduces things to a common interface or it doesn’t. If it does, it’s unclear how it can leverage differentiated provider features and remain cloud-agnostic. If it doesn’t, it’s unclear what the value of it is or how it can be truly multi-cloud.

Not to pick on PCF or Red Hat too much, but as the major cloud providers continue to unbundle their own platforms and rebundle them in a more democratized way, the value proposition of these multi-cloud platforms begins to diminish. In the pre-Kubernetes and containers era—aka the heyday of Platform as a Service (PaaS)—there was a compelling story. Now, with the prevalence of containers, Kubernetes, and especially things like Google’s GKE and GKE On-Prem (and equivalents in other providers), that story is getting harder to tell. Interestingly, the recently announced Knative was built in close partnership with, among others, both Pivotal and Red Hat, which seems to be a play to capture some of the value from enterprise adoption of serverless computing using the momentum of Kubernetes.

But someone needs to run these multi-cloud platforms as a service, and therein lies the rub. That responsibility is usually dumped on an operations or shared-services team who now needs to run it in multiple clouds—and probably subscribe to a services contract with the vendor.

A multi-cloud deployment requires expertise for multiple cloud platforms. A PaaS might abstract that away from developers, but it’s pushed down onto operations staff. And we’re not even getting in to the security and compliance implications of certifying multiple platforms. For some companies who are just now looking to move to the cloud, this will seriously derail things. Once we get past the airy-fairy marketing speak, we really get into the hairy details of what it means to be multi-cloud.

There’s just less room today for running a PaaS that is not managed for you. It’s simply not strategic to any business. I also like to point out that revenues for companies like Pivotal and Red Hat are largely driven by services. These platforms act as a way to drive professional services revenue.

Generally speaking, the risk posed to businesses by vendor lock-in of non-strategic systems is low. For example, a database stores data. Whether it’s Amazon DynamoDB, Google Cloud Datastore, or Azure Cosmos DB—there might be technical differences like NoSQL, relational, ANSI-compliant SQL, proprietary, and so on—fundamentally, they just put data in and get data out. There may be engineering effort involved in moving between them, but it’s not insurmountable and that cost is often far outweighed by the benefits we get using them. Where vendor lock-in can become a problem is when relying on core strategic systems. These might be systems which perform actual business logic or are otherwise key enablers of a company’s business. As Joel Spolsky says, “If it’s a core business function—do it yourself, no matter what. Pick your core business competencies and goals, and do those in house.”

Pricing

Price competitiveness might be the weakest argument of all for multi-cloud. The reality is, as they commoditize more and more, all providers are in a race to the bottom when it comes to cost. Between providers, you will end up spending more in some areas and less in others. Multi-cloud price arbitrage is not a thing, it’s just something people pretend is a thing. For one, it’s wildly impractical. For another, it fails to account for volume discounts. As I mentioned in my comparison of AWS and GCP, it really comes down more to where you want to invest your resources when picking a cloud provider due to their differing philosophies.

And to Beau’s point earlier, the lock-in angle on pricing, i.e. a vendor locking you in and then driving up prices, just doesn’t make sense. First, that’s not how economies of scale work. And once you’re in the cloud, the cost of moving from one provider to another is dramatically less than when you were on-premise, so this simply would not be in providers’ best interest. They will do what’s necessary to capture the largest portion of the market and competitive forces will drive Infrastructure as a Service (IaaS) costs down. Because of the competitive environment and desire to capture market share, pricing is likely to converge.  For cloud providers to increase margins, they will need to move further up the stack toward Software as a Service (SaaS) and value-added services.

Additionally, most public cloud providers offer volume discounts. For instance, AWS offers Reserved Instances with significant discounts up to 75% for EC2. Other AWS services also have volume discounts, and Amazon uses consolidated billing to combine usage from all the accounts in an organization to give you a lower overall price when possible. GCP offers sustained use discounts, which are automatic discounts that get applied when running GCE instances for a significant portion of the billing month. They also implement what they call inferred instances, which is bin-packing partial instance usage into a single instance to prevent you from losing your discount if you replace instances. Finally, GCP likewise has an equivalent to Amazon’s Reserved Instances called committed use discounts. If resources are spread across multiple cloud providers, it becomes more difficult to qualify for many of these discounts.

Where Multi-Cloud Makes Sense

I said I would caveat my claim and here it is. Yes, multi-cloud can be—and usually is—a distraction for most organizations. If you are a company that is just now starting to look at cloud, it will serve no purpose but to divert you from what’s really important. It will slow things down and plant seeds of FUD.

Some companies try to do build-outs on multiple providers at the same time in an attempt to hedge the risk of going all in on one. I think this is counterproductive and actually increases the risk of an unsuccessful outcome. For smaller shops, pick a provider and focus efforts on productionizing it. Leverage managed services where you can, and don’t use multi-cloud as a reason not to. For larger companies, it’s not unreasonable to have build-outs on multiple providers, but it should be done through controlled experimentation. And that’s one of the benefits of cloud, we can make limited investments and experiment without big up-front expenditures—watch out for that with the multi-cloud PaaS offerings and service contracts.

But no, that doesn’t mean multi-cloud doesn’t have a place. Things are never that cut and dry. For large enterprises with multiple business units, multi-cloud is an inevitability. This can be a result of product teams at varying levels of maturity, corporate IT infrastructure, and certainly through mergers and acquisitions. The main value of multi-cloud, and I think one of the few arguments for it, is leveraging the strengths of each cloud where they make sense. This gets back to providers moving up the stack. As they attempt to differentiate with value-added services, multi-cloud starts to become a lot more meaningful. Secondarily, there might be a case for multi-cloud due to data-sovereignty reasons, but I think this is becoming less and less of a concern with the prevalence of regions and availability zones. However, some services, such as Google’s Cloud Spanner, might forgo AZ-granularity due to being “globally available” services, so this is something to be aware of when dealing with regulations like GDPR. Finally, for enterprises with colocation facilities, hybrid cloud will always be a reality, though this gets complicated when extending those out to multiple cloud providers.

If you’re just beginning to dip your toe into cloud, a multi-cloud strategy should not be at the forefront of your mind. It definitely should not be your guiding objective and something that drives core decisions or strategic items for the business. It has a time and place, but outside of that, it’s just a fool’s errand—a distraction from what’s truly important.

GCP and AWS: What’s the Difference?

AWS has long been leading the charge when it comes to public cloud providers. I believe this is largely attributed to Bezos’ mandate of “APIs everywhere” in the early days of Amazon, which in turn allowed them to be one of the first major players in the space. Google, on the other hand, has a very different DNA. In contrast to Amazon’s laser-focused product mindset, their approach to cloud has broadly been to spin out services based on internal systems backing Google’s core business. When put in the context of the very different leadership styles and cultures of the two companies, this actually starts to make a lot of sense. But which approach is better, and what does this mean for those trying to settle on a cloud provider?

I think GCP gets a bad rap for three reasons: historically, their support has been pretty terrible, there’s the massive gap in offerings between GCP and AWS, and Google tends to be very opaque with its product roadmaps and commitments. It is nearly impossible now to keep track of all the services AWS offers (which seems to continue to grow at a staggering rate), while GCP’s list of services remains fairly modest in comparison. Naively, it would seem AWS is the obvious “better” choice purely due to the number of services. Of course, there’s much more to the story. This article is less of a comparison of the two cloud providers (for that, there is a plethora of analyses) and more of a look at their differing philosophies and legacies.

Philosophies

AWS and GCP are working toward the same goal from completely opposite ends. AWS is the ops engineer’s cloud. It provides all of the low-level primitives ops folks love like network management, granular identity and access management (IAM), load balancers, placement groups for controlling how instances are placed on underlying hardware, and so forth. You need an ops team just to manage all of these things. It’s not entirely different from a traditional on-prem build-out, just in someone else’s data center. This is why ops folks tend to gravitate toward AWS—it’s familiar and provides the control and flexibility they like.

GCP is approaching it from the angle of providing the best managed services of any cloud. It is the software engineer’s cloud. In many cases, you don’t need a traditional ops team, or at least very minimal staffing in that area. The trade-off is it’s more opinionated. This is apparent when you consider GCP was launched in 2008 with the release of Google App Engine. Other key GCP offerings (and acquisitions) bear this out further, such as Google Kubernetes Engine (GKE), Cloud Spanner, Firebase, and Stackdriver.

Platform

A client recently asked me why more companies aren’t using Heroku. I have nothing personal against Heroku, but the reality is I have not personally run into a company of any size using it. I’m sure they exist, but looking at the customer list on their website, it’s mostly small startups. For greenfield initiatives, larger enterprises are simply apprehensive to use it (and PaaS offerings in general). But I think GCP has a pretty compelling story for managed services with a nice spectrum of control from fully managed “NoOps” type services to straight VMs:

Firebase, Cloud Functions → App Engine → App Engine Flex → GKE → GCE

With a typical PaaS like Heroku, you start to lose that ability to “drop down” a level. Even if a company can get by with a fully managed PaaS, they feel more comfortable having the escape hatch, whether it’s justified or not. App Engine Flexible Environment helps with this by providing a container as a service solution, making it much easier to jump to GKE.

I read an article recently on the good, bad, and ugly of GCP. It does a nice job of telling the same story in a slightly different way. It shows the byzantine nature of the IAM model in AWS and GCP’s much simpler permissioning system. It describes the dozens of compute-instance types AWS has and the four GCP has (micro, standard, highmem, and highcpu—with the ability to combine whatever combination of CPU and memory that makes sense for your workload). It also touches on the differences in product philosophy. In particular, when GCP releases new services or features into general availability (GA), they are usually very high quality. In contrast, when AWS releases something, the quality and production-readiness varies greatly. The common saying is “Google’s Beta is like AWS’s GA.” The flipside is GCP’s services often stay in Beta for a very long time.

GCP also does a better job of integrating their different services together, providing a much smaller set of core primitives that are global and work well for many use cases. The article points out Cloud Pub/Sub as a good example. In AWS, you have SQS, SNS, Amazon MQ, Kinesis Data Streams, Kinesis Data Firehose, DynamoDB Streams, and the list seems to only grow over time. GCP has Pub/Sub. It’s flexible enough to fit many (but not all) of the same use cases. The downside of this is Google engineers tend to be pretty opinionated about how problems should be solved.

This difference in philosophy usually means AWS is shipping more services, faster. I think a big part of this is because there isn’t much of a cohesive “platform” story. AWS has lots of disparate pieces—building blocks—many of which are low-level components or more or less hosted versions of existing tech at varying degrees of ready come GA. This becomes apparent when you have to trudge through their hodgepodge of clunky service dashboards which often have a wildly different look and feel than the others. That’s not to say there aren’t integrations between products, it just feels less consistent than GCP. The other reason for this, I suspect, is Amazon’s pervasive service-oriented culture.

For example, AWS took ActiveMQ and stood it up as a managed service called Amazon MQ. This is something Google is unlikely to do. It’s just not in their DNA. It’s also one reason why they are so far behind. GCP tends to be more on the side of shipping homegrown services, but the tech is usually good and ready for primetime when it’s released. Often they spin out internal services by rewriting them for public consumption. This has made them much slower than AWS.

Part of Amazon’s problem, too, is that they are—in a sense—victims of their own success. They got a much earlier head start. The AWS platform launched in 2002 and made its public debut in 2004 with SQS, shortly followed by S3 and EC2. As a result, there’s more legacy and cruft that has built up over time. Google just started a lot later.

More recently, Google has become much more strategic about embracing open APIs. The obvious case is what it has done with Kubernetes—first by open sourcing it, then rallying the community around it, and finally making a massive strategic investment in GKE and the surrounding ecosystem with pieces like Istio. And it has paid off. GKE is, by far and away, the best managed Kubernetes experience currently available. Amazon, who historically has shied away from open APIs (Google has too), had their hand forced, finally making Elastic Container Service for Kubernetes (EKS) generally available last month—probably a bit prematurely. For a long time, Amazon held firm on ECS as the way to run container workloads in AWS. The community spoke, however, and Amazon reluctantly gave in. Other lower-profile cases of Google embracing open APIs include Cloud Dataflow (Apache Beam) and Cloud ML (TensorFlow). As an aside, machine learning and data is another area GCP is leading the charge with its ML and other services like BigQuery, which is arguably a better product than Amazon Redshift.

There are some other implications with the respective approaches of GCP and AWS, one of which is compliance. AWS usually hits certifications faster, but it’s typically on a region-by-region basis. There’s also GovCloud for FedRAMP, which is an entirely separate region. GCP usually takes longer on compliance, but when it happens, it certifies everything. On the same note, services and features in AWS are usually rolled out by region, which often precludes organizations from taking advantage of them immediately. In GCP, resources are usually global, and the console shows things for the entire cloud project. In AWS, the console UIs are usually regional or zonal.

Billing and Support

For a long time, billing has been a rough spot for GCP. They basically gave you a monthly toy spreadsheet with your spend, which was nearly useless for larger operations. There also was not a good way to forecast spend and track it throughout the month. You could only alert on actual spend and not estimated usage. The situation has improved a bit more recently with better reporting, integration with Data Studio, and the recently announced forecasting feature, but it’s still not on par with AWS’s built-in dashboarding. That said, AWS’s billing is so complicated and difficult to manage, there is a small cottage industry just around managing your AWS bill.

Related to billing, GCP has a simpler pricing model. With AWS, you can purchase Reserved Instances to reduce compute spend, which effectively allows you to rent VMs upfront at a considerable discount. This can be really nice if you have stable and predictable workloads. GCP offers sustained use discounts, which are automatic discounts that get applied when running GCE instances for a significant portion of the billing month. If you run a standard instance for more than 25% of a month, Google automatically discounts your bill. The discount increases when you run for a larger portion of the month. They also do what they call inferred instances, which is bin-packing partial instance usage into a single instance to prevent you from losing your discount if you replace instances. Still, GCP has a direct answer to Amazon’s Reserved Instances called committed use discounts. This allows you to purchase a specific amount of vCPUs and memory for a discount in return for committing to a usage term of one or three years. Committed use discounts are automatically applied to the instances you run, and sustained use discounts are applied to anything on top of that.

Support has still been a touchy point for GCP, though they are working to improve it. In my experience, Google has become more committed to helping customers of all sizes be successful on GCP, primarily because AWS has eaten their lunch for a long time. They are much more willing to assign named account reps to customers regardless of size, while AWS won’t give you the time of day if you’re a smaller shop. Their Customer Reliability Engineering program is also one example of how they are trying to differentiate in the support area.

Outcomes

Something interesting that was pointed out to me by a friend and former AWS engineer was that, while GCP and AWS are converging on the same point from opposite ends, they also have completely opposite organizational structures and practices.

Google relies heavily on SREs and service error budgets for operations and support. SREs will manage the operations of a service, but if it exceeds its error budget too frequently, the pager gets handed back to the engineering team. Amazon support falls more on the engineers. This org structure likely influences the way Google and Amazon approach their services, i.e. Conway’s Law. AWS does less to separate development from operations and, as a result, the systems reflect that.

Suffice to say, there are compelling reasons to go with both AWS and GCP. Sufficiently large organizations will likely end up building out on both. You can use either provider to build the same thing, but how you get there depends heavily on the kinds of teams and skill sets your organization has, what your goals are operationally, and other nuances like compliance and workload shapes. If you have significant ops investment, AWS might be a better fit. If you have lots of software engineers, GCP might be. Pricing is often a point of discussion as well, but the truth is you will end up spending more in some areas and less in others. Moreover, all providers are essentially in a race to the bottom anyway as they commoditize more and more. Where it becomes interesting is how they differentiate with value-added services. This is where “multi-cloud” becomes truly meaningful.

Real Kinetic has extensive experience leveraging both AWS and GCP. Learn more about working with us.