gcp – Brave New Geek

April 4, 2024

Introducing Konfig: GitLab and Google Cloud preconfigured for startups and enterprises

Real Kinetic helps businesses transform how they build and deliver software in the cloud. This encompasses legacy migrations, app modernization, and greenfield development. We work with companies ranging from startups to Fortune 500s and everything in between. Most recently, we finished helping Panera Bread migrate their e-commerce platform to Google Cloud from on-prem and led their transition to GitLab. In doing this type of work over the years, we’ve noticed a problem organizations consistently hit that causes them to stumble with these cloud transformations. Products like GCP, GitLab, and Terraform are quite flexible and capable, but they are sort of like the piles of Legos below.

These products by nature are mostly unopinionated, which means customers need to put the pieces together in a way that works for their unique situation. This makes it difficult to get started, but it’s also difficult to assemble them in a way that works well for 1 team or 100 teams. Startups require a solution that allows them to focus on product development and accelerate delivery, but ideally adhere to best practices that scale with their growth. Larger organizations require something that enables them to transform how they deliver software and innovate, but they need it to address enterprise concerns like security and governance. Yet, when you’re just getting started, you know the least and are in the worst position to make decisions that will have a potentially long-lasting impact. The outcome is companies attempting a cloud migration or app modernization effort fail to even get off the starting blocks.

It’s easy enough to cobble together something that works, but doing it in a way that is actually enterprise-ready, scalable, and secure is not an insignificant undertaking. In fact, it’s quite literally what we have made a business of helping customers do. What’s worse is that this is undifferentiated work. Companies are spending countless engineering hours building and maintaining their own bespoke “cloud assembly line”—or Internal Developer Platform (IDP)—which are all attempting to address the same types of problems. That engineering time would be better spent on things that actually matter to customers and the business.

This is what prompted us to start thinking about solutions. GitLab and GCP don’t offer strong opinions because they address a broad set of customer needs. This creates a need for an opinionated configuration or distribution of these tools. The solution we arrived at is Konfig. The idea is to provide this distribution through what we call “Platform as Code.” Where Infrastructure as Code (IAC) is about configuring the individual resource-level building blocks, Platform as Code is one level higher. It’s something that can assemble these discrete products in a coherent way—almost as if they were natively integrated. The result is a turnkey experience that minimizes time-to-production in a way that will scale, is secure by default, and has best practices built in from the start. A Linux distro delivers a ready-to-use operating system by providing a preconfigured kernel, system library, and application assembly. In the same way, Konfig delivers a ready-to-use platform for shipping software by providing a preconfigured source control, CI/CD, and cloud provider assembly. Whether it’s legacy migration, modernization, or greenfield, Konfig provides your packaged onramp to GCP and GitLab.

Platform as Code

Central to Konfig is the notion of a Platform. In this context, a Platform is a way to segment or group parts of a business. This might be different product lines, business units, or verticals. How these Platforms are scoped and how many there are is different for every organization and depends on how the business is structured. A small company or startup might consist of a single Platform. A large organization might have dozens or more.

A Platform is then further subdivided into Domains, a concept we borrow from Domain-Driven Design. A Domain is a bounded context which encompasses the business logic, rules, and processes for a particular area or problem space. Simply put, it’s a way to logically group related services and workloads that make up a larger system. For example, a business providing online retail might have an E-commerce Platform with the following Domains: Product Catalog, Customer Management, Order Management, Payment Processing, and Fulfillment. Each of these domains might contain on the order of 5 to 10 services.

This structure provides a convenient and natural way for us to map access management and governance onto our infrastructure and workloads because it is modeled after the organization structure itself. Teams can have ownership or elevated access within their respective Domains. We can also specify which cloud services and APIs are available at the Platform level and further restrict them at the Domain level where necessary. This hierarchy facilitates a powerful way to enforce enterprise standards for a large organization while allowing for a high degree of flexibility and autonomy for a small organization. Basically, it allows for governance when you need it (and autonomy when you don’t). This is particularly valuable for organizations with regulatory or compliance requirements, but it’s equally valuable for companies wanting to enforce a “golden path”—that is, an opinionated and supported way of building something within your organization. Finally, Domains provide clear cost visibility because cloud resources are grouped into Domain projects. This makes it easy to see what “Fulfillment” costs versus “Payment Processing” in our E-commerce Platform, for example.

“Platform as Code” means these abstractions are modeled declaratively in YAML configuration and managed via GitOps. The definitions of Platforms and Domains consist of a small amount of metadata, shown below, but that small amount of metadata ends up doing a lot of heavy lifting in the background.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gitlab:
    parentGroupId: 82224252
  gcp:
    billingAccountId: "123ABC-456DEF-789GHI"
    parentFolderId: "1080778227704"
    defaultEnvs:
      - dev
      - stage
      - prod
    services:
      defaults:
        - cloud-run
        - cloud-sql
        - cloud-storage
        - secret-manager
        - cloud-kms
        - pubsub
        - redis
        - firestore
    api:
      path: /ecommerce

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  gcp:
    services:
      disabled:
        - pubsub
        - redis
        - firestore
    api:
      path: /payment
  groups:
    dev: [payment-devs@example.com]
    maintainer: [payment-maintainers@example.com]
    owner: [gitlab-owners@example.com]

domain.yaml

The Control Plane

Platforms, Domains, and all of the resources contained within them are managed by the Konfig control plane. The control plane consumes these YAML definitions and does whatever is needed in GitLab and GCP to make the “real world” reflect the desired state specified in the configuration.

The control plane manages the structure of groups and projects in GitLab and synchronizes this structure with GCP. This includes a number of other resources behind the scenes as well: configuring OpenID Connect to allow GitLab pipelines to authenticate with GCP, IAM resources like service accounts and role bindings, managing SAML group links to sync user permissions between GCP and GitLab, and enabling service APIs on the cloud projects. The Platform/Domain model allows the control plane to specify fine-grained permissions and scope access to only the things that need it. In fact, there are no credentials exposed to developers at all. It also allows us to manage what cloud services are available to developers and what level of access they have across the different environments. This governance is managed centrally but federated across both GitLab and GCP.

The net result is a configuration- and standards-driven foundation for your cloud development platform that spans your source control, CI/CD, and cloud provider environments. This foundation provides a golden path that makes it easy for developers to build and deliver software while meeting an organization’s internal controls, standards, or regulatory requirements. Now we’re ready to start delivering workloads to our enterprise cloud environment.

Managing Workloads and Infrastructure

The Konfig control plane establishes an enterprise cloud environment in which we could use traditional IAC tools such as Terraform to manage our application infrastructure. However, the control plane is capable of much more than just managing the foundation. It can also manage the workloads that get deployed to this cloud environment. This is because Konfig actually consists of two components: Konfig Platform, which configures and manages our cloud platform comprising GitLab and GCP, and Konfig Workloads, which configures and manages application workloads and their respective infrastructure resources.

Using the Lego analogy, think of Konfig Platform as providing a pre-built factory and Konfig Workloads as providing pre-built assembly lines within the factory. You can use both in combination to get a complete, turnkey experience or just use Konfig Platform and “bring your own assembly line” such as Terraform.

Konfig Workloads provides an IAC alternative to Terraform where resources are managed by the control plane. Similar to how the platform-level components like GitLab and GCP are managed, this works by using an operator that runs in the control plane cluster. This operator runs on a control loop which is constantly comparing the desired state of the system with the current state and performs whatever actions are necessary to reconcile the two. A simple example of this is the thermostat in your house. You set the temperature—the desired state—and the thermostat works to bring the actual room temperature—the current state—closer to the desired state by turning your furnace or air conditioner on and off. This model removes potential for state drift, where the actual state diverges from the configured state, which can be a major headache with tools like Terraform where state is managed with backends.

The Konfig UI provides a visual representation of the state of your system. This is useful for getting a quick understanding of a particular Platform, Domain, or workload versus reading through YAML that could be scattered across multiple files or repos (and which may not even be representative of what’s actually running in your environment). With this UI, we can easily see what resources a workload has configured and can access, the state of these resources (whether they are ready, still provisioning, or in an error state), and how the workload is configured across different environments. We can even use the UI itself to provision new resources like a database or storage bucket that are scoped automatically to the workload. This works by generating a merge request in GitLab with the desired changes, so while we can use the UI to configure resources, everything is still managed declaratively through IAC and GitOps. This is something we call “Visual IAC.”

Your Packaged Onramp to GCP and GitLab

The current cloud landscape offers powerful tools, but assembling them efficiently, securely, and at scale remains a challenge. This “undifferentiated work” consumes valuable engineering resources that could be better spent on core business needs, and it often prevents organizations from even getting off the starting line when beginning their cloud journey. Konfig, built around the principles of Platform as Code and standards-driven development, addresses this very gap. We built it to help our clients move quicker through operationalizing the cloud so that they can focus on delivering business value to their customers. Whether you’re migrating to the cloud, modernizing, or starting from scratch, Konfig provides a preconfigured and opinionated integration of GitLab, GCP, and Infrastructure as Code which gives you:

Faster time-to-production: Streamlined setup minimizes infrastructure headaches and allows developers to focus on building and delivering software.
Enterprise-grade security: Built-in security best practices and fine-grained access controls ensure your cloud environment remains secure.
Governance: Platforms and Domains provide a flexible model that balances enterprise standards with team autonomy.
Scalability: Designed to scale with your business, easily accommodating growth without compromising performance or efficiency.
Great developer UX: Designed to provide a great user experience for developers shipping applications and services.

Konfig functions like an operating system for your development organization to deliver software to the cloud. It’s an opinionated IDP specializing in cloud migrations and app modernization. This allows you to focus on what truly matters—building innovative software products and delivering exceptional customer experiences.

We’ve been leveraging these patterns and tools for years to help clients ship with confidence, and we’re excited to finally offer a solution that packages them up. Please reach out if you’d like to learn more and see a demo. If you’re undertaking a modernization or cloud migration effort, we want to help make it a success. We’re looking for a few organizations to partner with to develop Konfig into a robust solution.

Follow @tyler_treat

February 12, 2024

Cloud without Kubernetes

I think it’s safe to say Kubernetes has “won” the cloud mindshare game. If you look at the CNCF Cloud Native landscape (and manage to not go cross eyed), it seems like most of the projects are somehow related to Kubernetes. KubeCon is one of the fastest-growing industry events. Companies we talk to at Real Kinetic who are either preparing for or currently executing migrations to the cloud are centering their strategies around Kubernetes. Those already in the cloud are investing heavily in platform-izing their Kubernetes environment. Kubernetes competitors like Nomad, Pivotal Cloud Foundry, OpenShift, and Rancher have sort of just faded to the background (or simply pivoted to Kubernetes). In many ways, “cloud native” seems to be equated with “Kubernetes”.

All this is to say, the industry has coalesced around Kubernetes as the way to do cloud. But after working with enough companies doing cloud, watching their experiences, and understanding their business problems, I can’t help but wonder: should it be? Or rather, is Kubernetes actually the right level of abstraction?

Going k8sless

While we’ve worked with a lot of companies doing Kubernetes, we’ve also worked with some that are deliberately not. Instead, they leaned into serverless—heavily—or as I like to call it, they’ve gone k8sless. These are not small companies or startups, they are name brands you would recognize.

At first, we were skeptical. Our team came from a company that made it all the way to IPO using Google App Engine, one of the earliest serverless platforms available. We have regularly espoused the benefits of serverless. We’ve talked to clients about how they should consider it for their own workloads (often to great skepticism). But using only serverless? For once, we were the serverless skeptics. One client in particular was beginning a migration of their e-commerce platform to Google Cloud. They wanted to do it completely serverless. We gave our feedback and recommendations based on similar migrations we’ve performed:

“There are workloads that aren’t a good fit.”

“It would require major re-architecting.”

“It will be expensive once fully migrated.”

“You’ll have better cost efficiency bin packing lots of services into VMs with Kubernetes.”

We articulated all the usual arguments made by the serverless doubters. Even Google was skeptical, echoing our sentiments to the customer. “Serious companies doing online retail like The Home Depot or Target are using Google Kubernetes Engine,” was more or less the message. We have a team of serverless experts at Real Kinetic though, so we forged ahead and helped execute the migration.

Fast forward nearly three years later and we will happily admit it: we were wrong. You can run a multibillion-dollar e-commerce platform without a single VM. You don’t have to do a full rewrite or major re-architecting. It can be cost-effective. It doesn’t require proprietary APIs or constraints that result in vendor lock-in. It might sound like an exaggeration, but it’s not.

Container as the interface

Over the last several years, Google’s serverless offerings have evolved far beyond App Engine. It has reached the point where it’s now viable to run a wide variety of workloads without much issue. In particular, Cloud Run offers many of the same benefits of a PaaS like App Engine without the constraints. If your code can run in a container, there’s a very good chance it will run on Cloud Run with little to no modification.

In fact, other than using the gcloud CLI to deploy a service, there’s nothing really Google- or Cloud Run-specific needed to get a functioning application. This is because Cloud Run uses Knative, an open-source Kubernetes-based platform, as its deployment interface. And while Cloud Run is a Google-managed backend for the Knative interface, we could just as well switch the backend to GKE or our own Kubernetes cluster. When we implement our Cloud Run services, we actually implement them using a Kubernetes Deployment manifest, shown below, and right before deploying, we swap Deployment for Knative’s Service manifest.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    cloud.googleapis.com/location: us-central1
    service: my-service
  name: my-service
spec:
  template:
    spec:
      containers:
        - image: us.gcr.io/my-project/my-service:v1
          name: my-service
          ports:
            - containerPort: 8080
          resources:
            limits:
              cpu: 2
              memory: 1024Mi

This means we can deploy to Kubernetes without Knative at all, which we often do during development using the combination of Skaffold and K3s to perform local testing. It also allows us to use Kubernetes native tooling such as Kustomize to manage configuration. Think of Cloud Run as a Kubernetes Deployment as a service (though really more like Deployment and Service…as a service).

“Normal” businesses versus internet-scale businesses

What about cost? Yes, the unit cost in terms of compute is higher with serverless. If you execute enough CPU cycles to fill the capacity of a VM, you are better off renting the whole VM as opposed to effectively renting timeshares of it. But here’s the thing: most “normal” businesses tend to have highly cyclical traffic patterns throughout the day and their scale is generally modest.

What do I mean by “normal” businesses? These are primarily non-internet-scale companies such as insurance, fast food, car rental, construction, or financial services, not Google, Netflix, or Amazon. As a result, these companies can benefit greatly from pay-per-use, and those in the retail space also benefit greatly from the elasticity of this model during periods like Black Friday or promotional campaigns. Businesses with brick-and-mortar have traffic that generally follows their operating hours. During off-hours, they can often scale quite literally to zero.

Many of these businesses, for better or worse, treat software development as an IT cost center to be managed. They don’t need—or for that matter, want—the costs and overheads associated with platform-izing Kubernetes. A lot of the companies we interact with fall into this category of “normal” businesses, and I suspect most companies outside of tech do as well.

BYOP—Bring Your Own Platform

I’ve asked it before: is Kubernetes really the end-game abstraction? In my opinion, it’s an implementation detail. I don’t think I’m alone in that opinion. Some companies put a tremendous amount of investment into abstracting Kubernetes from their developers. This is what I mean by “platform-izing” Kubernetes. It typically involves significant and ongoing OpEx investment. The industry has started to coalesce around two concepts that encapsulate this: Platform Engineering and Internal Developer Platform. So while Kubernetes may have become the default container orchestrator, the higher-level pieces—the pieces constituting the Internal Developer Platform—are still very much bespoke. Kelsey Hightower said it best: the majority of people managing infrastructure just want a PaaS. The only requirement: it has to be built by them. That’s a problem.

Imagine having a Kubernetes cluster per Deployment. Full blast radius isolation, complete cost traceability, granular yet simple permissioning. It sounds like a maintenance nightmare though, right? Now imagine those clusters just being hidden from you completely and the Deployment itself is the only thing you interact with and maintain. You just provide your container (or group of containers), configure your CPU and memory requirements, specify the network and resource access, and deploy it. The Deployment manages your load balancing and ingress, automatically scales the pods up and down or canaries traffic, and gives you aggregated logs and metrics out of the box. You only pay for the resources consumed while processing a request. Just a few years ago, this was a futuristic-sounding fantasy.

The platform Kelsey describes above does now exist. From my experience, it’s a nearly ideal solution for those “normal” businesses who are looking to minimize complexity and operational costs and avoid having to bring (more like build) their own platform. I realize GCP is a distant third when it comes to public cloud market share so this will largely fall on deaf ears, but for those who are still listening: stop wasting time on Kubernetes and just use Cloud Run. Let me expand on the reasons why.

Easily and quickly get started with the cloud. Many of the companies we work with who are still in the midst of migrating to the cloud get hung up with analysis paralysis. Cloud Run isn’t a perfect solution for everything, but it’s good enough for the majority of cases. The rest can be handled as exceptions.
Minimize complexity of cloud environments. Cloud Run does not eliminate the need for infrastructure (there are still caches, queues, databases, and so forth), but it greatly simplifies it. Using managed services for the remaining infrastructure pieces simplifies it further.
Increase the efficiency of your developers and reduce operational costs. Rather than spending most of their time dealing with infrastructure concerns, allow your developers to focus on delivering business value. For most businesses, infrastructure is undifferentiated commodity work. By “outsourcing” large parts of your undifferentiated Internal Developer Platform, you can reallocate developers to product or feature development and reduce operational costs. This allows you to get the benefits of Platform Engineering with a fraction of the maintenance and overhead. Lastly, if you are a “normal” business that doesn’t operate at internet scale and has fairly cyclical traffic, it’s entirely likely Cloud Run will be cheaper than VM-based platforms.
Maintain the flexibility to evolve to a more complex solution over time if needed. This is where traditional serverless platforms and PaaS solutions fall short. Again, with Cloud Run there is no actual vendor lock-in, it’s just a Kubernetes Deployment as a Service. Even without Knative, we can take that Deployment and run it in any Kubernetes cluster. This is a very different paradigm from, say, App Engine where you wrote your application using App Engine APIs and deployed your service to the App Engine runtime. In this new paradigm, the artifact is a Plain Old Container. There are cases where Cloud Run is not a good fit, such as certain kinds of stateful legacy applications or services with sustained, non-cyclical traffic. We don’t want to be painted into a corner with these types of situations so having flexibility is important.

There are similar analogs to Cloud Run on other cloud platforms. For example, AWS has AppRunner. However, in my experience these fall short in terms of developer experience because of either lack of investment from the cloud provider or environment complexity (as I would argue is the case for AWS). Managed services like Cloud Run are one of the areas that GCP truly excels and differentiates itself.

Just use Cloud Run, seriously

I realize not everyone will be convinced. The gravitational pull of Kubernetes is strong and as a platform, it’s a safe bet. However, operationalizing Kubernetes properly—whether it’s a managed offering like GKE or not—requires some kind of platform team and ongoing investment. We’ve seen it approached without this where developers are given clusters or allowed to spin them up and fend for themselves. This quickly becomes untenable because standards are non-existent, security and compliance is unmanageable, and developer time is split between managing infrastructure and actual feature development.

If your organization is unable or unwilling to make this investment, I urge you to consider Cloud Run. There’s still work needed on the periphery to properly operationalize it, such as implementing CI/CD pipelines and managing accessory infrastructure, but it’s a much lower investment. Additionally, it provides an escape hatch—unlike App Engine or traditional PaaS solutions, there is no real switching cost in moving to Kubernetes if you need to in the future. With Cloud Run, serverless has finally reached a tipping point where it’s now viable for a majority of workloads rather than a niche subset. Unlike Kubernetes, it provides the right level of abstraction for most businesses building software. In my opinion, serverless is still not taken seriously due to preconceived notions, but it’s time to start reevaluating those notions.

Agree? Disagree? I’d love to hear your thoughts. If you’re an organization that would like to do cloud differently or are looking for the playbook to operationalize Google Cloud Platform, please get in touch.

Follow @tyler_treat

July 15, 2020July 15, 2020

Implementing ETL on GCP

ETL (Extract-Transform-Load) processes are an essential component of any data analytics program. This typically involves loading data from disparate sources, transforming or enriching it, and storing the curated data in a data warehouse for consumption by different users or systems. An example of this would be taking customer data from operational databases, joining it with data from Salesforce and Google Analytics, and writing it to an OLAP database or BI engine.

In this post, we’ll take an honest look at building an ETL pipeline on GCP using Google-managed services. This will primarily be geared towards people who may be familiar with SQL but may feel less comfortable writing code or building a solution that requires a significant amount of engineering effort. This might include data analysts, data scientists, or perhaps more technical-oriented business roles. That is to say, we’re mainly looking at low-code/no-code solutions, but we’ll also touch briefly on more code-heavy options towards the end. Specifically, we’ll compare and contrast Data Fusion and Cloud Dataprep. As part of this, we will walk through the high-level architecture of an ETL pipeline and discuss common patterns like data lakes and data warehouses.

General Architecture

It makes sense to approach ETL in two phases. First, we need a place to land raw, unprocessed data. This is commonly referred to as a data lake. The data lake’s job is to serve as a landing zone for all of our business data, even if the purpose of some of that data is not yet clear. The data lake is also where we can de-identify or redact sensitive data before it moves further downstream.

The second phase is processing the raw data and storing it for particular use cases. This is referred to as a data warehouse. The data here feeds end-user queries and reports for business analysts, BI tools, dashboards, spreadsheets, ML models, and other business activities. The data warehouse structures the data in a way suitable for these specific needs.

On GCP, our data lake is implemented using Cloud Storage, a low-cost, exabyte-scale object store. This is an ideal place to land massive amounts of raw data. We can also use Cloud Data Loss Prevention (DLP) to alert on or redact any sensitive data such as PII or PHI. Once use cases have been identified for the data, we then transform it and move it into our curated data warehouse implemented with BigQuery.

At a high level, our analytics pipeline architecture looks something like the following. The components in green are pieces implemented on GCP.

We won’t cover how data gets ingested into the data warehouse. This might be a data-integration tool like Mulesoft or Informatica if we’re moving data from on-prem. It might be an automated batch process using gsutil, a Python script, or Transfer Service. Alternatively, it might be a more real-time push process that streams data in via Cloud Pub/Sub. Either way, we’ll assume we have some kind of mechanism to load our data into Cloud Storage.

We will focus our time discussing the “Transform Process” step in the diagram above. This is where Data Fusion and Cloud Dataprep fit in.

Data Fusion

Data Fusion is a code-free data integration tool that runs on top of Hadoop. The user is intended to define ETL pipelines using a graphical plug-and-play UI with preconfigured connectors and transformations. Data Fusion is actually a managed version of an open source system called Cask Data Analytics Platform (CDAP) which Google acquired in 2018. It’s a relatively new product in GCP, and it shows. The UX is rough and there are a lot of sharp edges. For example, when an instance starts up, you can occasionally hit cryptic errors because the instance has not actually initialized fully. Case in point, try deciphering what this error means:

The theory of letting users with no programming experience implement and run ETL pipelines is appealing. However, the reality is that you will end up trying to understand Hadoop debug logs and opaque error messages when things go wrong, which happens frequently.

The pipelines created in Data Fusion run on Cloud Dataproc. This means every time you run a pipeline, you first need to wait for a Dataproc cluster to spin up—which is slow. Google’s recommendation to speed this up is to configure a runtime profile that uses a pre-existing Dataproc cluster. This has several downsides, one of which is simply the cost of keeping a Dataproc cluster running in addition to your Data Fusion instance. But what is the point of keeping a cluster running that only gets used for nightly batch processes or ad hoc pipeline development? The other is the technical and operations overhead required to configure and manage a cluster. This requires provisioning an appropriately sized cluster, creating an SSH key for it, and adding the key to the cluster so that Data Fusion can connect to it. For a product designed to allow relatively non-technical people to build out pipelines, this is a tall order. You’ll also quickly see how rough the UX is when walking through these steps.

The other downside of Data Fusion is that it’s actually pretty expensive. CDAP consists of a whole bunch of components. When you start a Data Fusion instance, it creates an internal GKE cluster to run all of these components. In addition to this, it relies on Cloud Storage, Cloud SQL, Persistent Disks, Elasticsearch, and Cloud KMS. The net result is that instances take approximately 10-20 minutes to start (now closer to 10 with recent improvements) and, for many, they’re not something you run and forget about.

A Basic Edition instance costs about $1,100 per month, while an Enterprise Edition instance costs $3,000 per month. For larger organizations, that might be a nominal cost, but it stings a bit when you realize that is just the cost to run the pipeline editor. The pipelines themselves run on Dataproc, which is an entirely separate—and significant—line item. What’s worse is that you have to keep the Data Fusion instance running in order to actually execute the ETL pipelines you develop in it. Additionally, the Basic Edition will only let you run pipelines on demand. In order to schedule pipelines or trigger them in a more streaming fashion, you have to use the Enterprise Edition. As a result, I often encounter teams wanting to schedule startup and shutdown for both the Dataproc clusters and Data Fusion instances to avoid unnecessary spend. This has to be done with code.

Pipelines are immutable, which means every time you need to tweak a pipeline, you first have to make a copy of it. Immutability sounds nice in theory, but in practice it means you end up with dozens of pipeline iterations as you build out your process. And in order to save your pipeline when a Data Fusion instance is deleted—say because you’re shutting it down nightly to save on costs—you have to export it to a file and then import it to the new instance. Recycling instances will still lose the job information for previous pipeline runs, however. There is no way to “pause” an instance, which makes pipeline management a pain.

Data Fusion itself is fairly robust in what you can do with it. It can extract data from a broad set of sources, including Cloud Storage, perform a variety of transformations, and load results into an assortment of destinations such as BigQuery. That said, I’m still a bit skeptical about no-code solutions for non-technical users. I still often find myself dropping in a JavaScript transform in order to actually do the manipulations on the data that I need versus trying to do it with a combination of preconfigured drag-and-drop widgets. Most of the analysts I’ve seen using it also just want to use SQL to do their transformations. Trying to join two data sources using a UI is frankly just more difficult than writing a SQL join. The data wrangler uses a goofy scripting language called JEXL that is poorly documented and inconsistently implemented. To put it bluntly, the UI and UX in Data Fusion (technically CDAP) is painful, and I often find myself wishing I could just write some Python. It just feels like an open source product that doesn’t see much investment.

Data Fusion is a bit of an oddball when viewed in the context of how GCP normally approaches services until you realize it was an acquisition of a company built around an open source framework. In that light, it feels very similar to Cloud Composer, another product built around an open source framework, Apache Airflow, which feels equally kludgy. Most of Google’s data products are highly refined with an emphasis on serverless and developer experience. Services like BigQuery, Dataflow, and Cloud Pub/Sub come to mind here. Data Fusion is the polar opposite. It’s clunky, the CDAP infrastructure is heavy and expensive, and it still requires low-level operations like when you’re configuring a Dataproc cluster.

Dataproc itself feels like a service for handling legacy Hadoop workloads since it has a lot of operations overhead. For newer workloads, I would target Dataflow which is closer to a “serverless” experience like BigQuery and is evidently on the roadmap as a runtime target for Data Fusion.

The CDAP UX is quirky, confusing, inconsistent, and generally unpleasant. The moment anything goes awry, which is often and unwittingly the case, you’re thrust into the world of Hadoop to divine what went wrong. I’m a raving fan of much of GCP’s managed services. On the whole, I find them to be better engineered, better thought-out, and better from a developer experience perspective compared to other cloud platforms. Data Fusion ain’t it.

Cloud Dataprep

Cloud Dataprep is actually a third-party application offered by Trifacta through GCP. In fact, it’s really just a GCP-specific SKU of Trifacta’s Wrangler product. The downside of this is that you have to agree to a third-party vendor’s terms and conditions. For some, this will likely trigger a whole separate sourcing process. This is a challenge for a lot of enterprise organizations.

If you can get past the procurement conundrum, you’ll find Dataprep to be a highly polished and refined product. In comparison to Data Fusion, it’s a breath of fresh air and is superior in nearly every aspect. The UI is pleasant, the UX is—for the most part—coherent and intuitive, it’s cheaper, and it’s a proper serverless product. Dataprep feels like what I would expect from a first-class managed service on GCP.

Dataprep is similar to Data Fusion in the sense that it allows you to build out pipelines with a graphical interface which then target an underlying runtime. In the case of Dataprep, it targets Dataflow rather than Dataproc. This means we benefit from the features of Dataflow, namely auto-provisioning and scaling of infrastructure. Jobs tend to run much more quickly and reliably than with Data Fusion. Another key difference is that, unlike Data Fusion, Dataprep doesn’t require an “instance” to develop pipelines. It is more like a SaaS application that relies on Dataflow. Today, using the app to develop pipelines is free of charge. You only incur charges from Dataflow resource usage. Unfortunately, this is changing as Trifacta is switching to a tiered monthly subscription model later this year. This will put base costs more in-line with Data Fusion, but I suspect the reliance on Dataflow will bring overall costs down.

The pipeline management in Dataprep is simpler than in Data Fusion. Pipelines in Dataprep are called “flows.” These are mutable and private by default but can be shared with other users. Because Dataprep is a SaaS product, you don’t need to worry about exporting and persisting your pipelines, and job data from previous flow executions is retained.

Dataprep has some drawbacks though. Broadly speaking, it’s not as feature-rich as Data Fusion. It can only integrate with Cloud Storage and BigQuery, while Data Fusion supports a wide array of data sources and sinks. You can do more with Data Fusion, while with Dataprep, you’re more or less confined to the wrangler. Because of this, Dataprep is well-suited to lighter weight processes and data cleansing—joining data sources, standardizing formats, identifying missing or mismatched values, deduplicating rows, and other things like that. It also works well for data exploration and slicing and dicing.

I often find teams using both Data Fusion and Dataprep. Data Fusion gets used for more advanced ETL processes and Dataprep for, well, data preparation. If it’s available to them, teams usually start with Dataprep and then switch to Data Fusion if they hit a wall with what it can do.

Alternatives

Data Fusion and Dataprep attempt to provide a managed solution that lets users with little-to-no programming experience build out ETL pipelines. Dataprep definitely comes closer to realizing that goal due to its more refined UX and reliance on Dataflow rather than Dataproc. However, I tend to dislike managed “workflow engines” like these. Cloud Composer and AWS Glue, which is Amazon’s managed ETL service, are other examples that fall under this category.

These types of services usually sit in a weird in-between position of trying to provide low-code solutions with GUIs but needing to understand how to debug complex and sophisticated distributed computing systems. It seems like every time you try something to make building systems easier, you wind up needing to understand the “easier” thing plus the “hard” stuff it was trying to make easy. This is what Joel Spolsky refers to as the Law of Leaky Abstractions. It’s why I prefer to write code to solve problems versus relying on low-code interfaces. The abstractions can work okay in some cases, but it’s when things go wrong or you need a little bit more flexibility where you run into problems. It can be a touchy subject, but I’ve found that the most effective data programs within organizations are the ones that have software engineers or significant programming and systems development skill sets. This is especially true if you’re on AWS where there’s more operations and networking knowledge required.

With that said, there are some alternative approaches to implementing ETL processes on GCP that move away from the more low/no-code options. If your team consists mostly of software engineers or folks with a development background, these might be a better option.

My go-to for building data processing pipelines is Cloud Dataflow, which is a serverless system for implementing stream and batch pipelines. With Dataflow, you don’t need to think about capacity and resource provisioning and, unlike Data Fusion and Dataproc, you don’t need to keep a standby cluster running as there is no “cluster.” The compute is automatically provisioned and autoscaled for you based on the job. You can use code to do your transformations or use SQL to join different data sources.

For batch ETL, I like a combination of Cloud Scheduler, Cloud Functions, and Dataflow. Cloud Scheduler can kick off the ETL process by hitting a Cloud Function which can then trigger your Dataflow template. Alternatively, you could use a streaming Dataflow pipeline in combination with Cloud Scheduler and Pub/Sub to launch your batch ETL pipelines. Google has an example of this here.

For streaming ETL, data can be fed into a streaming Dataflow pipeline from Cloud Pub/Sub and processed as usual. This data can even be joined with files in Cloud Storage or tables in BigQuery using SQL. This is what I found myself and many of the clients I’ve worked with wanting to do in Data Fusion and Dataprep. Sometimes you just want to write SQL, which leads to another solution.

BigQuery provides a good mechanism for ELT—that is extracting the data from its sources, loading it into BigQuery, and then performing the transformations on it. This is a good option if you’re dealing with primarily batch-driven processes and you have a SQL-heavy team as the transformations are expressed purely through SQL. The transformation queries can either be scheduled directly in BigQuery or triggered in an automated way using the API, such as running the transformations after data loading completes.

I mentioned earlier that I’m not a huge fan of managed workflow engines. This is speaking to high-level abstractions and heavy, monolithic frameworks specifically. However, I am a fan of lightweight, composable abstractions that make it easy to build scalable and fault-tolerant workflows. Examples of this include AWS Step Functions and Google Cloud Tasks. On GCP, Cloud Tasks can be a great alternative to Dataflow for building more code-heavy ETL processes if you’re not tied in to Apache Beam. In combination with Cloud Run, you can build out highly elastic workflows that are entirely serverless. While it’s not the obvious choice for implementing ETL on GCP, it’s definitely worth a mention.

Conclusion

There are several options when it comes to implementing ETL processes on GCP. What the right fit is depends on your team’s skill set, the use cases, and your affinity for certain tools. Cost and operational complexity are also important considerations. In practice, however, it’s likely you’ll end up using a combination of different solutions.

For low/no-code solutions, Data Fusion and Cloud Dataprep are your only real options. While Data Fusion is rough from a usability perspective and generally more expensive, it’s likely where Google is putting significant investment. Dataprep is more refined and cost-effective but limited in capability, and it brings a third-party vendor into the mix. Using BigQuery itself for ELT is also an option for SQL-minded teams. But for teams with a strong engineering background, my recommended starting point is Cloud Dataflow or even Cloud Tasks for certain types of processing work.

Together with Cloud Pub/Sub, Cloud Data Loss Prevention, Cloud Storage, BigQuery, and GCP’s other managed services, these solutions provide a great way to implement analytics pipelines that require minimal operations investment.

Follow @tyler_treat

June 24, 2020

Using Google-Managed Certificates and Identity-Aware Proxy With GKE

Ingress on Google Kubernetes Engine (GKE) uses a Google Cloud Load Balancer (GCLB). GCLB provides a single anycast IP that fronts all of your backend compute instances along with a lot of other rich features. In order to create a GCLB that uses HTTPS, an SSL certificate needs to be associated with the ingress resource. This certificate can either be self-managed or Google-managed. The benefit of using a Google-managed certificate is that they are provisioned, renewed, and managed for your domain names by Google. These managed certificates can also be configured directly with GKE, meaning we can configure our certificates the same way we declaratively configure our other Kubernetes resources such as deployments, services, and ingresses.

GKE also supports Identity-Aware Proxy (IAP), which is a fully managed solution for implementing a zero-trust security model for applications and VMs. With IAP, we can secure workloads in GCP using identity and context. For example, this might be based on attributes like user identity, device security status, region, or IP address. This allows users to access applications securely from untrusted networks without the need for a VPN. IAP is a powerful way to implement authentication and authorization for corporate applications that are run internally on GKE, Google Compute Engine (GCE), or App Engine. This might be applications such as Jira, GitLab, Jenkins, or production-support portals.

IAP works in relation to GCLB in order to secure GKE workloads. In this tutorial, I’ll walk through deploying a workload to a GKE cluster, setting up GCLB ingress for it with a global static IP address, configuring a Google-managed SSL certificate to support HTTPS traffic, and enabling IAP to secure access to the application. In order to follow along, you’ll need a GKE cluster and domain name to use for the application. In case you want to skip ahead, all of the Kubernetes configuration for this tutorial is available here.

Deploying an Application Behind GCLB With a Managed Certificate

First, let’s deploy our application to GKE. We’ll use a Hello World application to test this out. Our application will consist of a Kubernetes deployment and service. Below is the configuration for these:

Apply these with kubectl:

$ kubectl apply -f .

At this point, our application is not yet accessible from outside the cluster since we haven’t set up an ingress. Before we do that, we need to create a static IP address using the following command:

$ gcloud compute addresses create web-static-ip --global

The above will reserve a static external IP called “web-static-ip.” We now can create an ingress resource using this IP address. Note the “kubernetes.io/ingress.global-static-ip-name” annotation in the configuration:

Applying this with kubectl will provision a GCLB that will route traffic into our service. It can take a few minutes for the load balancer to become active and health checks to begin working. Traffic won’t be served until that happens, so use the following command to check that traffic is healthy:

$ curl -i http://<web-static-ip>

You can find <web-static-ip> with:

$ gcloud compute addresses describe web-static-ip --global

Once you start getting a successful response, update your DNS to point your domain name to the static IP address. Wait until the DNS change is propagated and your domain name now points to the application running in GKE. This could take 30 minutes or so.

After DNS has been updated, we’ll configure HTTPS. To do this, we need to create a Google-managed SSL certificate. This can be managed by GKE using the following configuration:

Ensure that “example.com” is replaced with the domain name you’re using.

We now need to update our ingress to use the new managed certificate. This is done using the “networking.gke.io/managed-certificates” annotation.

We’ll need to wait a bit for the certificate to finish provisioning. This can take up to 15 minutes. Once it’s done, we should see HTTPS traffic flowing correctly:

$ curl -i https://example.com

We now have a working example of an application running in GKE behind a GCLB with a static IP address and domain name secured with TLS. Now we’ll finish up by enabling IAP to control access to the application.

Securing the Application With Identity-Aware Proxy

If you’re enabling IAP for the first time, you’ll need to configure your project’s OAuth consent screen. The steps here will walk through how to do that. This consent screen is what users will see when they attempt to access the application before logging in.

Once IAP is enabled and the OAuth consent screen has been configured, there should be an OAuth 2 client ID created in your GCP project. You can find this under “OAuth 2.0 Client IDs” in the “APIs & Services” > “Credentials” section of the cloud console. When you click on this credential, you’ll find a client ID and client secret. These need to be provided to Kubernetes as secrets so they can be used by a BackendConfig for configuring IAP. Apply the secrets to Kubernetes with the following command, replacing “xxx” with the respective credentials:

$ kubectl create secret generic iap-oauth-client-id \
--from-literal=client_id=xxx \
--from-literal=client_secret=xxx

BackendConfig is a Kubernetes custom resource used to configure ingress in GKE. This includes features such as IAP, Cloud CDN, Cloud Armor, and others. Apply the following BackendConfig configuration using kubectl, which will enable IAP and associate it with your OAuth client credentials:

We also need to ensure there are service ports associated with the BackendConfig in order to trigger turning on IAP. One way to do this is to make all ports for the service default to the BackendConfig, which is done by setting the “beta.cloud.google.com/backend-config” annotation to “{“default”: “config-default”}” in the service resource. See below for the updated service configuration.

Once you’ve applied the annotation to the service, wait a couple minutes for the infrastructure to settle. IAP should now be working. You’ll need to assign the “IAP-secured Web App User” role in IAP to any users or groups who should have access to the application. Upon accessing the application, you should now be greeted with a login screen.

Your Kubernetes workload is now secured by IAP! Do note that VPC firewall rules can be configured to bypass IAP, such as rules that allow traffic internal to your VPC or GKE cluster. IAP will provide a warning indicating which firewall rules allow bypassing it.

For an extra layer of security, IAP sets signed headers on inbound requests which can be verified by the application. This is helpful in the event that IAP is accidentally disabled or misconfigured or if firewall rules are improperly set.

Together with GCLB and GCP-managed certificates, IAP provides a great solution for serving and securing internal applications that can be accessed anywhere without the need for a VPN.

Follow @tyler_treat

June 22, 2020June 22, 2020

Zero-Trust Security on GCP With Context-Aware Access

A lot of our clients at Real Kinetic leverage serverless on GCP to quickly build applications with minimal operations overhead. Serverless is one of the things that truly differentiates GCP from other cloud providers, and App Engine is a big component of this. Many of these companies come from an on-prem world and, as a result, tend to favor perimeter-based security models. They rely heavily on things like IP and network restrictions, VPNs, corporate intranets, and so forth. Unfortunately, this type of security model doesn’t always fit nicely with serverless due to the elastic and dynamic nature of serverless systems.

Recently, I worked with a client who was building an application for internal support staff on App Engine. They were using Identity-Aware Proxy (IAP) to authenticate users and authorize access to the application. IAP provides a fully managed solution for implementing a zero-trust access model for App Engine and Compute Engine. In this case, their G Suite user directory was backed by Active Directory, which allowed them to manage access to the application using Single Sign-On and AD groups.

Everything was great until the team hit a bit of a snag when they went through their application vulnerability assessment. Because it was for internal users, the security team requested the application be restricted to the corporate network. While I’m deeply skeptical of the value this adds in terms of security—the application was already protected by SSO and two-factor authentication and IAP cannot be bypassed with App Engine—I shared my concerns and started evaluating options. Sometimes that’s just the way things go in a larger, older organization. Culture shifts are hard and take time.

App Engine has firewall rules built in which allow you to secure incoming traffic to your application with allow/deny rules based on IP, so it seemed like an easy fix. The team would be in production in no time!

Unfortunately, there are some issues with how these firewall rules work depending on the application architecture. All traffic to App Engine goes through Google Front End (GFE) servers. This provides numerous benefits including TLS termination, DDoS protection, DNS, load balancing, firewall, and integration with IAP. It can present problems, however, if you have multiple App Engine services that communicate with each other internally. For example, imagine you have a frontend service which talks to a backend service.

App Engine does not provide a static IP address and instead relies on a large, dynamic pool of IP addresses. Two sequential outbound calls from the same application can appear to originate from two different IP addresses. One option is to allow all possible App Engine IPs, but this is riddled with issues. For one, Google uses netblocks that dynamically change and are encoded in Sender Policy Framework (SPF) records. To determine all of the IPs App Engine is currently using, you need to recursively perform DNS lookups by fetching the current set of netblocks and then doing a DNS lookup for each netblock. These results are not static, meaning you would need to do the lookups and update firewall rules continually. Worse yet, allowing all possible App Engine IPs would be self-defeating since it would be trivial for an attacker to work around by setting up their own App Engine application to gain access, assuming there isn’t any additional security beyond the firewall.

Another, slightly better option is to set up a proxy on Compute Engine in the same region as your App Engine application. With this, you get a static IP address. The downside here is that it’s an additional piece of infrastructure that must be managed, which isn’t great when you’re shooting for a serverless architecture.

Luckily, there is a better solution—one that fits our serverless model and enables us to control external traffic while allowing App Engine services to securely communicate internally. IAP supports context-aware access, which allows enforcing granular access controls for web applications, VMs, and GCP APIs based on an end-user’s identity and request context. Essentially, context-aware access brings a richer zero-trust model to App Engine and other GCP services.

To set up a network firewall in IAP, we first need to create an Access Level in the Access Context Manager. Access Levels are a way to add an extra level of security based on request attributes such as IP address, region, time of day, or device. In the client’s case, they can create an Access Level to only allow access from their corporate network.

We can then add the Access Level to roles that are assigned to users or groups in IAP. This means even if users are authenticated, they must be on the corporate network to access the application.

To allow App Engine services to communicate freely, we simply need to assign the IAP-secured Web App User role without the Access Level to the App Engine default service account. Services will then authenticate as usual using OpenID Connect without the added network restriction. The default service account is managed by GCP and there are no associated credentials, so this provides a solid security posture.

Now, at this point, we’ve solved the IP firewall problem, but that’s not really in the spirit of zero-trust, right? Zero-trust is a security principle believing that organizations should not inherently trust anything inside or outside of their perimeters and instead should verify anything trying to connect to their systems. Having to connect to a VPN in order to access an application in the cloud is kind of a bummer, especially when the corporate VPN goes down. COVID-19 has made a lot of organizations feel this pain. Fortunately, Access Levels can be a lot smarter than providing simple lists of approved IP addresses. With the Cloud IAM Conditions Framework, we can even write custom rules to allow access based on URL path, resource type, or other request attributes.

At this point, I talked the client through the Endpoint Verification process and how we can shift away from a perimeter-based security model to a defense-in-depth, zero-trust model. Rather than requiring the end-user to be signed in from the corporate network, we can require them to be signed in from a trusted, corporate-owned device from anywhere. We can require that the device has a screen lock and is encrypted or has a minimum OS version.

With IAP and context-aware access, we can build layered security on top of applications and resources without the need for a VPN, while still centrally managing access. This can even extend beyond GCP to applications hosted on-prem or in other cloud platforms like AWS and Azure. Enterprises don’t have to move away from more traditional security models all at once. This pattern allows you to gradually shift by adding and removing Access Levels and attributes over time. Zero-trust becomes much easier to implement within large organizations when they don’t have to flip a switch.

Follow @tyler_treat