Real Kinetic – Brave New Geek

November 14, 2024

Platform Engineering as a Service

Like most industry jargon, “DevOps” means a lot of things to a lot of different people. While many folks view it as specific to certain tooling or practices, such as CI/CD or Infrastructure as Code (IaC), I’ve always viewed it as an organizational model for how software is built and delivered. In particular, my interpretation is that DevOps is about shifting more responsibilities “left” onto developers, moving away from the more traditional “throw it over the wall” approach to IT operations. No doubt this encompasses tooling or practices like CI/CD and IaC, which are responsibilities that developers now shoulder, perhaps with the support of dev tools, productivity, or enablement teams—some companies just call this the “DevOps” team.

While many organizations still operate with the traditional silos, DevOps has established itself as an industry norm. But as organizations push the boundaries of software development, the limitations of DevOps are becoming increasingly apparent. The problem is that DevOps, in its pursuit of speed and autonomy, often results in chaos and inefficiency. Teams end up reinventing the wheel, creating bespoke solutions for the same problems, and struggling with inconsistent tooling and practices across the organization. The outcome? Technical debt, fragmented processes, and wasted effort. Many of the teams we work with at Real Kinetic spend significantly more time on the “DevOps work” than they do on actual product work.

The Rise of Platform Engineering

This is where Platform Engineering comes in. Rather than having each development team own their entire infrastructure stack, platform engineering provides a centralized, productized approach to infrastructure and developer tools. It’s about creating reusable, self-service platforms that development teams can leverage to build, deploy, and scale their applications efficiently. These platforms abstract away the complexities of cloud infrastructure, CI/CD pipelines, and security, enabling developers to focus on writing code rather than managing infrastructure or “glue”.

Platform engineering brings structure to the chaos of DevOps by creating a standardized, cohesive platform that empowers development teams while maintaining best practices and governance. It’s a solution to the growing complexity and sprawl that comes with scaling software delivery and scaling DevOps. Platform engineering is very much in its infancy as DevOps was circa 2012, but there’s growing interest in it as organizations hit the ceiling of DevOps.

*Google Trends for “Platform Engineering”*

But There’s a Catch: The Investment Barrier

Implementing platform engineering isn’t without its challenges. Building a robust, scalable platform requires significant time, resources, and expertise. It demands a deep understanding of your organization’s technology stack, development workflows, and business objectives. And importantly, it diverts valuable resources away from core product development efforts.

Many organizations are hesitant to make this level of investment, especially if it’s not their core competency. They either end up doing it poorly—leading to a half-baked platform that doesn’t deliver the promised efficiencies—or they avoid it altogether, sticking to the DevOps status quo. This often leaves them with the worst of both worlds: the overhead of DevOps without the benefits of a streamlined, developer-friendly platform.

What we most often see are dev tools teams masquerading as platform engineering. As Camille Fournier puts it, they build scripts or tools around configuration management and infrastructure provisioning, not products. Usually it’s because they either don’t want to have skin in the game or they don’t have a mandate from leadership. “Not having skin in the game” means some combination of these things: a) they don’t want to build their own software, b) they don’t want to be on the hook for operations, or c) they don’t want to be in the critical path for production or become a bottleneck. Instead, they provide “blueprints” for these things and the burden and responsibility ultimately falls on the product teams—this is just DevOps.

Another issue is that organizations don’t want to allocate the headcount to do real platform engineering. They’re not wrong to be hesitant because it takes real investment to actually do it. As a result, however, they take half measures. We frequently see companies take an InnerSource approach as an attempt to basically socialize platform engineering. I have never seen this approach work well in practice unless there’s clear ownership and the team has a clear mandate. And just as before, this approach pushes scripts, not products. Without ownership and directive, it just reverts back to DevOps which leads to inefficiency and sprawl.

The Solution: Platform Engineering as a Service

This is where Platform Engineering as a Service (PEaaS) comes in. Unlike traditional Platform as a Service (PaaS) offerings, which provide a rigid, one-size-fits-all platform that abstracts away the underlying infrastructure, PEaaS is designed to be flexible and tailored to your unique requirements. It doesn’t hide the infrastructure but rather empowers your teams by providing the tools, automation, and best practices needed to build and operate cloud-native applications efficiently for your organization.

Instead of building and maintaining a custom platform internally, organizations can partner with experts who specialize in platform engineering and bring deep, hands-on experience to the table. With PEaaS, you get all the benefits of a mature, scalable platform without the heavy upfront investment or the distraction from your core product development. This means that a robust, enterprise-grade platform can be implemented in a fraction of the time, and managed for a fraction of the cost. What typically takes companies 6 months or more to build can be accomplished in days or weeks. And, what typically takes a team of 5 – 10 engineers working full-time to manage can be handled by 1 engineer, often on a part-time basis.

At Real Kinetic, we’ve been helping organizations accelerate their software delivery for years. In fact, we’ve been doing platform engineering long before it was called platform engineering. We bring our extensive expertise in cloud infrastructure, CI/CD, and developer enablement to build platforms that align with your organization’s unique needs and technology stack. By leveraging our Platform Engineering as a Service, you can stay focused on what you do best—building great products—while we take care of the complexities of infrastructure, automation, and developer tooling.

Why Real Kinetic?

Why should you trust us with your platform engineering needs? Because we’ve done it before, time and time again. Real Kinetic has helped numerous organizations—from startups to large enterprises—modernize their software delivery practices, improve developer productivity, and accelerate time to market. Our approach is rooted in real-world experience, not theory. We understand the challenges of scaling platforms because we’ve been there ourselves.

When you partner with Real Kinetic, you’re not just getting a service provider—you’re getting a team of experts who are invested in your success and have skin in the game. We’re here to build a platform that scales with your business, optimizes your development workflows, and ultimately drives more value for your customers.

Ready to Level Up Your Software Delivery?

If you’re tired of the inefficiencies of DevOps and ready to embrace the power of platform engineering, let’s talk. Real Kinetic’s Platform Engineering as a Service is your fast track to a scalable, efficient platform that empowers your developers and accelerates your time to market. And if you’re using AWS or GCP, we’re also looking for a few companies to pilot our batteries-included platform engineering product Konfigurate.

Follow @tyler_treat

November 5, 2024November 15, 2024

Automating Infrastructure as Code with Vertex AI

A lot of companies are trying to figure out how AI can be used to improve their business. Most of them are struggling to not just implement AI, but to even find use cases that aren’t contrived and actually add value to their customers. We recently discovered a compelling use case for AI integration in our Konfigurate platform, and we found that implementing generative AI doesn’t require a great deal of complexity. I’m going to walk you through what we learned about integrating an AI assistant into our production system. There’s a ton of noise out there about what you “need” to integrate AI into your product. The good news? You don’t need much. The bad news? It took too much time sifting through nonsense to find what actually helps deliver value with AI.

We’ll show you how to leverage Google’s Vertex AI with Gemini 1.5 to implement multimodal input for automating the creation of infrastructure as code. We’ll see how to make our AI assistant context-aware, how to configure output to be well-structured, how to tune the output without needing actual model tuning, and how to test the model.

Our Use Case

The Context

Konfigurate takes a modern approach to infrastructure as code (IaC) that shifts more concerns left into the development process such as security, compliance, and architecture standardization. In addition to managing your cloud infrastructure, it also integrates with GitHub or GitLab to manage your organization’s repository structure and CI/CD.

Workloads are organized into Platforms and Domains, creating a structured environment that connects GitHub/GitLab with your cloud platform for seamless application and infrastructure management. Everything in Konfigurate—Platforms, Domains, Workloads, Resources—is GitOps-driven and implemented through YAML configuration. Below is an example showing the configuration for an “Ecommerce” Platform:

apiVersion: konfig.realkinetic.com/v1alpha8
kind: Platform
metadata:
  name: ecommerce
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce
  gitlab:
    parentGroupId: 88474985
  gcp:
    billingAccountId: "XXXXXX-XXXXXX-XXXXXX"
    parentFolderId: "38822600023"
    defaultEnvs:
      - label: dev
    services:
      defaults:
        - cloud-run
        - cloud-sql
        - pubsub
        - firestore
        - redis
  groups:
    dev:
      - ecomm-devs@realkinetic.com
    maintainer:
      - ecomm-maintainers@realkinetic.com
    owner:
      - sre@realkinetic.com

Example Konfigurate Platform YAML

The Problem

The Konfigurate objects like Platforms, Domains, and Workloads are a well-structured problem. We have technical specifications for them defined in a way that’s easily interpretable by programs. In fact, as you can probably tell from the example above, they are simply Kubernetes CRDs, meaning they are—quite literally—well-defined APIs. And as you can tell from the example, these YAML configurations are fairly straightforward, but they can still be tedious to write by hand. Instead, usually what happens, which also happens with every other IaC tool, is definitions get copy/pasted and proliferated. We saw an opportunity for AI due to the structured nature of the system and definition of the problem space.

The Solution

Our idea was to create an AI assistant that could generate Konfigurate IaC definitions based on flexible user input. Users could interact with the system in a couple different ways:

Text Description: users could describe their desired system architecture using natural language, e.g. “Add a new analytics domain to the ecommerce platform and within it I need a new ETL pipeline that will pull data from the orders database, process it in Cloud Run, and write the transformed data to BigQuery.”
Architecture Diagram: users could provide an image of their architecture diagram.

While we only introduced support for natural language and image-based inputs, we also validated that it worked with audio-based descriptions of the architecture as well with no additional effort. We tested this by recording ourselves describing the infrastructure and then providing an M4A file to the model. We decided not to include this mode of input since, while cool, it seemed not particularly practical.

The Value

This multimodal approach not only saves developers hours of time spent on boilerplate code but also accommodates different working styles and preferences. Whether a team uses visual tools for architecture design or prefers text-based planning, our system can adapt, getting them up and running with minimal mental effort. Developers would still be responsible for verifying system behavior and testing, but the initial setup time could be drastically reduced across various input methods.

Critically, we found this feature makes IaC more accessible and productive for a much broader set of roles and skill sets. For instance, we’ve worked with mainframe COBOL engineers, data analysts, and developers with no cloud experience who are now able to more effectively implement cloud infrastructure and systems. It doesn’t hide the IaC from them, it just gives them a reliable starting point to work from that is actually grounded to their environment and problem space. What we have found with our AI-assisted infrastructure and our more general approach to Visual IaC is that developers spend more time focusing on their actual product and less time on undifferentiated work.

The Technology

Our team has a lot of GCP experience, so we decided to use the Vertex AI platform and the Gemini-1.5-Flash-002 model for this project. It was a no-brainer for us. We know the ins and outs of GCP, and Vertex AI offers an all-in-one managed solution that makes it easy to get going. This particular model is fast and most importantly it’s cost-effective. As I am sure this will ring true for many of you, we didn’t want to mess around with setting up our own infrastructure or dealing with the headaches of managing our own AI models. The Vertex AI Studio made it really easy to start developing and iterating prompts as well as trying different models.

No, You Don’t Need RAG (At Least, We Didn’t)

Great, you’ve got your fancy AI setup, but don’t you need some complex retrieval system to make it context-aware? Sure, RAG (Retrieval Augmented Generation) is often touted as essential for creating context-aware AI agents. Our experience took us down a different path.

When researching how to create a context-aware GPT agent, you’ll inevitably encounter RAG. This typically involves:

Vector databases for efficient similarity search
Complex indexing and retrieval systems
Additional infrastructure for training and fine-tuning models

Our Initial Approach

We started by preparing JSONL-formatted data thinking we’d feed it into a RAG system. The plan was to have our AI model learn from this structured data to understand our Konfigurate specifications like Platforms and Domains. As we experimented, we found that going the RAG route wasn’t giving us the consistent, high-quality outputs we needed, so we pivoted.

The Big Prompt Solution

Instead of relying on RAG, we leaned heavily into prompt engineering. Here’s what we did:

Long-Context Prompts: we crafted detailed prompts that provided the necessary context about our Konfigurate system, its components, and how they interact.
Example IaC: as part of the prompt, we included numerous example definitions for Konfigurate objects such as Platforms and Domains.
Example Prompts: we also included example prompts and their corresponding correct outputs, essentially “showing” the AI what we expected.
Error Handling Prompts: we even included prompting that guided the AI on how to handle errors or edge cases.

Why This Worked Better

Consistency: by explicitly stating our requirements in the prompts, we got more consistent outputs.
Flexibility: it was easier to tweak and refine our prompts than to restructure a RAG system.
Control: we had more direct control over how the AI interpreted and used our domain-specific knowledge.
Simplicity: no need for additional infrastructure or complex retrieval systems—instead, it’s just a single API call.

The Takeaway

While RAG has its place, don’t assume it’s always necessary. For our use case, well-crafted prompts proved more effective than a sophisticated retrieval system. I believe this was a better fit because of the well-structured nature of our problem space. We can trivially validate the results output by the model because they are data structures with specifications. As a result, we got our context-aware AI assistant up and running faster, with better results, and without the overhead or complexity of RAG. Remember, in the world of technology, most times the simplest solution is the most elegant.

Prompt Engineering: The Secret Sauce

While prompt engineering has become a bit of a meme, it turned out to be the most crucial part of this whole process. When you’re working with these AI models, everything boils down to how you craft your prompts. It’s where the magic happens—or doesn’t.

Let’s break down what this looks like in practice. We’re using the Vertex AI API with Node.js , so we started with their boilerplate code. The key player is the getGenerativeModel() function. Here’s a stripped-down version of what we’re feeding it:

const generativeModel = vertexAi.preview.getGenerativeModel({
  model: "gemini-1.5-flash-002",
  generationConfig: {
    maxOutputTokens: 4096,
    temperature: 0.2,
    topP: 0.95,
  },
  safetySettings: [
    {
      category: HarmCategory.HARM_CATEGORY_HATE_SPEECH,
      threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    },
    {
      category: HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
      threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    },
    {
      category: HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
      threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    },
    {
      category: HarmCategory.HARM_CATEGORY_HARASSMENT,
      threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    },
  ],
  systemInstruction: {
    role: "system",
    parts: [
      // Removed for brevity (detailed below)
    ],
  },
});

Gemini 1.5 model initialization

Model: We’re using the latest version of Gemini 1.5 Flash, which is a lightweight and cost-effective model that excels at multimodal tasks and processing large amounts of text.
Generation Config: This is where we control things like the max output length as well as the “temperature” of the model. Temperature controls the randomness in token selection for the output. Gemini 1.5 Flash has a temperature range of 0 to 2 with 1 being the default. A lower temperature is good when you’re looking for a “true or correct” response, while a higher temperature can result in more diverse or unexpected results. This can be good for use cases that require a more “creative” model, but since our use case requires quite a bit of precision, we opted for a low temperature value.
Safety Settings: These are Google’s defaults. Refer to their documentation for customization.
System Instruction: This is the real meat of prompt engineering. It’s where you prime the model, giving it context and setting its role. I’ve omitted this from the example above to go into more depth on this below since it’s a critical part of the solution.

The Art and Science of Prompting

Here’s the thing: prompt engineering is a fine line between science and art. We spent a non-trivial amount of time crafting our prompts to get consistent, useful outputs. It’s not just about dumping information, it’s about structuring it in a way that guides the AI to give you what you actually need. Remember, these models will do exactly what you tell them to do, not necessarily what you want them to do. Sound familiar? It’s like debugging code, but instead of fixing logic errors, you’re fine-tuning language.

Fair warning, this is probably where you’ll spend most of your engineering time. It’s tempting to think the AI will just “get it,” but that’s not how this works. You need to be painfully clear and specific in your instructions. We went through many iterations, tweaking words here and there, restructuring our prompts, and sometimes completely overhauling our approach. But each iteration got us closer to that sweet spot where the model consistently churned out exactly what we needed. In the end, nailing your prompt engineering is what separates a frustrating, inconsistent AI experience from one that feels like you’ve just added a new team member to your crew.

The System Instructions mentioned above provide a way to inform the model how it should behave, provide it context, tell it how to structure output, and so forth. Though this information is separate from the actual user-provided prompt, they are still technically part of the overall prompt sent into the model. Effectively, System Instructions provide a way to factor out common prompt components from the user-provided prompt. I won’t show all of our System Instructions because there are quite a few, but I’ll show several examples below to give you an idea. Again, this is about being painstakingly explicit and clear about what you want the model to do.

“Konfigurate is a system that manages cloud infrastructure in AWS or Google Cloud Platform. It uses Kubernetes YAML files in order to specify the configuration. Konfigurate makes it easy for developers to quickly and safely configure and deploy cloud resources within a company’s standards. You are a Platform Engineer who’s job is to help Application Software Engineers author their Konfigurate YAML specifications.”
“I am going to provide some example Konfigurate YAML files for your reference. Never output this example YAML directly. Rather, when providing examples in your output, generate new examples with different names and so forth.”
“Please provide the complete YAML output without any explanations or markdown formatting.”
“If the user asks about something other than Konfigurate or if you are unable to produce Konfigurate YAML for their prompt, tell them you cannot help with that (this is the one case to return something other than YAML). Specifically, respond with the following: ‘Sorry, I’m unable to help with that.’”

Controlling Output and Context-Awareness

The example System Instructions above hint at this but it’s something worth going into more detail. First, our AI assistant has a very specific task: generate Konfigurate IaC YAML for users. For this reason, we never want it to output anything other than Konfigurate YAML to users, nor do we want it to respond to any prompts that are not directly related to Konfigurate. We handle this purely through prompting. To help the model understand Konfigurate IaC, we provide it with an extensive set of examples and tell it to only ever output complete YAML without any explanations or markdown formatting.

However, the output is actually more involved than this for our situation. That’s because we don’t just want to support generating new IaC, but also modify existing resources as well. This means the model doesn’t just need to be context-aware, it also needs to understand the distinction between “this is a new resource” and “this is an existing resource being modified.” This is important because Konfigurate is GitOps-driven, meaning the IaC resources are created in a branch and then a pull request is created for the changes. We need to know which resources are being created or modified, and if the latter, where those resources live.

To make the model context-aware, we feed it the definitions for the existing resources in the user’s environment. This needs to happen at “prompt time”, so this information is not included as part of the System Instructions. Instead, we fetch this information on demand when a user prompt is submitted and augment their prompt with it. Additionally, we provide the UI context in which the user is submitting the prompt from. For example, if they submit a prompt to create a new Domain while within the Ecommerce Platform, we can infer that they wish to create a new Domain within this specific Platform. It may seem obvious to us, but the model is completely unaware of this and so we need to provide it with this context. Below is the full code showing how this works and how the prompt is constructed.

export const generateYaml = async (
  context: AIContext,
  prompt: string,
  fileData?: FileData,
) => {
  const k8sApi = kc.makeApiClient(k8s.CustomObjectsApi);
  const { controlPlaneProjectId, defaultBranch } = await getOrSetGitlabContext(k8sApi);

  // Get user's environment information from the control plane
  const [placeholders, konfigObjects] = await Promise.all([
    getPlaceHolders(),
    getKonfigObjectsYAML(controlPlaneProjectId, defaultBranch),
  ]);

  const parts: Part[] = [];
  if (fileData) {
    parts.push({
      fileData,
    });
  }
  if (prompt) {
    parts.push({
      text: prompt,
    });
  }

  // Add user's environment context to the prompt
  parts.push(
    {
      text:
        "Replace the placeholders with the following values if they should be present " +
        "in the output YAML unless the prompt is referring to actual YAMLs from the " + 
        "user's environment, in which case use the YAML as is without replacing " +
        "values: " + JSON.stringify(placeholders, null, 2) + ".",
    },
    {
      text:
        'Following the "---" below are all existing Konfigurate YAMLs for the ' +
        "user\'s environment should they be needed either to reference or modify " +
        "and provide as output based on the prompt. Don't forget to never output " +
        "the example YAML exactly as is without modifications. Only output " +
        "Konfigurate object YAML and no other YAML structures. Infer appropriate " +
        "emails for dev, maintainer, and owner groups based on those in the " +
        "provided YAML below if possible.\n" +
        "---\n" +
        konfigObjects +
        "\n---\n",
    },
  );

  // Add user's UI context to the prompt
  if (context) {
    let contextPrompt = "";
    let contextSet = false;
    if (context.platform && context.domain && context.workload) {
      contextPrompt = `The user is operating within the context of the ${context.workload} Workload which is in the ${context.domain} Domain of the ${context.platform} Platform.`;
      contextSet = true;
    } else if (context.platform && context.domain) {
      contextPrompt = `The user is operating within the context of the ${context.domain} Domain of the ${context.platform} Platform.`;
      contextSet = true;
    } else if (context.platform) {
      contextPrompt = `The user is operating within the context of the ${context.platform} Platform.`;
      contextSet = true;
    }

    if (contextSet) {
      contextPrompt +=
        " Use this context to infer where output objects should go should " +
        "the user not provide explicit instructions in the prompt.";
      parts.push({
        text: contextPrompt,
      });
    }
  }

  const contents: Content[] = [
    {
      role: "user",
      parts,
    },
  ];

  const req: GenerateContentRequest = {
    contents,
  };

  const resp = await makeVertexRequest(req);
  return { error: resp === errorResponseMessage, content: resp };
};

This prompt manipulation makes the model smart enough to understand the user’s environment and the context in which they are operating within. Feeding it all of this information is possible due to Gemini 1.5’s context window. The context window acts like a short-term memory, allowing the model to recall information as part of its output generation. While a person’s short-term memory is generally quite limited both in terms of the amount of information and recall accuracy, generative models like Gemini 1.5 can have massive context windows and near-perfect recall. Gemini 1.5 Flash in particular has a 1-million-token context window, and Gemini 1.5 Pro has a 2-million-token context window. For reference, 1 million tokens is the equivalent of 50,000 lines of code (with standard 80 characters per line) or 8 average-length English novels. This is called “long context”, and it allows us to provide the model with massive prompts while it is still able to find a “needle in a haystack.”

Long context has allowed us to make the model context-aware with minimal effort, but there’s still a question we have not yet addressed: how can the model also output metadata along with the generated IaC YAML? Specifically, we need to know the file path for each respective Konfigurate object so that we create new resources in the right place or we modify the correct existing resources. The answer, of course, is more prompt engineering. To solve this problem, we instructed the model to include metadata YAML with each Konfigurate object. This metadata contains the file path for the object and whether or not it’s an existing resource. Here’s an example:

apiVersion: konfig.realkinetic.com/v1alpha8
kind: Domain
metadata:
  name: dashboard
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: internal-services
spec:
  domainName: Dashboards
---
filePath: konfig/internal-services/dashboard-domain.yaml
isExisting: false

We did this by providing the model with several examples. Here is the System Instruction prompt we used:

{
  text:
    "For each Konfigurate YAML you output, include the following metadata, " +
    "also in YAML format, following the Konfigurate object itself: " +
    "filePath, isExisting. Here are some examples:\n" + metadataExample,
}

It seems simple, but it was surprisingly effective and reliable.

Model Stability and Testing

Working with LLMs is a bit like describing a problem to someone else who writes the code to solve it—but without seeing the code, making it impossible to debug when issues arise. Worse yet, subtle changes in the description of the problem is akin to the other person starting over fully from scratch each time, so you might get consistent results or it could be completely different. There are also cases where no matter how explicit you are in your prompting, the model just doesn’t do the right thing. For example, with Gemini-1.5-Flash-001, I had problems preventing the AI from outputting the examples verbatim. I told it, in a variety of ways, to generate new examples using the provided ones as reference for the overall structure of resources, but it simply wouldn’t do it—until I upgraded to Gemini-1.5-Flash-002.

What we saw is that something as simple as just changing the model version could result in wildly different output. This is a nascent area but it’s a major challenge for companies attempting to leverage generative AI within their products or, worse, as a core component of their product. The only solution I can think of is to have a battery of test prompts you feed your AI and compare the results. But even this is problematic as the output content might be the same but the structure may have slight variations. In our case because we are generating YAML, it’s easy for us to validate output, but for use cases that are less structured, this seems like a major concern. Another solution is to feed results into a different model, but this feels equally precarious.

In addition to model stability, we had some challenges with “jailbreaking” the model. While we were never able to jailbreak the model to operate outside the context of Konfigurate, we were on occasion able to get it to provide Konfigurate output that was outside the bounds of our prompting. We did not invest a ton of time into this area as it felt like there wasn’t great ROI and it wasn’t really a concern within our product, but it’s certainly a concern when building with LLMs.

Patterns That Worked For Us: Prompt Engineering Pro Tips

You have stuck with us this far and now it’s time for some concrete strategies that consistently improved our AI’s performance. Here’s what we learned:

Be Specific About Output: tell the model exactly what you want and how you want it. For us, that meant specifying YAML as the output format. Don’t leave room for interpretation—the clearer you are, the better the results.
Show, Don’t Just Tell: give the model examples of what good output looks like. We explicitly prompted our model to reference our example resource specifications. It’s like training a new team member—show them what success looks like.
Use Placeholders: providing examples to the model worked great, except when it would use specific field values from the examples in the user’s output. To address this we used sentinel placeholder values in the examples and then had a step that told the model to replace the placeholders with values from the user’s environment at prompt time.
Error Handling is Key: just like you’d build error handling into your code, build it into your prompts. Give the model clear instructions on how to respond when it encounters ambiguous or out-of-scope requests. This keeps the user experience smooth, even when things go sideways.
The Anti-Hallucination Trick: it sounds silly but it helps to explicitly tell the model not to hallucinate and to only respond within the context you’ve provided. It’s not foolproof, but we’ve seen a significant reduction in made-up information, especially when you’ve fine-tuned the temperature.

Remember, prompt engineering is an iterative process. What works for one use case might not work for another. Keep experimenting, keep refining, and don’t be afraid to start from scratch if something’s not clicking. The goal is to find that sweet spot where your AI becomes a reliable, consistent part of your workflow.

Wrapping It Up

There you have it, our journey into integrating AI into the Konfigurate platform. We started thinking we needed all sorts of fancy tech only to find that sometimes, simpler is better. The big takeaways?

You don’t always need complex systems like RAG. A well-crafted prompt can often do the job just as well, if not better. Gemini 1.5’s long context and near-perfect recall makes it quite adept at the “needle-in-a-haystack” problem, and it enables pretty sophisticated use cases through complex prompting.
Prompt engineering isn’t just a buzzword or meme. It’s where the real work happens, and it’s worth investing your time to get it right.
LLMs are well-suited to structured problems because they are good at pattern matching. They’re also good at creative problems, but it’s less clear to us how to integrate something like this into a product versus a structured problem.
The AI landscape is constantly evolving. What works today might not be the best approach tomorrow. Stay flexible and keep experimenting.

We hope sharing our experience saves you some time and headaches. Remember, there’s no one-size-fits-all solution in AI integration. What worked for us might need tweaking for your specific use case. The key is to start simple, iterate often, and don’t be afraid to challenge conventional wisdom. You might just find that the “must-have” tools aren’t so must-have after all.

Now, go forth and build something cool!

Follow @tyler_treat

June 11, 2024October 9, 2024

Understanding Konfig’s Opinionation

In my last post, I talked about the benefits of an opinionated platform. An opinionated platform allows your engineers to focus on things that matter to your business, such as shipping and improving customer-facing products and services. This is in contrast to engineers spending substantial time on non-differentiating work like platform infrastructure. Rather than infrastructure architecture, developers can focus more on the product architecture. Konfig is an opinionated platform which provides two key value drivers: 1) reducing the investment and total cost of ownership needed to have an enterprise cloud platform and 2) minimizing the time to deliver new software products.

Konfig provides an out-of-the-box, enterprise-grade platform which is built with security and governance at its heart. Building this type of platform normally requires a sizable team of platform engineers which takes constant care, maintenance, and ongoing investment. With Konfig, we can now reallocate these resources to higher-value work, and we only need a small team to manage resource templates used by developers and implement business-specific components. It enables this small team to provide a robust platform within their company with organizational standards and opinions built in. This, in turn, allows an organization’s developers to self-service with a high degree of autonomy while ensuring they work within the bounds of our organization’s standards.

For many organizations, bringing a new software product to market can be a monumental undertaking. Even when the code is written, it can take some companies six months to a year just to get the system to production. Konfig reduces this sunk cost by accelerating the time-to-production. This is possible because it provides an opinionated platform that solves many of the common problems involved with building software in a way that codifies industry best practices. Konfig’s approach encourages deploying to production-like environments from day-1, something we call Deployment-Driven Development.

So what are Konfig’s opinions? What is the motivation behind each of them? And does an opinionated platform mean an organization is constrained or locked in as is often the case with a PaaS? Let’s explore each of these questions.

Opinions and their benefits

GitLab and GCP

Perhaps the most obvious of Konfig’s opinions is that it is centered around Google Cloud Platform and GitLab. While we are actively exploring support for GitHub and AWS, we chose to start with building a white-glove experience around GCP and GitLab for a few reasons.

First, GCP has best-in-class serverless offerings and managed services which lend themselves well to Konfig’s model. This is something we’ve written about extensively before. Services like Cloud Run, Google Kubernetes Engine (GKE), BigQuery, Firestore, and Dataflow are truly differentiators for Google Cloud.

Second, GCP’s Config Connector operator provides, we argue, a better alternative to Terraform for managing infrastructure. We’ll discuss this in more detail later.

Third, we believe GitLab’s CI/CD system is better designed than GitHub Actions. This is something worthy of its own blog post, but it’s a key factor in providing a platform that is both secure and has a great developer experience.

Lastly, GitLab uses a hierarchical structure with groups, subgroups, and projects which maps perfectly to Konfig’s own control plane, platform, domain hierarchy as well as GCP’s resource hierarchy with organizations, folders, and projects. This is a critical component in how Konfig manages governance.

The Konfig model works with any combination of cloud platform and DevOps tooling. We just chose to start with GCP and GitLab because they work so well together. With Konfig, they almost feel as if they are natively integrated.

Service-oriented architecture and domain-driven design

Konfig has a notion of platforms and domains. A platform is intended to map to a coarse-grained organizational boundary such as a product line or business unit. Platforms are further subdivided into domains, which are groupings of related services. This is loosely borrowed from the concept of domain-driven design. While Konfig does not take a particularly strong stance on DDD or how you structure workloads, it does encourage the use of APIs to connect services versus sharing databases between them.

This is a best practice that Konfig embraces because it reduces coupling which makes it easier to evolve services independently. A key aspect of this grouping will make certain tasks harder (though not necessarily impossible), on purpose, such as sharing a database between domains. It also promotes more durable teams who own different parts of a system, which is an organizational best practice we routinely encourage with our clients.

Structuring GitLab and GCP

Konfig maintains a consistent structure between GitLab and GCP based around the control plane, platform, domain hierarchy. Control planes, platforms, and domains are all declaratively defined in YAML. This hierarchy is central to Konfig because it allows it to enforce best practices for access management and cloud governance. It also provides powerful cost visibility because we can easily see the cloud spend and forecasted spend for platforms and domains. This means there is a rigid opinionation to the tiered structuring of folders and projects in GCP and subgroups and projects in GitLab.

In GCP, a control plane maps to a folder within your GCP organization (either specified by the user during setup or created by the Konfig CLI). Within this folder, there is a control plane project which houses the control plane Kubernetes cluster and a folder for each platform. Within each platform folder, there are folders for each domain which contain a project for each environment (e.g. dev, stage, and prod).

*Konfig hierarchy in GCP for a retail business*

In GitLab, a control plane maps to a subgroup within your organization’s top-level group (again, either specified by the user during setup or created by the CLI). Within this subgroup, there is a control plane project which houses the definitions for the control plane itself as well as the platforms and domains it manages. In addition to the control plane project, the control plane subgroup contains a child subgroup for each platform. Like the GCP structure, these platform subgroups in turn contain subgroups for each of the platform’s domains. It’s in these domain subgroups that our actual workload projects go. Konfig provides a GitLab template for creating new workload projects that includes a fully functional CI/CD pipeline and workload definition for configuring service settings and infrastructure resources.

*Corresponding Konfig hierarchy in GitLab*

This structure is critical because it allows Konfig to manage access and permissioning for both users as well as service accounts. This enables the platform to enforce strong isolation and security boundaries. The hierarchy also allows us to cascade permissions and governance cleanly.

Group-based access management

Konfig leverages groups to manage permissioning. Groups are managed using a customer’s identity provider, such as Google Cloud Identity or Microsoft Entra ID (formerly Azure AD). These groups are then synced into GitLab using SAML group links and into GCP with Cloud Identity (if using an external IdP). In the control plane, platform, and domain YAML definitions, we can specify what GitLab and GCP permissions a group should have from a single configuration source.

This opinionated model provides a single source of truth for both identity (by relying on a customer’s existing IdP) and access management (via Konfig’s YAML definitions). We’ve often seen organizations assign roles to individual users which is an anti-pattern, so Konfig relies on groups as a best practice. This model also lets us apply SDLC practices to access management.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: order-management
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce
spec:
  domainName: Order Management
  gcp:
    createProjects: true
    enableWorkloadIdentity: true
    envs: [dev, stage, prod]
    manageFolder: true
  gitlab:
    manageCIVars: true
    manageGroup: true
  groups:
    dev: [order-management-devs@widgets-4-all.biz]
    maintainer: [order-management-maintainers@widgets-4-all.biz]
    owner: [sre-admins@widgets-4-all.biz]

Example domain.yaml showing group permissions

Workload identity and least-privilege access

One of the benefits of Konfig is that it automatically manages IAM for workloads. This means you just specify what resources your application needs, such as databases, storage buckets, or caches, and it not only provisions those resources but also configures the application’s service account to have the minimal set of permissions needed to access them for the role specified.

In Konfig, every workload gets a dedicated service account. This is a best practice that ensures we don’t have overly broad access defined for services. Often what happens otherwise is service accounts get reused across applications, resulting in workload identities that accrue more and more roles. Another common anti-pattern is using the Compute Engine default service account which many GCP services use if a service account is not specified. This service account usually has the Editor role, which is a privileged role that grants broad access. Konfig disables this default service account altogether, preventing this from happening.

Decoupling IAM for developers, CI/CD, control planes, and workloads

There are four main groups of identities in Konfig: users, CI/CD pipelines, control planes, and workloads. Konfig takes the position that human users should generally not require elevated permissions. Similarly, all modifications to environments should occur via CI/CD pipelines rather than manually or through “ClickOps.” For this reason, Konfig enforces strong separation of developer user accounts, CI/CD service accounts, control plane service accounts, and workload service accounts.

Earlier we saw how Konfig relies on group-based access management rather than assigning roles to individual users. This provides a more uniform approach to access management. These roles provide a limited set of permissions. Instead, developers interact with Konfig through GitLab pipelines. Note, however, that the “owner” group permission, which is illustrated in the example domain.yaml above, provides break-glass access that can be set at the domain, platform, and control plane level to support situations that require emergency remediation.

Konfig maps GCP service accounts to CI/CD pipelines in GitLab using Workload Identity Federation. The service accounts used by the CI/CD system have limited permissions that are scoped by GCP IAM and Kubernetes RBAC. This means that a pipeline for a specific workload in a domain can only apply modifications to its own control plane namespace. This greatly reduces the blast radius of a compromised GitLab credential and also prevents teams from modifying environments that they don’t own.

Additionally, because Konfig relies on Workload Identity Federation, there are no long-lived credentials to begin with. Workload Identity Federation uses OpenID Connect to allow a GitLab pipeline to authenticate with GCP and use a short-lived token to act as a GCP service account. This is in contrast to using service account keys for authenticating between GitLab and GCP, which is a security anti-pattern because it involves long-lived credentials that often do not get rotated. These keys are a common source of security breaches. And because the Konfig control plane is responsible for orchestrating resources, this CI/CD service account needs a very minimal set of permissions. Basically, it just needs permissions to apply Konfig definitions to its control plane namespace. The control plane handles the actual heavy lifting from that point on.

Each domain-environment pair gets its own namespace in the Konfig control plane. This namespace has its own service account that is scoped only to the GCP project associated with this domain environment. This allows the control plane to provision workloads and resources within a domain while having strong isolation between different domains and different environments.

We already discussed how Konfig manages IAM for workloads and implements least-privilege access. It’s important to note that, like the CI/CD service accounts, these workload service accounts have no keys associated with them and thus are never exposed to humans or to the CI/CD system. This means they are fully decoupled and easier to audit and monitor.

GitOps, branching, and release strategies

Konfig uses a GitOps model for managing platforms, domains, workloads, and infrastructure resource templates. These constructs are defined in YAML and deployed or promoted using Git-based workflows like merging a branch into main or tagging a release. This model is a best practice that provides a declarative single source of truth for our infrastructure (which is normally referred to as Infrastructure as Code or IaC), our GitLab and GCP implementations, and our organizational standards (by way of resource templates). For instance, we saw earlier how these declarative configurations are used to manage permissions in GitLab and GCP. This allows us to apply the same SDLC we use for application source code to our enterprise platform. This more comprehensive approach to managing infrastructure, source control, CI/CD, and cloud environment is something we call Platform as Code.

Konfig promotes a trunk-based development workflow with merge requests and code reviews. Releases are done by creating a tag. This GitOps model provides a clear audit trail and approval flow that most developers are already familiar with. This not only lends itself to providing a better developer experience but also a strong governance story. Infrastructure configuration is treated as data stored in source control. This makes it easy to backup and restore, but we’ve also chosen a format that is widely supported and makes writing custom tooling or integrating with existing tools easy.

Image promotion and single container artifact

Workload repositories can only contain a single deployable artifact. This means monorepos are not supported in Konfig, and repositories may only have a single Dockerfile that gets built and deployed. It also means workloads need to be containerized. This allows Konfig to make certain assumptions about CI/CD and SDLC that further improve the developer experience, security, and governance.

A problem we see regularly at companies is images getting rebuilt for different environments. Konfig’s image promotion model ensures the images used for testing are what is deployed to production without rebuilding containers or copying artifacts from development environments. The use of releases and environments in GitLab ensures there is a clear auditing and tracking of artifacts so that you know exactly what is deployed, by whom, and when.

*GitLab’s environment view shows the history of all deployments to an environment*

Managed services and serverless over other options when possible

We’ve spoken at length about the benefits of serverless and how, for many businesses, it may be a better fit than Kubernetes. Previously, I mentioned that one of the reasons we chose GCP initially to build an enterprise-ready platform was because of its emphasis on serverless and managed services. In particular, services like Cloud Run, GKE, and Dataflow are truly industry-leading container platforms. For this reason, Konfig supports these runtimes natively. We recommend Cloud Run as the default workload runtime but provide GKE as a supported engine for cases where Cloud Run is just not a good fit. Dataflow provides a fully managed execution environment for unified batch and stream data processing.

Leveraging managed services and serverless greatly reduces operational burden, improves security posture, allows developers to focus on product and feature development, and reduces production lead times. By supporting a smaller set of services, we can provide a great developer experience and security posture by automatically configuring service account permissions, autowiring environment variables in application containers, managing secrets, and a number of other benefits. Reducing options also simplifies architecture, improves maintainability and supportability, and reduces infrastructure sprawl. We’ve said before, we believe organizations should invest their engineers’ creativity and time into differentiating their customer-facing products and services, not infrastructure and other non-differentiating work.

This is similar in nature to PaaS, but where Konfig differs is it provides an “escape hatch.” That is to say, it provides well-supported paths for both working around the platform’s opinions and constraints and for moving off of the platform if needed. I’ll touch on these a bit later.

Resource templates over bespoke configurations

Infrastructure as code is often quite complicated because infrastructure is complicated. If you’ve ever worked with a large Terraform configuration you’ve probably experienced the challenges and pain points. It can be tedious to maintain and every implementation of it is different from company to company or even team to team. Konfig takes a different approach that provides an improved developer experience and stronger governance model. Workload definitions specify important metadata about a service such as the runtime engine, CPU and memory settings, infrastructure resources, and dependencies on other services. These definitions provide a declarative and holistic view of a workload that sits alongside the source code and follows the same SDLC.

Below is a simple workload.yaml for a service with three infrastructure resources, a Cloud Storage bucket, a Cloud SQL database, and a Pub/Sub topic. If you recall, the Konfig control plane will handle provisioning these resources and configuring the service account with properly scoped roles. It will also inject environment variables into the container so that the service can “discover” these resources at runtime.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Workload
metadata:
  name: payment-api
spec:
  region: us-central1
  runtime:
    kind: RunService
    parameters:
      template:
        containers:
          - image: payment-api
    resources:
      - kind: StorageBucket
        name: receipts
      - kind: SQLInstance
        name: payment-db
      - kind: PubSubTopic
        name: payment-authorized-events

Example workload.yaml showing resource dependencies

As you can imagine, there’s a lot more to configuring these resources than simply specifying their name. This is where Konfig’s notion of resource templates comes into play. Konfig relies on resource templates to abstract the complexity of configuring cloud resources and provide a means to implement and enforce organizational standards. For instance, we might enforce a specific version of PostgreSQL, high availability mode, and customer-managed encryption keys. For non-production environments, we may use a non-HA configuration to reduce costs.

This model allows a platform engineering or SRE team to centrally manage default or required configurations for resources. Now organizations can enforce a “golden path” or a standardized way of building something within their organization. Rather than relying on external policy scanners like Checkov that work reactively, we can build our policies directly into the platform and hide most of the complexity from developers, allowing them to focus on what matters: product and feature work. An organization can choose the right balance between autonomy and standardization for their unique situation, and we can eliminate infrastructure and architecture fragmentation.

apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
  name: high-availability
  namespace: konfig-templates
  labels:
    annotations:
      konfig.realkinetic.com/extra-fields: "settings.tier,settings.diskSize"
spec:
  databaseVersion: "POSTGRES_15"
  settings:
    tier: "db-custom-1-3840"
    diskSize: 25
    availabilityType: REGIONAL
    diskType: PD_SSD
    backupConfiguration:
      binaryLogEnabled: true
      enabled: true

Example resource template for Cloud SQL

API ingress and path-based routing

Standardizing API ingress and routing is another key part of reducing architecture fragmentation and improving developer productivity. Konfig takes an opinionated stance on how workloads interact with each other and how external traffic interacts with a workload. By default, workloads are only accessible to other workloads in the same domain. However, we can also expose workloads to other workloads in the same platform or even across platforms if they are within the same control plane. Lastly, we can expose a workload to external traffic from the internet or from other control planes.

Konfig manages load balancers to make this ingress seamless and straightforward. It also utilizes path-based routing that maps to the platform, domain, workload hierarchy to provide a clean way of exposing APIs. Path-based routing is a best practice we promote because, compared to host-based routing, there’s less infrastructure to maintain, it removes cross-origin resource sharing (CORS) as a concern, and there are significantly fewer DNS records involved. A common challenge for SaaS companies is getting customers to whitelist hostnames. This is often a major headache for enterprise customers where whitelisting hostnames can be difficult. Path-based routing eliminates this problem by exposing services under the same domain.

Config Connector for IaC rather than Terraform

As mentioned earlier, one of the reasons we chose GCP as the initial cloud platform supported in Konfig is Config Connector. Config Connector is a Kubernetes operator that lets you manage GCP resources the same way you manage Kubernetes applications, and it’s a model that many other cloud platforms and infrastructure providers are adopting as well. Config Connector offers a compelling alternative to Terraform for managing IaC with a number of advantages.

First, because Config Connector is specific to GCP, it offers a more native integration with the platform. This includes more fine-grained status reporting that tells us the state of individual resources. It also lets us benefit from Kubernetes events for improved visibility. This allows us to provide greater visibility into what is happening with your infrastructure which results in a better overall developer experience.

Second, it enables us to use a combination of Kubernetes RBAC and GCP IAM for managing access control. With this model, we can have strong security boundaries and reduced blast radius as it relates to infrastructure.

Third, Config Connector provides an improved model for state management and reconciliation. With Terraform, managing state drift and environment promotions can be challenging. Additionally, the Terraform state file often contains secrets stored in clear text which is a security risk and requires the state file itself to be treated as highly sensitive data. Yet, Terraform does not support client-side state encryption but rather relies on at-rest encryption of the state backends (this is one of the areas Terraform and OpenTofu have diverged). It’s also common to hit race conditions in Terraform, where certain resources need to spin up before others can successfully apply. Sometimes this doesn’t happen and the deploy fails, requiring the apply to be re-run.

Config Connector takes a very different approach which solves these issues. Instead, it relies on a control loop which periodically reconciles resources to automatically correct drift. Think of it almost like “terraform apply” running regularly to ensure your desired state and actual state are in constant lockstep. The other benefit is it decouples dependent resources so we avoid race conditions and sequencing problems. This is possible because Config Connector models infrastructure as an eventually consistent state. This means resources can provision independently, even if their dependencies are not ready—no more re-running jobs to get a failed apply to work. Lastly, it can rely on Kubernetes secrets or GCP Secret Manager secrets to store sensitive information such as passwords or credentials. Sensitive data is properly isolated from our infrastructure configuration.

Finally, it’s possible to both bulk import existing resources into Config Connector and export resources to Terraform. This makes it possible to migrate existing infrastructure into Config Connector or migrate from Config Connector to Terraform. Config Connector can also reference resources that it does not manage using external references. This provides a means to gradually migrate to or from Config Connector.

Dealing with constraints and vendor lock-in

Konfig is different from typical PaaS offerings in that it really is an opinionated bundling of components rather than a proprietary, monolithic platform or “walled garden.” Normally, with PaaS systems like Google App Engine or Heroku, you relied on proprietary APIs to build applications which made it very difficult to migrate off when the time came. It also meant when you hit the limits of the platform, that’s it. There’s nothing you can do to work around them. At the other end of the spectrum are products like Crossplane, which are geared towards helping you build your own internal cloud platform. This puts you squarely back into the realm of staffing a team of highly paid platform engineers to build and maintain such a platform. For some companies, this might be a strategic place to invest. For most, it’s not.

And while there are proprietary value-add components to Konfig we have built, the core of it is products you’re already using—namely GitLab and GCP—and the open source Config Connector. The value of Konfig is that it is an opinionated implementation of these pieces that reduces an organization’s total cost of ownership for an enterprise cloud platform and shortens the delivery time for new software products. What this means, though, is that there really isn’t any vendor lock-in beyond whatever lock-in GitLab and GCP already have. Because it’s built around Config Connector, you can just as well use Config Connector to manage your resources directly, you’ll just lose the benefits of Konfig like GitLab and GCP integration and governance, automatic resource provisioning and IAM, ingress management, and the Konfig UI.

Config Connector provides us a powerful escape hatch, either in the case of needing to remove Konfig or needing to step outside Konfig’s opinions. If we want to remove Konfig, we have a couple options. We can export the Config Connector resource definitions that Konfig manages and import them into a new Config Connector instance. This Config Connector instance can be run either as a GKE add-on which is fully managed by Google, or it can be installed into a GKE cluster manually. Alternatively, we can export the resources managed by Konfig to Terraform.

If we need to step outside Konfig’s opinions, for example to provision a resource not currently supported by Konfig such as a VM, we can use Config Connector directly which supports a broad set of resources. We can manage these resources using the same GitOps model we use for Konfig workloads. With this, we have an opinionated platform that can support an organization’s most common needs but, unlike a PaaS, can grow and evolve with your organization.

Konfig reduces your cloud platform engineering costs and the time to deliver new software products. Reach out to learn more about Konfig or schedule a demo.

Follow @tyler_treat

May 23, 2024October 9, 2024

No assembly required: the benefits of an opinionated platform

When you talk to a doctor about a medical issue they will often present you with all of the options but shy away from providing an unambiguous recommendation. When you talk to a lawyer about a legal matter they frequently do the same. While it’s important to understand your options and their trade-offs or associated risks, when you go to these specialists you are likely seeking the counsel of an experienced and knowledgeable expert in their field who can help you make an informed decision. What most people are probably looking for is the answer to “what would you, someone who knows a lot about this stuff, do if you were in this situation?” After all, many of us are probably capable of finding the options ourselves, but the difficult part is determining what the right course of action is for a particular situation.

One can guess that the reason these professions shy away from making clear recommendations is because they don’t want to be accused of malpractice. The result is that those of us who are less litigiously inclined can have a hard time getting an expert opinion. The truth is we often do this ourselves with our consulting at Real Kinetic. A client will ask us a question and we will present them with the options and their various trade-offs. In my opinion, this isn’t even because we’re afraid of being sued. It’s more because when you don’t truly have skin in the game, it’s easy to go into “engineer mode” where you provide the analysis and decision tree without actually offering a decision. We remove ourselves from the situation because we feel it’s not really our place to be involved, nor do we want to be liable. I think this happens with other professions too.

Almost always what our clients are looking for is a concrete recommendation, the answer to “what would you do in this situation?” Sometimes we have clear, actionable recommendations. Other times we don’t have a strong opinion, but we help them make a decision by putting ourselves in their shoes. A seasoned engineer knows the answer is often “it depends”, but when they have skin in the game, they also have to make a choice at the end of the day.

This is very much the case when it comes to advising clients on operationalizing cloud platforms for their organization. When you look at a platform like AWS or GCP, it’s just a pile of Legos. No picture of what you’re building, no assembly instructions, just a collection of infinite possibilities—and millions of decisions. Some people see a pile of Legos and can start building. Others, myself included, suddenly become incapable of making decisions. This is where most of our clients find themselves. They look to the vendor, but the vendor is just like the lawyer offering all of the possible legal maneuvers one could perform. It’s not entirely helpful unless they’re willing to go out on a limb. Most will stick to generic guidance from my experience.

These cloud platforms are inherently unopinionated because they must meet customers—all customers—where they are. Just like the attorney providing all of your possible legal options, these platforms provide all of the possible building blocks you might need to construct something. How do you assemble them? Well, that’s up to you. The vendor doesn’t want to be in the business of having skin in the game.

Our team has a lot of experience putting the Legos together. Over the years, we’ve identified the common patterns, the things that work well, the things that don’t, and the pitfalls to avoid. Clients hire us to help them do the same, but they don’t usually want us to enumerate all of the possible configurations they could assemble. They want the “what would you do in this situation?” And while every company and situation is unique, the reality is most companies would be perfectly fine with a common, opinionated platform that implements best practices and provides just the right amount of flexibility. Those best practices are the opinions and recommendations we offer our clients through our consulting. This is what led us to create Konfig, an opinionated workload delivery platform built specifically around GCP and GitLab. It’s something that lets us codify those opinions into a product customers can install. No assembly required.

In previous blog posts, I’ve referenced our team’s experience at Workiva, a company that went from startup to IPO on GCP. When Workiva went public, it had just two ops people who were primarily responsible for managing a small set of VMs on AWS. This was possible because Workiva leveraged Google App Engine, which provided an integrated platform that allowed its developers to focus on product and feature development. At the time, App Engine was about as opinionated as a platform could be. It felt like a grievous constraint which ultimately precipitated moving to AWS, but in fact it was a major boon to the company. It meant Workiva’s engineers allocated almost all of their time towards things that made the company money. Moving to AWS resulted in a multi-year effort to effectively recreate what we had with App Engine, just with much more headcount to support and evolve it.

We’ve seen firsthand the power of an integrated, opinionated platform. Through our consulting, we’ve also seen companies struggle tremendously with things like DevOps, Platform Engineering, and just generally operationalizing the cloud. GCP has evolved a lot since App Engine. Both GitLab and GCP are highly flexible platforms, but they lack any opinionation because they are designed to address a broad set of customer needs. This leaves a void where customers are left having to assemble the Legos to provide their own opinionation, which is where we see customers struggle the most. This is what prompted us to build Konfig as a means to provide that missing layer of opinionation.

PaaS has become a bit of a taboo now. Instead, organizations are investing significantly in developing their own Internal Developer Platform (IDP), which is basically just a PaaS you have to care for and maintain with your headcount instead of another company’s headcount. It’s not entirely obvious if this is strategically beneficial for companies that build software. In my opinion, many of these companies are better off shifting that investment towards things that differentiate their business. Companies should not be expressing their creativity on software architecture and platform infrastructure but rather on their customer-facing products and services (sometimes this necessitates innovating with infrastructure, but this tends to be more with internet-scale companies versus ordinary businesses). What an opinionated platform does is delete this discussion altogether. Now, with Konfig, we can get the same types of benefits we saw with App Engine over 10 years ago without the same constraints. And unlike App Engine, we have a means to customize the platform when needed without losing the benefits. You can have reasonable defaults and opinions but can evolve and grow as your needs or understanding change.

There’s a reason IKEA furniture comes with detailed instructions: while some people relish the challenge of figuring things out themselves, most just want to get on with using the finished product. The same is true for cloud platforms. While the flexibility of GCP, GitLab, and similar platforms is undeniable, it can lead to decision paralysis and wasted resources spent on building infrastructure that already exists.

This is where the benefits of an opinionated platform come in. By offering a pre-configured solution built around best practices, it eliminates the need for endless customization in order to get companies up and running faster. This frees up valuable engineering resources to focus on what truly matters, differentiation and innovation.

In my next post, I want to dive into exactly what opinions Konfig has and the reasoning behind each. We’ll also look at the escape hatch available to us so that when we do hit a constraint, we can easily move it out of the way.

Konfig reduces your cloud platform engineering costs and the time to deliver new software products. Reach out to learn more about Konfig or schedule a demo.

Follow @tyler_treat

April 29, 2024October 9, 2024

How Konfig provides an enterprise platform with GitLab and Google Cloud

In a previous post, I explained the fundamental competing priorities that companies have when building software: security and governance, maintainability, and speed to production. These three concerns are all in constant tension with each other. For companies either migrating to the cloud or beginning a modernization effort, addressing them can be a major challenge. When you’re unfamiliar with the cloud, building systems that are both secure and maintainable is difficult because you’re not in a position to make decisions that have long-lasting and significant impact—you just don’t know what you don’t know. One small misstep can result in a major security incident. A bad decision can take years to manifest a problem. As a result, these migration and modernization efforts often stall out as analysis paralysis takes hold.

This is where Real Kinetic usually steps in: to get a stuck project moving again, to provide guard rails, and to help companies avoid the hidden landmines by offering our expertise and experience. We’ve been there before, so we help navigate our clients through the foundational decision making, design, and execution of large-scale cloud migrations. We’ve helped migrate systems generating billions of dollars in revenue and hundreds of millions in cloud spend. We’ve also helped customers save tens of millions in cloud spend by guiding them through more cost-effective solution architectures. And while we’ve had a lot of success helping our clients operationalize the cloud, they still routinely ask us: why is it so damn difficult? The truth is it doesn’t have to be if you’re willing to take just a slightly more opinionated stance.

Recently, we introduced Konfig, our solution for this exact problem. Konfig packages up our expertise and years of experience operationalizing and building software in the cloud. More concretely, it’s an enterprise integration of GitLab and Google Cloud that addresses those three competing priorities I mentioned earlier. The reason it’s so difficult for organizations to operationalize GitLab and GCP is because they are robust and flexible platforms that address a broad set of customer needs. As a result, they do not take an opinionated stance on pretty much anything. This leaves a gap unaddressed, and customers are left having to put together their own opinionation that meets their needs—except, they usually aren’t in a position to do this. Thus, they stall.

Konfig gives you a functioning, enterprise-ready GitLab and GCP environment that is secure by default, has strong governance and best practices built-in, and scales with your organization. The best part? You can start deploying production workloads in a matter of minutes. It does this by taking an opinionated stance on some things. It bridges the gap that is unaddressed by Google and GitLab. Those opinions are the recommendations, guidance, and best practices we share with clients when they are operationalizing the cloud.

Perhaps the most obvious opinion is that Konfig is specific to GCP and GitLab. We could extend this model to other platforms like AWS and GitHub, but we chose to focus on building a white-glove experience with GCP and GitLab first because they work together so well. GCP has first-class managed services and serverless offerings which lend themselves to providing a platform that is secure, maintainable, and has a great developer experience. GitLab’s CI/CD is better designed than GitHub Actions and its hierarchical structure maps well to GCP’s resource hierarchy.

Moreover, Konfig embraces service-oriented architecture and domain-driven design which drives how we structure folders and projects in GCP and groups in GitLab. This structure gives us a powerful way to map access management and governance, which we’ll explore later. It’s a best practice that makes systems more maintainable and evolvable. We’ll discuss Konfig’s opinions and their rationale in more depth in a future post. For now, I want to explain how Konfig provides an enterprise platform by addressing each of the three concerns in the software development triangle: security and governance, maintainable infrastructure, and speed to production.

Security and Governance

Access Management

Konfig relies on a hierarchy consisting of control plane > platforms > domains > workloads. The control plane is the top-level container which is responsible for managing all of the resources contained within it. Platforms are used to group different lines of business, product lines, or other organizational units. Domains are a way to group related workloads or services.

This structure provides several benefits. First, we can map it to hierarchies in both GitLab and GCP, shown in the image below. A platform maps to a group in GitLab and a folder in GCP. A domain maps to a subgroup in GitLab and a nested folder along with a project per environment in GCP.

Konfig synchronizes structure and permissions between GitLab and GCP

This hierarchy lets us manage permissions cleanly because we can assign access at the control plane, platform, and domain levels. These permissions will be synced to GitLab in the form of SAML group links and to GCP in the form of IAM roles. When a user has “dev” access, they get the Developer role for the respective group in GitLab. In GCP, they get the Editor role for dev environment projects and Viewer for higher environments. “Maintainer” has slightly more elevated access, and “owner” effectively provides root access to allow for a “break-glass” scenario. The hierarchy means these permissions can be inherited by setting them at different levels. This access management is shown in the platform.yaml and domain.yaml examples below highlighted in bold.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  groups:
    dev: [ecommerce-devs@example.com]
    maintainer: [ecommerce-maintainers@example.com]
    owner: [ecommerce-owners@example.com]

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  groups:
    dev: [payment-devs@example.com]
    maintainer: [payment-maintainers@example.com]
    owner: [payment-owners@example.com]

domain.yaml

Authentication and Authorization

There are three different authentication and authorization concerns in Konfig. First, GitLab needs to authenticate with GCP such that pipelines can deploy to the Konfig control plane. Second, the control plane, which runs in a privileged customer GCP project, needs to authenticate with GCP such that it can create and manage cloud resources in the respective customer projects. Third, customer workloads need to be able to authenticate with GCP such that they can correctly access their resource dependencies, such as a database or Pub/Sub topic. The configuration for all of this authentication as well as the proper authorization settings is managed by Konfig. Not only that, but none of these authentication patterns involve any kind of long-lived credentials or keys.

GitLab to GCP authentication is implemented using Workload Identity Federation, which uses OpenID Connect to map a GitLab identity to a GCP service account. We scope this identity mapping so that the GitLab pipeline can only deploy to its respective control plane namespace. For instance, the Payment Processing team can’t deploy to the Fulfillment team’s namespace and vice versa.

Control plane to GCP authentication relies on domain-level service accounts that map a control plane namespace for a domain (let’s say Payment Processing) to a set of GCP projects for the domain (e.g. Payment Processing Dev, Payment Processing Stage, and Payment Processing Prod).

Finally, workloads also rely on service accounts to authenticate and access their resource dependencies. Konfig creates a service account for each workload and sets the proper roles on it needed to access resources. We’ll look at this in more detail next.

This approach to authentication and authorization means there is very little attack surface area. There are no keys to compromise and even if an attacker were to somehow compromise GitLab, such as by hijacking a developer’s account, the blast radius is minimal.

Least-Privilege Access

Konfig is centered around declaratively modeling workloads and their infrastructure dependencies. This is done with the workload.yaml. This lets us spec out all of the resources our service needs like databases, storage buckets, caches, etc. Konfig then handles provisioning and managing these resources. It also handles creating a service account for each workload that has roles that are scoped to only the resources specified by the workload. Let’s take a look at an example.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Workload
metadata:
  name: order-api
spec:
  region: us-central1
  runtime:
    kind: RunService
    parameters:
      template:
        containers:
          - image: order-api
  resources:
    - kind: StorageBucket
      name: receipts
    - kind: SQLInstance
      name: order-store
    - kind: PubSubTopic
      name: order-events

workload.yaml

Here we have a simple workload definition for a service called “order-api”. This workload is a Cloud Run service that has three resource dependencies: a Cloud Storage bucket called “receipts”, a Cloud SQL instance called “order-store”, and a Pub/Sub topic called “order-events”. When this YAML definition gets applied by the GitLab pipeline, Konfig will handle spinning up these resources as well as the Cloud Run service itself and a service account for order-api. This service account will have the Pub/Sub Publisher role scoped only to the order-events topic and the Storage Object User role scoped to the receipts bucket. Konfig will also create a SQL user on the Cloud SQL instance whose credentials will be securely stored in Secret Manager and accessible only to the order-api service account. The Konfig UI shows this workload, all of its dependencies, and each resource’s status.

Enforcing Enterprise Standards

After looking at the example workload definition above, you may be wondering: there’s a lot more to creating a storage bucket, Cloud SQL database, or Pub/Sub topic than just specifying its name. Where’s the rest? It’s a good segue into how Konfig offers a means for providing sane defaults and enforcing organizational standards around how resources are configured.

Konfig uses templates to allow an organization to manage either default or required settings on resources. This lets a platform team centrally manage how things like databases, storage buckets, or caches are configured. For instance, our organization might enforce a particular version of PostgreSQL, high availability mode, private IP only, and customer-managed encryption key. For non-production environments, we may use a non-HA configuration to reduce costs. Just like our platform, domain, and workload definitions, these templates are also defined in YAML and managed via GitOps.

We can also take this further and even manage what cloud APIs or services are available for developers to use. Like access management, this is also configured at the control plane, platform, and domain levels. We can specify what services are enabled by default at the platform level which will inherit across domains. We can also disable certain services, for example, at the domain level. The example platform and domain definitions below illustrate this. We enable several services on the Ecommerce platform and restrict Pub/Sub, Memorystore (Redis), and Firestore on the Payment Processing domain.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gcp:
    services:
      defaults:
        - cloud-run
        - cloud-sql
        - cloud-storage
        - secret-manager
        - cloud-kms
        - pubsub
        - redis
        - firestore

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  gcp:
    services:
      disabled:
        - pubsub
        - redis
        - firestore

domain.yaml

This model provides a means for companies to enforce a “golden path” or an opinionated and supported way of building something within your organization. It’s also a critical component for organizations dealing with regulatory or compliance requirements such as PCI DSS. Even for organizations which prefer to favor developer autonomy, it allows them to improve productivity by setting good defaults so that developers can focus less on infrastructure configuration and more on product or feature development.

SDLC Integration

It’s important to have an SDLC that enables developer efficiency while also providing a sound governance story. Konfig fits into existing SDLCs by following a GitOps model. It allows your infrastructure to follow the same SDLC as your application code. Both rely on a trunk-based development model. Since everything from platforms and domains to workloads is managed declaratively, in code, we can apply typical SDLC practices like protected branches, short-lived feature branches, merge requests, and code reviews.

Even when we create resources from the Konfig UI, they are backed by this declarative configuration. This is something we call “Visual IaC.” Teams who are more comfortable working with a UI can still define and manage their infrastructure using IaC without even having to directly write any IaC. We often encounter organizations who have teams like data analytics, data science, or ETL which are not equipped to deal with managing cloud infrastructure. This approach allows these teams to be just as productive—and empowered—as teams with seasoned infrastructure engineers while still meeting an organization’s SDLC requirements.

Creating a resource in the Konfig workload UI

Cost Management

Another key part of governance is having good cost visibility. This can be challenging for organizations because it heavily depends on how workloads and resources are structured in a customer’s cloud environment. If things are structured incorrectly, it can be difficult to impossible to correctly allocate costs across different business units or product areas.

The Konfig hierarchy of platforms > domains > workloads solves this problem altogether because related workloads are grouped into domains and related domains are grouped into platforms. A domain maps to a set of projects, one per environment, which makes it trivial to see what a particular domain costs. Similarly, we can easily see an aggregate cost for an entire platform because of this grouping. The GCP billing account ID is set at the platform level and all projects within a platform are automatically linked to this account. Konfig makes it easy to implement an IT chargeback or showback policy for cloud resource consumption within a large organization.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gcp:
    billingAccountId: "123ABC-456DEF-789GHI"

platform.yaml

Maintainable Infrastructure

Opinionated Model

We’ve talked about opinionation quite a bit already, but I want to speak to this directly. The reason companies so often struggle to operationalize their cloud environment is because the platforms themselves are unwilling to take an opinionated stance on how customers should solve problems. Instead, they aim to be as flexible and accommodating as possible so they can meet as many customers where they are as possible. But we frequently hear from clients: “just tell me how to do it” or even “can you do it for me?” Many of them don’t want the flexibility, they just want a preassembled solution that has the best practices already implemented. It’s the difference between a pile of Legos with no instructions and an already-assembled Lego factory. Sure, it’s fun to build something yourself and express your creativity, but this is not where most businesses want creativity. They want creativity in the things that generate revenue.

Konfig is that preassembled Lego factory. Does that mean you get to customize and change all the little details of the platform? No, but it means your organization can focus its energy and creativity on the things that actually matter to your customers. With Konfig, we’ve codified the best practices and patterns into a turnkey solution. This more opinionated approach allows us to provide a good developer experience that results in maintainable infrastructure. The absence of creative constraints tends to lead to highly bespoke architectures and solutions that are difficult to maintain, especially at scale. It leads to a great deal of inefficiency and complexity.

Architectural Standards

Earlier we saw how Konfig provides a powerful means for enforcing enterprise standards and sane defaults for infrastructure as well as how we can restrict the use of certain services. While we looked at this in the context of governance, it’s also a key ingredient for maintainable infrastructure. Organizational standards around infrastructure and architecture improve efficiency and maintainability for the same reason the opinionation we discussed above does. Konfig’s templating model and approach to platforms and domains effectively allows organizations to codify their own internal opinions.

Automatic Reconciliation

There are a number of challenges with traditional IaC tools like Terraform. One such challenge is the problem of state management and drift. A resource managed by Terraform might be modified outside of Terraform which introduces a state inconsistency. This can range from something simple like a single field on a resource to something very complex, such as an entire application stack. Resolving drift can sometimes be quite problematic. Terraform works by storing its configuration in a state file. Aside from the problem that the state file often contains sensitive information like passwords and credentials, the Terraform state is applied in a “one-off” fashion. That is to say, when the Terraform apply command is run, the current state configuration is applied to the environment. At this point, Terraform is no longer involved until the next time the state is applied. It could be hours, days, weeks, or longer between applies.

Konfig uses a very different model. In particular, it regularly reconciles the infrastructure state automatically. This solves the issue of state drift altogether since infrastructure is no longer applied as “one-off” events. Instead, it treats infrastructure the way it actually is—a living, breathing thing—rather than a single, point-in-time snapshot.

Speed to Production

Turnkey Setup

Our goal with Konfig is to provide a fully turnkey experience, meaning customers have a complete and enterprise-grade platform with little-to-no setup. This includes setup of the platform itself, but also setup of new workloads within Konfig. We want to make it as easy and frictionless as possible for organizations to start shipping workloads to production. It’s common for a team to build a service that is code complete but getting it deployed to various environments takes weeks or even months due to the different organizational machinations that need to occur first. With Konfig, we start with a workload deploying to an environment. You can use our workload template in GitLab to create a new workload project and deploy it to a real environment in a matter of minutes. The CI/CD pipeline is already configured for you. Then you can work backwards and start adding your code and infrastructure resources. We call this “Deployment-Driven Development.”

Konfig works by using a control plane which lives in a customer GCP project. The setup of this control plane is fully automated using the Konfig CLI. When you run the CLI bootstrap command, it will run through a guided wizard which sets up the necessary resources in both GitLab and GCP. After this runs, you’ll have a fully functioning enterprise platform.

Workload Autowiring

We saw earlier how workloads declaratively specify their infrastructure resources (something we call resource claims) and how Konfig manages a service account with the correctly scoped, minimal set of permissions to access those resources. For resources that use credentials, such as Cloud SQL database users, Konfig will manage these secrets by storing them in GCP’s Secret Manager. Only the workload’s service account will be able to access this. This secret gets automatically mounted onto the workload. Resource references, such as storage bucket names, Pub/Sub topics, or Cloud SQL connections, will also be injected into the workload to make it simple for developers to start consuming these resources.

API Ingress and Path-Based Routing

Konfig makes it easy to control the ingress of services. We can set a service such that it is only accessible within a domain, within a platform, or within a control plane. We can even control which domains can access an API. Alternatively, we can expose a service to the internet. Konfig uses a path-based routing scheme which maps to the platform > domain > workload hierarchy. Let’s take a look at an example platform, domain, and workload configuration.

apiVersion: konfig.realkinetic.com/v1beta1
kind: Platform
metadata:
  name: ecommerce-platform
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/control-plane: konfig-control-plane
spec:
  platformName: Ecommerce Platform
  gcp:
    api:
      path: /ecommerce

platform.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Domain
metadata:
  name: payment-processing
  namespace: konfig-control-plane
  labels:
    konfig.realkinetic.com/platform: ecommerce-platform
spec:
  domainName: Payment Processing
  gcp:
    api:
      path: /payment

domain.yaml

apiVersion: konfig.realkinetic.com/v1beta1
kind: Workload
metadata:
  name: authorization-service
spec:
  region: us-central1
  runtime:
    kind: RunService
    parameters:
      template:
        containers:
          - image: authorization-service
  api:
    path: /auth

workload.yaml

Note the API path component in the above configurations. Our ecommerce platform specifies /ecommerce as its path, the payment-processing domain specifies /payment, and the authorization-service workload specifies /auth. The full route to hit the authorization-service would then be /ecommerce/payment/auth. We’ll explore API ingress and routing in more detail in a later post.

An Enterprise-Ready Workload Delivery Platform

We’ve looked at a few of the ways Konfig provides a compelling enterprise integration of GitLab and Google Cloud. It addresses a gap these products leave by not offering strong opinions to customers. Konfig allows us to package up the best practices and patterns for implementing a production-ready workload delivery platform and provide that missing opinionation. It tackles three competing priorities that arise when building software: security and governance, maintainable infrastructure, and speed to production. Konfig plays a strategic role in reducing the cost and improving the efficiency of cloud migration, modernization, and greenfield efforts. Reach out to learn more about Konfig or schedule a demo.

Follow @tyler_treat