Zero-Trust Security on GCP With Context-Aware Access

A lot of our clients at Real Kinetic leverage serverless on GCP to quickly build applications with minimal operations overhead. Serverless is one of the things that truly differentiates GCP from other cloud providers, and App Engine is a big component of this. Many of these companies come from an on-prem world and, as a result, tend to favor perimeter-based security models. They rely heavily on things like IP and network restrictions, VPNs, corporate intranets, and so forth. Unfortunately, this type of security model doesn’t always fit nicely with serverless due to the elastic and dynamic nature of serverless systems.

Recently, I worked with a client who was building an application for internal support staff on App Engine. They were using Identity-Aware Proxy (IAP) to authenticate users and authorize access to the application. IAP provides a fully managed solution for implementing a zero-trust access model for App Engine and Compute Engine. In this case, their G Suite user directory was backed by Active Directory, which allowed them to manage access to the application using Single Sign-On and AD groups.

Everything was great until the team hit a bit of a snag when they went through their application vulnerability assessment. Because it was for internal users, the security team requested the application be restricted to the corporate network. While I’m deeply skeptical of the value this adds in terms of security—the application was already protected by SSO and two-factor authentication and IAP cannot be bypassed with App Engine—I shared my concerns and started evaluating options. Sometimes that’s just the way things go in a larger, older organization. Culture shifts are hard and take time.

App Engine has firewall rules built in which allow you to secure incoming traffic to your application with allow/deny rules based on IP, so it seemed like an easy fix. The team would be in production in no time!

App Engine firewall rules

Unfortunately, there are some issues with how these firewall rules work depending on the application architecture. All traffic to App Engine goes through Google Front End (GFE) servers. This provides numerous benefits including TLS termination, DDoS protection, DNS, load balancing, firewall, and integration with IAP. It can present problems, however, if you have multiple App Engine services that communicate with each other internally. For example, imagine you have a frontend service which talks to a backend service.

App Engine does not provide a static IP address and instead relies on a large, dynamic pool of IP addresses. Two sequential outbound calls from the same application can appear to originate from two different IP addresses. One option is to allow all possible App Engine IPs, but this is riddled with issues. For one, Google uses netblocks that dynamically change and are encoded in Sender Policy Framework (SPF) records. To determine all of the IPs App Engine is currently using, you need to recursively perform DNS lookups by fetching the current set of netblocks and then doing a DNS lookup for each netblock. These results are not static, meaning you would need to do the lookups and update firewall rules continually. Worse yet, allowing all possible App Engine IPs would be self-defeating since it would be trivial for an attacker to work around by setting up their own App Engine application to gain access, assuming there isn’t any additional security beyond the firewall.

Another, slightly better option is to set up a proxy on Compute Engine in the same region as your App Engine application. With this, you get a static IP address. The downside here is that it’s an additional piece of infrastructure that must be managed, which isn’t great when you’re shooting for a serverless architecture.

Luckily, there is a better solution—one that fits our serverless model and enables us to control external traffic while allowing App Engine services to securely communicate internally. IAP supports context-aware access, which allows enforcing granular access controls for web applications, VMs, and GCP APIs based on an end-user’s identity and request context. Essentially, context-aware access brings a richer zero-trust model to App Engine and other GCP services.

To set up a network firewall in IAP, we first need to create an Access Level in the Access Context Manager. Access Levels are a way to add an extra level of security based on request attributes such as IP address, region, time of day, or device. In the client’s case, they can create an Access Level to only allow access from their corporate network.

GCP Access Context Manager

We can then add the Access Level to roles that are assigned to users or groups in IAP. This means even if users are authenticated, they must be on the corporate network to access the application.

Cloud Identity-Aware Proxy roles

To allow App Engine services to communicate freely, we simply need to assign the IAP-secured Web App User role without the Access Level to the App Engine default service account. Services will then authenticate as usual using OpenID Connect without the added network restriction. The default service account is managed by GCP and there are no associated credentials, so this provides a solid security posture.

Now, at this point, we’ve solved the IP firewall problem, but that’s not really in the spirit of zero-trust, right? Zero-trust is a security principle believing that organizations should not inherently trust anything inside or outside of their perimeters and instead should verify anything trying to connect to their systems. Having to connect to a VPN in order to access an application in the cloud is kind of a bummer, especially when the corporate VPN goes down. COVID-19 has made a lot of organizations feel this pain. Fortunately, Access Levels can be a lot smarter than providing simple lists of approved IP addresses. With the Cloud IAM Conditions Framework, we can even write custom rules to allow access based on URL path, resource type, or other request attributes.

At this point, I talked the client through the Endpoint Verification process and how we can shift away from a perimeter-based security model to a defense-in-depth, zero-trust model. Rather than requiring the end-user to be signed in from the corporate network, we can require them to be signed in from a trusted, corporate-owned device from anywhere. We can require that the device has a screen lock and is encrypted or has a minimum OS version.

With IAP and context-aware access, we can build layered security on top of applications and resources without the need for a VPN, while still centrally managing access. This can even extend beyond GCP to applications hosted on-prem or in other cloud platforms like AWS and Azure. Enterprises don’t have to move away from more traditional security models all at once. This pattern allows you to gradually shift by adding and removing Access Levels and attributes over time. Zero-trust becomes much easier to implement within large organizations when they don’t have to flip a switch.

Liftbridge 1.0

Liftbridge has evolved a lot since making the first commit in October 2017, but the vision has remained the same: provide a message-streaming solution with a focus on simplicity and usability. This is demonstrated through many of the design and implementation decisions. A few examples include the use of NATS as the messaging backbone, avoiding heavy dependencies on runtimes like the JVM and external coordination systems like ZooKeeper, compiling down to a small, single static binary, opting for a gRPC-based API, and relying on plain YAML configuration. Liftbridge is written in Go, and the code is structured with the hopes that it’s relatively easy for someone to hop in and contribute to the project.

The goal of Liftbridge is to bridge the gap between sophisticated but complex log-based messaging systems like Apache Kafka and Apache Pulsar and simpler, cloud-native solutions. If you’re not familiar with the project, the introduction post sheds some light. It’s been nearly two years since I open-sourced Liftbridge, and I’m pleased to announce the project has now reached a 1.0 release. In practical terms, what this means is that the API has reached a point of stability suitable for production use and will provide a backward-compatibility commitment going forward. Liftbridge will continue to follow a semantic versioning scheme.

A lot of great features have landed since the project was first conceived in 2016 and started in 2017—replication, log compaction and retention rules, stream partitioning, activity events, and stream pausing to name a few. An official Java client has been implemented and is quickly evolving. Python will follow shortly after. There’s also a lot of exciting stuff on the roadmap ahead including auto-pausing of sparsely used partitions, durable and fault-tolerant consumer groups, a better stream re-partitioning story, and broader client support.

If you’re already using Liftbridge today or are thinking about using it, I’d love to hear from you. Be sure to follow Liftbridge on Twitter and join the community Slack channel to stay up-to-date on the latest developments.

Digitally Transformed: Becoming a Technology Product Company

More and more established businesses are attempting to reinvent themselves as technology companies. At the heart of this is the digital transformation, a journey many organizations are undertaking in order to better compete and serve their customers. As a result, companies are pouring tons of cash into digital transformation strategies. For some, this means broader adoption of agile or DevOps practices. For others, it’s modernizing product offerings or moving to the cloud. Regardless of the changes, many are struggling to find success transforming themselves due to low throughput, quality issues, or failing to deliver the right thing at the right time. In a few cases, digital transformation has ended in outright disaster.

What is it that these companies are really after? To solve new problems in new ways through innovation? To more rapidly adapt to the changing market? To protect existing revenue? Any leader worth their salt will say all of these are important outcomes, so how do you even begin to make a “digital transformation” actionable? What are we transforming to? How do we know when we’ve arrived?

The reason so many digital transformations fail has to do with how IT is usually positioned within mature, established businesses. I believe what these companies are really after is not a digital transformation—whatever that might be—but rather an organizational one that radically changes the way the business operates. One that redefines what IT means in the context of building software. The technology is incidental to this cultural shift which involves the intersection of people, processes, and innovation. In order to be successful, these organizations need to become technology product companies.

The Genesis of IT

There is an inertia within organizations to overvalue tactics and undervalue strategy. This is true not just of mature, established businesses but really all businesses, startups included. In fact, it’s this exact reason most startups fail. A lack of clear strategy and guiding vision precludes even the best execution from delivering success outside of the odd unicorn (after all, someone has to win the Powerball). Established businesses, however, already have a reliable cash flow engine to fall back on. There is much more margin for error when it comes to both strategy and execution, but this peacetime mentality leads to disruption. Many leaders have begun to recognize this and act on it, falling right back to what they know best—tactics.

Why do companies and managers tend to bias towards tactics over strategy in software development? It comes back to the genesis of IT. Historically, IT was about managing computers, networks, email, phone systems, and other technical areas of the business. While this is still true today, the result of software eating the world has caused that scope to broaden significantly. But for mature, established businesses, IT has long been viewed as a cost center, and the mandate for an IT leader is cost minimization. This is in spite of the fact that the business has shifted away from humans, paper forms, and telephones to automation and software-based solutions. IT has always existed to support business operations, first by managing the technology the business depended on, now by building it. The only real change was IT transforming from a servant of the business to a partner of it.

Consequently, there are two key directives for a traditional IT organization: carry out the orders of the business and minimize cost. These goals inherently lead to a project mindset that is output- and task-oriented. Thus, IT has always been tactical and execution-minded in nature.

A Spotter’s Guide to Project-Minded IT

There are three ways to identify a project-minded IT organization. First, if both software engineers and more traditional IT roles like hardware support or help desk report up to a CIO, it’s likely a project-minded organization. In this case, it’s all just lumped into one group called “IT.”

This contrasts with product-minded companies which place IT responsibilities under a CIO, whose directive is still cost minimization, and product development responsibilities under a CTO and/or CPO (Chief Product Officer), whose directive is strategic investment. There are two distinct groups, IT and Product Development or R&D. It’s more common to see CTOs or CPOs at newer, technology-first companies than it is at mature, established businesses since this requires a major realignment. This alignment, however, is why we see many of the execution issues at companies attempting to “digitally transform” themselves.

Second, if there is a clear separation between IT or development and the business, there’s a good chance it’s a project-minded organization. This might be signaled by business partners, business analysts, or product owners who provide teams with implementation requirements and act as a backlog administrator. Developers might not have a good understanding of who their customers are or they view the business partner as the customer. This can also be signaled by frequently changing priorities, an ever-growing backlog of tasks, or unaddressed tech debt piling up. The team is typically not cross-functional, consisting only of developers and a business partner. Marty Cagan refers to these as delivery teams, and they are purely output-driven.

Alternatively, the team may be cross-functional with some form of designer (often oriented more towards UI than UX) and product manager, but it’s still governed by outputs. The product manager’s role is closer to that of a project manager armed with a product roadmap, and the closest thing developers have to product discovery is design and usability testing. Cagan refers to these as feature teams. Both delivery and feature teams exist to serve the business. These are the teams you’ll find at most companies building software.

At product-minded companies, teams are cross-functional with designers, UX, engineers, and product, and they are measured by outcomes, not outputs. This focus on outcomes means that the team is empowered to figure out the best way to solve the problems they’ve been asked to solve rather than being fed a list of features to build. These teams have an intimate understanding of their customers and interact with them regularly to perform product discovery and validate solutions. These are product teams in the truest sense but also quite rare.

The last way to spot a project-minded organization might be the most obvious. If the roadmap has a clear end point, it’s a project. Here, an IT organization treats building a software solution the same way it treats installing a new phone system. When the project is completed, teams or resources are reallocated to new projects and one of two things happen: it’s either dumped on another team to maintain and extend or no one sticks around to support it. The finished project languishes or former developers are told to context switch to it reactively and at the whims of the business. Engineers are treated as interchangeable and teams are not particularly durable or mission-driven but rather task-driven.

Product-minded companies instead embrace the virtues of minimum viable product, shipping incremental value, validating ideas, and iteration. The product manager provides a vision that unites the team in a common mission. Products are not “completed,” rather they grow and evolve. There is an emphasis on business outcomes over task outputs. Managers understand that teams are composed of people with diverse skills who are not easily fungible but who might be better suited to different phases of a product’s lifecycle. Members of a team might shift focus to other areas and priorities over time, but always in support of the team’s mission.

The Philosophical Dilemma of the Stoplight

A tactics-first mindset results in a propensity to treat software development like an assembly line. We can see this with the recent adoption of ideas from the Toyota Production System and lean manufacturing as it’s applied to software development. This emphasis on tactics causes managers to view product development as an optimization problem—if we just optimize the right set of tactics and practices, we can significantly improve throughput and quality at scale. This has led to the rise in packaged frameworks and processes like SAFe, LeSS, DAD, and Nexus as well as tactics like agile, pair programming, and test-driven development at large organizations.

The assembly-line mindset aims to take developers of arbitrary skill and background, run them through a prescribed process, and get high-quality, high-output results on the other end. I’ve never seen this deliver the desired outcomes in practice, at least not to the degree most leaders hope.

On the surface, mass production and software development share a lot of similarities. Both require quality standards, collaboration between groups of specialized workers, and repeatability. However, the reality is they are quite different from each other. A manufacturing assembly line is optimized to produce the exact same product over and over again, efficiently and reliably. Software products, especially Software as a Service, are heterogeneous. While we seek a process that produces consistent results, each product and situation is unique. Too prescriptive, and we end up with a rigid process that yields poor results and low-throughput. Too unstructured, and we end up with inconsistent and unreliable output.

Our Head of Client Experience Mike Taylor refers to this as the Stoplight Problem. To demonstrate, ask a roomful of people what to do at each phase of a stoplight. On green, everyone says “Go.” On red, “Stop.” And on yellow? The answers vary—even more so with the introduction of flashing yellow lights. How close are we to the light? How fast are we traveling? Are the roads icy? What are the cars in front or behind us doing? What happens at a yellow light is entirely context-dependent and situational. It comes down to making informed choices in the moment without an authoritative, black-and-white determination.

Execution and delivery issues invariably come down to one thing: the yellow light. The green and red lights are binary indicators. There are clear right and wrong actions to take. These are things that can be taught and learned—where tactics matter—but the yellow light comes down to making good decisions. This is something organizations struggle with at scale. How do you trust your teams to make good decisions? As a result, they end up making those decisions top-down in a command-and-control or assembly-line fashion. This is how organizations end up with delivery and feature teams. What’s needed is a sort of meta process or process for encouraging good decision making.

Empowered Product Teams

The emphasis on tactics isn’t limited to traditional project-minded IT organizations. Tactics are more visible and measurable. To a manager, tactics feel like work is happening, but they are rarely the difference maker for a company.

To illustrate, imagine handing out a bunch of axes to a group of people and telling them to go collect some wood. You might even teach them the proper technique for chopping down a tree. What happens next? Chaos. Confusion. A general sense of wandering in the woods. What kind of timber do we need? How much? What is it used for? How do we move it? Watching an army of people swinging axes is going to look like a lot of work is going on, but is it work that matters? You might follow people around, directing them where to go, which trees to cut down, and where to move them, but this won’t scale very well.

Without a guiding vision, we’re left with a bunch of people wandering in the woods swinging axes. Work happens, things get done—maybe even things that matter—but it’s haphazard and inefficient. More often than not, though, we’re always two weeks from completion because there isn’t clarity on where we’re trying to be. In agile terminology, we’re iterating to nowhere.

Our response might be to micromanage or implement the assembly-line process, turning our teams into feature factories. In my experience, this creates new challenges. In the first case, by grinding throughput to a halt, and in the second case, by failing to address the Stoplight Problem. The solution is a combination of vision, strategy, and execution.

A vision is a mental image of what the future could be like. It’s a grand and idealistic state, not something that can be achieved in a short amount of time. A shared vision empowers teams to make better decisions independently.

Strategy consists of a plan with decreasing fidelity. Some organizations attempt to plan 12 to 18 months out in a very waterfall-like fashion, and unless you’re sending a rocket into space, it just doesn’t work. A strategy is really a series of goals that get progressively fuzzier the further you go out. While a vision usually isn’t directly actionable, goals are both actionable and attainable in support of the overarching vision. We can break our strategy down into sets of three-month goals, which allows us to adjust course as needed. This is important since our goals are increasingly fuzzy. The key here is that strategy and goals are not dictated to teams. There needs to be give and take and dialog. OKRs can be a good tool for facilitating this.

At Real Kinetic, we hold quarterly leadership offsites to revisit our vision and strategy, course-correct, and ensure we have a general sense of alignment. We help our clients do the same within their product development organizations. The challenge with strategy is it looks like talking, while tactics look like working, even if it’s work that doesn’t truly move the needle. This is a cognitive bias leaders and managers should be aware of because it can trap us into focusing on tactics that aren’t framed by a clear vision and strategy.

Execution is all about hitting the goals we lay out in our strategy. This is where tactics come into play, but rather than providing teams with a list of features to implement or tasks to perform, we empower them to make good decisions. This is made possible by our guiding vision and cross-functional, mission-driven product teams. Our product manager is figuring out what lies ahead and helping plan the best course of action for realizing our vision. They are looking at value and business viability risks for the product. Our designer is looking at usability risks, and our tech lead is looking at feasibility, making estimations, and contributing to the strategy in order to avoid potential obstacles. You’ll notice that nowhere have we mentioned agile or scrum because these are specific tactics for managing execution. Together, the team is determining execution and discovering a solution that moves the business towards the ideal state set forth by its leadership.

Becoming a Technology Product Company

The struggle with digital transformation is it doesn’t get at the heart of the issue. It’s a tactical response to a tangible, yet ultimately inconsequential, part of the problem. The problem is not due to technology or innovation or particular tactics, it’s due to organizational alignment and execution deficiencies. Unfortunately, the former is more visible and more easily acted on than the latter.

The transformation that organizations are actually after is becoming a technology product company. This requires empowered product teams in combination with vision, strategy, and execution. Most companies focus on the execution because it’s easier, but it’s not sufficient. Empowered product teams require a shared vision that enables them to make good decisions without the need for an overly regimented or top-down process. This is the only effective way I’ve seen software companies scale throughput and quality. Don’t let your organization think it’s building a boulevard when it’s actually planting perennials next to potholes.

Real Kinetic helps clients build great product development organizations. Learn more about working with us.

Microservice Observability, Part 2: Evolutionary Patterns for Solving Observability Problems

In part one of this series, I described the difference between monitoring and observability and why the latter starts to become more important when dealing with microservices. Next, we’ll discuss some strategies and patterns for implementing better observability. Specifically, we’ll look at the idea of an observability pipeline and how we can start to iteratively improve observability in our systems.

To recap, observability can be described simply as the ability to ask questions of your systems without knowing those questions in advance. This requires capturing a variety of signals such as logs, metrics, and traces as well as tools for interpreting those signals like log analysis, SIEM, data warehouses, and time-series databases. A number of challenges surface as a result of this. Clint Sharp does a great job discussing the key problems, which I’ll summarize below along with some of my own observations.

Problem 1: Agent Fatigue

A typical microservice-based system requires a lot of different operational tooling—log and metric collectors, uptime monitoring, analytics aggregators, security scanners, APM runtime instrumentation, and so on. Most of these involve agents that run on every node in the cluster (or, in some cases, every pod in Kubernetes). Since vendors optimize for day-one experience and differentiating capabilities, they are incentivized to provide agents unique to their products rather than attempting to unify or standardize on tooling. This causes problems for ops teams who are concerned with the day-two costs of running and managing all of these different agents. Resource consumption alone can be significant, especially if you add in a service mesh like Istio into the mix. Additionally, since each agent is unique, the way they are configured and managed is different. Finally, from a security perspective, every agent added to a system introduces additional attack surface to hosts in the cluster. Each agent brings not just the vendor’s code into production but also all of its dependencies.

Problem 2: Capacity Anxiety

With the elastic microservice architectures I described in part one, capacity planning for things like logs and metrics starts to become a challenge. This point is particularly salient if, for example, you’ve ever been responsible for managing Splunk licensing. With microservices, a new deployment can now cause a spike in log volumes forcing back pressure on your log ingestion across all of your services. I’ve seen Splunk ingestion get backed up for days’ worth of logs, making it nearly impossible to debug production issues when logs are needed most. I’ve seen Datadog metric ingestion grind to a halt after someone added a high-cardinality dimension to classify a metric by user. And I’ve seen security teams turn on cloud audit log exporting to their SIEM only to get flooded with low-level minutiae and noise. Most tools prioritize gross data ingestion over fine-grained control like sampling, filtering, deduplicating, and aggregating. Using collectors such as Fluentd can help with this problem but add to the first problem. Elastic microservice architectures tend to require more control over data ingestion to avoid capacity issues.

Problem 3: Foresight Required

Unlike monitoring, observability is about asking questions that we hadn’t planned to ask in advance, but we can’t ask those questions if the necessary data was never collected in the first place! The capacity problem described above might cause us to under-instrument our systems, especially when the value of logs is effectively zero—until it’s not. Between monitoring, debugging, security forensics, and other activities, effective operations requires a lot of foresight. Unfortunately, this foresight tends to come from hindsight, which might be too late depending on the situation. Most dashboards are operational scar tissue, after all. Adding or reconfiguring instrumentation after the fact can have significant lag time, which can be the difference between prolonged downtime or a speedy remediation. Elastic microservice architectures benefit greatly from the ability to selectively and dynamically dial up the granularity of operational data when it’s needed and dial it back down when it’s not.

Problem 4: Tooling and Data Accessibility

Because of the problems discussed earlier, it’s not uncommon for organizations to settle on a limited set of operations tools like logging and analytics systems. This can pose its own set of challenges, however, as valuable operational data becomes locked up within certain systems in production environments. Vendor lock-in and high switching costs can make it difficult to use the right tool for the job.

There’s a wide range of data sources that provide high-value signals such as VMs, containers, load balancers, service meshes, audit logs, VPC flow logs, and firewall logs. And there’s a wide range of sinks and downstream consumers that can benefit from these different signals. The problem is that tool and data needs vary from team to team. Different tools or products are needed for different data and different use cases. The data that operations teams care about is different from the data that business analysts, security, or product managers care about. But if the data is siloed based on form or function or the right tools aren’t available, it becomes harder for these different groups to be effective. There’s an ever-changing landscape of tools, products, and services—particularly in the operations space—so the question is: how big of a lift is it for your organization to add or change tools? How easy is it to experiment with new ones? In addition to the data siloing, the “agent fatigue” problem described above can make this challenging when re-rolling host agents at scale.

Solution: The Observability Pipeline

Solving these problems requires a solution that offers the following characteristics:

  1. Allows capturing arbitrarily wide events
  2. Consolidates data collection and instrumentation
  3. Decouples data sources from data sinks
  4. Supports input-to-output schema normalization
  5. Provides a mechanism to encode routing, filtering, and transformation logic

When we implement these different concepts, we get an observability pipeline—a way to unify the collection of operational data, shape it, enrich it, eliminate noise, and route it to any tool in the organization that can benefit from it. With input-to-output schema normalization, we can perform schema-agnostic processing to enrich, filter, aggregate, sample, or drop fields from any shape and adapt data for different destinations. This helps to support a wider range of data collectors and agents. And by decoupling sources and sinks, we can easily introduce or change tools and reroute data without impacting production systems.

We’re starting to see the commercialization of this idea with products like Cribl, but there are ways to solve some of these problems yourself, incrementally, and without the use of commercial software. The remainder of this post will discuss patterns and strategies for building your own observability pipeline. While the details here will be fairly high level, part three of this series will share some implementation details and tactics through examples.

Pattern 1: Structured Data

A key part of improving system observability is being more purposeful in how we structure our data. Specifically, structured logging is critical to supporting production systems and aiding debuggability. The last thing you want to be doing when debugging a production issue is frantically grepping log files trying to pull out needles from a haystack. In the past, logs were primarily consumed by human operators. Today, they are primarily consumed by tools. That requires some adjustments at design time. For example, if we were designing a login system, historically, we might have a logging statement that resembles the following:

log.error(“User '{}' login failed”.format(user))

This would result in a log message like:

ERROR 2019-12-30 09:28.31 User ‘tylertreat' login failed

When debugging login problems, we’d probably use a combination of grep and regular expressions to track down the users experiencing issues. This might be okay for the time being, but as we introduce additional metadata, it becomes more and more kludgy. It also means our logs are extremely fragile. People begin to rely on the format of logs in ways that might even be unknown to the developers responsible for them. Unstructured logs become an implicit, undocumented API.

With structured logs, we make that contract more explicit. Our logging statement might change to something more like:

log.error(“User login failed”,
event=LOGIN_ERROR,
user=“tylertreat”,
email=“tyler.treat@realkinetic.com”,
error=error)

The actual format we use isn’t hugely important. I typically recommend JSON because it’s ubiquitous and easy to write and parse. With JSON, our log looks something like the following:

{
“timestamp”: “2019-12-30 09:28.31”,
“level”: “ERROR”,
“event”: “user_login_error”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“error”: “Invalid username or password”,
“message”: “User login failed”
}

With this, we can parse the structure, index it, query it, even transform or redact it, and we can add new pieces of metadata without breaking consumers. Our logs start to look more like events. Remember, observability is about being able to ask arbitrary questions of our systems. Events are like logs with context, and shifting towards this model helps with being able to ask questions of our systems.

Pattern 2: Request Context and Tracing

With elastic microservice architectures, correlating events and metadata between services becomes essential. Distributed tracing is one component of this. Another is tying our structured logs together and passing shared context between services as a request traverses the system. A pattern that I recommend to teams adopting microservices is to pass a context object to everything. This is actually a pattern that originated in Go for passing request-scoped values, cancelation signals, and deadlines across API boundaries. It turns out, this is also a useful pattern for observability when extended to service boundaries. While it’s contentious to explicitly pass context objects due to the obtrusiveness to APIs, I find it better than relying on implicit, request-local storage.

In its most basic form, a context object is simply a key-value bag that lets us track metadata as a request passes through a service and is persisted through the entire execution path. OpenTracing refers to this as baggage. You can include this context as part of your structured logs. Some suggest having a single event/structured-log-with-context emitted per hop, but I think this is more aspirational. For most, it’s probably easier to get started by adding a context object to your existing logging. Our login system’s logging from above would look something like this:

def login(ctx, username, email, password):
ctx.set(user=username, email=email)
...
log.error(“User login failed”,
event=LOGIN_ERROR,
context=ctx,
error=error)
...

This adds rich metadata to our logs—great for debugging—as they start evolving towards events. The context is also a convenient way to propagate tracing information, such as a span ID, between services.

{
“timestamp”: “2019-12-30 09:28.31”,
“level”: “ERROR”,
“event”: “user_login_error”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“span_id”: “34fe6cbf9556424092fb230eab6f4ea6”,
},
“error”: “Invalid username or password”,
“message”: “User login failed”
}

You might be wondering what to put on the context versus just putting on our structured logs. It’s a good question and, like most things, the answer is “it depends.” A good rule of thumb is what can you get for “free” and what do you need to pass along? These should typically be things specific to a particular request. For instance, CPU utilization and memory usage can be pulled from the environment, but a user or correlation ID are request-specific and must be propagated. This decision starts to become more obvious the deeper your microservice architectures get. Just be careful not to leak sensitive data into your logs! While we can introduce tooling into our observability pipeline to help with this risk, I believe code reviews are the best line of defense here.

Pattern 3: Data Schema

With our structured data and context, we can take it a step further and introduce schemas for each data type we collect, such as logs, metrics, and traces. Schemas provide a standard shape to the data and allow consumers to rely on certain fields and types. They might validate data types and enforce required fields like a user ID, license, or trace ID. These schemas basically take the explicit contract described above and codify it into a specification. This is definitely the most organization-dependent pattern, so it’s hard to provide specific advice. The key thing is having structured data that can be easily evolved and relied on for debugging or exploratory purposes.

These schemas also need libraries which implement the specifications and make it easy for developers to actually instrument their systems. There is a plethora of existing libraries available for structured logging. For tracing and metrics, OpenTelemetry has emerged as a vendor-neutral API and forthcoming data specification.

Pattern 4: Data Collector

So far, we’ve talked mostly about development practices that improve observability. While they don’t directly address the problems described above, later, we’ll see how they also help support other parts of the observability pipeline. Now we’re going to look at some actual infrastructure patterns for building out a pipeline.

Recall that two of the characteristics we desire in our observability solution are the ability to consolidate data collection and instrumentation and decouple data sources from data sinks. One of the ways we can reduce agent fatigue is by using a data collector to unify the collection of key pieces of observability data—namely logs (or events), metrics, and traces. This component collects the data, optionally performs some transformations or filtering on it, and writes it to a data pipeline. This commonly runs as an agent on the host. In Kubernetes, this might be a DaemonSet with an instance running on each node. From the application or container side, data is written to stdout/stderr or a Unix domain socket which the collector reads. From here, the data gets written to the pipeline, which we’ll look at next.

Moving data collection out of process can be important if your application emits a significant amount of logs or you’re doing anything at a large enough scale. I’ve seen cases where applications were spending more time writing logs than performing actual business logic. Writing logs to disk can easily take down a database or other I/O-intensive workload just by sharing a filesystem with its logging. Rather than sacrificing observability by reducing the volume and granularity of logs, offload it and move it out of the critical execution path. Logging can absolutely affect the performance and reliability of your application.

For this piece, I generally recommend using either Fluentd or Logstash along with the Beats ecosystem. I usually avoid putting too much logic into the data collector due to the way it runs distributed and at scale. If you put a lot of processing logic here, it can become difficult to manage and evolve. I find it works better to have the collector act as a dumb pipe for getting data into the system where it can be processed offline.

Pattern 5: Data Pipeline

Now that we have an agent running on each host collecting our structured data, we need a scalable, fault-tolerant data stream to handle it all. Even at modestly sized organizations, I’ve seen upwards of about 1TB of logs indexed daily with elastic microservice architectures. This volume can be much greater for larger organizations, and it can burst dramatically with the introduction of new services. As a result, decoupling sources and sinks becomes important for reducing capacity anxiety. This data pipeline is often something that can be partitioned for horizontal scalability. In doing this, we might just end up shifting the capacity anxiety from one system to another, but depending on the solution, this can be an easier problem to solve or might not be a problem at all if using a managed cloud service. Finally, a key reason for decoupling is that it also allows us to introduce or change sinks without impacting our production cluster. A benefit of this is that we can also evaluate and compare tools side-by-side. This helps reduce switching costs.

There are quite a few available solutions for this component, both open source and managed. On the open source side, examples include Apache Kafka, Apache Pulsar, and Liftbridge. On the cloud-managed services side, Amazon Kinesis, Google Cloud Pub/Sub, and Azure Event Hubs come to mind. I tend to prefer managed solutions since they allow me to focus on things that directly deliver business value rather than surrounding operational concerns.

Note that there are some important nuances depending on the pipeline implementation you use or which might determine the implementation you choose. For example, questions like how long do you need to retain observability data, do you need the ability to replay data streams, and do you need strict, in-order delivery of messages? Replaying operational data can be useful for retraining ML models or testing monitoring changes, for instance. For systems that are explicitly sharded, there’s also the question of how to partition the data. Random partitioning is usually easiest from a scaling and operations perspective, but it largely depends on how you intend to consume it.

Pattern 6: Data Router

The last pattern and component of our observability pipeline is the data router. With our operational data being written to a pipeline such as Kafka, we need something that can consume it, perform processing, and write it to various backend systems. This is also a great place to perform dynamic sampling, filtering, deduplication, aggregation, or data enrichment. The schema mentioned earlier becomes important here since the shape of the data determines how it gets handled. If you’re dealing with data from multiple sources, you’ll likely need to normalize to some common schema, either at ingestion time or processing time, in order to execute shared logic and perform schema-agnostic processing. Data may also need to be reshaped before writing to destination systems.

This piece can be as sophisticated or naive as you’d like, depending on your needs or your organization’s observability and operations maturity. A simple example is merely looking at the record type and sending logs to Splunk and Amazon Glacier cold storage, sending traces to Stackdriver, sending metrics to Datadog, and sending high-cardinality events to Honeycomb. More advanced use cases might involve dynamic sampling to dial up or down the granularity on demand, dropping values to reduce storage consumption or eliminate noise, masking values to implement data loss prevention, or joining data sources to create richer analytics.

Ultimately, this is a glue component that’s reading data in, parsing the shape of it, and writing it out to assorted APIs or other topics/streams for further downstream processing. Depending on the statefulness of your router logic, this can be a good fit for serverless solutions like AWS Lambda, Google Cloud Functions, Google Cloud Run, Azure Functions, or OpenFaaS. If using Kafka, Kafka Streams might be a good fit.

The Journey to Better Observability

Observability with elastic microservice architectures introduces some unique challenges like agent fatigue, capacity anxiety, required foresight, and tooling and data accessibility. Solving these problems requires a solution that can capture arbitrarily wide events, consolidate data collection and instrumentation, decouple data sources and sinks, support input-to-output schema normalization, and encode routing, filtering, and transformation logic. When we implement this, we get an observability pipeline, which is really just a fancy name for a collection of observability patterns and best practices.

An observability pipeline should be an evolutionary or iterative process. You shouldn’t waste time building out a sophisticated pipeline early on; you should be focused on delivering value to your customers. Instead, start small with items that add immediate value to the observability of your systems.

Something you can begin doing today that adds a ton of value with minimal lift is structured logging. Another high-leverage pattern is passing a context object throughout your service calls to propagate request metadata which can be logged and correlated. Use distributed tracing to understand and identify issues with performance. Next, move log collection out of process using Fluentd or Logstash. If you’re not already, use a centralized logging system—Splunk, Elasticsearch, Sumo Logic, Graylog—there are a bunch of options here, both open source and commercial, SaaS or self-managed. With the out-of-process collector, you can then introduce a data pipeline to decouple log producers from consumers. Again, there are managed options like Amazon Kinesis or Google Cloud Pub/Sub and self-managed ones like Apache Kafka. With this, you can now add, change, or compare consumers and log sinks without impacting production systems. Evaluate a product like Honeycomb for storing high-cardinality events. At this point, you can start to unify the collection of other instrumentation such as metrics and traces and evolve your logs towards context-rich events.

Each of these things will incrementally improve the observability of your systems and can largely be done in a stepwise fashion. Whether you’re just beginning your transition to microservices or have fully adopted them, the journey to better observability doesn’t have to require a herculean effort. Rather, it’s done one step at a time.

In part three of this series, I’ll demonstrate a few implementation details through examples to show some of these observability patterns in practice.

Microservice Observability, Part 1: Disambiguating Observability and Monitoring

“Pets versus cattle” has become something of a standard vernacular for describing the shift in how we build systems. It alludes to the elastic and dynamic nature of these (typically, but not necessarily) container-based systems with on-demand scaling and more transparent fault-tolerance. I’ve talked before about this transition before and specifically how it relates to monitoring. In particular, with these more dynamic, microservice-based systems, the conversation starts to shift away from traditional monitoring toward observability. In this series, I’ll describe that distinction, explain why it matters, and share some concrete tactical items for implementing observability in a microservice environment.

In the past, I’ve used the term “cloud-native” to describe these types of systems, but this buzzword has conflated so many different concepts that it’s been relegated to the likes of “DevOps”—entirely arbitrary and context-dependent. Depending on who you ask, cloud-native means containers, microservices, Kubernetes, elasticity, serverless, automation, or any number of other ideas. The truth, however, is that you can do many of these things on-prem just as much as in the cloud, the difference being largely CapEx versus OpEx. I think the spirit of “cloud-native” really just means architecting systems to take advantage of cloud capabilities, namely higher-level managed services (which may not even have on-prem equivalents), improved elasticity and fault-tolerance (which may or may not mean containers), and reduced operations investment (in part by leveraging managed services).

Because there are so many confounding and interrelated-yet-different ideas, I’m going to focus this discussion on elastic microservice architectures. Elastic meaning services that automatically scale up and down as needed (in contrast to static infrastructures), and microservice simply meaning applications comprised of many different—usually smaller—services (in contrast to monoliths or systems comprising just a few coarse-grained services).

Static Monolithic Architectures

With static monolithic architectures, monitoring is a reasonably well-understood problem. With a monolith, the system is typically in one of two states, up or down, and we can conceivably correlate this to customer impact. Bugs aside, when the monolith is down, we likely have a good idea of how this behavior manifests itself to the user. We can set up Nagios checks and get some meaningful signals out of it. Uptime is mostly a single data point.

With a monolith, it’s not unreasonable for ops teams to manage the day-to-day operations of the system and do so effectively. These teams tend to quickly develop a good intuition and “muscle memory” for the application when it’s the only thing they are responsible for, especially when it’s a single deployable unit. Logs can be grepped from a single log file, and if something is wrong with the application, operators might simply SSH into the box to poke at it. Runbooks and standard operating procedures are also common here.

With a monolith, we likely have a single runtime such as the JVM, which makes it easier to collect rich telemetry in a centralized way, all the way down to the code level. Tools like Dynatrace and AppDynamics can instrument the JVM itself to collect information on busy and idle threads, garbage collection stats, and request metrics. And because we have just a single deployed artifact running on a handful of static servers, this data can actually be useful and correlated back to customer impact and business metrics.

Elastic Microservice Architectures

With elastic microservice architectures, things start to change dramatically. Applications consist of dozens of different microservices. The system is no longer in one of two states but more like one of n-factorial states. In reality, it’s much more because in production you might have different versions of the same service running at the same time as you introduce more sophisticated deployment strategies and rollbacks. Integration testing can’t possibly account for all of these combinations. We can no longer easily correlate system behavior to actual customer impact because system behavior is much more emergent. It can be difficult to pinpoint how the behavior of a given service affects the user’s experience as the system operates in varying states of partial failure and services interact in unique ways. If it’s slow, which part is slow? The frontend service? An upstream service? The database? Some combination of these? Uptime is no longer a single data point but rather a composite of many different data points, but more importantly, what does “up” even mean in the context of a complex microservice architecture?

With microservices, it becomes intractable for a single ops team to manage dozens of heterogeneous services beyond anything but in a first-responder, incident-router capacity. There is too much context and specific knowledge needed since microservices are literally the embodiment of the specialization of teams.

With microservices, it’s no longer practical or even feasible to grep log files or SSH into the box to debug a problem. There might not even be a box to SSH into if it’s a container that has since been descheduled or a managed serverless runtime. With heterogeneous services, we might have half a dozen languages and runtimes to support, each with differing types of runtime instrumentation. Moreover, because we now have dozens or even hundreds of nodes running many different instances of our services, the value of this low-level, summarized data starts to diminish. It makes for pretty dashboards and can help in answering very specific, predefined questions, but that’s about it. It’s no use for proactive monitoring because it’s too much noise, and it’s no use for reactive debugging because it’s pre-aggregated. There’s not much you can do when all you have are rolled-up time-series metrics, and it’s just as difficult to correlate this data back to customer impact.

Monitoring and Observability

With a complex system, relying on this type of data along with logs can often lead to a deadend when tracking down a particularly insidious bug. And this is where observability comes into play. It picks up where monitoring leaves off.

While monitoring and observability have been getting conflated a lot lately, there’s actually an important distinction to make. Monitoring tends to focus on the overall health of system and business metrics—questions we know in advance. Observability is about providing more granular insights into the behavior of systems and richer context. It’s the difference between “post hoc” versus “ad hoc.”

In the top-right corner, we have known knowns. These are things of which we have a high degree of understanding and a large amount of data on, i.e. the things we are aware of and understand. For example, “the system has a 1GB memory limit.” As the designers of this system, this is something that we’re acutely aware of and understand. We know that we know how much memory the system can use before it moves outside of its operating boundaries and bad things happen.

In the bottom-right corner, we have known unknowns. These are things we are generally aware of but don’t necessarily understand. For example, “the system exceeded its memory limit and crashed, causing an outage.” As system designers, memory usage is something we know is important and affects system behavior. We can monitor it in production in order to gather lots of data on it, but just having that data often doesn’t help us to understand why memory is being consumed or even how that data manifests itself as system behavior.

In the top-left corner, we have unknown knowns, which are things we understand but are not completely aware of. This sounds like a strange, almost oxymoron-like categorization, but it’s basically the things that are gut instinct or intuition. It’s often things we know or think we know without even consciously realizing it. For example, “we implemented an orchestrator to ensure the system is always running.” Intuition tells us that if the process isn’t running, the system isn’t available, so we make sure that it gets restarted when something goes wrong. We might, however, be unaware of the unintended side effects of this decision, and it might be based more on theory and conjecture than data.

Which leads us to the bottom-left corner: unknown unknowns. These are the things we are neither aware of nor understand. The events we can’t even predict or foresee happening because if we could foresee them, they wouldn’t be unknown unknowns, they’d be known unknowns. For example, “instances churn because the orchestrator restarts the process when it approaches its memory limit, causing sporadic failures and slowdowns.” This was an unforeseen consequence of our orchestrator implementation. As a result, we could not have tested for it or looked for it with our monitoring tools. Instead, it’s something that happens, we learn from it, and quickly classify it as a known unknown—something we know to look for going forward.

In a sense, the known knowns are facts, the known unknowns are hypotheses, the unknown knowns are assumptions, and the unknown unknowns are discoveries. Through this lens, the distinction between observability and monitoring becomes clear. Monitoring is about testing hypotheses and observability is about exploring new discoveries. We monitor known unknowns because these are the things we know to look for, but unknown unknowns are, by definition, unpredictable. We cannot monitor them because we do not know to even look for them in the first place! Instead, we ask questions of our systems in order to understand and categorize these unknown unknowns. Observability is the ability to interrogate our systems after the fact in a data-rich, high-fidelity way. Monitoring, on the other hand, is before the fact and much lower fidelity. These are the dashboards and alerts we set up which usually consist of pre-aggregated metrics. This is what I mean by post hoc versus ad hoc. Observability allows us to ask arbitrary questions of our systems, not questions predefined in advance.

With this definition, monitoring is a subset of observability, and observability encompasses many different types of data. For example, things like distributed traces, application logs, system logs, audit logs, and application metrics are all important observability signals. But when we boil it all down, it turns out everything is really just events, of which we want different lenses to view. Some of this data provides context for the event itself, such as logs and metrics, and some of it describes relationships between events, such as traces. It’s important we have a way to collect all this context and store it such that we can query and analyze it using these different lenses. Aggregated metrics alone aren’t enough—they don’t have the granularity nor the context needed. Dashboards are simply answers to specific questions known in advance. Observability needs to go much deeper than this.

In part two of this series, we’ll revisit the concept of an observability pipeline as a tactical approach to implementing observability in a microservice environment. As part of this, we’ll discuss some steps that can be taken to incrementally improve observability while iterating toward this pattern.