software engineering – Brave New Geek

April 17, 2024April 17, 2024

Security, Maintainability, Velocity: Choose One

There are three competing priorities that companies have as it relates to software development: security, maintainability, and velocity. I’ll elaborate on what I mean by each of these in just a bit. When I originally started thinking about this, I thought of it in the context of the “good, fast, cheap: choose two” project management triangle. But after thinking about it for more than a couple minutes, and as I related it to my own experience and observations at other companies, I realized that in practice it’s much worse. For most organizations building software, it’s more like security, maintainability, velocity: choose one.

Of course, most organizations are not explicitly making these trade-offs. Instead, the internal preferences and culture of the company reveal them. I believe many organizations, consciously or not, accept this trade-off as an immovable constraint. More risk-averse groups might even welcome it. Though the triangle most often results in a “choose one” sort of compromise, it’s not some innate law. You can, in fact, have all three with a little bit of careful thought and consideration. And while reality is always more nuanced than what this simple triangle suggests, I find looking at the extremes helps to ground the conversation. It emphasizes the natural tension between these different concerns. Bringing that tension to the forefront allows us to be more intentional about how we manage it.

It wasn’t until recently that I distilled down these trade-offs and mapped them into the triangle shown above, but we’ve been helping clients navigate this exact set of competing priorities for over six years at Real Kinetic. We built Konfig as a direct response to this since it was such a common challenge for organizations. We’re excited to offer a solution which is the culmination of years of consulting and which allows organizations to no longer compromise, but first let’s explore the trade-offs I’m talking about.

Security

Companies, especially mid- to large- sized organizations, care a great deal about security (and rightfully so!). That’s not to say startups don’t care about it, but the stakes are just much higher for enterprises. They are terrified of being the next big name in the headlines after a major data breach or ransomware attack. I call this priority security for brevity, but it actually consists of two things which I think are closely aligned: security and governance.

Governance directly supports security in addition to a number of other concerns like reliability, risk management, and compliance. This is sometimes referred to as Governance, Risk, and Compliance or GRC. Enterprises need control over, and visibility into, all of the pieces that go into building and delivering software. This is where things like SDLC, separation of duties, and access management come into play. Startups may play it more fast and loose, but more mature organizations frequently have compliance or regulatory obligations like SOC 2 Type II, PCI DSS, FINRA, FedRAMP, and so forth. Even if they don’t have regulatory constraints, they usually have a reputation that needs to be protected, which typically means more rigid processes and internal controls. This is where things can go sideways for larger organizations as it usually leads to practices like change review boards, enterprise (ivory tower) architecture programs, and SAFe. Enterprises tend to be pretty good at governance, but it comes at a cost.

It should come as no surprise that security and governance are in conflict with speed, but they are often in contention with well-architected and maintainable systems as well. When organizations enforce strong governance and security practices, it can often lead to developers following bad practices. Let me give an example I have seen firsthand at an organization.

A company has been experiencing stability and reliability issues with its software systems. This has caused several high-profile, revenue-impacting outages which have gotten executives’ attention. The response is to implement a series of process improvements to effectively slow down the release of changes to production. This includes a change review board to sign-off on changes going to production and a production gating process which new workloads going to production must go through before they can be released. The hope is that these process changes will reduce defects and improve reliability of systems in production. At this point, we are wittingly trading off velocity.

What actually happened is that developers began batching up more and more changes to get through the change review board which resulted in “big bang” releases. This caused even more stability issues because now large sets of changes were being released which were increasingly complex, difficult to QA, and harder to troubleshoot. Rollbacks became difficult to impossible due to the size and complexity of releases, increasing the impact of defects. Release backlogs quickly grew, prompting developers to move on to more work rather than sit idle, which further compounded the issue and led to context switching. Decreasing the frequency of deployments only exacerbated these problems. Counterintuitively, slowing down actually increased risk.

To avoid the production gating process, developers began adding functionality to existing services which, architecturally speaking, should have gone into new services. Services became bloated grab bags of miscellaneous functionality since it was easier to piggyback features onto workloads already in production than it was to run the gauntlet of getting a new service to production. These processes were directly and unwittingly impacting system architecture and maintainability. In economics, this is called a “negative externality.” We may have security and governance, but we’ve traded off velocity and maintainability. Adding insult to injury, the processes were not even accomplishing the original goal of improving reliability, they were making it worse!

Maintainability

It’s critical that software systems are not just built to purpose, but also built to last. This means they need to be reliable, scalable, and evolvable. They need to be conducive to finding and correcting bugs. They need to support changing requirements such that new features and functionality can be delivered rapidly. They need to be efficient and cost effective. More generally, software needs to be built in a way that maximizes its useful life.

We simply call this priority maintainability. While it covers a lot, it can basically be summarized as: is the system architected and implemented well? Is it following best practices? Is there a lot of tech debt? How much thought and care has been put into design and implementation? Much of this comes down to gut feel, but an experienced engineer can usually intuit whether or not a system is maintainable pretty quickly. A good proxy can often be the change fail rate, mean time to recovery, and the lead time for implementing new features.

Maintainability’s benefits are more of a long tail. A maintainable system is easier to extend and add new features later, easier to identify and fix bugs, and generally experiences fewer defects. However, the cost for that speed is basically frontloaded. It usually means moving slower towards the beginning while reaping the rewards later. Conversely, it’s easy to go fast if you’re just hacking something together without much concern for maintainability, but you will likely pay the cost later. Companies can become crippled by tech debt and unmaintained legacy systems to the point of “bankruptcy” in which they are completely stuck. This usually leads to major refactors or rewrites which have their own set of problems.

Additionally, building systems that are both maintainable and secure can be surprisingly difficult, especially in more dynamic cloud environments. If you’ve ever dealt with IAM, for example, you know exactly what I mean. Scoping identities with the right roles or permissions, securely managing credentials and secrets, configuring resources correctly, ensuring proper data protections are in place, etc. Misconfigurations are frequently the cause of the major security breaches you see in the headlines. The unfortunate reality is security practices and tooling lag in the industry, and security is routinely treated as an afterthought. Often it’s a matter of “we’ll get it working and then we’ll come back later and fix up the security stuff,” but later never happens. Instead, an IAM principal is left with overly broad access or a resource is configured improperly. This becomes 10x worse when you are unfamiliar with the cloud, which is where many of our clients tend to find themselves.

Velocity

The last competing priority is simply speed to production or velocity. This one probably requires the least explanation, but it’s consistently the priority that is sacrificed the most. In fact, many organizations may even view it as the enemy of the first two priorities. They might equate moving fast with being reckless. Nonetheless, companies are feeling the pressure to deliver faster now more than ever, but it’s much more than just shipping quickly. It’s about developing the ability to adapt and respond to changing market conditions fast and fluidly. Big companies are constantly on the lookout for smaller, more nimble players who might disrupt their business. This is in part why more and more of these companies are prioritizing the move to cloud. The data center has long been their moat and castle as it relates to security and governance, however, and the cloud presents a new and serious risk for them in this space. As a result, velocity typically pays the price.

As I mentioned earlier, velocity is commonly in tension with maintainability as well, it’s usually just a matter of whether that premium is frontloaded or backloaded. More often than not, we can choose to move quickly up front but pay a penalty later on or vice versa. Truthfully though, if you’ve followed the DORA State of DevOps Reports, you know that a lot of companies neither frontload nor backload their velocity premium—they are just slow all around. These are usually more legacy-minded IT shops and organizations that treat software development as an IT cost center. These are also usually the groups that bias more towards security and governance, but they’re probably the most susceptible to disruption. “Move fast and break things” is not a phrase you will hear permeating these organizations, yet they all desire to modernize and accelerate. We regularly watch these companies’ teams spend months configuring infrastructure, and what they construct is often complex, fragile, and insecure.

Choose Three

Businesses today are demanding strong security and governance, well-structured and maintainable infrastructure, and faster speed to production. The reality, however, is that these three priorities are competing with each other, and companies often end up with one of the priorities dominating the others. If we can acknowledge these trade-offs, we can work to better understand and address them.

We built Konfig as a solution that tackles this head-on by providing an opinionated configuration of Google Cloud Platform and GitLab. Most organizations start from a position where they must assemble the building blocks in a way that allows them to deliver software effectively, but their own biases result in a solution that skews one way or the other. Konfig instead provides a turnkey experience that minimizes time-to-production, is secure by default, and has governance and best practices built in from the start. Rather than having to choose one of security, maintainability, and velocity, don’t compromise—have all three. In a follow-up post I’ll explain how Konfig addresses concerns like security and governance, infrastructure maintainability, and speed to production in a “by default” way. We’ll see how IAM can be securely managed for us, how we can enforce architecture standards and patterns, and how we can enable developers to ship production workloads quickly by providing autonomy with guardrails and stable infrastructure.

Follow @tyler_treat

February 19, 2024February 19, 2024

Choosing Good SLIs

Transitioning from an on-prem environment to a cloud environment involves a lot of major shifts for organizations. One of those shifts is often around how we monitor the overall health of systems. The typical way to measure things like the availability, reliability, and performance of systems is with SLIs or Service Level Indicators. SLIs are a valuable tool both on-prem and in the cloud, but when it comes to the latter, I often see organizations carrying over some operational anti-patterns from their data center environment.

Unlike public clouds, data centers are often resource-constrained. Services run on dedicated sets of VMs and it can take days or weeks for new physical servers to be provisioned. Consequently, it’s common for organizations to closely monitor metrics such as CPU utilization, memory consumption, disk space, and so forth since these are all precious resources within a data center.

Often what happens is that ops teams get really good at identifying and pattern-matching the common issues that arise in their on-prem environment. For instance, certain applications may be prone to latency issues. Each time we dig into a latency issue we find that the problem is due to excessive garbage collection pauses. As a result, we define a metric around garbage collection because it is often an indicator of performance problems in the application. In practice, this becomes an SLI, whether it’s explicitly defined as such or not, because there is some sort of threshold beyond which garbage collection is considered “excessive.” We begin watching this metric closely to gauge whether the service is healthy or not and alerting on it.

The cloud is a very different environment than on-prem. Whether we’re using an orchestrator such as Kubernetes or a serverless platform, containers are usually ephemeral and instances autoscale up and down. If an instance runs out of memory, it will just get recycled. This is why we sometimes say you can “pay your way out” of a problem in these environments because autoscaling and autohealing can hide a lot of application issues such as a slow memory leak. In an on-prem environment, these can be significantly more impactful. The performance profile of applications often looks quite differently in the cloud than on-prem as well. Underlying hardware, tenancy, and networking characteristics differ considerably. All this is to say, things look and behave quite differently between the two environments, so it’s important to reevaluate operational practices as well. With SLIs and monitoring, it’s easy to bias toward specific indicators from on-prem, but they might not translate to more cloud-native environments.

User-centric monitoring

So how do we choose good SLIs? The key question to ask is: what is the customer’s experience like? Everything should be driven from this. Is the application responding slowly? Is it returning errors to the user? Is it returning bad or incorrect results? These are all things that directly impact the customer’s experience. Conversely, things that do not directly impact the customer’s experience are questions such as what is the CPU utilization of the service? The memory consumption? The rate of garbage collection cycles? These are all things that could impact the customer’s experience, but without actually looking from the user’s perspective, we simply don’t know whether they are or not. Rather, they are diagnostic tools that—once an issue is identified—can help us to better understand the underlying cause.

Take, for example, the CPU and memory utilization of processes on your computer. Most people probably are not constantly watching the Activity Monitor on their MacBook. Instead, they might open it up when they notice their machine is responding slowly to see what might be causing the slowness.

Three key metrics

When it comes to monitoring services, there are really three key metrics that matter: traffic rate, error rate, and latency. These three things all directly impact the user’s experience.

Traffic Rate

Traffic rate, which is usually measured in requests or queries per second (qps), is important because it tells us if something is wrong upstream of us. For instance, our service might not be throwing any errors, but if it’s suddenly handling 0 qps when it ordinarily is handling 80-100 qps, then something happened upstream that we should know about. Perhaps there is a misconfiguration that is preventing traffic from reaching our service, which almost certainly impacts the user experience.

Error Rate

Error rate simply tells us the rate in which the service is returning errors to the client. If our service normally returns 200 responses but suddenly starts returning 500 errors, we know something is wrong. This requires good status code hygiene to be effective. I’ve encountered codebases where various types of error codes are used to indicate non-error conditions which can add a lot of noise to this type of SLI. Additionally, this metric might be more fine-grained than just “error” or “not error”, since—depending on the application—we might care about the rate of specific 2xx, 4xx, or 5xx responses, for example.

It’s common for teams to rely on certain error logs rather than response status codes for monitoring. This can provide even more granularity around types of error conditions, but in my experience, it usually works better to rely on fairly coarse-grained signals such as HTTP status codes for the purposes of aggregate monitoring and SLIs. Instead, use this logging for diagnostics and troubleshooting once you have identified there is a problem (I am, however, a fan of structured logging and log-based metrics for instrumentation but this is for another blog post).

Latency

Combined with error rate, latency tells us what the customer’s experience is really like. This is an important metric for synchronous, user-facing APIs but might be less critical for asynchronous processes such as services that consume events from a message queue. It’s important to point out that when looking at latency, you cannot use averages. This is a common trap I see ops teams and engineers fall into. Latency rarely follows a normal distribution, so relying on averages or medians to provide a summarized view of how a system is performing is folly.

Instead, we have to look at percentiles to get a better understanding of what the latency distribution looks like. Similarly, you cannot average percentiles either. It mathematically makes no sense, meaning you can’t, for instance, look at the average 90th percentile over some period of time. To summarize latency, we can plot multiple percentiles on a graph. Alternatively, heatmaps can be an effective way to visualize latency because they can reveal useful details like distribution modes and outliers. For example, the heatmap below shows that the latency for this service is actually bimodal. Requests usually either respond in approximately 10 milliseconds or 1 second. This modality is not apparent in the line chart above the heatmap where we are only plotting the 50th, 95th, and 99th percentiles. The line chart does, however, show that latency ticked up a tiny bit around 10:10 AM following a severe spike in tail latency where the 99th percentile momentarily jumped over 4 seconds…curious.

Latency distribution for a service as percentiles

Latency distribution for a service as a heatmap

Identifying other SLIs

While these three metrics are what I consider the critical baseline metrics, there may be other SLIs that are important to a service. For example, if our service is a cache, we might care about the freshness of data we’re serving as something that impacts the customer experience. If our service is queue-based, we might care about the time messages spend sitting in the queue.

Heatmap showing the age distribution of data retrieved from a cache

Whatever the SLIs are, they should be things that directly matter to the user’s experience. If they aren’t, then at best they are a useful diagnostic or debugging tool and at worst they are just dashboard window dressing. Usually, though, they’re no use for proactive monitoring because it’s too much noise, and they’re no use for reactive debugging because it’s typically pre-aggregated data.

What’s worse is that when we focus on the wrong SLIs, it can lead us to take steps that actively harm the customer’s experience or simply waste our own time. A real-world example of this is when I saw a team that was actively monitoring garbage collection time for a service. They noticed one instance in particular appeared to be running more garbage collections than the others. While it appeared there were no obvious indicators of latency issues, timeouts, or out-of-memory errors that would actually impact the client, the team decided to redeploy the service in order to force instances to be recycled. This redeploy ended up having a much greater impact on the user experience than any of the garbage collection behavior ever did. The team also spent a considerable amount of time tuning various JVM parameters and other runtime settings, which ultimately had minimal impact.

Where lower-level metrics can provide value is with optimizing resource utilization and cloud spend. While the elastic nature of cloud may allow us to pay our way out of certain types of problems such as a memory leak, this can lead to inefficiency and waste long term. If we see that our service only utilizes 20% of its allocated CPU, we are likely overprovisioned and could save money. If we notice memory consumption consistently creeping up and up before hitting an out-of-memory error, we likely have a memory leak. However, it’s important to understand this distinction in use cases: SLIs are about gauging customer experience while these system metrics are for identifying optimizations and understanding long-term resource characteristics of your system. At any rate, I think it’s preferable to get a system to production with good monitoring in place, put real traffic on it, and then start to fine-tune its performance and resource utilization versus trying to optimize it beforehand through synthetic means.

Transitioning from an on-prem environment to the cloud necessitates a shift in how we monitor the health of systems. It’s essential to recognize and discard operational anti-patterns from traditional data center environments, where resource constraints often lead to a focus on specific metrics and behaviors. This can frequently lead to a sort of “overfitting” when monitoring cloud-based systems. The key to choosing good SLIs is by aligning them with the customer’s experience. Metrics such as traffic rate, error rate, and latency directly impact the user and provide meaningful insights into the health of services. By emphasizing these critical baseline metrics and avoiding distractions with irrelevant indicators, organizations can proactively monitor and improve the customer experience. Focusing on the right SLIs ensures that efforts are directed toward resolving actual issues that matter to users, avoiding pitfalls that can inadvertently harm user experience or waste valuable time. As organizations navigate the complexities of migrating to a cloud-native environment, a user-centric approach to monitoring remains fundamental to successful and efficient operations.

Need help making the transition?

Real Kinetic helps organizations with their cloud migrations and implementing effective operations. If you have questions or need help getting started, we’d love to hear from you. These emails come directly to us, and we respond to every one.

Follow @tyler_treat

February 5, 2020

Digitally Transformed: Becoming a Technology Product Company

More and more established businesses are attempting to reinvent themselves as technology companies. At the heart of this is the digital transformation, a journey many organizations are undertaking in order to better compete and serve their customers. As a result, companies are pouring tons of cash into digital transformation strategies. For some, this means broader adoption of agile or DevOps practices. For others, it’s modernizing product offerings or moving to the cloud. Regardless of the changes, many are struggling to find success transforming themselves due to low throughput, quality issues, or failing to deliver the right thing at the right time. In a few cases, digital transformation has ended in outright disaster.

What is it that these companies are really after? To solve new problems in new ways through innovation? To more rapidly adapt to the changing market? To protect existing revenue? Any leader worth their salt will say all of these are important outcomes, so how do you even begin to make a “digital transformation” actionable? What are we transforming to? How do we know when we’ve arrived?

The reason so many digital transformations fail has to do with how IT is usually positioned within mature, established businesses. I believe what these companies are really after is not a digital transformation—whatever that might be—but rather an organizational one that radically changes the way the business operates. One that redefines what IT means in the context of building software. The technology is incidental to this cultural shift which involves the intersection of people, processes, and innovation. In order to be successful, these organizations need to become technology product companies.

The Genesis of IT

There is an inertia within organizations to overvalue tactics and undervalue strategy. This is true not just of mature, established businesses but really all businesses, startups included. In fact, it’s this exact reason most startups fail. A lack of clear strategy and guiding vision precludes even the best execution from delivering success outside of the odd unicorn (after all, someone has to win the Powerball). Established businesses, however, already have a reliable cash flow engine to fall back on. There is much more margin for error when it comes to both strategy and execution, but this peacetime mentality leads to disruption. Many leaders have begun to recognize this and act on it, falling right back to what they know best—tactics.

Why do companies and managers tend to bias towards tactics over strategy in software development? It comes back to the genesis of IT. Historically, IT was about managing computers, networks, email, phone systems, and other technical areas of the business. While this is still true today, the result of software eating the world has caused that scope to broaden significantly. But for mature, established businesses, IT has long been viewed as a cost center, and the mandate for an IT leader is cost minimization. This is in spite of the fact that the business has shifted away from humans, paper forms, and telephones to automation and software-based solutions. IT has always existed to support business operations, first by managing the technology the business depended on, now by building it. The only real change was IT transforming from a servant of the business to a partner of it.

Consequently, there are two key directives for a traditional IT organization: carry out the orders of the business and minimize cost. These goals inherently lead to a project mindset that is output- and task-oriented. Thus, IT has always been tactical and execution-minded in nature.

A Spotter’s Guide to Project-Minded IT

There are three ways to identify a project-minded IT organization. First, if both software engineers and more traditional IT roles like hardware support or help desk report up to a CIO, it’s likely a project-minded organization. In this case, it’s all just lumped into one group called “IT.”

This contrasts with product-minded companies which place IT responsibilities under a CIO, whose directive is still cost minimization, and product development responsibilities under a CTO and/or CPO (Chief Product Officer), whose directive is strategic investment. There are two distinct groups, IT and Product Development or R&D. It’s more common to see CTOs or CPOs at newer, technology-first companies than it is at mature, established businesses since this requires a major realignment. This alignment, however, is why we see many of the execution issues at companies attempting to “digitally transform” themselves.

Second, if there is a clear separation between IT or development and the business, there’s a good chance it’s a project-minded organization. This might be signaled by business partners, business analysts, or product owners who provide teams with implementation requirements and act as a backlog administrator. Developers might not have a good understanding of who their customers are or they view the business partner as the customer. This can also be signaled by frequently changing priorities, an ever-growing backlog of tasks, or unaddressed tech debt piling up. The team is typically not cross-functional, consisting only of developers and a business partner. Marty Cagan refers to these as delivery teams, and they are purely output-driven.

Alternatively, the team may be cross-functional with some form of designer (often oriented more towards UI than UX) and product manager, but it’s still governed by outputs. The product manager’s role is closer to that of a project manager armed with a product roadmap, and the closest thing developers have to product discovery is design and usability testing. Cagan refers to these as feature teams. Both delivery and feature teams exist to serve the business. These are the teams you’ll find at most companies building software.

At product-minded companies, teams are cross-functional with designers, UX, engineers, and product, and they are measured by outcomes, not outputs. This focus on outcomes means that the team is empowered to figure out the best way to solve the problems they’ve been asked to solve rather than being fed a list of features to build. These teams have an intimate understanding of their customers and interact with them regularly to perform product discovery and validate solutions. These are product teams in the truest sense but also quite rare.

The last way to spot a project-minded organization might be the most obvious. If the roadmap has a clear end point, it’s a project. Here, an IT organization treats building a software solution the same way it treats installing a new phone system. When the project is completed, teams or resources are reallocated to new projects and one of two things happen: it’s either dumped on another team to maintain and extend or no one sticks around to support it. The finished project languishes or former developers are told to context switch to it reactively and at the whims of the business. Engineers are treated as interchangeable and teams are not particularly durable or mission-driven but rather task-driven.

Product-minded companies instead embrace the virtues of minimum viable product, shipping incremental value, validating ideas, and iteration. The product manager provides a vision that unites the team in a common mission. Products are not “completed,” rather they grow and evolve. There is an emphasis on business outcomes over task outputs. Managers understand that teams are composed of people with diverse skills who are not easily fungible but who might be better suited to different phases of a product’s lifecycle. Members of a team might shift focus to other areas and priorities over time, but always in support of the team’s mission.

The Philosophical Dilemma of the Stoplight

A tactics-first mindset results in a propensity to treat software development like an assembly line. We can see this with the recent adoption of ideas from the Toyota Production System and lean manufacturing as it’s applied to software development. This emphasis on tactics causes managers to view product development as an optimization problem—if we just optimize the right set of tactics and practices, we can significantly improve throughput and quality at scale. This has led to the rise in packaged frameworks and processes like SAFe, LeSS, DAD, and Nexus as well as tactics like agile, pair programming, and test-driven development at large organizations.

The assembly-line mindset aims to take developers of arbitrary skill and background, run them through a prescribed process, and get high-quality, high-output results on the other end. I’ve never seen this deliver the desired outcomes in practice, at least not to the degree most leaders hope.

On the surface, mass production and software development share a lot of similarities. Both require quality standards, collaboration between groups of specialized workers, and repeatability. However, the reality is they are quite different from each other. A manufacturing assembly line is optimized to produce the exact same product over and over again, efficiently and reliably. Software products, especially Software as a Service, are heterogeneous. While we seek a process that produces consistent results, each product and situation is unique. Too prescriptive, and we end up with a rigid process that yields poor results and low-throughput. Too unstructured, and we end up with inconsistent and unreliable output.

Our Head of Client Experience Mike Taylor refers to this as the Stoplight Problem. To demonstrate, ask a roomful of people what to do at each phase of a stoplight. On green, everyone says “Go.” On red, “Stop.” And on yellow? The answers vary—even more so with the introduction of flashing yellow lights. How close are we to the light? How fast are we traveling? Are the roads icy? What are the cars in front or behind us doing? What happens at a yellow light is entirely context-dependent and situational. It comes down to making informed choices in the moment without an authoritative, black-and-white determination.

Execution and delivery issues invariably come down to one thing: the yellow light. The green and red lights are binary indicators. There are clear right and wrong actions to take. These are things that can be taught and learned—where tactics matter—but the yellow light comes down to making good decisions. This is something organizations struggle with at scale. How do you trust your teams to make good decisions? As a result, they end up making those decisions top-down in a command-and-control or assembly-line fashion. This is how organizations end up with delivery and feature teams. What’s needed is a sort of meta process or process for encouraging good decision making.

Empowered Product Teams

The emphasis on tactics isn’t limited to traditional project-minded IT organizations. Tactics are more visible and measurable. To a manager, tactics feel like work is happening, but they are rarely the difference maker for a company.

To illustrate, imagine handing out a bunch of axes to a group of people and telling them to go collect some wood. You might even teach them the proper technique for chopping down a tree. What happens next? Chaos. Confusion. A general sense of wandering in the woods. What kind of timber do we need? How much? What is it used for? How do we move it? Watching an army of people swinging axes is going to look like a lot of work is going on, but is it work that matters? You might follow people around, directing them where to go, which trees to cut down, and where to move them, but this won’t scale very well.

Without a guiding vision, we’re left with a bunch of people wandering in the woods swinging axes. Work happens, things get done—maybe even things that matter—but it’s haphazard and inefficient. More often than not, though, we’re always two weeks from completion because there isn’t clarity on where we’re trying to be. In agile terminology, we’re iterating to nowhere.

Our response might be to micromanage or implement the assembly-line process, turning our teams into feature factories. In my experience, this creates new challenges. In the first case, by grinding throughput to a halt, and in the second case, by failing to address the Stoplight Problem. The solution is a combination of vision, strategy, and execution.

A vision is a mental image of what the future could be like. It’s a grand and idealistic state, not something that can be achieved in a short amount of time. A shared vision empowers teams to make better decisions independently.

Strategy consists of a plan with decreasing fidelity. Some organizations attempt to plan 12 to 18 months out in a very waterfall-like fashion, and unless you’re sending a rocket into space, it just doesn’t work. A strategy is really a series of goals that get progressively fuzzier the further you go out. While a vision usually isn’t directly actionable, goals are both actionable and attainable in support of the overarching vision. We can break our strategy down into sets of three-month goals, which allows us to adjust course as needed. This is important since our goals are increasingly fuzzy. The key here is that strategy and goals are not dictated to teams. There needs to be give and take and dialog. OKRs can be a good tool for facilitating this.

At Real Kinetic, we hold quarterly leadership offsites to revisit our vision and strategy, course-correct, and ensure we have a general sense of alignment. We help our clients do the same within their product development organizations. The challenge with strategy is it looks like talking, while tactics look like working, even if it’s work that doesn’t truly move the needle. This is a cognitive bias leaders and managers should be aware of because it can trap us into focusing on tactics that aren’t framed by a clear vision and strategy.

Execution is all about hitting the goals we lay out in our strategy. This is where tactics come into play, but rather than providing teams with a list of features to implement or tasks to perform, we empower them to make good decisions. This is made possible by our guiding vision and cross-functional, mission-driven product teams. Our product manager is figuring out what lies ahead and helping plan the best course of action for realizing our vision. They are looking at value and business viability risks for the product. Our designer is looking at usability risks, and our tech lead is looking at feasibility, making estimations, and contributing to the strategy in order to avoid potential obstacles. You’ll notice that nowhere have we mentioned agile or scrum because these are specific tactics for managing execution. Together, the team is determining execution and discovering a solution that moves the business towards the ideal state set forth by its leadership.

Becoming a Technology Product Company

The struggle with digital transformation is it doesn’t get at the heart of the issue. It’s a tactical response to a tangible, yet ultimately inconsequential, part of the problem. The problem is not due to technology or innovation or particular tactics, it’s due to organizational alignment and execution deficiencies. Unfortunately, the former is more visible and more easily acted on than the latter.

The transformation that organizations are actually after is becoming a technology product company. This requires empowered product teams in combination with vision, strategy, and execution. Most companies focus on the execution because it’s easier, but it’s not sufficient. Empowered product teams require a shared vision that enables them to make good decisions without the need for an overly regimented or top-down process. This is the only effective way I’ve seen software companies scale throughput and quality. Don’t let your organization think it’s building a boulevard when it’s actually planting perennials next to potholes.

Real Kinetic helps clients build great product development organizations. Learn more about working with us.

Follow @tyler_treat

April 26, 2019April 26, 2019

Planting Perennials Next to Potholes

Silos, bikesheds, and focusing on what matters

If you’ve ever flown into Des Moines then you’ve had the privilege of driving on what might be the most decrepit major road in the metro area. An important artery, Fleur Drive is the only way to get to and from the airport, and the pavement is marginally better than that of a dirt road. Cars weave back and forth to dodge potholes and massive cracks in the asphalt as people race to catch their flights. There always appears to be some kind of construction going on somewhere along the six mile stretch of road, and yet, it never seems to actually improve. The road is also located in a major floodplain, so sometimes the city just closes it when the nearby river rises too much. It’s basically what you’d get if you agiled your way through urban planning.

Typically, you’ll see the Public Works Department planting flowers or otherwise maintaining the landscaping of the medians. It goes down to one lane when they have to water the flowers. Over the past month, they tore up and poured new concrete to replace the medians altogether, again bringing the road down to one lane in the process. The tulips look nice though.

It’s interesting because a lot of companies build software this way. They quickly pave the road by iterating their way there, ignoring nearby flood hazards or the anticipated traffic that’s going to be traversing it. They plant some flowers along the way to make it look nice and then move on to the next thing. Over time, the road deteriorates. Fleur is a main thoroughfare, so you can’t just close it and repave. The city doesn’t have the budget to repave it all at once anyway. So you patch up a few potholes and plant some new flowers.

There are a few different facets to this depending on what vantage point you look at it from. As it turns out, however, they all dovetail into the same thing. At the individual level, what you often see is bikeshedding. That is, engineers focusing time and energy on technical minutiae that, in the grand scheme of things, don’t really matter. Often it’s fixating on aesthetics and what you can see rather than function or things that truly move the needle forward in a meaningful way. Sometimes we get caught up in the details and plant flowers. When you’re up to your neck in alligators, it’s hard to remember that your initial objective was to drain the swamp. This often comes from a lack of direction for the team, and it’s the manager’s job to ensure we’re focusing on what matters.

At the team level, we start to run into siloing issues. This happens when we have different functions of the business focusing on their little parts of the world, more or less neglecting the other parts. Development focuses on development. Operations focuses on operations. Security focuses on security. What you get is gridlock, an utter inability to make progress because everyone is uncompromisingly fastened to their silo. Worse yet, what does manage to get done is a patchwork of competing goals and agendas. It’s building new medians as the roads crumble. And silos are not limited to pure business functions like development, operations, and security. There are silos within silos—Product Team X and Product Team Y, for example. Silos are recursive. They are a natural team dynamic that occurs as organizations grow in accordance with Dunbar’s number, especially at companies that rigidly specialize by function. This is why a cohesive vision is critical.

At the organization level, we see large-scale strategy problems and what I call “WIP-lash”—lots of WIP (Work In Progress), lots of shifting priorities, and lots of “high-priority” items. Priorities change at the drop of a hat or everything is a priority all of the time or the work is planned 12 months in advance and by the time we execute, the goalposts have moved. Executives make knee-jerk mandates in absolute terms to respond to the newest fire. Tech debt piles up as things are added to the never-ending priority queue (that’s at least one thing that doesn’t get equal priority as everything else!), but the infrastructure is in a constant state of ruin and the potholes don’t stop. WIP-lash is just strategic bikeshedding. This is a prioritization and planning issue through and through. We can’t close the entire road and repave it. Instead, we do it in phases. Managing tech debt works the same way. We have to pay it down periodically, but not with constant band-aids and chewing gum and not by stopping the world. We have to prioritize the work like everything else we do, and sometimes that means saying no to other things we deem important.

OKRs can be a useful way to force those difficult decisions and provide teams a shared vision. Specifically, they are the strategy to balance out the iterative tactics of agile. If you don’t have some kind of mile markers you’re working towards, you’re just iterating your way to nowhere. OKRs are not intended to be a waterfall approach, they are about providing strategic guidance. That doesn’t mean companies don’t screw it up though, especially when consultants get their hooks into things. They don’t need to be a large, scary, expensive process with fancy tools—just a Word document and real discussions about what needs to happen and dialogues about what is actually possible. OKRs are hard to get right though and, like anything, require iteration. A key part of good OKR processes is using them to drive discussions and negotiations up and down the organization. It surfaces conflicts and alignment issues earlier in the process. It provides line managers a mechanism to push back and force hard decisions and open a dialogue between groups. The discussions on what really matters and the negotiation about what is really possible is the major value.

“Do you want this or that? I only have resources for this.”
“Oh, I actually have engineers I can lend this quarter. Maybe that will help?”
“Sure, but we can only accomplish part of that.”
“We can make that work.”

OKRs are a vehicle for strategic discussions, not tactical status updates, task lists, or waterfall plans. Without some sort of guiding vision that you’re working towards, you’re just doing stuff. That might look and feel productive but only on the surface. It must be a negotiation if you want results and not just activity.

It really comes down to prioritization and alignment. At the individual level, we have tactical bikeshedding—focusing on items that are largely inconsequential. This is a prioritization problem. It falls on managers to keep teams focused, but it also flows from broader organizational issues. It’s particularly insidious in companies that separate product management (“the business”) from product development (“engineering”). At the organization level, we have strategic bikeshedding—being unable to make hard decisions and focus in on what matters to the business right now, resulting in WIP-lash. This is also a prioritization problem, and it leads to the tactical bikeshedding mentioned earlier. In between, at the team level, we have siloing. This causes all sorts of issues ranging from gridlock and broken customer experiences to duplication of effort. It’s an alignment problem.

There is not a simple, quick solution to these problems, but it starts at the top. If management is not in alignment and unable to prioritize what matters, no one else will. Work will happen, and to a passerby that can look reassuring, but is it work that matters? OKRs are not a silver bullet, and they are difficult to do and take time to get right. But when executed well, they can be a powerful lens to focus on what matters and provide a shared vision. As Intel co-founder and former CEO Andy Grove said, the most powerful tool of all is the word “no.”

Real Kinetic is committed to helping clients develop great engineering organizations. Learn more about working with us.

Follow @tyler_treat

January 3, 2019January 3, 2019

How to Level up Dev Teams

One question that clients frequently ask: how do you effectively level up development teams? How do you take a group of engineers who have never written Python and make them effective Python developers? How do you take a group who has never built distributed systems and have them build reliable, fault-tolerant microservices? What about a team who has never built anything in the cloud that is now tasked with building cloud software?

Some say training will level up teams. Bring in a firm who can teach us how to write effective Python or how to build cloud software. Run developers through a bootcamp; throw raw, undeveloped talent in one end and out pops prepared and productive engineers on the other.

My question to those who advocate this is: when do you know you’re ready? Once you’ve completed a training course? Is the two-day training enough or should we opt for the three-day one? The six-month pair-coding boot camp? You might be more ready than you were before, but you also spent piles of cash on training programs, not to mention the opportunity cost of having a team of expensive engineers sit in multi-day or multi-week workshops. Are the trade-offs worth it? Perhaps, but it’s hard to say. And what happens when the next new thing comes along? We have to start the whole process over again.

Others say tools will help level up teams. A CI/CD pipeline will make developers more effective and able to ship higher quality software faster. Machine learning products will make our on-call experience more manageable. Serverless will make engineers more productive. Automation will improve our company’s slow and bureaucratic processes.

This one’s simple: tools are often band-aids for broken or inefficient policies, and policies are organizational scar tissue. Tools can be useful, but they will not fix your broken culture and they certainly will not level up your teams, only supplement them at best.

Yet others say developer practices will level up teams. Teams doing pair programming or test-driven development (TDD) will level up faster and be more effective—or scrum, or agile, or mob programming. Teams not following these practices just aren’t ready, and it will take them longer to become ready.

These things can help, but they don’t actually matter that much. If this sounds like blasphemy to you, you might want to stop and reflect on that dogma for a bit. I have seen teams that use scrum, pair programming, and TDD write terrible software. I have seen teams that don’t write unit tests write amazing software. I have seen teams implement DevOps on-prem, and I have seen teams completely silo ops and dev in the cloud. These are tools in the toolbox that teams can choose to leverage, but they will not magically make a team ready or more effective. The one exception to this is code reviews by non-authors.

Code reviews are the one practice that helps improve software quality, and there is empirical data to support this. Pair programming can be a great way to mentor junior engineers and ensure someone else understands the code, but it’s not a replacement for code reviews. It’s just as easy to come up with a bad idea working by yourself as it is working with another person, but when you bring in someone uninvolved with outside perspective, they’re more likely to realize it’s a bad idea.

Code reviews are an effective way to quickly level up teams provided you have a few pockets of knowledgeable reviewers to bootstrap the process (which, as a corollary, means high-performing teams should occasionally be broken up to seed the rest of the organization). They provide quick feedback to developers who will eventually internalize it and then instill it in their own code reviews. Thus, it quickly spreads expertise. Leveling up becomes contagious.

I experienced this firsthand when I started working at Workiva. Having never written a single line of Python and having never used Google App Engine before, I joined a company whose product was predominantly written in Python and running on Google App Engine. Within the span of a few months, I became a fairly proficient Python developer and quite knowledgeable of App Engine and distributed systems practices. I didn’t do any training. I didn’t read any books. I rarely pair-coded. It was through code reviews (and, in particular, group code reviews!) alone that I leveled up. And it’s why we were ruthless on code reviews, which often caught new hires off guard. Using this approach, Workiva effectively took a team of engineers with virtually no Python or cloud experience, shipped a cloud-based SaaS product written in Python, and then IPO’d in the span of a few years.

Code reviews promote a culture which separates ego from code. People are naturally threatened by criticism, but with a culture of code reviews, we critique code, not people. Code reviews are also a good way to share context within a team. When other people review your code, they get an idea of what you’re up to and where you’re at. Code reviews provide a pulse to your team, and that can help when a teammate needs to context switch to something you were working on.

They are also a powerful way to scale other functions of product development. For example, one area many companies struggle with is security. InfoSec teams are frequently a bottleneck for R&D organizations and often resource-constrained. By developing a security-reviewer program, we can better scale how we approach security and compliance. Require security-sensitive changes to undergo a security review. In order to become a security reviewer, engineers must go through a security training program which must be renewed annually. Google takes this idea even further, having certifications for different areas like “JS readability.”

This is why our consulting at Real Kinetic emphasizes mentorship and building a culture of continuous improvement. It’s also why we bring a bias to action. We talk to companies who want to start adopting new practices and technologies but feel their teams aren’t prepared enough. Here’s the reality: you will never feel fully prepared because you can never be fully prepared. As John Gall points out, the best an army can do is be fully prepared to fight the previous war. This is where being agile does matter, but agile only in the sense of reacting and pivoting quickly.

Nothing is a replacement for experience. You don’t become a professional athlete by watching professional sports on TV. You don’t build reliable cloud software by reading about it in books or going to trainings. To be clear, these things can help, but they aren’t strategies. Similarly, developer practices can help, but they aren’t prerequisites. And more often than not, they become emotional or philosophical debates rather than objective discussions. Teams need to be given the latitude to experiment and make mistakes in order to develop that experience. They need to start doing.

The one exception is code reviews. This is the single most effective way to level up development teams. Through rigorous code reviews, quick iterations, and doing, your teams will level up faster than any training curriculum could achieve. Invest in training or other resources if you think they will help, but mandate code reviews on changes before merging into master. Along with regular retros, this is a foundational component to building a culture of continuous improvement. Expertise will start to spread like wildfire within your organization.

Follow @tyler_treat