Digitally Transformed: Becoming a Technology Product Company

More and more established businesses are attempting to reinvent themselves as technology companies. At the heart of this is the digital transformation, a journey many organizations are undertaking in order to better compete and serve their customers. As a result, companies are pouring tons of cash into digital transformation strategies. For some, this means broader adoption of agile or DevOps practices. For others, it’s modernizing product offerings or moving to the cloud. Regardless of the changes, many are struggling to find success transforming themselves due to low throughput, quality issues, or failing to deliver the right thing at the right time. In a few cases, digital transformation has ended in outright disaster.

What is it that these companies are really after? To solve new problems in new ways through innovation? To more rapidly adapt to the changing market? To protect existing revenue? Any leader worth their salt will say all of these are important outcomes, so how do you even begin to make a “digital transformation” actionable? What are we transforming to? How do we know when we’ve arrived?

The reason so many digital transformations fail has to do with how IT is usually positioned within mature, established businesses. I believe what these companies are really after is not a digital transformation—whatever that might be—but rather an organizational one that radically changes the way the business operates. One that redefines what IT means in the context of building software. The technology is incidental to this cultural shift which involves the intersection of people, processes, and innovation. In order to be successful, these organizations need to become technology product companies.

The Genesis of IT

There is an inertia within organizations to overvalue tactics and undervalue strategy. This is true not just of mature, established businesses but really all businesses, startups included. In fact, it’s this exact reason most startups fail. A lack of clear strategy and guiding vision precludes even the best execution from delivering success outside of the odd unicorn (after all, someone has to win the Powerball). Established businesses, however, already have a reliable cash flow engine to fall back on. There is much more margin for error when it comes to both strategy and execution, but this peacetime mentality leads to disruption. Many leaders have begun to recognize this and act on it, falling right back to what they know best—tactics.

Why do companies and managers tend to bias towards tactics over strategy in software development? It comes back to the genesis of IT. Historically, IT was about managing computers, networks, email, phone systems, and other technical areas of the business. While this is still true today, the result of software eating the world has caused that scope to broaden significantly. But for mature, established businesses, IT has long been viewed as a cost center, and the mandate for an IT leader is cost minimization. This is in spite of the fact that the business has shifted away from humans, paper forms, and telephones to automation and software-based solutions. IT has always existed to support business operations, first by managing the technology the business depended on, now by building it. The only real change was IT transforming from a servant of the business to a partner of it.

Consequently, there are two key directives for a traditional IT organization: carry out the orders of the business and minimize cost. These goals inherently lead to a project mindset that is output- and task-oriented. Thus, IT has always been tactical and execution-minded in nature.

A Spotter’s Guide to Project-Minded IT

There are three ways to identify a project-minded IT organization. First, if both software engineers and more traditional IT roles like hardware support or help desk report up to a CIO, it’s likely a project-minded organization. In this case, it’s all just lumped into one group called “IT.”

This contrasts with product-minded companies which place IT responsibilities under a CIO, whose directive is still cost minimization, and product development responsibilities under a CTO and/or CPO (Chief Product Officer), whose directive is strategic investment. There are two distinct groups, IT and Product Development or R&D. It’s more common to see CTOs or CPOs at newer, technology-first companies than it is at mature, established businesses since this requires a major realignment. This alignment, however, is why we see many of the execution issues at companies attempting to “digitally transform” themselves.

Second, if there is a clear separation between IT or development and the business, there’s a good chance it’s a project-minded organization. This might be signaled by business partners, business analysts, or product owners who provide teams with implementation requirements and act as a backlog administrator. Developers might not have a good understanding of who their customers are or they view the business partner as the customer. This can also be signaled by frequently changing priorities, an ever-growing backlog of tasks, or unaddressed tech debt piling up. The team is typically not cross-functional, consisting only of developers and a business partner. Marty Cagan refers to these as delivery teams, and they are purely output-driven.

Alternatively, the team may be cross-functional with some form of designer (often oriented more towards UI than UX) and product manager, but it’s still governed by outputs. The product manager’s role is closer to that of a project manager armed with a product roadmap, and the closest thing developers have to product discovery is design and usability testing. Cagan refers to these as feature teams. Both delivery and feature teams exist to serve the business. These are the teams you’ll find at most companies building software.

At product-minded companies, teams are cross-functional with designers, UX, engineers, and product, and they are measured by outcomes, not outputs. This focus on outcomes means that the team is empowered to figure out the best way to solve the problems they’ve been asked to solve rather than being fed a list of features to build. These teams have an intimate understanding of their customers and interact with them regularly to perform product discovery and validate solutions. These are product teams in the truest sense but also quite rare.

The last way to spot a project-minded organization might be the most obvious. If the roadmap has a clear end point, it’s a project. Here, an IT organization treats building a software solution the same way it treats installing a new phone system. When the project is completed, teams or resources are reallocated to new projects and one of two things happen: it’s either dumped on another team to maintain and extend or no one sticks around to support it. The finished project languishes or former developers are told to context switch to it reactively and at the whims of the business. Engineers are treated as interchangeable and teams are not particularly durable or mission-driven but rather task-driven.

Product-minded companies instead embrace the virtues of minimum viable product, shipping incremental value, validating ideas, and iteration. The product manager provides a vision that unites the team in a common mission. Products are not “completed,” rather they grow and evolve. There is an emphasis on business outcomes over task outputs. Managers understand that teams are composed of people with diverse skills who are not easily fungible but who might be better suited to different phases of a product’s lifecycle. Members of a team might shift focus to other areas and priorities over time, but always in support of the team’s mission.

The Philosophical Dilemma of the Stoplight

A tactics-first mindset results in a propensity to treat software development like an assembly line. We can see this with the recent adoption of ideas from the Toyota Production System and lean manufacturing as it’s applied to software development. This emphasis on tactics causes managers to view product development as an optimization problem—if we just optimize the right set of tactics and practices, we can significantly improve throughput and quality at scale. This has led to the rise in packaged frameworks and processes like SAFe, LeSS, DAD, and Nexus as well as tactics like agile, pair programming, and test-driven development at large organizations.

The assembly-line mindset aims to take developers of arbitrary skill and background, run them through a prescribed process, and get high-quality, high-output results on the other end. I’ve never seen this deliver the desired outcomes in practice, at least not to the degree most leaders hope.

On the surface, mass production and software development share a lot of similarities. Both require quality standards, collaboration between groups of specialized workers, and repeatability. However, the reality is they are quite different from each other. A manufacturing assembly line is optimized to produce the exact same product over and over again, efficiently and reliably. Software products, especially Software as a Service, are heterogeneous. While we seek a process that produces consistent results, each product and situation is unique. Too prescriptive, and we end up with a rigid process that yields poor results and low-throughput. Too unstructured, and we end up with inconsistent and unreliable output.

Our Head of Client Experience Mike Taylor refers to this as the Stoplight Problem. To demonstrate, ask a roomful of people what to do at each phase of a stoplight. On green, everyone says “Go.” On red, “Stop.” And on yellow? The answers vary—even more so with the introduction of flashing yellow lights. How close are we to the light? How fast are we traveling? Are the roads icy? What are the cars in front or behind us doing? What happens at a yellow light is entirely context-dependent and situational. It comes down to making informed choices in the moment without an authoritative, black-and-white determination.

Execution and delivery issues invariably come down to one thing: the yellow light. The green and red lights are binary indicators. There are clear right and wrong actions to take. These are things that can be taught and learned—where tactics matter—but the yellow light comes down to making good decisions. This is something organizations struggle with at scale. How do you trust your teams to make good decisions? As a result, they end up making those decisions top-down in a command-and-control or assembly-line fashion. This is how organizations end up with delivery and feature teams. What’s needed is a sort of meta process or process for encouraging good decision making.

Empowered Product Teams

The emphasis on tactics isn’t limited to traditional project-minded IT organizations. Tactics are more visible and measurable. To a manager, tactics feel like work is happening, but they are rarely the difference maker for a company.

To illustrate, imagine handing out a bunch of axes to a group of people and telling them to go collect some wood. You might even teach them the proper technique for chopping down a tree. What happens next? Chaos. Confusion. A general sense of wandering in the woods. What kind of timber do we need? How much? What is it used for? How do we move it? Watching an army of people swinging axes is going to look like a lot of work is going on, but is it work that matters? You might follow people around, directing them where to go, which trees to cut down, and where to move them, but this won’t scale very well.

Without a guiding vision, we’re left with a bunch of people wandering in the woods swinging axes. Work happens, things get done—maybe even things that matter—but it’s haphazard and inefficient. More often than not, though, we’re always two weeks from completion because there isn’t clarity on where we’re trying to be. In agile terminology, we’re iterating to nowhere.

Our response might be to micromanage or implement the assembly-line process, turning our teams into feature factories. In my experience, this creates new challenges. In the first case, by grinding throughput to a halt, and in the second case, by failing to address the Stoplight Problem. The solution is a combination of vision, strategy, and execution.

A vision is a mental image of what the future could be like. It’s a grand and idealistic state, not something that can be achieved in a short amount of time. A shared vision empowers teams to make better decisions independently.

Strategy consists of a plan with decreasing fidelity. Some organizations attempt to plan 12 to 18 months out in a very waterfall-like fashion, and unless you’re sending a rocket into space, it just doesn’t work. A strategy is really a series of goals that get progressively fuzzier the further you go out. While a vision usually isn’t directly actionable, goals are both actionable and attainable in support of the overarching vision. We can break our strategy down into sets of three-month goals, which allows us to adjust course as needed. This is important since our goals are increasingly fuzzy. The key here is that strategy and goals are not dictated to teams. There needs to be give and take and dialog. OKRs can be a good tool for facilitating this.

At Real Kinetic, we hold quarterly leadership offsites to revisit our vision and strategy, course-correct, and ensure we have a general sense of alignment. We help our clients do the same within their product development organizations. The challenge with strategy is it looks like talking, while tactics look like working, even if it’s work that doesn’t truly move the needle. This is a cognitive bias leaders and managers should be aware of because it can trap us into focusing on tactics that aren’t framed by a clear vision and strategy.

Execution is all about hitting the goals we lay out in our strategy. This is where tactics come into play, but rather than providing teams with a list of features to implement or tasks to perform, we empower them to make good decisions. This is made possible by our guiding vision and cross-functional, mission-driven product teams. Our product manager is figuring out what lies ahead and helping plan the best course of action for realizing our vision. They are looking at value and business viability risks for the product. Our designer is looking at usability risks, and our tech lead is looking at feasibility, making estimations, and contributing to the strategy in order to avoid potential obstacles. You’ll notice that nowhere have we mentioned agile or scrum because these are specific tactics for managing execution. Together, the team is determining execution and discovering a solution that moves the business towards the ideal state set forth by its leadership.

Becoming a Technology Product Company

The struggle with digital transformation is it doesn’t get at the heart of the issue. It’s a tactical response to a tangible, yet ultimately inconsequential, part of the problem. The problem is not due to technology or innovation or particular tactics, it’s due to organizational alignment and execution deficiencies. Unfortunately, the former is more visible and more easily acted on than the latter.

The transformation that organizations are actually after is becoming a technology product company. This requires empowered product teams in combination with vision, strategy, and execution. Most companies focus on the execution because it’s easier, but it’s not sufficient. Empowered product teams require a shared vision that enables them to make good decisions without the need for an overly regimented or top-down process. This is the only effective way I’ve seen software companies scale throughput and quality. Don’t let your organization think it’s building a boulevard when it’s actually planting perennials next to potholes.

Real Kinetic helps clients build great product development organizations. Learn more about working with us.

Planting Perennials Next to Potholes

Silos, bikesheds, and focusing on what matters

If you’ve ever flown into Des Moines then you’ve had the privilege of driving on what might be the most decrepit major road in the metro area. An important artery, Fleur Drive is the only way to get to and from the airport, and the pavement is marginally better than that of a dirt road. Cars weave back and forth to dodge potholes and massive cracks in the asphalt as people race to catch their flights. There always appears to be some kind of construction going on somewhere along the six mile stretch of road, and yet, it never seems to actually improve. The road is also located in a major floodplain, so sometimes the city just closes it when the nearby river rises too much. It’s basically what you’d get if you agiled your way through urban planning.

Typically, you’ll see the Public Works Department planting flowers or otherwise maintaining the landscaping of the medians. It goes down to one lane when they have to water the flowers. Over the past month, they tore up and poured new concrete to replace the medians altogether, again bringing the road down to one lane in the process. The tulips look nice though.

It’s interesting because a lot of companies build software this way. They quickly pave the road by iterating their way there, ignoring nearby flood hazards or the anticipated traffic that’s going to be traversing it. They plant some flowers along the way to make it look nice and then move on to the next thing. Over time, the road deteriorates. Fleur is a main thoroughfare, so you can’t just close it and repave. The city doesn’t have the budget to repave it all at once anyway. So you patch up a few potholes and plant some new flowers.

There are a few different facets to this depending on what vantage point you look at it from. As it turns out, however, they all dovetail into the same thing. At the individual level, what you often see is bikeshedding. That is, engineers focusing time and energy on technical minutiae that, in the grand scheme of things, don’t really matter. Often it’s fixating on aesthetics and what you can see rather than function or things that truly move the needle forward in a meaningful way. Sometimes we get caught up in the details and plant flowers. When you’re up to your neck in alligators, it’s hard to remember that your initial objective was to drain the swamp. This often comes from a lack of direction for the team, and it’s the manager’s job to ensure we’re focusing on what matters.

At the team level, we start to run into siloing issues. This happens when we have different functions of the business focusing on their little parts of the world, more or less neglecting the other parts. Development focuses on development. Operations focuses on operations. Security focuses on security. What you get is gridlock, an utter inability to make progress because everyone is uncompromisingly fastened to their silo. Worse yet, what does manage to get done is a patchwork of competing goals and agendas. It’s building new medians as the roads crumble. And silos are not limited to pure business functions like development, operations, and security. There are silos within silos—Product Team X and Product Team Y, for example. Silos are recursive. They are a natural team dynamic that occurs as organizations grow in accordance with Dunbar’s number, especially at companies that rigidly specialize by function. This is why a cohesive vision is critical.

At the organization level, we see large-scale strategy problems and what I call “WIP-lash”—lots of WIP (Work In Progress), lots of shifting priorities, and lots of “high-priority” items. Priorities change at the drop of a hat or everything is a priority all of the time or the work is planned 12 months in advance and by the time we execute, the goalposts have moved. Executives make knee-jerk mandates in absolute terms to respond to the newest fire. Tech debt piles up as things are added to the never-ending priority queue (that’s at least one thing that doesn’t get equal priority as everything else!), but the infrastructure is in a constant state of ruin and the potholes don’t stop. WIP-lash is just strategic bikeshedding. This is a prioritization and planning issue through and through. We can’t close the entire road and repave it. Instead, we do it in phases. Managing tech debt works the same way. We have to pay it down periodically, but not with constant band-aids and chewing gum and not by stopping the world. We have to prioritize the work like everything else we do, and sometimes that means saying no to other things we deem important.

OKRs can be a useful way to force those difficult decisions and provide teams a shared vision. Specifically, they are the strategy to balance out the iterative tactics of agile. If you don’t have some kind of mile markers you’re working towards, you’re just iterating your way to nowhere. OKRs are not intended to be a waterfall approach, they are about providing strategic guidance. That doesn’t mean companies don’t screw it up though, especially when consultants get their hooks into things. They don’t need to be a large, scary, expensive process with fancy tools—just a Word document and real discussions about what needs to happen and dialogues about what is actually possible. OKRs are hard to get right though and, like anything, require iteration. A key part of good OKR processes is using them to drive discussions and negotiations up and down the organization. It surfaces conflicts and alignment issues earlier in the process. It provides line managers a mechanism to push back and force hard decisions and open a dialogue between groups. The discussions on what really matters and the negotiation about what is really possible is the major value.

“Do you want this or that? I only have resources for this.”
“Oh, I actually have engineers I can lend this quarter. Maybe that will help?”
“Sure, but we can only accomplish part of that.”
“We can make that work.”

OKRs are a vehicle for strategic discussions, not tactical status updates, task lists, or waterfall plans. Without some sort of guiding vision that you’re working towards, you’re just doing stuff. That might look and feel productive but only on the surface. It must be a negotiation if you want results and not just activity.

It really comes down to prioritization and alignment. At the individual level, we have tactical bikeshedding—focusing on items that are largely inconsequential. This is a prioritization problem. It falls on managers to keep teams focused, but it also flows from broader organizational issues. It’s particularly insidious in companies that separate product management (“the business”) from product development (“engineering”). At the organization level, we have strategic bikeshedding—being unable to make hard decisions and focus in on what matters to the business right now, resulting in WIP-lash. This is also a prioritization problem, and it leads to the tactical bikeshedding mentioned earlier. In between, at the team level, we have siloing. This causes all sorts of issues ranging from gridlock and broken customer experiences to duplication of effort. It’s an alignment problem.

There is not a simple, quick solution to these problems, but it starts at the top. If management is not in alignment and unable to prioritize what matters, no one else will. Work will happen, and to a passerby that can look reassuring, but is it work that matters? OKRs are not a silver bullet, and they are difficult to do and take time to get right. But when executed well, they can be a powerful lens to focus on what matters and provide a shared vision. As Intel co-founder and former CEO Andy Grove said, the most powerful tool of all is the word “no.”

Real Kinetic is committed to helping clients develop great engineering organizations. Learn more about working with us.

Scaling DevOps and the Revival of Operations

Operations is going through a renaissance right now. With the move to cloud, the increasing amount of automation, and the increasing importance of automation, Ops as we know it is reinventing itself out of necessity. Infrastructure is becoming more and more sophisticated—and commoditized—and practices are just now starting to grow up around that. So while some worry about robots taking our jobs, the reality is more about how automation will help augment us to build better software and focus on higher-value things. It’s not so much about the distant future—whatever that may hold—so much as it is about the next five to ten years, what Operations looks like in that timeframe, and why I think it has to retool.

When we think about traditional Operations, we probably think about hardware and servers, managing networks and databases, application servers and runtimes, disaster recovery, Nagios checks, as well as the business side—vendor management, procurement, and so on. Finally, we have applications built on top by development teams.

We have a nice, clean separation—developers focus on building features and products, and Ops focuses on making sure the lights stay on. Of course, we know the reality is this separation also creates a lot of problems, so DevOps was borne out of this as a way to bring these two groups into alignment by improving communication and feedback loops.

Now, with the move to cloud, many of these traditional Ops functions are effectively being outsourced to cloud providers, i.e. the idea of NoOps. We get unprecedented elasticity and on-demand compute with far less overhead than we ever had before—shrinking procurement time from days or weeks to seconds or minutes.

What this leaves is a thin but important slice between Google or Amazon and those products built by developers—the glue, essentially, between cloud and product. I call this NewOps (which I use facetiously in reference to NoSQL/NewSQL), and it’s the future of Ops. This encompasses infrastructure automation, deployment automation, configuration management, logging, monitoring, and many other things. When Marc Andreessen said software is eating the world, he really meant it. The future of Ops—and many other things—is software. It’s killing the boring, repetitive things we really don’t want to be doing anyway and letting us shift our focus elsewhere.

Certainly, automation is nothing new and is, I think, an important part of DevOps, so I’m going to explain what I mean by NewOps and why I’m distinguishing it. I also don’t want to mischaracterize by having these neatly delineated Ops models. The truth is, your company doesn’t just one day graduate and gets its DevOps diploma. Instead, it might evolve through various manifestations of these different models. DevOps is a journey, not a destination in and of itself.

I like to think of a DevOps scale of automation, from manual provisioning all the way to fully self-service. Next, I add a second dimension, org size, from the smallest startups to the biggest enterprises.

Scaling DevOps

Scaling a business is probably one of the hardest things a company has to go through. In particular, dealing with the problem of silos. They happen at every company as it grows, but why is it that silos form in the first place?

Many companies start with a “DevOps” approach, often out of necessity more than anything. As a small startup, we can’t afford to have dedicated developers, QA, Ops, and security people. We just have people, and those people wear many different hats. Developers might be pushing their own code to production. They might even be managing the infrastructure that code runs on. There’s probably not a lot of stability, probably a lot of risk, and probably not a whole lot of thought towards controlling costs.

But as the product scales, we specialize. And as the business scales, we add various safety checks, controls, and processes. Developers write code, Ops people run it, QA gets blamed for defects, security blocks everything, and management wonders why nothing gets shipped.

And so we end up in the top left-hand quadrant with Ops as gatekeepers. Ops is fighting for stability and, at the same time, devs are basically fighting for change. More or less, we have a stable, cost-controlled, risk-averse environment—hopefully. But we also have a significant delivery and innovation bottleneck.

Specialization is good! But misalignment is not good. The question is, then, how do we scale specialization? Cross-functional teams come to mind. After all, DevOps encourages cooperation! We add an Ops engineer to each team, and maybe a reliability engineer, and perhaps a few extra for on-call backup, and of course a QA engineer too. Problem solved, right?

But hold on. What if we have 40 development teams? And all those teams are doing microservices. And, of course, all of those microservices are special snowflakes each with their own stacks, infrastructure, databases, and so on. This quickly gets out of control, but moreover, that’s a lot of teams and specialized roles on those teams. That’s a lot of headcount which equates to a lot of hiring and a lot of time and money. If you’re Google and you can just throw money at the problem, this might work out okay. For the rest of us, it might not be such a realistic option.

We go back to the drawing board and again ask ourselves how do we scale specialization? My thought to how we do this is with vision and product.

A vision is simply a mental image of what the future could be like. It enables independent decision making and alignment. Vision allows all of those teams, and the people on those teams, to make decisions without having to constantly coordinate with each other. Without vision, you’re just iterating to nowhere fast.

But vision without execution is just hallucination. Products are how we scale execution. Specifically, this idea of Operations through the lens of product, which I’ll describe after showing the parallel with what’s happening in QA.

In a lot of engineering organizations, many QA roles have been quietly disappearing. I think what’s happening is this evolution of QA, particularly, this shift from being test-focused to tools-focused.

We can look at companies like Amazon and Microsoft who popularized the SDET (Software Development Engineer in Test) model. These companies recognized that having a separate QA and development group causes a lot of problems, just like how having a separate Ops group does. We end up with SDEs (Software Development Engineers) who still focus on the development aspects of building software and SDETs who focus on the quality aspects, but rather than having two wholly separate groups, we just have development teams with SDETs embedded in them.

More recently, Microsoft moved to what they call a “Combined Engineering” model—effectively combining the SDE and SDET roles into a single role called a Software Engineer. Software Engineers write the product code, test code, and tools code needed to deliver their service. They are responsible for everything. Quality is a core concern of software development anyway.

Software Engineers write the code, unit tests, and integration tests. Those tests run in CI. The code moves through a CD pipeline before finally going out to production in some fashion. QA teams are shrinking, but what’s growing are the teams building the tools—the CI environments, the CD pipelines, the automated testing frameworks, the production tooling and automation, etc. The same is becoming true of Ops.

This is what I mean by “Operations through the lens of product.” The build, release, deploy automation, configuration management, infrastructure automation, logging, monitoring—these are all products.

Constraints often make problems easier. At Workiva, as we were struggling through that scaling phase, we placed a constraint on ourselves. We capped our infrastructure engineering headcount at 15% of R&D. This forced us to solve the problem using technology, and technical problems tend to be easier than people problems. In effect, this required us to productize our infrastructure. In doing so, we scaled. We controlled costs. We kept our headcount in check. We reduced risk. We accelerated development. Ultimately, we delivered value to customers faster, going from about three to four releases per year to multiple releases per day. In the end, this is really the goal of DevOps—to deliver value to customers continuously and to do it rapidly and reliably.

Rethinking Ops

It’s time we start to rethink Operations because clearly this model of Ops as cluster or infrastructure admins does not scale. Developers will always out-demand their capacity to supply. Either your headcount is out of control or your ability to innovate and deliver is severely hamstrung. Operations becomes this interrupt-driven thing where we’re just fighting fires as they happen. Ops as masters of production usually devolves to Ops becoming human incident routers, trying to figure out what team or person can help resolve problems because, being responsible for everything, they don’t have the insight to fix it themselves.

Another path that many companies take is Platform as a Service. Workiva is an example of this. For a very long time, Workiva didn’t have a traditional Ops team because the Ops team was Google. The first product was built on Google App Engine. This helped immensely to deliver value to customers quickly. We could just focus on the product and not the surrounding operational aspects, but there is a very real innovation bottleneck that comes with this.

The idea of “Ops lock-in” can be a major problem, whether it’s a PaaS like App Engine locking you in or your own Ops team who just isn’t able to support the kind of innovation that you’re trying to do.

My vision for the future of Operations is taking Combined Engineering to its logical conclusion. Just like with QA, Ops capabilities should be embedded within development teams. The reality is you can’t be an effective software engineer today without some Ops skills, and I think every role should be working towards automating itself out of a job. Specifically, my vision is enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services.

The knee-jerk reaction to this idea is usually fully embracing Infrastructure as a Service, infrastructure as code, and giving developers freedom—and usually the consequences are dire. The point here is that the pendulum can swing too far in the other direction. This was a problem for a brief period of time at Workiva. As we were building new products off of App Engine, developers had this newfound freedom, so teams all went different directions introducing new tech, new infrastructure, new services, and so forth. It was a free-for-all, an explosion of stuff, and the cost explosion that comes with it.

There has to be some control around that, so we tweak the vision statement a bit: enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services…with minimal Ops intervention. We have to have some checks and balances in place.

With this, Ops become force multipliers. We move away from the reactive, interrupt-driven model where Ops are masters of production responsible for everything. Instead, we make dev teams responsible for their services but provide the tools they need to actually own their systems end-to-end—from the code on their laptops to operating it in production.

Enabling developers to self-service through tooling and automation means treating Ops as a product team. The infrastructure automation, deployment automation, configuration management, logging, monitoring, and production tools—these are all products. It’s these products that allow teams to fully own their services. This leads to empowerment.

I have this theory that all engineering organizations operate in this fashion which I call pain-driven development. As a company grows, it starts to develop limbs—teams or silos. Each of these limbs has its own pain receptors. Teams operate in a way that minimizes the amount of pain that they feel, it’s human instinct. We make locally optimal decisions to minimize pain and end up following a path of least resistance.

Silos promote pain displacement, which results in a “bulkhead” effect. Product development feels the pain of building software, QA feels the pain of testing software, and Ops feels the pain of running software. This creates broken feedback loops. For instance, developers aren’t feeling the pain Ops is feeling trying to run their software. We just throw things over the wall and it becomes an empathy problem.

This leads to misaligned incentives because each team will optimize for the pain that they feel. How do you expect developers to care about quality if they’re not on the hook? Similarly, how do you expect them to care about operability if they’re not on the hook? Developers won’t build truly reliable software until they are on-call for it and directly responsible. However, responsibility requires empowerment. You can’t have one without the other. You can’t ask someone to care about something and fix it without also giving them the power to do so. Most Ops teams simply haven’t done enough to empower and offload responsibility onto development teams.

Products enable ownership. We move away from Ops as masters of production responsible for everything and push that responsibility onto dev teams. They are the experts for their services. They are best equipped to deal with problems that arise. But we provide the tools they need to diagnose and resolve those problems on their own.

Products maintain control through enablement—enabling teams to follow best practices for builds, testing, deploys, support, and compliance. Compliance and other SDLC requirements have to be encoded into the tools and processes. These are things developers won’t empathize with or simply won’t understand. Rather than giving them a long list of things they have to do, we take as many of those things as we can and bake them into our products. If you use these tools or follow these processes, you’ll get a lot of this stuff for free. This reduces risk and accelerates development.

Similarly, we can’t allow all of the special snowflakes to happen. We have to control that explosion of stuff. To do this, we use pain-driven development to our advantage by creating paths of least resistance. Using standardized patterns, application shapes, and infrastructure services, we can setup “paths” to both make it easier to reach production and meet the goals of the business. As a developer, if you follow this path, your life will be a lot easier and you’ll feel less pain. If you deviate from that path, things get much harder—and painful.

We end up with a set “menu” of standard application shapes and infrastructure. If teams want to deviate and go off-menu, it’s on them to make a case for it. For example, if I want to introduce Erlang into our stack, it’s on my team and me to present the case for that. Part of this might mean we help build and maintain the tools needed to support that. If there is a compelling enough case or enough teams are making similar asks, we can start to standardize new shapes.

Note that we aren’t necessarily mandating technologies, but we’re leveraging pain-driven development to work in our favor.

Products in Practice

Next, I’m going to look at this idea of Operations through the lens of product in a bit more detail. We’ll see what this might actually look like in practice, again using Workiva as a bit of a case study.

Below is the high-level flow that I think about, from code on laptop to code in production.

Starting with the Build and continuous integration stage, this workflow tends to look something like the following. A developer pushes a change to a branch in a code repository, e.g. GitHub. This triggers a few things to happen. First, the build process, which runs unit/integration tests and builds artifacts. This, in turn, might trigger a QA and/or compliance process. At the same time, we have code reviews happening. All of these processes provide feedback to the developer to quickly iterate.

Workiva has a lot of automated processes built into the developer workflow, some off-the-shelf and some built in-house. For example, when a PR is opened, a security scanner runs which does static analysis and looks for various security vulnerabilities. This can flag a security review when a closer look is needed. Likewise, there is code coverage, automated builds, unit tests, and integration tests, Docker image builds, and compliance checks. The screenshots below come from an open-source repo showing some of these products in practice.

For compliance reasons, Workiva requires at least one other person sign-off on code changes. GitHub provides pretty good support for this. Code reviewers provide their feedback, developers work through that feedback, and, once satisfied, reviewers give their “plus one.”

The screenshot below shows some of the automated processes Workiva relies on in the developer workflow: Travis CI, Codecov, Smithy (which is Workiva’s internal build system), Skynet (automated testing), Rosie (automated compliance controls, e.g. do you have code reviews, security reviews, other SDLC compliance requirements?), and Aviary (the security scanner). Once all of these have passed, the PR is automatically labeled with “Merge Requirements Met” and the change can be merged into master.

There are a couple things worth pointing out with this workflow. First, the build plan is part of the code and not baked into some build tool. This allows dev teams to fully control their builds. Second, you noticed that Workiva has very deep integration with GitHub. This has allowed them to build automated controls into the development process, which speeds up the developer’s workflow while reducing risk.

Next, we move on to the Release stage. This flow looks something like the following:

The developer tags a branch for release, which triggers a build process for creating the artifact. This may have a QA process which then promotes the artifact to a development artifact repository. As you may have noticed, Workiva has a lot of compliance requirements since they deal with companies’ pre-financial data, so there is typically a sign-off process at various stages involving different parties like Release Management, QA, Security, etc. Depending on your compliance controls, this might just be clicking a button to promote an artifact to a production repository. From there, it can actually be deployed to a production environment.

With this workflow, artifact tagging, building, and promotion is all automated. It’s also important we have processes around security. Container and machine image auditing is automated as well as security patching for OS updates, etc. For example, this workflow might use something like Packer to automate AMI building. Finally, the artifact sign-off is streamlined for the various parties involved, if not fully automated.

Now we’re ready to actually deploy our application. This is a key part of self-service and “owning” a product. This allows a team to configure their application and, ideally, deploy it themselves to production. Initially, this might be handled by a Release Management team who actually clicks the deploy button, but as you become more confident in your processes and your tools become more mature, more of this responsibility can be pushed onto the development teams.

This is also where control comes into play. For instance, I may be allowed to configure my application to use 1GB of RAM, but if I need 1TB, I may need to get additional sign-off.

Self-service deploys and self-service configuration—with guard rails—are an important part of continuous deployment. Additionally, infrastructure provisioning should be automated. No more submitting tickets for a nameless Ops person to provision and configure servers, VMs, or other resources—no ticket-driven development.

I’ve been deliberate about not prescribing particular solutions for some of these problems. You might be using Kubernetes or ECS to orchestrate containers, it doesn’t really matter. These should mostly be implementation details. What does matter, though, is having good abstractions around certain implementation details. For example, Workiva was meticulous about building some layers around workload scheduling. This allowed them at one point to switch from using Fleet to ECS to manage containers with virtually no impact to developers. With the amount of churn that happens in tech, it’s important not to tie yourself too heavily to any one implementation. Instead, think about the APIs you expose for your infrastructure and consider those the deliverable.

Finally, we need to operate our service in production, another important part of ownership. There are a lot of products here, so we’ll just look at a cross section.

Logging is arguably the most important part of how we figure out what is happening in our systems. For this reason, Workiva built structured logging and metrics specs and language libraries implementing these specs. As a developer, this made it easy to simply pull in the library for your language and get structured, contextual logging for free. The other half to this was building out a data pipeline. Basically all metadata at Workiva went into Amazon Kinesis, including logs, metrics, and traces. First, this allowed us to reuse the same infrastructure for all of this data, from the agents running on the machines to the pipeline itself. Second, it allowed us to fan this data out to different backend systems—Splunk, SumoLogic, Datadog, Stackdriver, BigQuery, as well as various internal tools. This is probably one of the most important things you can do with your infrastructure.

Other continuous operations tools include telemetry, tracing, health checks, alerting, and more sophisticated production tools like canary deploys, A/B testing, and traffic shadowing. Some might refer to these as tools for testing in production. Realistically, once you reach a certain scale, testing in production is the only real alternative to the proliferation of deployment environments.

It’s worth mentioning that you do not need to build all of these products yourself. In fact, you shouldn’t. Many off-the-shelf solutions just need glued together. However, I’ve also come to realize that it’s often the “glue” that is important. That is to say, taking some large, commercial off-the-shelf solution and introducing it into a company is frequently rife with headaches. It’s like Jira, a big Frankenstein product that attempts to solve everyone’s problems and, in doing so, solves none of them particularly well. This is why I tend to favor small, modular solutions that can be composed. But it also highlights why there is a cultural aspect to this.

If you think the solution to your ailments is some magical product—maybe a CI/CD pipeline or Kubernetes or something else—you’re misguided. If anything, most problems are cultural, not technical in nature. Technology will not fix your broken culture! The products are not the endgame, they are a means to an end. And the products need to fit the company, its culture, its architecture, and its constraints. It’s tempting to take something you see on Hacker News and introduce it into your stack, but you have to be careful.

Likewise, it’s tempting to dive straight into the deep-end, automate everything, and build out a highly sophisticated infrastructure. But it’s important to start small and evolve over time. My approach to this is get the workflow correct, start manual, then automate more and more over time.

Wrapping Up

Specialization leads to misalignment and broken feedback loops, but it’s an important part of scaling a business. The question is: how do we specialize?

We know the traditional Ops model does not scale—devs will always out-demand capacity in this reactive model. Not only this, the siloing creates an empathy problem. DevOps attempts to help with this by tightening feedback loops and building empathy. NewOps takes this further by empowering teams and providing autonomy. It’s not a replacement for DevOps, it’s an evolution of it. It’s applying a product mindset to the traditional Ops model.

The future of Ops is taking Combined Engineering to its logical conclusion. As such, Ops teams should be redefining their vision from being masters of production to enablers of production. Just like with QA, Ops capabilities need to be embedded within dev teams, but the caveat is they need to be enabled! This is the direction Operations is headed. Software is eating the world, which means both up and down the stack. NewOps treats Ops like a product team whose product, effectively, is infrastructure. It’s creating guard rails, not walls—taking SDLC and compliance controls and encoding them into products rather than giving devs a laundry list of things, having them run the gauntlet through a long, drawn-out development process, and having a gatekeeper at the end.

Offloading responsibility helps correct and scale feedback loops. In my opinion, this is how we scale specialization. Operations isn’t going away, it’s just getting a product manager.

Plant Trees Before You Need the Shade

Like humans, companies go through phases. There’s the early seed and development phase. Founders are so preoccupied with a problem they go crazy. They consider solutions and the feasibility of a business. There’s the startup phase, when a business is actually born, and it stumbles towards product/market fit. There’s the growth and scaling phase, as we try to close more and more deals while, at the same time, hiring the right people. If we’re lucky, we reach the later stages. There’s the expansion phase, as we attempt to land and expand or attack new verticals or geographies. This is when things get really interesting—and hard. Who are the right people to hire? What are the right products to build? The formula that got us here almost certainly won’t get us there. Lastly, there’s maturity, which is when the business has really hit its stride. Maybe there’s an exit, and very likely there’s new leadership involved.

Consistent in all of this are two things: culture and capabilities. Culture is the invisible hand inside your organization. It’s your company’s autopilot. Specifically, culture is the unique combination of processes and values within an organization. These processes and values are what enable us to replicate our success. They allow people to make decisions which are in alignment with the goals of the company without having to constantly coordinate with one another.

This also means your culture is derived from your capabilities, what your organization can and cannot do. Clayton Christensen groups these factors into three buckets: resources, processes, and values. Resources are the (mostly) tangible things a company has—people, capital, brands, intellectual property, relationships with customers, manufacturers, distributors, and so forth. Processes are what we do with resources to accomplish the organization’s goals, such as developing products, developing employees, hiring, firing, doing market research, and allocating resources. They take in resources and produce value. Processes help us protect and scale our values by providing a means of documenting and codifying them. These are predominantly intangible things. Finally, values define how a company makes decisions. What goes to the top of the list, and what gets ignored. What gets investment, and what doesn’t. These are our priorities that guide us.

There are a few problems with how leadership tends to view capabilities. There is typically an overemphasis on resources. This happens because in the startup phase, success is largely governed by resources. Notably, people. This is especially true of software startups. I quote this section from The Innovator’s Dilemma frequently:

In the start-up stages of an organization, much of what gets done is attributable to resources—people, in particular. The addition or departure of a few key people can profoundly influence its success. Over time, however, the locus of the organization’s capabilities shifts toward its processes and values. As people address recurrent tasks, processes become defined. And as the business model takes shape and it becomes clear which types of business need to be accorded highest priority, values coalesce. In fact, one reason that many soaring young companies flame out after an IPO based on a single hot product is that their initial success is grounded in resources—often the founding engineers—and they fail to develop processes that can create a sequence of hot products.

In the beginning stages, people drive success. Early engineers and founders (the two are not mutually exclusive) have an itch to scratch that they are so passionate about, they evangelize this crazy grand vision and people get excited. The company is small, focused, passionate, and everyone is working closely together to solve a customer problem they are obsessed with. And, sometimes, it pays off.

Next, leadership starts asking itself, “OK, we shipped this crazy successful product, now how do we grow?” This is where the wheels start to come off. There are three problems that occur.

First, the focus shifts away from solving a problem you’re obsessed with to finding the next big product. These companies fail to find a new product because they are searching for one without passion. They are seeking top-line revenue growth, not pursuing a vision. They are looking at markets through the lens of “here’s where we can make money” and not “here’s where we can solve problems,” and in doing so, they lose sight of the customer. At this stage, they basically have forgotten what got them here.

Second, their processes get in their own way. This seems contradictory given that processes, by their very nature, are meant to facilitate repeatability. If we have good processes in place, we should be able to apply them time and time again, even with different people, and end up with consistent results. But things break down when we apply the same processes to different problems. In essence, they try to find success using what brought them success to begin with, but with a warped perspective. You can look at every startup and they will all have wildly different stories about how they found success, but the one thing they will all have in common is that relentless itch. When we attack a new market, we need to do a reset on the processes and values, not a recycle. Christensen suggests if your company is so deeply entrenched in its processes that this isn’t viable, as is often the case with large enterprises, spin it out into a new venture. Processes are as dynamic as the company itself. They are not a one-size-fits-all deal. Christensen calls this the migration of capabilities.

The factors that define an organization’s capabilities and disabilities evolve over time—they start in resources; then move to visible, articulated processes and values; and migrate finally to culture. As long as the organization continues to face the same sorts of problems that its processes and values were designed to address, managing the organization can be straightforward. But because those factors also define what an organization cannot do, they constitute disabilities when the problems facing the company change fundamentally.

Third, they don’t fully appreciate the nature of dynamic priorities. Software startups in particular often mistakenly attribute their initial success to technology when, in reality, it’s because of people and timing. Aside from the passion, you need people early on who can ship and ship often. These might not be the most technically capable engineers, but they get things done when it matters. Later, you need people who can still ship but while cleaning things up. Lastly, you need maintainers—people who can refine without breaking anything. That’s not to say you have these archetypes exclusively at each stage—you want a balance of people—but it’s similar to how companies commonly need a different type of CTO at different phases or an Interim CTO.

While resources are an essential part of early success, it’s processes and values that will sustain you. However, we tend to overfit on resources because we become biased from that success—investing heavily in technology and innovation and grounding the company’s success in a few key individuals. We also overfit because resources are more visible and measurable. Being deliberate about establishing processes and values—which are derived from vision—helps to overcome this bias, but it needs to happen early and be continually reinforced. We also need to ensure our processes adapt to new problems. The larger and more complex an organization becomes, the harder this gets.

Moreover, we need to be conscientious about which processes matter to us the most. Often the most important capabilities aren’t reflected by the most visible processes—product development or customer service, for example—but in the less visible, background processes that support decisions about where to invest resources. These might include determining how market research is done, how financial and sales projections are drawn from this analysis, how products are conceived, how planning and budgets are negotiated, and so on. These processes are where many companies get their inability to cope with change.

And this is where the breakdown happens: a company has highly capable people—people who have helped shape its success as a startup—but arms them with the wrong processes and values. The result is often a boom followed by a bunch of fizzles as they try to catch lightning in a bottle once more. A compelling vision plants a seed. Strong processes and clear values help that seed to grow. But the shade produced by that seed—our capabilities—is stationary, so when we approach a new challenge, we need to recognize when to start tilling.

Software Is About Storytelling

Software engineering is more a practice in archeology than it is in building. As an industry, we undervalue storytelling and focus too much on artifacts and tools and deliverables. How many times have you been left scratching your head while looking at a piece of code, system, or process? It’s the story, the legacy left behind by that artifact, that is just as important—if not more—than the artifact itself.

And I don’t mean what’s in the version control history—that’s often useless. I mean the real, human story behind something. Artifacts, whether that’s code or tools or something else entirely, are not just snapshots in time. They’re the result of a series of decisions, discussions, mistakes, corrections, problems, constraints, and so on.  They’re the product of the engineering process, but the problem is they usually don’t capture that process in its entirety. They rarely capture it at all. They commonly end up being nothing but a snapshot in time.

It’s often the sign of an inexperienced engineer when someone looks at something and says, “this is stupid” or “why are they using X instead of Y?” They’re ignoring the context, the fact that circumstances may have been different. There is a story that led up to that point, a reason for why things are the way they are. If you’re lucky, the people involved are still around. Unfortunately, this is not typically the case. And so it’s not necessarily the poor engineer’s fault for wondering these things. Their predecessors haven’t done enough to make that story discoverable and share that context.

I worked at a company that built a homegrown container PaaS on ECS. Doing that today would be insane with the plethora of container solutions available now. “Why aren’t you using Kubernetes?” Well, four years ago when we started, Kubernetes didn’t exist. Even Docker was just in its infancy. And it’s not exactly a flick of a switch to move multiple production environments to a new container runtime, not to mention the politicking with leadership to convince them it’s worth it to not ship any new code for the next quarter as we rearchitect our entire platform. Oh, and now the people behind the original solution are no longer with the company. Good luck! And this is on the timescale of about five years. That’s maybe like one generation of engineers at the company at most—nothing compared to the decades or more software usually lives (an interesting observation is that timescale, I think, is proportional to the size of an organization). Don’t underestimate momentum, but also don’t underestimate changing circumstances, even on a small time horizon.

The point is, stop looking at technology in a vacuum. There are many facets to consider. Likewise, decisions are not made in a vacuum. Part of this is just being an empathetic engineer. The corollary to this is you don’t need to adopt every bleeding-edge tech that comes out to be successful, but the bigger point is software is about storytelling. The question you should be asking is how does your organization tell those stories? Are you deliberate or is it left to tribal knowledge and hearsay? Is it something you truly value and prioritize or simply a byproduct?

Documentation is good, but the trouble with documentation is it’s usually haphazard and stagnant. It’s also usually documentation of how and not why. Documenting intent can go a long way, and understanding the why is a good way to develop empathy. Code survives us. There’s a fantastic talk by Bryan Cantrill on oral tradition in software engineering where he talks about this. People care about intent. Specifically, when you write software, people care what you think. As Bryan puts it, future generations of programmers want to understand your intent so they can abide by it, so we need to tell them what our intent was. We need to broadcast it. Good code comments are an example of this. They give you a narrative of not only what’s going on, but why. When we write software, we write it for future generations, and that’s the most underestimated thing in all of software. Documenting intent also allows you to document your values, and that allows the people who come after you to continue to uphold them.

Storytelling in software is important. Without it, software archeology is simply the study of puzzles created by time and neglect. When an organization doesn’t record its history, it’s bound to repeat the same mistakes. A company’s memory is comprised of its people, but the fact is people churn. Knowing how you got here often helps you with getting to where you want to be. Storytelling is how we transcend generational gaps and the inevitable changing of the old guard to the new guard in a maturing engineering organization. The same is true when we expand that to the entire industry. We’re too memoryless—shipping code and not looking back, discovering everything old that is new again, and simply not appreciating our lineage.