Digitally Transformed: Becoming a Technology Product Company

More and more established businesses are attempting to reinvent themselves as technology companies. At the heart of this is the digital transformation, a journey many organizations are undertaking in order to better compete and serve their customers. As a result, companies are pouring tons of cash into digital transformation strategies. For some, this means broader adoption of agile or DevOps practices. For others, it’s modernizing product offerings or moving to the cloud. Regardless of the changes, many are struggling to find success transforming themselves due to low throughput, quality issues, or failing to deliver the right thing at the right time. In a few cases, digital transformation has ended in outright disaster.

What is it that these companies are really after? To solve new problems in new ways through innovation? To more rapidly adapt to the changing market? To protect existing revenue? Any leader worth their salt will say all of these are important outcomes, so how do you even begin to make a “digital transformation” actionable? What are we transforming to? How do we know when we’ve arrived?

The reason so many digital transformations fail has to do with how IT is usually positioned within mature, established businesses. I believe what these companies are really after is not a digital transformation—whatever that might be—but rather an organizational one that radically changes the way the business operates. One that redefines what IT means in the context of building software. The technology is incidental to this cultural shift which involves the intersection of people, processes, and innovation. In order to be successful, these organizations need to become technology product companies.

The Genesis of IT

There is an inertia within organizations to overvalue tactics and undervalue strategy. This is true not just of mature, established businesses but really all businesses, startups included. In fact, it’s this exact reason most startups fail. A lack of clear strategy and guiding vision precludes even the best execution from delivering success outside of the odd unicorn (after all, someone has to win the Powerball). Established businesses, however, already have a reliable cash flow engine to fall back on. There is much more margin for error when it comes to both strategy and execution, but this peacetime mentality leads to disruption. Many leaders have begun to recognize this and act on it, falling right back to what they know best—tactics.

Why do companies and managers tend to bias towards tactics over strategy in software development? It comes back to the genesis of IT. Historically, IT was about managing computers, networks, email, phone systems, and other technical areas of the business. While this is still true today, the result of software eating the world has caused that scope to broaden significantly. But for mature, established businesses, IT has long been viewed as a cost center, and the mandate for an IT leader is cost minimization. This is in spite of the fact that the business has shifted away from humans, paper forms, and telephones to automation and software-based solutions. IT has always existed to support business operations, first by managing the technology the business depended on, now by building it. The only real change was IT transforming from a servant of the business to a partner of it.

Consequently, there are two key directives for a traditional IT organization: carry out the orders of the business and minimize cost. These goals inherently lead to a project mindset that is output- and task-oriented. Thus, IT has always been tactical and execution-minded in nature.

A Spotter’s Guide to Project-Minded IT

There are three ways to identify a project-minded IT organization. First, if both software engineers and more traditional IT roles like hardware support or help desk report up to a CIO, it’s likely a project-minded organization. In this case, it’s all just lumped into one group called “IT.”

This contrasts with product-minded companies which place IT responsibilities under a CIO, whose directive is still cost minimization, and product development responsibilities under a CTO and/or CPO (Chief Product Officer), whose directive is strategic investment. There are two distinct groups, IT and Product Development or R&D. It’s more common to see CTOs or CPOs at newer, technology-first companies than it is at mature, established businesses since this requires a major realignment. This alignment, however, is why we see many of the execution issues at companies attempting to “digitally transform” themselves.

Second, if there is a clear separation between IT or development and the business, there’s a good chance it’s a project-minded organization. This might be signaled by business partners, business analysts, or product owners who provide teams with implementation requirements and act as a backlog administrator. Developers might not have a good understanding of who their customers are or they view the business partner as the customer. This can also be signaled by frequently changing priorities, an ever-growing backlog of tasks, or unaddressed tech debt piling up. The team is typically not cross-functional, consisting only of developers and a business partner. Marty Cagan refers to these as delivery teams, and they are purely output-driven.

Alternatively, the team may be cross-functional with some form of designer (often oriented more towards UI than UX) and product manager, but it’s still governed by outputs. The product manager’s role is closer to that of a project manager armed with a product roadmap, and the closest thing developers have to product discovery is design and usability testing. Cagan refers to these as feature teams. Both delivery and feature teams exist to serve the business. These are the teams you’ll find at most companies building software.

At product-minded companies, teams are cross-functional with designers, UX, engineers, and product, and they are measured by outcomes, not outputs. This focus on outcomes means that the team is empowered to figure out the best way to solve the problems they’ve been asked to solve rather than being fed a list of features to build. These teams have an intimate understanding of their customers and interact with them regularly to perform product discovery and validate solutions. These are product teams in the truest sense but also quite rare.

The last way to spot a project-minded organization might be the most obvious. If the roadmap has a clear end point, it’s a project. Here, an IT organization treats building a software solution the same way it treats installing a new phone system. When the project is completed, teams or resources are reallocated to new projects and one of two things happen: it’s either dumped on another team to maintain and extend or no one sticks around to support it. The finished project languishes or former developers are told to context switch to it reactively and at the whims of the business. Engineers are treated as interchangeable and teams are not particularly durable or mission-driven but rather task-driven.

Product-minded companies instead embrace the virtues of minimum viable product, shipping incremental value, validating ideas, and iteration. The product manager provides a vision that unites the team in a common mission. Products are not “completed,” rather they grow and evolve. There is an emphasis on business outcomes over task outputs. Managers understand that teams are composed of people with diverse skills who are not easily fungible but who might be better suited to different phases of a product’s lifecycle. Members of a team might shift focus to other areas and priorities over time, but always in support of the team’s mission.

The Philosophical Dilemma of the Stoplight

A tactics-first mindset results in a propensity to treat software development like an assembly line. We can see this with the recent adoption of ideas from the Toyota Production System and lean manufacturing as it’s applied to software development. This emphasis on tactics causes managers to view product development as an optimization problem—if we just optimize the right set of tactics and practices, we can significantly improve throughput and quality at scale. This has led to the rise in packaged frameworks and processes like SAFe, LeSS, DAD, and Nexus as well as tactics like agile, pair programming, and test-driven development at large organizations.

The assembly-line mindset aims to take developers of arbitrary skill and background, run them through a prescribed process, and get high-quality, high-output results on the other end. I’ve never seen this deliver the desired outcomes in practice, at least not to the degree most leaders hope.

On the surface, mass production and software development share a lot of similarities. Both require quality standards, collaboration between groups of specialized workers, and repeatability. However, the reality is they are quite different from each other. A manufacturing assembly line is optimized to produce the exact same product over and over again, efficiently and reliably. Software products, especially Software as a Service, are heterogeneous. While we seek a process that produces consistent results, each product and situation is unique. Too prescriptive, and we end up with a rigid process that yields poor results and low-throughput. Too unstructured, and we end up with inconsistent and unreliable output.

Our Head of Client Experience Mike Taylor refers to this as the Stoplight Problem. To demonstrate, ask a roomful of people what to do at each phase of a stoplight. On green, everyone says “Go.” On red, “Stop.” And on yellow? The answers vary—even more so with the introduction of flashing yellow lights. How close are we to the light? How fast are we traveling? Are the roads icy? What are the cars in front or behind us doing? What happens at a yellow light is entirely context-dependent and situational. It comes down to making informed choices in the moment without an authoritative, black-and-white determination.

Execution and delivery issues invariably come down to one thing: the yellow light. The green and red lights are binary indicators. There are clear right and wrong actions to take. These are things that can be taught and learned—where tactics matter—but the yellow light comes down to making good decisions. This is something organizations struggle with at scale. How do you trust your teams to make good decisions? As a result, they end up making those decisions top-down in a command-and-control or assembly-line fashion. This is how organizations end up with delivery and feature teams. What’s needed is a sort of meta process or process for encouraging good decision making.

Empowered Product Teams

The emphasis on tactics isn’t limited to traditional project-minded IT organizations. Tactics are more visible and measurable. To a manager, tactics feel like work is happening, but they are rarely the difference maker for a company.

To illustrate, imagine handing out a bunch of axes to a group of people and telling them to go collect some wood. You might even teach them the proper technique for chopping down a tree. What happens next? Chaos. Confusion. A general sense of wandering in the woods. What kind of timber do we need? How much? What is it used for? How do we move it? Watching an army of people swinging axes is going to look like a lot of work is going on, but is it work that matters? You might follow people around, directing them where to go, which trees to cut down, and where to move them, but this won’t scale very well.

Without a guiding vision, we’re left with a bunch of people wandering in the woods swinging axes. Work happens, things get done—maybe even things that matter—but it’s haphazard and inefficient. More often than not, though, we’re always two weeks from completion because there isn’t clarity on where we’re trying to be. In agile terminology, we’re iterating to nowhere.

Our response might be to micromanage or implement the assembly-line process, turning our teams into feature factories. In my experience, this creates new challenges. In the first case, by grinding throughput to a halt, and in the second case, by failing to address the Stoplight Problem. The solution is a combination of vision, strategy, and execution.

A vision is a mental image of what the future could be like. It’s a grand and idealistic state, not something that can be achieved in a short amount of time. A shared vision empowers teams to make better decisions independently.

Strategy consists of a plan with decreasing fidelity. Some organizations attempt to plan 12 to 18 months out in a very waterfall-like fashion, and unless you’re sending a rocket into space, it just doesn’t work. A strategy is really a series of goals that get progressively fuzzier the further you go out. While a vision usually isn’t directly actionable, goals are both actionable and attainable in support of the overarching vision. We can break our strategy down into sets of three-month goals, which allows us to adjust course as needed. This is important since our goals are increasingly fuzzy. The key here is that strategy and goals are not dictated to teams. There needs to be give and take and dialog. OKRs can be a good tool for facilitating this.

At Real Kinetic, we hold quarterly leadership offsites to revisit our vision and strategy, course-correct, and ensure we have a general sense of alignment. We help our clients do the same within their product development organizations. The challenge with strategy is it looks like talking, while tactics look like working, even if it’s work that doesn’t truly move the needle. This is a cognitive bias leaders and managers should be aware of because it can trap us into focusing on tactics that aren’t framed by a clear vision and strategy.

Execution is all about hitting the goals we lay out in our strategy. This is where tactics come into play, but rather than providing teams with a list of features to implement or tasks to perform, we empower them to make good decisions. This is made possible by our guiding vision and cross-functional, mission-driven product teams. Our product manager is figuring out what lies ahead and helping plan the best course of action for realizing our vision. They are looking at value and business viability risks for the product. Our designer is looking at usability risks, and our tech lead is looking at feasibility, making estimations, and contributing to the strategy in order to avoid potential obstacles. You’ll notice that nowhere have we mentioned agile or scrum because these are specific tactics for managing execution. Together, the team is determining execution and discovering a solution that moves the business towards the ideal state set forth by its leadership.

Becoming a Technology Product Company

The struggle with digital transformation is it doesn’t get at the heart of the issue. It’s a tactical response to a tangible, yet ultimately inconsequential, part of the problem. The problem is not due to technology or innovation or particular tactics, it’s due to organizational alignment and execution deficiencies. Unfortunately, the former is more visible and more easily acted on than the latter.

The transformation that organizations are actually after is becoming a technology product company. This requires empowered product teams in combination with vision, strategy, and execution. Most companies focus on the execution because it’s easier, but it’s not sufficient. Empowered product teams require a shared vision that enables them to make good decisions without the need for an overly regimented or top-down process. This is the only effective way I’ve seen software companies scale throughput and quality. Don’t let your organization think it’s building a boulevard when it’s actually planting perennials next to potholes.

Real Kinetic helps clients build great product development organizations. Learn more about working with us.

Planting Perennials Next to Potholes

Silos, bikesheds, and focusing on what matters

If you’ve ever flown into Des Moines then you’ve had the privilege of driving on what might be the most decrepit major road in the metro area. An important artery, Fleur Drive is the only way to get to and from the airport, and the pavement is marginally better than that of a dirt road. Cars weave back and forth to dodge potholes and massive cracks in the asphalt as people race to catch their flights. There always appears to be some kind of construction going on somewhere along the six mile stretch of road, and yet, it never seems to actually improve. The road is also located in a major floodplain, so sometimes the city just closes it when the nearby river rises too much. It’s basically what you’d get if you agiled your way through urban planning.

Typically, you’ll see the Public Works Department planting flowers or otherwise maintaining the landscaping of the medians. It goes down to one lane when they have to water the flowers. Over the past month, they tore up and poured new concrete to replace the medians altogether, again bringing the road down to one lane in the process. The tulips look nice though.

It’s interesting because a lot of companies build software this way. They quickly pave the road by iterating their way there, ignoring nearby flood hazards or the anticipated traffic that’s going to be traversing it. They plant some flowers along the way to make it look nice and then move on to the next thing. Over time, the road deteriorates. Fleur is a main thoroughfare, so you can’t just close it and repave. The city doesn’t have the budget to repave it all at once anyway. So you patch up a few potholes and plant some new flowers.

There are a few different facets to this depending on what vantage point you look at it from. As it turns out, however, they all dovetail into the same thing. At the individual level, what you often see is bikeshedding. That is, engineers focusing time and energy on technical minutiae that, in the grand scheme of things, don’t really matter. Often it’s fixating on aesthetics and what you can see rather than function or things that truly move the needle forward in a meaningful way. Sometimes we get caught up in the details and plant flowers. When you’re up to your neck in alligators, it’s hard to remember that your initial objective was to drain the swamp. This often comes from a lack of direction for the team, and it’s the manager’s job to ensure we’re focusing on what matters.

At the team level, we start to run into siloing issues. This happens when we have different functions of the business focusing on their little parts of the world, more or less neglecting the other parts. Development focuses on development. Operations focuses on operations. Security focuses on security. What you get is gridlock, an utter inability to make progress because everyone is uncompromisingly fastened to their silo. Worse yet, what does manage to get done is a patchwork of competing goals and agendas. It’s building new medians as the roads crumble. And silos are not limited to pure business functions like development, operations, and security. There are silos within silos—Product Team X and Product Team Y, for example. Silos are recursive. They are a natural team dynamic that occurs as organizations grow in accordance with Dunbar’s number, especially at companies that rigidly specialize by function. This is why a cohesive vision is critical.

At the organization level, we see large-scale strategy problems and what I call “WIP-lash”—lots of WIP (Work In Progress), lots of shifting priorities, and lots of “high-priority” items. Priorities change at the drop of a hat or everything is a priority all of the time or the work is planned 12 months in advance and by the time we execute, the goalposts have moved. Executives make knee-jerk mandates in absolute terms to respond to the newest fire. Tech debt piles up as things are added to the never-ending priority queue (that’s at least one thing that doesn’t get equal priority as everything else!), but the infrastructure is in a constant state of ruin and the potholes don’t stop. WIP-lash is just strategic bikeshedding. This is a prioritization and planning issue through and through. We can’t close the entire road and repave it. Instead, we do it in phases. Managing tech debt works the same way. We have to pay it down periodically, but not with constant band-aids and chewing gum and not by stopping the world. We have to prioritize the work like everything else we do, and sometimes that means saying no to other things we deem important.

OKRs can be a useful way to force those difficult decisions and provide teams a shared vision. Specifically, they are the strategy to balance out the iterative tactics of agile. If you don’t have some kind of mile markers you’re working towards, you’re just iterating your way to nowhere. OKRs are not intended to be a waterfall approach, they are about providing strategic guidance. That doesn’t mean companies don’t screw it up though, especially when consultants get their hooks into things. They don’t need to be a large, scary, expensive process with fancy tools—just a Word document and real discussions about what needs to happen and dialogues about what is actually possible. OKRs are hard to get right though and, like anything, require iteration. A key part of good OKR processes is using them to drive discussions and negotiations up and down the organization. It surfaces conflicts and alignment issues earlier in the process. It provides line managers a mechanism to push back and force hard decisions and open a dialogue between groups. The discussions on what really matters and the negotiation about what is really possible is the major value.

“Do you want this or that? I only have resources for this.”
“Oh, I actually have engineers I can lend this quarter. Maybe that will help?”
“Sure, but we can only accomplish part of that.”
“We can make that work.”

OKRs are a vehicle for strategic discussions, not tactical status updates, task lists, or waterfall plans. Without some sort of guiding vision that you’re working towards, you’re just doing stuff. That might look and feel productive but only on the surface. It must be a negotiation if you want results and not just activity.

It really comes down to prioritization and alignment. At the individual level, we have tactical bikeshedding—focusing on items that are largely inconsequential. This is a prioritization problem. It falls on managers to keep teams focused, but it also flows from broader organizational issues. It’s particularly insidious in companies that separate product management (“the business”) from product development (“engineering”). At the organization level, we have strategic bikeshedding—being unable to make hard decisions and focus in on what matters to the business right now, resulting in WIP-lash. This is also a prioritization problem, and it leads to the tactical bikeshedding mentioned earlier. In between, at the team level, we have siloing. This causes all sorts of issues ranging from gridlock and broken customer experiences to duplication of effort. It’s an alignment problem.

There is not a simple, quick solution to these problems, but it starts at the top. If management is not in alignment and unable to prioritize what matters, no one else will. Work will happen, and to a passerby that can look reassuring, but is it work that matters? OKRs are not a silver bullet, and they are difficult to do and take time to get right. But when executed well, they can be a powerful lens to focus on what matters and provide a shared vision. As Intel co-founder and former CEO Andy Grove said, the most powerful tool of all is the word “no.”

Real Kinetic is committed to helping clients develop great engineering organizations. Learn more about working with us.

Operations in the World of Developer Enablement

NewOps is not a replacement for DevOps, it’s an evolution of it by looking at Operations through the lens of product. It’s what I’ve come to call “Developer Enablement” because the goal is to shift the focus of Ops teams from being masters of production to enablers of production. Through Developer Enablement, teams are enabled—and tasked with the responsibility—to control their own destiny. This extends far beyond just the responsibility of building products. It includes how we build, test, secure, deploy, monitor, and operate systems.

For some, this might come naturally. Many startups don’t have the privilege of siloing up their organizations (although you’d be surprised!). For others, this can be a major shift in how we build software. Especially in large, established organizations with more specialized roles, responsibilities can be so siloed people aren’t even aware they’re happening. Basic “ilities” like scalability, reliability, and even security become someone else’s responsibility. “Good Operations” means no one even knows you’re there, unless something goes wrong.

So when this is turned on its ear, and these responsibilities are placed on the dev team’s shoulders, how do they adapt? In many cases, teams are eager to take on these new responsibilities but also blissfully unaware of what that actually entails. DBAs are a good example of this. Often a staple of enterprise IT Ops, DBAs are tasked with—among other things—installing and patching DBMSs, performing backups, managing HA and DR strategies, balancing database workloads, managing resources, tuning performance, configuring security settings, and monitoring systems. Many of these responsibilities are invisible to developers.

With cloud and Developer Enablement, this can change in profound ways. However, in a typical lift-and-shift, the role of DBAs is widely unchanged. In this case, we’re just running the same stuff in someone else’s data center. There are still databases to be patched, replication to be managed, backups to be made, and so on. But pure lift-and-shifts, at least as an end goal, are largely a misstep. You throw away all that institutional memory—the knowledge and experience you have managing your own data center—for more expensive compute with which you have less experience administering. Things change when we start to rely on managed cloud services. We no longer run our own databases on VMs but instead rely on cloud-managed ones. This is where things become much more grey—but also much more interesting.

Developer Enablement in the Cloud

First, a quick aside. There are two different concepts we’re talking about here: cloud and Developer Enablement (DevOps for brevity). These are two distinct but related concepts. We can “do” DevOps on-prem, just as we can in the cloud. Likewise, we can also do traditional Operations in the cloud, just as we can on-prem. One of the benefits of cloud is it allows us to focus more investment on business-differentiating things, but it also makes implementing DevOps easier for two reasons. First, the cloud provider takes on more operational responsibilities (the stuff that supports—but doesn’t directly contribute to—business value). Second, it provides a lower barrier to self-service infrastructure. This means developers can, of their own accord, provision and manage supporting infrastructure like databases, caches, queues, and other things without a go-between or the customary “throw-it-over-the-wall” approach. This is a key part of Developer Enablement.

In the world of Developer Enablement in the cloud, what is the role of a DBA, or any other Ops person for that matter? When you start to map who is accountable for what, you quickly realize there is far too much nuance to cleanly map responsibilities. Which cloud provider are we talking about? Within that cloud provider, which database offering? Proprietary NoSQL databases like Google’s Cloud Datastore? Relational databases like Amazon’s RDS? Globally-distributed databases like Spanner? How we handle things like HA and DR vary drastically depending on the service and service provider. In some cases, the vendor is entirely responsible, e.g. because the database has built-in replication. In other cases, the customer. Sometimes it’s a combination of both, such as a database that has automated backups which must first be enabled. It’s not as cut and dry as it used to be.

As we push more responsibility onto developers, how do we ensure they are actually tackling all of those responsibilities, especially the ones they might not even know about? How do we implement DevOps responsibly?

The goal of Developer Enablement is not to enable developers by giving them total control and free rein. Instead, it’s to empower them in a way that is “safe” for the business. People often misconstrue DevOps and automation as things that reduce lead times and increase deployment frequencies by simply pulling security out of the process. This is categorically not the purpose of DevOps. In fact, the intention is to improve security by integrating it more deeply and earlier into the process in a more reliable and repeatable way, i.e. “shift left.” Developer Enablement is about providing the tools, automation, services, and standards teams need to do just this.

So when we say we want to implement DevOps and Developer Enablement, we’re not saying we want to hand developers the keys to production with a pat on the back. We’re saying we want to pave a path to production which allows developers to release software in a way that is safe and secure with greater autonomy—because autonomy enables building more reliable software faster. In this world, Operations teams become increasingly Developer Enablement teams because there is simply less stuff to operate. It becomes more about supporting development teams and organizing around products than acting purely as a gatekeeper or service provider. It’s pretty amazing how things start to improve when you align yourself this way.

Responsibilities of Developer Enablement

Those Operations teams still have extremely valuable skill sets however. It’s just that they start to act more in an advisory role than the assembly-line-worker role converting Jira tickets into outputs. For instance, DBAs have deep expertise on the intricacies and operations of various database systems, but when Amazon is now responsible for installing the database, patching it, scaling it, monitoring it, performing backups, managing replication and failovers, and handling encryption and security, what do the DBAs do? They become domain experts and developer advocates. They make sure teams aren’t shooting themselves—or the company—in the foot and provide domain expertise and tooling in a supporting role. When a developer complains about a slow query, they are the ones who can help them identify, understand, and fix the problem. “It’s doing a full-table scan since you’re missing an index,” or “You have a hot partition because you’re using a timestamp as the partition key. Try using a more uniform ID to distribute workloads evenly.” These folks can often help developers better structure their data to improve application performance and scalability.

In addition to this supporting role, these Developer Enablement teams also help ensure dev teams are thinking about all the things they need to be considering. In the case of data, how is encryption handled? HA? DR? Data migrations? Rollbacks? Not that all of these things need to be handled by the teams themselves—again, often the cloud provider has it covered—but simply ensuring that they have been considered and can be spoken to is important. It’s vital to start this conversation early in the development process.

The Three Phases of Development

There are basically three phases of development to consider. There’s the “playground” phase, which is when teams are essentially exploring different technologies. At this stage, there can be little-to-no oversight outside of controlling cloud spend (which is important for when your intern accidentally starts a task bomb before leaving for the weekend). Teams are free to try out new ideas without worrying about production. Often this work happens in a separate “experimentation” cloud project.

Next, there’s the “green-light” phase. The thing being built is going to production, it’s part of the company’s strategic plan, people are talking about it, etc. At this point, we start an ongoing dialogue with the team and provide them with a list of the key things to be thinking about. This should not be a 10-page document. It should be a one-page document hitting the main areas. An example portion of this might look like the following:

  • How do you plan to implement HA?
  • What classifications of data will this system handle and how do you plan to secure that data in transit and at rest?
  • How much traffic do you expect the system to handle and how will you scale it?
  • How will the system handle authentication and authorization?
  • What are the integration points?
  • Who will support the system in production?
  • What is the CI/CD story for the system?
  • What is the testing strategy?

Depending on your company’s culture, this can sometimes be seen as an affront or threat to teams if they’re used to Ops or InfoSec groups gatekeeping. That is not the goal as it’s intended to be in an advisory capacity. This ends up having a couple benefits. First, it gets teams thinking about and planning for key operational items, and second, it uncovers any major gaps early in the process. The number of times I’ve heard someone ask, “What’s HA?” after reading this list is non-zero. The purpose of this isn’t to shame anyone, just to provide a way to start critical discussions between the team and Developer Enablement groups.

Finally, there’s the “ready-for-production” phase. The team is ready to ship what they’ve been building. This is where things get real. Typically, there are a few things that should happen here. When launching a new service or product, there should be a comprehensive review of the system. The team will sit down with a group of their peers, architects, and security engineers and walk them through the system. People hate the dreaded architecture review, so we call it a product technical walkthrough instead.

Operational Readiness and Change Management

About a month or so prior to the walkthrough, the team should be working through an “operational-readiness checklist” which is used to guide the walkthrough. This checklist is much more detailed than the previous one, enumerating items like what the deploy process consists of, configuration management, API versioning, incident-response procedures, system observability, etc. The checklist we commonly use with clients at Real Kinetic is about seven pages long and covers 10 areas: Deployment, Testing, Reliability/Failover, Architecture, Costs, Security, CI/CD, Infrastructure, Capacity/Performance Estimates, and Operations and Support. This checklist is used to probe different areas. If certain areas feel a little weak, this can lead to deeper discussions depending on the importance or severity. If a system is particularly critical to the business or high-risk, this process can veto a release. Having a sign-off process like this makes some people nervous, but it’s important to point out that this should only apply to new launches. It is not a general change-management process. It’s really about helping teams learn about running systems in production and understanding what that takes.

In addition to the product technical walkthrough, we also recommend doing a security assessment for new services. This usually encompasses a vulnerability and threat assessment, risk assessment, pen testing, the whole nine yards. I usually also like to see some sort of load profiling done on the service before putting it in production (though load and chaos testing should ideally be part of the normal development process, not saved for the very end).

When it comes to infrastructure, there’s also the question of how to manage changes. This is where infrastructure as code (IaC) becomes hugely important as it not only provides a way to automate infrastructure changes, but also a means to review those changes. We can treat infrastructure changes in the same way we treat application changes—storing them in source control, doing code reviews on them, running them through static analysis tools, and so forth. Infrastructure changes, like all changes, should go through a code review process. It cannot be overstated how essential code reviews are and how much they benefit your organization. And once again, this is where Developer Enablement comes into play. I recommend IaC changes be reviewed by a Developer Enablement team member. This provides a touchpoint where they can provide domain expertise and ensure changes are within acceptable parameters. If a developer is requesting a change which falls outside those parameters, such as a database instance with 1TB of RAM for example, it requires a conversation and sign-off process.

Conclusion

With Developer Enablement, what used to be Operations becomes primarily a product and advisory team. “Product” in the sense of providing systems and tools that help developers take on more responsibility, from day-to-day development to operations and support. “Advisory” in the sense of offering domain expertise and guidance. Through this approach, we get better alignment by giving engineers end-to-end ownership from development to on-call and improve efficiency by reducing handoffs. This also lets us scale more effectively. Through products and reduced hand-offs, a Developer Enablement group can empower far more engineers than any conventional Ops team could.

How to Level up Dev Teams

One question that clients frequently ask: how do you effectively level up development teams? How do you take a group of engineers who have never written Python and make them effective Python developers? How do you take a group who has never built distributed systems and have them build reliable, fault-tolerant microservices? What about a team who has never built anything in the cloud that is now tasked with building cloud software?

Some say training will level up teams. Bring in a firm who can teach us how to write effective Python or how to build cloud software. Run developers through a bootcamp; throw raw, undeveloped talent in one end and out pops prepared and productive engineers on the other.

My question to those who advocate this is: when do you know you’re ready? Once you’ve completed a training course? Is the two-day training enough or should we opt for the three-day one? The six-month pair-coding boot camp? You might be more ready than you were before, but you also spent piles of cash on training programs, not to mention the opportunity cost of having a team of expensive engineers sit in multi-day or multi-week workshops. Are the trade-offs worth it? Perhaps, but it’s hard to say. And what happens when the next new thing comes along? We have to start the whole process over again.

Others say tools will help level up teams. A CI/CD pipeline will make developers more effective and able to ship higher quality software faster. Machine learning products will make our on-call experience more manageable. Serverless will make engineers more productive. Automation will improve our company’s slow and bureaucratic processes.

This one’s simple: tools are often band-aids for broken or inefficient policies, and policies are organizational scar tissue. Tools can be useful, but they will not fix your broken culture and they certainly will not level up your teams, only supplement them at best.

Yet others say developer practices will level up teams. Teams doing pair programming or test-driven development (TDD) will level up faster and be more effective—or scrum, or agile, or mob programming. Teams not following these practices just aren’t ready, and it will take them longer to become ready.

These things can help, but they don’t actually matter that much. If this sounds like blasphemy to you, you might want to stop and reflect on that dogma for a bit. I have seen teams that use scrum, pair programming, and TDD write terrible software. I have seen teams that don’t write unit tests write amazing software. I have seen teams implement DevOps on-prem, and I have seen teams completely silo ops and dev in the cloud. These are tools in the toolbox that teams can choose to leverage, but they will not magically make a team ready or more effective. The one exception to this is code reviews by non-authors.

Code reviews are the one practice that helps improve software quality, and there is empirical data to support this. Pair programming can be a great way to mentor junior engineers and ensure someone else understands the code, but it’s not a replacement for code reviews. It’s just as easy to come up with a bad idea working by yourself as it is working with another person, but when you bring in someone uninvolved with outside perspective, they’re more likely to realize it’s a bad idea.

Code reviews are an effective way to quickly level up teams provided you have a few pockets of knowledgeable reviewers to bootstrap the process (which, as a corollary, means high-performing teams should occasionally be broken up to seed the rest of the organization). They provide quick feedback to developers who will eventually internalize it and then instill it in their own code reviews. Thus, it quickly spreads expertise. Leveling up becomes contagious.

I experienced this firsthand when I started working at Workiva. Having never written a single line of Python and having never used Google App Engine before, I joined a company whose product was predominantly written in Python and running on Google App Engine. Within the span of a few months, I became a fairly proficient Python developer and quite knowledgeable of App Engine and distributed systems practices. I didn’t do any training. I didn’t read any books. I rarely pair-coded. It was through code reviews (and, in particular, group code reviews!) alone that I leveled up. And it’s why we were ruthless on code reviews, which often caught new hires off guard. Using this approach, Workiva effectively took a team of engineers with virtually no Python or cloud experience, shipped a cloud-based SaaS product written in Python, and then IPO’d in the span of a few years.

Code reviews promote a culture which separates ego from code. People are naturally threatened by criticism, but with a culture of code reviews, we critique code, not people. Code reviews are also a good way to share context within a team. When other people review your code, they get an idea of what you’re up to and where you’re at. Code reviews provide a pulse to your team, and that can help when a teammate needs to context switch to something you were working on.

They are also a powerful way to scale other functions of product development. For example, one area many companies struggle with is security. InfoSec teams are frequently a bottleneck for R&D organizations and often resource-constrained. By developing a security-reviewer program, we can better scale how we approach security and compliance. Require security-sensitive changes to undergo a security review. In order to become a security reviewer, engineers must go through a security training program which must be renewed annually. Google takes this idea even further, having certifications for different areas like “JS readability.”

This is why our consulting at Real Kinetic emphasizes mentorship and building a culture of continuous improvement. It’s also why we bring a bias to action. We talk to companies who want to start adopting new practices and technologies but feel their teams aren’t prepared enough. Here’s the reality: you will never feel fully prepared because you can never be fully prepared. As John Gall points out, the best an army can do is be fully prepared to fight the previous war. This is where being agile does matter, but agile only in the sense of reacting and pivoting quickly.

Nothing is a replacement for experience. You don’t become a professional athlete by watching professional sports on TV. You don’t build reliable cloud software by reading about it in books or going to trainings. To be clear, these things can help, but they aren’t strategies. Similarly, developer practices can help, but they aren’t prerequisites. And more often than not, they become emotional or philosophical debates rather than objective discussions. Teams need to be given the latitude to experiment and make mistakes in order to develop that experience. They need to start doing.

The one exception is code reviews. This is the single most effective way to level up development teams. Through rigorous code reviews, quick iterations, and doing, your teams will level up faster than any training curriculum could achieve. Invest in training or other resources if you think they will help, but mandate code reviews on changes before merging into master. Along with regular retros, this is a foundational component to building a culture of continuous improvement. Expertise will start to spread like wildfire within your organization.

GCP and AWS: What’s the Difference?

AWS has long been leading the charge when it comes to public cloud providers. I believe this is largely attributed to Bezos’ mandate of “APIs everywhere” in the early days of Amazon, which in turn allowed them to be one of the first major players in the space. Google, on the other hand, has a very different DNA. In contrast to Amazon’s laser-focused product mindset, their approach to cloud has broadly been to spin out services based on internal systems backing Google’s core business. When put in the context of the very different leadership styles and cultures of the two companies, this actually starts to make a lot of sense. But which approach is better, and what does this mean for those trying to settle on a cloud provider?

I think GCP gets a bad rap for three reasons: historically, their support has been pretty terrible, there’s the massive gap in offerings between GCP and AWS, and Google tends to be very opaque with its product roadmaps and commitments. It is nearly impossible now to keep track of all the services AWS offers (which seems to continue to grow at a staggering rate), while GCP’s list of services remains fairly modest in comparison. Naively, it would seem AWS is the obvious “better” choice purely due to the number of services. Of course, there’s much more to the story. This article is less of a comparison of the two cloud providers (for that, there is a plethora of analyses) and more of a look at their differing philosophies and legacies.

Philosophies

AWS and GCP are working toward the same goal from completely opposite ends. AWS is the ops engineer’s cloud. It provides all of the low-level primitives ops folks love like network management, granular identity and access management (IAM), load balancers, placement groups for controlling how instances are placed on underlying hardware, and so forth. You need an ops team just to manage all of these things. It’s not entirely different from a traditional on-prem build-out, just in someone else’s data center. This is why ops folks tend to gravitate toward AWS—it’s familiar and provides the control and flexibility they like.

GCP is approaching it from the angle of providing the best managed services of any cloud. It is the software engineer’s cloud. In many cases, you don’t need a traditional ops team, or at least very minimal staffing in that area. The trade-off is it’s more opinionated. This is apparent when you consider GCP was launched in 2008 with the release of Google App Engine. Other key GCP offerings (and acquisitions) bear this out further, such as Google Kubernetes Engine (GKE), Cloud Spanner, Firebase, and Stackdriver.

Platform

A client recently asked me why more companies aren’t using Heroku. I have nothing personal against Heroku, but the reality is I have not personally run into a company of any size using it. I’m sure they exist, but looking at the customer list on their website, it’s mostly small startups. For greenfield initiatives, larger enterprises are simply apprehensive to use it (and PaaS offerings in general). But I think GCP has a pretty compelling story for managed services with a nice spectrum of control from fully managed “NoOps” type services to straight VMs:

Firebase, Cloud Functions → App Engine → App Engine Flex → GKE → GCE

With a typical PaaS like Heroku, you start to lose that ability to “drop down” a level. Even if a company can get by with a fully managed PaaS, they feel more comfortable having the escape hatch, whether it’s justified or not. App Engine Flexible Environment helps with this by providing a container as a service solution, making it much easier to jump to GKE.

I read an article recently on the good, bad, and ugly of GCP. It does a nice job of telling the same story in a slightly different way. It shows the byzantine nature of the IAM model in AWS and GCP’s much simpler permissioning system. It describes the dozens of compute-instance types AWS has and the four GCP has (micro, standard, highmem, and highcpu—with the ability to combine whatever combination of CPU and memory that makes sense for your workload). It also touches on the differences in product philosophy. In particular, when GCP releases new services or features into general availability (GA), they are usually very high quality. In contrast, when AWS releases something, the quality and production-readiness varies greatly. The common saying is “Google’s Beta is like AWS’s GA.” The flipside is GCP’s services often stay in Beta for a very long time.

GCP also does a better job of integrating their different services together, providing a much smaller set of core primitives that are global and work well for many use cases. The article points out Cloud Pub/Sub as a good example. In AWS, you have SQS, SNS, Amazon MQ, Kinesis Data Streams, Kinesis Data Firehose, DynamoDB Streams, and the list seems to only grow over time. GCP has Pub/Sub. It’s flexible enough to fit many (but not all) of the same use cases. The downside of this is Google engineers tend to be pretty opinionated about how problems should be solved.

This difference in philosophy usually means AWS is shipping more services, faster. I think a big part of this is because there isn’t much of a cohesive “platform” story. AWS has lots of disparate pieces—building blocks—many of which are low-level components or more or less hosted versions of existing tech at varying degrees of ready come GA. This becomes apparent when you have to trudge through their hodgepodge of clunky service dashboards which often have a wildly different look and feel than the others. That’s not to say there aren’t integrations between products, it just feels less consistent than GCP. The other reason for this, I suspect, is Amazon’s pervasive service-oriented culture.

For example, AWS took ActiveMQ and stood it up as a managed service called Amazon MQ. This is something Google is unlikely to do. It’s just not in their DNA. It’s also one reason why they are so far behind. GCP tends to be more on the side of shipping homegrown services, but the tech is usually good and ready for primetime when it’s released. Often they spin out internal services by rewriting them for public consumption. This has made them much slower than AWS.

Part of Amazon’s problem, too, is that they are—in a sense—victims of their own success. They got a much earlier head start. The AWS platform launched in 2002 and made its public debut in 2004 with SQS, shortly followed by S3 and EC2. As a result, there’s more legacy and cruft that has built up over time. Google just started a lot later.

More recently, Google has become much more strategic about embracing open APIs. The obvious case is what it has done with Kubernetes—first by open sourcing it, then rallying the community around it, and finally making a massive strategic investment in GKE and the surrounding ecosystem with pieces like Istio. And it has paid off. GKE is, by far and away, the best managed Kubernetes experience currently available. Amazon, who historically has shied away from open APIs (Google has too), had their hand forced, finally making Elastic Container Service for Kubernetes (EKS) generally available last month—probably a bit prematurely. For a long time, Amazon held firm on ECS as the way to run container workloads in AWS. The community spoke, however, and Amazon reluctantly gave in. Other lower-profile cases of Google embracing open APIs include Cloud Dataflow (Apache Beam) and Cloud ML (TensorFlow). As an aside, machine learning and data is another area GCP is leading the charge with its ML and other services like BigQuery, which is arguably a better product than Amazon Redshift.

There are some other implications with the respective approaches of GCP and AWS, one of which is compliance. AWS usually hits certifications faster, but it’s typically on a region-by-region basis. There’s also GovCloud for FedRAMP, which is an entirely separate region. GCP usually takes longer on compliance, but when it happens, it certifies everything. On the same note, services and features in AWS are usually rolled out by region, which often precludes organizations from taking advantage of them immediately. In GCP, resources are usually global, and the console shows things for the entire cloud project. In AWS, the console UIs are usually regional or zonal.

Billing and Support

For a long time, billing has been a rough spot for GCP. They basically gave you a monthly toy spreadsheet with your spend, which was nearly useless for larger operations. There also was not a good way to forecast spend and track it throughout the month. You could only alert on actual spend and not estimated usage. The situation has improved a bit more recently with better reporting, integration with Data Studio, and the recently announced forecasting feature, but it’s still not on par with AWS’s built-in dashboarding. That said, AWS’s billing is so complicated and difficult to manage, there is a small cottage industry just around managing your AWS bill.

Related to billing, GCP has a simpler pricing model. With AWS, you can purchase Reserved Instances to reduce compute spend, which effectively allows you to rent VMs upfront at a considerable discount. This can be really nice if you have stable and predictable workloads. GCP offers sustained use discounts, which are automatic discounts that get applied when running GCE instances for a significant portion of the billing month. If you run a standard instance for more than 25% of a month, Google automatically discounts your bill. The discount increases when you run for a larger portion of the month. They also do what they call inferred instances, which is bin-packing partial instance usage into a single instance to prevent you from losing your discount if you replace instances. Still, GCP has a direct answer to Amazon’s Reserved Instances called committed use discounts. This allows you to purchase a specific amount of vCPUs and memory for a discount in return for committing to a usage term of one or three years. Committed use discounts are automatically applied to the instances you run, and sustained use discounts are applied to anything on top of that.

Support has still been a touchy point for GCP, though they are working to improve it. In my experience, Google has become more committed to helping customers of all sizes be successful on GCP, primarily because AWS has eaten their lunch for a long time. They are much more willing to assign named account reps to customers regardless of size, while AWS won’t give you the time of day if you’re a smaller shop. Their Customer Reliability Engineering program is also one example of how they are trying to differentiate in the support area.

Outcomes

Something interesting that was pointed out to me by a friend and former AWS engineer was that, while GCP and AWS are converging on the same point from opposite ends, they also have completely opposite organizational structures and practices.

Google relies heavily on SREs and service error budgets for operations and support. SREs will manage the operations of a service, but if it exceeds its error budget too frequently, the pager gets handed back to the engineering team. Amazon support falls more on the engineers. This org structure likely influences the way Google and Amazon approach their services, i.e. Conway’s Law. AWS does less to separate development from operations and, as a result, the systems reflect that.

Suffice to say, there are compelling reasons to go with both AWS and GCP. Sufficiently large organizations will likely end up building out on both. You can use either provider to build the same thing, but how you get there depends heavily on the kinds of teams and skill sets your organization has, what your goals are operationally, and other nuances like compliance and workload shapes. If you have significant ops investment, AWS might be a better fit. If you have lots of software engineers, GCP might be. Pricing is often a point of discussion as well, but the truth is you will end up spending more in some areas and less in others. Moreover, all providers are essentially in a race to the bottom anyway as they commoditize more and more. Where it becomes interesting is how they differentiate with value-added services. This is where “multi-cloud” becomes truly meaningful.

Real Kinetic has extensive experience leveraging both AWS and GCP. Learn more about working with us.