How to Level up Dev Teams

One question that clients frequently ask: how do you effectively level up development teams? How do you take a group of engineers who have never written Python and make them effective Python developers? How do you take a group who has never built distributed systems and have them build reliable, fault-tolerant microservices? What about a team who has never built anything in the cloud that is now tasked with building cloud software?

Some say training will level up teams. Bring in a firm who can teach us how to write effective Python or how to build cloud software. Run developers through a bootcamp; throw raw, undeveloped talent in one end and out pops prepared and productive engineers on the other.

My question to those who advocate this is: when do you know you’re ready? Once you’ve completed a training course? Is the two-day training enough or should we opt for the three-day one? The six-month pair-coding boot camp? You might be more ready than you were before, but you also spent piles of cash on training programs, not to mention the opportunity cost of having a team of expensive engineers sit in multi-day or multi-week workshops. Are the trade-offs worth it? Perhaps, but it’s hard to say. And what happens when the next new thing comes along? We have to start the whole process over again.

Others say tools will help level up teams. A CI/CD pipeline will make developers more effective and able to ship higher quality software faster. Machine learning products will make our on-call experience more manageable. Serverless will make engineers more productive. Automation will improve our company’s slow and bureaucratic processes.

This one’s simple: tools are often band-aids for broken or inefficient policies, and policies are organizational scar tissue. Tools can be useful, but they will not fix your broken culture and they certainly will not level up your teams, only supplement them at best.

Yet others say developer practices will level up teams. Teams doing pair programming or test-driven development (TDD) will level up faster and be more effective—or scrum, or agile, or mob programming. Teams not following these practices just aren’t ready, and it will take them longer to become ready.

These things can help, but they don’t actually matter that much. If this sounds like blasphemy to you, you might want to stop and reflect on that dogma for a bit. I have seen teams that use scrum, pair programming, and TDD write terrible software. I have seen teams that don’t write unit tests write amazing software. I have seen teams implement DevOps on-prem, and I have seen teams completely silo ops and dev in the cloud. These are tools in the toolbox that teams can choose to leverage, but they will not magically make a team ready or more effective. The one exception to this is code reviews by non-authors.

Code reviews are the one practice that helps improve software quality, and there is empirical data to support this. Pair programming can be a great way to mentor junior engineers and ensure someone else understands the code, but it’s not a replacement for code reviews. It’s just as easy to come up with a bad idea working by yourself as it is working with another person, but when you bring in someone uninvolved with outside perspective, they’re more likely to realize it’s a bad idea.

Code reviews are an effective way to quickly level up teams provided you have a few pockets of knowledgeable reviewers to bootstrap the process (which, as a corollary, means high-performing teams should occasionally be broken up to seed the rest of the organization). They provide quick feedback to developers who will eventually internalize it and then instill it in their own code reviews. Thus, it quickly spreads expertise. Leveling up becomes contagious.

I experienced this firsthand when I started working at Workiva. Having never written a single line of Python and having never used Google App Engine before, I joined a company whose product was predominantly written in Python and running on Google App Engine. Within the span of a few months, I became a fairly proficient Python developer and quite knowledgeable of App Engine and distributed systems practices. I didn’t do any training. I didn’t read any books. I rarely pair-coded. It was through code reviews (and, in particular, group code reviews!) alone that I leveled up. And it’s why we were ruthless on code reviews, which often caught new hires off guard. Using this approach, Workiva effectively took a team of engineers with virtually no Python or cloud experience, shipped a cloud-based SaaS product written in Python, and then IPO’d in the span of a few years.

Code reviews promote a culture which separates ego from code. People are naturally threatened by criticism, but with a culture of code reviews, we critique code, not people. Code reviews are also a good way to share context within a team. When other people review your code, they get an idea of what you’re up to and where you’re at. Code reviews provide a pulse to your team, and that can help when a teammate needs to context switch to something you were working on.

They are also a powerful way to scale other functions of product development. For example, one area many companies struggle with is security. InfoSec teams are frequently a bottleneck for R&D organizations and often resource-constrained. By developing a security-reviewer program, we can better scale how we approach security and compliance. Require security-sensitive changes to undergo a security review. In order to become a security reviewer, engineers must go through a security training program which must be renewed annually. Google takes this idea even further, having certifications for different areas like “JS readability.”

This is why our consulting at Real Kinetic emphasizes mentorship and building a culture of continuous improvement. It’s also why we bring a bias to action. We talk to companies who want to start adopting new practices and technologies but feel their teams aren’t prepared enough. Here’s the reality: you will never feel fully prepared because you can never be fully prepared. As John Gall points out, the best an army can do is be fully prepared to fight the previous war. This is where being agile does matter, but agile only in the sense of reacting and pivoting quickly.

Nothing is a replacement for experience. You don’t become a professional athlete by watching professional sports on TV. You don’t build reliable cloud software by reading about it in books or going to trainings. To be clear, these things can help, but they aren’t strategies. Similarly, developer practices can help, but they aren’t prerequisites. And more often than not, they become emotional or philosophical debates rather than objective discussions. Teams need to be given the latitude to experiment and make mistakes in order to develop that experience. They need to start doing.

The one exception is code reviews. This is the single most effective way to level up development teams. Through rigorous code reviews, quick iterations, and doing, your teams will level up faster than any training curriculum could achieve. Invest in training or other resources if you think they will help, but mandate code reviews on changes before merging into master. Along with regular retros, this is a foundational component to building a culture of continuous improvement. Expertise will start to spread like wildfire within your organization.

Multi-Cloud Is a Trap

It comes up in a lot of conversations with clients. We want to be cloud-agnostic. We need to avoid vendor lock-in. We want to be able to shift workloads seamlessly between cloud providers. Let me say it again: multi-cloud is a trap. Outside of appeasing a few major retailers who might not be too keen on stuff running in Amazon data centers, I can think of few reasons why multi-cloud should be a priority for organizations of any scale.

A multi-cloud strategy looks great on paper, but it creates unneeded constraints and results in a wild-goose chase. For most, it ends up being a distraction, creating more problems than it solves and costing more money than it’s worth. I’m going to caveat that claim in just a bit because it’s a bold blanket statement, but bear with me. For now, just know that when I say “multi-cloud,” I’m referring to the idea of running the same services across vendors or designing applications in a way that allows them to move between providers effortlessly. I’m not speaking to the notion of leveraging the best parts of each cloud provider or using higher-level, value-added services across vendors.

Multi-cloud rears its head for a number of reasons, but they can largely be grouped into the following points: disaster recovery (DR), vendor lock-in, and pricing. I’m going to speak to each of these and then discuss where multi-cloud actually does come into play.

Disaster Recovery

Multi-cloud gets pushed as a means to implement DR. When discussing DR, it’s important to have a clear understanding of how cloud providers work. Public cloud providers like AWS, GCP, and Azure have a concept of regions and availability zones (n.b. Azure only recently launched availability zones in select regions, which they’ve learned the hard way is a good idea). A region is a collection of data centers within a specific geographic area. An availability zone (AZ) is one or more data centers within a region. Each AZ is isolated with dedicated network connections and power backups, and AZs in a region are connected by low-latency links. AZs might be located in the same building (with independent compute, power, cooling, etc.) or completely separated, potentially by hundreds of miles.

Region-wide outages are highly unusual. When they happen, it’s a high-profile event since it usually means half the Internet is broken. Since AZs themselves are geographically isolated to an extent, a natural disaster taking down an entire region would basically be the equivalent of a meteorite wiping out the state of Virginia. The more common cause of region failures are misconfigurations and other operator mistakes. While rare, they do happen. However, regions are highly isolated, and providers perform maintenance on them in staggered windows to avoid multi-region failures.

That’s not to say a multi-region failure is out of the realm of possibility (any more than a meteorite wiping out half the continental United States or some bizarre cascading failure). Some backbone infrastructure services might span regions, which can lead to larger-scale incidents. But while having a presence in multiple cloud providers is obviously safer than a multi-region strategy within a single provider, there are significant costs to this. DR is an incredibly nuanced topic that I think goes underappreciated, and I think cloud portability does little to minimize those costs in practice. You don’t need to be multi-cloud to have a robust DR strategy—unless, perhaps, you’re operating at Google or Amazon scale. After all, Amazon.com is one of the world’s largest retailers, so if your DR strategy can match theirs, you’re probably in pretty good shape.

Vendor Lock-In

Vendor lock-in and the related fear, uncertainty, and doubt therein is another frequently cited reason for a multi-cloud strategy. Beau hits on this in Stop Wasting Your Beer Money:

The cloud. DevOps. Serverless. These are all movements and markets created to commoditize the common needs. They may not be the perfect solution. And yes, you may end up “locked in.” But I believe that’s a risk worth taking. It’s not as bad as it sounds. Tim O’Reilly has a quote that sums this up:

“Lock-in” comes because others depend on the benefit from your services, not because you’re completely in control.

We are locked-in because we benefit from this service. First off, this means that we’re leveraging the full value from this service. And, as a group of consumers, we have more leverage than we realize. Those providers are going to do what is necessary to continue to provide value that we benefit from. That is what drives their revenue. As O’Reilly points out, the provider actually has less control than you think. They’re going to build the system they believe benefits the largest portion of their market. They will focus on what we, a player in the market, value.

Competition is the other key piece of leverage. As strong as a provider like AWS is, there are plenty of competing cloud providers. And while competitors attempt to provide differentiated solutions to what they view as gaps in the market they also need to meet the basic needs. This is why we see so many common services across these providers. This is all for our benefit. We should take advantage of this leverage being provided to us. And yes, there will still be costs to move from one provider to another but I believe those costs are actually significantly less than the costs of going from on-premise to the cloud in the first place. Once you’re actually on the cloud you gain agility.

The mental gymnastics I see companies go through to avoid vendor lock-in and “reasons” for multi-cloud always astound me. It’s baffling the amount of money companies are willing to spend on things that do not differentiate them in any way whatsoever and, in fact, forces them to divert resources from business-differentiating things.

I think there are a couple reasons for this. First, as Beau points out, we have a tendency to overvalue our own abilities and undervalue our costs. This causes us to miscalculate the build versus buy decision. This is also closely related to the IKEA effect, in which consumers place a disproportionately high value on products they partially created. Second, as the power and influence in organizations has shifted from IT to the business—and especially with the adoption of product mindset—it strikes me as another attempt by IT operations to retain control and relevance.

Being cloud-agnostic should not be an important enough goal that it drives key decisions. If that’s your starting point, you’re severely limiting your ability to fully reap the benefits of cloud. You’re just renting compute. Platforms like Pivotal Cloud Foundry and Red Hat OpenShift tout the ability to run on every major private and public cloud, but doing so—by definition—necessitates an abstraction layer that abstracts away all the differentiating features of each cloud platform. When you abstract away the differentiating features to avoid lock-in, you also abstract away the value. You end up with vendor “lock-out,” which basically means you aren’t leveraging the full value of services. Either the abstraction reduces things to a common interface or it doesn’t. If it does, it’s unclear how it can leverage differentiated provider features and remain cloud-agnostic. If it doesn’t, it’s unclear what the value of it is or how it can be truly multi-cloud.

Not to pick on PCF or Red Hat too much, but as the major cloud providers continue to unbundle their own platforms and rebundle them in a more democratized way, the value proposition of these multi-cloud platforms begins to diminish. In the pre-Kubernetes and containers era—aka the heyday of Platform as a Service (PaaS)—there was a compelling story. Now, with the prevalence of containers, Kubernetes, and especially things like Google’s GKE and GKE On-Prem (and equivalents in other providers), that story is getting harder to tell. Interestingly, the recently announced Knative was built in close partnership with, among others, both Pivotal and Red Hat, which seems to be a play to capture some of the value from enterprise adoption of serverless computing using the momentum of Kubernetes.

But someone needs to run these multi-cloud platforms as a service, and therein lies the rub. That responsibility is usually dumped on an operations or shared-services team who now needs to run it in multiple clouds—and probably subscribe to a services contract with the vendor.

A multi-cloud deployment requires expertise for multiple cloud platforms. A PaaS might abstract that away from developers, but it’s pushed down onto operations staff. And we’re not even getting in to the security and compliance implications of certifying multiple platforms. For some companies who are just now looking to move to the cloud, this will seriously derail things. Once we get past the airy-fairy marketing speak, we really get into the hairy details of what it means to be multi-cloud.

There’s just less room today for running a PaaS that is not managed for you. It’s simply not strategic to any business. I also like to point out that revenues for companies like Pivotal and Red Hat are largely driven by services. These platforms act as a way to drive professional services revenue.

Generally speaking, the risk posed to businesses by vendor lock-in of non-strategic systems is low. For example, a database stores data. Whether it’s Amazon DynamoDB, Google Cloud Datastore, or Azure Cosmos DB—there might be technical differences like NoSQL, relational, ANSI-compliant SQL, proprietary, and so on—fundamentally, they just put data in and get data out. There may be engineering effort involved in moving between them, but it’s not insurmountable and that cost is often far outweighed by the benefits we get using them. Where vendor lock-in can become a problem is when relying on core strategic systems. These might be systems which perform actual business logic or are otherwise key enablers of a company’s business. As Joel Spolsky says, “If it’s a core business function—do it yourself, no matter what. Pick your core business competencies and goals, and do those in house.”

Pricing

Price competitiveness might be the weakest argument of all for multi-cloud. The reality is, as they commoditize more and more, all providers are in a race to the bottom when it comes to cost. Between providers, you will end up spending more in some areas and less in others. Multi-cloud price arbitrage is not a thing, it’s just something people pretend is a thing. For one, it’s wildly impractical. For another, it fails to account for volume discounts. As I mentioned in my comparison of AWS and GCP, it really comes down more to where you want to invest your resources when picking a cloud provider due to their differing philosophies.

And to Beau’s point earlier, the lock-in angle on pricing, i.e. a vendor locking you in and then driving up prices, just doesn’t make sense. First, that’s not how economies of scale work. And once you’re in the cloud, the cost of moving from one provider to another is dramatically less than when you were on-premise, so this simply would not be in providers’ best interest. They will do what’s necessary to capture the largest portion of the market and competitive forces will drive Infrastructure as a Service (IaaS) costs down. Because of the competitive environment and desire to capture market share, pricing is likely to converge.  For cloud providers to increase margins, they will need to move further up the stack toward Software as a Service (SaaS) and value-added services.

Additionally, most public cloud providers offer volume discounts. For instance, AWS offers Reserved Instances with significant discounts up to 75% for EC2. Other AWS services also have volume discounts, and Amazon uses consolidated billing to combine usage from all the accounts in an organization to give you a lower overall price when possible. GCP offers sustained use discounts, which are automatic discounts that get applied when running GCE instances for a significant portion of the billing month. They also implement what they call inferred instances, which is bin-packing partial instance usage into a single instance to prevent you from losing your discount if you replace instances. Finally, GCP likewise has an equivalent to Amazon’s Reserved Instances called committed use discounts. If resources are spread across multiple cloud providers, it becomes more difficult to qualify for many of these discounts.

Where Multi-Cloud Makes Sense

I said I would caveat my claim and here it is. Yes, multi-cloud can be—and usually is—a distraction for most organizations. If you are a company that is just now starting to look at cloud, it will serve no purpose but to divert you from what’s really important. It will slow things down and plant seeds of FUD.

Some companies try to do build-outs on multiple providers at the same time in an attempt to hedge the risk of going all in on one. I think this is counterproductive and actually increases the risk of an unsuccessful outcome. For smaller shops, pick a provider and focus efforts on productionizing it. Leverage managed services where you can, and don’t use multi-cloud as a reason not to. For larger companies, it’s not unreasonable to have build-outs on multiple providers, but it should be done through controlled experimentation. And that’s one of the benefits of cloud, we can make limited investments and experiment without big up-front expenditures—watch out for that with the multi-cloud PaaS offerings and service contracts.

But no, that doesn’t mean multi-cloud doesn’t have a place. Things are never that cut and dry. For large enterprises with multiple business units, multi-cloud is an inevitability. This can be a result of product teams at varying levels of maturity, corporate IT infrastructure, and certainly through mergers and acquisitions. The main value of multi-cloud, and I think one of the few arguments for it, is leveraging the strengths of each cloud where they make sense. This gets back to providers moving up the stack. As they attempt to differentiate with value-added services, multi-cloud starts to become a lot more meaningful. Secondarily, there might be a case for multi-cloud due to data-sovereignty reasons, but I think this is becoming less and less of a concern with the prevalence of regions and availability zones. However, some services, such as Google’s Cloud Spanner, might forgo AZ-granularity due to being “globally available” services, so this is something to be aware of when dealing with regulations like GDPR. Finally, for enterprises with colocation facilities, hybrid cloud will always be a reality, though this gets complicated when extending those out to multiple cloud providers.

If you’re just beginning to dip your toe into cloud, a multi-cloud strategy should not be at the forefront of your mind. It definitely should not be your guiding objective and something that drives core decisions or strategic items for the business. It has a time and place, but outside of that, it’s just a fool’s errand—a distraction from what’s truly important.

How is Software Valued?

I was talking to a friend a few weeks ago who was putting together a business presentation for potential investors. He was developing a plan for a campground kiosk system that would rely on GIS data to allow guests to view and check in to camp sites. The plan was reasonable enough and mostly feasible. He carefully considered all the costs—licensing for a third-party GIS, kiosk hardware, line trenching—and then there was software.

He allocated a mere $8,000 for the kiosk software, a low-ball figure by any definition of the word, and he estimated it to take about four weeks to complete from scratch.

“Where did you get that figure?” I asked him. The answer basically boiled down to “thin air.”

I didn’t have any kind of sudden realization, but this exchange did reinforce something many others have already observed: software is remarkably undervalued.

All too often clients say something along the lines of “You want me to pay you $X per hour to sit and type on your computer?!” What’s not obvious to many is that software engineers create extraordinary value for businesses. It’s almost ironic considering just about everything these days is driven by software, to the extent that it’s almost taken for granted, and it doesn’t somehow materialize out of thin air.

So why is this the case? Is it because software isn’t a physical good? Maybe. However, I think the issue is largely attributed to the disparate levels of productivity between software engineers and other areas of industry. A developer might write an accounting system that leaves a large number of accountants redundant or automate a process that otherwise takes a dozen employees to complete. Is it fair that they are compensated accordingly? Again, it’s about creating value, but the fact is, most developers aren’t paid in proportion to the value they create or their productivity. Consultant John Cook explains why this is the case:

A salesman who sells 10x as much as his peers will be noticed, and compensated accordingly. Sales are easy to measure, and some salesmen make orders of magnitude more money than others. If a bricklayer were 10x more productive than his peers this would be obvious too, but it doesn’t happen: the best bricklayers cannot lay 10x as much brick as average bricklayers. Software output cannot be measured as easily as dollars or bricks. The best programmers do not write 10x as many lines of code and they certainly do not work 10x longer hours.

It may also be due, at least in part, to software being endlessly enigmatic to non-software people. Is this auto mechanic ripping us off on our car? Is this developer ripping us off on our point-of-sale system? It’s easy for people to see what it takes to build a bridge—designing it, performing simulated load tests, pouring the concrete, assembling the steel, laying the superstructure—these are all tangible overheads.

What does it take to build software? It’s just some bit-twiddling, right? There’s no inventory that needs to be accounted for; there’s no manufacturing labor. As developers, we know it’s a lot more involved than that. The problem with software is that it’s a living thing. After you build a home, you don’t decide to move the bathroom to the other side of the house. The same cannot be said of software.

Product owners are fickle creatures. They don’t know what they want, except that Feature X needs to be changed to Feature Y and still ship in time. I’ve been on projects where this had become so problematic that developers started leaving Feature X implemented. That way, when NPD ultimately decided X was correct in the first place, we would be on schedule, but that’s tangential to this conversation.

What I’m getting at is that there’s a lot more to building software than what may be perceived. There’s still planning, and designing, and prototyping, and implementing, and testing. But unlike the bridge or the house, the process doesn’t stop when the software ships.

No self-respecting (or sane) software engineer would agree to build a complete system in four weeks’ time for $8,000. It’s almost insulting. But to someone which software is completely foreign to—and it is to most—it might not sound so outlandish. The problem is finding the appropriate level of value. It’s easy if you’re an independent consultant, but if you’re one of several hundred developers at a company, how is your value measured? As Cook explains, output, in terms of lines of code, is not a reliable metric. In fact, one could argue it’s inversely proportional to a developer’s ability.

The romantic image of an über-programmer is someone who fires up Emacs, types like a machine gun, and delivers a flawless final product from scratch. A more accurate image would be someone who stares quietly into space for a few minutes and then says “Hmm. I think I’ve seen something like this before.”

It’s for this reason, combined with the fact that programmer salaries don’t really vary dramatically, that many developers do consulting as a profession. They know exactly what their time is worth and the value they add to a business. Coming back to the problem described earlier, the downside of consulting is that many customers don’t recognize that value. As a consultant, it’s also your job to establish what it is and why.

I took on a contract last month to build some mobile software for a small engineering firm. They needed an Android application but didn’t have the resources in-house to do it. They met with a few software shops in the area but none of them specialized in mobile development. I build Android apps. This raised my value, and I already had a pretty good idea what the app would do for their business. At that point, it’s just letting economics work itself out.