The sunset was impossibly beautiful. The sun itself just dipped under the seemingly endless horizon, but it’s still painting the wispy clouds tones of red, orange, and a bit of yellow. With a Mai Tai in hand, can life get any better? I reach over to take a si-BEEP BEEP BEEP BEEP BEEP.

It was all a dream…what time is it…how do I stop this awful noise again? I grab my phone and somehow manage to slap the parts of the surface that probably combine to form my passcode. Nope…I wait for my eyes to focus as the BEEP BEEP BEEP BEEP continues without mercy or respite. After a pause that feels like minutes but was actually 4 seconds, I can see well enough to enter my passcode and open my phone. The time of 02:13 mocks my previous ambitions of a good night’s sleep. The push notification is clear: “ERROR THRESHOLD BREACHED FOR > 3 CONSECUTIVE SAMPLING PERIODS” - and after opening my laptop and logging in, it becomes clear that this isn’t a joke. In fact, it appears that every single request to the service is now failing.

Every. Single. One.

Dependency outage? I pull up the logs and see the impossible:

[ERROR] ValidationException: Request missing required parameter 'objectId'.

Who pushed a code change out that broke our external calls and got it through our testing suite? I start imagining ways to express my UNYIELDING RAGE using polite, appropriate work email language to the person who broke just about every single control we painstakingly put in pl-

Nobody. We haven’t deployed a change in days. Our dependency just made objectId required overnight, and brought down our entire service. There’s no chance I’m going to sleep, and now I’m going to have to ruin about my whole team’s night…

This particular story is fiction, but stories like this have happened. I would wager somebody reading this has had something pretty close to this happen to them before. Maybe it was during a code update and all your tests failed in your continuous integration (CI) pipeline (you DO have a CI pipeline, right?) because your library dependency just changed an input shape, forcing you to rewrite every single integration with them. Hell, maybe they just changed a parameter name because they think a new one would be “more descriptive” and developer pain means nothing to them. Stories like this are incredibly common across virtually all sub-fields of programming.

Increasingly, more and more of us maintain services, libraries, CLIs, and other forms of direct programmer interfaces that have the power to cause this kind of nightmare to come true. As somebody who has maintained open source libraries and CLIs for 7 years and counting, causing a problem like this to my users is essentially my greatest nightmare. It’s also something my teams have put a lot of importance on, to the point where we’ve given multiple talks on the “what” and “how” of backwards compatibility:

As a brief definition, backwards compatibility for APIs (what we are focusing on here) means that access patterns on one version of an API will continue to work for all future revisions - deviations from this are known as “breaking changes” and tend to lead to problems.

The Importance of Backwards Compatibility

A mental model for backwards compatibility I really appreciate is looking at it as a part of your service’s durability and availability. To cite a colleague’s whiteboard:

Looking at our story, and thinking about other consequences of breaking changes, you can see that a breaking change is essentially an “Availability” failure. If you make a breaking change and bring down anybody calling your service with newly-invalid call patterns, you’re essentially experiencing an outage. Worse, if you don’t alarm on anomalous levels of 4XX errors, you may not even realize it until you get the angry calls/emails!

Even if you merely make a backwards-incompatible change to your clients, you still risk breaking customers who lack continuous integration, or if they have it, blocking them from pushing out potentially critical changes of their own (imagine where you fall on this tradeoff scale if a breaking change you push in a minor version bump blocks somebody else’s critical security patch!).

Breaking changes can even factor into durability - if you change your API behavior such that the a call that used to just add/modify resources might now delete resources without changing the parameters, your breaking change just hit durability.

What this also means is that there is a time where breaking changes are clearly legitimate, though you should take great design pains to avoid ever needing to make this choice: If you need to make a breaking change to preserve security. Do it. Get ready for a huge customer notification campaign, but do it.

The Fear

I’m hoping that by this point, you’re on board about the fact that backwards compatibility is important. But there’s one concern I’ve heard many times, and it’s worth addressing:

“If I make all these backwards compatibility promises, does that mean that if I shipped a mistake, I can’t ever fix it?”

I’ll be honest upfront: This is a legitimate fear. The answer is NOT a straightforward “no” - there are some mistakes you’ll probably have to keep if you ship them. The case I want to make is that this does not need to be feared, and that you’ll rarely be frozen, unable to move forward or rectify mistakes.

The Reality

Within a single major version of a product, you are committing to the rules of the road for backwards compatibility, if you’ve decided that stability matters:

  • Do
    • APIs
      • Add members/shapes
      • Add intermediate workflow states
      • Add detail to exceptions
      • Add new opt-in exceptions
      • Loosen constraints
    • Clients
      • Embrace forwards-compatibility (support all of the above)
      • Focus on discoverability
  • Do Not
    • APIs
      • Remove/rename members and shapes
      • Change member types
      • Add new terminal workflow states
      • Add new exceptions without opt-in
      • Tighten constraints
    • Clients
      • Validate API constraints

There are two skills to sharpen to help deliver a great product with these constraints in mind: Designing for growth in features, and adding new usage paths.

Design for Growth

Imagine a List style API. At launch, you expect (and may even enforce) that each customer might only have a small number of total tasks, and you’ll only support filtering by, say, a Status field for a TODO-style API. You might design an API like so:

Inputs:
- Status: Enum<String>

Outputs:
- Tasks: List<Task>

Given the designed API, this will totally work. But you have a couple of future problems coming, and so upfront design where you think about the coming growth of your API can save you a massive headache down the road. Perhaps over time, you’ll need to support pagination (because a customer might have hundreds of tasks and you don’t want to return them all in a single giant response object) and additional filters as you add dozens of fields. With the above design, you’ll be stuck returning all tasks for all time (or making a breaking change to add pagination later and breaking the implied design contract), and each new filter would be a separate input. You might, instead, design an API like this up front, anticipating the ability to grow without the need for breaking changes:

Inputs:
- Filters: Map<String, String>
- NextToken: String
- MaxItems: Integer

Outputs:
- Tasks: List<Task>
- NextToken: String

For an minimum viable product that works in practice like the first API, the second API will work perfectly well, and will support future growth in functionality.

New Usage Paths

Designing up front is not always enough. One design I’ve shipped and later needed to rework is the TableMigration concept in the aws-record library. It was an attempt to try and bring familiar (to Ruby developer) “migration” concepts to the aws-record library. The problem is, Amazon DynamoDB is not a relational database, and the way it deals with configuration and schemas is significantly different than a relational database. But it shipped. Since it shipped, simply removing it and saying “Sorry, there’s a better way” wouldn’t do. It worked, and if you already built around it, it should continue to work.

The way around this was to keep supporting TableMigration, but then to also ship, as a separate class, the better way of doing it. When TableConfig was launched, it provided a declarative syntax and handled create-or-update logic on its own, allowing you to define your table as it should be, and then handle the mechanics of creation on your behalf.

In this manner, you can work around past design errors without breaking any of your users. Over time it can create some cruft or technical debt, but with this tradeoff comes stability for your users.

Major Version Bumps - Eventually

Over time, technical debt can build as you support new code paths to deliver new features without breaking existing customers - you should keep a collection of these over time, and when the list gets long, it may be time for a new major version that addresses these issues.

The key here? When you launch a new major version, you’re supporting both new and old for at least months, ideally over a year, to let people transition.

Choosing Stability

When you take efforts to avoid shipping breaking changes, what you’re really doing is providing the features of predictability and stability. When I look at a product’s CHANGELOG and see that they avoid breaking changes, I feel much more confident to use that tool or service in production.

It’s not a cost-free tradeoff, but I’ve come to believe through hard-won experience that the tradeoff is absolutely worth it. Your users, who get to enjoy their vacations with a bit less worry about unpredictable outages, will too.