The missing ops-dev feedback loop

Establishing a working resilience practice

Uwe Friedrichsen

12 minute read

Pittoresque small harbor (seen in Rotterdam, Netherlands)

The missing ops-dev feedback loop

I recently had a short discussion with a Product Owner after a talk I gave about resilient software design. His question was: “How can I motivate developers to care more about resilience?”

I found it interesting that a Product Owner (PO) approached me with that question because quite often POs are only focused on feature delivery and thus keep developers from building more resilient applications. Here it was the other way round – kudos to this Product Owner!

Still, the question remains: How to motivate developers to care more about resilience?

Forces against caring about resilience

In my blog post about the 100% availability trap, I wrote about some reasons why people tend to fall for the 100% availability trap and discussed a few potential countermeasures.

All the reasons I discussed there also apply to resilient software design in general:

  • If you do not understand distributed systems,
  • if you are shielded from them via standards and products that make remote communication “feel” like local communication,
  • if you are shielded from the production environment in general where the failures happen,
  • if you do not understand the math behind distributed systems failure modes,

you are not only prone to the 100% availability trap. Most likely, in such a setting you will not pay a lot of attention to resilient software design in general.

To make things worse, most developers are rewarded for delivering features as fast as possible, implementing an industrial software delivery mode. I will skip the discussion here, why this software delivery mode is counterproductive almost everywhere today. If you are interested in the discussion, you may want to read my little blog series about understanding the consequences of uncertainty.

But even if this software delivery mode is highly outdated, cultural inertia in the companies ensures that we still see it in many places, making it hard for software engineers to care about resilience even if they would like to.

Industrial thinking amplifying the problem

With the situation given as described before it is not a surprise that most software engineers do not pay attention to resilient software design. And the question still remains: How can we improve the situation?

In my blog post about the 100% availability trap, I sketched some ideas. The first of those ideas was to establish what I call the “ops-dev feedback loop” 1. I did not write much about the idea in the former post. Instead, I promised to discuss it in more detail in a later post. Well, here we are!

Before we will jump into the solution, let us briefly dive a bit deeper into the problem.

In most enterprises, especially those that still stick to an industrial software delivery mode, you find a big wall between the development and the operations department – the “ops-dev wall”. Often this division between the departments goes up to the top executive level, the CIO being the first person who faces the needs and demands of both departments.

To make it even harder to get through (or over) the wall, often the only inter-department communication paths provisioned are the ones the software delivery processes defines. All other communication must follow the chain of command, i.e., up to the CIO and back down into the other department. In its consequence, this means that communication between the departments outside the narrow channels defined by the software development process is at least discouraged, often even actively suppressed.

This type of organization of IT departments is rooted in the division-of-labor approach of industrial production organizations, having scaling of production and cost-efficiency as primary goals. This kind of highly divided labor, assigned to highly specialized worker pools, controlled by a clear chain of command enables maximum scalability of production while keeping the costs low.

So, this kind of organization has its justification – or at least, it had. You can only scale production of complicated products in a cost-efficient way using such an organization. But as I discussed in this former post, software development crossed the line from complicated to complex a long time ago.

And you cannot cost-efficiently scale production of complex products this way. You need a completely different approach that takes the uncertainty based on the task’s complexity into account to act successfully in complex environments. Still, due to cultural inertia in most companies we still see the outdated division-of-labor approach and the resulting ops-dev wall a lot.

Overall, this approach has a number of drawbacks. In the context of this post, we will focus on one crucial drawback: the ops-dev wall.

The ops-dev wall

As we have discussed in the blog series why we need resilient software design, in a distributed environment we cannot solve all robustness and availability issues at the infrastructure level alone. We also need to take care of resilience at the application level.

The infrastructure level can help to support resilience. Contemporary tooling offers quite some support from observing timeouts over autoscaling and rate limiting to smart rollout strategies, and more. Still, it cannot completely shield the application level from the non-deterministic failure modes of distributed systems.

E.g., it cannot know what to do at a functional level if a timeout hits, a circuit breaker trips or the retry failed. This must be handled at the application level. Complete classes of failure modes like response failures can only be detected at the application level. The same is true for out-of-order message arrival that can happen due to successful reconciliation attempts of the infrastructure tooling. Some essential measures like providing idempotent remote functions can only be implemented at the application level. And so on.

In short: You also need to implement parts of your robustness and availability measures at the application level.

Now let us assume, developers are aware of this and start to implement resilience measures. Implementing resilience measure is a lot like optimizing the user experience (UX) of an application: You never get it right the first time.

Instead, it is a continuous build-measure-learn cycle:

  • You decide for a measure, build it and release the application to production.
  • You measure its effects. How does the new measure affect availability?
  • You learn from the results of measuring. Did it work as expected? Shall we keep it, improve it, or drop it and try something else?

And you repeat this cycle over and over again. You are never done. Instead, you work to improve your system availability and robustness step by step by step.

But how can you measure with the ops-dev wall in place?

Build is done at the dev side of the wall. Measure needs to take place at the production side. And learn would be at the dev side again.

The ops-dev wall cuts right through the required build-measure-learn cycle. Without access to production metrics developers have no idea if their measures improved application resilience or not. Instead, they figuratively shoot in the dark. They implement something without the faintest idea if it is useful or not.

Measuring the effectiveness of resilience measures in a QA environment as some people then suggest, does not solve the problem. In a QA environment you can test the basic functionality of you measures – if they work at all. But you cannot check their effectiveness in real life. You need production traffic and load and all the complex and unforeseeable interaction patterns resulting from it to understand how well (or not) your resilience measures work.

You should test your measures in a QA environment for sure. But you need the production numbers to learn and improve. Without those numbers, I am blind as a developer regarding application resilience.

With an ops-dev wall in place, it is no wonder that developers tend not to care about resilience. Even if I am totally motivated in the beginning, without any feedback regarding the effectiveness of my measures I will eventually lose interest and motivation.

It is a bit like sending birthday presents to a person and never receive any feedback: No feedback if the person was pleased to receive a present. No feedback if the person liked the present. Not even feedback if the person received the present at all. You can be a really selfless person. Without any feedback, you will become frustrated eventually and lose the motivation to send further presents.

And regarding resilience measures, missing feedback is even more critical because without feedback you have no idea if you did the right thing or not.

This is like being a doctor trying to help a patient to become healthier. You prescribe medication to the best of your knowledge and belief but never get any feedback how the patient responded to it: Did it help? Did it not help? Shall I continue prescribing the medicine? Do I need to try something else? Is the patient still alive??? You have no idea if you are on the right track or not. You shoot in the dark – which is also very frustrating.

The ops-dev feedback loop

So, feedback is essential – not only for keeping the developer’s motivation up but also to enable them to be effective instead of guessing and shooting in the dark.

Also, with an ops-dev wall in place, operations typically is left alone with the task of “guaranteeing” the availability of the applications which – as we have discussed before – is impossible at the infrastructure level alone.

But being deprived of access to development, they are doomed to fail in the highly distributed system environments we face today. And in the worst case, the will try to prevent the introduction of contemporary paradigms because they do not know how to “guarantee” availability with them. 2

The only effective way to resolve this undesirable situation is to tear down the ops-dev wall and to establish a working ops-dev feedback loop:

  • Developers must be able to see how their resilience measures work in production in order to continuously improve them.
  • Operations must be able to provide feedback where they experience problems regarding the applications’ availability and robustness.
  • Together they must learn how to add and change functionality all the time without compromising availability.

There is no panacea how to implement the feedback loop. There are different options to do it, e.g.:

  • A quite well-proven option is to adopt DevOps (based on its original meaning) in your organization. This will for sure establish the feedback loop, but will also trigger a lot more change. 3
  • If DevOps is too-big-a-change for your organization at the moment, you may want to peek into Google’s site Site Reliability Engineering (SRE). It has a much narrower focus than DevOps, only focusing on the balance between adding new features and production stability. But it will have similar effects regarding the required ops-dev feedback loop. 4
  • Or you may know an option that might work better for your organization than the two aforementioned approaches. But make sure you establish the required feedback loops in a sustainable way! 5

The key point is to establish the feedback loop and tear down the ops-dev wall if it exists. You need the permeability. You need the collaboration. You need the transparency. You need the feedback loop: How well does my application behave in production? How does it respond to unexpected problems? What did I miss? Otherwise, your IT system resilience will always be questionable at best.

Summing up

If you want developers taking care about resilient software design, you need to establish an ops-dev feedback loop. Without that loop – especially if an ops-dev wall is in place – developers do not get any feedback regarding the effectiveness of their resilience measures. This is demotivating and it prevents them from continuously improving the resilience of their applications.

A panacea how to establish such a feedback loop does not exist. There are multiple options. The best known options are going for DevOps (based on its original definition, see footnote #3 for more information) or SRE. But other approaches are also possible. The key point is to establish the loop – no matter how you are doing it.

I hope I gave you some ideas to ponder and that this post motivates you to establish the ops-dev feedback loop in your company if does not yet exist. Good luck!

  1. I deliberately use the unusual “reversed” order of the terms “dev” and “ops”. The reason for doing so is that “DevOps” and all terms derived from it are meanwhile so overloaded with meanings that most likely your brain would have decided what this feedback loop idea is about before you would have read the first sentence of this blog. No blame intended – that is totally normal human behavior. To break that interpretation shortcut, I use the reversed word order and use a different way of writing it: ops-dev. This way, the interpretation shortcut mechanism in our brains will not immediately trigger. Sometimes we need to take a detour to keep our brains from into premature conclusions … ;) ↩︎

  2. If you additionally incentivize your dev and ops departments for completely contradictory goals, in the worst case connected to salary components, like rewarding development for speed of feature completion and operations for availability, the problems are maximized. Unfortunately, this is exactly what we can still see in many companies. ↩︎

  3. For a narrative introduction into the ideas of DevOps, you may want to read “The Phoenix Project” by Gene Kim, Kevin Behr and George Spafford. For a more hands-on introduction, you may want to read “The DevOps Handbook” by Gene Kim, Jez Humble, Patrick Debois, John Willis and Nicole Forsgren. ↩︎

  4. Some Google employees have written several good books about SRE. You can read the books online. If you prefer a printed version of the books or an ebook edition, you can also follow the link provided and find more details how to purchase them. ↩︎

  5. The problem of most homemade approaches is that you will find little or no advice regarding good practices and alike. You have to figure it out all by yourself which increases the risk of failing a lot. If you are successful on the other hand, you have an approach that is perfectly tailored to your needs. Going for a homemade approach is a bit like betting on a number at roulette: high risk of losing, but if you win, you win big. ↩︎