Simplify! – Part 14
In the previous post, I discussed several relevant drivers that lead to ever-increasing complexity of the IT landscape, creating layers over layers of technology over time that never get cleaned up. As the post would have become too long otherwise, I left out the mitigation options.
This is the content of this post.
Improving the situation
As written in the previous post, there are certainly more drivers of ever-increasing IT landscape complexity, but the discussed ones are probably the most widespread and most harmful ones.
Thus for a last time in this blog series, let us ask the question: How can we do better?
As always, I do not think that there are easy answers. Otherwise, we most likely would have them implemented a long time ago. Still, I think there are some options to improve the situation. Here I focus on a few measures, I consider the most powerful.
Establish a continuous improvement process
The probably biggest problem is that there is never time to tackle growing complexity. Tackling complexity is always ousted by “urgent” problems absorbing all available capacity all the time because they are “urgent” (see also the section “The no-time to clean up fallacy” in the previous post).
We see many companies that have something I would call a “cult(ure) of urgency”, where every task either is priority 1 or never gets done (leading to 95%+ of tasks getting assigned priority 1), where it is important to obediently render homage to the cult by demonstrating frantic activity all the time, where completed tasks on a status report are more important than actual solving a problem, and so on.
Even if the setting is not as bad as I just described, we still see “urgent” ousting “important” all the time. The best (and probably only) way to stem against the typical never-stopping stream of “urgent” tasks is to explicitly reserve time for non-urgent, but important activities.
A good compromise would be reserving 20% of the IT capacities for improving the landscape and 80% for the normal “urgent” work. For maximum effectiveness the 20% should be organized as a continuous improvement process.
Of course at least at the beginning this will create a new stream of continuous outcry and resistance:
- But we do not have time for this!
- But there are so many pressing problems!
- But the deadline!
- But it does not create any business value!
- But …
- But …
- But …
Depending on the company you work for it will be easier or harder to deal with this stream of resistance, pressure and sometimes outright threats. Yet, there are always deadlines. There are always pressing problems. But neglecting the ever-degrading state of the IT system landscape makes it less and less likely to meet the deadlines, to solve the pressing problems.
Additionally, if everything is urgent, nothing is urgent. To illustrate this claim, a little anecdote from my past: I once had a recurring discussion with a former boss. Whenever my next vacation came close he approached me to tell me that it would be such a bad time for vacation and if I could not shift my vacation because the project situation would be so critical. It was always “critical”. As we usually ran several projects in parallel with quite a big crew, there were always at least one project with pressure and urgent problems.
After the third time we had this discussion I asked him: “When would be a good time for vacation?”" He pondered that question for a moment and honestly replied: “Probably never” (after all he was a good boss, always being fair and honest). My response was: “If it is always a bad time for vacation, it is always a good time for vacation.”. He pondered that response for a moment, laughed and said: “Yeah, probably you are right.” We never had that discussion again.
This is what I meant with my claim: If it always is a bad time to improve your system landscape, it is always a good time to do it. The reference system is just broken. If 20% of the IT staff would leave or be sick for a longer period of time, the situation were the same but you would not face a big resistance. Probably, people would complain a bit more about a lack of responsiveness of IT than they do already, but that would probably be it.
To make the point clearer regarding the broken reference system: You could ask for 25% more employees to make sure you get the improvement work done without slowing down all the “urgent” stuff. It is clear that the continuous deterioration of the IT landscape is a huge problem, everybody feels it and everybody would be happy if IT would become more responsive. It would pay for itself over time by becoming faster again, having less errors and outages, and so on.
Still, most likely people would look at you as if you would have gone completely insane: How can you even consider to add more people? Well, don’t you always complain that IT is too slow? I offer you a way out here. But not by creating superfluous costs. And so on.
Long story short: The reference system is just broken. An arbitrary historical grown number of employees in the IT department is considered “correct”, no matter if it gets the job done or not. Additionally, anything that it is not immediately related to here and now is considered superfluous. No long-term thinking, not even mid-term thinking. Just broken.
Still, every year you do nothing you probably become 5% - 10% slower in IT because your landscape continuously deteriorates and its complexity grows. If your improvement work would just stop the deterioration, i.e, keep the complexity at the same level, you already become better and after two or three years the gains due to speed not lost would offset the whole improvement team.
Overall, while knowing that you most likely will face a lot of resistance, I think it is worth going this way if you do not want to get stuck in a reinforcing downward spiral.
Allow for experiments
Many alternative options are not explored because creating environments to test them are too hard to provision and set up. Assume it would be easy to set up a second production environment next to the system you do not dare to switch off. The second environment would receive the same inputs, but not create any effects outside the environment. You could easily run a new implementation and compare the results – all without huge efforts, but more or less with a single click.
In such a setting, you would dare to make a lot more experiments because they are easy to conduct and mostly risk-free. Cloud environments support such a setup. Thus, it is not a totally far-fetched utopia. In such an environment, decommissioning old systems and trying a lot of other improvement experiments would become a lot easier. This would not only help against the fear of shutting off old systems, it would also help to test new options easier and more riskless if technology evolves.
Thus, create an infrastructure that allows for easily exploring alternative ideas in a risk-free way. The technology is available, the concepts are known and the benefits are huge, once you have this in place.
Calculate the price of doing nothing
Often clean-up or improvement activities to tackle excessive complexity of the IT system landscape are not taken into account because they “do not create business value”. While it is true that improvement activities do not create an immediate business value, I think this reasoning is way too shortsighted.
As discussed in the context of the continuous improvement process there is a price you pay for doing nothing. Not explicitly fighting complexity means increasing complexity all the time. I discussed that before in the context of legacy systems:
Meir Lehman pointed out already 1980 in his great paper “Programs, life cycles, and laws of software evolution” 1, that most programs need to be adapted continuously to respond to the ever-changing needs of their environment.
In conjunction with that observation Lehman framed the following law:
The law of increasing complexity
As an evolving program is continually changed, its complexity, reflecting deteriorating structure, increases unless work is done to maintain or reduce it.
This means not investing in maintaining or reducing the complexity of the IT system landscape means continuously harder to maintain systems, continuously slower and more expensive system evolution, continuously more bugs, continuously more failures in operations, and so on. In plain words:
It costs you a lot not doing anything to tackle growing IT system landscape complexity.
If this is not contained in the business value calculation, the costs and the risks of not fighting complexity, the business case is pointless and will necessarily lead you to false conclusions. Therefore, a different calculation schema is needed that explicitly takes the costs and risks of unchecked growth of complexity into account. 2
Know when to move on
A facet of this discussion is the decision making process if moving from a custom-built solution to a product or commodity solution. Especially with the rise of managed services we see more and more business and technical functionalities that can simply be rented instead of building, running and maintaining it on your own.
What does it mean to stick to the custom-built solution? How much capacity of development and operations are blocked by maintaining and running the solution? Which other relevant activities are slowed down or blocked due to this? Do you manage complexity growth of the solution? If not (which is the most likely answer), how much would it cost to do so? How well do you manage security risks of the solution, especially if you use a lot of OSS libraries and frameworks (which is very likely)? Do you keep track of all CVEs? Do you patch your applications immediately? How much does it cost? How big are the risks?
And so on. The key point is that you need to understand the TCO (total cost of ownership) and TRO (total risk of ownership) to make a sensible decision. Unfortunately, most of the times nobody ever considered the questions listed before. Only “sunk costs” and immediate “business value” are taken into account, necessarily always leading to the same decision: Not doing anything.
This does not mean that you should move to a product or commodity solution as soon as one is available. But it means that you need to create a complete picture considering all types of costs and risks for the options available before making a decision. You also need to keep in mind that not doing anything is also an option that needs to be assessed. This option often is not assessed because it is considered being “okay” by default – which can be a very dangerous assumption.
Architecture without an end state
Finally I would like to discuss what Michael Nygard calls “architecture without an end state”.
“Architecture without an end state” is the sensible alternative for the big cleanup initiatives that never work out. Generally speaking, it means that we need not only to accept, but to embrace heterogeneity. It means to accept that we will never replace systems that took several thousand person years to build in virtually no time with some magic initiative. As stated in the previous post such an initiative would most likely also take several thousand person years to complete which economically usually does not make any sense.
Instead, we rather need to let go of the illusion of big cleanup initiatives and try to find ways to integrate new solutions as good as possible into an existing heterogeneous system landscape, trying to increase overall complexity as little as possible.
That is a completely different way of thinking than coming up with the big initiative to clean up the “mess”. You should always remind yourself that the “mess” was created by people who thought and acted as you, who were as intelligent as you and who were exposed to the same conditions as you. They also had the best intentions in everything they did. And yet you clench your hands over your head when you see the result. Thus, coming up with yet another “Let us clean up the mess once and for all” initiative is not the way to go.
The key insight is that architecture will never reach an end state where everything adheres to a unified paradigm. It will always be a bit “messy”. It will always evolve. There will always be a mix of paradigms, tools and technologies. So instead of hoping for the next and next and next magic initiative to clean that “mess” up for good, we need to manage the “mess”, make sure that complexity does not grow unnecessarily. We already discussed the continuous improvement process. Here I would like to add two more recommendations:
- Go for humble solutions that integrate well with the rest of the world. No matter if you are early or late in the technology evolution cycle, always be aware that your solution will not exist in isolation. There will always be a huge IT landscape it needs to work with. One aspect of this fact is that it always makes sense to strive for the simplest possible solution as there will always be more than enough complexity already.
- Balance innovation and complexity. Sometimes you need new technology to address new types of needs. But be careful with picking up too much new technology, especially bleeding edge, low abstraction technologies without a pressing reason (defined in business value terms). But also be aware that it is important not to cling to obsolete technologies without a reason (again, in terms of business value). This is an ongoing and everything but simple consideration.
Architecture without an end state is a huge topic and I could continue discussing it for a long time, discussing contributing ideas like, e.g., replaceability, good encapsulation of functional concepts, low coupling, easy discoverability and access, integrability. Very likely I will discuss it in more detail in some future posts, but in this post I will leave it here. Understanding and accepting the underlying idea is the first vital step. 3
In the previous post and this post we have discussed layers of technology that build up over time in an IT system landscape and never get cleaned up as a final source of accidental complexity (regarding this blog post series).
We have seen several drivers that lead to more and more technology layers piled up in the IT system landscape in the previous post:
- Missing responsiveness of IT – Slow response times of the IT departments leading to solutions bypassing the IT department, not adhering to existing standards.
- Fear of decommissioning old systems – Not knowing what existing systems do and the fear of missing something relevant if switching them off, resulting in never decommissioning them.
- Missing technology evolution – Not understanding technology evolution and thus missing the point in time when to replace a complex custom-built solution with a product or commodity solution.
- The no-time to clean up fallacy – A continuous flow of “urgent” tasks always inhibiting the important task to manage the complexity of the system landscape.
- The big clean-up initiative – The recurring initiatives to clean up the system landscape once and for all with a single new paradigm, never getting completed, always leaving another layer of complexity.
As I have written in the previous post, there are more drivers, but these are probably the most important – and harmful – ones.
In this post we have looked at some mitigation options to tackle the aforementioned drivers:
- Establish a continuous improvement process – Reserve a fixed percentage of capacity to continuously improving the system landscape and tackle ever-growing complexity in many small steps.
- Allow for experiments – Provide the infrastructural means that make it easy to test new ideas; very useful for improving the landscape and reduces the fear of switching off old systems.
- Calculate the price of doing nothing – Explicitly calculate the rising costs and risks of not addressing the ever-growing complexity of the system landscape and make them transparent.
- Know when to move on – Periodically assess if a complex custom-built solution should be replaced by a product or commodity solution due to technology evolution.
- Architecture without an end state – Accept and embrace that your landscape always will be heterogeneous; design and build your systems in a way that they integrate in such a landscape with the smallest increase of complexity possible.
Again, there are more mitigation options, but the discussed ones are really effective as they point the mindset and discussions in the right direction.
This was the last area of accidental complexity that I wanted to discuss in this blog series. Maybe I missed an area but this series already has become a lot bigger than I would ever have expected it. Either I am too critical or IT actually is in a lot worse shape than I had realized at the beginning of this series. Well, probably a mixture of both aspects …
Before I will conclude this “Simplify!” series with a summary post including some general considerations, I will insert a post that discusses a topic which also influences the accidental complexity we pile up all the time. It is the observation that we in IT as an industry do not learn. We continuously forget what we have learned and always need to relearn it from scratch. More about this in the next post. Stay tuned …
I will probably discuss this topic in more depth in a future post. ↩︎
If you do not want until I will write in more detail about the topic, I really recommend to check the resources, Michael Nygard offers regarding the topic. He has a lot of great concrete advice to offer. To start with, see, e.g., this really good writeup of Michael’s talk about the topic with a link to a recording of the talk. ↩︎