Simplify! – Part 1
This little post series will be a bit different from the former ones as it discusses a relatively new train of thought of mine. Hence, in some places the reasoning may still feel a bit rough and not as thought out as in prior posts. If you should encounter such a place, please bear with me – and maybe help improving the reasoning via discussing the topic with me.
The topic, I would like to discuss in this blog series is simplification, the problems that avoidable excessive complexity in our IT solutions cause and what we can do about it.
I think it is a very relevant topic these days that concerns me a lot. That is why I decided to write about it. In this first post I try to sketch the current situation to motivate the need for simplification in IT. In the next posts, I will dive deeper, analyze the effects of the current situation and discuss concrete simplification options.
IT has become indispensable
Why should a discussion regarding simplification of IT even be relevant? The usual response is sort of: “Well, IT is complex. Deal with it. Maybe train people better, create more expert roles, automate things or whatever, but stop whining!”
Unfortunately, this response overlooks a vital point: These days, IT is virtually everywhere.
In the past, most everyday life products and services were free of IT. Some of them contained some custom hardware solution that was designed and tested once and then used for many years. But that was basically it.
However, today most products and services are software-supported. This allows for more sophisticated solutions that can be enhanced and adapted in short time frames. Not only mobile phones, also cars, washing machines and amplifiers require regular software updates today. Also, the cash register of you favorite coffee shop usually is software-based today.
And that is just the visible part of the products and services. Most of them rely on backend services, hosted in some (cloud) data center that implement big parts of their “smartness”.
Additionally, we do not only interact with “smart” consumer products and services. Often we interact with pure software-based services via our notebooks and mobile devices. There, we also interact with lots of fronted and backend software.
This means major parts of so-called B2C (business-to-customer) interaction happens via or is supported by software these days – trend rising. We do not only buy at our favorite online retailer, we communicate with our banks, our insurance companies, our energy suppliers and other companies via software. Even the communication with our city administration stepwise shifts to the online world.
If we look at B2B (business-to-business) interaction, we face even more software-based communication. Also internally, most companies are highly dependent on software – not only for executing their business processes at scale. Software also sneaks more and more into the shop floors, where the products are built and maintained.
The key observation is: Software surrounds us these days. Software is everywhere.
This also means that more and more aspects of our daily business and private lifes rely on working software. If the software does not work, if the respective applications are down, we are in trouble. In short:
Software has become indispensable in our everyday business and private life.
If the software does not work as intended, we have a problem.
A typical IT project today
At the same time the underlying software systems are getting more and more complex, often on the verge to becoming unmanageable. To illustrate this claim, let us have a look at a quite typical IT project these days:
- It starts with a big requirements document containing every single imaginable requirement that might be of value for the solution to build. These days, it often is called a “backlog”, but in its core it is the same story, just that the requirements are left on a vaguer level. Actually, the vaguer level often leads to an even bigger scope by adding many “epics”, each worth a hundred pages or more of a traditional requirements document disguised in a single sentence.
- Often the requirements are not designed in a way that allows for a natural evolution of the existing solution. Instead, they tend to be based on momentary needs, not aligned with the solution history, sometimes even in plain conflict with the existing solution.
- On an organizational level, it usually find some “scaled agile” variant, more or less mimicking the fine-grained role and specialist structures of traditional enterprise software development approaches. The only difference is that now is expected that software developers work multiple times faster, producing higher quality based on the aforementioned vaguely sketched requirements. 1
- As a consequence, developers work under constant pressure to get features implemented as quickly as possible, neglecting the effects of the implementation chosen on future system maintenance and evolution. 2
So much for requirements and organization to provide the context. Let us move on to the solution domain where the “juicy” parts for us IT people start, i.e., technology and architecture. Here I describe an exemplary setting with a strong bias on OSS solutions that I have seen in many variants in the recent years 3:
- Of course, especially in the case of React, several more libraries are required, and Redux is just a start.
- Then we need a whole set of development/configuration/build/deployment tools like node, npm (or yarn), Webpack and Grunt, just to name a few of the usual suspects.
- If we decide to re-use the same code base also for mobile devices, we additionally need a framework like React Native or need to implement an approach like PWA (Progressive Web App). Otherwise, we need to implement dedicated solutions for the mobile devices where we need to pick up additional tools, frameworks and languages again.
- The backend logic is implemented using microservices, using frameworks like, e.g., Spring Boot or Micronaut in the case of Java.
- Depending on the amount, structure and complexity and the service landscape, we need to implement dedicated BFF services (Backend For Frontend) as facades for the different types of frontends we need to support.
- Then we need an API gateway as proxy for IAM (Identity and Access Management), rate limiting, standardized request logging and more cross-cutting concerns. We can implement that on our own (which usually is a bad idea) or leverage a product like, e.g., Kong.
- To implement a working IAM, these days usually based on OAuth2 and JWT (JSON Web Token), we need an IAM tool like, e.g., Keycloak.
- Of course we do not deploy these services “raw” but put them in Docker containers.
- And we do not start these containers manually or via scripts in production, but we use a scheduler like Kubernetes. If you run multiple clusters (which is the norm for bigger installations), the different Kubernetes clusters need to be or embedded in solutions like, e.g., Rancher. Additionally, you use Tools like, e.g., Helm and Operators for easier application deployment and administration.
- With microservices, of course we need to take care of cross-cutting concerns like monitoring of latency with retries and error handling, request routing, access control, request correlation, rollout strategies and more. As we do not want to implement that for every service from scratch, we complement our installation with a service mesh like, e.g., istio or linkerd.
- As it becomes extremely complex and error-prone to manage configuration parameters file-based per service, we use a central configuration management solution like, e.g., Consul, ZooKeeper or etcd.
- To reliably operate those distributed service landscapes, we need a good monitoring solution that supports user defined metrics, events and alerting. Thus, we add tools like, e.g., Prometheus or Graphite for monitoring (which both need to be extended by additional solutions to support alerting).
- Additionally we need log aggregation with flexible options for searching and analysis, e.g., ELK, to support ad hoc analysis of system behavior and post mortem analysis after failures.
- In times of “polyglot persistence”, a single RDBMS is not sufficient anymore. One or more NoSQL or NewSQL solutions, like, e.g., MongoDB or Cassandra are part of the storage landscape.
- Many trendy solutions are built using techniques like event sourcing and CQRS. Hence, they need solutions for handling event streams like, e.g., Kafka or RabbitMQ.
- As we cannot build, test and deploy such a solution manually anymore (please, do not even think about it!), we additionally need a continuous delivery pipeline, often still based on Jenkins today.
- Tests need to be automated (yes, they do!) and we need appropriate tooling for our unit tests, integration tests, user acceptance tests, load tests, resilience tests, security tests, and so on.
- Additionally we need a source code repository, usually Git, plus additional repositories for artifacts and containers like, e.g., Artifactory or Nexus.
- Finally, we need to provision all the infrastructure needed to run all the applications and data stores, development and test environments, continuous delivery pipelines and more. Here we need IaC (Infrastructure as Code) supported by tools like, e.g., Terraform or AWS CloudFormation for resource provisioning, and, e.g., Puppet or Ansible for node configuration. If you do not leverage public cloud resources, but run your on-premise data center, you need to make sure that you can provision your resources via code by leveraging appropriate compute, storage and network virtualization solutions that can be controlled via API.
- Did I mention security? The required continuous infrastructure and application hardening? Not only SIEM (Security Information and Event Management), EDR (Endpoint Detection and Response), VA (Vulnerability Assessment), corporate IAM, firewalls, micro-segmentation or Beyond Corp on the infrastructure level, but all the things needed to get a secure IT landscape? Gives you a headache? Yeah, I know, security is hard. But it is needed, nevertheless – these days more urgently than ever, but unfortunately just an unloved afterthought in most projects.
I stop here. While this might sound a bit over the top at first sight, this is the project reality, I have seen in many places. If you also work in “modern software development”, chances are that this listing sounds just too familiar to you.
Explosion of complexity
Actually, the listing or a variant of it is not only reality in many projects, it is only scratching the surface. E.g., most companies do not just use one programming language, one architectural paradigm, one tool for a job, one type of ecosystem, but often 5, 10 or more of them. This means, you need to multiply the aforementioned list by a factor of 5, 10 or higher to cover the complete IT department.
Additionally, I did not mention all the “legacy software”, standard software solutions and all the other missing bits and pieces that also need to be maintained and operated.
Overall, the complexity of most companies' IT landscapes continually grew over the years. Especially the developments of the recent years with the rise of smarter frontends, microservices, containers, polyglot persistence, etc. lead to an explosion of complexity that most companies struggle to cope with.
If you need proof for the explosion of concepts, tools and technologies, just have a look at the programs of an average IT conference of the last years. Tons a “new” ideas, trends, tools and technologies – every year a bit more.
Or look at the special interest IT conferences. Just a tiny example: There are whole conferences solely about Helm, a package manager for Kubernetes.
Nothing against Helm, but if you take a few steps back and look at the whole picture you realize it is just a tiny, tiny implementation detail, invisible from the business problem to be solved. And Helm is just a single tool from the Kubernetes ecosystem. There are dozens more of “must-know” tools.
Also have a look at Kubernetes itself: Kubernetes conferences easily attract thousands of attendants. Yet, in the end Kubernetes is nothing but a resource scheduler, something we know for 50+ years from operating systems, just for containers. Again, a tiny implementation detail, invisible from the business problem to be solved.
Now ask yourself: How many of those conferences does a single person need to visit to be able to solve any moderately complex business problem? That is what I mean with explosion of complexity.
To make things worse, we created a self-reinforcing cycle of “new”: In IT we love “new”. If something is new, it is good. “New” sells magazines, “new” fills conference sessions. Therefore, magazine editors and conference organizers prefer new topics. It sells more copies, it attracts more attendees. As they focus on “new”, readers and attendees develop a constant state of FOMO (fear of missing out) because if so many people write or talk about it, it must be important – which reinforces the focus on “new” and creates a self-reinforcing cycle.
In conjunction with the complexity explosion of the recent years, we can summarize the current state of IT like this:
In IT, we are drowning in complexity - and every day it gets a bit worse.
Where does this lead?
If we summarize this post, we see two evolutions:
- IT has become an indispensable part of everyday business and private life.
- The complexity on the IT solution side grows all the time.
I do not know how you feel if you see those two observations next to each other. I get a really bad feeling. From what I observe, most IT departments are on the verge of unmaintainability. Actually, some of them rather feel like already being 2 steps over the edge, just like Wile E. Coyote the moment before looking down. 4
They have stopped understanding their IT landscape many years ago. Their enterprise architecture efforts barely cover the tip of the software landscape iceberg, but completely fail to cover the actual complexity hidden under water, let alone making it comprehensible and traceable.
Every few years the companies start a new initiative, containing new concepts, tools and technologies they barely understand, being sure that this initiative will finally solve all their problems of the past – which of course it does not, just leaving another layer of complexity for the next initiative to come.
Finally we have all the smaller accelerants: a new library here, a new language (including ecosystem) there, some strange requirements implemented as “hot fixes” because they were “urgent”, and, and, and – all the little things multiplying up, resulting in more and more complexity. Eventually, if you need to change anything or – worse – something breaks in production and must be fixed, you can only rediscover this complexity, but you will fail to understand it.
This is why I get a bad feeling whenever I connect the two observations: an IT needed to run our everyday business and private lifes on the verge of unmaintainability already, pushing forward every day. How much longer can this go well (or at least work)? This is a question that really concerns me.
Still, up to this point we just have two observations and a bad feeling. Thus, before jumping to action, we need to analyze the situation more in depth to understand if we really need to take action and where the biggest levers are.
As this post is long enough already, I will leave it here today and give you some time to ponder the observations of this post. In the next post, I will analyze the effects of this situation on people in more depth. Stay tuned …
I know that this description feels quite exaggerated. Unfortunately, it is by far not as exaggerated as one would hope. With the rise and hype of “Agile”, especially Scrum, quite some things went wrong and basically lead to the situation described in many places. I will discuss the evolution, what went wrong and what we can learn from it in a future post. ↩︎
Actually, most “Agile” approaches only focus on functional requirements and non-functional requirements (NFRs) are often neglected. This is because NFRs are usually not considered creating any “business value”. Unfortunately, this point of view is wrong. If you should disagree, then I expect you to deploy your next mission-critical application on a 300€ consumer PC under you desk, not implementing any security features at all and without adding any options to scale the application. Why bothering? All those NFRs like availability, security or scalability just cost money and do not create business value anyway … ;) ↩︎
Please note that the products and technologies mentioned in the listing are only examples, not recommendations. There are many other products and technologies I could have used instead. This is also the reason why I did not include any links to the mentioned products and technologies. The idea of the listing is to provide an idea of the complexity we face in the technical domain today, not to advertise any products. Feel free to look any of the mentioned products and technologies up if you like, but please keep in mind that I did not recommend any of them. ↩︎
Remember: the Coyote only falls after he realizes that he ran over the edge. ↩︎