Review of Release It!

Release It! Design and Deploy Production-Ready Software by Michael T. Nygard, published by The Pragmatic Programmers.
ISBN: 978-0-9787-3921-8

Introduction

Release It! is a book I have had on my reading list for a few years. I started a new job at Tradera/eBay Sweden in June last year and Release It! felt more relevant to my work than ever before. It’s the first time in my career that I am working on a public site with a lot of traffic (well, a lot for Sweden anyway). Downtime can mean a lot of lost revenue for my company. As I’m on call every second month for a week at a time, downtime also really affects how much sleep I get during that week. So perfect timing to read a book about pragmatic architecture.

System downtime can both cost huge amounts of money and seriously damage the reputation of a company. The incident that springs to mind when I think about damage to a company’s reputation is the Blackberry three day outage in 2011. It was the beginning of the end for a previously very successful company. Release It! is all about building a pragmatic architecture and infrastructure to avoid one of these doomsday scenarios for your company.

The book is organised into four different parts and roars into action with the first part: Stability or how to keep your system up 24 hours a day.

The Exception That Grounded An Airline

The author Michael Nygard has worked on several large, distributed systems and uses his war stories to illustrate anti-patterns.

The first story is of a check-in system for an airline that crashes early in the morning – peak time for check-ins. First, he describes the process of figuring out which servers were the problem (they had no errors in their logs). They then took the decision to restart them all. This restart took three hours and caused so many flight delays that it made the TV news. The post-mortem involved thread dumps and decompiling production binaries (there was bad blood between the developers and operations so they weren’t too keen to give him the source code). The bug turned out to be an uncaught sql exception that only ever got triggered during a database failover.

I love the this way of illustrating technical concepts. The stories are great and make the book a very easy read despite it being quite technical. And the postmortems after the incident are like CSI for programmers.

The lesson learned in this case was that it is unlikely that you can catch all the bugs but that you CAN stop them spreading and taking down the whole system. This bug was a simple bug but one unlikely to be ever caught in testing. Who tests that code works during database failovers and has that setup in their test environment?

However, by failing fast and using patterns like timeouts or circuit breakers you can stop the problem spreading through the whole system and crashing everything.

Cascading Failures vs Fail Fast

Circuit-breaker at High Voltage test bay, BerlinAfter setting the scene, the author then introduces a group of Stability anti-patterns. They have great names like Cascading Failures and Attacks of Self-Denial. These anti-patterns represent all the ways you can design an unstable system.

To counteract these anti-patterns are a group of Stability patterns.These include some very useful patterns like the practice of using timeouts so that calls don’t hang forever. Fail Fast counteracts the Cascading Failure anti-pattern. Instead of an error being propagated from subsystem to subsystem crashing each one, the idea is to fail as fast as you can so that only the subsystem where the error occurred is affected.

The most interesting pattern for me was the Circuit Breaker pattern. I haven’t used it yet in a production system but it got me looking at a .NET implementation called Polly.

Anyone else used it in a production system?

Capacity

The second part of the book introduces the capacity theme with a story about the launch of a retailer’s new website. They deployed the site at 9 a.m. and by 9:30 a.m. it had crashed and burned. Total nightmare. One of the sysadmins had a 36 hour shift setting up and deploying new app servers.

Again the author introduces a group of anti-patterns to watch out for and some patterns to counteract them. This time the anti-patterns are more obvious and are mostly some sort of waste of resources. They can be seemingly insignificant anti-patterns like not creating indexes for lookup columns in a database table, session times that are too long (a session in memory is using valuable memory), sending too much data over the wire or cookies that are too large. For a small system these are not a problem but if you multiple them with thousands of concurrent users then a lot more servers will be needed.

Michael Nygard makes a very good point here that most developers have less than 10 years of experience and most systems that are built do not have capacity problems; the combination of these two statistics means that very few developers have experience of working on large systems. This also means that the same mistakes are repeatedly made when designing large systems.

I personally still have a lot to learn here. I will be digging deeper into load testing and some of the more technical details of how our servers are configured. It’s easy to forget about how the code that I write affects the production environment.

General Design Issues

This part of the book is a collection of disparate design issues that can be important for large systems. Networking, security, load balancing, SLA’s, building a QA environment and other topics are covered here. Interesting stuff if not as compelling as the first two parts of the book.

Operations

The last part of the book kicks off with another great war story about a site suffering performance problems on Black Friday (the day after Thanksgiving in the US when every shop in the country has a sale). This time the problem was caused by an external system run by another company that had taken down two of its four servers for maintenance (on the busiest day of the year!). They were able to save the situation by reconfiguring a component not to call the external servers. This was possible due to good component design that allowed them to just restart the component instead of the server (it would have taken 6 hours to restart all the servers). The site was back working within a few minutes.

Transparency

So how did they know so fast what was wrong in this case? They had built the system to be transparent. Most systems are black boxes that do not allow any sort of insight into how healthy it is. To make a system transparent it is has to be designed to reveal its current state and a monitoring infrastructure is needed to create trends that can be easily analyzed by a human.

For example, at Tradera/eBay Sweden for our frontend servers we can see the number of errors per minute, page response time (both client and server side) and number of 404 and 500 HTTP response per minute (as well as lots of other measurements). For each server we can see memory usage, number of connections and a host of other performance indicators. By just glancing at our dashboard for live statistics I can see if we are suffering any performance issues on the frontend. Without this help it would be extremely difficult to keep the system running when things go wrong. As you can see this theme is close to my heart!

I did get a bit lost in the technical details of this section however. It is a really interesting theme but the technical details are either JVM-specific or have not aged well. A lot has happened in this space since the book was written in 2007.

Verdict

If you work with a system that is deployed on more than ten servers then this book is a must-read. There is a difference between just writing code and keeping a system up 24 hours a day. If you burn for Devops or are on-call regularly or care about creating stable systems then this book is for you.

 

 

The Legacy Code Lifecycle

If you have worked for more than couple of years as a programmer then you’ve seen a legacy system. If you are lucky then you have only dabbled with them and not been drawn into the epicentre of a full-blown legacy code project. If you’ve been unlucky then your very first job was maintaining an old legacy system that made you question your decision to choose this career.

Inception of a Legacy SystemFrankenstein

So what do I mean when I say Legacy Code? There are lots of definitions: old code, somebody else’s code, bad code etc. But in the extreme cases nearly everyone recognizes legacy code. Have you ever made a change to a system that should have taken one day but took five? Or made a seemingly simple change that rippled through the system introducing a bug in a totally different module?

I think that the concept of legacy code is tightly coupled to how we as programmers write code. To illustrate this, let’s start with the lifecycle of a project gone wrong.

First Phase – The Land of Milk and Honey

A company decides to build a new system. The developers rejoice as they get to work on a shiny, new greenfield project. This is developer paradise. If the original team is reasonably competent then this phase is a joy to work in. It is easy to add new features and the customers are happy as their requests are fulfilled within a couple of days. The developers get to use the latest, shiny technology and enjoy the feeling of fast flowing development.

However, it is very possible to sabotage the whole system from the very start. An example is the prototype trap; first the team hacks together a prototype which the customer loves, then the team decide to build on the prototype instead of throwing it away. And with that decision made, they are well on their way to building a spaghetti code jungle.

Second Phase – Getting Fatter

Life is not too bad. The features are not flying out the door at the same speed anymore but the customer is reasonably happy. The code might be starting to get a bit clunky, the controllers/code-behind classes are not as skinny anymore. Class size is growing as the classes accumulate more methods and more lines per method. The bug count increases as new features involve changing existing code and not just adding new code. It is high-time to set up bug tracking and a process for change requests.

Third Phase – A Bump in the Road

It is common for programmers to change jobs regularly and in most companies the key people will eventually leave for greener pastures. This means that the two or three developers that built the system and know every nook and cranny are gone. And so is their knowledge. New developers come in and do not manage to grasp the structure and logic of the system. Perhaps it is not even their main task and they are just helping out occasionally.

The era of quick fixes has begun. The new developers never went through the first phase and don’t remember when the code base was a thing of beauty. They see a chunky, inelegant code base and they want to get in and out as quickly as possible. Maybe the customer is putting the pressure on or they just want to get back to their other (greenfield) project.

Fourth Phase – The Big Project

Good news! The customer is delighted with the system. It is saving/earning them buckets of money. They have loads of ideas for new features and want them implemented pronto.

But this is where everything goes wrong. The current team of developers have allowed the code base to rot by applying quick fixes and not understanding the original vision for the system. There is not much structure left and new code has been dumped in inappropriate places. The team are trying to build on a foundation of sand and it is now the bug-fix death march begins.

They try to estimate how long a new feature is going to take but this is extremely difficult due to no-one knowing how big the ripple effect will be. So they triple their estimate to be on the safe side. The problem is this feature will never be done. They might get to 90% done but that last 10% is unattainable. In an entangled code base most large changes introduce bugs. Fixing those bugs introduces new bugs. It is near impossible to complete this project without bugs and meet a deadline.

This is where it turns into a death march. The team has a deadline to meet and just throws code at the bugs in a desperate attempt to get everything done on time. People work long hours, the code quality worsens all the time. In the end, the customer gets a bug-ridden Big Ball of Mud. The customer is not happy and crisis talks begin about the future of the project.

Fifth Phase – What do we do now?

At this stage, you have a burnt out team (or maybe no team at all) and a system that is a maintenance nightmare . What happens now? Throw away the code (and the money invested) and rewrite it? Limp along for years and spend your days fixing all the bugs and taking an eternity to add new features? This system might only be a few years old and already could be thrown on the scrap heap.

A lot of us have seen and experienced projects like this and agree that this system qualifies as legacy code. But I favour Michael Feathers’ much tougher definition:

Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.

So yes, my example qualifies as legacy code but so does most of code I’ve written during my career as a programmer. Code without tests inevitably rots and becomes harder to maintain. It might never get to the fifth phase but will require a lot of care and attention to stay at the second phase and not get out of control.

In my experience, writing unit tests is the key to changing the lifecycle of a project and to start improving code quality and reducing the bug count. But how can you learn unit testing and TDD if you work with a legacy system? If you are stuck with a legacy system and want to make programming fun again then get started by reading my next article on TDD and Legacy Code.