Rewrite or Refactor?

The Process Black Hole

Posted by Cads on 20th May 2020

Back in the day we wrote software to do specific tasks. We wrote a little BASIC program to print out "Hello World" seventeen gazillion times on a Commodore-64 or TRS-80 or whatever. It was fun to mess with the salespeople on the floor who would have to break the program and reboot the computer. We would then add more and more bits to make it more fun for us and annoying for the salespeople. Beep and poke became de rigueur and more and more bits got added on to the original script.

The same is actually true when we write software nowadays. We start off small and add more and more bits to it until we have Twitter or Google or yes, even Netflix. We learn how to code bigger pieces by building smaller bits and adding them on/in to the original. It's like virtual Lego. Build up this thing that does X. Now build up something that does Y and add it together.

And this is GREAT. It allows us to functionally program; to iterate and create MVPs; to learn from mistakes at the smallest practicable/feasible level. But. And there is a big but. Sooner or later, we need to add things to this software that it wasn't originally designed to do. And we look at the hours that we have spent building the current program, and, as human beings, we are loathe to waste that work. So we start modifying the original program to do things that it wasn't supposed to do to start with.

I was thinking about this "organic" growth a lot at my last role. It became clear that it wasn't just a function of waste avoidance. Nor of time-compression (Get new feature x out in f-3 days). No - it's a function of human nature and organizational/cultural approach.

There's the story of a driver in a strange town. The driver asks a passer-by "Hey how do I get to other strange town?" and the passer-by says "Well you don't want to start from here!". This joke, crudely told, allows us to laugh at both the driver because they are truly lost, and also the passer-by because clearly the driver has to start from here. And inside it there is the truth that quite often, we don't want to start from here. Quite often, we want to start from somewhere else entirely.

The natural response to this understanding is to throw away what we have done, and build something new from scratch. Both the organic/iterative, and the build it new approaches, have things to offer.

The organic approach allows us to see what we can use in our current product to shortcut the design and development stage. It saves time, helps fit in the DRY model and gives us a platform of familiarity upon which to base our new decisions.

Sofware companies love the organic approach - it fits with Scrum/Agile. It allows the bottom line to be directly linked to previous iterations of the software. It becomes the creation of a new API or FeatureSet or product that can be comfortably viewed as an e v o l u t i o n of the original software. v2 software is so often like this, and it allows the original teams to support the software.

The build-it-new or "blue-sky" approach allows us to design and build a new, fun, whizzy, shiny product, using best practices and a fresh look at the entire problem set. It avoids all the pitfalls that we learned building the previous product, and lets us experiment with new faster technologies and methods.

Going blue-sky also means that we end up with something that does exactly what we asked. The requirements are fit. They work, it works. Granted it took longer to get here, but we go from A-D without passing through B and C.

Young nimble software companies love the blue-sky approach. It allows them to invest in developer autonomy. Developer happiness is at the heart of productivity, so says the mantra. Creating new things is what gets us developers out of bed in the morning. Cleaning away the cruft on old products is boring. It is satisfying to a small portion of maintenance focused engineers, but in the majority we equate new with exciting. Blue-sky approaches allow for excitement and innovation and drive.

In the world of existing software, the rewrite is generally thought of as trashing old work and wasting effort. In business terms, it is often thought that a rewrite is the single worst strategic mistake that can be made. A lot of this is due to studies that were done during/around the dot-com era (Joel Spolsky re: Netscape). And MOST of this is because it is axiomatic when we are discussing refactor vs rewrite that we are talking about the entire piece of software.

Interestingly the same is true of Process. It is true, but for very different reasons. The cost of rewriting software is directly attributable. The time and effort spent can be clearly quantified. But for Processes, it is less clear. Sure, we can look at the team creating the new process and see the cost that they have, but without doing full measurement of the length of time and interactions of the first process compared to the second process for the exact same piece of work, it is difficult to make concrete cost/benefit analyses. Also, with software, or product, there is a clear revenue stream coming in. Whether that is the amount you charge per user on a SAAS or the cost of a specific license. With process, it is integral to your organization. The real cost of creating a new process is seen in employee happiness, which has so many variables coming in that it is hard to isolate whether the new process is causing issues or whether it's the result of market forces or product direction or whether or not the VP of Engineering's football team is nearing relegation.

However, there are some things common to tweaking both process and software.

  • They both start from a perceived failure of the original
  • They both are attempting to achieve a new, albeit related, goal
  • They both require people to implement the solution
  • The approach to writing both is function of the culture of the organization
  • And they both require understanding of the underlying goals and the current state of play

This list is of course not exhaustive, and we will probably come across more as we look at this jumping off point.

In an organic/refactor culture, value is placed on the work that has been done and the benefits that it brings the organization. A secondary value is placed on stability. Not stability necessarily of the product - though that will come through time - but rather, stability of the culture. It appears that refactor cultures tend to be those that plan more frequently, move more slowly, and rely on the "that's the way it is done" group-think. Maintenance costs of legacy software and processes are seen as just "part of doing business". Tech debt is often seen as a mortgage.

In a rewrite culture, value is placed on customers and the acquisition of new customers. This is seen as engineering places prioritization on the new code and product set. In the case of companies like 37Signals (Basecamp) the value of existing customers is clearly called out and old(er) versions are left available so that the existing customer can use it if they want, rather than move to the "latest greatest" software. Tech debt is seen as a necessary cost of business as well, and in this case it actually acts like one. It needs to be paid down, but over time, the principal (if not the principle) dwindles.

Processes, like software obey these two rough-hewn models. The well trodden way of doing things provides comfort, stability and a mirror of the culture. Just looking at the plethora of quarterly planned, ITIL or ISO-27001 organizations shows us that this way of dealing with process creates a clearly identifiable look and feel for the organization. Those that adopt extremely lightweight or even informal processes (XP or FDD for example) are generally seen to be faster iterating. The term "move fast and break things" was the epitomy of this, and it clearly outlines a company culture that thrives on disruption even at the development team level.

And all this is very well and good, but is there a path for stable development in a culture that looks to disrupt the market, but not itself? A culture wherein BlueSky and BrownField can co-exist? Inherently the answer seems to be yes, but it is hard to find.

Take the aforementioned "you don't want to start from here" joke. At the heart there is a simple idea. The driver travelled from A to B trying to get to C. Getting to C from B is painful. Perhaps they should go back to A and start again.

Organic software and process says - well, we've got something that goes, and while it doesn't go to C, let's improve it and get it to go A-B-C. BlueSky software says - what we've got doesn't go A - C so lets build something completely new that does.

Breaking it down like this makes the answer sit up and beg for attention. What if rather than improving our process or program, we designed a new one with the new goal? And then, rather than looking to see if we can modify the existing one to fit the new goal, we see what parts of the existing one we can use.

This is fundamentally hard to do. It seems so simple, but as humans we have a great tendancy to grab what we have and then see if it fits. This is of course only natural. But it is backwards thinking. In starting from the existing software and seeing if it fits, a number of things are left by the wayside.

  • We lose sight of the new goal in trying to (re)use as much as we can
  • We adopt patterns and anti-patterns that already exist and may/may not help
  • We place a value on the decision making that went into the existing product and are generally unwilling to re-evaluate that decision making

If we can but put this aside, and approach things from the other end, we can achieve so much more. We can design our new product/process with a clear understanding of what we are trying to achieve; We can motivate the owners of the work with "new and exciting" work; We still place a value on the existing product, but a higher value on the functionality and correctness of product in general.

The rules behind this approach are simple, though difficult for an organization to adhere to.

  • When refactoring code or improving a process, STOP.
  • Formally design what you need to achieve.
  • Understand the use cases driving the new process/software.
  • Look at what exists that you can re-use.
  • Re-use only that which will 100% fit within the new design.
  • Write new process/procedure/code for the missing pieces.

A useful example to examine this is a release process. The process runs something along these lines:

Product defines requirements. These are then vetted by the engineering team (including QA/SRE/Security/DevOps etc. etc.). The engineering team does the work and it is implemented tested and deployed.

The problem set is that a customer finds a major problem with the release.

In modern CI/CD worlds, this is an issue, but not a big deal, the systems are generally Highly Available and fault tolerant, so a percentage of users will see the issue, and new code will be written to fix the problem and life moves on obla di obla da. And that is all fine and dandy from the software point of view.

In the process world though, we don't have CI/CD. We are dealing with the storage capacity and flexibility of meat. People followed the process and now that we have to have a post-mortem we have to actually figure out if the process broke. And if it did (which for the sake of this argument it did), fix it.

The human being that looks at this can see quite easily that our QA person was not notified. So in order to fix that we add a step in the process that says "QA signs off on the release" and there's a little documentation that shows it. And now we have a slightly more heavyweight process. A little bit of a tweak that doesn't really do much except provide peace of mind. And it is likely that this checkbox or document or whatever is open to abuse. Or, worse, in the case of formal CAB approvals, this now adds more time and friction to the delivery of the software.

This addition and organic growth of the process follows a fairly consistent and predictable pattern. And always through good intention. The general pattern is the intent to prevent errors and to communicate effectively. Unfortunately the continual tweaking, adoption of new parts of the process and legislating for more and more edge cases tends to lead to the Process Black Hole.

The Process Black Hole

As we add more pieces to the process, so the process becomes more and more weighty. It behaves like a black hole, sucking more and more resources into it. In some organizations, there are whole Program Management Offices dedicated to managing the weight of the process.

It's been written about before that there is no such thing as a bad process, just a bad process for a given organization. That still seems to hold true. I once worked for an organization that used a full Prince2 model for its SDLC. And. It. Worked. I wouldn't recommend it for pretty much any organization other than that one, but there are cases where heavy process works.

In the start-up or agile/nimble/flexible/responsive world, such processes are usually foisted upon the engineering group by external forces. In early stages, it is easier to capitulate to large customers than push back, because the money that they are bringing in will help fund your next set of features. In mid stages, it is easier to go with whichever is the most stringent process that your customer wants because then the number of overall processes is reduced.

There are, however, ways to avoid this.

Firstly prioritizing design over refactor instils a culture of evaluating each problem set as it is presented. Design really does mean that - set out what you want to achieve, and design a process that will achieve it.

Secondly listen to the vocal naysayers.

There is a group (often called the recalcitrant minority) who will clearly fight to their death to prevent some change happening. In the world of process design, this group needs to be heard and heard well. It is very unlikely that they will fight against a new process but they will strongly resist additions or changes to an existing process. Thirdly once the process design has been done, look for re-use. Treating process design like Lego bricks or software refactoring effectively allows us to use familiar techniques and prevent waste. The OODA loop becomes available to us because we are now manoeuvring, not just reacting. Fourthly, communicate communicate communicate. Any new process requires people to follow it. And because the old process is going to be cut out, people will flounder until they know the new process. Getting this new process communicated early and often will save so much heartache.

Again, this is simple and can be done in practice. At Castlight Health for example, there was a set of shift work in our Indian Development Center. SREs and DBAs were on a strict rota that had 24/7 coverage. Castlight Health is based in San Francisco. There really was no need for this shift work, but it "had always been done that way".

At first the idea was floated that we remove the night shift and let the Bay Area employees handle that load. This met with deep resistance. Both on a technical and cultural level. Who would know who was on call? This has been working well, why should we change? etc.

The fact was that it wasn't working well and the team was hurting because of it. There were communication gaps that were increased rather than reduced by this approach. In order to improve communication, team health and productivity, the team designed a process from the ground up. Note: this was a DESIGN. The first thing that was done was creating a map for what needed to be achieved.

Once the design was clear, the team then looked at the tooling that it had in place and the processes that they already had and picked out what could be used 100% to fit the new design. This involved pulling parts of the RFC process, the approval process and the oncall process out and coupling them together with some communications mortar.

Communication about the new process started with the approved map. It was presented to Leadership and Managers and Stakeholders in several different ways. It was presented at Engineering all-hands and written up in forums that all engineering staff were using. A cutover day was proposed, rejected, modified and then finalized.

The first week of the new process was not smooth. People still wanted to know who from IDC was available at what would have been 3am their time. People still expected IDC members to be doing this work rather than the Bay Area engineer sitting across from them. The JIRA Queue seemed to languish because of a failure to include it in the redesign. But these things were akin to bugs in software. The JIRA issue was effectively a bug in the process, and was fixed by communicating that queue to the Bay Area team. The communication around the other parts was clearly and timely made.

Within 2 weeks, the IDC team became more productive (more issues cleared during any given 24 hour period than before the change). The communication between the IDC team and the SF team became clearer (slack handover channels were created so that work wasn't duplicated). The IDC team became happier (who knew that sleep was important to happiness?!).

The external teams were happier because they could rely on people that were in their timezone and often within their office.

In comparison, the original vocal naysayer group pointed out before this that previous attempts to remove the shift work had failed. The communications process was too hard they said. There was a perceived lack of expertise at the right times. The teams were used to what they were doing and did it effectively.

Listening to those naysayers up front and building out the new process meant that the team didn't fall into the trap of creating a heavier process to do the same thing, but rather created a lighterweight process to do more.

Avoiding the process black hole is hard. It involves work. Generally we don't notice that we are near the event horizon until too late. We've adopted procedures that really can damage the organization just to ensure that a box is checked. But it is competely avoidable with forethought, planning and not just throwing the baby out with the bathwater.

The key here is to design what is needed FIRST, and then to investigate what can be re-used. Culture and people are hard. Being self-aware enough to understand that we ourselves are part of culture and people is a good clue to finding the path to building effective processes and effective software.

In short: Design first, build second and re-use where competely applicable. It really is Reduce, Re-use, Recycle, Repurpose - in that order.

Notes on the diagram: Weight : There's no specific weight here. We haven't got to the point of measuring actual weight yet so Boehm's model and Prince2 for example would just be "heavyweight". There's no Conor McGregor vs Mike Tyson weight classing. Light weight development methodologies embrace practices allow quick and efficient solution building with better responsiveness responsiveness to changes in business requirements.

SANE?! : This region looks to be sane. There's enough responsiveness with enough long term planning to be able to provide predictability for the business, the market and consumer base. People aren't caught out by process or feature change.

Event Horizon : This is an arbitrary line (see Weight, above). It looks pretty, so that's where it goes. Before this line, it is possible within the realm of the process to not slip into the black hole. There's no specific time or defined weight (see Weight, above) that creates the Event Horizon. If you want to really push this analogy further, you should hire a vocal naysayer called Hawking and have them yell loudly about how bad the current process is.

Thanks to Randall Munroe for his amazing chart style.