The Problem
A legacy distributed system for massive quantitative data generation had reached its limits. System innovation slowed down to a halt as it experienced frequent generation failures due to ever increasing load. Users were growing frustrated and the pressures to resolve the problem were mounting. Challenged to deal with the situation, we were faced with a tough choice.
Should the system be horizontally scaled, at the price of increasing on-prem operational costs and the risk of persisting failures?
Should it be moved to the cloud, at the price of significant migration and operation costs as well as the risk of delaying innovation?
Or should it be sped-up, at the price of difficult performance-tuning efforts and the risk of failing to sufficiently improve?
The Analysis
Against pressures to act quickly, we insisted on employing our scientific approach, and began our journey by asking a simple yet penetrating question: what is the system actually computing?
We found that the system generated a huge number of quantitative results differing by a set of parameters of its computational code. A deeper systematic analysis revealed the building blocks of the computations, and pointed to large groups of results computed from much smaller intermediate data not readily available in the code.
The Solution
At this point, the solution was clear to us. We restructured the computation to generate and cache the intermediate data in a format convenient for further processing. In addition, we pushed the computation of results to swift server code operating on top of the intermediate data. There was no longer a need to generate the quantitative results, since the server code could reproduce them dynamically.
The Impact
Within weeks, incredible benefits began to materialize. The revised data generation led to steep reduction in required computational resources and operational costs, to very low generation failure rates, and to gradual retirement of the legacy system. The dynamic server code enabled fast innovation, allowing the new system to quickly evolve to meet new requirements, often with no or little change to data generation.
Most importantly, the new system stood the test of time, with innovation and delivery continuing long after its launch.