scalability, the surge, and the obamacare websites
Two things surprised me about the obamacare websites. The first was that (at least for Covered California) the errors thrown by an overloaded system on the last evening for enrollment were java exceptions. The second was the proud claim that the federal site was able to handle 60k visitors at a time and 800k in a day.
Back in 1999 I wrote a custom webserver for a now long-gone client called Sportal. This was for live “match-tracking” of the European soccer tournament Euro2000. First of its kind. A new generation of these products can now be seen on many sports sites, notably the MLB baseball game tracker. The software had to run on a single Sun server. And it had to be scalable. We thought we had succeeded. We had failed.
Before writing a line of code, we tested off-the-shelf web servers. At that time, apache had the best throughput and could handle hundreds of connections per second. But not the number of connections we were expecting. One semi-final, our java applet was being “watched” live by over 100k surfers. More people than were in the stadium. We had run a custom server, and had abandoned all of the cool technology in order to be scalable. Our server was single threaded – and worked through “timeslices.” And the client was happy.
Until a few weeks after the competition. And then they decided to use it in a different way. Rather than a 90 minute soccer game, they used it for a 5 minute, massively popular event – the draw for the next World Cup. And that was the day the software died.
To be truly scalable (rather than work for a number of users, whether that number is 10, 100, 1000 or 100,000) you have to design at the limit. How does the software perform when it has reached saturation. Our server had two tasks – it had to service existing connections and accept new ones. First it pushed the new connections into a queue, and then it looked after all the people who were already connected. As it spent more and more time in the “new input” queue, it was servicing existing connections less and less. And we hadn’t understood that existing connections are very unstable. If you suddenly find your browser isn’t responding, you refresh the page. Rather than having a system that was gradually converting new requests to existing connections, most of the existing connections self-modified themselves into new requests. No matter how fast the system worked, it could never recover.
The system was designed as a “match tracker.” People would be connecting and disconnecting throughout the game. It could cope with a steady state of 100,000 or more users, but could not handle a massive surge. While we prided ourselves on our scalability – and had even developed technology we called “soft multicasting” that could allow orders of magnitude more connections, we hadn’t actually designed for scalability. We hadn’t started from point N – the saturation point of the system, and focused on what would happen at N+1. We had simply taken a number out of the air 100k, 200k – I can’t remember now – and built a robust system for that throughput. It wasn’t scalable.
When I joined the healthcare stress test on Monday evening, and watched all of the errors, my first reaction was – “why are they using Java?” – but on reflection, maybe we have come far enough in 13 years that Java can handle the steady state throughput you need for a site like Covered California. Now, I think that a new generation of coders has learned the hard way that throughput is not enough. And that however large the number of connections you plan for, that number doesn’t make you scalable. You need to plan for the surge, and for the impatience of users who will always press the refresh button rather than wait. The military understand the value of a surge. It is very hard to plan to keep everything running smoothly and deal with a surge.
For the obamacare software, every year is going to be 360 productive and calm days, and a few crazy chaotic days as the deadline approaches.
Leave a Reply