The longest week

There's an interesting thread in the JOS forum where people are swapping war stories about their worst SQL mistakes and database screw-ups. Reading the thread brought back memories of just about the longest week that I ever had to work.

Back in the late 1990s, I did a contract for Nomura, one of the largest financial brokers in the world. As part of my workload, I inherited a large Sybase SQL script written by an ex-employee who had departed several weeks earlier. This script was designed to run once only, using a set of SQL batches to update a significant amount of important data from one production database to another. 

Of course, like any battle-scarred veteran, I was very suspicious of another developer's code, especially when I now had to take responsibility for executing that code. So I ran multiple tests using copies of the production database, and everything worked okay after I had fixed a couple of minor bugs. And then the big day arrived...

I arranged for a complete back-up of the target production database, then at around 8 PM I set the script running and sat back to wait for the fireworks. But after about an hour, the script finished normally without any visible errors. I ran my own script that tested a few dozen random rows, and sure enough, all of my tests showed that the data had been updated correctly. Breathing a sigh of relief, I went home, stopping only at the pub for a well-deserved drink. Everything as planned, Smithers. Excellent.

The next day, everything went smoothly and no problems were found. But at 3 AM the following morning, somebody phoned to tell me that there was a problem with the data in the target database. Still confident despite the unearthly hour, I trundled into work and had a look around inside the database.

You know that first, faintest, tiniest inkling that something, somewhere, has gone terribly wrong...? 

The event log belonging to the machine that ran the aforesaid database script opened to reveal dozens of messages saying that a deadlock had been detected during an Update statement. When a deadlock occurs in a Sybase database, the server rolls back the transaction with the lower CPU time. About 1% of the transactions, some 600 or so updates, had failed - and the error messages were extremely cryptic about which specific SQL statements had failed - and the database had been continually updated throughout the whole day, thus making useless the backup that I'd taken - and the script that I had taken responsibility for didn't use correct commit/rollback semantics for some of its work - and...and...and...aaarrrgh!!

During all of my tests, I had been working with a database that wasn't being updated by any other process. When it came to the production run, unbeknownst to me a large external feed had been competing with my script to update the same database. Every so often, that feed and my script deadlocked each other, and one of the SQL batches was killed. Not only was the deadlocked update not performed, but all subsequent updates within the same batch weren't performed either! The tool I'd been using to execute the script (either Rapid-SQL or Query Analyzer, I can't remember which) had truncated the massive result set, which is why I didn't spot the errors immediately after the script had finished.  

After informing management, I got together with a colleague whose technical skills I trusted to concoct a rescue plan. We decided that the sort of detailed detective work and reconstruction required was actually my forte, so he would sit around all day to check my work whilst I wrote a myriad of small programs to assemble dozens of tiny clues into a reasonable picture of what exactly had happened and how we should try to correct it.

I commenced batch after batch of brain-busting reconstruction work. Every 2 or 3 hours, I would hand the results to my colleague to check while I continued on the next batch. He kept management up to date with my progress so that they didn't freak out. And of course the data in the database was still changing underneath me, so every extra day that I took made the problem progressively more difficult to understand. For 6 days I did nothing else other than sleep and work - my partner even brought meals to my desk! And slowly, ever so slowly, I started to grind the monster down.

6 days and 104 working hours later, the final batch of corrections were checked, passed, and implemented. My brain and body were both completely fried, so I collapsed into bed and slept for 24 hours, dreaming of ever-longer and more relentless spirals of numbers twisting in space around me.

UPDATED: My wife reminds me that it was 3 AM that I received the emergency call, not 4 PM. And I'm pretty sure that the database was Sybase, not SQL Server.