The game was crashing. Not a lot, but about once a day the game would randomly just explode. No particular rhyme or reason to the stack trace or what was going on at the time. Just, boom!
We obviously can't ship that way. It was getting kinda late in the development cycle, enough that this made everyone nervous, and I was tasked with tracking the thing down. They said we hope you find it fast, but fast or slow, you have to find it because we don't really have a choice.
I spent about a week trying to see if it was related to any particular thing. I started disabling subsystems to track it down. No audio, wireframe graphics, simple test level, remove all equipment or controllers we don't need. Nope. At this stage you start having those tense conversations like "Maybe we don't NEED level 6.3" when fiddly bugs start to show up, but it was irrelevant, because nothing we could remove would solve it. I made some level of attempt to run back in the source control to see if I could binary-search to where it first showed up, but the game crashing isn't exactly unheard of and it was fiddly to even reproduce the thing in the first place. That was what made it so insanely slow to track down. No luck from any of this.
After several days, I understood that I wasn't in for a typical debugging session. What I decided on was to embark on making a build of the game that was byte-for-byte identical from run to run. No audio, no control input, constant RNG seed, track down anything that might make it deviate or anything time- or IO-dependent. That took about another week, but at the end, I had my build. Every byte of memory was the exact same from run to run; every address, every value. (This was a console game so this task was easier than it otherwise would have been.)
So we're two weeks in now. By dumb luck, I managed to replicate the bug almost immediately once I had the repeatable build. The morbidly simple level and setup that reproduced it quickly was: Start the repeatable build with a script running that would spawn the player in a bare room with a floating gun. The gun fires, killing the player. The player drops to the ground, and then the game crashes. Every time.
I got you now, you son of a bitch. Happy that the thing was trapped now, unable to flee into its stochastic wilderness, I began to live in an endless world of execution-style slayings.
What followed was actually the longest part. As I said, the crash had no real pattern, but I now had the ability to go backwards in time. I could control everything. I took any wrong-looking values from the memory at the time of the crash, and started setting watchpoints on their memory addresses. How had it gotten this way? What touched this piece of memory? The process of single-stepping through, as a single piece of memory was allocated, used, freed, and then reallocated, trying to understand it all and find the wrong bit, was intense.
My memory of the following couple of weeks is honestly a little fuzzy. I know I was able to track down, by single-stepping through every single place the offending memory value was touched, to definitively work out what value it was that had had an impossible value, how it had caused the crash, and every single place that value had been touched, until I arrived at the offending code. The problem was... the bug wasn't in the code I was looking at. It had had some weird impossible value in some of its data that had caused it to set up something else wrong.
Oh. Oh no. Do I have to do it again? By now I was three or four weeks in.
I did do it again. Like I say I don't remember clearly, but I remember that it took me about 6 weeks in total and I filled up the majority of one little notebook that I kept next to the computer to write notes in. Little hex addresses, stack traces, clues that I might need to flip back to later. No matter. After tracing back all the breakage, I finally arrived at the code that was actually causing it. Once I actually saw what was happening, it was easy to work out.
At some point, someone had added a cache for some expensive-to-compute data for actors in the world. The cache had a mapping of actors to cached-value matrices, and each actor had a single value that was a pointer to its cached value, or NULL if it had none. I think they were too big to store indefinitely? There was a little LRU cache. But, if an actor was destroyed, it wasn't removed from the cache. So its cached data would sit there in the LRU cache, like a little time bomb, until the thing expired, and then the last part of its removal from the cache would be... unsetting the pointer to cached data in the now-dead actor back to NULL. The actor's memory had already been freed, and so one int, somewhere at some random location in memory, would get set to 0. And once in a while, it would happen to be at a location where that would cause serious problems (usually leading quickly to a crash).
The fix was 1 line.