с подачи http://juan-gandhi.livejournal.com/3651456.html
The most epic ones I remember were:
1. My application crashed by jumping to a random address when I connected more than 1000 clients. Eventually I've traced it to some bits in the return address on the stack of an OS authentication function getting flipped at random. Which happened because the function did the NIS authentication and used select() to control the sockets. I've built my code with the definition that supported up to 10240 bits in FD_SET but the system library was built with the default settings with only 1024 bits in FD_SET, and allocated the FD_SET variables on the stack. So when more than 1024 sockets were open, it got a socket at index over 1024 and select() corrupted the random bits on the stack. Hm, don't remember how I fixed it or if I fixed it at all. Maybe I've moved the authentication into a separate process.
2. The machines in a Veritas cluster committed suicide. The 2-node clusters are inherently unstable, trying to decide whether the other machine had failed or if the network got partitioned (in case of partitioning the master machine is supposed to continue and the slave machine is supposed to kill itself). The slave machine killed itself out of nowhere. It worked the worst when the master machine was getting stopped, so it would move the load to the slave machine, and that one would kill itself in the middle of fail-over. It took probably close to a year: they would send a dump, I would add more diagnostics and a potential fix, send them a patch, in a month or so they would have the problem reoccur and send another dump, I'll look at the diagnostics in it and do another iteration... Kind of spectacularly, I've solved the problem onsite: the customer wanted someone to be present and hold their hand while they have a service window and install the latest patch, and the support engineer had personal plans, so I went. This last patch contained the diagnostics to the point where it watched for the unusual timing between the events. So we've installed the patch, and it has turned out that once in a while the OS time just stopped for about 4 seconds. Which broke all the heartbeat algorithms. The weirdest part is that this timer bug was a known one, fixed in another edition of the OS but somehow it never occurred to them to port the fix to this one too. But while working on it , I think I've fixed a few genuine race conditions in the clustering code too :-)