Programming War Stories: The LED Kernel Panic
The following is a true story about debugging a kernel panic with a corrupted backtrace, and the importance of checking all lines of code. Names have been deliberately avoided to protect the guilty/innocent.
1 Changing Code and Changing Kernel Panics
At a former company, the flagship product was a complex, custom hardware solution, a box filled with dozens of interconnected FPGAs and processors. Coordinating, initializing, and displaying the status of all these subsystems was a processor running a value-added strain of Linux.
The next generation product required new device drivers with significant modifications to kernel-space and user-space programs. After many months of development, we had a working system, but there was one problem: on rare occasions, for some unknown reason, the kernel would panic.
Driver programming inevitably leads to kernel panics, but with a backtrace and a scrutinous eye over the latest code changes, they are easily found and fixed. But with an unreproducible bug and many modifications, the backtrace is even more important for locating the issue. However, the backtrace for this kernel panic was strange: it accused some innocuous function of crashing the system, led_set_state(), a function created in biblical times1 and untouched by the recent changes. This function was short, simple, and straightforward – it wasn't doing anything that could cause a kernel panic.
Furthermore, the backtrace showed an even stranger call sequence – it was impossible. The functions leading up to led_set_state() seemed plausible, but further up the call stack, non-existent function calls were somehow being called. The codebase did have forests of function pointers, but this wasn't one of them. The backtrace also was inconsistent, showing different, impossible paths through the code on different panics.2 It looked like stack corruption.
We scoured the traces looking for any patterns or hints to narrow the scope, but found nothing. In parallel, we poured over the changed code, particularly the tricky and complicated parts, adding logs and asserts for better situational awareness when the bug happened again. By this point, the veteran driver programmers were brought into the fray. We scrubbed the code immaculately clean, finding bugs here and there, but never the bug we were looking for.
2 Kernel Debugger to the Rescue
Someone suggested kgdb could provide valuable information if it halted on the error. From there, we could probe the system or maybe even see an uncorrupted stack. However, the dark art of setting up kgdb was lost to time: enabling it in the kernel build, setting up a serial console, etc. Nobody knew how to do it.3
Equally qualified, I volunteered and ventured into the unknown to setup kgdb while the others continued to scrub the code. The process was arcane (no doubt made worse by our custom hardware), but after trial and error, poor documentation, and bad serial cables, kgdb was attached to the system.4 My finger was again poised over the enter key after countless attempts – this time, it would work. Striking the key, the system intentionally crashed for proof of concept, testing the debugger configuration for catching panics. It did catch it, and I was greeted with a fully functional prompt into the kernel – the bug's days were numbered. I setup the system again, ran some generic tests, and waited. The board was set, and the pieces were moving…
The rest of the day passed, and then another, and another. Nothing. The bug hadn't happened yet, and we feared the new kernel build may have "fixed" the bug – a dreaded Heisenbug.5 We ran some different tests on the box, but hope was waning.
And then one morning, I walked into the lab and saw something unusual – a gdb prompt was staring at me. It had caught something! Like an archaeologist uncovering ancient ruins, I lightly examined the system, careful to not disturb the remains and destroy any clues.6 For the first time, I saw the real backtrace. It was uncorrupted and showed exactly where the bug was. I saw the line that was causing all our troubles, the elusive line that many eyes had passed over and over. It was a debug print.
3 The Wolf in Sheep's Clothing
The debug print seemed harmless enough: it just logged some memory usage statistics. It also printed additional information when memory use was high, notifying that the subsystem was taxed and throttling would soon begin. The issue was in this threshold logic.
Buried in the print statement arguments were a few ternary expressions over the threshold condition, selecting relevant details in each scenario. One side of the ternary, the more frequently taken side, worked perfectly. The other side, however, dereferenced a bad pointer, which ended up corrupting the stack somehow. But sometimes, this pointer was valid by dumb luck and the system continued normally.
But it gets better. When the system ran out of memory, it took a different code path, avoiding the print statement entirely. And the threshold was close to the maximum value, resulting in a narrow range for the print to trigger. Furthermore, the system dynamics were large and fast, ramping from zero to maximum utilization in little time, blowing past the threshold print. Only in a Goldilocks region of just right utilization, at the right time, with an unlucky pointer, would the kernel panic. That it also happened to corrupt the stack is Murphy's law in full effect.7
All the actual code worked, which was countlessly checked looking for a phantom bug. Who would suspect a purely informational debug print, which proudly appeared in the logs, to be the culprit? Humbled by the simple (yet complex) bug, we promptly fixed it and never again saw led_set_state() wrongly accused. We were grateful it was an "easy" bug – it could have been a lot worse.
Thanks for reading. If you enjoyed this story, I'd love to hear your comments and feedback. You can contact me via email. ✚
- ↑ Every ancient project has at least one sepia-toned file, where only the copyright date in the header comment is ever changed. It's lineage was lost when the repository was upgraded from cvs to svn, not because the transfer was incorrect, but because it was originally lost when upgrading from scss to cvs. led_set_state() was in one of those files.
- ↑ Most of the panics ended in led_set_state(), but not always – it just happened to be the most common. However, we didn't know if we had one unique bug with multiple signatures, or multiple bugs each with a unique signature.
- ↑ Logs, backtraces, small changes, and experience were sufficient for solving most bugs. kgdb was overkill, plus it had its own problems. Pausing the kernel at runtime caused general chaos: components timing out, failovers triggering, etc. The time spent configuring kgdb – which may not even help debug your particular issue – was better spent debugging the problem the old fashion way. Turning to kgdb was an act of desperation.
- ↑ Building the correct mental model to understand what's going on is the real challenge. Much documentation assumes the reader is an expert and only highlights pertinent details. Filling the gap and deciphering the details is, as they say, an exercise left to the reader.
- ↑ A portmanteau of "Heisenberg" and "bug," borrowing from Heisenberg's Uncertainty Principle from quantum mechanics: observing a bug changes its behavior, or fixes it entirely. They usually materialize in distributed systems, multithreaded code, and compiler-optimized code with debug disabled.
- ↑ With a rare, unreproducible bug, a system in a failed state is worth its weight in gold. Experts gather around to run various diagnostic commands, patching together an overall picture of what happened. The session usually concludes with, "Does anyone want to check anything else before I try X?"
- ↑ "Anything that can go wrong, will go wrong." Engineers are keenly aware of this fact, as they don't get calls when things are working, they get calls when things break. Failures that penetrate a robust design are usually a dumbfounding series of unfortunate events.