Programming War Stories: The Magical NOPs
The following is a true story about debugging a bootloader and the importance of understanding your hardware. It was recounted to me by the engineer in the story when we worked together at a previous employer. Names have been deliberately avoided to protect the guilty/innocent.
1 Changing the Bootloader
An embedded systems engineer was tasked with modifying low-level bootloader code for a processor inside the company product. The processor interacted with numerous pieces of complex hardware which were accessible only after an intricate configuration process. The code ran early in the boot sequence and was responsible for setting up the processor itself – configuring memory management units, caches, busses, peripheral blocks, etc. In such an environment, C code cannot run, so this section was written in assembly.
The low-level code rarely needed modification, with each new product simply carrying over the previous version and tweaking a few parameters when necessary.1 The red-letter changes slated for this product were similarly benign and straightforward, needing only a few new instructions added. However, internalizing the tomes of processor documentation2 to determine those new instructions, and navigating the abandoned and dusty codebase to find where to put them, were the first hurdles to success.
In the face of these challenges, the engineer persevered and soon had patch ready to try. He loaded the new code and rebooted to apply the changes. Sitting in front of the system, he saw the blinding lights, heard the screaming fans, and witnessed the complicated system slowly coming to life.3 The interwoven subsystems delicately danced with their initialization protocols and one by one came online, but there was a problem – the processor was dead.
Thinking it a fluke, he rebooted the system to try again. And again, the system failed to boot. He reviewed his changes for typos or careless errors, but everything was in order. Reflashing the code and trying a third time, he rebooted the system and held his breath. The lights, the fans, and… nothing. As a sanity check, he removed his changes and tried the original code – perhaps something else in the system got corrupted, or something in the ancient project was misconfigured.4 But without the changes, everything worked as normal.
2 Forward Progress
Puzzled, the engineer dived back into the documentation. The task was simple: the processor was already mapping some memory regions to peripherals, and he was simply adding another region. The documentation and existing code confirmed his reasoning was sound. However, there are multiple versions of the same processor, each with different performance and memory capabilities. Perhaps he was consulting the wrong documentation, or unknowingly running on lower cost chip5 – the processor may have run out of region mappings. Verifying the documentation version and processor hardware, everything agreed; he was looking at the right documentation for the right processor, and the documentation showed enough available regions for the change.
This small task was becoming a large problem, and the debugging techniques were running thin. With a suspicious eye on the documentation, he replaced an existing region with the new change. Applying the patch, he rebooted the system in the noisy lab and the system roared. A downpour of log messages ran across the screen, indicating a successful boot. With a working command line prompt, he briefly tested the change and it was working as intended. The old region obviously failed, but that was the price of progress. With this important data point in hand, it was time to seek an elder.
3 The Magical NOPs
This particular piece of code was ancient. Emerging from the mists of prehistoric times, the code had no author or maintainer in living memory. The last person to have touched it was many years ago by someone similar to himself: just a wayfarer who was tasked with a "small" modification. Luckily, he was still with the company. After sharing these strange riddles and mysteries with the previous victim, the understanding brother sat deep in thought, digging into the recesses of his memories for a shred of advice. His efforts were rewarded with a long forgotten hidden gem. "Ah, did you remember to add the magical NOPs?" He was met with a blank stare. "You know, the long list of NOPs before that section of code. You need to add a few more whenever you change things around there."
The NOP, or no-op instruction, is a processor instruction that does nothing. Low-level coding, working closely with hardware, is notorious for esoteric behaviors. For example, changing the frequency of an internal clock may need a few cycles to stabilize, and the recommended solution is to insert a few NOPs before continuing. Changing the memory regions didn't modify the clock frequency, but perhaps it was doing something similar that needed time to take effect.
With this newfound information of the "magical NOPs," he returned to his task with more questions than answers. The proposition seemed plausible, but something didn't feel right. He reviewed the code again and found the fabled NOPs; there were dozens of them. But the sorcery wasn't confined to just one location – smaller groups of NOPs were suspiciously scattered throughout the code.
Suspending disbelief, he sprinkled in a few of the prescribed incantations and tried running the changes again – victory was near at hand. When the system powered on this time, the monitor was filled with a familiar sight. Pitch black. The system had failed to boot. Undeterred, he tried adding more NOPs into different magical locations – a bunch here, a few there – but still the code failed. And so it was time after time, until after much trial and error, he finally stumbled upon the correct amount of magic. The processor was sufficiently enchanted by the NOPs and flawlessly booted the new changes. Success had been won, but the solution was unsatisfying; what were all those NOPs doing?
4 Finding the Real Problem
At this point, he could have checked in the "fix" and moved on, as the many others before him had done (as evidenced by the quality code they left behind).6 However, unsatisfied with the hack solution, he was determined to find the real problem. He poured over the documentation, learning more about the processor than anyone would ever care to know. And then he found it, the explanation behind all the troubles. With perfect clarity, the problem rolled over and showed its childish simplicity. The reason the processor failed to boot, was because of code alignment.
For performance, microprocessors don't read instructions directly out of memory, they read it into various levels of cache. The cache reads and writes data in multi-word chunks called cache lines. For the best performance, instructions are aligned to these cache line boundaries; an instruction straddling this boundary requires two cache lines to service, which slows down execution. However on this processor, instructions must start cache line aligned; otherwise, the processor will crash.
The magical NOPs were padding the instructions so they aligned with a cache line boundary. Adding a new memory region changed code in a couple of places, but also added constants before the code which offset everything following it – the assembly wasn't divided into sections or specified with any alignment requirements. Adding a four byte data constant shifted the entire instruction stream by four bytes, and if the first instruction was pushed off a cache boundary, it would fail. Merely changing constants, adding instructions in certain places, and also dumb luck kept the code aligned, which explains why the bug survived for such a long time.
After adding in the proper alignment statements, the engineer exorcised the magical NOPs and the code ran perfectly after every change, thus securing peace and sanity for future maintainers.
This story contains many important lessons – too many to count. I hope that sharing this experience imparts some of its wisdom, or provides solace to other programmers who have been in similar situations. Comments and feedback are welcome via email. ✚
- ↑ This is the easiest and smartest thing to do. The "bring-up" phase of a new product, getting a new hardware design up and running, is a difficult process. New PCB designs may have wiring mistakes, signal integrity issues, power issues, etc. Additionally, each individual board may have manufacturing problems: cold solder joints, damaged parts, shorted pins, etc. And even with a properly designed and manufactured board, the microchips may have undocumented behaviors, unforeseen incompatibilities, overlooked requirements, etc. Because of all these hazards, a conservative product design begins with a previous project as a base, and only changes what is absolutely necessary.
- ↑ A thousand page manual for a processor is not uncommon, and they are only getting thicker. Products with longer feature lists are more attractive to sales, and as such, there is no counterbalancing force to the pull of complexity. Such complex products are fertile sources of bugs.
- ↑ The lights really were blinding and the fans deafening. While the system was booting, the brightness and speed were set to maximums until the software woke up. The LEDs were a few lumens shy of the sun and left spots in your eyes if you glanced at them for too long. The fans were a few decibels shy of a jet engine, necessitating raised voices and lip reading for successful communication.
- ↑ Contrary to popular belief, newer toolchains are not necessarily better. An older codebase may (unknowingly) rely on particular compiler quirks or carefully work around particular issues. In a production product, every bug is a feature. New compilers promise better code generation, performance, and correctness, but they too have bugs like their predecessors (and different bugs at that). Sometimes, the devil you know is better than the devil you don't know.
- ↑ Early development systems often use cheaper, pin-compatible parts to keep prototyping costs low. Even after the final product's hardware has been finalized, these development systems continue to float around because they are "good enough" for many tasks. However, sometimes they aren't.
- ↑ This is not to disparage the former programmers, as their conditions, responsibilities, and external constraints are unknown to us. However, a "quick and dirty fix" with the intention of properly fixing it later, often turns into a "permanent fix."