Hardware-Accelerated Bounds Checking with Watchpoints

Author	Kevin Nygaard
Published	17 July 2022

A watchpoint is a processor-dependent feature that halts program execution when a specified address is accessed. Bookending memory allocations with watchpoints protects the enclosed region from overflow and underflow, implementing efficient, hardware-based bounds checking.

1 Bounds Checking
2 Watchpoints
- 2.1 macOS 64-bit Arm
- 2.2 macOS 64-bit x86
- 2.3 Linux 64-bit x86
- 2.4 Windows 64-bit x86
3 Performance
4 Other Uses and Limitations

1 Bounds Checking

In programming languages that manipulate raw memory, reading and writing data using correct and valid addresses is the programmer's responsibility. At best, failure results in the program crashing; a bug either directly or indirectly causes an invalid access – reading from an unallocated region, writing to a read-only region, etc¹ – which the operating system catches and uses as justification to terminate the program, preventing it from causing more problems. At worst, failure results in data loss or a security issue that remains undetected for long periods of time (eg corrupted backups, Heartbleed). Since these errors occur within valid program memory, the operating system cannot detect them; they propagate into the program's use case.

A common mistake is ignoring memory allocation sizes² (eg allocating a 100 item array, but storing 120 items into it). Respecting the allocation size using a bounds check ensures accesses are confined within the valid region:

if (Address < LowerBound || Address > UpperBound) {
    InvalidAccess(Address);
} else {
    ReadOrWrite(Address);
}

Although correct and succinct, scattering these checks throughout a program costs execution time, so much so that programmers omit them when possible. Furthermore, the prevalence of memory-related security issues proves that a program "works" without bounds checks, needing them only for pathological cases – supporting special scenarios degrades normal program performance.³

However, the processor has dedicated hardware which eliminates these explicit checks, restoring performance and covering pathological cases.⁴

2 Watchpoints

Aiding in processor validation and program development, processors have special circuitry for controlling and monitoring instruction execution and memory accesses. Writing the address of an instruction into a breakpoint register raises an exception when that particular instruction is executed. Similarly, writing the address of a data word into a watchpoint register raises an exception when that particular word is accessed.⁵

For sequentially accessed memory buffers, there are only two boundary conditions: reading from an empty buffer, causing an underflow; and writing to a full buffer, causing an overflow. By marking these boundaries with watchpoints, the processor raises an exception when these addresses are accessed – bounds checking, with zero execution time.

Using this feature depends on the processor and operating system. The major processor architectures (x86 and Arm) support watchpoints, and the major operating systems (macOS, Linux, and Windows) expose them for program usage.⁶ In the examples below, a watchpoint is placed on Foo, raising a SIGTRAP signal whenever the variable is read or written.

2.1 macOS 64-bit Arm

For security, watchpoint registers are inaccessible via user space. Instead, they are accessed through the kernel, which maintains the register sets of each running process.⁷

#include <mach/mach.h> 
uint64_t Foo;
int main(int ArgCount, char *Args[]) {
    thread_act_port_array_t Threads;
    mach_msg_type_number_t  ThreadCount;
    arm_debug_state64_t     State;
    thread_state_flavor_t   Type = ARM_DEBUG_STATE64;
    mach_msg_type_number_t  Size = ARM_DEBUG_STATE64_COUNT;
    task_threads(mach_task_self(), &Threads, &ThreadCount);
    thread_get_state(*Threads, Type, (void *)&State, &Size);
    State.__wvr[0] = (uint64_t)&Foo;
    State.__wcr[0] = 0x1  << 0  // Enable
                   | 0x3  << 3  // RW
                   | 0xff << 5; // Any byte
    thread_set_state(*Threads, Type, (void *)&State, Size);
    Foo = 42;
    return 0;
}

mach_task_self() returns a kernel handle for the current process, which is used with thread_get_state() to return an array of handles, one for each thread in the process. thread_get_state() and thread_set_state() respectively read and write a thread's register set; the watchpoint registers are stored with the other debug registers under the ARM_DEBUG_STATE64 category. The watchpoint value register (DBGBWVR0_EL1) holds the target address, and the watchpoint control register (DBGBWCR0_EL1) enables the watchpoint and sets the trigger condition (both reads and writes on all bytes within a 64-bit word).

See §D2.10 on pD2-4690 in Arm's Manual for more details on encoding and behavior. See the Apple and MIT resources for more details on the Mach kernel (caveat lector, the documentation is scant).

2.2 macOS 64-bit x86

The Intel version follows closely to the Arm-based example above, with x86 replacing ARM. Each processor has different watchpoint capabilities and encodings, but the basic concept is the same.

#include <mach/mach.h> 
uint64_t Foo;
int main(int ArgCount, char *Args[]) {
    thread_act_port_array_t Threads;
    mach_msg_type_number_t  ThreadCount;
    x86_debug_state64_t     State;
    thread_state_flavor_t   Type = x86_DEBUG_STATE64;
    mach_msg_type_number_t  Size = x86_DEBUG_STATE64_COUNT;
    task_threads(mach_task_self(), &Threads, &ThreadCount);
    thread_get_state(*Threads, Type, (void *)&State, &Size);
    State.__dr0 = (uint64_t)&Foo;
    State.__dr7 = 0x1 <<  0  // Enable
                | 0x1 << 10  // Reserved
                | 0x3 << 16  // RW
                | 0x2 << 18; // Any byte
    thread_set_state(*Threads, Type, (void *)&State, Size);
    Foo = 42;
    return 0;
}

See §17 on p17-1 in Volume 3B of Intel's Manual for more details on encoding and behavior.

2.3 Linux 64-bit x86

The following example is for a 64-bit x86 processor running Ubuntu; other distributions may use a different approach. Access to a program's register set is done via ptrace(2).

#include <stddef.h>
#include <sys/ptrace.h>
#include <sys/user.h>
#include <sys/wait.h>
#include <unistd.h>
long Foo;
int main(int ArgCount, char *Args[]) {
    pid_t Pid = fork();
    if (!Pid) {
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        raise(SIGSTOP);
        Foo = 42;
        return 0;
    }
    waitpid(Pid, 0, 0);
    ptrace(PTRACE_ATTACH, Pid, 0, 0);
    long *Regs = (void *)offsetof(struct user, u_debugreg);
    long Dr0 = (long)&Foo;
    long Dr7 = 0x1 <<  0  // Enable
             | 0x1 << 10  // Reserved
             | 0x3 << 16  // RW
             | 0x2 << 18; // Any byte
    ptrace(PTRACE_POKEUSER, Pid, Regs + 0, Dr0);
    ptrace(PTRACE_POKEUSER, Pid, Regs + 7, Dr7);
    ptrace(PTRACE_DETACH, Pid, 0, 0);
    return 0;
}

fork(2) causes the program to split into two: the child and the parent, both executing simultaneously and in an undefined order.

Receiving a zero for Pid, the child enables itself for tracing by calling ptrace() with PTRACE_TRACME, then calls raise(3) with SIGSTOP to pause execution, waiting for its parent (the tracer) to attach.

Receiving the child's process id for Pid, the parent calls waitpid(2) to wait for the child's SIGSTOP. Then with ptrace(), it attaches to the child (the tracee) with PTRACE_ATTACH, and writes one watchpoint register at a time with PTRACE_POKEUSER. (The encoding is identical to the x86 macOS example above.) After, PTRACE_DETACH releases the child for normal execution.⁸

Unfortunately, a program cannot trace itself (the tracee must be halted for most ptrace commands), and a child cannot trace its parent (for some unknown reason), which forces the program of interest to be a child.

See the resources on ptrace() and other system calls for more details.⁹

2.4 Windows 64-bit x86

The Windows example is similar to the macOS version, but delightfully spartan:

#include <windows.h>
DWORD64 Foo;
int main(int ArgCount, char *Args[]) {
    HANDLE Thread = GetCurrentThread();
    CONTEXT Context;
    Context.ContextFlags = CONTEXT_DEBUG_REGISTERS;
    GetThreadContext(Thread, &Context);
    Context.Dr0 = (DWORD64)&Foo;
    Context.Dr7 = 0x1 <<  0  // Enable
                | 0x1 << 10  // Reserved
                | 0x3 << 16  // RW
                | 0x2 << 18; // any byte
    SetThreadContext(Thread, &Context);
    Foo = 42;
    return 0;
}

GetCurrentThread() returns a handle for the current thread, which is used with GetThreadContext() to load the thread's register set into Context. The CONTEXT_DEBUG_REGISTERS flag tells SetThreadContext to only write the debug registers. (The encoding is identical to the x86 macOS example above.)

See the Microsoft resources for more details (here too, documentation is scant, with important details hidden in header files).

3 Performance

For evaluating performance of watchpoint-based bounds checks, consider the following example. It simulates a stack machine – a non-trivial problem requiring bounds checking:

for (;;) {
    StackIndex += Commands[CommandIndex];
    if (StackIndex <= 0 || StackIndex >= StackCount - 1) {
        break;
    }
    Stack[StackIndex] = CommandIndex;
    CommandIndex = CommandIndex + 1 & (CommandCount - 1);
}

Commands contains values of either -1 or 1, CommandCount is a power of 2, and StackIndex is initialized to the middle of Stack. The first and last elements of Stack are invalid locations. On each iteration, the command either increments or decrements the stack by one element, and after the bounds check, writes CommandIndex to the new location if it is valid. CommandIndex is then incremented and clamped for wraparound, repeatedly executing the same sequence of commands forever.¹⁰

For well-formed command sequences, the stack never overflows or underflows, making bounds checking worthless. However, poorly-formed command sequences overflow or underflow the stack, requiring bounds checking.¹¹

Extracting just the loop portion, an optimizing compiler emits 9 instructions for the code above:

100003e28: 89 5b 28 f8 	str	x9, [x28, w8, uxtw #3]
100003e2c: 49 05 00 11 	add	w9, w10, #1
100003e30: 29 41 40 92 	and	x9, x9, #0x1ffff
100003e34: ea 03 09 aa 	mov	x10, x9
100003e38: 2b 6b e9 38 	ldrsb	w11, [x25, x9]
100003e3c: 08 01 0b 0b 	add	w8, w8, w11
100003e40: 0b 05 00 51 	sub	w11, w8, #1
100003e44: 7f 01 18 6b 	cmp	w11, w24
100003e48: 03 ff ff 54 	b.lo	0x100003e28 <_main+0x1b4>

Time-profiling over 100 runs, this loop executed in 730 ms on average (standard deviation 3.5 ms).

Removing the explicit bounds check and using watchpoints instead, the loop is compressed to 7 instructions:

100003df8: ea 03 09 2a 	mov	w10, w9
100003dfc: 6b 6a ea 38 	ldrsb	w11, [x19, x10]
100003e00: 08 01 0b 0b 	add	w8, w8, w11
100003e04: ca da 28 f8 	str	x10, [x22, w8, sxtw #3]
100003e08: 29 05 00 11 	add	w9, w9, #1
100003e0c: 29 41 00 12 	and	w9, w9, #0x1ffff
100003e10: fa ff ff 17 	b	0x100003df8 <_main+0x200>

Time-profiling over 100 runs, this loop executed in 676 ms on average (standard deviation 3.6 ms).

With a difference of 54 ms (standard deviation 0.8 ms), the watchpoint solution ran 7.4% (standard deviation 0.1%) faster than the naive solution: a modest improvement. Tighter loops will see greater performance gains, as will critical paths with more expensive checks.

See the resources for the full benchmark program source code, assembly listings, machine details, and raw results.

4 Other Uses and Limitations

Watchpoint-based bounds checking does offer free performance, but comes with list of exceptions and conditions:

Watchpoints are dependent on both the processor and operating system: the hardware must have the feature, and the software must expose it
When they are supported, setup may be slow (eg many system calls) or impose other requirements (eg forking on Linux)
Each side of a boundary check requires one watchpoint, of which there are relatively few (4 on x86, 16 on Arm)
Each end must also be padded by at least a word; memory before and after an array are used validly elsewhere
Only the boundary is checked: jumping past watchpoints is undetectable
When errors are detected, signal-driven logic makes graceful recovery difficult
Using a debugger is also difficult, as they too use the watchpoints
They add complexity

This list is not exhaustive; it shows watchpoints are nuanced and require careful thought and planning for correct usage.

Looking forward, compilers could treat watchpoints like another register resource, automatically using them for bounds checking and other creative uses. Beyond twisting them into bounds checkers, watchpoints are more commonly found in debuggers like lldb and gdb, although a program could provide its own debugging capabilities; the example code is essentially the skeleton of a debugger.¹²

I hope you've found this article on watchpoints and bounds checking useful. Feedback and corrections are welcome via email. ✚

Footnotes

↑ The processor uses another dedicated piece of hardware, the Memory Management Unit (MMU), to efficiently make these checks. We'll explore bounds checking with virtual memory some other time.
↑ According to MITRE's Common Weakness Enumeration (CWE) report, out-of-bounds reads and writes are ranked number 5 and number 1 respectively for the "most common and impactful software weaknesses."
↑ Attempting to speed up the normal case by conditionally skipping bounds checks is counterproductive, particularly in loops. It adds another branch instruction and another value to test.
↑ …and it's not MPX. Intel processors had hardware-assisted bounds checking called Memory Protection Extensions (MPX), but it had issues. MPX is now unsupported in 64-bit mode and deprecated. For operational details, see chapter 17 in Volume 1 of Intel's Manual.
↑ On x86 processors, breakpoints and watchpoints use the same resources, of which there are only 4. On the other hand, Arm has separate resources for breakpoints and watchpoints, each having 16.
↑ OpenBSD, sadly, does not.
↑ This is how it performs preemptive multitasking. After a process expends its allotted time slice, the kernel saves its registers, swaps in the registers of the next scheduled process, and allows it to execute for a time slice.
↑ In this example, the parent tragically terminates after detaching, orphaning the child. Fortunately, the infant is adopted by the good-natured init process, tending to its signals and exit status. Unfortunately, any abnormal return status is hidden from plain view (the parent's shell, in particular)
↑ When possible, the resources link to OpenBSD's documentation, as it is better maintained and similar enough. For authoritative details, refer to the documentation included on your system.
↑ Its useless and impractical, but this article isn't about stack machines.
↑ This is another instance of the halting problem. Even if a sophisticated compiler could analyze the command sequence, it cannot handle sequences of indefinite length (streaming).
↑ You may accidentally create the world's best debugger – both gdb and lldb set a low bar in terms of usability.

Further Resources

Apple. Kernel Programming Guide: Mach Overview. 2013.
Chiueh, Tzi-cker. Fast Bounds Checking Using Debug Register (pdf). 2008.
⸻. XNU Source Code. 2022.
Arm. Arm® Architecture Reference Manual for A-profile architecture (pdf). 2022.
GNU. ptrace(2). 2021.
Intel. Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture (pdf). 2022.
⸻. Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2 (pdf). 2022.
MIT. Mach IPC Interface. 2022.
MITRE. 2022 CWE Top 25 Most Dangerous Software Weaknesses. 2022.
Microsoft. GetThreadContext. 2022.
⸻. SetThreadContext. 2022.
OpenBSD. fork(2). 2015.
⸻. raise(3). 2015.
⸻. waitpid(2). 2021.
Wikipedia. Bounds checking. 2022.