In my previous post, I described how native code callbacks can be patched behind the caller's back, which can be especially useful in a hot-reloading environment.
To perform this legerdemain, I had summoned an unruly flea circus of inline assembly instructions, built lots of ramps and a trampoline, and then let the code temporarily commandeer a CPU register so that the instruction pointer would, after some short initial leaps, jump far onto the final callback address. It worked, but ended up being so brittle that I decided to search for a better approach. Surprisingly, what I found turned out to be not only closer to the metal, but simpler to understand.
I knew I had a problem when my program started to crash once it was compiled with clang. My original solution used inline assembly to get the compiler to build an elaborate collection of ramps onto a trampoline placed at a known offset. But GCC and Clang translated my inline assembly into slightly different machine code. Clang generated a few more bytes than GCC, which meant its ramps would end up being just a little bit too long. This made the instruction pointer overshoot the trampoline. Not a pretty sight.
And so, instead of piling on even more compiler-specific inline assembly, I decided to cut out the middleman, that is, I decided to cut out the inline assembler, and to generate the machine code for the trampolines myself, by hand.
I got the idea while reading Chris Wellon's blog post “C Closures as a Library”, where he describes how to “backport” the concept of closures to legacy C apis. His post - amongst other things - shows how to add bytes of hand-coded machine code to a program.
Generating these hand-coded machine bytes is not magic. On the contrary, it is just very, very tedious: First, you write out your assembly instructions on a piece of paper. Then, using the Intel 64 Architecture Software Developer's manual, you translate your assembly to a sequence of bytes.
How do these bytes ever get executed? Well, we store the machine code as an array of chars with the program. At runtime, we copy these bytes into a page of memory which we allocated using
We then bless this page by updating its permissions using
mprotect() from read/write to read/execute . And – hocus pocus – we have trans-substantiated these bytes from plain old data into executable code: From now on, we can call their address just like any other executable code .
With what we know now, we can look at the problem with fresh eyes. What we want is an immutable callback address, which we can give our caller, together with the promise that this address is forever. But internally, we want to be able to hot-patch wherever we forward to when this address gets called.
This means we need two things:
How these two elements work together is perhaps best explained with a diagram:
Let's begin by allocating two consecutive pages of memory.
In the bottom page we can store a tightly packed array of 64 bit forwarding addresses to our actual callbacks.
And what should go into the top page? Hand-coded trampolines. Lots of them.
By now, it should be clear that we're not expecting the world from our trampolines: all we want them to do is to look up a forwarding address, and then jump to it.
This minimalism pays off: let's say we find a way to translate our trampoline into 8 or fewer bytes of machine code. This would allow us to fit the same number of trampolines into the top page as there are forwarding address entries in the bottom page. And what's more, the relative lookup address for each trampoline will be identical, which means that all trampolines can be copies of the same code.
And what should the 8 bytes of code for a trampoline be exactly? Well, they would have to represent the machine code for the following assembly instruction:
In English, this translates to: “De-reference the address you get when adding
offset to the instruction pointer register (
rip) value, then jump to that.”
Note that we use rip relative addressing, which means we don't have to worry about where our code ends up in memory - as long as it gets executed, the lookup table will always be just one page length away.
Let's assemble the snippet above to machine code by looking up how each instruction translates using a handbook – actually, let's cheat by using the
asm utility in linux, which is part of pwntools (and can be installed in python via
pip install pwn).
We can see that this instruction is 6 bytes long, and that the last four bytes are used to encode the offset. Perfect! This allows us to calculate the correct offset: One page size minus the length of the current jmp instruction (6 bytes) will point to the correct jump table entry, which, assuming the default page size of 4096 bytes, gives us the following byte sequence for our trampoline:
To show all this in context, I've written a small, pure-C, proof of concept. This should work in a C++ context, too.
Callback hot-patching works similar to using a “virtual got/plt":
First, we allocate two pages from memory, a top page and a bottom page. The bottom page contains a tightly packed array of 64bit forwarding addresses. The top page contains code for trampolines. Each trampoline is made up of 8 Bytes of hand-written code, which contains just enough logic to look up a forwarding address and jump to it.
Once the trampoline code has been copied into the top page, its permissions are changed to read-only/execute.
We made sure that each trampoline in the top page is exactly 8 bytes in length. Similarly, each forwarding address in the bottom page is 8 bytes in length. Because of this, the number of trampolines in the top page can be exactly the same as the number of forwarding address slots in the bottom page.
Similarly, the offset between each trampoline and its corresponding forwarding address is always one page size. This allows us to keep the hand-written code for our trampolines simple: all trampolines are copies of the same 8 bytes of machine code.
If we later want to patch a callback address, we only have to update it's entry in the bottom page - the public address for the callback (somewhere within the top page) stays the same.
Note that the mechanism I described is similar to how procedure linkage tables (PLT) and global offset tables (GOT) work on linux systems with position independent executables (PIE). Chris Wellons suggested it could be really nice to somehow harness the program's actual PLT - a very elegant idea, which would allow the linker to update the target addresses automatically upon hot-reloading.
Unfortunately the architecture I chose for Island, my hot-reloading Vulkan graphics engine, prevents me from using the PLT, and so I haven't pursued this particular approach as far as it could possibly be pushed.
I'd also like to thank Doug Binks for feedback to my previous blog post. His approach to run-time compiled cpp, RCC++, has been a great source of ideas and inspiration. His collection of Runtime Compiled C & C++ Solutions is a fantastic resource.
And, finally, I'd like to thank Andrew Healey for his encouragement, and for being a second pair of eyes on an earlier draft of this post. All remaining errors and ommisssions are, of course, my own.