Callbacks and Hot-Reloading Reloaded: Bring your own PLT

Now even closer to the metal, yet surprisingly simpler! 

In my previous post, I described how native code callbacks can be patched behind the caller’s back, which can be especially useful in a hot-reloading environment.

To perform this legerdemain, I had summoned an unruly flea circus of inline assembly instructions, built lots of ramps and a trampoline, and then let the code temporarily commandeer a CPU register so that the instruction pointer would, after some short initial leaps, jump far onto the final callback address. It worked, but ended up being so brittle that I decided to search for a better approach. Surprisingly, what I found turned out to be not only closer to the metal, but simpler to understand.

100% machine code
*actual machine code content may differ

The Unpredictability of Inline Assembly 

I knew I had a problem when my program started to crash once it was compiled with clang. My original solution used inline assembly to get the compiler to build an elaborate collection of ramps onto a trampoline placed at a known offset. But GCC and Clang translated my inline assembly into slightly different machine code. Clang generated a few more bytes than GCC, which meant its ramps would end up being just a little bit too long. This made the instruction pointer overshoot the trampoline. Not a pretty sight.

“Look Ma, no Assembler!” 

And so, instead of piling on even more compiler-specific inline assembly, I decided to cut out the middleman, that is, I decided to cut out the inline assembler, and to generate the machine code for the trampolines myself, by hand.

I got the idea while reading Chris Wellon’s blog post “C Closures as a Library”, where he describes how to “backport” the concept of closures to legacy C apis. His post - amongst other things - shows how to add bytes of hand-coded machine code to a program.

How to write machine code by hand 

Generating these hand-coded machine bytes is not magic. On the contrary, it is just very, very tedious: First, you write out your assembly instructions on a piece of paper. Then, using the Intel 64 Architecture Software Developer’s manual, you translate your assembly to a sequence of bytes.

How do these bytes ever get executed? Well, we store the machine code as an array of chars with the program. At runtime, we copy these bytes into a page of memory which we allocated using mmap().

We then bless this page by updating its permissions using mprotect() from read/write to read/execute . And – hocus pocus – we have trans-substantiated these bytes from plain old data into executable code: From now on, we can call their address just like any other executable code .

How does it fit together? 

With what we know now, we can look at the problem with fresh eyes. What we want is an immutable callback address, which we can give our caller, together with the promise that this address is forever. But internally, we want to be able to hot-patch wherever we forward to when this address gets called.

This means we need two things:

  1. an immutable trampoline addresses, which forwards to
  2. the address of the actual callback

How will this work? 

How these two elements work together is perhaps best explained with a diagram:

Schematic drawing
Calling a trampoline (1) forwards to a callback.

The address for a trampoline (1) will never change. But we want to be able to change where the trampoline forwards to.

For this, we must be able to write to where the trampoline looks up the forwarding address (2).

How can we implement this? 

Let’s begin by allocating two consecutive pages of memory.

In the bottom page we can store a tightly packed array of 64 bit forwarding addresses to our actual callbacks.

And what should go into the top page? Hand-coded trampolines. Lots of them.

Diagram showing page layout
Two consecutive memory pages: the top page contains machine code for trampolines.

Each trampoline has its own forwarding address which it can look up from the bottom page.

Lots of tiny, hand-coded trampolines 

By now, it should be clear that we’re not expecting the world from our trampolines: all we want them to do is to look up a forwarding address, and then jump to it.

This minimalism pays off: let’s say we find a way to translate our trampoline into 8 or fewer bytes of machine code. This would allow us to fit the same number of trampolines into the top page as there are forwarding address entries in the bottom page. And what’s more, the relative lookup address for each trampoline will be identical, which means that all trampolines can be copies of the same code.

And what should the 8 bytes of code for a trampoline be exactly? Well, they would have to represent the machine code for the following assembly instruction:

1
 jmp [ rip + offset ]    ; we still need to calculate the offset

In English, this translates to: “De-reference the address you get when adding offset to the instruction pointer register (rip) value, then jump to that.”

Note that we use rip relative addressing, which means we don’t have to worry about where our code ends up in memory - as long as it gets executed, the lookup table will always be just one page length away.

Let’s assemble the snippet above to machine code by looking up how each instruction translates using a handbook – actually, let’s cheat by using the asm utility in linux, which is part of pwntools (and can be installed in python via pip install pwn).

1
2
tim@hopper ~> asm "jmp [rip+0]" --context amd64 -f hex
ff2500000000 

We can see that this instruction is 6 bytes long, and that the last four bytes are used to encode the offset. Perfect! This allows us to calculate the correct offset: One page size minus the length of the current jmp instruction (6 bytes) will point to the correct jump table entry, which, assuming the default page size of 4096 bytes, gives us the following byte sequence for our trampoline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
unsigned char thunk[8] = {
    0xff, // Jump
    0x25, // RIP-relative. Bit pattern: Mod: 00, REG: 100, R/M: 101
    0xfa, // 4090 in little-endian notation
    0x0f, // -"-
    0x00, // -"-
    0x00, // -"-
    0x00, // Padding #1
    0x00, // Padding #2
}

A Proof of Concept 

To show all this in context, I’ve written a small, pure-C, proof of concept. This should work in a C++ context, too.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <sys/mman.h>
#include <string.h>

static void say( char const *utterance ) {
	puts( "I say: " );
	puts( utterance );
}

static void shout( char const *utterance ) {
	puts( "I shout: " );
	puts( utterance );
}

int main( void ) {

	typedef void( say_fun_t )( char const * );

	size_t page_size = sysconf( _SC_PAGESIZE );

	// Let's get two pages of memory
	//
	// The idea is to have one page of executable trampolines,
	// followed by a data page of jump target addresses.
	//
	// If we want to redirect a callback, only the target address needs to be
	// updated, the callback function pointer can stay the same.
	//
	// We use mmap, so that we can be sure that our allocations
	// happen at page boundaries.
	char *page_1 = mmap( NULL, page_size * 2, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
	char *page_2 = page_1 + page_size;

	if ( page_1 == MAP_FAILED ) {
		puts( "Error mapping page." );
		return ( 1 );
	}

	// --------| Invariant: mapping successful.

	{

		unsigned char thunk[ 8 ] = {
		    0xff, // Jump
		    0x25, // RIP-relative. Bit pattern: Mod: 00, REG: 100, R/M: 101
		    0xfa, // 4090 in little-endian notation
		    0x0f, // -"-
		    0x00, // -"-
		    0x00, // -"-
		    0x00, // Padding #1
		    0x00, // Padding #2
		};

		// Patch offset in thunk - we don't want to assume a page size.
		{
			int32_t *offset = ( int32_t * )&thunk[ 2 ];

			*offset = page_size; // Corresponding entries are one page size apart
			*offset -= 6;        // But must subtract length of current (jmp) instruction
		}

		// Copy instructions for jmp into page1
		for ( char *dst = page_1; dst != page_2; dst += sizeof( thunk ) ) {
			memcpy( dst, thunk, sizeof( thunk ) );
		}

		// Write jump targets - these addresses are indirect pointers
		// to the callback function.
		// They start at the beginning of the second page.
		void **jump_table = ( void ** )( page_2 );

		// Trampolines start at page_1
		char *trampolines = ( page_1 );

		// Set exec bit, but remove write bit from our plt
		mprotect( page_1, page_size, PROT_READ | PROT_EXEC );

		// ------ callback trampolines have been set up - let's test them

		const int CALLBACK_ENTRY_INDEX = 1;

		// Set forwarder for entry CALLBACK_ENTRY_INDEX to
		// function `say`
		jump_table[ CALLBACK_ENTRY_INDEX ] = say;

		// Grab function pointer to trampoline forwarding to callback entry 1
		// We make it a const function pointer to emphasise that it can't change,
		// but you don't have to.
		say_fun_t *const say_callback = ( say_fun_t * )( trampolines + 8 * CALLBACK_ENTRY_INDEX );

		// Let's pretend the caller calls the (original) callback
		say_callback( "Hello" );

		// Now, let's change the callback's target address
		// without telling the caller:
		jump_table[ CALLBACK_ENTRY_INDEX ] = shout;

		// Let's pretend the caller calls the callback again:
		// this time the new method gets invoked.
		say_callback( "Hello again." );
	}

	// Note that this unmaps page 2 as well.
	int result = munmap( page_1, page_size * 2 );

	if ( result == -1 ) {
		puts( "there was an error while unmapping." );
		return ( 1 );
	}
}

Summary 

Callback hot-patching works similar to using a “virtual got/plt”:

First, we allocate two pages from memory, a top page and a bottom page. The bottom page contains a tightly packed array of 64bit forwarding addresses. The top page contains code for trampolines. Each trampoline is made up of 8 Bytes of hand-written code, which contains just enough logic to look up a forwarding address and jump to it.

Once the trampoline code has been copied into the top page, its permissions are changed to read-only/execute.

We made sure that each trampoline in the top page is exactly 8 bytes in length. Similarly, each forwarding address in the bottom page is 8 bytes in length. Because of this, the number of trampolines in the top page can be exactly the same as the number of forwarding address slots in the bottom page.

Similarly, the offset between each trampoline and its corresponding forwarding address is always one page size. This allows us to keep the hand-written code for our trampolines simple: all trampolines are copies of the same 8 bytes of machine code.

If we later want to patch a callback address, we only have to update it’s entry in the bottom page - the public address for the callback (somewhere within the top page) stays the same.

Further work 

Note that the mechanism I described is similar to how procedure linkage tables (PLT) and global offset tables (GOT) work on linux systems with position independent executables (PIE). Chris Wellons suggested it could be really nice to somehow harness the program’s actual PLT - a very elegant idea, which would allow the linker to update the target addresses automatically upon hot-reloading.

Unfortunately the architecture I chose for Island, my hot-reloading Vulkan graphics engine, prevents me from using the PLT, and so I haven’t pursued this particular approach as far as it could possibly be pushed.

Island preview image
If you’re interested in how I applied the method described in this post to the Island codebase, I recommend you take a look at the relevant lines in the source code for Island’s core module, le_core.cpp, on github.

Thanks 

I’d like to thank Chris Wellons for feedback on my previous blog post, and for inspiring this current approach with his writing.

I’d also like to thank Doug Binks for feedback to my previous blog post. His approach to run-time compiled cpp, RCC++, has been a great source of ideas and inspiration. His collection of Runtime Compiled C & C++ Solutions is a fantastic resource.

And, finally, I’d like to thank Andrew Healey for his encouragement, and for being a second pair of eyes on an earlier draft of this post. All remaining errors and ommisssions are, of course, my own.


Tagged:

codehot-reloadingcassemblyisland


RSS:

Find out first about new posts by subscribing to the RSS Feed


Further Posts:

Vulkan Video Decode: First Frames h.264 video island rendergraph synchronisation vulkan code
C++20 Coroutines Driving a Job System code coroutines c++ job-system
Vulkan Render-Queues and how they Sync island rendergraph synchronisation vulkan code
Rendergraphs and how to implement one island rendergraph vulkan code
Implementing Bitonic Merge Sort in Vulkan Compute code algorithm compute glsl island
Callbacks and Hot-Reloading: Must JMP through extra hoops code hot-reloading c assembly island
Love Making Waves fft real-time island research
2D SDF blobs v.1 research real-time island
Earth Normal Maps from NASA Elevation Data tutorial code
Using ofxPlaylist tutorial code
Flat Shading using legacy GLSL on OSX tutorial code