Callbacks and Hot-Reloading: Must JMP through extra hoops

Callbacks and hot-reloading were not meant for each other. 

In this post I describe how, by assembling a little theme park of assembly instructions, some sinister, some not, but all well intentioned, we can coerce the two to find each other.

Requesting a callback is a promise that someone, in the future, will be there to pick up the call. This promise is easily broken in a hot-reloading scenario: every time a module is recompiled and reloaded, the updated callback handlers get mapped to new addresses, and any calls to old, stale, callback addresses crash the program.

Imagine a situation where you have an external library, say GLFW, that triggers a callback whenever the user presses a key. The callback is set up when GLFW is loaded and the window is created. If the module that implements the callback (le_window in the illustration above) gets hot-reloaded, calls from GLFW will still go to the old address, and that, dear friend, is shaking the tree of pain.

We could tell external libraries to update their callbacks as soon as we detect hot-reloading has happened. But this is not always possible.

Wouldn't it be nice to have a solution where we can hide all these internal, embarrassing hot-reloading details from external libraries so we don't have to come back every time that hot-reloading has happened?

A magical proxy 

We could solve this by using a proxy or trampoline function. Such a function would need to figure out the current address of the callback handler and forward any calls (including parameters) to this final address. And while this final address can change, the proxy function itself must remain in the same place for the full duration of the program.

This rules out modern c++ trickery: perfect forwarding would still give us templated function implementations inside the compilation unit which sets the callback handler - the same compilation unit which almost certainly gets reloaded whenever the handler gets recompiled.

Okay, so we need to find a module or part of the program with never gets hot-reloaded. By virtue of elimination this leaves only one module: the module which actually implements the hot-reloading.

In Project Island, a hot-reloading Vulkan engine I've been working on for a little while, the le_core module fits the bill: Because this core module deals with loading and reloading all other modules and contains the central API registry, it is the only module which cannot be itself be reloaded.

image
le_core will always stay at the same memory address, regardless of all the hot-reloading going on around it. It is, in fact, responsible for all the hot-reloading. Why not make it responsible for callback forwarding, too, then? (Diagram not to scale)

So that's where our forwarder will need to live. But how do we get it to actually do what it needs to do?

We ruled out advanced c++ template magic a bit earlier. What's left? Perhaps we can use assembly?

When out of options, try assembly 

In assembly, a JMP instruction is used to jump to an address in code. In contrast to a CALL instruction, it leaves the stack exactly as it is, it just does what it says on the tin: it makes the instruction pointer jump to the target address.

If we can sell our client library the address of such a JMP instruction as the callback address, or maybe an address a little bit earlier, so that we can add a few assembly instructions to resolve the final address to jump to, we would be golden.

Let's write a POC 

Let's say the goal is to jump to the address that's held in the global variable target_addr inside le_core.cpp. We write an assembly method trampoline_func, which we can use as a proxy for any callback function. Any call to trampoline_func() shall get forwarded to target_addr().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// Final callback target, we want this method to be called 
// by a call to trampoline_func();
extern "C" void * target_addr; 

// Forward-declaration of our freestanding assembly function: 
// callback clients may call this function
extern "C" void trampoline_func(); 

// Our freestanding assembly function follows
asm(R("ASM(
.text

trampoline_func:
    jmp *target_addr

)ASM");

Here's what the compiler/linker has to say to that:

1
2
3
/usr/bin/ld: le_core/CMakeFiles/le_core.dir/le_core.cpp.o: relocation
R_X86_64_32S against symbol `target_addr' can not be used when making a shared
object; recompile with -fPIC 

Daw!! This does not work, because the inline assembler won't resolve the address of target_addr for us. le_core.cpp is actually compiled using -fPIC, so recompiling won't help. I've tried.

But because target_addr and trampoline_func() are part of the same compilation unit, we should be able to use rip-relative addressing to resolve target_addr.

And indeed, if I calculate the offset manually, and write some assembly that calculates the absolute target_addr by adding the offset to the %rip register, it works. But that's not practicable, because adding just one instruction will mess up that manually calculated offset, and I would have to recalculate and update the offset manyally, everytime I wanted to add anything to le_core.cpp.

Maybe we can use extended assembly?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void * target_addr;

void trampoline_func(){
    asm( R"ASM(

    .text
        jmp *%0

    )ASM"   : /* deliberately empty */
            : "m" ( target_addr )
    );
}

Yes, with extended assembly, magically, we can use globals, and they will resolve to correct rip-relative addresses - the extended asm input variable target_addr, referred to in extended assembly code as “%0”, was translated into the first line of the following assembly listing:

1
2
0x7ffff7f87624  <+    4>  48 8b 05 5d e9 00 00  mov    0xe95d(%rip),%rax
0x7ffff7f8762b  <+   11>  ff 20                 jmpq   *(%rax)

Because the trampoline is called from inside a function call, it will have added a push rbp instruction. This messes up the stack for the callback function, but is easy to undo: we add a pop rbp instruction just before jumping to our real callback function.

1
2
3
0x7ffff7f87624  <+    4>  48 8b 05 5d e9 00 00  mov    0xe95d(%rip),%rax
0x7ffff7f8762b  <+   11>  5d                    pop    %rbp
0x7ffff7f8762c  <+   12>  ff 20                 jmpq   *(%rax)

A Trampoline and a Sled 

Next problem: How can we do this for multiple callback functions? Can we figure out a way to use the same forwarder function for each callback?

For the forwarder to be truly useful we need a way to store the actual final callback addresses in an array inside le_core.cpp, and have the forwarder function pick the correct target address. But based on what? We cannot somehow add an extra parameter to the proxy function, because we cannot change the number of parameters of the callback function.

We somehow have to encode an index with the function call, but we can't save any data with the function call - all we have is the address of the forwarder, which we give to the client.

Wait a moment - can we use this address to encode an index?

Yes we can: we can insert a sled (or should we call it a “ramp”) at the beginning of trampoline_func, so that there are multiple entry points to the function, each entry point setting the %rbp register to a specific value. We can use that value to index into an array of addresses - we can look up the callback target function from our “phonebook” array stored in target_func_addr[], based on the index given in %rbx.

That way, to external libraries, we can hand out unique addresses for forwarder functions , with each address now encoding the index of the target function to call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// a number of callback function pointers
void *target_func_addr[ 4 ] = {}; 

extern "C" void trampoline_func() {

// Using a sled to enter this function,
// means we can use the entry point to
// to calculate an offset.

asm(R"ASM(
    
    movq  $0x00, %rbx
    jmp sled_end
    movq  $0x08, %rbx
    jmp sled_end
    movq  $0x10, %rbx
    jmp sled_end
    movq  $0x18, %rbx
    jmp sled_end
    movq  $0x20, %rbx
    jmp sled_end

sled_end:
)ASM");

    asm( R"ASM(

.text

    /* pop %%rbp */
    add %%rbx, %%rax
    jmp *%0

)ASM" :
      : "m" ( target_func_addr[0] )
    );
};

void *core_get_callback_forwarder_addr(int target_index) {
    // +4 bytes to jump over the `push rbp` instruction.
    // Each sled entry is 12 bytes in size, therefore 
    // `target_index` controls which sled entry gets used as
    // entry point, which in turn controls the value ending up in 
    // the %rbx register. 
    return ( char * )&trampoline_func + ( 4 + 12 * target_index );
};

And this is the assembly code generated for void trampoline_func():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
       47 [1]	extern "C" void trampoline_func() {
0x7ffff7fb9922            f3 0f 1e fa           endbr64
0x7ffff7fb9926  <+    4>  55                    push   %rbp
0x7ffff7fb9927  <+    5>  48 89 e5              mov    %rsp,%rbp
0x7ffff7fb992a  <+    8>  48 c7 c3 00 00 00 00  mov    $0x0,%rbx
0x7ffff7fb9931  <+   15>  eb 24                 jmp    0x7ffff7fb9957 <trampoline_func()+53>
0x7ffff7fb9933  <+   17>  48 c7 c3 08 00 00 00  mov    $0x8,%rbx
0x7ffff7fb993a  <+   24>  eb 1b                 jmp    0x7ffff7fb9957 <trampoline_func()+53>
0x7ffff7fb993c  <+   26>  48 c7 c3 0f 00 00 00  mov    $0x10,%rbx
0x7ffff7fb9943  <+   33>  eb 12                 jmp    0x7ffff7fb9957 <trampoline_func()+53>
0x7ffff7fb9945  <+   35>  48 c7 c3 10 00 00 00  mov    $0x18,%rbx
0x7ffff7fb994c  <+   42>  eb 09                 jmp    0x7ffff7fb9957 <trampoline_func()+53>
0x7ffff7fb994e  <+   44>  48 c7 c3 18 00 00 00  mov    $0x20,%rbx
0x7ffff7fb9955  <+   51>  eb 00                 jmp    0x7ffff7fb9957 <trampoline_func()+53>
0x7ffff7fb9957  <+   53>  48 8b 05 32 e6 00 00  mov    0xe632(%rip),%rax        # 0x7ffff7fc7f90
0x7ffff7fb995e  <+   60>  48 01 d8              add    %rbx,%rax
0x7ffff7fb9961  <+   63>  48 8b 00              mov    (%rax),%rax
0x7ffff7fb9964  <+   66>  48 8b 00              mov    (%rax),%rax
0x7ffff7fb9967  <+   69>  ff e0                 jmpq   *%rax
        85 [1]	};
0x7ffff7fb9969  <+   71>  90                    nop
0x7ffff7fb996a  <+   72>  5d                    pop    %rbp
0x7ffff7fb996b  <+   73>  c3                    retq

Wrapping up 

What's left to do is some cleaning up, and making it all a bit more ergonomic.

For one, it's important that we don't clobber the %rbx register, as it might actually be in use. So that we're able to restore the original value of %rbx we add an instruction at the beginning of each sled entry which pushes the initial %rbx value to the stack. We then pop that value back into %rbx just before we execute the final jmp instruction.

This means our sled entries will have to look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
push %rbx
movq  $0x0000, %rbx
jmp sled_end

push %rbx
movq  $0x0008, %rbx
jmp sled_end

push %rbx
movq  $0x0010, %rbx
jmp sled_end

push %rbx
movq  $0x0018, %rbx
jmp sled_end

# ...

And the final jump instruction will need an additional stack pop:

1
2
pop     %rbx
jmpq    *%rax

Another thing to consider is that if we add quite a few sled entries, we will have to modify the formula to calculate the target index. This is because the compiler will issue slightly different machine code based on how far a jump goes.

Sled entries placed in machine code within 128 bytes code of the label sled_end will use a 2-byte machine code instruction for the jmp sled_end assembly instruction, which will makes these sled entries 10 bytes apart in total. Sled entries which are further away from sled_end will need a wider address, for which the assembler will issue a different machine code instruction consuming 13 bytes of machine code.

Putting it all together 

Here's how we use this devil's concoction to setup a callback which can survive hot-reloading of the module in which it was declared:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// le_window.cpp

#include "le_core.cpp"

static void window_set_callbacks( le_window_o *self ) {

  glfwSetKeyCallback( 
  self->window, 
    ( GLFWkeyfun )core_get_callback_forwarder_addr( 
      ( void * )&le_window_api_i->window_callbacks_i.glfw_key_callback_addr 
    ) 
  );

}

Here are the relevant portions of le_core.cpp which deal with callback forwarding:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// le_core.cpp

// Note: run `python create_sled.py > sled.asm` after changing this value:
#define CORE_MAX_CALLBACK_FORWARDERS 32

// Array of lookup addresses for callbacks
void * target_func_addr[ CORE_MAX_CALLBACK_FORWARDERS ] = {};

// Number of callbacks actually in use
std::atomic<uint32_t> USED_CALLBACK_FORWARDERS = 0;

extern "C" void trampoline_func() {

  // What follows is a sled, which, and the lookup index 
  // for the target function is set depending on how far 
  // into the sled this the call is entered.
  //
  // The sled itself is auto-generated via a python script, 
  // see CORE_MAX_CALLBACK_FORWARDERS above.
  #include "sled.asm" 

asm( R"ASM(

.text

    addq    %%rbx, %%rax    /* apply offset based on sled index */
    movq    %0, %%rax
    movq    (%%rax), %%rax
    pop     %%rbx           /* restore rbx register */
    jmpq    *%%rax

)ASM" : /* empty input operands */
      : "m" ( target_func_addr[0] )
    );
};

void *core_get_callback_forwarder_addr( void *callback_addr ) {
  // Note post-increment: we're interested in value before increment,
  // but increment is atomic, and therefore we can guarantee that
  // another call to this method will not end up with the same forwarder_idx
  uint32_t current_index = USED_CALLBACK_FORWARDERS++;

  // Make sure we're not overshooting
  assert( USED_CALLBACK_FORWARDERS < CORE_MAX_CALLBACK_FORWARDERS );

  target_func_addr[ current_index ] = callback_addr;

  // Number of machine code bytes of a sled entry depends on how close 
  // an entry is placed relative to the `sled_end` label:
  //
  // if it is within 128 bytes, (that applies to the last 13 entries), 
  // it will be 10 bytes in length, all other sled entries will be 
  // 13 bytes wide.
  //
  // This is because the jmp instruction is 2 bytes of machine code for
  // short jmp (`eb xx`), and 5 bytes (`e9 xx xx xx xx` ) for near jmp.

  int first_addr = std::min( 0, 13 - CORE_MAX_CALLBACK_FORWARDERS );

  int addr            = first_addr + int( current_index );              
  int num_large_jumps = std::max( 0, std::min( addr, 0 ) - origin ); 
  int num_small_jumps = std::max( 0, addr );           

  // We must now calculate the exact offset so that we hit the 
  // correct sled entry: 
  //
  // 8 bytes to jump over the initial `push rbp` instruction, then
  // 13 bytes for each distant sled entry, and 10 bytes for each of
  // the 13 sled entries closest to the `sled_end` label.

  // Return computed address.
  
  return ( char * )&trampoline_func + ( 8 +
                                        10 * num_small_jumps +
                                        13 * num_large_jumps );
};

The file sled.asm contains just the code for the sled, which might be a bit repetitive to write by hand:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
asm(R"ASM(

    push %rbx
    movq  $0x0000, %rbx
    jmp sled_end

    push %rbx
    movq  $0x0008, %rbx
    jmp sled_end

    push %rbx
    movq  $0x0010, %rbx
    jmp sled_end

    push %rbx
    movq  $0x0018, %rbx
    jmp sled_end

    push %rbx
    movq  $0x0020, %rbx
    jmp sled_end

/* ... */

sled_end:
)ASM");

This is why I added a python script to automatically generate a sled of the correct length required by parsing le_core.cpp for the value of CORE_MAX_CALLBACK_FORWARDERS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/python3

import re

# Fetch the number of sled entries from le_core.cpp file: 
# it is the value associated with CORE_MAX_CALLBACK_FORWARDERS

pattern = 
    re.compile(r"\s*?#define\s+?CORE_MAX_CALLBACK_FORWARDERS\s+?(\d+).*")

core_max_callback_forwarders = 0

with open("le_core.cpp", "rt") as core_file:
    for line in core_file:
        result = pattern.search(line)
        if result is not None:
            core_max_callback_forwarders = int(result[1])
            break

if (core_max_callback_forwarders == 0):
    exit(1)

print("asm(R\"ASM(\n")


for offset in range(0, core_max_callback_forwarders):
    print("\tpush %rbx")
    print("\tmovq  $0x{:04x}, %rbx".format(offset * 8 ))
    print("\tjmp sled_end\n")

print("sled_end:")
print(")ASM\");")

Some Caveats 

Because this uses assembly, the method presented above is not as portable as it should be - so far this has only been developed for and tested on 64bit Linux.

The assembly instructions used, however, are very simple, and it should be straightforward to port this to systems which don't necessarily use the System V calling convention, such as Windows.


Tagged:

codehot-reloadingcassemblyisland


Further Posts:

Love Making Waves fft real-time island research
2D SDF blobs v.1 research real-time island
Earth Normal Maps from NASA Elevation Data tutorial code
Using ofxPlaylist tutorial code
Flat Shading using legacy GLSL on OSX tutorial code