β¨ Read this insightful post from Hacker News π
π Category:
β Key idea:
nullprogram.com/blog/2025/12/12/
Back in 2017 I wrote about a technique for creating closures in C
using JIT-compiled wrapper. Itβs neat, though rarely necessary in
real programs, so I donβt think about it often. I applied it to qsort,
which sadly accepts no context pointer. More practical would be
working around insufficient custom allocator interfaces, to
create allocation functions at run-time bound to a particular allocation
region. Iβve learned a lot since I last wrote about this subject, and a
recent article had me thinking about it again, and how I could do
better than before. In this article I will enhance Win32 window procedure
callbacks with a fifth argument, allowing us to more directly pass extra
context. Iβm using w64devkit on x64, but the everything here should
work out-of-the-box with any x64 toolchain that speaks GNU assembly.
A window procedure has this prototype:
LRESULT Wndproc(
HWND hWnd,
UINT Msg,
WPARAM wParam,
LPARAM lParam,
);
To create a window we must first register a class with RegisterClass,
which accepts a set of properties describing a window class, including a
pointer to one of these functions.
MyState *state = ...;
RegisterClassA(&(WNDCLASSA)π₯);
HWND hwnd = CreateWindowExA("my_class", ..., state);
The thread drives a message pump with events from the operating system,
dispatching them to this procedure, which then manipulates the program
state in response:
for (MSG msg; GetMessageW(&msg, 0, 0, 0);) What do you think?
All four WNDPROC parameters are determined by Win32. There is no context
pointer argument. So how does this procedure access the program state? We
generally have two options:
- Global variables. Yucky but easy. Frequently seen in tutorials.
- A
GWLP_USERDATApointer attached to the window.
The second option takes some setup. Win32 passes the last CreateWindowEx
argument to the window procedure when the window created, via WM_CREATE.
The procedure attaches the pointer to its window as GWLP_USERDATA. This
pointer is passed indirectly, through a CREATESTRUCT. So ultimately it
looks like this:
case WM_CREATE:
CREATESTRUCT *cs = (CREATESTRUCT *)lParam;
void *arg = (struct state *)cs->lpCreateParams;
SetWindowLongPtr(hwnd, GWLP_USERDATA, (LONG_PTR)arg);
// ...
In future messages we can retrieve it with GetWindowLongPtr. Every time
I go through this I wish there was a better way. What if there was a fifth
window procedure parameter though which we could pass a context?
typedef LRESULT Wndproc5(HWND, UINT, WPARAM, LPARAM, void *);
Weβll build just this as a trampoline. The x64 calling convention
passes the first four arguments in registers, and the rest are pushed on
the stack, including this new parameter. Our trampoline cannot just stuff
the extra parameter in the register, but will actually have to build a
stack frame. Slightly more complicated, but barely so.
Allocating executable memory
In previous articles, and in the programs where Iβve applied techniques
like this, Iβve allocated executable memory with VirtualAlloc (or mmap
elsewhere). This introduces a small challenge for solving the problem
generally: Allocations may be arbitrarily far from our code and data, out
of reach of relative addressing. If theyβre further than 2G apart, we need
to encode absolute addresses, and in the simple case would just assume
theyβre always too far apart.
These days Iβve more experience with executable formats, and allocation,
and I immediately see a better solution: Request a block of writable,
executable memory from the loader, then allocate our trampolines from it.
Other than being executable, this memory isnβt special, and allocation
works the usual way, using functions unaware itβs executable. By
allocating through the loader, this memory will be part of our loaded
image, guaranteed to be close to our other code and data, allowing our JIT
compiler to assume a small code model.
There are a number of ways to do this, and hereβs one way to do it with
GNU-styled toolchains targeting COFF:
.section .exebuf,"bwx"
.globl exebuf
exebuf: .space 1<<21
This assembly program defines a new section named .exebuf containing 2M
of writable ("w"), executable ("x") memory, allocated at run time just
like .bss ("b"). Weβll treat this like an arena out of which we can
allocate all trampolines weβll probably ever need. With careful use of
.pushsection this could be basic inline assembly, but Iβve left it as a
separate source. On the C side I retrieve this like so:
typedef struct β‘ Arena;
Arena get_exebuf()
β‘
Unfortunately I have to repeat myself on the size. There are different
ways to deal with this, but this is simple enough for now. I would have
loved to define the array in C with the GCC section attribute,
but as is usually the case with this attribute, itβs not up to the task,
lacking the ability to set section flags. Besides, by not relying on the
attribute, any C compiler could compile this source, and we only need a
GNU-style toolchain to create the tiny COFF object containing exebuf.
While weβre at it, a reminder of some other basic definitions weβll need:
#define S(s) (Str)Share your opinion below!
#define new(a, n, t) (t *)alloc(a, n, sizeof(t), _Alignof(t))
typedef struct π¬ Str;
Str clone(Arena *a, Str s)
What do you think?
Which have been discussed at length in previous articles.
Trampoline compiler
From here the plan is to create a function that accepts a Wndproc5 and a
context pointer to bind, and returns a classic WNDPROC:
WNDPROC make_wndproc(Arena *, Wndproc5, void *arg);
Our window procedure now gets a fifth argument with the program state:
LRESULT my_wndproc(HWND, UINT, WPARAM, LPARAM, void *arg)
{
MyState *state = arg;
// ...
}
When registering the class we wrap it in a trampoline compatible with
RegisterClass:
RegisterClassA(&(WNDCLASSA){
// ...
.lpfnWndProc = make_wndproc(a, my_wndproc, state),
.lpszClassName = "my_class",
// ...
});
All windows using this class will readily have access to this state object
through their fifth parameter. It turns out setting up exebuf was the
more complicated part, and make_wndproc is quite simple!
WNDPROC make_wndproc(Arena *a, Wndproc5 proc, void *arg)
{
Str thunk = S(
"\x48\x83\xec\x28" // sub $40, %rsp
"\x48\xb8........" // movq $arg, %rax
"\x48\x89\x44\x24\x20" // mov %rax, 32(%rsp)
"\xe8...." // call proc
"\x48\x83\xc4\x28" // add $40, %rsp
"\xc3" // ret
);
Str r = clone(a, thunk);
int rel = (int)((uintptr_t)proc - (uintptr_t)(r.data + 24));
memcpy(r.data+ 6, &arg, sizeof(arg));
memcpy(r.data+20, &rel, sizeof(rel));
return (WNDPROC)r.data;
}
The assembly allocates a new stack frame, with callee shadow space, and
with room for the new argument, which also happens to re-align the stack.
It stores the new argument for the Wndproc5 just above the shadow space.
Then calls into the Wndproc5 without touching other parameters. There
are two βpatchesβ to fill out, which Iβve initially filled with dots: the
context pointer itself, and a 32-bit signed relative address for the call.
Itβs going to be very near the callee. The only thing I donβt like about
this function is that Iβve manually worked out the patch offsets.
Itβs probably not useful, but itβs easy to update the context pointer at
any time if hold onto the trampoline pointer:
void set_wndproc_arg(WNDPROC p, void *arg)
{
memcpy((char *)p+6, &arg, sizeof(arg));
}
So, for instance:
MyState *state[2] = ...; // multiple states
WNDPROC proc = make_wndproc(a, my_wndproc, state[0]);
// ...
set_wndproc_arg(proc, state[1]); // switch states
Though I expect the most common case is just creating multiple procedures:
WNDPROC procs[] = {
make_wndproc(a, my_wndproc, state[0]),
make_wndproc(a, my_wndproc, state[1]),
};
To my slight surprise these trampolines still work with an active Control
Flow Guard system policy. Trampolines do not have stack unwind
entries, and I thought Windows might refuse to pass control to them.
Hereβs a complete, runnable example if youβd like to try it yourself:
main.c and exebuf.s
Better cases
This is more work than going through GWLP_USERDATA, and real programs
have a small, fixed number of window procedures β typically one β so this
isnβt the best example, but I wanted to illustrate with a real interface.
Again, perhaps the best real use is a library with a weak custom allocator
interface:
typedef struct {
void *(*malloc)(size_t); // no context pointer!
void (*free)(void *); // "
} Allocator;
void *arena_malloc(size_t, Arena *);
// ...
Allocator perm_allocator = {
.malloc = make_trampoline(exearena, arena_malloc, perm);
.free = noop_free,
};
Allocator scratch_allocator = {
.malloc = make_trampoline(exearena, arena_malloc, scratch);
.free = noop_free,
};
Something to keep in my back pocket for the future.
{π¬|β‘|π₯} {What do you think?|Share your opinion below!|Tell us your thoughts in comments!}
#οΈβ£ #Closures #Win32 #window #procedures
π Posted on 1765671025
