🚀 Read this trending post from Hacker News 📖
📂 Category:
✅ Key idea:
Tooling
How can we debug code that doesn’t bind buffer and texture objects and doesn’t call an API to describe the memory layout explicitly? C/C++ debuggers have been doing that for decades. There’s no special operating system APIs for describing your software’s memory layout. The debugger is able to follow 64-bit pointer chains and use the debug symbol data provided by your compiler. This includes the memory layouts of your structs and classes. CUDA and Metal use C/C++ based shading languages with full 64-bit pointer semantics. Both have robust debuggers that traverse pointer chains without issues. The texture descriptor heap is just GPU memory. The debugger can index it, load a texture descriptor, show the descriptor data and visualize the texels. All of this works already in the Xcode Metal debugger. Click on a texture or a sampler handle in any struct in any GPU address. Debugger will visualize it.
Modern GPUs virtualize memory. Each process has their own page table. The GPU capture has a separate replayer process with its own virtual address space. If the replayer would naively replay all the allocations, it would get a different GPU virtual address for each memory allocation. This was fine for legacy APIs as it wasn’t possible to directly store GPU addresses in your data structures. A modern API needs special replay memory allocation APIs that force the replayer to mirror the exact GPU virtual memory layout. DX12 and Vulkan BDA have public APIs for this: RecreateAt and VkMemoryOpaqueCaptureAddressAllocateInfo. Metal and CUDA debuggers do the same using internal undocumented APIs. A public API is preferable as it allows open source tools like RenderDoc to function.
Don’t raw pointers bring security concerns? Can’t you just read/write other apps’ memory? This is not possible due to virtual memory. You can only access your own memory pages. If you accidentally use a stale pointer or overflow, you will get a page fault. Page faults are possible with existing buffer based APIs. DirectX 12 and Vulkan don’t clamp your storage (byteaddress/structured) buffer addresses. OOB causes a page fault. Users can also accidentally free a memory heap and continue using stale buffer or texture descriptors to get a page fault. Nothing really changes. An access to an unmapped region is a page fault and the application crashes. This is familiar to C/C++ programmers. If you want robustness, you can use ptr + size pairs. That’s exactly how WebGPU is implemented. The WebGPU shader compiler (Tint or Naga) emits an extra clamp instruction for each buffer access, including vertex accesses (index buffer value out of bounds). WebGL didn’t allow shading index buffer data with other data. WebGL scanned through the indices on the CPU side (making index buffer update very slow). Back then custom vertex fetch was not possible. The hardware page faulted before the shader even ran.
Translation layers
Being able to run existing software is crucial. Translation layers such as ANGLE, Proton and MoltenVK play a crucial role in the portability and deprecation process of legacy APIs. Let’s talk about translating DirectX 12, Vulkan and Metal to our new API.
MoltenVK (Vulkan to Metal translation layer) proves that Vulkan’s buffer centric API can be translated to Metal’s 64-bit pointer based ecosystem. MoltenVK translates Vulkan’s descriptor sets into Metal’s argument buffers. The generated argument buffers are standard GPU structs containing a 64-bit GPU pointer per buffer binding and a 64-bit texture ID per texture binding. We can do better by allocating a contiguous range of texture descriptors in our texture heap for each descriptor set, and storing a single 32-bit base index instead of a 64-bit texture ID for each texture binding. This is possible since our API has a user managed texture heap unlike Metal.
MoltenVK maps descriptor sets to Metal API root bind slots. We generate a root struct with up to eight 64-bit pointer fields, each pointing to a descriptor set struct (see above). Root constants are translated into value fields and root descriptors (root buffers) are translated into 64-bit pointers. The efficiency should be identical, assuming the GPU driver preloads our root struct fields into uniform/scalar registers (as discussed in the root arguments chapter).
Our API uses 64-bit pointer semantics like Metal. We can use the same techniques employed by MoltenVK to translate the buffer load/store instructions in the shader. MoltenVK also supports translating Vulkan’s new buffer device address extension.
Proton (DX12 to Vulkan translation layer) proves that DirectX 12 SM 6.6 descriptor heap can be translated to Vulkan’s new descriptor buffer extension. Proton also translates other DirectX 12 features to Vulkan. We have already shown that Vulkan to Metal translation is possible with MoltenVK, transitively proving that translation from DirectX 12 to Metal should be possible. The biggest missing feature in MoltenVK is the SM 6.6 style descriptor heap (Vulkan’s descriptor buffer extension). Metal doesn’t expose the descriptor heap directly to the user. Our new proposed API has no such limitation. Our descriptor heap semantics are a superset to SM 6.6 descriptor heap and a close match to Vulkan’s descriptor buffer extension. Translation is straightforward. Vulkan’s extension also adds a special flag for descriptor invalidate, matching our HAZARD_DESCRIPTORS. DirectX 12 descriptor heap API is easy to translate, as it’s just a thin wrapper over the raw descriptor array in GPU memory.
To support Metal 4.0, we need to implement Metal’s driver managed texture descriptor heap. This can be implemented using a simple freelist over our texture heap. Metal uses 64-bit texture handles which are implemented as direct heap indices on modern Apple Silicon devices. Metal allows using the texture handles in shaders directly as textures. This is syntactic sugar for textureHeap[uint64(handle)]. A Metal texture handle is translated into uint64 by our shader translator, maintaining identical GPU memory layout.
Our API doesn’t support vertex buffers. WebGPU doesn’t use hardware vertex buffers either, yet it implements the classic vertex buffer abstraction. WGSL shader translator (Tint or Naga) adds one storage buffer binding per vertex stream and emits vertex load instructions in the beginning of the vertex shader. Custom vertex fetch allows emitting clamp instructions to avoid OOB behavior. A misbehaving website can’t crash the web browser. Our own shader translator adds a 64-bit pointer to the root struct for each vertex stream, generates a struct matching its layout and emits vertex struct load instructions in the beginning of the vertex shader.
We have shown that it’s possible to write translation layers to run DirectX 12, Vulkan and Metal applications on top of our new API. Since WebGPU is implemented on top of these APIs by browsers, we can run WebGPU applications too.
Min spec hardware
Nvidia Turing (RTX 2000 series, 2018) introduced ray-tracing, tensor cores, mesh shaders, low latency raw memory paths, bigger & faster caches, scalar unit, secondary integer pipeline and many other future looking features. Officially PCIe ReBAR support launched with RTX 3000 series, but there exists hacked Turing drivers that support it too, indicating that the hardware is capable of it. This 7 year old GPU supports everything we need. Nvidia just ended GTX 1000 series driver support in fall 2025. All currently supported Nvidia GPUs could be supported by our new API.
AMD RDNA2 (RX 6000 series, 2020) matched Nvidia’s feature set with ray-tracing and mesh shaders. One year earlier, RDNA 1 introduced coherent L2$, new L1$ level, fast L0$, generic DCC read/write paths, fastpath unfiltered loads and a modern SIMD32 architecture. PCIe ReBAR is officially supported (brand name “Smart Access Memory”). This 5 year old GPU supports everything we need. AMD ended GCN driver support already in 2021. Today RDNA 1 & RDNA 2 only receive bug fixes and security updates, RDNA 3 is the oldest GPU receiving game optimizations. All the currently supported AMD GPUs could be supported by our API.
Intel Alchemist / Xe1 (2022) were the first Intel chips with SM 6.6 global indexable heap support. These chips also support ray-tracing, mesh shaders, PCIe ReBAR (discrete) and UMA (integrated). These 3 year old Intel GPUs support everything we need.
Apple M1 / A14 (MacBook M1, iPhone 12, 2020) support Metal 4.0. Metal 4.0 guarantees GPU memory visibility to CPU (UMA on both phones and computers), and allows the user to write 64-bit pointers and 64-bit texture handles directly into GPU memory. Metal 4.0 has a new residency set API, solving a crucial usability issue with bindless resource management in the old useResource/useHeap APIs. iOS 26 still supports iPhone 11. Developers are not allowed to ship apps that require Metal 4.0 just yet. iOS 27 likely deprecates iPhone 11 support next year. On Mac, if you drop Intel Mac support, you have guaranteed Metal 4.0 support. M1-M5 = 5 generations = 5 years.
ARM Mali-G710 (2021) is ARMs first modern architecture. It introduced their new command stream frontend (CSF), reducing the CPU dependency of draw call building and adding crucial features like multi-draw indirect and compute queues. Non-uniform index texture sampling is significantly faster and the AFBC lossless compressor now supports 16-bit floating point targets. G710 supports Vulkan BDA and descriptor buffer extensions and is capable of supporting the new 2025 unified image layout extension with future drivers. The Mali-G715 (2022) introduced support for ray-tracing.
Qualcomm Adreno 650 (2019) supports Vulkan BDA, descriptor buffer and unified image layout extensions, 16-bit storage/math, dynamic rendering and extended dynamic state with the latest Turnip open source drivers. Adreno 740 (2022) introduced support for ray-tracing.
PowerVR DXT (Pixel 10, 2025) is PowerVRs first architecture that supports Vulkan descriptor buffer and buffer device address extensions. It also supports 64-bit atomics, 8-bit and 16-bit storage/math, dynamic rendering, extended dynamic state and all the other features we require.
Conclusion
Modern graphics API have improved gradually in the past 10 years. Six years after DirectX 12 launch, SM 6.6 (2021) introduced the modern global texture heap, allowing fully bindless renderer design. Metal 4.0 (2025) and CUDA have a clean 64-bit pointer based shader architecture with minimal binding API surface. Vulkan has the most restrictive standard, but extensions such as buffer device access (2020), descriptor buffer (2022) and unified image layouts (2025) add support for modern bindless infrastructure, but tools are still lagging behind. As of today, there’s no single API that meets all our requirements, but if we combine their best bits together, we can build the perfect API for modern hardware.
10 years ago, modern APIs were designed for CPU-driven binding models. New bindless features were presented as optional features and extensions. A clean break would improve the usability and reduce the API bloat and driver complexity significantly. It’s extremely difficult to get the whole industry behind a brand new API. I am hoping that vendors are willing to drop backwards compatibility in their new major API versions (Vulkan 2.0, DirectX 13) to embrace the fully bindless GPU architecture we have today. A new bindless API design would solve the mismatch between the API and the game engine RHI, allowing us to get rid of the hash maps and fine grained resource tracking. Metal 4.0 is close to this goal, but it is still missing the global indexable texture heap. A 64-bit texture handle can’t represent a range of textures.
HLSL and GLSL shading languages were designed over 20 years ago as a framework of 1:1 elementwise transform functions (vertex, pixel, geometry, hull, domain, etc). Memory access is abstracted and array handling is cumbersome as there’s no support for pointers. Despite 20 years of existence, HLSL and GLSL have failed to accumulate a library ecosystem. CUDA in contrast is a composable language exposing memory directly and new features (such as AI tensor cores) though intrinsics. CUDA has a broad library ecosystem, which has propelled Nvidia into $4T valuation. We should learn from it.
WebGPU note: WebGPU design is based on 10 year old core Vulkan 1.0 with extra restrictions. WebGPU doesn’t support bindless resources, 64-bit GPU pointers or persistently mapped GPU memory. It feels like a mix between DirectX 11 and Vulkan 1.0. It is a great improvement for web graphics, but doesn’t meet modern bindless API standards. I will discuss WebGPU in a separate blog post.
My prototype API shows what is achievable with modern GPU architectures today, if we mix the best bits from all the latest APIs. It is possible to build an API that is simpler to use than DirectX 11 and Metal 1.0, yet it offers better performance and flexibility than DirectX 12 and Vulkan. We should embrace the modern bindless hardware.
Appendix
A simple user land GPU bump allocator used in all example code. We call gpuHostToDevicePointer once in the temp allocator constructor. We can perform standard pointer arithmetic (such as offset) on GPU pointers. Traditional Vulkan/DX12 buffer APIs require a separate offset. This simplifies the API and user code (ptr vs handle+offset pair). A production ready temp allocator would implement overflow handing (grow, flush, etc).
⚡ Tell us your thoughts in comments!
#️⃣ #Graphics #API #Sebastian #Aaltonen
🕒 Posted on 1765916362
