✨ Check out this insightful post from Hacker News 📖
📂 **Category**:
✅ **What You’ll Learn**:
High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.
| Model | Mode | Decode | VRAM | Notes |
|---|---|---|---|---|
| Llama 3.1 8B Q8_0 | Resident | 48.9 tok/s | 10.0 GB | All layers in VRAM |
| Llama 3.1 8B Q8_0 | Tiered (auto) | 48.8 tok/s | 10.3 GB | 32/32 layers auto-promoted to VRAM |
| Llama 3.1 70B Q6_K | Streaming (mmap) | 0.006 tok/s | 7.3 GB | Page cache thrashing (53 GB > 48 GB RAM) |
| Llama 3.1 70B Q6_K | Tiered (auto) | 0.2 tok/s | 23.1 GB | 26 VRAM + 54 RAM + 0 NVMe |
| Llama 3.1 70B Q4_K_M | Tiered (auto) | 0.3 tok/s | 22.9 GB | 36 VRAM + 44 RAM (50% faster) |
| Llama 3.1 70B Q4_K_M | Tiered + layer skip | 0.5 tok/s | 22.9 GB | 36 VRAM + 44 RAM, 20 layers skipped |
3-tier adaptive caching auto-sizes from hardware: VRAM-resident layers (zero I/O) + pinned RAM (H2D only) + NVMe/mmap fallback. Achieves 83x speedup over mmap baseline for 70B on consumer hardware (RTX 3090 + 48 GB RAM).
Bottleneck is PCIe H2D bandwidth at Gen3 x8 (~6.5 GB/s). Q4_K_M fits 10 more layers in VRAM (36 vs 26), reducing tier B transfers. Layer skip (cosine similarity calibration) eliminates 20/80 layers per token with minimal quality loss.
- Zero external dependencies beyond CUDA Toolkit (no PyTorch, no cuBLAS)
- GGUF model format with Q4_0, Q8_0, Q4_K_M, Q5_K, Q6_K, F16, F32 quantization
- 3-Tier Adaptive Caching: auto-sized VRAM resident + pinned RAM + NVMe/mmap tiers
- SLEP streaming: double-buffered layer pipeline overlaps NVMe reads, PCIe DMA, and GPU compute
- gpu-nvme-direct backend: userspace NVMe driver reads model weights directly to pinned GPU-accessible memory
- Layer skip: cosine-similarity calibration skips redundant layers (20/80 skipped at threshold 0.98)
- Self-speculative decoding: VRAM-resident layers as draft model (no extra model needed)
- Four data paths (auto-selected): VRAM resident > pinned RAM H2D > mmap pinned > CPU worker memcpy
- Llama architecture: RoPE, GQA, SwiGLU, RMSNorm, KV cache
- Linux (tested on Ubuntu, kernel 6.17+)
- CUDA Toolkit 13.1
- gcc-14 / g++-14
- NVIDIA GPU with Compute Capability 8.0+ (RTX 3090 tested)
- CMake 3.24+
- (Optional) NVMe SSD on separate PCIe slot + gpu-nvme-direct library
# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc-14 \
-DCMAKE_CXX_COMPILER=g++-14 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake --build . -j
# Run (resident mode — model fits in VRAM)
./ntransformer -m /path/to/llama-8b-q8_0.gguf -p "Hello" -n 128
# Run (streaming mode — model larger than VRAM)
./ntransformer -m /path/to/llama-70b-q6_k.gguf -p "Hello" -n 32 --streaming
# Run with layer skip (fastest for 70B)
./ntransformer -m /path/to/llama-70b-q4_k_m.gguf -p "Hello" -n 32 --streaming --skip-threshold 0.98
# Self-speculative decoding (VRAM layers as draft, no extra model)
./ntransformer -m /path/to/llama-70b-q6_k.gguf -p "Hello" -n 32 --self-spec --draft-k 3
# Chat mode
./ntransformer -m /path/to/model.gguf --chat
# Benchmark
./ntransformer -m /path/to/model.gguf --benchmark -n 64
Running ntransformer with NVMe direct I/O requires system-level modifications. An automated setup script handles all of them:
# Full first-time setup (interactive, creates backups)
sudo ./scripts/setup_system.sh
# Check current system state (no changes)
sudo ./scripts/setup_system.sh --check
# NVMe-only (run after every reboot)
sudo ./scripts/setup_system.sh --nvme-only
What the script modifies and why
| Phase | What | Why | Risk | Rollback |
|---|---|---|---|---|
| 1 | Installs gcc-14, cmake, kernel headers | CUDA 13.1 is incompatible with gcc-15 (Ubuntu 25.10 default) | Low — standard packages | apt remove |
| 2 | Adds amd_iommu=off to GRUB |
AMD root complex drops GPU→NVMe P2P reads if IOMMU is on. Disabling IOMMU lets posted PCIe writes (doorbells) through | Medium — removes hardware DMA isolation between all PCIe devices. Don’t run on multi-tenant/server systems | Remove amd_iommu=off from /etc/default/grub, run update-grub, reboot |
| 3 | Patches NVIDIA DKMS (os-mlock.c) |
follow_pfn() was removed in kernel 6.12+. Without the patch, cudaHostRegisterIoMemory fails and the GPU can’t map NVMe BAR0 for MMIO writes |
High — bad patch prevents GPU driver from loading (black screen on reboot). Backup .orig created automatically |
cp os-mlock.c.orig os-mlock.c in DKMS source dir, dkms remove/install nvidia/VERSION |
| 3b | Patches CUDA header (math_functions.h) |
glibc 2.42+ (Ubuntu 25.10) declares rsqrt()/rsqrtf() with noexcept. CUDA 13.1 declares without, causing build failure |
Low — only affects one header, backup created | cp math_functions.h.orig math_functions.h |
| 4 | Loads VFIO modules (vfio, vfio-pci) |
NVMe must be bound to VFIO for userspace access. Consumer GPUs (GeForce) require enable_unsafe_noiommu_mode=1 |
Low — modules unload on reboot. “Unsafe noiommu” means no IOMMU DMA protection for VFIO devices | Reboot (or modprobe -r vfio-pci vfio) |
| 5 | Unbinds NVMe from kernel, binds to VFIO | gpu-nvme-direct needs raw PCIe access. The NVMe disappears from /dev/ while bound to VFIO |
High if wrong device — never run on your boot drive. Script auto-detects and refuses boot devices | sudo ./scripts/restore_nvme.sh |
BIOS settings (manual, before running the script)
- Above 4G Decoding: ON (required for 64-bit BAR mapping)
- IOMMU: OFF (or leave on — the script adds the kernel parameter)
- Secure Boot: OFF (required for unsigned/patched kernel module loading)
WARNING: This project performs low-level PCIe operations (GPU MMIO writes to NVMe controller
registers, userspace NVMe command submission, VFIO device passthrough). While tested extensively on
RTX 3090 + WD SN740, incorrect configuration or hardware incompatibilities could theoretically cause:
- NVMe link failure requiring power cycle (observed during development with GPU reads)
- Data loss on the NVMe device used for raw block storage
- System instability from disabled IOMMU or patched kernel modules
Never use your boot drive for NVMe direct I/O. Always use a dedicated secondary NVMe.
The authors are not responsible for hardware damage or data loss. Use at your own risk.
| Script | Purpose | When to run |
|---|---|---|
scripts/setup_system.sh |
Full system configuration (7 phases) | First-time setup |
scripts/setup_system.sh --nvme-only |
VFIO + NVMe bind only | After every reboot |
scripts/setup_system.sh --check |
Verify system state | Debugging |
scripts/setup_nvme.sh [BDF] |
Bind single NVMe to VFIO | After reboot (standalone) |
scripts/restore_nvme.sh [BDF] |
Restore NVMe to kernel driver | When done with NVMe direct |
For models that don’t fit in VRAM, the NVMe backend eliminates the CPU from the data path:
NVMe SSD → (DMA) → Pinned Staging → (PCIe H2D) → GPU Buffers → Compute
# Build with NVMe support (requires gpu-nvme-direct library)
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_GPUNVME=ON \
-DCMAKE_C_COMPILER=gcc-14 -DCMAKE_CXX_COMPILER=g++-14 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.1/bin/nvcc
cmake --build . -j
# Write GGUF model to NVMe raw device
sudo ./scripts/restore_nvme.sh # ensure kernel driver is bound
sudo dd if=model.gguf of=/dev/nvme0n1 bs=1M oflag=direct status=progress
# Bind NVMe to VFIO for userspace access
sudo ./scripts/setup_nvme.sh # loads VFIO, forces D0, enables BusMaster
# Run with NVMe backend
sudo GPUNVME_PCI_BDF=0000:01:00.0 GPUNVME_GGUF_LBA=0 \
./build/ntransformer -m /path/to/model.gguf -p "Hello" -n 32 --streaming
# Restore NVMe to kernel driver when done
sudo ./scripts/restore_nvme.sh
- The GGUF model file is written to raw NVMe blocks via
dd setup_nvme.shbinds the NVMe to VFIO, forces PCIe D0 power state, enables BusMaster- gpu-nvme-direct initializes the NVMe controller from userspace (admin queues, I/O queues)
- During inference, each layer (~670 MB for 70B Q6_K) is read via 670 NVMe commands in ~202 ms
- Data lands in CUDA pinned staging memory, then async DMA to GPU compute buffers
- Pipeline overlaps NVMe reads, H2D DMA, and GPU compute across double buffers
src/
├── core/ # Tensor, allocator, GPU device management
├── cuda/ # CUDA kernels: GEMV, RMSNorm, RoPE, SwiGLU, softmax
├── memory/ # SLEP layer streaming engine (NVMe + mmap backends)
├── model/ # Transformer: config, GGUF loader, attention, FFN, norms
├── inference/ # Tokenizer, sampler, engine
├── utils/ # Timer, logger
├── main.cpp # CLI entry point
scripts/
├── setup_system.sh # Full system setup (GRUB, NVIDIA patch, CUDA patch, VFIO, NVMe)
├── setup_nvme.sh # Bind NVMe to VFIO, configure for gpu-nvme-direct
├── restore_nvme.sh # Restore NVMe to kernel driver
tests/ # Unit tests (tensor, GEMM kernels, NVMe layer loader)
forward_tiered() — hybrid pipeline:
Tier A (VRAM resident, layers 0..28):
GPU Compute: [layer 0][layer 1]...[layer 28] (zero I/O, weights permanent)
Tier B (pinned RAM, layers 29..79, double-buffered):
H2D DMA: [L29→gpu0][L30→gpu1][L31→gpu0]... (async from pinned RAM)
GPU Compute: [ ][layer 29][layer 30]... (overlapped with H2D)
Tier C (NVMe/mmap fallback, if needed):
NVMe/memcpy: [read L→stg0][read L→stg1]...
H2D DMA: [ ][stg0→gpu0 ]...
GPU Compute: [ ][ ][layer]...
Tier sizes auto-computed from cudaMemGetInfo() + /proc/meminfo MemAvailable.
| Format | Bits/Weight | Block Size | Supported |
|---|---|---|---|
| Q4_0 | 4.5 | 32 | Yes |
| Q8_0 | 8.5 | 32 | Yes |
| Q4_K_M | 4.5 | 256 | Yes (mixed: Q4_K + Q5_K + Q6_K) |
| Q5_K | 5.5 | 256 | Yes |
| Q6_K | 6.6 | 256 | Yes |
| F16 | 16 | 1 | Yes |
| F32 | 32 | 1 | Yes |
- Phase 1 – Foundation (complete): Llama 8B Q8_0, custom CUDA kernels, 48.9 tok/s
- Phase 2 – SLEP Streaming (complete): 70B on single GPU, 3-tier caching, 33x speedup
- Phase 3 – Optimization (complete): Q4_K_M/Q5_K support, layer skip (0.5 tok/s), self-speculative decoding, F16 KV cache
- Phase 4 – NVMe Direct: gpu-nvme-direct backend for tier C (GPU-initiated NVMe reads, 3.35 GB/s)
- Phase 5 – Polish: speculative decoding with draft model, benchmarks, public C API
BSD-2-Clause
⚡ **What’s your take?**
Share your thoughts in the comments below!
#️⃣ **#xaskasdfntransformer #Highefficiency #LLM #inference #engine #CCUDA #Run #Llama #70B #RTX**
🕒 **Posted on**: 1771735527
🌟 **Want more?** Click here for more info! 🌟
