robertcprice/nCPU: nCPU: model-native and tensor-optimized CPU research runtimes with organized workloads, tools, and docs · GitHub

💥 Read this awesome post from Hacker News 📖

📂 **Category**:

📌 **What You’ll Learn**:

A CPU that runs entirely on GPU — registers, memory, flags, and program counter are all tensors.
Every ALU operation is a trained neural network.
Addition uses Kogge-Stone carry-lookahead. Multiplication uses a learned byte-pair lookup table.
Bitwise ops use neural truth tables. Shifts use attention-based bit routing. No hardcoded arithmetic.

Tests
Models
Accuracy
License

pip install -e ".[dev]"

# Run a program — all arithmetic through trained neural networks
python main.py --program programs/sum_1_to_10.asm

# Run with execution trace
python main.py --program programs/fibonacci.asm --trace

# Inline assembly
python main.py --inline "MOV R0, 42; HALT"

# GPU tensor mode (maximum speed, native tensor ops)
python main.py --binary firmware.bin --fast

The entire CPU lives on GPU. Registers, memory, flags, and the program counter are
PyTorch tensors. Instruction decode, ALU dispatch, and state updates all happen on-device
— nothing round-trips to the host CPU. Every ALU operation routes through a trained .pt model:

Instruction	Neural Model	How It Works
`ADD R0, R1, R2`	arithmetic.pt + carry_combine.pt	Kogge-Stone CLA (8 neural passes)
`SUB R0, R1, R2`	arithmetic.pt + carry_combine.pt	Two’s complement + CLA
`MUL R0, R1, R2`	multiply.pt	Byte-pair LUT lookups (up to 64 pairs for 64-bit)
`DIV R0, R1, R2`	arithmetic.pt	Restoring division via neural subtraction
`AND R0, R1, R2`	logical.pt	Vectorized truth table (all 32 bits at once)
`OR R0, R1, R2`	logical.pt	Vectorized truth table
`XOR R0, R1, R2`	logical.pt	Vectorized truth table
`SHL R0, R1, 4`	lsl.pt	Attention-based bit routing per output position
`SHR R0, R1, 2`	lsr.pt	Attention-based bit routing
`CMP R0, R1`	arithmetic.pt	Neural subtraction → derive N/Z/C flags
`INC R0`	arithmetic.pt	Neural add 1
`DEC R0`	arithmetic.pt	Neural subtract 1

Math functions (sin, cos, sqrt, exp, log, atan2) also wired through trained models.

Result: 100% accuracy on integer arithmetic, verified by 347 automated tests.

23 models (~135 MB total), 13 actively wired:

Component	Model	Accuracy	Wired
ADD/SUB/INC/DEC	Kogge-Stone CLA (carry_combine + full adder)	100%	Yes
MUL	Byte-pair LUT [256×256×16]	100%	Yes
AND/OR/XOR	Neural truth tables [7×4]	100%	Yes
LSL	Decomposed shift network	100%	Yes
LSR	Decomposed shift network	100%	Yes
CMP	Neural subtraction	100%	Yes
sin/cos	Sine-activated deep network	Trained	Yes
sqrt	Two-stage with Newton refinement	Trained	Yes
exp/log	4-layer MLP	Trained	Yes
atan2	6-layer residual with BatchNorm	Trained	Yes
Decode LLM	Qwen2.5-Coder-1.5B LoRA	100%	Yes (real mode)

See models/MODEL_INDEX.md for full details.

Benchmarked on Apple Silicon (MPS backend, PyTorch 2.10.0), 1,000 iterations per operation:

Operation	Latency (mean)	Sequential Passes	Strategy
exp, log	21 us	1	Single-pass MLP
mul	21 us	1	Batched byte-pair LUT gather
and, or, xor	21 us	1	Vectorized truth table lookup
sin, cos	48 us	2	Sine-activated deep network
add, sub, cmp	248 us	8 (CLA)	Kogge-Stone carry-lookahead
shl, shr	434 us	3 (batched)	Vectorized attention routing
sqrt	522 us	2 + batch pad	Two-stage BatchNorm MLP
atan2	935 us	6 + batch pad	Residual BatchNorm network

All models load in 60ms. Programs execute at 136–262 us/cycle depending on instruction mix (~4,975 IPS).

# Run benchmarks
python benchmarks/benchmark_neural.py

Multiplication is 12x faster than addition, even with carry-lookahead. In conventional CPUs,
MUL is slower than ADD. In the neural CPU, it’s inverted: the byte-pair LUT (21 us) has zero
sequential dependency, while the CLA adder (248 us) requires O(log n) carry-combine stages.
Before CLA, the gap was 38x (826 us with 32 ripple-carry passes).

Carry-lookahead works in neural networks. The Kogge-Stone parallel-prefix algorithm,
using a trained carry-combine network (100% accuracy on all 16 inputs), reduced ADD/SUB/CMP
from ~826 us to ~248 us — a 3.3x speedup. Classical hardware design principles transfer
directly to neural architectures.

Vectorization recovers most of the attention cost. Shift operations went from ~2,833 us
(64 sequential passes) to ~434 us (3 batched passes) — a 6.5x speedup.

O(1) / O(log n) / O(n) hierarchy. Operations fall into three tiers: O(1) single-pass
lookups (~21 us), O(log n) parallel-prefix carry (~248 us), and O(n) sequential passes
(sqrt ~522 us, atan2 ~935 us).

See the research paper for detailed analysis.

The NeuralCPU is not a simulator that happens to use a GPU — it is a GPU program.
All CPU state lives permanently on-device as PyTorch tensors:

Registers: 31 x 64-bit (torch.int64 on GPU)
Memory: Flat byte-addressable (torch.uint8 on GPU)
Flags: N, Z, C, V as GPU-resident tensors
Program Counter: GPU tensor, incremented on-device

The CPU is 64-bit ARM64. Registers, addresses, and data paths are all 64-bit.
Instruction fetch, decode, execute, and writeback all happen on GPU. The ALU is a bank
of trained neural networks running as GPU inference. No host CPU arithmetic in the
execution loop.

Neural Mode (default) — Every ALU operation is a forward pass through a trained .pt model:

from ncpu.model import CPU
cpu = CPU(neural_execution=True)
cpu.load_program("MOV R0, 7\nMOV R1, 6\nMUL R2, R0, R1\nHALT")
cpu.run()
print(cpu.get_register("R2"))  # 42 (computed by neural byte-pair LUT)

Fast Mode (--fast) — Same GPU-resident architecture, but ALU uses torch.add/torch.mul
instead of model inference. Targets 1.35M IPS at batch_size=32768 on Apple Silicon MPS:

from ncpu.neural import NeuralCPU
cpu = NeuralCPU(fast_mode=True)  # Native GPU tensor ops
cpu.load_binary(arm64_binary)

For maximum throughput, the kernels/ directory contains native Metal GPU implementations
that run the entire ARM64 fetch-decode-execute loop on GPU with zero CPU-GPU synchronization:

MLX Metal (kernels/mlx/): Python interface to custom Metal compute kernels via Apple MLX.
Full ARM64 decode and execute in Metal Shading Language.
Rust Metal (kernels/rust_metal/): Direct Metal API via objc2-metal with PyO3 Python
bindings. Includes GPU-side syscall handling, basic block caching, neural dispatch, and
out-of-order execution. Ships with a native DOOM benchmark.

Text Assembly (ncpu.model)

Instruction	Format	Description
`MOV`	`MOV Rd, imm/Rs`	Load immediate or copy register
`ADD`	`ADD Rd, Rs1, Rs2`	Neural addition
`SUB`	`SUB Rd, Rs1, Rs2`	Neural subtraction
`MUL`	`MUL Rd, Rs1, Rs2`	Neural multiplication
`DIV`	`DIV Rd, Rs1, Rs2`	Neural division (restoring algorithm)
`AND`	`AND Rd, Rs1, Rs2`	Neural bitwise AND
`OR`	`OR Rd, Rs1, Rs2`	Neural bitwise OR
`XOR`	`XOR Rd, Rs1, Rs2`	Neural bitwise XOR
`SHL`	`SHL Rd, Rs, imm/Rn`	Neural shift left
`SHR`	`SHR Rd, Rs, imm/Rn`	Neural shift right
`INC`	`INC Rd`	Neural increment
`DEC`	`DEC Rd`	Neural decrement
`CMP`	`CMP Rs1, Rs2`	Neural compare (sets flags)
`JMP`	`JMP label`	Unconditional jump
`JZ/JNZ`	`JZ/JNZ label`	Jump if zero / not zero
`JS/JNS`	`JS/JNS label`	Jump if negative / not negative
`HALT`	`HALT`	Stop execution

ARM64 Binary (ncpu.neural)

Full ARM64 instruction set — real binary encoding. The NeuralCPU decodes and executes
real ARM64 instructions with GPU-resident state.

nCPU/
├── ncpu/
│   ├── neural/           # Full GPU neural CPU (ARM64, 12K lines)
│   │   ├── cpu.py        # NeuralCPU — all state on GPU as tensors
│   │   └── neural_alu_bridge.py  # Routes ops through trained models
│   ├── model/            # Model-based CPU (text assembly)
│   │   ├── cpu.py        # CPU orchestrator
│   │   ├── neural_ops.py # Loads and runs .pt models
│   │   └── architectures.py  # All model class definitions
│   └── tensor/           # Tensor-based ARM64 kernel
├── kernels/
│   ├── mlx/              # MLX Metal compute kernels (Python + Metal Shading Language)
│   └── rust_metal/       # Rust Metal GPU kernel (objc2-metal + PyO3 bindings)
│       └── src/          # Full ARM64 execute loop in Metal, zero GPU-CPU sync
├── models/               # 23 trained .pt models
│   ├── alu/              # arithmetic, carry_combine, multiply, divide, logical, compare
│   ├── shifts/           # lsl, lsr, asr, rol
│   ├── math/             # sincos, sqrt, exp, log, atan2, doom_trig
│   ├── decode_llm/       # Qwen2.5 LoRA adapter
│   └── MODEL_INDEX.md    # Complete model status
├── demos/                # DOOM raycaster and other demos
├── programs/             # Assembly programs (.asm)
├── tests/                # 347 tests
├── docs/                 # Architecture docs (neural mode + tensor mode)
├── benchmarks/           # Performance benchmarks
├── paper/                # Research paper
└── main.py               # CLI entry point

A DDA raycaster that runs all arithmetic through trained neural networks. Every ADD, SUB, MUL,
and CMP is a forward pass through a real .pt model.

# Neural mode — every op through trained models (~2.5 FPS)
python demos/doom_raycaster.py

# Fast mode — native Python arithmetic (~5,000 FPS)
python demos/doom_raycaster.py --fast

# Side-by-side comparison (verifies identical output)
python demos/doom_raycaster.py --both

Fixed-point arithmetic (scale 1024) keeps everything in 32-bit integers, matching the
nCPU’s integer-only ISA. Both modes execute identical algorithms and produce identical
frame output — the only difference is whether arithmetic routes through neural networks.

pytest tests/ -v
# 347 tests: decode, programs, registry, state, neural ops (incl. CLA + batch),
#            neural bridge, math ops, architecture forward-pass, division

MIT

⚡ **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#robertcpricenCPU #nCPU #modelnative #tensoroptimized #CPU #research #runtimes #organized #workloads #tools #docs #GitHub**

🕒 **Posted on**: 1772605214

🌟 **Want more?** Click here for more info! 🌟

robertcprice/nCPU: nCPU: model-native and tensor-optimized CPU research runtimes with organized workloads, tools, and docs · GitHub

Text Assembly (ncpu.model)

ARM64 Binary (ncpu.neural)

By

Leave a Reply Cancel reply