🔥 Explore this trending post from Hacker News 📖
📂 **Category**:
✅ **What You’ll Learn**:
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.

Inspired by @karpathy/autoresearch — which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: agent modifies one file, runs a fixed evaluation, keeps or reverts, repeats forever.
Give AutoKernel any PyTorch model. It will:
- Profile the model to find which GPU kernels are bottlenecks
- Extract each bottleneck as a standalone Triton kernel
- Optimize each kernel autonomously (edit, benchmark, keep/revert — forever)
- Verify end-to-end correctness and report the total speedup
The agent reads program.md — the “research org code” — which contains comprehensive instructions for autonomous operation. It edits kernel.py one kernel at a time, runs bench.py (fixed benchmark with 5-stage correctness checks + roofline analysis), and either keeps or reverts the change. The orchestrator decides when to move to the next kernel using Amdahl’s law.
Each experiment takes ~90 seconds. That’s ~40 experiments/hour, ~320 overnight, across all kernels.
Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync
# One-time setup: test data + baselines
uv run prepare.py
# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
--input-shape 1,512 --dtype float16
# Extract top bottleneck kernels
uv run extract.py --top 5
# Verify benchmark works
uv run bench.py
Spin up Claude, Codex, or any coding agent in this directory:
Read program.md and let's kick off a new experiment. Start with setup.
The agent will:
- Profile your model and present the optimization plan
- Create a branch (e.g.,
autokernel/mar10-llama7b) - Optimize each bottleneck kernel in priority order
- Verify end-to-end correctness and report total speedup
program.md is intentionally comprehensive so the agent can run 10+ hours without getting stuck. It includes a 6-tier optimization playbook, decision framework, crash handling, and Amdahl’s law reasoning.
profile.py extract.py bench.py (loop) verify.py
Any PyTorch ──> Rank kernels ──> Generate baseline ──> Optimize each ──> End-to-end
model by GPU time Triton kernels kernel (agent) verification
| Tool | What it does |
|---|---|
profile.py |
Profiles any PyTorch model with torch.profiler, ranks kernels by GPU time, classifies as compute/memory-bound |
extract.py |
Extracts top-N bottleneck kernels from profiling results into standalone Triton kernel files |
orchestrate.py |
Multi-kernel scheduler: decides which kernel to optimize next using Amdahl’s law, tracks aggregate progress |
bench.py |
Fixed benchmark: 5-stage correctness (smoke, shape sweep, numerical stability, determinism, edge cases) + performance + roofline |
verify.py |
Plugs optimized kernels back into the model, checks end-to-end correctness, reports total speedup |
9 kernel types covering the core operations of modern deep learning:
| Kernel | Description | Key Metric |
|---|---|---|
| matmul | Dense matrix multiplication (M x K) @ (K x N) | TFLOPS |
| softmax | Row-parallel numerically stable softmax | GB/s |
| layernorm | Layer normalization with affine transform | GB/s |
| rmsnorm | RMS normalization (LLaMA-style) | GB/s |
| flash_attention | Scaled dot-product attention with causal masking | TFLOPS |
| fused_mlp | SwiGLU-style fused MLP (gate + up + down) | TFLOPS |
| cross_entropy | Fused cross entropy loss | GB/s |
| rotary_embedding | Rotary position embeddings (RoPE) | GB/s |
| reduce | Parallel reduction (sum) | GB/s |
Each has a PyTorch reference in reference.py and a starter Triton kernel in kernels/.
Self-contained model definitions ship with AutoKernel (no transformers library needed):
| Model | File | Params | Usage |
|---|---|---|---|
| GPT-2 Small | models/gpt2.py |
124M | --class-name GPT2 --input-shape 1,1024 |
| LLaMA (compact) | models/llama_7b.py |
160M | --class-name LlamaModel --input-shape 1,512 |
| LLaMA 7B | models/llama_7b.py |
7B | --class-name LlamaModel7B --input-shape 1,2048 |
| BERT-base | models/bert_base.py |
110M | --class-name BertModel --input-shape 8,512 |
| Custom | models/custom.py |
— | Template for your own model |
For HuggingFace models (uv sync --extra models):
uv run profile.py --module transformers --class-name AutoModelForCausalLM \
--pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16
autokernel/
kernel.py the file the agent modifies (one kernel at a time)
program.md agent instructions -- the "research org code"
bench.py fixed benchmark + 5-stage correctness harness
reference.py PyTorch reference implementations (ground truth)
prepare.py one-time setup: test data, baselines
profile.py profile any PyTorch model, rank kernels by GPU time
extract.py extract bottleneck kernels into workspace/
orchestrate.py multi-kernel scheduler (Amdahl's law)
verify.py end-to-end model verification + speedup report
analysis.py experiment visualization (generates progress.png)
kernels/ starter Triton kernels (9 types)
models/ self-contained model definitions (GPT-2, LLaMA, BERT)
workspace/ runtime artifacts (gitignored)
Why Triton. Readable Python-like syntax the agent can understand and modify without mastering inline PTX or SASS. Well-tuned Triton regularly reaches 80-95% of cuBLAS. The agent needs to iterate fast — Triton compiles in seconds, not minutes.
Correctness first. The benchmark checks kernel output against PyTorch before measuring performance. A fast but wrong kernel is immediately reverted. This prevents the agent from “optimizing” by producing garbage.
Amdahl’s law orchestration. The orchestrator prioritizes by impact. A 1.5x speedup on a 60% kernel (1.25x end-to-end) beats a 3x speedup on a 5% kernel (1.03x end-to-end). It moves on when diminishing returns set in.
Single file to modify. The agent only touches kernel.py. Scope stays manageable, diffs reviewable, reverts clean.
TSV logging. Results go to a plain results.tsv file. Human-readable, git-friendly, trivially parseable, no infrastructure.
Every experiment is logged to results.tsv (tab-separated):
| Column | Description |
|---|---|
experiment |
Sequential experiment number (0 = baseline) |
tag |
Short identifier |
kernel_type |
Which kernel (e.g., matmul) |
throughput_tflops |
Measured throughput (higher is better) |
latency_us |
Execution time in microseconds |
pct_peak |
Percentage of GPU theoretical peak |
speedup_vs_pytorch |
Speedup vs PyTorch/cuBLAS |
correctness |
PASS, FAIL, TIMEOUT, or CRASH |
peak_vram_mb |
Peak GPU memory usage |
description |
What was tried |
This project is autoresearch for GPU kernels — directly inspired by Andrej Karpathy’s autoresearch, the original experiment in autonomous AI research agents for LLM training. Karpathy showed that an AI agent can run hundreds of experiments overnight, methodically exploring a search space and logging every result. AutoKernel applies that same loop — agent edits one file, runs a fixed evaluation, keeps or reverts — to the domain of GPU kernel optimization with Triton.
Built by the team behind Forge.
MIT
🔥 **What’s your take?**
Share your thoughts in the comments below!
#️⃣ **#RightNowAIautokernel #Autoresearch #GPU #kernels #Give #PyTorch #model #sleep #wake #optimized #Triton #kernels #GitHub**
🕒 **Posted on**: 1773216223
🌟 **Want more?** Click here for more info! 🌟
