RightNow-AI/autokernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels. · GitHub

🔥 Explore this trending post from Hacker News 📖

📂 **Category**:

✅ **What You’ll Learn**:

Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.

AutoKernel Progress

Inspired by @karpathy/autoresearch — which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: agent modifies one file, runs a fixed evaluation, keeps or reverts, repeats forever.

Give AutoKernel any PyTorch model. It will:

Profile the model to find which GPU kernels are bottlenecks
Extract each bottleneck as a standalone Triton kernel
Optimize each kernel autonomously (edit, benchmark, keep/revert — forever)
Verify end-to-end correctness and report the total speedup

The agent reads program.md — the “research org code” — which contains comprehensive instructions for autonomous operation. It edits kernel.py one kernel at a time, runs bench.py (fixed benchmark with 5-stage correctness checks + roofline analysis), and either keeps or reverts the change. The orchestrator decides when to move to the next kernel using Amdahl’s law.

Each experiment takes ~90 seconds. That’s ~40 experiments/hour, ~320 overnight, across all kernels.

Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync

# One-time setup: test data + baselines
uv run prepare.py

# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
 --input-shape 1,512 --dtype float16

# Extract top bottleneck kernels
uv run extract.py --top 5

# Verify benchmark works
uv run bench.py

Spin up Claude, Codex, or any coding agent in this directory:

Read program.md and let's kick off a new experiment. Start with setup.

The agent will:

Profile your model and present the optimization plan
Create a branch (e.g., autokernel/mar10-llama7b)
Optimize each bottleneck kernel in priority order
Verify end-to-end correctness and report total speedup

program.md is intentionally comprehensive so the agent can run 10+ hours without getting stuck. It includes a 6-tier optimization playbook, decision framework, crash handling, and Amdahl’s law reasoning.

                 profile.py              extract.py           bench.py (loop)         verify.py
Any PyTorch  ──>  Rank kernels  ──>  Generate baseline  ──>  Optimize each  ──>  End-to-end
   model          by GPU time       Triton kernels          kernel (agent)       verification

Tool	What it does
`profile.py`	Profiles any PyTorch model with `torch.profiler`, ranks kernels by GPU time, classifies as compute/memory-bound
`extract.py`	Extracts top-N bottleneck kernels from profiling results into standalone Triton kernel files
`orchestrate.py`	Multi-kernel scheduler: decides which kernel to optimize next using Amdahl’s law, tracks aggregate progress
`bench.py`	Fixed benchmark: 5-stage correctness (smoke, shape sweep, numerical stability, determinism, edge cases) + performance + roofline
`verify.py`	Plugs optimized kernels back into the model, checks end-to-end correctness, reports total speedup

9 kernel types covering the core operations of modern deep learning:

Kernel	Description	Key Metric
matmul	Dense matrix multiplication (M x K) @ (K x N)	TFLOPS
softmax	Row-parallel numerically stable softmax	GB/s
layernorm	Layer normalization with affine transform	GB/s
rmsnorm	RMS normalization (LLaMA-style)	GB/s
flash_attention	Scaled dot-product attention with causal masking	TFLOPS
fused_mlp	SwiGLU-style fused MLP (gate + up + down)	TFLOPS
cross_entropy	Fused cross entropy loss	GB/s
rotary_embedding	Rotary position embeddings (RoPE)	GB/s
reduce	Parallel reduction (sum)	GB/s

Each has a PyTorch reference in reference.py and a starter Triton kernel in kernels/.

Self-contained model definitions ship with AutoKernel (no transformers library needed):

Model	File	Params	Usage
GPT-2 Small	`models/gpt2.py`	124M	`--class-name GPT2 --input-shape 1,1024`
LLaMA (compact)	`models/llama_7b.py`	160M	`--class-name LlamaModel --input-shape 1,512`
LLaMA 7B	`models/llama_7b.py`	7B	`--class-name LlamaModel7B --input-shape 1,2048`
BERT-base	`models/bert_base.py`	110M	`--class-name BertModel --input-shape 8,512`
Custom	`models/custom.py`	—	Template for your own model

For HuggingFace models (uv sync --extra models):

uv run profile.py --module transformers --class-name AutoModelForCausalLM \
 --pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16

autokernel/
  kernel.py             the file the agent modifies (one kernel at a time)
  program.md            agent instructions -- the "research org code"

  bench.py              fixed benchmark + 5-stage correctness harness
  reference.py          PyTorch reference implementations (ground truth)
  prepare.py            one-time setup: test data, baselines

  profile.py            profile any PyTorch model, rank kernels by GPU time
  extract.py            extract bottleneck kernels into workspace/
  orchestrate.py        multi-kernel scheduler (Amdahl's law)
  verify.py             end-to-end model verification + speedup report
  analysis.py           experiment visualization (generates progress.png)

  kernels/              starter Triton kernels (9 types)
  models/               self-contained model definitions (GPT-2, LLaMA, BERT)
  workspace/            runtime artifacts (gitignored)

Why Triton. Readable Python-like syntax the agent can understand and modify without mastering inline PTX or SASS. Well-tuned Triton regularly reaches 80-95% of cuBLAS. The agent needs to iterate fast — Triton compiles in seconds, not minutes.

Correctness first. The benchmark checks kernel output against PyTorch before measuring performance. A fast but wrong kernel is immediately reverted. This prevents the agent from “optimizing” by producing garbage.

Amdahl’s law orchestration. The orchestrator prioritizes by impact. A 1.5x speedup on a 60% kernel (1.25x end-to-end) beats a 3x speedup on a 5% kernel (1.03x end-to-end). It moves on when diminishing returns set in.

Single file to modify. The agent only touches kernel.py. Scope stays manageable, diffs reviewable, reverts clean.

TSV logging. Results go to a plain results.tsv file. Human-readable, git-friendly, trivially parseable, no infrastructure.

Every experiment is logged to results.tsv (tab-separated):

Column	Description
`experiment`	Sequential experiment number (0 = baseline)
`tag`	Short identifier
`kernel_type`	Which kernel (e.g., `matmul`)
`throughput_tflops`	Measured throughput (higher is better)
`latency_us`	Execution time in microseconds
`pct_peak`	Percentage of GPU theoretical peak
`speedup_vs_pytorch`	Speedup vs PyTorch/cuBLAS
`correctness`	PASS, FAIL, TIMEOUT, or CRASH
`peak_vram_mb`	Peak GPU memory usage
`description`	What was tried

This project is autoresearch for GPU kernels — directly inspired by Andrej Karpathy’s autoresearch, the original experiment in autonomous AI research agents for LLM training. Karpathy showed that an AI agent can run hundreds of experiments overnight, methodically exploring a search space and logging every result. AutoKernel applies that same loop — agent edits one file, runs a fixed evaluation, keeps or reverts — to the domain of GPU kernel optimization with Triton.

Built by the team behind Forge.

MIT

🔥 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#RightNowAIautokernel #Autoresearch #GPU #kernels #Give #PyTorch #model #sleep #wake #optimized #Triton #kernels #GitHub**

🕒 **Posted on**: 1773216223

🌟 **Want more?** Click here for more info! 🌟

RightNow-AI/autokernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels. · GitHub

By

Leave a Reply Cancel reply