💥 Check out this trending post from Hacker News 📖
📂 **Category**:
💡 **What You’ll Learn**:
IPv6 longest-prefix-match (LPM) using a linearized B+-tree with AVX-512 SIMD
and a scalar fallback. Based on the algorithm from:
Zhihao Zhang, Lanzheng Liu, Chen Chen, Huiba Li, Jiwu Shu, Windsor Hsu,
Yiming Zhang.
PlanB: Efficient Software IPv6 Lookup with Linearized B+-Tree.
NSDI ’26. arXiv:2604.14650.
Author’s reference code: https://github.com/nicexlab/planb.
This repository is an independent, clean-room reimplementation of the
algorithm as a portable, MIT-licensed C++17 library with extras that were
not part of the reference:
- Portable header-only core (
include/lpm6.hpp) that compiles without
AVX-512 and transparently falls back to a scalar path. - Dynamic FIB (
lpm6::Dynamic) using the paper’s rebuild-and-swap model
withstd::atomic<:shared_ptr/>; lookups are wait-free. - Python bindings via pybind11 (
pip install -e .). - Correctness tests (
tests/test_correctness.cpp) that verify the tree
against a brute-force LPM reference on synthetic FIBs. - Sample FIB/trace generator (
examples/generate_sample.py).
The PlanB paper is a solid engineering idea, but the reference code on
GitHub is not something you can drop into another project: it is Linux-
and AVX-512-only, has no license, no Python layer, and no dynamic-FIB
path. planb-lpm re-expresses the algorithm from the paper as a
portable, tested, and permissively licensed library so the technique is
actually usable in research, teaching, and production software.
- Research / simulation — drop-in fast LPM backend for ns-3 or a
custom packet-level simulator, or as a reference to compare newer LPM
structures against. - Control-plane tooling — SDN controllers and route-monitoring
services that need to answer “which prefix covers this address?” over
a large FIB without pulling in a full routing stack. - Network analysis — IPv6 scanners, traffic classifiers, and
log-enrichment pipelines that tag flows with their covering prefix. - Teaching — a compact, readable implementation of a modern SIMD
lookup structure suitable for networking and computer-architecture
courses.
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Options:
| CMake option | Default | Effect |
|---|---|---|
LPM6_BUILD_PYTHON |
OFF |
Build the pybind11 extension |
AVX-512 is auto-detected via check_cxx_compiler_flag; on CPUs or compilers
without support, the scalar path is used.
python3 examples/generate_sample.py 100000 1000000 42
./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20
- Synthetic FIB, 100 000 prefixes. Prefix length distribution is
weighted to match the RIPErrc00-25table (roughly /48 45%, /32 11%,
/40 10%, /44 10%, /36 4%, others), generated by
examples/generate_sample.py. - Synthetic trace, 1 M uniform-random 64-bit addresses (upper half of an
IPv6 address, which is all that drives /0../64 LPM). - Each benchmark does one warmup pass (discarded) and 20 timed passes.
We report min / q1 / median / q3 / max + IQR across the 20 runs so
variance is visible, not hidden under an average. - Pin to a single core with
taskset -cto cut scheduler noise.
The binaries print the CPU they ran on, the affinity-mask width, the
cpufreqgovernor, and the Intel turbo state so numbers are
reproducible. (On WSL the/sysprobes are unavailable and those
lines are silently omitted.) - Hardware: laptop Intel i5-1035G7 (Ice Lake, 4C/8T, AVX-512),
Ubuntu 24.04 on WSL, GCC 13.3,-O3. This is a consumer CPU; the
paper’s numbers are from a 24-core Xeon 8331C and a Zen5 Ryzen 9 370HX.
taskset -c 2 ./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20 (20 timed runs + 1 warmup, batch paths use the real public API returning next-hops):
| Path | min | q1 | median | q3 | max | IQR |
|---|---|---|---|---|---|---|
single |
34 | 45 | 48 | 51 | 62 | 6.6 |
batch<1> |
32 | 44 | 47 | 51 | 60 | 7.6 |
batch<2> |
36 | 48 | 54 | 58 | 68 | 10.4 |
batch<4> |
42 | 60 | 66 | 71 | 84 | 11.3 |
batch<8> |
53 | 66 | 73 | 81 | 101 | 13.4 |
batch<16> |
43 | 57 | 62 | 66 | 78 | 9.1 |
batch<32> |
41 | 70 | 76 | 83 | 94 | 13.4 |
Units: MLPS; medians from a rolling average of four 20-run sweeps.
Build time for that FIB is ~25 ms, tree depth 6, 531 k keys across
200 k edges.
Observations:
batch<1>≈single— the batched entry point adds no overhead
at M=1, confirming the dispatch cost is negligible.- Sweet spot at M=8: ~1.5× single-core throughput over
single.
M=32 is essentially flat; going further would need larger-batch
unrolling and register-allocation work. - M=16 dip is real and reproducible across runs — likely register
pressure at the#pragma GCC unroll 16boundary interacting with the
7-level traversal. A known-to-investigate item.
Paper’s ablation reports batching worth 3–4.5× on AMD Zen5; we see ~1.5×
on Ice Lake. Part of the gap is the paper’s lookup_batch_checksum-style
fast path (our binary uses the public API that also writes next-hops),
part is AVX-512 sustained frequency differences between Ice Lake and
Zen5. The paper reports 191–197 MLPS single-core on Xeon and 374–393
MLPS on Zen5 — a real server should land near those numbers.
Against a conventional baseline
The bench_naive binary builds a Patricia radix trie (tests/patricia.hpp)
on the same FIB and runs the same trace through it. Patricia is a
standard path-compressed binary trie — representative of the
pointer-chasing LPM structures PlanB is designed to replace.
taskset -c 2 ./build/bench_naive … (same 20-run discipline):
| Implementation | min | median | max | IQR |
|---|---|---|---|---|
lpm6::Tree |
23 | 50 | 63 | 8.2 |
| Patricia trie | 2.4 | 2.4 | 2.6 | 0.06 |
Units: MLPS. Tree-vs-Patricia ≈ 20× on the median, bracketing the
upper end of the paper’s reported 1.6–14× advantage over PopTrie,
CP-Trie, Neurotrie, and HBS. Patricia isn’t one of those baselines —
it’s a conventional-radix-trie stand-in — so the exact multiplier will
shift when we benchmark against the paper’s actual algorithms (on the
roadmap, see plan.md). The binary also prints a linear-scan number
on a 50 000-address slice; that is O(N) vs. O(log N) and is only a
sanity check, not a comparison.
Same 100 000-prefix workload. footprint_bytes() counts the
algorithmic cost (our flat key + next-hop arrays for the tree; one
Node struct per Patricia node); RSS delta is the process-level
VmRSS change across the build and captures allocator overhead:
| Implementation | Algorithmic | RSS delta | Per-prefix |
|---|---|---|---|
lpm6::Tree |
4.82 MB | 5.05 MB | ~50 B/prefix |
| Patricia trie | 6.04 MB | 9.02 MB | ~90 B/prefix |
At 100 K, the tree’s flat cache-aligned layout costs roughly half the
RSS of the pointer-chasing Patricia. This does not hold at scale —
see FIB-size sweep below: the linearized B+-tree grows in discrete
depth steps (9^depth buckets), so its footprint is bounded from
below by depth, not prefix count. Once depth transitions 6→7 (around
250 K prefixes) the tree’s algorithmic size jumps to ~38 MB and stays
there until depth 7 fills up. Patricia scales linearly with prefixes,
so at 250 K the tree is actually larger than Patricia. Paper
reports 56–92% memory reductions vs PopTrie/CP-Trie/Neurotrie/HBS;
those are heavier than plain Patricia, so the paper’s advantage likely
still holds against them even at our 250 K crossover.
examples/run_sweep.sh drives planb-lpm, bench_update, and
bench_naive across five FIB sizes sharing one 1 M-address trace,
same 20-run discipline, pinned to core 2. Numbers are medians of
the timed runs; throughput units are MLPS:
| FIB | depth | single | batch<8> | batch<32> | rebuild | tree alg. | Patricia alg. |
|---|---|---|---|---|---|---|---|
| 10 K | 5 | 94 | 165 | 148 | 1.5 ms | 0.53 MB | 0.61 MB |
| 100 K | 6 | 46 | 69 | 75 | 17.7 ms | 4.82 MB | 6.04 MB |
| 250 K | 7 | 20 | 29 | 41 | 51.8 ms | 38.40 MB | 15.00 MB |
| 500 K | 7 | 12 | 18 | 27 | 107.9 ms | n/m | n/m |
| 1 M | 7 | 10 | 14 | 22 | 213.4 ms | n/m | n/m |
(n/m = not measured; Patricia allocation and the brute-force sanity
check become impractical at 500 K+ on a laptop, so the sweep script
only runs them for ≤250 K. Tree footprint at 500 K / 1 M is dominated
by the same depth-7 bucket skeleton as 250 K and grows only with the
filled-leaf count; we haven’t added a separate probe yet.)
Observations:
- Throughput falls as the FIB exceeds L3. The laptop has ~6 MB
L3; the tree’s flat footprint is below that at 100 K (4.8 MB) and
above it from 250 K onward. Single-lookup throughput drops 4.6×
going from 10 K → 100 K and another 2.3× going from 100 K → 250 K,
dominated by memory traffic once we spill out of L3. - Batching helps most at scale. At 10 K,
batch<8>beats
batch<32>(165 vs 148 MLPS) — everything’s in L1/L2 and large
unrolls just add register pressure. At 1 M,batch<32>is the
best path (22 vs 14 forbatch<8>); the long-latency DRAM fetches
benefit from more in-flight loads. - Depth transitions matter. Depth goes 5→6 between 10 K and
100 K, and 6→7 between 100 K and 250 K. Each extra level is one
more cache-line indirection per lookup and bumps the sentinel-padded
key count (531 K → 4.78 M going 6→7), which drives the tree-above-
Patricia memory crossover. - Rebuild scales roughly linearly. 1.5 / 17.7 / 51.8 / 107.9 /
213.4 ms for 10 K / 100 K / 250 K / 500 K / 1 M; the paper’s Zen5
number is 850 ms for 1 M, so our Ice Lake laptop is actually ~4×
faster on the rebuild path (likely because our tree build elides
some of the preprocessing the paper bundles into rebuild time).
Dataset caveat: these are synthetic FIBs with a RIPE-weighted length
distribution, not real BGP tables. Real BGP has stronger
prefix-length clustering and spatial locality in advertised ranges,
both of which change cache behavior — see the next section for a
reproduction on an actual RIPE RIS RIB.
Real BGP reproduction (RIPE RIS rrc00)
examples/mrt_to_fib.py parses a TABLE_DUMP_V2 / RIB_IPV6_UNICAST
MRT dump (RFC 6396) into a planb-lpm FIB. On the rrc00 full-table
latest-bview.gz of 2026-04-19 we get 254 197 unique IPv6
prefixes — right in the range (≈235 K) the paper cites for rrc00 in
its evaluation. Same 1 M uniform-random 64-bit trace, same 20-run
discipline, pinned to core 2:
| Path | min | median | max | IQR |
|---|---|---|---|---|
single |
48.7 | 67.5 | 74.4 | 3.6 |
batch<8> |
115.3 | 137.2 | 151.0 | 8.7 |
batch<32> |
100.5 | 118.6 | 137.1 | 8.3 |
Units: MLPS. Tree depth 7, 4.78 M sentinel-padded keys, 38.4 MB
algorithmic footprint (same skeleton as the synthetic 250 K row).
Build is 63 ms, full rebuild under bench_update is median 40 ms
(paper 850 ms/1 M on Zen5 → ~213 ms/250 K scale; our Ice Lake laptop
at 40 ms is ~5× faster — likely because the paper bundles more
preprocessing into the rebuild number).
Throughput is roughly 3× higher on real BGP than on the same-size
synthetic (67 vs 20 MLPS single). The synthetic distribution is
length-weighted but picks each prefix’s address uniformly; real BGP
concentrates advertisements in a small fraction of the address space,
so a uniform-random lookup trace against a real RIB tends to hit the
same hot tree paths and stays L2-resident. This is the opposite
direction of “real data is worse” that we expected — it mainly tells
us our trace is the limiting factor, not the FIB.
Against real BGP the Patricia baseline gets faster too — by more:
| Implementation | min | median | max | IQR |
|---|---|---|---|---|
lpm6::Tree |
43.5 | 65.3 | 75.4 | 5.1 |
| Patricia trie | 51.2 | 95.7 | 110.9 | 24.2 |
Patricia jumps from 2.4 MLPS on the 100 K synthetic to 95 MLPS on
real BGP — 40× better on the same machine, and now ahead of the
tree. Two things are going on:
- Uniform 64-bit lookups bail out early in Patricia on a BGP table.
Most random upper-64-bit addresses fall into a handful of heavily
covered root regions (think /6, /8, /12) — Patricia returns after
1–2 pointer chases staying in L1. The tree always does 7
cache-line fetches no matter which prefix ultimately wins. - Our Patricia is a stand-in, not one of the paper’s baselines.
The paper’s 1.6–14× advantage is over PopTrie / CP-Trie /
Neurotrie / HBS — those structures add per-node SIMD and bitmap
acceleration that a plain pointer trie doesn’t have; on uniform
lookups they pay more than our Patricia does. The gap to a
genuine paper baseline is almost certainly larger than our gap to
Patricia.
The honest conclusion: on this hardware, on a real BGP table, against
a uniform 64-bit trace, a plain Patricia trie is roughly at parity
with the tree. The tree’s advantage shows up with non-uniform
traces (real packet captures concentrate on long prefixes where
Patricia walks deep) or with the paper’s actual baselines. Both are
on the roadmap.
bench_mt runs batch<8> across T threads, each pinned to a distinct
logical CPU, all sharing one immutable lpm6::Tree. Threads release
in lockstep per run so L3 / DRAM contention is measured, not dodged.
Same 20-run discipline. Laptop: 4 physical cores / 8 logical (SMT):
| FIB | T | per-thread min | per-thread max | aggregate MLPS | efficiency |
|---|---|---|---|---|---|
| BGP 254 K | 1 | 95 | 140 | 130 | 1.00× |
| BGP 254 K | 2 | 95 | 143 | 278 | 1.07× |
| BGP 254 K | 4 | 68 | 142 | 396 | 0.76× |
| BGP 254 K | 8 | 17 | 98 | 626 | 0.60× |
| synth 1 M | 1 | 12 | 18 | 15 | 1.00× |
| synth 1 M | 2 | 13 | 18 | 31 | 1.02× |
| synth 1 M | 4 | 7 | 16 | 52 | 0.85× |
| synth 1 M | 8 | 3 | 15 | 85 | 0.69× |
Efficiency = aggregate at T ÷ (T × aggregate at T=1). Units MLPS.
Observations:
- T=2 is super-linear on the BGP FIB (1.07×). At 1T, a single
memory-bound lookup chain leaves functional units idle during DRAM
waits; with 2T on 2 cores the hardware has more in-flight loads and
can hide more of that latency. Similar effect is visible on synth. - 4-of-4 physical cores is the cleanest scaling regime. At 4T the
efficiency is ~0.76–0.85×, mainly limited by shared L3 once the hot
set no longer fits per-core. - 8T uses SMT (hyperthreading). Per-thread throughput collapses
(the two sibling threads contend for L1/L2 and AVX-512 port
bandwidth on the same physical core), but aggregate still rises
because the second sibling occupies stall cycles. Efficiency vs
single-core baseline drops to 0.60–0.69×. - Real BGP scales worse relatively than synth 1 M. BGP’s hot
set at T=1 already hits ~130 MLPS and stays in L2; more threads push
it out and each one pays the L3 tax. Synth-1M is DRAM-bound even at
T=1 (footprint 38 MB ≫ L3), so adding threads is cheaper in relative
terms — the DRAM subsystem was the bottleneck anyway.
Paper reports 3.4 BLPS on 12 Xeon cores (§5.4). Our 8-thread laptop
(4 real cores + SMT) hits 0.63 BLPS on the same-scale BGP table —
roughly 5× below the paper, which tracks with 12 vs 4 real cores
plus Xeon’s wider L2 / higher sustained AVX-512 frequency. A proper
comparison needs a server-class box; see Phase 2.6 in plan.md.
PlanB’s update model is batched: N pending prefix changes are coalesced
into one full rebuild (§3.6 of the paper). bench_update measures
rebuild time and per-op latency on the 100 000-prefix FIB:
tree rebuild : min 17.2 med 18.8 max 25.0 ms (IQR 1.2)
dynamic insert (incl. rebuild) : med 14.4 ms
amortized if coalesced per 1 K : ~14 µs / update
amortized if coalesced per 10 K: ~1.4 µs / update
The paper reports 850 ms to rebuild a 1 M-prefix FIB on an AMD Zen5
server; linear scaling gives ~85 ms for our 100 K setup and we see
~19 ms on Ice Lake, so the rebuild path is at parity within hardware
noise.
Limitations — what this is not
- Not on the paper’s hardware. Laptop Ice Lake (4 physical cores,
~6 MB L3) vs. the paper’s 12-core Xeon 8331C / Zen5 server — ≈5–15×
raw single-core throughput gap before algorithm and ~3× core-count
gap for multi-core aggregate. A server-class run on GCP
c2-standard-16(Cascade Lake, AVX-512) is on the roadmap
(Phase 2.6 inplan.md). - Not the paper’s baselines. Patricia is a stand-in conventional
pointer-chasing trie; we do not implement PopTrie, CP-Trie,
Neurotrie, or HBS. PopTrie is the next item inplan.md
(Phase 2.7); the others are tracked but not scheduled. - Synthetic FIBs are length-weighted, not address-distributed like
real BGP. The “Real BGP reproduction” section above shows how
much this matters: throughput on a realrrc00table is ~3× higher
than on a same-size synthetic for uniform 64-bit traces, and
Patricia’s relative position against the tree flips entirely. Both
the synthetic and the real-BGP runs use a uniform-random lookup
trace — real packet captures concentrate on long prefixes, which
would penalize Patricia further and is the regime where the
linearized tree’s fixed 7 cache-line cost looks best.
Treat these numbers as a reproducibility check that the algorithm
behaves as the paper predicts on commodity hardware, not as a
competitive evaluation.
pip install pybind11 scikit-build-core
pip install -e .
python examples/basic_usage.py
from planb_lpm import Tree, Dynamic
t = Tree()
t.build([("2001:db8::", 32, 10), ("fe80::", 10, 20)])
t.lookup("2001:db8::1") # -> 10
d = Dynamic()
d.load([("::", 0, 1)])
d.insert("2001:db8::", 32, 7)
d.lookup("2001:db8::abcd") # -> 7
- Expand each
prefix/leninto a[start, end)interval on the upper
64 bits of the IPv6 address space. - Sort the 2·N boundary points; resolve prefix nesting with a stack so that
each elementary interval knows its active next-hop. - Pack the sorted boundaries into the leaves of a 9-ary linearized B+-tree
(8 keys per node, 64-byte cache-line aligned). Internal nodes hold
separator keys copied from the leaves at strides of9^L · 8. - Lookup is a pure predecessor search: at each internal node use a single
AVX-512vpcmpuq+popcntto pick the child, then do the same at the
leaf to locate the predecessor edge.
include/lpm6.hpp core header-only library
src/main.cpp CLI benchmark
src/ipv6_util.hpp FIB/trace loaders
tests/test_correctness.cpp brute-force vs tree vs patricia
tests/patricia.hpp baseline path-compressed radix trie (benchmark only)
tests/bench_naive.cpp three-way throughput: naive / patricia / tree
tests/bench_update.cpp rebuild + dynamic-update latency
tests/bench_mt.cpp multi-core scaling (shared const tree, per-thread slice)
tests/bench_stats.hpp warmup / percentile / env-probe helpers
examples/mrt_to_fib.py RFC 6396 TABLE_DUMP_V2 → FIB converter for RIPE RIS
python/binding.cpp pybind11 module
examples/ sample data + usage scripts
All files in this repository are original work under the MIT license — see
LICENSE.md. The author’s reference implementation at
https://github.com/nicexlab/planb is not vendored; see the paper cited
above for the algorithm itself.
The 0.1.0 tree has been built and tested in the following configurations on
Ubuntu 24.04 / GCC 13.3 (Intel Ice Lake):
| Configuration | Result |
|---|---|
| Release, AVX-512 auto-detect | ctest 1/1, bench OK |
| Release, AVX-512 forced off (scalar fallback) | ctest 1/1, 0 zmm instructions in binary |
Debug + -fsanitize=address,undefined |
ctest 1/1, no leaks / UB |
-Wall -Wextra -Wpedantic -Wshadow -Wconversion -Werror |
Clean build |
pip install -e . in a fresh venv + examples/basic_usage.py |
All lookups correct |
The core algorithm is from the PlanB paper by Zhang et al. (see top of file);
all source in this repository is an independent reimplementation.
Contributions of any size are welcome — bug reports, benchmark results on
new hardware, alternative SIMD backends, language bindings, documentation
fixes. See CONTRIBUTING for the dev workflow,
commit conventions, and how to run the full verification suite (scalar
fallback, ASAN/UBSAN, -Werror, pip install -e .) before opening a
pull request.
⚡ **What’s your take?**
Share your thoughts in the comments below!
#️⃣ **#GitHub #esutcuplanblpm #GitHub**
🕒 **Posted on**: 1776662469
🌟 **Want more?** Click here for more info! 🌟
