GitHub – esutcu/planb-lpm · GitHub

💥 Check out this trending post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

IPv6 longest-prefix-match (LPM) using a linearized B+-tree with AVX-512 SIMD
and a scalar fallback. Based on the algorithm from:

Zhihao Zhang, Lanzheng Liu, Chen Chen, Huiba Li, Jiwu Shu, Windsor Hsu,
Yiming Zhang.
PlanB: Efficient Software IPv6 Lookup with Linearized B+-Tree.
NSDI ’26. arXiv:2604.14650.
Author’s reference code: https://github.com/nicexlab/planb.

This repository is an independent, clean-room reimplementation of the
algorithm as a portable, MIT-licensed C++17 library with extras that were
not part of the reference:

  • Portable header-only core (include/lpm6.hpp) that compiles without
    AVX-512 and transparently falls back to a scalar path.
  • Dynamic FIB (lpm6::Dynamic) using the paper’s rebuild-and-swap model
    with std::atomic<:shared_ptr/>; lookups are wait-free.
  • Python bindings via pybind11 (pip install -e .).
  • Correctness tests (tests/test_correctness.cpp) that verify the tree
    against a brute-force LPM reference on synthetic FIBs.
  • Sample FIB/trace generator (examples/generate_sample.py).

The PlanB paper is a solid engineering idea, but the reference code on
GitHub is not something you can drop into another project: it is Linux-
and AVX-512-only, has no license, no Python layer, and no dynamic-FIB
path. planb-lpm re-expresses the algorithm from the paper as a
portable, tested, and permissively licensed library so the technique is
actually usable in research, teaching, and production software.

  • Research / simulation — drop-in fast LPM backend for ns-3 or a
    custom packet-level simulator, or as a reference to compare newer LPM
    structures against.
  • Control-plane tooling — SDN controllers and route-monitoring
    services that need to answer “which prefix covers this address?” over
    a large FIB without pulling in a full routing stack.
  • Network analysis — IPv6 scanners, traffic classifiers, and
    log-enrichment pipelines that tag flows with their covering prefix.
  • Teaching — a compact, readable implementation of a modern SIMD
    lookup structure suitable for networking and computer-architecture
    courses.
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Options:

CMake option Default Effect
LPM6_BUILD_PYTHON OFF Build the pybind11 extension

AVX-512 is auto-detected via check_cxx_compiler_flag; on CPUs or compilers
without support, the scalar path is used.

python3 examples/generate_sample.py 100000 1000000 42
./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20
  • Synthetic FIB, 100 000 prefixes. Prefix length distribution is
    weighted to match the RIPE rrc00-25 table (roughly /48 45%, /32 11%,
    /40 10%, /44 10%, /36 4%, others), generated by
    examples/generate_sample.py.
  • Synthetic trace, 1 M uniform-random 64-bit addresses (upper half of an
    IPv6 address, which is all that drives /0../64 LPM).
  • Each benchmark does one warmup pass (discarded) and 20 timed passes.
    We report min / q1 / median / q3 / max + IQR across the 20 runs so
    variance is visible, not hidden under an average.
  • Pin to a single core with taskset -c to cut scheduler noise.
    The binaries print the CPU they ran on, the affinity-mask width, the
    cpufreq governor, and the Intel turbo state so numbers are
    reproducible. (On WSL the /sys probes are unavailable and those
    lines are silently omitted.)
  • Hardware: laptop Intel i5-1035G7 (Ice Lake, 4C/8T, AVX-512),
    Ubuntu 24.04 on WSL, GCC 13.3, -O3. This is a consumer CPU; the
    paper’s numbers are from a 24-core Xeon 8331C and a Zen5 Ryzen 9 370HX.

taskset -c 2 ./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20 (20 timed runs + 1 warmup, batch paths use the real public API returning next-hops):

Path min q1 median q3 max IQR
single 34 45 48 51 62 6.6
batch<1> 32 44 47 51 60 7.6
batch<2> 36 48 54 58 68 10.4
batch<4> 42 60 66 71 84 11.3
batch<8> 53 66 73 81 101 13.4
batch<16> 43 57 62 66 78 9.1
batch<32> 41 70 76 83 94 13.4

Units: MLPS; medians from a rolling average of four 20-run sweeps.
Build time for that FIB is ~25 ms, tree depth 6, 531 k keys across
200 k edges.

Observations:

  • batch<1>single — the batched entry point adds no overhead
    at M=1, confirming the dispatch cost is negligible.
  • Sweet spot at M=8: ~1.5× single-core throughput over single.
    M=32 is essentially flat; going further would need larger-batch
    unrolling and register-allocation work.
  • M=16 dip is real and reproducible across runs — likely register
    pressure at the #pragma GCC unroll 16 boundary interacting with the
    7-level traversal. A known-to-investigate item.

Paper’s ablation reports batching worth 3–4.5× on AMD Zen5; we see ~1.5×
on Ice Lake. Part of the gap is the paper’s lookup_batch_checksum-style
fast path (our binary uses the public API that also writes next-hops),
part is AVX-512 sustained frequency differences between Ice Lake and
Zen5. The paper reports 191–197 MLPS single-core on Xeon and 374–393
MLPS on Zen5 — a real server should land near those numbers.

Against a conventional baseline

The bench_naive binary builds a Patricia radix trie (tests/patricia.hpp)
on the same FIB and runs the same trace through it. Patricia is a
standard path-compressed binary trie — representative of the
pointer-chasing LPM structures PlanB is designed to replace.

taskset -c 2 ./build/bench_naive … (same 20-run discipline):

Implementation min median max IQR
lpm6::Tree 23 50 63 8.2
Patricia trie 2.4 2.4 2.6 0.06

Units: MLPS. Tree-vs-Patricia ≈ 20× on the median, bracketing the
upper end of the paper’s reported 1.6–14× advantage over PopTrie,
CP-Trie, Neurotrie, and HBS. Patricia isn’t one of those baselines —
it’s a conventional-radix-trie stand-in — so the exact multiplier will
shift when we benchmark against the paper’s actual algorithms (on the
roadmap, see plan.md). The binary also prints a linear-scan number
on a 50 000-address slice; that is O(N) vs. O(log N) and is only a
sanity check, not a comparison.

Same 100 000-prefix workload. footprint_bytes() counts the
algorithmic cost (our flat key + next-hop arrays for the tree; one
Node struct per Patricia node); RSS delta is the process-level
VmRSS change across the build and captures allocator overhead:

Implementation Algorithmic RSS delta Per-prefix
lpm6::Tree 4.82 MB 5.05 MB ~50 B/prefix
Patricia trie 6.04 MB 9.02 MB ~90 B/prefix

At 100 K, the tree’s flat cache-aligned layout costs roughly half the
RSS of the pointer-chasing Patricia. This does not hold at scale —
see FIB-size sweep below: the linearized B+-tree grows in discrete
depth steps (9^depth buckets), so its footprint is bounded from
below by depth, not prefix count. Once depth transitions 6→7 (around
250 K prefixes) the tree’s algorithmic size jumps to ~38 MB and stays
there until depth 7 fills up. Patricia scales linearly with prefixes,
so at 250 K the tree is actually larger than Patricia. Paper
reports 56–92% memory reductions vs PopTrie/CP-Trie/Neurotrie/HBS;
those are heavier than plain Patricia, so the paper’s advantage likely
still holds against them even at our 250 K crossover.

examples/run_sweep.sh drives planb-lpm, bench_update, and
bench_naive across five FIB sizes sharing one 1 M-address trace,
same 20-run discipline, pinned to core 2. Numbers are medians of
the timed runs; throughput units are MLPS:

FIB depth single batch<8> batch<32> rebuild tree alg. Patricia alg.
10 K 5 94 165 148 1.5 ms 0.53 MB 0.61 MB
100 K 6 46 69 75 17.7 ms 4.82 MB 6.04 MB
250 K 7 20 29 41 51.8 ms 38.40 MB 15.00 MB
500 K 7 12 18 27 107.9 ms n/m n/m
1 M 7 10 14 22 213.4 ms n/m n/m

(n/m = not measured; Patricia allocation and the brute-force sanity
check become impractical at 500 K+ on a laptop, so the sweep script
only runs them for ≤250 K. Tree footprint at 500 K / 1 M is dominated
by the same depth-7 bucket skeleton as 250 K and grows only with the
filled-leaf count; we haven’t added a separate probe yet.)

Observations:

  • Throughput falls as the FIB exceeds L3. The laptop has ~6 MB
    L3; the tree’s flat footprint is below that at 100 K (4.8 MB) and
    above it from 250 K onward. Single-lookup throughput drops 4.6×
    going from 10 K → 100 K and another 2.3× going from 100 K → 250 K,
    dominated by memory traffic once we spill out of L3.
  • Batching helps most at scale. At 10 K, batch<8> beats
    batch<32> (165 vs 148 MLPS) — everything’s in L1/L2 and large
    unrolls just add register pressure. At 1 M, batch<32> is the
    best path (22 vs 14 for batch<8>); the long-latency DRAM fetches
    benefit from more in-flight loads.
  • Depth transitions matter. Depth goes 5→6 between 10 K and
    100 K, and 6→7 between 100 K and 250 K. Each extra level is one
    more cache-line indirection per lookup and bumps the sentinel-padded
    key count (531 K → 4.78 M going 6→7), which drives the tree-above-
    Patricia memory crossover.
  • Rebuild scales roughly linearly. 1.5 / 17.7 / 51.8 / 107.9 /
    213.4 ms for 10 K / 100 K / 250 K / 500 K / 1 M; the paper’s Zen5
    number is 850 ms for 1 M, so our Ice Lake laptop is actually ~4×
    faster on the rebuild path (likely because our tree build elides
    some of the preprocessing the paper bundles into rebuild time).

Dataset caveat: these are synthetic FIBs with a RIPE-weighted length
distribution, not real BGP tables. Real BGP has stronger
prefix-length clustering and spatial locality in advertised ranges,
both of which change cache behavior — see the next section for a
reproduction on an actual RIPE RIS RIB.

Real BGP reproduction (RIPE RIS rrc00)

examples/mrt_to_fib.py parses a TABLE_DUMP_V2 / RIB_IPV6_UNICAST
MRT dump (RFC 6396) into a planb-lpm FIB. On the rrc00 full-table
latest-bview.gz of 2026-04-19 we get 254 197 unique IPv6
prefixes
— right in the range (≈235 K) the paper cites for rrc00 in
its evaluation. Same 1 M uniform-random 64-bit trace, same 20-run
discipline, pinned to core 2:

Path min median max IQR
single 48.7 67.5 74.4 3.6
batch<8> 115.3 137.2 151.0 8.7
batch<32> 100.5 118.6 137.1 8.3

Units: MLPS. Tree depth 7, 4.78 M sentinel-padded keys, 38.4 MB
algorithmic footprint (same skeleton as the synthetic 250 K row).
Build is 63 ms, full rebuild under bench_update is median 40 ms
(paper 850 ms/1 M on Zen5 → ~213 ms/250 K scale; our Ice Lake laptop
at 40 ms is ~5× faster — likely because the paper bundles more
preprocessing into the rebuild number).

Throughput is roughly 3× higher on real BGP than on the same-size
synthetic
(67 vs 20 MLPS single). The synthetic distribution is
length-weighted but picks each prefix’s address uniformly; real BGP
concentrates advertisements in a small fraction of the address space,
so a uniform-random lookup trace against a real RIB tends to hit the
same hot tree paths and stays L2-resident. This is the opposite
direction of “real data is worse” that we expected — it mainly tells
us our trace is the limiting factor, not the FIB.

Against real BGP the Patricia baseline gets faster too — by more:

Implementation min median max IQR
lpm6::Tree 43.5 65.3 75.4 5.1
Patricia trie 51.2 95.7 110.9 24.2

Patricia jumps from 2.4 MLPS on the 100 K synthetic to 95 MLPS on
real BGP — 40× better on the same machine, and now ahead of the
tree. Two things are going on:

  1. Uniform 64-bit lookups bail out early in Patricia on a BGP table.
    Most random upper-64-bit addresses fall into a handful of heavily
    covered root regions (think /6, /8, /12) — Patricia returns after
    1–2 pointer chases staying in L1. The tree always does 7
    cache-line fetches no matter which prefix ultimately wins.
  2. Our Patricia is a stand-in, not one of the paper’s baselines.
    The paper’s 1.6–14× advantage is over PopTrie / CP-Trie /
    Neurotrie / HBS — those structures add per-node SIMD and bitmap
    acceleration that a plain pointer trie doesn’t have; on uniform
    lookups they pay more than our Patricia does. The gap to a
    genuine paper baseline is almost certainly larger than our gap to
    Patricia.

The honest conclusion: on this hardware, on a real BGP table, against
a uniform 64-bit trace, a plain Patricia trie is roughly at parity
with the tree. The tree’s advantage shows up with non-uniform
traces (real packet captures concentrate on long prefixes where
Patricia walks deep) or with the paper’s actual baselines. Both are
on the roadmap.

bench_mt runs batch<8> across T threads, each pinned to a distinct
logical CPU, all sharing one immutable lpm6::Tree. Threads release
in lockstep per run so L3 / DRAM contention is measured, not dodged.
Same 20-run discipline. Laptop: 4 physical cores / 8 logical (SMT):

FIB T per-thread min per-thread max aggregate MLPS efficiency
BGP 254 K 1 95 140 130 1.00×
BGP 254 K 2 95 143 278 1.07×
BGP 254 K 4 68 142 396 0.76×
BGP 254 K 8 17 98 626 0.60×
synth 1 M 1 12 18 15 1.00×
synth 1 M 2 13 18 31 1.02×
synth 1 M 4 7 16 52 0.85×
synth 1 M 8 3 15 85 0.69×

Efficiency = aggregate at T ÷ (T × aggregate at T=1). Units MLPS.

Observations:

  • T=2 is super-linear on the BGP FIB (1.07×). At 1T, a single
    memory-bound lookup chain leaves functional units idle during DRAM
    waits; with 2T on 2 cores the hardware has more in-flight loads and
    can hide more of that latency. Similar effect is visible on synth.
  • 4-of-4 physical cores is the cleanest scaling regime. At 4T the
    efficiency is ~0.76–0.85×, mainly limited by shared L3 once the hot
    set no longer fits per-core.
  • 8T uses SMT (hyperthreading). Per-thread throughput collapses
    (the two sibling threads contend for L1/L2 and AVX-512 port
    bandwidth on the same physical core), but aggregate still rises
    because the second sibling occupies stall cycles. Efficiency vs
    single-core baseline drops to 0.60–0.69×.
  • Real BGP scales worse relatively than synth 1 M. BGP’s hot
    set at T=1 already hits ~130 MLPS and stays in L2; more threads push
    it out and each one pays the L3 tax. Synth-1M is DRAM-bound even at
    T=1 (footprint 38 MB ≫ L3), so adding threads is cheaper in relative
    terms — the DRAM subsystem was the bottleneck anyway.

Paper reports 3.4 BLPS on 12 Xeon cores (§5.4). Our 8-thread laptop
(4 real cores + SMT) hits 0.63 BLPS on the same-scale BGP table —
roughly 5× below the paper, which tracks with 12 vs 4 real cores
plus Xeon’s wider L2 / higher sustained AVX-512 frequency. A proper
comparison needs a server-class box; see Phase 2.6 in plan.md.

PlanB’s update model is batched: N pending prefix changes are coalesced
into one full rebuild (§3.6 of the paper). bench_update measures
rebuild time and per-op latency on the 100 000-prefix FIB:

  tree rebuild          : min 17.2  med 18.8  max 25.0  ms    (IQR 1.2)
  dynamic insert (incl. rebuild) : med 14.4 ms
  amortized if coalesced per 1 K : ~14 µs / update
  amortized if coalesced per 10 K: ~1.4 µs / update

The paper reports 850 ms to rebuild a 1 M-prefix FIB on an AMD Zen5
server; linear scaling gives ~85 ms for our 100 K setup and we see
~19 ms on Ice Lake, so the rebuild path is at parity within hardware
noise.

Limitations — what this is not

  • Not on the paper’s hardware. Laptop Ice Lake (4 physical cores,
    ~6 MB L3) vs. the paper’s 12-core Xeon 8331C / Zen5 server — ≈5–15×
    raw single-core throughput gap before algorithm and ~3× core-count
    gap for multi-core aggregate. A server-class run on GCP
    c2-standard-16 (Cascade Lake, AVX-512) is on the roadmap
    (Phase 2.6 in plan.md).
  • Not the paper’s baselines. Patricia is a stand-in conventional
    pointer-chasing trie; we do not implement PopTrie, CP-Trie,
    Neurotrie, or HBS. PopTrie is the next item in plan.md
    (Phase 2.7); the others are tracked but not scheduled.
  • Synthetic FIBs are length-weighted, not address-distributed like
    real BGP.
    The “Real BGP reproduction” section above shows how
    much this matters: throughput on a real rrc00 table is ~3× higher
    than on a same-size synthetic for uniform 64-bit traces, and
    Patricia’s relative position against the tree flips entirely. Both
    the synthetic and the real-BGP runs use a uniform-random lookup
    trace — real packet captures concentrate on long prefixes, which
    would penalize Patricia further and is the regime where the
    linearized tree’s fixed 7 cache-line cost looks best.

Treat these numbers as a reproducibility check that the algorithm
behaves as the paper predicts on commodity hardware, not as a
competitive evaluation.

pip install pybind11 scikit-build-core
pip install -e .
python examples/basic_usage.py
from planb_lpm import Tree, Dynamic

t = Tree()
t.build([("2001:db8::", 32, 10), ("fe80::", 10, 20)])
t.lookup("2001:db8::1")   # -> 10

d = Dynamic()
d.load([("::", 0, 1)])
d.insert("2001:db8::", 32, 7)
d.lookup("2001:db8::abcd") # -> 7
  1. Expand each prefix/len into a [start, end) interval on the upper
    64 bits of the IPv6 address space.
  2. Sort the 2·N boundary points; resolve prefix nesting with a stack so that
    each elementary interval knows its active next-hop.
  3. Pack the sorted boundaries into the leaves of a 9-ary linearized B+-tree
    (8 keys per node, 64-byte cache-line aligned). Internal nodes hold
    separator keys copied from the leaves at strides of 9^L · 8.
  4. Lookup is a pure predecessor search: at each internal node use a single
    AVX-512 vpcmpuq + popcnt to pick the child, then do the same at the
    leaf to locate the predecessor edge.
include/lpm6.hpp            core header-only library
src/main.cpp                CLI benchmark
src/ipv6_util.hpp           FIB/trace loaders
tests/test_correctness.cpp  brute-force vs tree vs patricia
tests/patricia.hpp          baseline path-compressed radix trie (benchmark only)
tests/bench_naive.cpp       three-way throughput: naive / patricia / tree
tests/bench_update.cpp      rebuild + dynamic-update latency
tests/bench_mt.cpp          multi-core scaling (shared const tree, per-thread slice)
tests/bench_stats.hpp       warmup / percentile / env-probe helpers
examples/mrt_to_fib.py      RFC 6396 TABLE_DUMP_V2 → FIB converter for RIPE RIS
python/binding.cpp          pybind11 module
examples/                   sample data + usage scripts

All files in this repository are original work under the MIT license — see
LICENSE.md. The author’s reference implementation at
https://github.com/nicexlab/planb is not vendored; see the paper cited
above for the algorithm itself.

The 0.1.0 tree has been built and tested in the following configurations on
Ubuntu 24.04 / GCC 13.3 (Intel Ice Lake):

Configuration Result
Release, AVX-512 auto-detect ctest 1/1, bench OK
Release, AVX-512 forced off (scalar fallback) ctest 1/1, 0 zmm instructions in binary
Debug + -fsanitize=address,undefined ctest 1/1, no leaks / UB
-Wall -Wextra -Wpedantic -Wshadow -Wconversion -Werror Clean build
pip install -e . in a fresh venv + examples/basic_usage.py All lookups correct

The core algorithm is from the PlanB paper by Zhang et al. (see top of file);
all source in this repository is an independent reimplementation.

Contributions of any size are welcome — bug reports, benchmark results on
new hardware, alternative SIMD backends, language bindings, documentation
fixes. See CONTRIBUTING for the dev workflow,
commit conventions, and how to run the full verification suite (scalar
fallback, ASAN/UBSAN, -Werror, pip install -e .) before opening a
pull request.

⚡ **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#GitHub #esutcuplanblpm #GitHub**

🕒 **Posted on**: 1776662469

🌟 **Want more?** Click here for more info! 🌟

By

Leave a Reply

Your email address will not be published. Required fields are marked *