kodomonocch1/see_proto: Schema-aware JSON compression with millisecond lookups — cut transfer/storage while enabling exists*/pos* queries. (Demo + wheels; core is binary-only)

✨ Check out this trending post from Hacker News 📖

📂 Category:

✅ Here’s what you’ll learn:

SEE — Searchable JSON Compression (Semantic Entropy Encoding)

combined ≈ 19.5% • lookup p50 ≈ 0.18 ms • skip ≈ 99%

Why it matters
SEE reduces both the data tax (storage/egress) and the CPU tax (decompress/parse) by keeping JSON searchable while compressed.
It may not always be smaller than Zstd, but searchability + low I/O + random access leads to better TCO/ROI for many workloads.

① Download (Release) ・
② OnePager (ROI) ・
③ Try in 10 minutes

Enterprise / NDA inquiry → Private contact form
Under NDA: full VDR pack available. Please provide a company email (no confidential data required).

Schema-aware JSON compression: combines structure × delta × Zstd (+ Bloom / Skip) to stay searchable while compressed, with page-level random access.
Design trade-off: favors low I/O & low latency (ms) and ~99% skip rate over minimal size.

Combined size: ≈19.5% of raw
Lookup present (ms): p50 ≈ 0.18 / p95 ≈ 0.28 / p99 ≈ 0.34
Skip ratio: present ≈ 0.99 / absent ≈ 0.992, Bloom density ≈ 0.30

スクリーンショット 2025-10-06 005753

Savings/TB = (1 − 0.195) × Price_per_GB × 1000
Example: $0.05/GB → ≈$40/TB, $0.25/GB → ≈$200/TB

python samples/quick_demo.py

Prints compression ratio, skip rate, Bloom density, and lookup latency (p50/p95/p99).

Demo package (Release v0.1.0):

Includes Python wheel, .see files, demo scripts, metrics, and OnePager PDF.
Reproducible on Windows / macOS / Linux.

Verify integrity using:

pwsh tools/verify_checksums.ps1
# or manually check SHA256SUMS.txt

KPI (demo): combined ≈ 19.5%, lookup p50 ≈ 0.18 ms, skip ≈ 99%, bloom ≈ 0.30.
Tradeoff: not always smaller than Zstd, but stays searchable while compressed, cutting I/O and CPU costs.

Zstd-only can be smaller, but not searchable; you still pay I/O + CPU to decompress and parse JSON.
SEE trades a small size increase for millisecond lookups and page-level random access, reducing I/O and CPU — resulting in better TCO.

Q. Will it ever be larger than Zstd?
A. Sometimes yes; in return you get ms lookups and ~99% skipping. For I/O/CPU-bound workloads, TCO decreases.
Q. Best-fit data?
A. Repetitive JSON/NDJSON such as logs, events, telemetry, and metrics.
Q. How long to reproduce?
A. About 10 minutes using the included Demo ZIP.
Q. Why not build a separate index?
A. Separate indexes add extra I/O, space, and consistency risk.
SEE keeps searchability inside the storage format, reducing random I/O and parsing overhead.
Q. How to tune for different data?
A. Adjust Bloom density (default ≈0.30, works best in 0.25–0.55). Demo prints all metrics for validation.

What’s included in the Release ZIP

Python Wheel (.whl)
Demo scripts: samples/quick_demo.py, samples/quick_bench.py (prints KPIs)
OnePager (PDF) and metrics/ summaries
Integrity check script: tools/verify_checksums.ps1
README_FIRST.md — concise reproduction guide

📦 VDR (Virtual Data Room) — Evaluation Package

What it is
The SEE VDR is a private, NDA-only evaluation bundle that lets third parties reproduce our key KPIs on their own machine:

Compression: combined size ≈ ~19.5% of raw
Lookup latency: p50 ≈ ~0.18 ms
Skipping: ~99% page-level skip

What it contains (high level)

Sample .see artifacts with minimal metadata (for reproducible tests)
A prebuilt evaluation wheel (binary-only) for quick local runs
KPI summaries (CSV/JSON) and a frozen results snapshot
Simple verification scripts (checksums / quality-gate)
A concise One-Pager and evaluator README

ℹ️ Implementation details (core algorithms, dictionaries, low-level parameters) remain proprietary and are not disclosed in this repository.

Access policy

Distributed on request under NDA (no public download).
To request access, please contact us via LinkedIn (see Official Links & Profiles) with the subject: “SEE VDR Access”.
Redistribution, reverse engineering, and public benchmarking of VDR binaries are prohibited.
An Evaluation EULA applies in addition to the NDA.

How evaluators use it (under NDA)

Verify package integrity (checksums script).
Install the provided evaluation wheel into a clean virtual environment.
Run the 10-minute demo to print ratio / skip / bloom / p50–p99.
Compare local output with the included KPI snapshot (apples-to-apples).

Why VDR?

Ensures reproducible, verifiable numbers without exposing the core IP.
Shortens technical diligence for FinOps / M&A / platform teams while keeping trade secrets protected.

If you only need the public demo, see the repository’s samples and Release assets.
The VDR is reserved for formal evaluations (NDA) that require deeper verification.

Note: The GitHub Discussions “Enterprise (NDA)” category is public.
Do not post confidential information or emails there — use the private form above.

🔗 Official Links & Profiles

📬 If you’re interested in schema-aware compression, reproducible benchmarks, or potential collaboration, feel free to connect via LinkedIn.

From Bytes to Balance Sheets — SEE (Semantic Entropy Encoding)

Optional: For reproducibility or citation

If you reproduce benchmarks or use SEE in your research, please cite:

SEE (Semantic Entropy Encoding)
https://github.com/kodomonocch1/see_proto

⚡ What do you think?

#️⃣ #kodomonocch1see_proto #Schemaaware #JSON #compression #millisecond #lookups #cut #transferstorage #enabling #existspos #queries #Demo #wheels #core #binaryonly

🕒 Posted on 1760716606

kodomonocch1/see_proto: Schema-aware JSON compression with millisecond lookups — cut transfer/storage while enabling exists/pos queries. (Demo + wheels; core is binary-only)

SEE — Searchable JSON Compression (Semantic Entropy Encoding)

What’s included in the Release ZIP

📦 VDR (Virtual Data Room) — Evaluation Package

🔗 Official Links & Profiles

Optional: For reproducibility or citation

By

Leave a Reply Cancel reply