kodomonocch1/see_proto: Schema-aware JSON compression with millisecond lookups β€” cut transfer/storage while enabling exists*/pos* queries. (Demo + wheels; core is binary-only)

✨ Check out this trending post from Hacker News πŸ“–

πŸ“‚ Category:

βœ… Here’s what you’ll learn:

Gemini_Generated_Image_f2hennf2hennf2he

SEE β€” Searchable JSON Compression (Semantic Entropy Encoding)

combined β‰ˆ 19.5% β€’ lookup p50 β‰ˆ 0.18 ms β€’ skip β‰ˆ 99%

Why it matters
SEE reduces both the data tax (storage/egress) and the CPU tax (decompress/parse) by keeping JSON searchable while compressed.
It may not always be smaller than Zstd, but searchability + low I/O + random access leads to better TCO/ROI for many workloads.

β‘  Download (Release) ・
β‘‘ OnePager (ROI) ・
β‘’ Try in 10 minutes

Enterprise / NDA inquiry β†’ Private contact form
Under NDA: full VDR pack available. Please provide a company email (no confidential data required).


  • Schema-aware JSON compression: combines structure Γ— delta Γ— Zstd (+ Bloom / Skip) to stay searchable while compressed, with page-level random access.
  • Design trade-off: favors low I/O & low latency (ms) and ~99% skip rate over minimal size.
  • Combined size: β‰ˆ19.5% of raw
  • Lookup present (ms): p50 β‰ˆ 0.18 / p95 β‰ˆ 0.28 / p99 β‰ˆ 0.34
  • Skip ratio: present β‰ˆ 0.99 / absent β‰ˆ 0.992, Bloom density β‰ˆ 0.30

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2025-10-06 005753

Savings/TB = (1 βˆ’ 0.195) Γ— Price_per_GB Γ— 1000
Example: $0.05/GB β†’ β‰ˆ$40/TB, $0.25/GB β†’ β‰ˆ$200/TB


python samples/quick_demo.py

Prints compression ratio, skip rate, Bloom density, and lookup latency (p50/p95/p99).

Demo package (Release v0.1.0):

  • Includes Python wheel, .see files, demo scripts, metrics, and OnePager PDF.

  • Reproducible on Windows / macOS / Linux.

  • Verify integrity using:

    pwsh tools/verify_checksums.ps1
    # or manually check SHA256SUMS.txt

KPI (demo): combined β‰ˆ 19.5%, lookup p50 β‰ˆ 0.18 ms, skip β‰ˆ 99%, bloom β‰ˆ 0.30.
Tradeoff: not always smaller than Zstd, but stays searchable while compressed, cutting I/O and CPU costs.


  • Zstd-only can be smaller, but not searchable; you still pay I/O + CPU to decompress and parse JSON.
  • SEE trades a small size increase for millisecond lookups and page-level random access, reducing I/O and CPU β€” resulting in better TCO.

  • Q. Will it ever be larger than Zstd?
    A. Sometimes yes; in return you get ms lookups and ~99% skipping. For I/O/CPU-bound workloads, TCO decreases.

  • Q. Best-fit data?
    A. Repetitive JSON/NDJSON such as logs, events, telemetry, and metrics.

  • Q. How long to reproduce?
    A. About 10 minutes using the included Demo ZIP.

  • Q. Why not build a separate index?
    A. Separate indexes add extra I/O, space, and consistency risk.
    SEE keeps searchability inside the storage format, reducing random I/O and parsing overhead.

  • Q. How to tune for different data?
    A. Adjust Bloom density (default β‰ˆ0.30, works best in 0.25–0.55). Demo prints all metrics for validation.


What’s included in the Release ZIP

  • Python Wheel (.whl)
  • Demo scripts: samples/quick_demo.py, samples/quick_bench.py (prints KPIs)
  • OnePager (PDF) and metrics/ summaries
  • Integrity check script: tools/verify_checksums.ps1
  • README_FIRST.md β€” concise reproduction guide

πŸ“¦ VDR (Virtual Data Room) β€” Evaluation Package

What it is
The SEE VDR is a private, NDA-only evaluation bundle that lets third parties reproduce our key KPIs on their own machine:

  • Compression: combined size β‰ˆ ~19.5% of raw
  • Lookup latency: p50 β‰ˆ ~0.18 ms
  • Skipping: ~99% page-level skip

What it contains (high level)

  • Sample .see artifacts with minimal metadata (for reproducible tests)
  • A prebuilt evaluation wheel (binary-only) for quick local runs
  • KPI summaries (CSV/JSON) and a frozen results snapshot
  • Simple verification scripts (checksums / quality-gate)
  • A concise One-Pager and evaluator README

ℹ️ Implementation details (core algorithms, dictionaries, low-level parameters) remain proprietary and are not disclosed in this repository.

Access policy

  • Distributed on request under NDA (no public download).
  • To request access, please contact us via LinkedIn (see Official Links & Profiles) with the subject: β€œSEE VDR Access”.
  • Redistribution, reverse engineering, and public benchmarking of VDR binaries are prohibited.
  • An Evaluation EULA applies in addition to the NDA.

How evaluators use it (under NDA)

  1. Verify package integrity (checksums script).
  2. Install the provided evaluation wheel into a clean virtual environment.
  3. Run the 10-minute demo to print ratio / skip / bloom / p50–p99.
  4. Compare local output with the included KPI snapshot (apples-to-apples).

Why VDR?

  • Ensures reproducible, verifiable numbers without exposing the core IP.
  • Shortens technical diligence for FinOps / M&A / platform teams while keeping trade secrets protected.

If you only need the public demo, see the repository’s samples and Release assets.
The VDR is reserved for formal evaluations (NDA) that require deeper verification.

Note: The GitHub Discussions β€œEnterprise (NDA)” category is public.
Do not post confidential information or emails there β€” use the private form above.

πŸ”— Official Links & Profiles


πŸ“¬ If you’re interested in schema-aware compression, reproducible benchmarks, or potential collaboration, feel free to connect via LinkedIn.

From Bytes to Balance Sheets β€” SEE (Semantic Entropy Encoding)


Optional: For reproducibility or citation

If you reproduce benchmarks or use SEE in your research, please cite:

SEE (Semantic Entropy Encoding)
https://github.com/kodomonocch1/see_proto

⚑ What do you think?

#️⃣ #kodomonocch1see_proto #Schemaaware #JSON #compression #millisecond #lookups #cut #transferstorage #enabling #existspos #queries #Demo #wheels #core #binaryonly

πŸ•’ Posted on 1760716606

By

Leave a Reply

Your email address will not be published. Required fields are marked *