Text Compression using Large Language Models

🚀 Read this trending post from Hacker News 📖

📂 **Category**:

✅ **What You’ll Learn**:



ts_zip: Text Compression using Large Language Models


The ts_zip utility can
compress (and hopefully decompress) text files using a Large Language
Model. The compression ratio is much higher than with other
compression tools. There are some caveats of course:

  • A GPU is necessary to get a reasonable speed. 4 GB of RAM is
    required.
  • It is slower than conventional compressors (compression and
    decompression speed: up to 1 MB/s on a RTX 4090).
  • Only text files are supported. Binary files won’t be compressed
    much. The currently used language model (RWKV 169M v4) was trained
    mostly on English texts. Other languages are supported including
    source code.
  • It is experimental so no backward compability should be expected
    between the various versions.
  • See also ts_sms which is optimized for the
    compression of small messages.

Compression Ratio

The compression ratio is given in bits per byte (bpb).

File Original size
(bytes)
xz
(bytes) (bpb)
ts_zip
(bytes) (bpb)
alice29.txt 152089 48492 2.551 21713 1.142
book1 768771 261116 2.717 137477 1.431
enwik8 100000000 24865244 1.989 13825741 1.106
enwik9 1000000000 213370900 1.707 135443237 1.084
linux-1.2.13.tar 9379840 1689468 1.441 1196859 1.021

Results and speed for other programs on enwik8 and enwik9 are
available at the Large
Text Compression Benchmark.

Download

Technical information

  • ts_zip uses
    the RWKV 169M v4
    language model which is a good compromise between speed and
    compression ratio. The model is quantized to 8 bits per parameter
    and evaluated using BF16 floating point numbers.
  • The language model predicts the probabilities of the next token. An
    arithmetic coder then encodes the next token according to the
    probabilities.
  • The model is evaluated in a deterministic and reproducible
    way. Hence the result does not depend on the exact GPU or CPU
    model nor on the number of configured threads. This key point
    ensures that a compressed file can be decompressed using a
    different hardware or software configuration.

Fabrice Bellard – https://bellard.org/

💬 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Text #Compression #Large #Language #Models**

🕒 **Posted on**: 1768261829

🌟 **Want more?** Click here for more info! 🌟

By

Leave a Reply

Your email address will not be published. Required fields are marked *