text - Compressing text using recursive N-Grams

Question

I was just kicking around the idea of breaking up a large group of text into a single integer by using recursive 2-Gram storage until there is only one value left.

table pair
{
    id
    first_parent_id (points to -> this.id)
    second_parent_id (points to -> this.id)
}

For example, in the following code I have a 11 word sentence (twelve with the period). I could store each word pair in a database ("this" + "is" = ID #1) and then store each set of two wordpairs in the database (1 + 2 = ID #7), and repeat until I get down to only one word set left - which would be ID 12.

This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------

Then using the number "12" we can work backwards (if we have the same dataset)

------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.

While this would take a tremendous amount of work to compress/uncompress each string - it seems like it might have a use in some kind of archive work where the contents need to be stored - but are never read except in rare cases where the uncompression process isn't a problem.

Am I thinking about this correctly? Would the possible number of word sequences just be too great to store like this? (Imagine a 500 word document).

score 2 · Accepted Answer

圧縮を実現するために「ダイグラムワード」が必要なのはなぜですか。それが厳密な要件ではない場合、さまざまなシーンリオでテキストデータを圧縮するさまざまな方法があります。それらは主に辞書の前処理と呼ばれます。あなたのケースに適用できるリストは次のとおりです。

単語の出現をカウントし、頻度の降順で並べ替えます。カスタムエンコーディング方法で上位Nワードを使用できます。ここで、Nはユーザーが構成できます。動的計画法などでNを最適化することもできます。実際のエンコードでは、次の記号が辞書の単語であるか直接エンコードされた単語であるかを示すフラグをエンコードします。
ダイグラムまたはトリグラム文字の組み合わせ（スペース、句読点などを含む）のヒストグラムを作成します。次に、未使用のバイト値を使用して、頻繁に見られる1つまたは複数のダイグラムをエンコードします。再帰的な方法を使用して何度もスキャンし、ソースファイルを減らすこともできます。

あなたの場合、上記の方法を検討すると非効率的です。なぜなら、エンコードされたデータをデコードするために本当に大きなデータが必要だとは考えていなかったようです。圧縮のアイデアのほとんどを理解するには、その出力を分析するための非常に単純なテストプログラムを作成することをお勧めします。最終的には、より強力で安定したアルゴリズムになります。

参照を提供するために頭に浮かぶ辞書プリプロセッサを次に示します。

XWRT：最先端の辞書プリプロセッサの1つ。
DICT：高性能のFreeArcアーカイバのプリプロセッサ（オープンソース）。それについての記事があります。残念ながら、それはロシア語です。
KWC：6グラムのコードを辞書コードに置き換える単純なテスト辞書プリプロセッサ。議論のためにここを見てください。
bpe2 V3：n-gram置換に基づいています。他のバージョン：V1、V2。また、それについての議論があります。

score 1 · Accepted Answer

要するに、そうです、シーケンスの可能な数は、これを効率的に行うには多すぎるでしょう。より大きな問題は、これらの単語のマッピング、およびそれらの各マッピングに続くn-gramをどこかに保存する必要があることです。これは、実際の「圧縮」の節約を大幅に上回ります。

text - Compressing text using recursive N-Grams

2 に答える 2

Related

Reference