I was just kicking around the idea of breaking up a large group of text into a single integer by using recursive 2-Gram storage until there is only one value left.
table pair
{
id
first_parent_id (points to -> this.id)
second_parent_id (points to -> this.id)
}
For example, in the following code I have a 11 word sentence (twelve with the period). I could store each word pair in a database ("this" + "is" = ID #1) and then store each set of two wordpairs in the database (1 + 2 = ID #7), and repeat until I get down to only one word set left - which would be ID 12.
This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------
Then using the number "12" we can work backwards (if we have the same dataset)
------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.
While this would take a tremendous amount of work to compress/uncompress each string - it seems like it might have a use in some kind of archive work where the contents need to be stored - but are never read except in rare cases where the uncompression process isn't a problem.
Am I thinking about this correctly? Would the possible number of word sequences just be too great to store like this? (Imagine a 500 word document).