Please help me to build word pairs frequency table from table with 100 mln records that is work on SQL Server 2008 db. Table looks like:
Original table
id |source |comment(255)
-------------------
1 A1 review budget limitation
source is some ID that has could have about 800 different values. Distribution of sources in original table is exponential. That means amount of records with source A1 could be 20 mln and A500 is only 10,000.
In final I would like to get a word pairs frequency table with ignoring words: the, and, of, to, a, i, it, in, or, is
How I expected it should work (I could be not optimal here):
- read first two words from comment in original table, put it to FREQUENCY
- read next two words and put it
Frequency table
id | word pairs | source |Frequency
---------------------------------------------
1 review budget A1 1
2 budget limitation A1 1
- Fill in full comment from first record that has for example source A1
- Start next record and process it in the same way.
- If it found same word pairs already exist in Frequency table and source is the same than just increment Frequency, if source is different - add this pair with new source.
Please help me with optimal sql script for SQL Server ?