sql - script to build frequency word pairs on 100 mln DB

Question

Please help me to build word pairs frequency table from table with 100 mln records that is work on SQL Server 2008 db. Table looks like:

Original table 
id |source |comment(255)
-------------------
1     A1     review budget limitation

source is some ID that has could have about 800 different values. Distribution of sources in original table is exponential. That means amount of records with source A1 could be 20 mln and A500 is only 10,000.

In final I would like to get a word pairs frequency table with ignoring words: the, and, of, to, a, i, it, in, or, is

How I expected it should work (I could be not optimal here):

read first two words from comment in original table, put it to FREQUENCY
read next two words and put it

Frequency table

id | word pairs        | source |Frequency
 ---------------------------------------------
1   review budget         A1         1
2   budget limitation     A1         1

Fill in full comment from first record that has for example source A1
Start next record and process it in the same way.
If it found same word pairs already exist in Frequency table and source is the same than just increment Frequency, if source is different - add this pair with new source.

Please help me with optimal sql script for SQL Server ?

score 1 · Accepted Answer

私はこれを1分（与えられた時間）で解決しますが、2つの必須事項を提示したいと思います。

SQLで高速に実行する必要があるものはすべて、セットベースで実行する必要があります。「一度に1つずつ」処理することは避けてください。
テーブル値関数を使用して、コメントを単語ペアのテーブルに分割します
共通のテーブル式を使用して作業を階層化し、読みやすくします

これらの3つのルールを使用すると、大量のデータを移動できます。selectステートメントを作成した後は、それをテーブルにダンプするだけです。

編集：

CREATE FUNCTION dbo.SplitToPairs(@sText nvarchar(255))
RETURNS @Pairs TABLE (
    Pair nvarchar(255) NOT NULL
)
AS
BEGIN
    SET @sText = LTRIM(RTRIM(@sText));
    DECLARE @Pos1 int = 0
    DECLARE @Pos2 int = CHARINDEX(' ', @sText);
    DECLARE @Pos3 int;
    IF @Pos2 <> 0
    BEGIN
        DECLARE @Word1 nvarchar(255) = SUBSTRING(@sText, @Pos1+1, @Pos2-@Pos1-1);
        WHILE CHARINDEX(N'|' + @Word1 + N'|', N'|the|and|of|to|a|i|it|in|or|is|') <> 0
        BEGIN
            SET @Pos1 = @Pos2;
            SET @Pos3 = CHARINDEX(' ', @sText, @Pos2+1);
            SET @Pos2 = @Pos3;
            SET @Word1 = SUBSTRING(@sText, @Pos1+1, @Pos2-@Pos1-1);
        END
        DECLARE @Word2 nvarchar(255);

        WHILE @Pos2 <> 0
        BEGIN
            SET @Pos3 int = CHARINDEX(' ', @sText, @Pos2+1);
            IF @Pos3 <> 0
            BEGIN
                SET @Word2 = SUBSTRING(@sText, @Pos2+1, @Pos3-@Pos2-1);
                WHILE CHARINDEX(N'|' + @Word2 + N'|', N'|the|and|of|to|a|i|it|in|or|is|') <> 0
                BEGIN
                    SET @Pos1 = @Pos2;
                    SET @Pos2 = @Pos3;
                    SET @Word2 = SUBSTRING(sText, @Pos2+1, @Pos2-@Pos1-1);
                END
                INSERT @Pairs (Pair) VALUES (@Word1 + N' ' + @Word2)
            END

            SET @Pos1 = @Pos2;
            SET @Pos2 = @Pos3;
            SET @Word1 = @Word2;
        END
    END
    -- Note: if only one word in text, no insert happens
    RETURN @Pairs
END

次に、それを使用して選択を作成します

SELECT I.Source, P.Pair, COUNT(*) AS Frequency
FROM Information AS I CROSS APPLY dbo.SplitToPairs(i.Comment) AS P
GROUP BY I.Source, P.Pair

私がいくつかのエッジケースでオフになっている可能性がありますが、それはあなたに私が何をしようとしているのかについての考えを与えるはずです。また、「word1word2」と「word2word1」が等しいとは見なされません。

私はそれを読者に練習として残します：p

編集：

TABLEオンラインでキーワードを追加しましたRETURNS。

また、SQL2008以降でDECLAREのみ機能する値の割り当てだと思います。

編集：

RETURNステートメントを追加

編集：

AntarticIceのフィードバックごとの変更

sql - script to build frequency word pairs on 100 mln DB

1 に答える 1

Related

Reference