bash - 単語頻度集計スクリプトが遅すぎる

Question

バックグラウンド

プレーンテキストファイル内の単語の頻度をカウントするスクリプトを作成しました。スクリプトは次の手順を実行します。

コーパスから単語の頻度を数えます。
辞書で見つかったコーパス内の各単語を保持します。
周波数のコンマ区切りファイルを作成します。

スクリプトはhttp://pastebin.com/VAZdeKXsにあります。

#!/bin/bash

# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
sed -e 's/ /\n/g' -e 's/[^a-zA-Z\n]//g' corpus.txt | \
  tr [:upper:] [:lower:] | \
  sort | \
  uniq -c | \
  sort -rn > frequency.txt

echo Creating corpus lexicon...
rm -f corpus-lexicon.txt

for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
  grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done

echo Creating lexicon...
rm -f lexicon.txt

for i in $(cat corpus-lexicon.txt); do
  egrep -m 1 "^[0-9 ]* $i\$" frequency.txt | \
    awk '{print $2, $1}' | \
    tr ' ' ',' >> lexicon.txt;
done

問題

次の行は、単語を照合するために辞書を継続的に循環します。

for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
  grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done

機能しますが、辞書にない単語を削除するために見つかった単語をスキャンしているため、速度が低下します。このコードは、単語ごとに辞書をスキャンしてこのタスクを実行します。(-m 1パラメーターは、一致が見つかったときにスキャンを停止します。)

質問

単語ごとに辞書が最初から最後までスキャンされないようにするには、スクリプトをどのように最適化しますか? ほとんどの単語は辞書に載っていません。

ありがとうございました！

score 2 · Accepted Answer

grep -ffrequency.txt で 1 回のパスですべての単語を検索するために使用できます。

awk '{print $2}' frequency.txt | grep -Fxf dictionary.txt > corpus-lexicon.txt

-F固定文字列を検索します。
-x行全体のみに一致します。
-fDictionary.txt から検索パターンを読み取る

実際、これを 2 番目のループと組み合わせて、中間の corpus-lexicon.txt ファイルを削除することもできます。2 つの for ループは、1 つの grep に置き換えることができます。

grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'

に変更-xしたことに注意してください-w。

score 1 · Accepted Answer

これは通常、速度を上げるために Perl で作成するスクリプトの 1 つです。しかし、私のように、書き込み専用のプログラミング言語が嫌いな場合は、すべて Awk で実行できます。

awk '
    BEGIN {
        while ((getline < "dictionary.txt") > 0)
            dict[$1] = 1
    }
    ($2 && $2 in dict) { print $2 }
' < frequency.txt > corpus-lexicon.txt

rm -f corpus-lexicon.txtこのバージョンではは必要ありません。

score 0 · Accepted Answer

実際のプログラミング言語を使用します。アプリの起動とファイルスキャンのすべてがあなたを殺しています. たとえば、Python で作成した例を次に示します (コード行を最小限に抑えています)。

import sys, re
words = re.findall(r'(\w+)',open(sys.argv[1]).read())
counts = {}
for word in words:
  counts[word] = counts.setdefault(word,0) + 1
open(sys.argv[2],'w').write("\n".join([w+','+str(c) for (w,c) in counts.iteritems()]))

私が座っていた大きなテキストファイル (1.4MB、wc によると 80,000 ワード) に対してテストすると、これは 5 年前の powermac で 1 秒未満 (18,000 の一意のワード) で完了します。

bash - 単語頻度集計スクリプトが遅すぎる

バックグラウンド

問題

質問

3 に答える 3

Related

Reference