database - Google のデータベース (または AWS でホストされているもの) の ngram を頻度で並べ替える方法

Question

Google Book の Ngram を頻度順に並べる方法を探しています。

元のデータセットはhttp://books.google.com/ngrams/datasetsにあります。各ファイル内で、ngram はアルファベット順に並べ替えられ、次に時系列順に並べられます。

私のコンピューターは 2.2 TB 相当のデータを処理するほど強力ではないため、これを分類する唯一の方法は「クラウド内」になると思います。

AWS がホストするバージョンはhttp://aws.amazon.com/datasets/8172056142375670です。

最も頻繁に使用される 10,000 個の 1 グラム、2 グラム、3 グラム、4 グラム、および 5 グラムを見つける経済的に効率的な方法はありますか?

それにレンチを投げるには、データセットには複数年のデータが含まれています。

As an example, here are the 30,000,000th and 30,000,001st lines from file 0 
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):

circumvallate   1978   313    215   85 
circumvallate   1979   183    147   77

The first line tells us that in 1978, the word "circumvallate" (which means 
"surround with a rampart or other fortification", in case you were wondering) 
occurred 313 times overall, on 215 distinct pages and in 85 distinct books 
from our sample.

理想的には、度数リストには 1980 年から現在までのデータ (各年の合計) のみが含まれます。

どんな助けでも大歓迎です！

乾杯、

score 4 · Accepted Answer

Pigの使用をお勧めします！

Pigを使用すると、このようなことが非常に簡単で簡単になります。これは、必要なことをほぼ実行するサンプルの豚のスクリプトです。

raw = LOAD '/foo/input' USING PigStorage('\t') AS (ngram:chararray, year:int, count:int, pages:int, books:int);
filtered = FILTER raw BY year >= 1980;
grouped = GROUP filtered BY ngram;
counts = FOREACH grouped GENERATE group AS ngram, SUM(filtered.count) AS count;
sorted = ORDER counts BY count DESC;
limited = LIMIT sorted 10000;
STORED limited INTO '/foo/output' USING PigStorage('\t');

AWS Elastic MapReduceのPigはS3データを直接操作することもできるため、おそらくS3バケットに置き換えること/foo/inputも/foo/outputできます。

database - Google のデータベース (または AWS でホストされているもの) の ngram を頻度で並べ替える方法

1 に答える 1

Related

Reference