c# - 1テラバイトのテキストを解析し、各単語の出現回数を効率的にカウントします

Question

最近、私は次のことを行う必要がある任意の言語でアルゴリズムを作成するためのインタビューの質問に出くわしました

1テラバイトのコンテンツを読む
そのコンテンツで繰り返される単語ごとに数えます
最も頻繁に出現する単語のトップ10をリストします

このためのアルゴリズムを作成するための最良の方法を教えてください。

編集：

OK、コンテンツが英語だとしましょう。そのコンテンツで最も頻繁に出現する上位10語をどのように見つけることができますか？私の他の疑問は、意図的に一意のデータを提供している場合、ヒープサイズのオーバーフローでバッファが期限切れになることです。それも処理する必要があります。

score 65 · Accepted Answer

インタビュー回答

このタスクは複雑になりすぎずに興味深いので、優れた技術的な議論を始めるのに最適な方法です。このタスクに取り組む私の計画は次のとおりです。

区切り文字として空白と句読点を使用して、入力データを単語に分割します
見つかったすべての単語をTrie構造にフィードし、単語の最後の文字を表すノードでカウンターを更新します
完全に入力されたツリーをトラバースして、カウントが最も高いノードを見つけます

インタビューの文脈で...私はボードや紙に木を描くことによってトライのアイデアを示します。空から始めて、少なくとも1つの繰り返し単語を含む単一の文に基づいてツリーを構築します。「猫はネズミを捕まえることができる」と言います。最後に、ツリーをトラバースして最大数を見つける方法を示します。次に、このツリーがどのように優れたメモリ使用量と優れた単語検索速度を提供し（特に、多くの単語が互いに派生する自然言語の場合）、並列処理に適しているかを正当化します。

ボードに描く

トライの例を描く

デモ

以下のC＃プログラムは、4コアのxeon W3520で75秒で2GBのテキストを処理し、最大8つのスレッドを処理します。パフォーマンスは毎秒約430万語で、入力解析コードは最適ではありません。単語を格納するTrie構造では、自然言語入力を処理するときにメモリは問題になりません。

ノート：

グーテンベルクプロジェクトから得られたテストテキスト
入力解析コードは改行を想定しており、かなり最適ではありません
句読点やその他の非単語の削除はあまりうまく行われていません
複数の小さなファイルではなく1つの大きなファイルを処理するには、ファイル内の指定されたオフセット間でスレッドの読み取りを開始するために少量のコードが必要になります。

using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;

namespace WordCount
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Counting words...");
            DateTime start_at = DateTime.Now;
            TrieNode root = new TrieNode(null, '?');
            Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();

            if (args.Length == 0)
            {
                args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
                                      "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
            }

            if (args.Length > 0)
            {
                foreach (string path in args)
                {
                    DataReader new_reader = new DataReader(path, ref root);
                    Thread new_thread = new Thread(new_reader.ThreadRun);
                    readers.Add(new_reader, new_thread);
                    new_thread.Start();
                }
            }

            foreach (Thread t in readers.Values) t.Join();

            DateTime stop_at = DateTime.Now;
            Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
            Console.WriteLine();
            Console.WriteLine("Most commonly found words:");

            List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
            int distinct_word_count = 0;
            int total_word_count = 0;
            root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
            top10_nodes.Reverse();
            foreach (TrieNode node in top10_nodes)
            {
                Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
            }

            Console.WriteLine();
            Console.WriteLine("{0} words counted", total_word_count);
            Console.WriteLine("{0} distinct words found", distinct_word_count);
            Console.WriteLine();
            Console.WriteLine("done.");
        }
    }

    #region Input data reader

    public class DataReader
    {
        static int LOOP_COUNT = 1;
        private TrieNode m_root;
        private string m_path;        

        public DataReader(string path, ref TrieNode root)
        {
            m_root = root;
            m_path = path;
        }

        public void ThreadRun()
        {
            for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
            {
                using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
                {
                    using (StreamReader sreader = new StreamReader(fstream))
                    {
                        string line;
                        while ((line = sreader.ReadLine()) != null)
                        {
                            string[] chunks = line.Split(null);
                            foreach (string chunk in chunks)
                            {
                                m_root.AddWord(chunk.Trim());
                            }
                        }
                    }
                }
            }
        }
    }

    #endregion

    #region TRIE implementation

    public class TrieNode : IComparable<TrieNode>
    {
        private char m_char;
        public int m_word_count;
        private TrieNode m_parent = null;
        private ConcurrentDictionary<char, TrieNode> m_children = null;

        public TrieNode(TrieNode parent, char c)
        {
            m_char = c;
            m_word_count = 0;
            m_parent = parent;
            m_children = new ConcurrentDictionary<char, TrieNode>();            
        }

        public void AddWord(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
                {
                    if (!m_children.ContainsKey(key))
                    {
                        m_children.TryAdd(key, new TrieNode(this, key));
                    }
                    m_children[key].AddWord(word, index + 1);
                }
                else
                {
                    // not a letter! retry with next char
                    AddWord(word, index + 1);
                }
            }
            else
            {
                if (m_parent != null) // empty words should never be counted
                {
                    lock (this)
                    {
                        m_word_count++;                        
                    }
                }
            }
        }

        public int GetCount(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (!m_children.ContainsKey(key))
                {
                    return -1;
                }
                return m_children[key].GetCount(word, index + 1);
            }
            else
            {
                return m_word_count;
            }
        }

        public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
        {
            if (m_word_count > 0)
            {
                distinct_word_count++;
                total_word_count += m_word_count;
            }
            if (m_word_count > most_counted[0].m_word_count)
            {
                most_counted[0] = this;
                most_counted.Sort();
            }
            foreach (char key in m_children.Keys)
            {
                m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
            }
        }

        public override string ToString()
        {
            if (m_parent == null) return "";
            else return m_parent.ToString() + m_char;
        }

        public int CompareTo(TrieNode other)
        {
            return this.m_word_count.CompareTo(other.m_word_count);
        }
    }

    #endregion
}

ここでは、同じ20MBのテキストを8つのスレッドで100回処理した結果を示しています。

Counting words...
Input data processed in 75.2879952 secs

Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times

323618000 words counted
60896 distinct words found

score 24 · Accepted Answer

ここでの多くは、指定されていないいくつかのものに依存します。たとえば、これを1回実行しようとしているのでしょうか、それとも定期的かつ継続的にこれを実行するシステムを構築しようとしているのでしょうか。入力を制御できますか？すべてが単一の言語（英語など）であるテキストを扱っているのですか、それとも多くの言語が表現されているのですか（もしそうなら、いくつありますか）？

これらは次の理由で重要です。

データが単一のハードドライブで開始される場合、並列カウント（map-reduceなど）は実際には効果がありません。ボトルネックはディスクからの転送速度です。より速くカウントできるように、より多くのディスクにコピーを作成すると、1つのディスクから直接カウントするよりも遅くなります。
これを定期的に実行するシステムを設計している場合、私たちの重点のほとんどは実際にはハードウェアにあります。具体的には、帯域幅を増やし、少なくとも少しだけそれに追いつくために、多数のディスクを並列に配置します。 CPU。
読んでいるテキストの量に関係なく、処理する必要のある個別の単語の数には制限があります。テラバイトの英語のテキストであっても、数十億のようなものは表示されません。英語の異なる単語。簡単なチェックを行うと、オックスフォード英語辞典には英語で約600,000語がリストされています。
実際の単語は言語によって明らかに異なりますが、言語ごとの単語数はほぼ一定であるため、作成するマップのサイズは、表現される言語の数に大きく依存します。

それは主に、いくつの言語を表現できるかという問題を残します。今のところ、最悪のケースを想定しましょう。ISO 639-2には、485の人間の言語のコードがあります。言語あたり平均700,000語、たとえば1語あたり10バイトのUTF-8の平均語長を想定します。

単純な線形リストとして保存されるだけです。つまり、地球上のすべての言語のすべての単語を、6ギガバイト弱の8バイトの頻度カウントとともに保存できます。代わりにPatriciaトライのようなものを使用する場合、少なくともある程度縮小することを計画できます。おそらく3ギガバイト以下に縮小する可能性がありますが、これらすべての言語について十分に理解しているわけではありません。

さて、現実には、私たちはそこの多くの場所で数をほぼ確実に過大評価しています-かなりの数の言語がかなりの数の単語を共有し、多くの（特に古い）言語はおそらく英語よりも単語が少なく、リストには、おそらくフォームがまったく書かれていないものが含まれているようです。

概要：ほとんどすべての合理的に新しいデスクトップ/サーバーには、マップを完全にRAMに保持するのに十分なメモリがあり、データが増えてもそれは変わりません。並列の1つ（またはいくつか）のディスクの場合、とにかくI / Oバウンドになるため、並列カウント（およびそのようなもの）はおそらく純損失になります。他の最適化が多くを意味する前に、おそらく数十のディスクを並列に必要とします。

score 15 · Accepted Answer

このタスクでは、 map-reduceアプローチを試すことができます。map-reduceの利点はスケーラビリティであるため、1TB、10TB、または1PBでも同じアプローチが機能し、新しいスケールに合わせてアルゴリズムを変更するために多くの作業を行う必要はありません。フレームワークは、クラスター内にあるすべてのマシン（およびコア）間で作業を分散することも処理します。

最初-(word,occurances)ペアを作成します。
このための擬似コードは次のようになります。

map(document):
  for each word w:
     EmitIntermediate(w,"1")

reduce(word,list<val>):
   Emit(word,size(list))

次に、topKの発生率が最も高いものを、ペアを1回繰り返すだけで簡単に見つけることができます。このスレッドでは、この概念について説明しています。主なアイデアは、上位K個の要素の最小ヒープを保持することです。反復中は、ヒープにこれまでに見た上位K個の要素が常に含まれていることを確認してください。完了すると、ヒープには上位K個の要素が含まれます。

よりスケーラブルな（マシンが少ない場合は遅くなりますが）代替手段は、map-reduceソート機能を使用し、発生に応じてデータをソートし、上位Kをgrepすることです。

score 9 · Accepted Answer

このための3つの注意点。

具体的には、ファイルが大きすぎてメモリに保持できない、ワードリストが（潜在的に）大きすぎてメモリに保持できない、ワード数が32ビット整数に対して大きすぎる可能性があります。

これらの警告を通り抜けたら、それは簡単なはずです。ゲームは潜在的に大きな単語リストを管理しています。

簡単な場合（頭が回転しないようにするため）。

「65KのRAMと1MBのファイルを備えたZ-808ビットマシンを実行しています...」

同じ正確な問題。

score 5 · Accepted Answer

要件によって異なりますが、ある程度のエラーが発生する可能性がある場合、ストリーミングアルゴリズムと確率的データ構造は、時間とスペースの効率が非常に高く、実装が非常に簡単であるため、興味深いものになる可能性があります。

ヘビーヒッター（例：スペース節約）、最も頻繁に使用される上位n個の単語のみに関心がある場合
カウント-最小スケッチ、任意の単語の推定カウントを取得します

これらのデータ構造に必要な定数スペースはごくわずかです（正確な量は、許容できるエラーによって異なります）。

これらのアルゴリズムの優れた説明については、http：//alex.smola.org/teaching/berkeley2012/streams.htmlを参照してください。

score 3 · Accepted Answer

私はDAWG（ウィキペディア、および詳細が記載されたC＃の記事）を使用したいと思うでしょう。リーフノードにカウントフィールドを追加するのは簡単で、メモリの面で効率的で、ルックアップに非常に適しています。

編集：あなたは単にDictionary<string, int>？<string, int>は単語とカウントを表しますか？おそらくあなたは早すぎて最適化しようとしていますか？

編集者のメモ：この投稿は、もともとこのウィキペディアの記事にリンクされていました。これは、DAWGという用語の別の意味についてのようです。効率的な近似文字列マッチングのために、1つの単語のすべての部分文字列を格納する方法。

score 2 · Accepted Answer

別の解決策は、SQLテーブルを使用して、システムに可能な限り適切にデータを処理させることです。wordまず、コレクション内の単語ごとに、単一のフィールドを持つテーブルを作成します。

次に、クエリを使用します（構文の問題で申し訳ありませんが、私のSQLは錆びています-これは実際には擬似コードです）：

SELECT DISTINCT word, COUNT(*) AS c FROM myTable GROUP BY word ORDER BY c DESC

一般的な考え方は、最初にすべての単語を含むテーブル（ディスクに保存されている）を生成し、次にクエリを使用してカウントと並べ替え(word,occurances)を行うことです。次に、取得したリストから上位Kを取得できます。

すべての人に：SQLステートメントに構文やその他の問題がある場合：自由に編集してください

score 2 · Accepted Answer

まず、私は最近Trieデータ構造を「発見」しましたが、zeFrenchyの答えは、私がそれを理解するのに最適でした。

コメントで、パフォーマンスを改善する方法を提案している人が何人かいるのを見ましたが、これらはほんのわずかな調整でしたので、本当のボトルネックであることがわかったものをあなたと共有したいと思いました...ConcurrentDictionary。

スレッドローカルストレージを試してみたかったのですが、サンプルはそれを行う絶好の機会を与えてくれました。スレッドごとに辞書を使用するようにいくつかの小さな変更を加えた後、Join（）でパフォーマンスが約30％向上した後に辞書を結合しました。（8つのスレッドで20MBを100回処理すると、私のボックスでは約48秒から約33秒になりました）。

コードは下に貼り付けられており、承認された回答からあまり変更されていないことに気付くでしょう。

PS私は50以上の評判ポイントを持っていないので、これをコメントに入れることができませんでした。

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;

namespace WordCount
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Counting words...");
            DateTime start_at = DateTime.Now;
            Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
            if (args.Length == 0)
            {
                args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
                                      "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
            }

            List<ThreadLocal<TrieNode>> roots;
            if (args.Length == 0)
            {
                roots = new List<ThreadLocal<TrieNode>>(1);
            }
            else
            {
                roots = new List<ThreadLocal<TrieNode>>(args.Length);

                foreach (string path in args)
                {
                    ThreadLocal<TrieNode> root = new  ThreadLocal<TrieNode>(() =>
                    {
                        return new TrieNode(null, '?');
                    });

                    roots.Add(root);

                    DataReader new_reader = new DataReader(path, root);
                    Thread new_thread = new Thread(new_reader.ThreadRun);
                    readers.Add(new_reader, new_thread);
                    new_thread.Start();
                }
            }

            foreach (Thread t in readers.Values) t.Join();

            foreach(ThreadLocal<TrieNode> root in roots.Skip(1))
            {
                roots[0].Value.CombineNode(root.Value);
                root.Dispose();
            }

            DateTime stop_at = DateTime.Now;
            Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
            Console.WriteLine();
            Console.WriteLine("Most commonly found words:");

            List<TrieNode> top10_nodes = new List<TrieNode> { roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value };
            int distinct_word_count = 0;
            int total_word_count = 0;
            roots[0].Value.GetTopCounts(top10_nodes, ref distinct_word_count, ref total_word_count);

            top10_nodes.Reverse();
            foreach (TrieNode node in top10_nodes)
            {
                Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
            }

            roots[0].Dispose();

            Console.WriteLine();
            Console.WriteLine("{0} words counted", total_word_count);
            Console.WriteLine("{0} distinct words found", distinct_word_count);
            Console.WriteLine();
            Console.WriteLine("done.");
            Console.ReadLine();
        }
    }

    #region Input data reader

    public class DataReader
    {
        static int LOOP_COUNT = 100;
        private TrieNode m_root;
        private string m_path;

        public DataReader(string path, ThreadLocal<TrieNode> root)
        {
            m_root = root.Value;
            m_path = path;
        }

        public void ThreadRun()
        {
            for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
            {
                using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
                using (StreamReader sreader = new StreamReader(fstream))
                {
                    string line;
                    while ((line = sreader.ReadLine()) != null)
                    {
                        string[] chunks = line.Split(null);
                        foreach (string chunk in chunks)
                        {
                            m_root.AddWord(chunk.Trim());
                        }
                    }
                }
            }
        }
    }

    #endregion

    #region TRIE implementation

    public class TrieNode : IComparable<TrieNode>
    {
        private char m_char;
        public int m_word_count;
        private TrieNode m_parent = null;
        private Dictionary<char, TrieNode> m_children = null;

        public TrieNode(TrieNode parent, char c)
        {
            m_char = c;
            m_word_count = 0;
            m_parent = parent;
            m_children = new Dictionary<char, TrieNode>();            
        }

        public void CombineNode(TrieNode from)
        {
            foreach(KeyValuePair<char, TrieNode> fromChild in from.m_children)
            {
                char keyChar = fromChild.Key;
                if (!m_children.ContainsKey(keyChar))
                {
                    m_children.Add(keyChar, new TrieNode(this, keyChar));
                }
                m_children[keyChar].m_word_count += fromChild.Value.m_word_count;
                m_children[keyChar].CombineNode(fromChild.Value);
            }
        }

        public void AddWord(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
                {
                    if (!m_children.ContainsKey(key))
                    {
                        m_children.Add(key, new TrieNode(this, key));
                    }
                    m_children[key].AddWord(word, index + 1);
                }
                else
                {
                    // not a letter! retry with next char
                    AddWord(word, index + 1);
                }
            }
            else
            {
                if (m_parent != null) // empty words should never be counted
                {
                    m_word_count++;                        
                }
            }
        }

        public int GetCount(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (!m_children.ContainsKey(key))
                {
                    return -1;
                }
                return m_children[key].GetCount(word, index + 1);
            }
            else
            {
                return m_word_count;
            }
        }

        public void GetTopCounts(List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
        {
            if (m_word_count > 0)
            {
                distinct_word_count++;
                total_word_count += m_word_count;
            }
            if (m_word_count > most_counted[0].m_word_count)
            {
                most_counted[0] = this;
                most_counted.Sort();
            }
            foreach (char key in m_children.Keys)
            {
                m_children[key].GetTopCounts(most_counted, ref distinct_word_count, ref total_word_count);
            }
        }

        public override string ToString()
        {
            return BuildString(new StringBuilder()).ToString();
        }

        private StringBuilder BuildString(StringBuilder builder)
        {
            if (m_parent == null)
            {
                return builder;
            }
            else
            {
                return m_parent.BuildString(builder).Append(m_char);
            }
        }

        public int CompareTo(TrieNode other)
        {
            return this.m_word_count.CompareTo(other.m_word_count);
        }
    }

    #endregion
}

score 1 · Accepted Answer

簡単な一般的なアルゴリズムとして、これを行います。

Create a map with entries being the count for a specific word and the key being the actual string.  

for each string in content:
   if string is a valid key for the map:
      increment the value associated with that key
   else
      add a new key/value pair to the map with the key being the word and the count being one
done

次に、マップで最大の値を見つけることができます


create an array size 10 with data pairs of (word, count) 

for each value in the map
    if current pair has a count larger than the smallest count in the array
        replace that pair with the current one

print all pairs in array

score 0 · Accepted Answer

非常に興味深い質問です。これは、コーディングよりも論理分析に関連しています。英語と有効な文章を前提とすると、それは簡単になります。

すべての単語を数える必要はありません。指定された言語の平均語長以下の長さの単語だけを数えます（英語の場合は5.1です）。したがって、メモリに問題はありません。

ファイルの読み取りに関しては、並列モードを選択する必要があります。空白のファイル位置を操作して、チャンク（選択したサイズ）を読み取ります。たとえば、1MBのチャンクを読み取る場合は、最初のチャンクを除くすべてのチャンクを少し広くする必要があります（左から+22バイト、右から+22バイト、22は最長の英語の単語を表します-私が正しい場合）。並列処理の場合、マージする並行ディクショナリまたはローカルコレクションが必要になります。

通常、有効なストップワードリストの一部としてトップ10になることに注意してください（これはおそらく、ファイルの内容が通常である限り有効である別の逆のアプローチです）。

score 0 · Accepted Answer

以下の方法では、データを1回だけ読み取り、メモリサイズに合わせて調整できます。

たとえば1GBのチャンクでファイルを読み取ります
チャンクごとに、頻度の高い5000の最も出現する単語のリストを作成します
頻度に基づいてリストをマージします（それぞれ5000語の1000リスト）
マージされたリストのトップ10を返します

理論的には言葉を見逃すかもしれませんが、チャンスは非常に小さいと思います。

score 0 · Accepted Answer

ストームは注目すべき技術です。データ入力（スパウト）の役割をプロセッサ（ボルト）から分離します。ストームブックの第2章では、問題を正確に解決し、システムアーキテクチャについて詳しく説明しています-http ://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010

Stormは、Hadoopを使用したバッチ処理とは対照的に、よりリアルタイムの処理です。データが既存のものごとにある場合は、荷重をさまざまな注ぎ口に分散し、さまざまなボルトに処理するためにそれらを分散させることができます。

このアルゴリズムは、日付がリアルタイムで到着したときに分析されるため、テラバイトを超えて増大するデータのサポートも可能にします。

score 0 · Accepted Answer

さて、最初の考えは、ハッシュテーブル/配列または各単語の出現を保存するための形式でdtabaseを管理することですが、データサイズに応じて私はむしろ：

出現>=2である最初の10個の見つかった単語を取得します
これらの単語が文字列全体で何回出現するかを取得し、カウントしながらそれらを削除します
もう一度やり直してください。10語のセットが2つあると、両方のセットの中で最も出現した10語が得られます。
文字列の残りの部分についても同じようにします（これらの単語はもう含まれていません）。

より効率的にして、最初に見つかった10語から始めて、出現回数が5以上になるようにすることもできます。見つからない場合は、10語が見つかるまでこの値を減らします。これにより、大量のデータであるすべての単語の出現を集中的に保存するメモリの使用を回避する良いチャンスがあり、スキャンラウンドを保存できます（良い場合）

ただし、最悪の場合、従来のアルゴリズムよりも多くのラウンドが発生する可能性があります。

ちなみに、その問題は、OOPではなく関数型プログラミング言語で解決しようとしています。

score 0 · Accepted Answer

個人的には、ファイルを128 MBなどのさまざまなサイズに分割し、スキャン中に常に2つをメモリに保持し、検出された単語をハッシュリストに追加し、リストのリストをカウントしてから、リストを繰り返します。トップ10を見つけるために最後のリストの...

score 0 · Accepted Answer

この種の問題に取り組むために、特別なデータ構造を考えてみてください。この場合、文字列を特定の方法で格納するためのトライのような特別な種類のツリーで、非常に効率的です。または、単語を数えるなど、独自のソリューションを構築する2番目の方法。このTBのデータは英語であると思いますが、一般に約600,000語あるので、それらの単語のみを保存し、繰り返される文字列を数えることができます+このソリューションでは、特殊文字を削除するために正規表現が必要になります。最初の解決策はより速くなるでしょう、私はかなり確信しています。

http://en.wikipedia.org/wiki/Trie

これがJavaでのタイヤの実装です
http://algs4.cs.princeton.edu/52trie/TrieST.java.html

score 0 · Accepted Answer

MapReduce
WordCountは、hadoopを使用したmapreduceによって効率的に達成できます。 https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0 大きなファイルを解析できます。クラスター内の複数のノードを使用して、この操作を実行します。

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       String line = value.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
             word.set(tokenizer.nextToken());
             output.collect(word, one);
       }
         }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
         public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       int sum = 0;
           while (values.hasNext()) {
             sum += values.next().get();
           }
       output.collect(key, new IntWritable(sum));
     }
       }

c# - 1テラバイトのテキストを解析し、各単語の出現回数を効率的にカウントします

16 に答える 16

インタビュー回答

デモ

Related

Reference