c# - 正規表現を効率的に使用して StreamReader を解析する

Question

私は変数を持っています

    StreamReader DebugInfo = GetDebugInfo();
    var text = DebugInfo.ReadToEnd();  // takes 10 seconds!!! because there are a lot of students

テキストが等しい:

<student>
    <firstName>Antonio</firstName>
    <lastName>Namnum</lastName>
</student>
<student>
    <firstName>Alicia</firstName>
    <lastName>Garcia</lastName>
</student>
<student>
    <firstName>Christina</firstName>
    <lastName>SomeLattName</lastName>
</student>
... etc
.... many more students

私が今していることは次のとおりです。

  StreamReader DebugInfo = GetDebugInfo();
  var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!!

  var mtch = Regex.Match(text , @"(?s)<student>.+?</student>");
  // keep parsing the file while there are more students
  while (mtch.Success)
  {
     AddStudent(mtch.Value); // parse text node into object and add it to corresponding node
     mtch = mtch.NextMatch();
  }

全体のプロセスには約 25 秒かかります。streamReader をテキスト ( var text = DebugInfo.ReadToEnd();) に変換するには、10 秒かかります。残りの部分は約 15 秒かかります。2つのパートを同時にできると思っていたのに…

編集

私は次のようなものが欲しいです：

    const int bufferSize = 1024;

    var sb = new StringBuilder();

    Task.Factory.StartNew(() =>
    {
         Char[] buffer = new Char[bufferSize];
         int count = bufferSize;

         using (StreamReader sr = GetUnparsedDebugInfo())
         {

             while (count > 0)
             {
                 count = sr.Read(buffer, 0, bufferSize);
                 sb.Append(buffer, 0, count);
             }
         }

         var m = sb.ToString();
     });

     Thread.Sleep(100);

     // meanwhile string is being build start adding items

     var mtch = Regex.Match(sb.ToString(), @"(?s)<student>.+?</student>"); 

     // keep parsing the file while there are more nodes
     while (mtch.Success)
     {
         AddStudent(mtch.Value);
         mtch = mtch.NextMatch();
     }

編集 2

概要

申し訳ありませんが、テキストはxmlに非常に似ていますが、そうではありません。そのため、正規表現を使用する必要があります...つまり、ストリームを文字列に変換してから文字列を解析しているため、時間を節約できると思います。ストリームを正規表現で解析してみませんか。または、それが不可能な場合は、ストリームのチャンクを取得して、そのチャンクを別のスレッドで解析してみませんか。

score 2 · Accepted Answer

更新しました：

この基本的なコードは、(およそ) 20 メガバイトのファイルを 0.75 秒で読み取ります。私のマシンは、参照されている 2 秒でおよそ 53.33 メガバイトを処理するはずです。さらに、20,000,000 / 2,048 = 9765.625. .75 / 9765.625 = .0000768. つまり、100 分の 768 秒ごとにおよそ 2048 文字を読み取っていることになります。マルチスレッドの追加の複雑さが適切かどうかを判断するには、反復のタイミングに関連するコンテキスト切り替えのコストを理解する必要があります。7.68X10^5 秒で、ほとんどの場合、リーダースレッドがアイドル状態になっていることがわかります。私には意味がありません。単一のスレッドで単一のループを使用するだけです。

char[] buffer = new char[2048];
StreamReader sr = new StreamReader(@"C:\20meg.bin");
while(sr.Read(buffer, 0, 2048) != 0)
{
    ; // do nothing
}

このような大規模な操作では、転送専用でキャッシュされていないリーダーを使用する必要があります。あなたのデータは XML のように見えるので、これには XmlTextReader が最適です。ここにいくつかのサンプルコードがあります。お役に立てれば。

string firstName;
        string lastName;
        using (XmlTextReader reader = GetDebugInfo())
        {
            while (reader.Read())
            {
                if (reader.IsStartElement() && reader.Name == "student")
                {
                    reader.ReadToDescendant("firstName");
                    reader.Read();
                    firstName = reader.Value;
                    reader.ReadToFollowing("lastName");
                    reader.Read();
                    lastName = reader.Value;
                    AddStudent(firstName, lastName);
                }
            }
        }

次の XML を使用しました。

<students>
    <student>
        <firstName>Antonio</firstName>
        <lastName>Namnum</lastName>
    </student>
    <student>
        <firstName>Alicia</firstName>
        <lastName>Garcia</lastName>
    </student>
    <student>
        <firstName>Christina</firstName>
        <lastName>SomeLattName</lastName>
    </student>
</students>

微調整が必要な場合があります。これは、はるかに高速に実行されるはずです。

score 1 · Accepted Answer

正規表現は、文字列を解析する最速の方法ではありません。XmlReader に似た調整されたパーサーが必要です (データ構造に一致させるため)。ファイルを部分的に読み取り、RegEx よりもはるかに高速に解析できます。

タグのセットが限られているため、ネスト FSM アプローチ (http://en.wikipedia.org/wiki/Finite-state_machine) が機能します。

score 1 · Accepted Answer

これが最速であることが判明したものです（おそらく、もっと試してみる必要があります）

char[][] listToProcess = new char[200000][];ストリームのチャンクを配置する配列の配列を作成しました。別のタスクで、各チャンクの処理を開始しました。コードは次のようになります。

   StreamReader sr = GetUnparsedDebugInfo(); // get streamReader                        

   var task1 = Task.Factory.StartNew(() =>
   {
       Thread.Sleep(500); // wait a little so there are items on list (listToProcess) to work with
       StartProcesingList();
   });

   int counter = 0;

   while (true)
   {
       char[] buffer = new char[2048]; // crate a new buffer each time we will add it to the list to process

       var charsRead = sr.Read(buffer, 0, buffer.Length);

       if (charsRead < 1) // if we reach the end then stop
       {
           break;
       }

       listToProcess[counter] = buffer;
       counter++;
   }

   task1.Wait();

メソッドはStartProcesingList()基本的に、null オブジェクトに到達するまでリストを調べ始めます。

    void StartProcesingList()
    {
        int indexOnList = 0;

        while (true)
        {
            if (listToProcess[indexOnList] == null)
            {
                Thread.Sleep(100); // wait a little in case other thread is adding more items to the list

                if (listToProcess[indexOnList] == null)
                    break;
            }

            // add chunk to dictionary if you recall listToProcess[indexOnList] is a 
            // char array so it basically converts that to a string and splits it where appropiate
            // there is more logic as in the case where the last chunk will have to be 
            // together with the first chunk of the next item on the list
            ProcessChunk(listToProcess[indexOnList]);

            indexOnList++;                
        }

    }

score 1 · Accepted Answer

行ごとに読み取ることはできますが、データの読み取りに 15 秒かかる場合、速度を上げるためにできることはあまりありません。

重要な変更を行う前に、ファイルのすべての行を単純に読み取り、何も処理しないようにしてください。それでも目標よりも時間がかかる場合は、目標を調整するか、ファイル形式を変更してください。それ以外の場合は、解析を最適化することでどれだけの利益が期待できるかを確認してください。複雑でない正規表現では、RegEx は非常に高速です。

c# - 正規表現を効率的に使用して StreamReader を解析する

編集

編集 2

5 に答える 5

Related

Reference