c# - 複数のファイルでテキストを検索する最速の方法は?

Question

約 120 のテキストファイルからテキストを検索する必要があります。テキストを検索するための最良かつ最速の方法を知りたいです。RichTextBox 内の各ファイルを読み取り、そのメソッドを使用してテキストを検索する必要がありますか?それとも、それらのファイルを文字列変数に読み取ってから、正規表現を使用して検索する必要がありますか?

パフォーマンスの背後にある主な要因は、既に一致がテストされている行をループする必要がないようにする方法を見つけることだと思います。ファイル内のすべての一致を一度に見つける方法はありますか? Visual Studio のように、テキストファイルで一致を見つける方法を知っている人はいますか? 約 800 ～ 1000 ミリ秒で 200 個のテキストファイルが検索されました。これを達成するために複数のスレッドを利用していると思います。

score 2 · Accepted Answer

あなたの説明 (120 ファイル、70K ～ 80K ワード、ファイルあたり 1 ～ 2 MB) から、ファイルを一度読み取り、検索可能なインデックスを作成するのが最善の方法のようです。そのようなことを行う方法を説明するために以下の例を含めましたが、正確な用語または接頭語を見つけるよりも複雑な検索用語の一致が必要な場合は、使用が制限される可能性があります.

より複雑なテキスト検索マッチングが必要な場合 (優れたパフォーマンスを得ながら)、この目的のために特別に構築された優れた Lucene ライブラリを調べることをお勧めします。

public struct WordLocation
{
    public WordLocation(string fileName, int lineNumber, int wordIndex)
    {
        FileName = fileName;
        LineNumber = lineNumber;
        WordIndex = wordIndex;
    }
    public readonly string FileName; // file containing the word.
    public readonly int LineNumber;  // line within the file.
    public readonly int WordIndex;   // index within the line.
}

public struct WordOccurrences
{
    private WordOccurrences(int nOccurrences, WordLocation[] locations)
    {
        NumberOfOccurrences = nOccurrences;
        Locations = locations;
    }

    public static readonly WordOccurrences None = new WordOccurrences(0, new WordLocation[0]);

    public static WordOccurrences FirstOccurrence(string fileName, int lineNumber, int wordIndex)
    {
        return new WordOccurrences(1, new [] { new WordLocation(fileName, lineNumber, wordIndex) });
    }

    public WordOccurances AddOccurrence(string fileName, int lineNumber, int wordIndex)
    {
        return new WordOccurrences(
            NumberOfOccurrences + 1, 
            Locations
                .Concat(
                    new [] { new WordLocation(fileName, lineNumber, wordIndex) })
                .ToArray());
    }

    public readonly int NumberOfOccurrences;
    public readonly WordLocation[] Locations;
}

public interface IWordIndexBuilder
{
    void AddWordOccurrence(string word, string fileName, int lineNumber, int wordIndex);
    IWordIndex Build();
}

public interface IWordIndex
{
    WordOccurrences Find(string word);
}

public static class BuilderExtensions
{
    public static IWordIndex BuildIndexFromFiles(this IWordIndexBuilder builder, IEnumerable<FileInfo> wordFiles)
    {
        var wordSeparators = new char[] {',', ' ', '\t', ';' /* etc */ };
        foreach (var file in wordFiles)
        {
            var lineNumber = 1;
            using (var reader = file.OpenText())
            {
                while (!reader.EndOfStream)
                {
                    var words = reader
                         .ReadLine() 
                         .Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries)
                         .Select(f => f.Trim());

                    var wordIndex = 1;
                    foreach (var word in words)
                        builder.AddWordOccurrence(word, file.FullName, lineNumber, wordIndex++);

                    lineNumber++;
                }
            }
        }
        return builder.Build();
    }
}

次に、最も単純なインデックスの実装 (完全一致ルックアップのみを行うことができます) は、辞書を内部的に使用します。

public class DictionaryIndexBuilder : IIndexBuilder
{
    private Dictionary<string, WordOccurrences> _dict;

    private class DictionaryIndex : IWordIndex 
    {
        private readonly Dictionary<string, WordOccurrences> _dict;

        public DictionaryIndex(Dictionary<string, WordOccurrences> dict)
        {
            _dict = dict;
        }
        public WordOccurrences Find(string word)
        {
           WordOccurrences found;
           if (_dict.TryGetValue(word, out found);
               return found;
           return WordOccurrences.None;
        }
    }

    public DictionaryIndexBuilder(IEqualityComparer<string> comparer)
    {
        _dict = new Dictionary<string, WordOccurrences>(comparer);
    }
    public void AddWordOccurrence(string word, string fileName, int lineNumber, int wordIndex)
    {
        WordOccurrences current;
        if (!_dict.TryGetValue(word, out current))
            _dict[word] = WordOccurrences.FirstOccurrence(fileName, lineNumber, wordIndex);
        else
            _dict[word] = current.AddOccurrence(fileName, lineNumber, wordIndex);
    }
    public IWordIndex Build()
    {
        var dict = _dict;
        _dict = null;
        return new DictionaryIndex(dict);
    }
}

使用法：

var builder = new DictionaryIndexBuilder(EqualityComparer<string>.Default);
var index = builder.BuildIndexFromFiles(myListOfFiles);
var matchSocks = index.Find("Socks");

プレフィックス検索も行いたい場合は、ソートされた辞書を使用するインデックスビルダー/インデックスクラスを実装します (そして、IWordIndex.Find複数の一致を返すようにメソッドを変更するか、部分的/パターン一致を見つけるためのインターフェイスに新しいメソッドを追加します)。

より複雑なルックアップを行いたい場合は、Lucence などを選択してください。

score 0 · Accepted Answer

ここで、あなたがどこにいる場合、私は何をしますか：

1-すべてのファイルパスを文字列のリストにロードします。

2- 検索用語に一致するファイルパスを保存する新しいリストを作成します。

3- ファイルリストで foreach をループし、用語を検索してから、一致したファイルを新しいリストに追加します。

string searchTerm = "Some terms";
    string[] MyFilesList = Directory.GetFiles(@"c:\txtDirPath\", "*.txt");
    List<string> FoundedSearch=new List<string>();
    foreach (string filename in MyFilesList)
    {
        string textFile = File.ReadAllText(filename);
        if (textFile.Contains(searchTerm))
        {
            FoundedSearch.Add(filename);
        }
    }

その後、List:FoundedSearch を必要に応じて処理できます。

ところで：

最良の答えはわかりませんが、1 ファイルあたり 1000 ワードの 800 個のテキストファイルまでは、パフォーマンスは非常に良好です。このグラフでパフォーマンスをかなりよく確認できます。

c# - 複数のファイルでテキストを検索する最速の方法は?

3 に答える 3

Related

Reference