c# - 重複ファイルエントリの最適化削除方法

Question

最適化する次のコードがあります。ファイルが大きいと予想されるため、行を格納するためにHashMapを使用せず、代わりに文字列配列を選択しました。nが約500,000のロジックをテストしてみたところ、約14分間実行されました。私は間違いなくそれよりもはるかに速くしたいと思います、そしてどんな助けや提案にも感謝します。

         public static void RemoveDuplicateEntriesinFile(string filepath)
        {
              if (filepath == null)
                    throw new ArgumentException("Please provide a valid FilePath");
              String[] lines = File.ReadAllLines(filepath);
              for (int i = 0; i < lines.Length; i++)
              {
                    for (int j = (i + 1); j < lines.Length; j++)
                    {
                          if ((lines[i] !=null) && (lines[j]!=null) && lines[i].Equals(lines[j]))
                          {//replace duplicates with null
                                lines[j] = null;
                          }
                    }
              }

              File.WriteAllLines(filepath, lines);
        }

前もって感謝します！

score 1 · Accepted Answer

「ファイルが大きいと予想されるため、行を格納するためにHashMapを使用せず、代わりに文字列配列を選択しました。」</ p>

私はあなたの推論に同意しません。ファイルが大きいほど、ハッシュから得られるパフォーマンス上の利点が大きくなります。コードでは、各行を後続のすべての行と比較しているため、ファイル全体でO（n²）の計算が複雑になります。

一方、効率的なハッシュアルゴリズムを使用する場合、各ハッシュルックアップはO（1）で完了します。ファイル全体を処理する計算の複雑さはO（n）になります。

HashSet<string>を使用して、処理時間の違いを確認してください。

public static void RemoveDuplicateEntriesinFile(string filepath)
{
    if (filepath == null)
        throw new ArgumentException("Please provide a valid FilePath");

    HashSet<string> hashSet = new HashSet<string>(File.ReadLines(filepath));
    File.WriteAllLines(filepath, hashSet);
}

編集：次のバージョンのアルゴリズムを試して、所要時間を確認していただけますか？メモリ消費を最小限に抑えるように最適化されています。

HashAlgorithm hashAlgorithm = new SHA256Managed();
HashSet<string> hashSet = new HashSet<string>();
string tempFilePath = filepath + ".tmp";

using (var fs = new FileStream(tempFilePath, FileMode.Create, FileAccess.Write))
using (var sw = new StreamWriter(fs))
{
    foreach (string line in File.ReadLines(filepath))
    {
        byte[] lineBytes = Encoding.UTF8.GetBytes(line);
        byte[] hashBytes = hashAlgorithm.ComputeHash(lineBytes);
        string hash = Convert.ToBase64String(hashBytes);

        if (hashSet.Add(hash))
            sw.WriteLine(line);
    }
}

File.Delete(filepath);
File.Move(tempFilePath, filepath);

score 0 · Accepted Answer

新しいリストを作成して追加することで試すことができます。

        public static void RemoveDuplicateEntriesinFile(string filepath)
        {
              if (filepath == null)
                    throw new ArgumentException("Please provide a valid FilePath");
              String[] lines = File.ReadAllLines(filepath);
              List<String> newLines=new List<String>()
              foreach (string s in lines)
              {
                   if (newLines.Contains(s)
                   continue;
                   newLines.add(s);
              }
              //not sure if you can do this with a list, might have to convert back to array
              File.WriteAllLines(filepath, newLines);
        }

score 0 · Accepted Answer

lines[j] = null;私のために働いていませんでした。File.WriteAllLines(filepath, lines);それらの行を""（string.Empty）として書き込みます

c# - 重複ファイルエントリの最適化削除方法

3 に答える 3

Related

Reference