c# - 2 つの文字列配列の高速な count() 交差

Question

文字列の 2 つの大きな配列の交点に対応する要素の数を数え、それを非常に高速に行う必要があります。

次のコードを使用しています。

arr1[i].Intersect(arr2[j]).Count()

CPU 時間については、VS プロファイラーが示します

85.1%System.Linq.Enumerable.Count()
0.3%System.Linq.Enumerable.Intersect()

残念ながら、すべての作業を完了するには数時間かかる場合があります。

より速くする方法は？

score 4 · Accepted Answer

HashSetで使用できますarr2

HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
              ------------------
                      |
                      |->HashSet's contains method executes quickly using hash-based lookup..

arr2からへの変換を考慮しないarr2Set場合、これはO(n)

score 1 · Accepted Answer

プロファイラーがで消費された時間を表示する理由Countは、これがコレクションが実際に列挙される場所であると思われます (Intersectは遅延評価され、結果が必要になる前に実行されません)。

これをIntersect合理的に高速化するには、いくつかの内部最適化が必要だと思いHashSet<string>ますが、各要素の内部配列を検索せずに交差を確実に作成できるように、 a を使用してみてください。

HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;

score 1 · Accepted Answer

うーん、交差点はおそらくN ^ 2です

両方の配列のクイックソートを高速化します。そして、両方の配列をトラバースします。交差点を数える。

それがどれほど速いかをテストするのが面倒ですが、O(nlogn +n)

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            const int arrsize = 1000000;
            Random rnd = new Random(42);
            string[] arr1 = new string[arrsize];
            string[] arr2 = new string[arrsize];
            for (int i = 0; i < arrsize; i++)
            {
                arr1[i] = rnd.Next().ToString();
                arr2[i] = rnd.Next().ToString();
            }
            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                arr1.Intersect(arr2).Count();
                Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

        {

            HashSet<string> set = new HashSet<string>(arr1);
            var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
            set.IntersectWith(arr2);
            int count = set.Count;
            Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
        }
            {
               var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                HashSet<string> set = new HashSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                SortedSet<string> set = new SortedSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {

                SortedSet<string> set = new SortedSet<string>(arr1);
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }
        }
    }
}

結果

配列 914,637

ハッシュセット 816,119

ハッシュセット +new 1,150,978

SortedSet +new 16,173,836

新しいなしの SortedSet 7,946,709

最善の方法は、準備ができているハッシュセットを保持することです。

score 0 · Accepted Answer

小さい方の配列を使用してHashSetを構築し、大きい方の配列をループ処理して、ハッシュセットにアイテムが存在する場合はカウンターをインクリメントします。

score 0 · Accepted Answer

セットを扱っている場合、複雑さは O((n log n)*(m log m)) 程度になります。

ここはもっと速いはずだと思いますが、今は O((n log n)+(m log m)) かどうかはわかりません

possible would be 
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();  

nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();

c# - 2 つの文字列配列の高速な count() 交差

5 に答える 5

Related

Reference