c# - 個別の値を保持しながら、.NET で文字列配列を効率的にマージする

Question

.NET 3.5 を使用しています。1 つ以上の値を共有する 2 つの文字列配列があります。

string[] list1 = new string[] { "apple", "orange", "banana" };
string[] list2 = new string[] { "banana", "pear", "grape" };

重複する値のない 1 つの配列にそれらをマージする方法が欲しいです。

{ "apple", "orange", "banana", "pear", "grape" }

私はLINQでこれを行うことができます:

string[] result = list1.Concat(list2).Distinct().ToArray();

しかし、それは大規模な配列ではあまり効率的ではないと思います。

より良い方法はありますか？

score 109 · Accepted Answer

string[] result = list1.Union(list2).ToArray();

from msdn： "このメソッドは、戻りセットから重複を除外します。これは、重複を含む入力シーケンスのすべての要素を返すConcat（TSource）メソッドとは異なる動作です。"

score 12 · Accepted Answer

なぜ効率が悪いと思いますか？私の知る限り、Concat と Distinct はどちらも遅延評価され、Distinct の舞台裏で HashSet を使用して、既に返された要素を追跡します。

一般的な方法でそれよりも効率的にする方法がわかりません:)

編集: Distinct は実際には HashSet の代わりに Set (内部クラス) を使用しますが、要点はまだ正しいです。これは、LINQ がいかに優れているかを示す良い例です。最も単純な答えは、ドメインの知識がなくても達成できるのと同じくらい効率的です。

効果は次のものと同等です。

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)
{
    HashSet<T> returned = new HashSet<T>();
    foreach (T element in first)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
    foreach (T element in second)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
}

score 3 · Accepted Answer

.NET 3.5 では、これを実行できる HashSet クラスが導入されました。

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

パフォーマンスはわかりませんが、あなたが与えたLinqの例よりも優れているはずです。

編集：私は訂正しました。Concat と Distinct の遅延実装には、重要なメモリと速度の利点があります。Concat/Distinct は約 10% 高速で、データの複数のコピーを保存します。

コードで確認しました：

Setting up arrays of 3000000 strings overlapping by 300000
Starting Hashset...
HashSet: 00:00:02.8237616
Starting Concat/Distinct...
Concat/Distinct: 00:00:02.5629681

次の出力です。

        int num = 3000000;
        int num10Pct = (int)(num / 10);

        Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct));
        string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();
        string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();

        Console.WriteLine("Starting Hashset...");
        Stopwatch sw = new Stopwatch();
        sw.Start();
        string[] merged = new HashSet<string>(list1).Union(list2).ToArray();
        sw.Stop();
        Console.WriteLine("HashSet: " + sw.Elapsed);

        Console.WriteLine("Starting Concat/Distinct...");
        sw.Reset();
        sw.Start();
        string[] merged2 = list1.Concat(list2).Distinct().ToArray();
        sw.Stop();
        Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

score 2 · Accepted Answer

免責事項これは時期尚早の最適化です。サンプル配列では、3.5 拡張メソッドを使用します。この領域でパフォーマンスの問題があることがわかるまでは、ライブラリコードを使用する必要があります。

配列を並べ替えることができる場合、またはコードのそのポイントに到達したときに配列が並べ替えられている場合は、次の方法を使用できます。

これらは、両方から 1 つのアイテムを取得し、「最も低い」アイテムを生成してから、両方のソースが使い果たされるまで、対応するソースから新しいアイテムをフェッチします。2 つのソースからフェッチされた現在のアイテムが等しい場合、最初のソースからアイテムが生成され、両方のソースでそれらがスキップされます。

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2)
{
    return Merge(source1, source2, Comparer<T>.Default);
}

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2, IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source1))
        throw new ArgumentNullException("source1");
    if (Object.ReferenceEquals(null, source2))
        throw new ArgumentNullException("source2");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T>
        enumerator1 = source1.GetEnumerator(),
        enumerator2 = source2.GetEnumerator())
    {
        Boolean more1 = enumerator1.MoveNext();
        Boolean more2 = enumerator2.MoveNext();

        while (more1 && more2)
        {
            Int32 comparisonResult = comparer.Compare(
                enumerator1.Current,
                enumerator2.Current);
            if (comparisonResult < 0)
            {
                // enumerator 1 has the "lowest" item
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            }
            else if (comparisonResult > 0)
            {
                // enumerator 2 has the "lowest" item
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            }
            else
            {
                // they're considered equivalent, only yield it once
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
                more2 = enumerator2.MoveNext();
            }
        }

        // Yield rest of values from non-exhausted source
        while (more1)
        {
            yield return enumerator1.Current;
            more1 = enumerator1.MoveNext();
        }
        while (more2)
        {
            yield return enumerator2.Current;
            more2 = enumerator2.MoveNext();
        }
    }
}

ソースの 1 つに重複が含まれている場合、出力に重複が表示される可能性があることに注意してください。既に並べ替えられたリストでこれらの重複を削除する場合は、次の方法を使用します。

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source)
{
    return CheapDistinct<T>(source, Comparer<T>.Default);
}

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source,
    IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source))
        throw new ArgumentNullException("source");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T> enumerator = source.GetEnumerator())
    {
        if (enumerator.MoveNext())
        {
            T item = enumerator.Current;

            // scan until different item found, then produce
            // the previous distinct item
            while (enumerator.MoveNext())
            {
                if (comparer.Compare(item, enumerator.Current) != 0)
                {
                    yield return item;
                    item = enumerator.Current;
                }
            }

            // produce last item that is left over from above loop
            yield return item;
        }
    }
}

これらはいずれも内部的にデータ構造を使用してデータのコピーを保持しないことに注意してください。そのため、入力がソートされている場合はコストがかかりません。それを保証できない、または保証しない場合は、既に見つけた 3.5 拡張メソッドを使用する必要があります。

上記のメソッドを呼び出すコード例を次に示します。

String[] list_1 = { "apple", "orange", "apple", "banana" };
String[] list_2 = { "banana", "pear", "grape" };

Array.Sort(list_1);
Array.Sort(list_2);

IEnumerable<String> items = Merge(
    CheapDistinct(list_1),
    CheapDistinct(list_2));
foreach (String item in items)
    Console.Out.WriteLine(item);

score 1 · Accepted Answer

測定するまで、どちらのアプローチが速いかはわかりません。LINQ の方法はエレガントで理解しやすいものです。

もう 1 つの方法は、セットをハッシュ配列 (Dictionary) として実装し、両方の配列のすべての要素をセットに追加することです。次に、set.Keys.ToArray() メソッドを使用して、結果の配列を作成します。

score 1 · Accepted Answer

おそらく、値をキーとしてハッシュテーブルを作成し（まだ存在しないものを追加するだけ）、キーを配列に変換することが実行可能な解決策になる可能性があります。

c# - 個別の値を保持しながら、.NET で文字列配列を効率的にマージする

6 に答える 6

Related

Reference