c# - 要素が欠落している可能性のある 2 つの非常に大きなコレクションを比較する必要がある

Question

仕事のために書いているプログラムでレンガの壁に出くわしました。具体的にコンテキストを知る必要はありませんが、簡単に言えば、私はそれぞれ約 65 万レコードのコレクションを 2 つ持っています。

コレクション A が正しいとわかっているコレクションであり、コレクション B が間違っているとわかっているコレクションであるとします。

コレクション B には、コレクション A の要素と同じ型のプロパティを持つ複合オブジェクトが含まれています (つまり、次のようになります)。

// Where T : IComparable
IEnumerable<DateTime> A = ...; // Collection of T elements
IEnumerable<Complex> B = ...; // Collection of complex elements.
class Complex<DateTime>
{
   public DateTime Time { get; set; }
   .....
}

私の問題は、基本的に A を順次列挙し、A の現在の要素が B の Complex オブジェクトに存在するかどうかを確認する必要があることです。存在しない場合は、その要素をカプセル化する Complex オブジェクトを作成する必要があります (とりわけ)。

問題は、両方のリストの長さが約 650,000 要素であることに気付いたときに発生します。データセットを減らすことはできません。この 650,000 を使用する必要があります。現在、私はを使用ICollection.Contains()しており、二分探索の (単純な) 実装を試みましたが、時間がかかりすぎます。

何か提案はありますか？

編集:それが役立つ場合、T は IComparable を実装します。EDIT2: もう少しコンテキスト: IEnumerable は、Linq To Objects を使用して DataTable から取得されます。

        IEnumerable<Complex> set = set.Tbl
            .Where(dbObject => dbObject.TS.CompareTo(default(DateTime)) != 0)
            .Select(Func<DataRow,Complex>) // Function that wraps the DataRow in a Complex object
            // Just done to make debugging a little easier so we still have a large sample but small enough that it doesn't make me grow a beard
            .Take(100000) 
            .AsEnumerable<Complex>();

この質問がアーカイブされ、他の誰かがこの問題を解決する必要がある場合に備えて、完全を期すために、現在の実装は次のようになりました

        BDataSet bSet = new BDataSet();
        B_LUTableAdapter adap = new B_LUTableAdapter();
        adap.Fill(bSet.B_LU);
        IEnumerable<Complex> w3 = bSet.B
            .Where(dbObject => dbObject.TS.CompareTo(default(DateTime)) != 0)
            // Function that just wraps datarow into a complex object
            .Select(Func<DataRow, Complex>)
            // Just for sake of debugging speed
            .Take(100000)
            .AsEnumerable<Complex>();

        List<Complex> b = bSet.OrderBy(x => x.Time).ToList<Complex>();
        // Get last & first timestamps
        // Some of the timestamps in b are 01/01/1011 for some reason,
        // So we do this check.
        Complex start = b.Where(x => x.Time != default(DateTime)).First();
        Complex end = b.Last();

        List<DateTime> a = new List<DateTime>();
        // RoundSeconds reduces seconds in a DateTime to 0.
        DateTime current = RoundSeconds(new DateTime(start.Time.Ticks));

        while (current.CompareTo(RoundSeconds(end.Time)) <= 0)
        {
            a.Add(current);
            current = current.Add(TimeSpan.FromMinutes(1));
        }

        IEnumerable<DateTime> times = b.Select(x => x.Time);
        var missing = a.Where(dt => times.Contains(dt));
        foreach (var dt in missing)
        {
            adap.Insert(dt, 0, "", "", "", null, 0, 0);
            // This has since been changed to List.Add()
        }

Cosmin のおかげで、この問題は解決され、完成した実装は次のとおりです。 List expected = new List(); DateTime current = RoundSeconds(new DateTime(start.Time.Ticks));

        while (current.CompareTo(RoundSeconds(end.Time)) <= 0)
        {
            expected.Add(current);
            current = current.Add(TimeSpan.FromMinutes(1));
        }

        Console.WriteLine("Expecting {0} intervals.", expected.Count);
        var missing = b.FindAllMissing(expected, x => x.Time);
        if(!missing.Any()) return;
        Console.WriteLine("{0} missing intervals.", missing.Count());
        foreach (var dt in missing)
        {
            b.Add(new Complex() { /* some values */ });
            //Console.WriteLine("\t> Inserted new record at {0}", dt);
        }

        //.....
        public static IEnumerable<Basic> FindAllMissing<Basic, Complex>(this IEnumerable<Complex> complexList,
        IEnumerable<Basic> basicList,
        Func<Complex, Basic> selector)
        {
            HashSet<Basic> inComplexList = new HashSet<Basic>();
            foreach (Complex c in complexList)
                inComplexList.Add(selector(c));
            List<Basic> missing = new List<Basic>();
            foreach (Basic basic in basicList)
                if (!(inComplexList.Contains(basic)))
                    missing.Add(basic);
            return missing;
        }

score 4 · Accepted Answer

ステップバイステップ:

ジェネリックコレクションの 1 つを使用して、2 番目のコレクションに既に存在するO(1)の高速検索可能なリストを作成します。T提案してもいいですかHashSet<T>
SECOND コレクションを列挙し、すべてのTを最初のステップのコレクションに入れます。
FIRST コレクションを列挙し、各項目がステップ 1 で作成されたコレクションに含まれているかどうかを確認します。その操作はであるため、これで解決策O(1)が得られましたO(n)。
楽しみ。

これは、そのアルゴリズムをジェネリック拡張メソッドとして実装し、LINQ との親和性を高めるクラスです。引数をIEnumerable<T>および returnとして取り、型 (および)IEnumerable<T>に関する仮定を作成しませんでした。私のテストでは、リストを複合型として使用し、単純型を単純型として使用しています。コンソールアプリケーションは、に 600000 個の値を入力し、100000 個の値を列挙子を使用して;にないすべての単純な値をカウントする simple に入れます。それは非常に速いので、それが機能しているのを見る機会がありません。ヒットすると結果が表示されます。TComplexTuple<int,int>intList<Tuple<int,int>>List<int>List<Tuple<int,int>>F5

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication2
{

    static class FixProblem
    {
        public static IEnumerable<T> FindAllThatNeedCreating<T, Complex>(this IEnumerable<Complex> list_of_complex, IEnumerable<T> list_of_T, Func<Complex, T> extract)
        {
            HashSet<T> T_in_list_of_complex = new HashSet<T>();
            foreach (Complex c in list_of_complex)
                T_in_list_of_complex.Add(extract(c));
            List<T> answer = new List<T>();
            foreach (T t in list_of_T)
                if (!T_in_list_of_complex.Contains(t))
                    answer.Add(t);
            return answer;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            // Test the code
            List<Tuple<int, int>> complex = new List<Tuple<int, int>>();
            List<int> simple = new List<int>();

            // Fill in some random data
            Random rnd = new Random();
            for (int i = 1; i < 600000; i++)
                complex.Add(new Tuple<int, int>(rnd.Next(), rnd.Next()));

            for (int i = 1; i < 100000; i++)
                simple.Add(rnd.Next());

            // This is the magic line of code:
            Console.WriteLine(complex.FindAllThatNeedCreating(simple, x => x.Item1).Count());

            Console.ReadKey();

        }
    }
}

score 0 · Accepted Answer

あなたの要件を理解していれば、このコードはうまく機能すると思います。両方のコレクションで 650 万レコードに増やしましたが、11 秒で完了しました。650k レコードに戻すのに 1 秒もかかりません。

var rnd = new Random();

var xs = Enumerable
    .Range(0, 650000)
    .Select(x => new Complex<int>()
    {
        MyComplex = rnd.Next(0, 100001)
    })
    .ToList();

var ys = Enumerable
    .Range(0, 650000)
    .Select(x => rnd.Next(0, 100001))
    .ToArray();

var xsLookup = xs.ToLookup(x => x.MyComplex);

var newYs = ys.Where(y => !xsLookup[y].Any()).ToArray();

newYs
    .ToList()
    .ForEach(y =>
    {
        xs.Add(new Complex<int>()
        {
            MyComplex = y
        });
    });

score 0 · Accepted Answer

リストが両方とも同じキーで並べられている場合は、両方を同時にループできます。containsこれを行うことで、時間を短縮することができましたexcept。

var A = SimpleCollection; // source of truth
var B = GetComplexOnDemand().ToArray();
var missing = new List<Complex<DateTime>>();
int cpxIx = 0;
int simpleIx = 0;
while (simpleIx < A.Count)
{
    if (cpxIx >= B.Length)
    {
        missing.Add(new Complex<DateTime>() {InnerValue = A[simpleIx]});
        simpleIx++;
        continue;
    }

    if (A[simpleIx] != B[cpxIx].InnerValue)
    {
        missing.Add(new Complex<DateTime>() {InnerValue = A[simpleIx]});
        simpleIx++;
        continue;
    }

    cpxIx++;
    simpleIx++;
}

テストデータを生成するには、次のことを行いました。

private static readonly List<DateTime> SimpleCollection = 
    Enumerable.Range(1, SimpleSize).Select(t => new DateTime(DateTime.Now.Ticks + t)).ToList();
public static IEnumerable<Complex<DateTime>> GetComplexOnDemand()
{
    for (int i = 1; i <= ComplexSize; i+=2)
    {
        yield return new Complex<DateTime>() { InnerValue = SimpleCollection[i] };
    }
}

score 0 · Accepted Answer

A タイプのプロパティをキーとして使用して、複雑なオブジェクトをディクショナリに格納することをお勧めします。次に、 A の要素が Dictionary に存在するかどうかをすぐに確認できます。各ルックアップ操作は O(1) になり、全体的なパフォーマンスは O(n) になります。

または、既存の IEnumerable で .ToLookup() を呼び出し、ラムダ式でキーと値を作成することもできます。返された ILookup は、ニーズに最適なはずです (つまり、A から値を検索するなど)。ここでの追加の利点は、A を含む複数の複雑なオブジェクトがある場合、Lookup を使用するとそれらのコレクションが返されることです。

c# - 要素が欠落している可能性のある 2 つの非常に大きなコレクションを比較する必要がある

5 に答える 5

Related

Reference