c# - String.comparison パフォーマンス (トリムあり)

Question

大文字と小文字を区別しない高性能の文字列比較をたくさん行う必要があり、それを行う私の方法 .ToLower().Trim() は、割り当てられているすべての新しい文字列を行うため、本当にばかげていることに気付きました

だから私は少し掘り下げましたが、この方法が望ましいようです：

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

ここでの唯一の問題は、先頭または末尾のスペース、つまり Trim() を無視したいということですが、Trim を使用すると、文字列の割り当てで同じ問題が発生します。各文字列をチェックして、それが StartsWith(" ") か EndsWith(" ") かを確認してから、Trim することができると思います。それか、各文字列のインデックスと長さを把握し、string.Compare オーバーライドに渡します。

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

しかし、それはかなり面倒に思えます。両方の文字列の末尾と先頭の空白のすべての組み合わせに対して本当に大きな if-else ステートメントを作成しない場合、おそらくいくつかの整数を使用する必要があります...エレガントなソリューションのアイデアはありますか?

これが私の現在の提案です：

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }

score 6 · Accepted Answer

このような何かがそれを行う必要があります：

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

例：

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

出力：

単純なトリムに対してプロファイリングし、実際のデータと比較して、使用する目的に実際に違いがあるかどうかを確認する必要があります。

score 3 · Accepted Answer

私はあなたが持っているコードを使用します

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

.Trim()必要に応じて呼び出しを追加します。.ToLower().Trim()これにより、ほとんどの場合、最初のオプション 4 文字列 ( 、および常に 2 つの文字列 ( ))が節約されます.ToLower()。

この後にパフォーマンスの問題が発生した場合は、「乱雑な」オプションが最善の策となる可能性があります。

score 2 · Accepted Answer

各文字列を (取得時に) 正確に 1 回だけトリムする (そしておそらく小文字にする) ことはできませんか? それ以上やると、時期尚早の最適化のように聞こえます....

score 2 · Accepted Answer

まず、このコードを最適化する必要があるかどうかを確認してください。おそらく、文字列のコピーを作成しても、プログラムに大きな影響はありません。

本当に最適化する必要がある場合は、文字列を比較するときではなく、最初に格納するときに文字列の処理を試みることができます (プログラムのさまざまな段階で処理が行われると仮定します)。たとえば、文字列のトリミングされたバージョンと小文字のバージョンを保存して、それらを比較するときに単純に等価性をチェックできるようにします。

score 0 · Accepted Answer

警告は時期尚早の最適化が正しいことに関するものですが、これをテストして、文字列のコピーに多くの時間が浪費されていることがわかったと思います。その場合、私は次のことを試みます：

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndLengthは、「トリミングされた」文字列の開始インデックスと長さを検索する関数です（これはテストされていませんが、一般的な考え方を示す必要があります）。

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}

score 0 · Accepted Answer

問題は、それを行う必要がある場合、それを行う必要があるということです。私はあなたの異なる解決策のどれも違いを生むとは思わない。いずれの場合も、空白を見つけたり削除したりするには、いくつかの比較が必要です。

どうやら、空白を削除することは問題の一部なので、それについて心配する必要はありません。
また、比較する前に文字列を小文字にすることは、Unicode文字を使用している場合のバグであり、文字列のコピーよりも遅い可能性があります。

score 0 · Accepted Answer

独自のを実装できますStringComparer。基本的な実装は次のとおりです。

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

備考：

広範囲にテストされておらず、バグが含まれている可能性があります
パフォーマンスはまだ評価されていません (しかしTrim、ToLowerとにかく呼び出すよりはおそらく優れています)
メソッドは実装されてGetHashCodeいないため、辞書のキーとして使用しないでください

score 0 · Accepted Answer

あなたの最初の提案は、ソートではなく同等性のみを比較していることに気付きました。これにより、さらに効率を節約できます。

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

もちろん、一致するハッシュコードプロデューサーのない等価チェッカーとは何ですか。

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}

c# - String.comparison パフォーマンス (トリムあり)

8 に答える 8

Related

Reference