c# - 文字列内の最も長い数字のシーケンスを検索します

Question

質の悪いOCR読み取りの結果を明らかにしようとしていますが、安全に推測できるすべてのものを削除しようとすると間違いです。

目的の結果は6桁の数値文字列であるため、結果から1桁ではない文字を除外できます。また、これらの番号が順番に表示されることも知っているので、順序が正しくない番号も正しくない可能性が非常に高くなります。

（はい、品質を修正するのが最善ですが、いいえ...彼らはドキュメントを変更しない/変更できません）

空白はすぐTrim()に削除します。空白はファイル名になってしまうため、不正な文字もすべて削除します。

どの文字が数字であるかを見つけ、それらが見つかった配列の位置に対して辞書に追加しました。これにより、数のシーケンスを明確に視覚的に示すことができますが、プログラムにこれを認識させる方法のロジックに苦労しています。

文字列" Oct'、2 $ 3622 "（実際の不正な読み取り）でテスト済みこれの理想的な出力は " 3662 " 人間には明らか

    public String FindLongest(string OcrText)
    {
        try
        {
            Char[] text = OcrText.ToCharArray();
            List<char> numbers = new List<char>();

            Dictionary<int, char> consec = new Dictionary<int, char>();

            for (int a = 0; a < text.Length; a++)
            {
                if (Char.IsDigit(text[a]))
                {
                    consec.Add(a, text[a]);

                    // Won't allow duplicates?
                    //consec.Add(text[a].ToString(), true);
                }
            }

            foreach (var item in consec.Keys)
            {
                #region Idea that didn't work
                // Combine values with consecutive keys into new list
                // With most consecutive?
                for (int i = 0; i < consec.Count; i++)
                {
                    // if index key doesn't match loop, value was not consecutive
                    // Ah... falsely assuming it will start at 1. Won't work.
                    if (item == i)
                        numbers.Add(consec[item]);
                    else
                        numbers.Add(Convert.ToChar("#")); //string split value
                }
                #endregion
            }

            return null;
        }
        catch (Exception ex)
        {
            string message;

            if (ex.InnerException != null)
                message =
                    "Exception: " + ex.Message +
                    "\r\n" +
                    "Inner: " + ex.InnerException.Message;
            else
                message = "Exception: " + ex.Message;
            MessageBox.Show(message);

            return null;
        }
    }

score 5 · Accepted Answer

数字の最長シーケンスを取得するための迅速で汚い方法は、次のような正規表現を使用することです。

var t = "sfas234sdfsdf55323sdfasdf23";

var longest = Regex.Matches(t, @"\d+").Cast<Match>().OrderByDescending(m => m.Length).First();

Console.WriteLine(longest);

これにより、実際にはすべてのシーケンスが取得され、明らかにLINQを使用してこれらの中で最も長いシーケンスを選択できます。

これは、同じ長さの複数のシーケンスを処理しません。

score 1 · Accepted Answer

var split = Regex.Split(OcrText, @"\D+").ToList();

var longest = (from s in split
               orderby s.Length descending
               select s).FirstOrDefault();

数字ではないすべての文字を検索する\D（コードでは@ "\ D +"）を使用してRegex.Splitを使用することをお勧めします。次に、Linqクエリを実行して、.Lengthで最長の文字列を検索します。

ご覧のとおり、シンプルで非常に読みやすいです。

score 1 · Accepted Answer

だからあなたはただ最長の＃シーケンスを見つける必要がありますか？正規表現を使用してみませんか？

  Regex reg = new Regex("\d+");
  Matches mc = reg.Matches(input);
  foreach (Match mt in mc)
  {
     // mt.Groups[0].Value.Length is the len of the sequence
     // just find the longest
  }

ちょっとした考え。

score 1 · Accepted Answer

厳密に数値の一致が必要なため、に一致する正規表現を使用することをお勧めし(\d+)ます。

MatchCollection matches = Regex.Matches(input, @"(\d+)");
string longest = string.Empty;
foreach (Match match in matches) {
    if (match.Success) {
        if (match.Value.Length > longest.Length) longest = match.Value;
    }
}

これにより、最長の長さの番号が表示されます。実際に値を比較したい場合（これは「最長の長さ」でも機能しますが、同じ長さの一致の問題を解決できる可能性があります）：

MatchCollection matches = Regex.Matches(input, @"(\d+)");
int biggest = 0;
foreach (Match match in matches) {
    if (match.Success) {
        int current = 0;
        int.TryParse(match.Value, out current);
        if (current > biggest) biggest = current;
    }
}

c# - 文字列内の最も長い数字のシーケンスを検索します

4 に答える 4

Related

Reference