c# - 文字列内の空白行のインデックスを検索する

Question

テキストファイル、キャリッジリターン、タブなどを含む文字列があるとします。その文字列の最初の空白行（空白のみを含む行を含める）のインデックスを見つけるにはどうすればよいですか？

私が試したこと：

この場合、私は空白行のインデックスを見つけるために醜いコードの束を活用する作業関数を持っています。これよりもエレガントで読みやすい方法が必要です。

明確にするために、以下の関数は、指定された「タイトル」の文字列から、タイトルの後の最初の空白行のインデックスまでのセクションを返します。そのほとんどがそのインデックスの検索によって消費されるため、完全に提供されます。また、「なぜ世界で空白行のインデックスが必要なのか」という質問を避けるために。また、XY問題がここで発生している場合は、それを打ち消すために。

（明らかに機能している、すべてのエッジケースをテストしていない）コード：

// Get subsection indicated by supplied title from supplied section
private static string GetSubSectionText(string section, string subSectionTitle)
    {
        int indexSubSectionBgn = section.IndexOf(subSectionTitle);
        if (indexSubSectionBgn == -1)
            return String.Empty;

        int indexSubSectionEnd = section.Length;

        // Find first blank line after found sub-section
        bool blankLineFound = false;
        int lineStartIndex = 0;
        int lineEndIndex = 0;
        do
        {
            string temp;
            lineEndIndex = section.IndexOf(Environment.NewLine, lineStartIndex);

            if (lineEndIndex == -1)
                temp = section.Substring(lineStartIndex);
            else
                temp = section.Substring(lineStartIndex, (lineEndIndex - lineStartIndex));

            temp = temp.Trim();
            if (temp.Length == 0)
            {
                if (lineEndIndex == -1)
                    indexSubSectionEnd = section.Length;
                else
                    indexSubSectionEnd = lineEndIndex;

                blankLineFound = true;
            }
            else
            {
                lineStartIndex = lineEndIndex + 1;
            }
        } while (!blankLineFound && (lineEndIndex != -1));

        if (blankLineFound)
            return section.Substring(indexSubSectionBgn, indexSubSectionEnd);
        else
            return null;
}

フォローアップ編集：

結果（Konstantinの回答に大きく基づいています）：

// Get subsection indicated by supplied title from supplied section
private static string GetSubSectionText(string section, string subSectionTitle)
{
        string[] lines = section.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
        int subsectStart = 0;
        int subsectEnd = lines.Length;

        // Find subsection start
        for (int i = 0; i < lines.Length; i++)
        {
            if (lines[i].Trim() == subSectionTitle)
            {
                subsectStart = i;
                break;
            }
        }

        // Find subsection end (ie, first blank line)
        for (int i = subsectStart; i < lines.Length; i++)
        {
            if (lines[i].Trim().Length == 0)
            {
                subsectEnd = i;
                break;
            }
        }

        return string.Join(Environment.NewLine, lines, subsectStart, subsectEnd - subsectStart);

}

結果とKonstantinの回答の主な違いは、フレームワークのバージョン（.NET 2.0を使用しており、string []。Takeをサポートしていません）と、ハードコードされた'\n'の代わりにEnvironment.NewLineを利用していることによるものです。。元のパスよりもはるかに美しく、読みやすくなっています。皆さんありがとう！

score 4 · Accepted Answer

String.Splitメソッドを使用してみましたか：

string s = "safsadfd\r\ndfgfdg\r\n\r\ndfgfgg";
string[] lines = s.Split('\n');
int i;
for (i = 0; i < lines.Length; i++)
{
    if (string.IsNullOrWhiteSpace(lines[i]))     
    //if (lines[i].Length == 0)          //or maybe this suits better..
    //if (lines[i].Equals(string.Empty)) //or this
    {
        Console.WriteLine(i);
        break;
    }
}
Console.WriteLine(string.Join("\n",lines.Take(i)));

編集：OPの編集に応答します。

score 3 · Accepted Answer

「空白行」とは、空白のみを含む行を意味しますか？はい、正規表現を使用する必要があります。探している構文はです@"(?<=\r?\n)[ \t]*(\r?\n|$)"。

(?<=…<code>）は先読みを示します。これは、探しているものの前にある必要があります。
\r?\nUnixとWindowsの両方の規則をサポートする改行を示します。
(?<=\r?\n)したがって、前の改行の先読みです。
[ \t]*0個以上のスペースまたはタブ文字を意味します。これらは、空白行の内容（存在する場合）と一致します。
(\r?\n|$)改行またはファイルの終わりを意味します。

例：

string source = "Line 1\r\nLine 2\r\n   \r\nLine 4\r\n";
Match firstBlankLineMatch = Regex.Match(source, @"(?<=\r?\n)[ \t]*(\r?\n|$)");
int firstBlankLineIndex = 
    firstBlankLineMatch.Success ? firstBlankLineMatch.Index : -1;

score 2 · Accepted Answer

楽しみのために: 文字列を 1 行に 1 回再割り当てしても問題ないようです。その場合、文字列を遅延評価して各行を返すイテレータを作成できます。例えば：

IEnumerable<string> BreakIntoLines(string theWholeThing)
{
    int startIndex = 0;
    int endIndex = 0;
    for(;;)
    {
        endIndex = theWholeThing.IndexOf(Environment.NewLine,startIndex) + Environment.NewLine.Count; //Remember to pick up the newline character(s) too!
        if(endIndex = -1) //Didn't find a newline
        {
            //Return the end part of the string and finish
            yield return theWholeThing.SubString(startIndex);
            yield break;
        }
        else //Found a newline
        {
            //Return where we're at up to the newline
            yield return theWholeThing.SubString(startIndex, endIndex - startIndex);
            startIndex = endIndex;
        }
    }
}

次に、そのイテレータを、関心のある行のみを返し、他の行を破棄する別のイテレータでラップできます。

IEnumerable<string> GetSubsectionLines(string theWholeThing, string subsectionTitle)
{
    bool foundSubsectionTitle = false;
    foreach(var line in BreakIntoLines(theWholeThing))
    {
        if(line.Contains(subSectionTitle))
        {
            foundSubsectionTitle = true; //Start capturing
        }

        if(foundSubsectionTitle)
        {
            yield return line;
        } //Implicit "else" - Just discard the line if we haven't found the subsection title yet

        if(String.IsNullOrWhiteSpace(line))
        {
            //This will stop iterating after returning the empty line, if there is one
            yield break;
        }
    }
}

現在、このメソッド (投稿された他のメソッドのいくつかと一緒に) は、元のコードが行うこととまったく同じではありません。たとえば、subsectionTitle のテキストがたまたま 1 行にまたがっている場合、それは検出されません。これが許可されないように仕様が書かれていると仮定します。このコードは、返されるすべての行のコピーも作成しますが、これは元のコードでも行われたため、おそらく問題ありません。

この方法と string.split の唯一の利点は、SubSection を返し終わったときに、残りの文字列が評価されないことです。ほとんどの合理的なサイズの文字列については、おそらく気にしません。「パフォーマンスの向上」は存在しない可能性があります。パフォーマンスを本当に気にするなら、そもそも各行をコピーしないでしょう!

もう 1 つの利点 (実際には価値があるかもしれません) は、コードの再利用です。文書を構文解析するプログラムを作成している場合、個々の行を操作できると便利です。

c# - 文字列内の空白行のインデックスを検索する

3 に答える 3

Related

Reference