.net - .NET RegEx - 最初の M 行の最初の N 文字

Question

次の 4 つの基本的なケースに対して、4 つの一般的な正規表現が必要です。

ファイルの先頭から D 行の後に始まる C 行までの行の先頭から B 文字の後に始まる A 文字まで
ファイルの終わりから D 行の前にある行頭から B 文字の後に始まる A 文字まで、最大 C 行まで
ファイルの先頭から D 行の後に始まる C 行までの行末から B 文字の前に始まる A 文字まで
ファイルの終わりから D 行の前に始まる C 行までの行末から B 文字の前に始まる A 文字まで

これらにより、ファイル内の任意のテキストブロックを選択できます。

これまでのところ、行と文字を別々にのみ機能するケースを考え出すことができました:

(?<=(?m:^[^\r]{N}))[^\r]{1,M}= 最初の N 文字の後、各行の最大 M 文字
[^\r]{1,M}(?=(?m:.{N}\r$)) = 各行の最大 M 文字、最後の N 文字の前

上記の 2 つの式は文字用であり、多くの一致 (各行に 1 つ) を返します。

(?<=(\A([^\r]*\r\n){N}))(?m:\n*[^\r]*\r$){1,M}= 最初の N 行の後に最大 M 行
(((?=\r?)\n[^\r]*\r)|((?=\r?)\n[^\r]+\r?)){1,M}(?=((\n[^\r]*\r)|(\n[^\r]+\r?)){N}\Z)= UP TO M 行 BEFORE LAST N 行前

これらの 2 つの式は行に対して同等ですが、常に 1 つの一致のみを返します。

タスクは、これらの式を組み合わせて、シナリオ 1 ～ 4 を可能にすることです。誰でも助けることができますか？

質問のタイトルのケースは、B = 0 と D = 0 の両方であるシナリオ #1 の単なるサブクラスであることに注意してください。

例 1: 行 3 ～ 5 の文字 3 ～ 6。合計3試合。

ソース：

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

結果：

<match>ne3 </match>
<match>ne4 </match>
<match>ne5 </match>

例 2: 最後の 1 行の前の 2 行の最後の 4 文字。合計2試合。

ソース：

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

結果：

<match>ah 4</match>
<match>ah 5</match>

score 2 · Accepted Answer

基本的なケース 2 の 1 つの正規表現を次に示します。

Regex regexObj = new Regex(
    @"(?<=              # Assert that the following can be matched before the current position
     ^                # Start of line
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)
    (?=               # Assert that the following can be matched after the current position
     .*$              # rest of the current line
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

本文中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

それは一致します

ne2
ne3
ne4

(ne2最後から 5 行目 (C+D = 5) の 3 文字目 (B=2) から始まるなど)

score 1 · Accepted Answer

編集：あなたのコメントに基づいて、これは本当にあなたのコントロールできないものであるように思えます. 私がこの回答を投稿した理由は、特に正規表現に関しては、開発者が技術的な課題に簡単に巻き込まれ、実際の目標である問題の解決を見失うことがよくあると感じるからです。私もこの方法だと知っています。技術的にも創造的にも考えていることの残念な結果だと思います.

そのため、可能であれば、目前の問題に再度焦点を当てたいと思います。十分にストックされたツールセットが存在する場合、Regex はこの仕事に適したツールではないことを強調したいと思います。それがあなたの手に負えない理由で自由に使える唯一のツールである場合、もちろん、選択の余地はありません。

おそらく正規表現ソリューションを要求する本当の理由があると思いました。しかし、それらの理由が十分に説明されていないので、あなたが頑固なだけの可能性がまだあると感じました;)

これは正規表現で行う必要があると言いますが、私は確信が持てません!

まず第一に、私は .NET 2.0 に制限されています [ . . . ]

問題ない。このような問題に LINQが必要だと誰が言いますか? LINQ は物事を簡単にするだけです。不可能を可能にするわけではありません。

たとえば、質問の最初のケースを実装する方法の 1 つを次に示します (これをより柔軟なものにリファクタリングして、ケース 2 ～ 3 もカバーできるようにするのはかなり簡単です)。

public IEnumerable<string> ScanText(TextReader reader,
                                    int start,
                                    int count,
                                    int lineStart,
                                    int lineCount)
{
    int i = 0;
    while (i < lineStart && reader.Peek() != -1)
    {
        reader.ReadLine();
        ++i;
    }

    i = 0;
    while (i < lineCount && reader.Peek() != -1)
    {
        string line = reader.ReadLine();

        if (line.Length < start)
        {
            yield return ""; // or null? or continue?
        }
        else
        {
            int length = Math.Min(count, line.Length - start);
            yield return line.Substring(start, length);
        }

        ++i;
    }
}

したがって、正規表現 (または LINQ) を使用せずに、一般的な問題に対して .NET 2.0 に適した解決策があります。

第二に、これらの [ . . . ]

多分私は密集しているだけです。正規表現以外のものから始めて、その上でより「洗練された」動作のために正規表現を使用することを妨げているのは何ですか? たとえば、上記で返された行に対して追加の処理を行う必要がある場合は、正規表現を使用して行うことができます。ScanTextしかし、最初から正規表現を使用することを主張するのは...わかりません。単に不必要です。

残念ながら、プロジェクトの性質上、正規表現で行う必要があります [ . . . ]

それが本当なら、それで結構です。しかし、あなたの理由が上記の抜粋からのものだけである場合、問題のこの特定の側面(テキストの特定の行から特定の文字をスキャンすること) は、正規表現の他の側面に必要になるとしても、正規表現を使用して対処する必要があることに同意しません。この質問の範囲でカバーされていない問題。

一方、何らかの恣意的な理由で Regex の使用を余儀なくされている場合 (たとえば、誰かが何らかの要件/仕様を、おそらくあまり考えずに書き込むことを選択した場合、このタスクには正規表現が使用されることになります)まあ、私は個人的にそれと戦うことをお勧めします。この要件を変更できる立場にある人に、正規表現は不要であり、問題は正規表現を使用しなくても簡単に解決できることを説明してください...または「通常の」コードと正規表現の組み合わせを使用します。

私が考えることができる他の唯一の可能性は（これは私自身の想像力の欠如の結果かもしれませんが）、質問で説明した問題に正規表現を使用する必要があることを説明することです。ユーザー入力として正規表現のみを受け入れるツール。しかし、あなたの質問はタグ付けされているので、この問題を解決するために使用する独自のコードをある程度書くことができると想定する必要があります. その場合は、もう一度言います: Regex は必要ないと思います ;).net

score 1 · Accepted Answer

まず、「基本的なケース 1」の回答を次に示します。

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

を使用して一致を反復処理できるようになりました

Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    // matched text: matchResults.Value
    // match start: matchResults.Index
    // match length: matchResults.Length
    matchResults = matchResults.NextMatch();
}

なので、本文では

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

それは一致します

ne3
ne4
ne5

score 1 · Accepted Answer

基本的なケース 3 の例を次に示します。

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .*               # any number of characters
    )                 # End of lookbehind assertion
    (?=               # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     $                # end of line
    )                 # End of lookahead assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

なので本文では

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

それは一致します

3 b
4 b
5 b

( 3 b3 文字 (A = 3)、最後から 8 文字目 (B = 8)、3 行目 (D = 2) など)

score 1 · Accepted Answer

最後に、基本的なケース 4 の 1 つの解決策:

Regex regexObj = new Regex(
    @"(?=             # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )                 # End of lookahead assertion
    .{1,3}            # Match three characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

本文中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

これは一致します

2 b
3 b
4 b

( 2 b3 文字 (A = 3)、最後から 5 行目 (C+D = 5) の最後から 8 文字目 (B = 8) から始まるため)

score 0 · Accepted Answer

2点失礼します。

完全に正規表現ベースではないソリューションを提案します。純粋な正規表現ソリューションが必要だと読みました。しかし、私は興味深い問題に取り組み、この問題に正規表現を使用すると問題が過度に複雑になっているとすぐに結論付けました。純粋な正規表現ソリューションでは答えられませんでした。以下のものを見つけたので、それらを示します。たぶん、彼らはあなたにアイデアを与えるかもしれません。
私は C# や .NET を知りません。Python だけです。正規表現はすべての言語でほぼ同じなので、正規表現だけで答えようと思ったので、問題について調べ始めました。とにかく理解しやすいと思うので、Pythonでのソリューションをすべて同じように示します。

一意の正規表現を使用して、テキスト内で必要な文字のすべての出現をキャプチャすることは非常に難しいと思います。複数の行で複数の文字を見つけることは、一致でネストされた一致を見つける問題のように思えるからです (おそらく私は十分なスキルがありませんか?正規表現で）。

そこで、主にすべての行のすべての文字の出現を検索してリストに入れ、次にリストをスライスして希望する出現を選択する方がよいと考えました。

行内の文字を検索する場合、最初は正規表現で問題ないように思えました。SO関数selectRE（）を使用したソリューション。

その後、行内の文字を選択することは、便利なインデックスで行をスライスすることと同じであり、リストをスライスすることと同じであることに気付きました。したがって、関数 select() です。

2 つの解をまとめて与えるので、2 つの関数の 2 つの結果が等しいことを確認できます。

import re

def selectRE(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    def pat(a,which_chars,b):
        if which_chars=='to':
            print repr(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
        elif which_chars=='before':
            print repr('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
            return re.compile('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
        elif which_chars=='after':
            print repr(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return pat(a,which_chars,b).findall(ch)[x:y]


def select(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    if   which_chars=='to'    :  a   = a-1
    elif which_chars=='after' :  a,b = b,a+b

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return [ line[len(line)-a-b:len(line)-b] if which_chars=='before' else line[a:b]
             for i,line in enumerate(ch.splitlines()) if x<=i<y ]


ch = '''line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
'''
print ch,'\n'

print 'Characters 3-6 of lines 3-5. A total of 3 matches.'
print selectRE(3,'to',6,3,'to',5,ch)
print   select(3,'to',6,3,'to',5,ch)
print
print 'Characters 1-5 of lines 4-5. A total of 2 matches.'
print selectRE(1,'to',5,4,'to',5,ch)
print   select(1,'to',5,4,'to',5,ch)
print
print '7 characters before the last 3 chars of lines 2-6. A total of 5 matches.'
print selectRE(7,'before',3,2,'to',6,ch)
print   select(7,'before',3,2,'to',6,ch)
print
print '6 characters before the 2 last characters of 3 lines before the 3 last lines.'
print selectRE(6,'before',2,3,'before',3,ch)
print   select(6,'before',2,3,'before',3,ch)
print 
print '4 last characters of 2 lines before 1 last line. A total of 2 matches.'
print selectRE(4,'before',0,2,'before',1,ch)
print   select(4,'before',0,2,'before',1,ch)
print
print 'last 1 character of 4 last lines. A total of 2 matches.'
print selectRE(1,'before',0,4,'before',0,ch)
print   select(1,'before',0,4,'before',0,ch)
print
print '7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.'
print selectRE(7,'before',3,3,'after',2,ch)
print   select(7,'before',3,3,'after',2,ch)
print
print '5 characters before the 3 last chars of the 5 first lines'
print selectRE(5,'before',3,5,'after',0,ch)
print   select(5,'before',3,5,'after',0,ch)
print
print 'Characters 3-6 of the 4 first lines'
print selectRE(3,'to',6,4,'after',0,ch)
print   select(3,'to',6,4,'after',0,ch)
print
print '9 characters after the 2 first chars of the 3 lines after the 1 first line'
print selectRE(9,'after',2,3,'after',1,ch)
print   select(9,'after',2,3,'after',1,ch)

結果

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6


Characters 3-6 of lines 3-5. A total of 3 matches.
'.{2}(.{4}).*(?:\n|$)'
['ne3 ', 'ne4 ', 'ne5 ']
['ne3 ', 'ne4 ', 'ne5 ']

Characters 1-5 of lines 4-5. A total of 2 matches.
'.{0}(.{5}).*(?:\n|$)'
['line4', 'line5']
['line4', 'line5']

7 characters before the last 3 chars of lines 2-6. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']

6 characters before the 2 last characters of 3 lines before the 3 last lines.
'.*(.{6}).{2}(?:\n|$)'
['2 blah', '3 blah', '4 blah']
['2 blah', '3 blah', '4 blah']

4 last characters of 2 lines before 1 last line. A total of 2 matches.
'.*(.{4})(?:\n|$)'
['ah 5', 'ah 6']
['ah 5', 'ah 6']

last 1 character of 4 last lines. A total of 2 matches.
'.*(.{1})(?:\n|$)'
['4', '5', '6']
['4', '5', '6']

7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne3 bla', 'ne4 bla', 'ne5 bla']
['ne3 bla', 'ne4 bla', 'ne5 bla']

5 characters before the 3 last chars of the 5 first lines
'.*(.{5}).{3}(?:\n|$)'
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']

Characters 3-6 of the 4 first lines
'.{2}(.{4}).*(?:\n|$)'
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']

9 characters after the 2 first chars of the 3 lines after the 1 first line
'.{2}(.{9}).*(?:\n|$)'
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']

そして今、私はティム・ピーツカーのトリッキーな解決策を研究します

score 0 · Accepted Answer

次のようなことをしてみませんか。

//Assuming you have it read into a string name sourceString
String[] SplitString = sourceString.Split(Environment.Newline); //You will probably need to account for any line delimeter
String[M] NewStrings;
for(i=0;i<M;i++) {
    NewStrings[i] = SplitString[i].SubString(0,N) //Or (N, SplitString[i].Length -1) depending on what you need
}

RegEx も LINQ も必要ありません。

さて、私はあなたの質問の冒頭を読み直しました.forループとSplitの開始と終了をパラメータ化するだけで、必要なものを正確に取得できます。

.net - .NET RegEx - 最初の M 行の最初の N 文字

7 に答える 7

Related

Reference