javascript - JavaScript 正規表現で各キャプチャのインデックスを取得する

Question

のような正規表現を照合し/(a).(b)(c.)d/、"aabccde"次の情報を取得したい:

"a" at index = 0
"b" at index = 2
"cc" at index = 3

これどうやってするの？String.match は、すべてのキャプチャのインデックスではなく、一致のリストと完全な一致の開始のインデックスを返します。

編集: 単純な indexOf では機能しないテストケース

regex: /(a).(.)/
string: "aaa"
expected result: "a" at 0, "a" at 2

注: 質問はJavascript Regex: How to find index of each subexpression? に似ています。、しかし、正規表現を変更してすべての部分式をキャプチャグループにすることはできません。

score 8 · Accepted Answer

現在、ネイティブ Javascript でこれを実装する提案(ステージ 4) があります。

ECMAScript の正規表現一致インデックス

ECMAScript RegExp Match Indices は、キャプチャされた部分文字列の開始インデックスと終了インデックスに関する追加情報を、入力文字列の開始から相対的に提供します。

...の配列結果 (部分文字列配列) に追加indicesのプロパティを採用することを提案します。このプロパティ自体は、キャプチャされた各部分文字列の開始インデックスと終了インデックスのペアを含むインデックス配列になります。一致しないキャプチャグループは、部分文字列配列内の対応する要素と同様です。さらに、インデックス配列自体に、名前付きキャプチャグループごとの開始インデックスと終了インデックスを含むグループプロパティがあります。RegExp.prototype.exec()undefined

以下は、物事がどのように機能するかの例です。次のスニペットは、少なくとも Chrome ではエラーなしで実行されます。

const re1 = /a+(?<Z>z)?/d;

// indices are relative to start of the input string:
const s1 = "xaaaz";
const m1 = re1.exec(s1);
console.log(m1.indices[0][0]); // 1
console.log(m1.indices[0][1]); // 5
console.log(s1.slice(...m1.indices[0])); // "aaaz"

console.log(m1.indices[1][0]); // 4
console.log(m1.indices[1][1]); // 5
console.log(s1.slice(...m1.indices[1])); // "z"

console.log(m1.indices.groups["Z"][0]); // 4
console.log(m1.indices.groups["Z"][1]); // 5
console.log(s1.slice(...m1.indices.groups["Z"])); // "z"

// capture groups that are not matched return `undefined`:
const m2 = re1.exec("xaaay");
console.log(m2.indices[1]); // undefined
console.log(m2.indices.groups.Z); // undefined

したがって、問題のコードについては、次のことができます。

const re = /(a).(b)(c.)d/d;
const str = 'aabccde';
const result = re.exec(str);
// indices[0], like result[0], describes the indices of the full match
const matchStart = result.indices[0][0];
result.forEach((matchedStr, i) => {
  const [startIndex, endIndex] = result.indices[i];
  console.log(`${matchedStr} from index ${startIndex} to ${endIndex} in the original string`);
  console.log(`From index ${startIndex - matchStart} to ${endIndex - matchStart} relative to the match start\n-----`);
});

出力：

aabccd from index 0 to 6 in the original string
From index 0 to 6 relative to the match start
-----
a from index 0 to 1 in the original string
From index 0 to 1 relative to the match start
-----
b from index 2 to 3 in the original string
From index 2 to 3 relative to the match start
-----
cc from index 3 to 5 in the original string
From index 3 to 5 relative to the match start

配列には、一致の開始点ではなく、文字列の開始点を基準としindicesた一致したグループのインデックスが含まれていることに注意してください。

ポリフィルはこちらから入手できます。

score 5 · Accepted Answer

ネストされたグループを魅力的に解析できる小さな正規表現パーサーを作成しました。小さいけど大きい。いいえ、そうではありません。ドナルドの手のように。誰かがそれをテストしてくれたら本当に嬉しいので、それは実戦でテストされます. https://github.com/valorize/MultiRegExp2にあります。

使用法：

let regex = /a(?: )bc(def(ghi)xyz)/g;
let regex2 = new MultiRegExp2(regex);

let matches = regex2.execForAllGroups('ababa bcdefghixyzXXXX'));

Will output:
[ { match: 'defghixyz', start: 8, end: 17 },
  { match: 'ghi', start: 11, end: 14 } ]

score 1 · Accepted Answer

したがって、テキストと正規表現があります。

txt = "aabccde";
re = /(a).(b)(c.)d/;

最初のステップは、正規表現に一致するすべての部分文字列のリストを取得することです。

subs = re.exec(txt);

次に、各部分文字列のテキストに対して簡単な検索を実行できます。最後の部分文字列の位置を変数に保持する必要があります。この変数に名前を付けましたcursor。

var cursor = subs.index;
for (var i = 1; i < subs.length; i++){
    sub = subs[i];
    index = txt.indexOf(sub, cursor);
    cursor = index + sub.length;


    console.log(sub + ' at index ' + index);
}

編集: @nhahtdh のおかげで、メカニズムを改善し、完全な機能を作成しました。

String.prototype.matchIndex = function(re){
    var res  = [];
    var subs = this.match(re);

    for (var cursor = subs.index, l = subs.length, i = 1; i < l; i++){
        var index = cursor;

        if (i+1 !== l && subs[i] !== subs[i+1]) {
            nextIndex = this.indexOf(subs[i+1], cursor);
            while (true) {
                currentIndex = this.indexOf(subs[i], index);
                if (currentIndex !== -1 && currentIndex <= nextIndex)
                    index = currentIndex + 1;
                else
                    break;
            }
            index--;
        } else {
            index = this.indexOf(subs[i], cursor);
        }
        cursor = index + subs[i].length;

        res.push([subs[i], index]);
    }
    return res;
}


console.log("aabccde".matchIndex(/(a).(b)(c.)d/));
// [ [ 'a', 1 ], [ 'b', 2 ], [ 'cc', 3 ] ]

console.log("aaa".matchIndex(/(a).(.)/));
// [ [ 'a', 0 ], [ 'a', 1 ] ] <-- problem here

console.log("bababaaaaa".matchIndex(/(ba)+.(a*)/));
// [ [ 'ba', 4 ], [ 'aaa', 6 ] ]

javascript - JavaScript 正規表現で各キャプチャのインデックスを取得する

6 に答える 6

Related

Reference