regex - gregexpr と矛盾する strsplit

Question

文字ベクトルの最初と最後のコンマに正しく一致しているように見えますが、この質問に対する私の回答に対するコメントは、目的の結果が得られるはずでstrsplitはありません。これは、とを使用して証明できます。gregexprregmatches

では、同じ正規表現に対して 2 つの一致しか返さstrsplitないのに、この例でコンマごとに分割するのはなぜでしょうか?regmatches

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90" 


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

は？！何が起こっている？

score 10 · Accepted Answer

@Aprillion の理論は、R のドキュメントから正確です。

各入力文字列に適用されるアルゴリズムは次のとおりです。

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

つまり、反復ごと^に新しい文字列の先頭に一致します (前の項目はありません)。

この動作を簡単に説明するには:

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

ここでは、区切り文字として先読みアサーションを使用して、この動作の結果を確認できます (リンクについては @JoshO'Brien に感謝します)。

regex - gregexpr と矛盾する strsplit

1 に答える 1

Related

Reference