regex - R gregexpr の正規表現一致

Question

3 つの連続する「a」イベントのインスタンスをカウントしようとしています"aaa"。

文字列は下位のアルファベットで構成されます。"abaaaababaaa"

次のコードを試しました。しかし、動作は正確には私が探しているものではありません。

x<-"abaaaababaaa";
gregexpr("aaa",x);

「aaa」の出現回数が 2 回ではなく 3 回になるように一致させたいと考えています。

指数付けが 1 から始まると仮定します。

「aaa」が最初に出現するのはインデックス 3 です。
「aaa」の 2 番目の出現はインデックス 4 にあります (これは gregexpr によって捕捉されません)。
「aaa」の 3 番目のオカレンスはインデックス 10 にあります。

score 6 · Accepted Answer

重複する一致をキャッチするには、次のような先読みを使用できます。

gregexpr("a(?=aa)", x, perl=TRUE)

ただし、一致は単一の「a」になっているため、特に固定長のパターンを常に探しているわけではない場合、これらの一致の処理が複雑になる可能性があります。

score 1 · Accepted Answer

遅くなりましたが、この解決策を共有したかったので、

your.string <- "abaaaababaaa"
nc1 <- nchar(your.string)-1
x <- unlist(strsplit(your.string, NULL))
x2 <- c()
for (i in 1:nc1)
x2 <- c(x2, paste(x[i], x[i+1], x[i+2], sep="")) 
cat("ocurrences of <aaa> in <your.string> is,", 
    length(grep("aaa", x2)), "and they are at index", grep("aaa", x2))
> ocurrences of <aaa> in <your.string> is, 3 and they are at index 3 4 10

Fran による R-help からのこの回答に大いに触発されました。

score 0 · Accepted Answer

を使用して、さまざまな長さの重複するすべての一致を抽出する方法を次に示しgregexprます。

x<-"abaaaababaaa"
# nest in lookahead + capture group
# to get all instances of the pattern "(ab)|b"
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# regmatches will reference the match.length attr. to extract the strings
# so move match length data from 'capture.length' to 'match.length' attr
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
# extract substrings
regmatches(x, matches)
# [[1]]
# [1] "ab" "b"  "ab" "b"  "ab" "b"

コツは、パターンをキャプチャグループで囲み、そのキャプチャグループを先読みアサーションで囲むことです。gregexpr属性を持つ開始位置を含むリストを返しますcapture.length。これは、最初の列が最初のキャプチャグループの一致する長さであるマトリックスです。これをベクトルに変換してmatch.length属性に移動すると (パターン全体が先読みアサーション内にあるため、すべてゼロです)、に渡してregmatches文字列を抽出できます。

最終結果の型からわかるように、いくつかの変更を加えることで、xが文字列のリストである場合に備えて、これをベクトル化できます。

x<-list(s1="abaaaababaaa", s2="ab")
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# make a function that replaces match.length attr with capture.length
set.match.length<-
function(x) structure(x, match.length=as.vector(attr(x, 'capture.length')[,1]))
# set match.length to capture.length for each match object
matches<-lapply(matches, set.match.length)
# extract substrings
mapply(regmatches, x, lapply(matches, list))
# $s1
# [1] "ab" "b"  "ab" "b"  "ab" "b" 
# 
# $s2
# [1] "ab" "b"

regex - R gregexpr の正規表現一致

3 に答える 3

Related

Reference