r - R を使用した一連の整数データの近似パターンマッチングと抽出

Question

c(1,2,3,4,5)データでほぼ一致する必要があるため、整数のパターンがありますc(1,10,1,6,3,4,5,1,2,3,4,5,9,10,1,2,3,4,6)

私が試してみました：

pmatch()
all.equal()
grepl()

しかし、彼らはこのシナリオをサポートしていないようです。

pattern <- c(1,2,3,4,5)

data <- c(1,10,1,6,3,4,5,1,2,3,4,5,9,10,1,2,3,4,6)

上記の例では、次の出力を生成する必要があります。

1,6,3,4,5

1,2,3,4,5

1,2,3,4,6

これについての考えを感謝します。

ありがとう

score 2 · Accepted Answer

「少なくともN-1の整数が一致する別の整数シーケンスの整数シーケンスに一致する」と言っていると思います。一致が重複する場合の動作がどうあるべきかは不明であるため、以下では重複するシーケンスをピックアップします。

# helper function to test "match" at a threshold of 4 matches
is_almost <- function(s1, s2, thresh = 4) {
   sum(s1 == s2) >= thresh }

# function to lookup and return sequences
extract_seq <- function(pattern, data) {
   res <- lapply(1:(length(data) - length(pattern) + 1), function(s) {
   subseq <- data[s:(s+length(pattern)-1)]
   if (is_almost(pattern, subseq)) { 
      subseq}
   })
   Filter(Negate(is.null),res)
}

# let's test it out
pattern <- c(1,2,3,4,5)
data <- c(1,10,1,6,3,4,5,1,2,3,4,5,9,10,1,2,3,4,6)

extract_seq(pattern,data)

[[1]]
[1] 1 6 3 4 5

[[2]]
[1] 1 2 3 4 5

[[3]]
[1] 1 2 3 4 6

score 0 · Accepted Answer

特定のベクトルに一致するベクトル内の一意の要素を見つけたい場合は%Iin%、より大きなベクトル内の「パターン」の存在をテストするために使用できます。演算子%in%は論理ベクトルを返します。その出力をに渡すとwhich()、各値のインデックスが返されます。TRUEこれを使用してより大きなベクトルをサブセット化し、順序に関係なく「パターン」に一致するすべての要素を返します。サブセットベクトルをに渡すとunique()重複が排除され、「パターン」ベクトルの要素と長さに一致する大きなベクトルから各要素が 1 つだけ出現するようになります。

例えば：

> num.data <- c(1, 10, 1, 6, 3, 4, 5, 1, 2, 3, 4, 5, 9, 10, 1, 2, 3, 4, 5, 6)
> num.pattern.1 <- c(1,6,3,4,5)
> num.pattern.2 <- c(1,2,3,4,5)
> num.pattern.3 <- c(1,2,3,4,6)
> unique(num.data[which(num.data %in% num.pattern.1)])
[1] 1 6 3 4 5
> unique(num.data[which(num.data %in% num.pattern.2)])
[1] 1 3 4 5 2
> unique(num.data[which(num.data %in% num.pattern.3)])
[1] 1 6 3 4 2

最初の結果が偶然にもの順序とnum.pattern.1一致することに注意してください。他の 2 つのベクトルは、パターンベクトルの順序と一致しません。

パターンに一致する正確なシーケンスを見つけるにはnum.data、次の関数に似たものを使用できます。

set.seed(12102015)
test.data <- sample(c(1:99), size = 500, replace = TRUE)
test.pattern.1 <- test.data[90:94]

find_vector <- function(test.data, test.pattern.1) {
   # List of all the vectors from test.data with length = length(test.pattern.1), currently empty
   lst <- vector(mode = "list")
   # List of vectors that meet condition 1, currently empty
   lst2 <- vector(mode = "list")
   # List of vectors that meet condition 2, currently empty
   lst3 <- vector(mode = "list")
   # A modifier to the iteration variable used to build 'lst'
   a <- length(test.pattern.1) - 1
   # The loop to iterate through 'test.data' testing for conditions and building lists to return a match
   for(i in 1:length(test.data)) {
     # The list is build incrementally as 'i' increases
     lst[[i]] <- test.data[c(i:(i+a))]
     # Conditon 1
     if(sum(lst[[i]] %in% test.pattern.1) == length(test.pattern.1)) {lst2[[i]] <- lst[[i]]}
     # Condition 2
     if(identical(lst[[i]], test.pattern.1)) {lst3[[i]] <- lst[[i]]}
   }
   # Remove nulls from 'lst2' and 'lst3'
   lst2 <- lst2[!sapply(lst2, is.null)]
   lst3 <- lst3[!sapply(lst3, is.null)]
# Return the intersection of 'lst2' and 'lst3' which should be a match to the pattern vector.
return(intersect(lst2, lst3))
}

再現性のためにset.seed()、テストデータセットとパターンを使用して作成しました。この関数find_vector()は 2 つの引数を取ります。1 つtest.data目は、パターンベクトルをチェックする大きな数値ベクトルで、2 つ目はtest.pattern.1で見つけたい短い数値ベクトルですtest.data。最初に、パターンベクトルの長さと等しい長さの小さいベクトルに分割されたlst保持、最初の条件を満たすパターンベクトルからの保持、および 2 番目の条件を満たすベクトルからの保持の3 つのリストが作成されます。最初の条件は、ベクトルの要素がパターンベクトル内にあるかどうかをテストします。2 番目の条件は、test.datalst2lstlst3lstlstlst順序および要素ごとにパターンベクトルに一致します。

このアプローチの問題点の 1 つはNULL、条件が満たされない場合に値が各リストに導入されますが、条件が満たされるとプロセスが停止することです。参照用にリストを印刷して、テストされたすべてのベクトル、最初の条件を満たすベクトル、および 2 番目の条件を満たすベクトルを確認できます。ヌルは削除できます。ヌルを削除して、との交点を見つけると、で完全にlst2一致lst3するパターンが明らかになりtest.dataます。

test.data <- 'a numeric vector'関数を使用するには、 andを明示的に定義してtest.pattern.1 <- 'a numeric vector'ください。特別なパッケージは必要ありません。ベンチマークは行っていませんが、関数は高速に動作しているようです。また、関数が失敗するシナリオも探しませんでした。

r - R を使用した一連の整数データの近似パターン マッチングと抽出

2 に答える 2

Related

Reference

r - R を使用した一連の整数データの近似パターンマッチングと抽出