r - 2 つの文字列間の不連続な一致を見つける R の最長の共通部分文字列

Question

R で最も長い一般的な部分文字列を見つけることについて質問があります。StackOverflow のいくつかの投稿を検索しているときに、qualV パッケージについて知りました。ただし、このパッケージの LCS 関数は、連続していなくても、string2 に存在する string1 のすべての文字を実際に検出することがわかります。

説明するために、文字列が string1 : " hello " string2 : " hel 12345lo" の場合、出力は hel であると予想されますが、出力は hello として取得されます。私は何か間違ったことをしているに違いない。以下の私のコードを見てください。

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

Rlibstree メソッドも試しましたが、まだ連続していない部分文字列が得られます。また、部分文字列の長さも私の予想から外れています.s 以下を参照してください。

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

score 12 · Accepted Answer

考えられる解決策は 3 つあります。

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

またadist()、agrep()ベース R にはとがあり、stringdistパッケージには LCS メソッドを実行するいくつかの関数があります。をご覧くださいstringsidt。ペアになっていない文字の数を返します。

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

adist()これをもう少し詳しく調べたので、進むべき道かもしれないと思います。設定counts=TRUEすると、一致、挿入などのシーケンスが取得されます。したがって、それを指定stri_locate()すると、その行列を使用して a から b への一致を取得できます。

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

そのため、M値は直接一致することを示します。で部分文字列を取得できますstri_sub()

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

申し訳ありませんが、文字列距離アルゴリズムに精通していないため、うまく説明できませんでした。

score 9 · Accepted Answer

adist使用できる@RichardScrivenの洞察を活用します（「おおよその文字列距離」を計算します。より包括的な関数を作成しました"trafos"。2つの文字列間の「距離」を決定するために使用される「変換」を表すことに注意してください（下の例）

編集この回答は、間違った/予期しない結果をもたらす可能性があります。@wdkrnlsが指摘したように：

「apple」と「big apple bagels」に対して関数を実行したところ、「appl」が返されました。私は「リンゴ」を期待していたでしょう。

間違った結果については、以下の説明を参照してください。longest_stringリスト内のを取得する関数から始めます。

longest_string <- function(s){return(s[which.max(nchar(s))])}

次に、@RichardSriven の作品とstringiライブラリを使用できます。

library(stringi)
lcsbstr <- function(a,b) { 
  sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
  cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
  longest_cmn_sbstr <- longest_string(cmn_sbstr)
   return(longest_cmn_sbstr) 
}

または、コードを書き直して、外部ライブラリを使用しないようにすることもできます(R のネイティブadist関数を引き続き使用します)。

lcsbstr_no_lib <- function(a,b) { 
    matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
    lengths<- attr(matches, 'match.length')
    which_longest <- which.max(lengths)
    index_longest <- matches[which_longest]
    length_longest <- lengths[which_longest]
    longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
    return(longest_cmn_sbstr ) 
}

'hello '上記の両方の関数は、' ではなく、最も長い共通部分文字列としてのみ識別hello r'します (どちらの引数が 2 つのうち長いかは関係ありません)。

identical('hello',
    lcsbstr_no_lib('hello', 'hello there'), 
    lcsbstr(       'hello', 'hello there'),
    lcsbstr_no_lib('hello there', 'hello'), 
    lcsbstr(       'hello there', 'hello'))

最終編集 この結果の奇妙な動作に注意してください。

lcsbstr('hello world', 'hello')
#[1] 'hell'

私は期待してい'hello'ましたが、変換が実際に (削除によって) 世界の "o" を地獄の "o" に移動するため、次のように地獄の部分のみが一致すると見なされますM:

drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1]  vvvv   v
#[1] "hello world"

この動作は、この Levenstein ツールを使用して観察されます。これにより、これら 2 つの変換に相当する 2 つの可能な解が得られます。

#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"

adistあるソリューションを別のソリューションよりも優先するように構成できるかどうかわかりません。(変換の「重み」は同じです -- 「M」と「D」の数は同じです --連続 Mした数が多い変換を優先する方法がわかりません)

最後に、adist で渡すことができることを忘れないでくださいignore.case = TRUE(FALSEがデフォルトです)。

"trafos"のプロパティへのキーadist; ある文字列から別の文字列に変換する「変換」:

変換シーケンスは、戻り値の「trafos」属性として、要素M、I、DおよびS一致、挿入、削除、および置換を示す文字列として返されます。

score 1 · Accepted Answer

「こんにちは」の出力を取得するために何をしたかわかりません。以下の試行錯誤の実験に基づいて、LCS関数は次のように見えます。(b) 複数の同じ長さの LCS を検索します (最初のものだけを検索する sub() とは異なります)。(c) 文字列内の要素の順序は重要ではありません。以下に図はありません。(b) LCS 呼び出しでの文字列の順序は重要ではありません。これも表示されません。

したがって、b の「hel」の後に文字が続くため、a の「hello」には b に LCS がありませんでした。まあ、それが私の現在の仮説です。

上記のポイント A:

a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...

a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it

a= c("hello", "hel", "abcd") 
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character

上記のポイント B:

a= c("hello", "hel", "abcd") 
b= c("hello", "abcd") 
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all

r - 2 つの文字列間の不連続な一致を見つける R の最長の共通部分文字列

5 に答える 5

Related

Reference