r - MetaPhone 関数 (SoundEx など) 関数と R での使用?

Question

MetaPhone、Double Metaphone、Caverphone 、MetaPhone3、SoundExを使用したいと考えています。まだ誰かがそれを行っている場合は、「R」内でNameX関数を使用して、同様の値を分類および要約して、分析前のデータクレンジング操作を最小限に抑えることができます。

各アルゴリズムには独自の長所と短所があり、SoundEx を使用したくないことを十分に認識していますが、代替手段が見つからない場合でも機能する可能性があります。この投稿で述べたように、 Harperは、SoundEx の下にある無関係な名前のリストのいずれかと一致しますが、より良い結果の一致のために Metaphone では一致しないはずです。

ある程度の柔軟性を維持しながら、どれが私の目的に最も適しているかはわかりませんが、値を確認する前に、それらのいくつかを試して、次のような表を生成したいのはそのためです。

ここに画像の説明を入力

表のソースリンク

姓は私の最初の分析の対象ではありませんが、同じ値として扱われる「響きのある」単語のようなすべてを効果的に検討したいので、これは良い例だと思います。 .

私がすでに見たいくつかのこと：

C パッケージを作成してRCppで呼び出すことができることは知っていますが、SEの SoundEx には C ソリューションもありますが、以前に R パッケージを作成したことがなく、より簡単な方法があれば車輪の再発明を避けたいと考えています。 Rで直接行うか、機能を利用できるパッケージが存在しますか?
RecordLinkageと現在のstringdistパッケージには SoundEx 関数がありますが、どの形式の MetaPhone 関数もありません。

だから私は具体的に答えを探しているのは、MetaPhone / CaverphoneがRでどのように機能し、「値」を知っているので、データ値をグループ化できるかということです。

追加の注意点は、私は R を毎日使用しているわけではないので、まだ R に慣れていないと考えていることです。

score 9 · Accepted Answer

アルゴリズムは非常に簡単ですが、私も既存の R パッケージを見つけることができませんでした。この作業を R で本当に行う必要がある場合、短期的なオプションの 1 つは、python モジュールmetaphone( pip install metaphone) をインストールしてから、rPythonブリッジを使用して R で使用することです。

library(rPython)

python.exec("from metaphone import doublemetaphone")
python.call("doublemetaphone", "architect")
[1] "ARKTKT" ""

これは最も洗練されたソリューションではありませんが、R で metaphone 操作を行うことができます。

Apache Commons には、metaphone アルゴリズムも実装するコーデックライブラリがあります。

library(rJava)

.jinit() # need to have commons-codec-1.10.jar in your CLASSPATH

mp <- .jnew("org.apache.commons.codec.language.Metaphone")
.jcall(mp,"S","metaphone", "architect")
[1] "ARXT"

上記.jcallを R 関数にして、他の R 関数と同じように使用できます。

metaphone <- function(x) {
  .jcall(mp,"S","metaphone", x)  
}

sapply(c("abridgement", "stupendous"), metaphone)

## abridgement  stupendous 
##      "ABRJ"      "STPN"

Java インターフェースは、プラットフォーム間での互換性も向上している可能性があります。

以下は、Java インターフェイスの使用に関するより完全なビューです。

library(rJava)

.jinit()

mp <- .jnew("org.apache.commons.codec.language.Metaphone")
dmp <- .jnew("org.apache.commons.codec.language.DoubleMetaphone")

metaphone <- function(x) {
  .jcall(mp,"S","metaphone", x)  
}

double_metaphone <- function(x) {
  .jcall(dmp,"S","doubleMetaphone", x)  
}

words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan', 
           'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith', 
           'Smyth', 'Jessica', 'Joshua')

data.frame(metaphone=sapply(words, metaphone),
           double=sapply(words, double_metaphone))

##           metaphone double
## Catherine      K0RN   K0RN
## Katherine      K0RN   K0RN
## Katarina       KTRN   KTRN
## Johnathan      JN0N   JN0N
## Jonathan       JN0N   JN0N
## John             JN     JN
## Teresa          TRS    TRS
## Theresa         0RS    0RS
## Smith           SM0    SM0
## Smyth           SM0    SM0
## Jessica         JSK    JSK
## Joshua           JX     JX

score 8 · Accepted Answer

Rパッケージに Double Metaphone の実装が含まれるようになりましたPGRdup。

install.packages(PGRdup)
library(PGRdup)
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan', 
           'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith', 
           'Smyth', 'Jessica', 'Joshua')
DoubleMetaphone(words)

$primary
 [1] "K0RN" "K0RN" "KTRN" "JN0N" "JN0N" "JN"   "TRS"  "0RS"  "SM0"  "SM0"  "JSK"  "JX"  

$alternate
 [1] "KTRN" "KTRN" "KTRN" "ANTN" "ANTN" "AN"   "TRS"  "TRS"  "XMT"  "XMT"  "ASK"  "AX"

score 1 · Accepted Answer

私は数ヶ月間、フォニックスと呼ばれるこのためのパッケージに取り組んできました. 私は、Caverphone、Caverphone2、Metaphone、soundex など、一般的なものとあまり一般的でないもののいくつかを実装しました。他にもいくつか実装されています。1.0 と呼ぶ前に、まだいくつか実装する予定がありますが、パッケージのリリースを CRAN に提出したところです。

score 0 · Accepted Answer

これはキャバーフォンの解釈です。カスケードルールのアプローチを取り入れていますが、キャバーフォンは常に、地域のアクセントコンテキストに合わせてカスタマイズする例として意図されていたことに注意してください (ただし、人々はそれを一般的な目的で使用します。独自の地域に基づく他のほとんどのトレードオフのセット）、したがって、a）データソースで一意の文字を取得して、それらすべてを処理していることを確認する、b）関連する最終的な長さ制限を変更することを検討することをお勧めしますc) 地域の訛りの混合をモデル化することについて考える - これは、1800 年代後半から 1900 年代初頭にかけてのニュージーランドのさまざまな訛りのグループをモデル化するためのものであり、彼らがお互いの言っていることを誤って転記する方法をモデル化するためのものでした。 .

caverphonise <- function(x) {
# Convert to lowercase
x <- tolower(x)

# Remove anything not A-Z
x <- gsub("[^a-z]", "", x)

# If the name starts with
## cough make it cou2f
x <- gsub("^cough", "cou2f", x)
## rough make it rou2f
x <- gsub("^rough", "rou2f", x)
## tough make it tou2f
x <- gsub("^tough", "tou2f", x)
## enough make it enou2f
x <- gsub("^enough", "enou2f", x)
## gn make it 2n
x <- gsub("^gn", "2n", x)

# If the name ends with
## mb make it m2
x <- gsub("mb$", "m2", x)

# Replace
## cq with 2q
x <- gsub("cq", "2q", x)
## ci with si
x <- gsub("ci", "si", x)
## ce with se
x <- gsub("ce", "se", x)
## cy with sy
x <- gsub("cy", "sy", x)
## tch with 2ch
x <- gsub("tch", "2ch", x)
## c with k
x <- gsub("c", "k", x)
## q with k
x <- gsub("q", "k", x)
## x with k
x <- gsub("x", "k", x)
## v with f
x <- gsub("v", "f", x)
## dg with 2g
x <- gsub("dg", "2g", x)
## tio with sio
x <- gsub("tio", "sio", x)
## tia with sia
x <- gsub("tia", "sia", x)
## d with t
x <- gsub("d", "t", x)
## ph with fh
x <- gsub("ph", "fh", x)
## b with p
x <- gsub("b", "p", x)
## sh with s2
x <- gsub("sh", "s2", x)
## z with s
x <- gsub("z", "s", x)
## any initial vowel with an A
x <- gsub("^[aeiou]", "A", x)
## all other vowels with a 3
x <- gsub("[aeiou]", "3", x)
## 3gh3 with 3kh3
x <- gsub("3gh3", "3kh3", x)
## gh with 22
x <- gsub("gh", "22", x)
## g with k
x <- gsub("g", "k", x)
## groups of the letter s with a S
x <- gsub("s+", "S", x)
## groups of the letter t with a T
x <- gsub("t+", "T", x)
## groups of the letter p with a P
x <- gsub("p+", "P", x)
## groups of the letter k with a K
x <- gsub("k+", "K", x)
## groups of the letter f with a F
x <- gsub("f+", "F", x)
## groups of the letter m with a M
x <- gsub("m+", "M", x)
## groups of the letter n with a N
x <- gsub("n+", "N", x)
## w3 with W3
x <- gsub("w3", "W3", x)
## wy with Wy
x <- gsub("wy", "Wy", x)
## wh3 with Wh3
x <- gsub("wh3", "Wh3", x)
## why with Why
x <- gsub("why", "Why", x)
## w with 2
x <- gsub("w", "2", x)
## any initial h with an A
x <- gsub("^h", "A", x)
## all other occurrences of h with a 2
x <- gsub("h", "2", x)
## r3 with R3
x <- gsub("r3", "R3", x)
## ry with Ry
x <- gsub("ry", "Ry", x)
## r with 2
x <- gsub("r", "2", x)
## l3 with L3
x <- gsub("l3", "L3", x)
## ly with Ly
x <- gsub("ly", "Ly", x)
## l with 2
x <- gsub("l", "2", x)
## j with y
x <- gsub("j", "y", x)
## y3 with Y3
x <- gsub("y3", "Y3", x)
## y with 2
x <- gsub("y", "2", x)

# remove all
## 2s
x <- gsub("2", "", x)
## 3s
x <- gsub("3", "", x)
# put six 1s on the end
x <- paste(x,"111111", sep="")
# take the first six characters as the code
unlist(lapply(x, FUN= function(x){paste((strsplit(x, "")[[1]])[1:6], collapse="")}))
}

r - MetaPhone 関数 (SoundEx など) 関数と R での使用?

4 に答える 4

Related

Reference