r - Rでの化学名マッチング

Question

[更新]

問題

私は2つのデータベースを持っています:

1:

1   Name: D-Tagatose 1,6-bisphosphate
2   Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-    myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol
3   Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione
4   Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine
5   Name: H+;: Hydron

2:

>  <NAME>    Benzaldehyde, 4-[(trimethylsilyl)oxy]-     >  <SYNONYMS>    Benzaldehyde, p-(trimethylsiloxy)-
>  <NAME>    Benzeneacetic acid, methyl ester           >  <SYNONYMS>    q qer
>  <NAME>    Cyclopropaneoctanoic acid, 2-[[2-[(2-ethylcyclopropyl)methyl]cyclopropyl]methyl]-, methyl ester    >  <SYNONYMS>    Methyl 8-[2-((2-[(2-ethylcyclopropyl)methyl]cyclopropyl)methyl)cyclopropyl]octanoate #
>  <NAME>    Mevalonic lactone, trimethylsilyl deriv.   >  <SYNONYMS>    Mevalonic lactone, trimethylsilyl
>  <NAME>    Benzeneacetic acid, phenylmethyl ester     >  <SYNONYMS>    Acetic acid, phenyl-, benzyl ester

望ましい出力:

データベース 2 の名前またはシノニムをデータベース 1 の名前と一致させます。私たちは化合物について話しているので、化合物の名前にわずかな違いが生じる可能性があります. そのため、リンクされたオンラインデータベースもマッチングに使用しました。

テスト用の入力:

リンク先のエクセルファイルをご覧ください。データ

私が試したことは？

名前のみの照合 (db 1 の名前から "Name " 文字列を差し引く必要があります)
部分的な名前の一致 -> 明らかに、化学名の一致では最良のアイデアではありません。
以下のデータベースを利用したマッチング）

チェビ

NIST

パブケム

小さな R 入力:

入力 1

structure(c(">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", 
">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", 
">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", 
">  <NAME>", " Benzaldehyde, 4-[(trimethylsilyl)oxy]-", " Benzeneacetic acid, methyl ester", 
" Cyclopropaneoctanoic acid, 2-[[2-[(2-ethylcyclopropyl)methyl]cyclopropyl]methyl]-, methyl ester", 
" Mevalonic lactone, trimethylsilyl deriv.", " Benzeneacetic acid, phenylmethyl ester", 
" Butanoic acid, 3,3-dimethyl-, methyl ester", " Acetic acid, (4-(trifluoromethoxy)phenyl)methyl ester", 
" Phosphoramidothioic acid, O,S-dimethyl ester", " Octanoic acid, phenylmethyl ester", 
" Benzenepropanoic acid, methyl ester", " 2-Propenoic acid, 3-phenyl-, methyl ester", 
" Propanoic acid, 2-methyl-, phenylmethyl ester", " Acetic acid, (2,3-dichlorophenyl)methyl ester", 
" L-Methionine, methyl ester", " Butanoic acid, phenylmethyl ester", 
"<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", 
"<SYNONYMS>", ">  <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", 
"<SYNONYMS>", "<SYNONYMS>", ">  <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", 
" Benzaldehyde, p-(trimethylsiloxy)-", " Acetic acid, phenyl-, methyl ester", 
" Methyl 8-[2-((2-[(2-ethylcyclopropyl)methyl]cyclopropyl)methyl)cyclopropyl]octanoate #", 
" Mevalonic lactone, trimethylsilyl", " Acetic acid, phenyl-, benzyl ester", 
" Butyric acid, 3,3-dimethyl-, methyl ester", " NA", " Methamidophos", 
" Octanoic acid, benzyl ester", " Hydrocinnamic acid, methyl ester", 
" Cinnamic acid, methyl ester", " Isobutyric acid, benzyl ester", 
" NA", " Methyl 2-amino-4-(methylsulfanyl)butanoate #", " Butyric acid, benzyl ester"
), .Dim = c(15L, 4L), .Dimnames = list(c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"), 
    c("NAME", NA, "NA.1", "NA.2")))

入力 2

structure(c("Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol", 
"Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione", 
"Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine", 
"Name: H+;: Hydron", "Name: 3-Iodo-L-tyrosine", "Name: 3-Methoxytyramine", 
"Name: 3-Methoxy-4-hydroxyphenylacetaldehyde;: (4-Hydroxy-3-methoxyphenyl)acetaldehyde;: Homovanillin", 
"Name: L-Noradrenaline;: Noradrenaline;: Norepinephrine;: Arterenol;: 4-[(1R)-2-Amino-1-hydroxyethyl]-1,2-benzenediol", 
"Name: 3,4-Dihydroxymandelaldehyde;: 3,4-Dihydroxyphenylglycolaldehyde", 
"Name: L-Metanephrine", "Name: L-Adrenaline;: (R)-(-)-Adrenaline;: (R)-(-)-Epinephrine;: (R)-(-)-Epirenamine;: (R)-(-)-Adnephrine;: 4-[(1R)-1-Hydroxy-2-(methylamino)ethyl]-1,2-benzenediol", 
"Name: 3-Methoxy-4-hydroxyphenylglycolaldehyde", "Name: L-Normetanephrine", 
"Name: L-Dopachrome;: 2-L-Carboxy-2,3-dihydroindole-5,6-quinone", 
"Name: 5,6-Dihydroxyindole;: DHI"), .Dim = c(15L, 1L))

score 1 · Accepted Answer

これに対する R ソリューションが本当に必要な場合は、以下のようなものを試してください。入力、特に 2 番目のセットの 4 番目の要素を整理する必要があると思います。私が入れたものは、各化学物質の最初の名前でのみ機能します。同義語の作業はあなたに任せます。

あなたの化学名のすべてが chemspider データベースに登録されているようには見えません。名前のないエントリをキャッチすることは関数の重要な部分であり、それなしではすべてが壊れます。

API のトークンを取得するには、chemspider に登録する必要があります。それは自由です。

あなたが与えた化学名の例は、2 つのデータセット間で一致していないように見えるため、以下の df3 には一致が含まれません。これが役立つことを願っています。

library(RCurl)
library(XML)

token <- "your token here" # from chemspider profile
#url <- "http://www.chemspider.com/Search.asmx/AsyncSimpleSearch?query="
url <- "www.chemspider.com/Search.asmx/SimpleSearch?query="

chemCrawl <- function(chemname){ # Query chemspider with chemical names, return ids. 
  # df1[13] in particular seems to throw an error. Don't know why. 
  chem.id <-tryCatch(xmlValue(xmlRoot(xmlTreeParse(
    getURL(paste(url, "\"", curlEscape(chemname), "\"" ,"&token=" ,
                 token, sep = ""))
  ))), error=function(err) { 
    "oops"} )
  return(chem.id)
}

df1 <- as.data.frame(structure(c(">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", 
            ">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", 
            ">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", ">  <NAME>", 
            ">  <NAME>", " Benzaldehyde, 4-[(trimethylsilyl)oxy]-", " Benzeneacetic acid, methyl ester", 
            " Cyclopropaneoctanoic acid, 2-[[2-[(2-ethylcyclopropyl)methyl]cyclopropyl]methyl]-, methyl ester", 
            " Mevalonic lactone, trimethylsilyl deriv.", " Benzeneacetic acid, phenylmethyl ester", 
            " Butanoic acid, 3,3-dimethyl-, methyl ester", " Acetic acid, (4-(trifluoromethoxy)phenyl)methyl ester", 
            " Phosphoramidothioic acid, O,S-dimethyl ester", " Octanoic acid, phenylmethyl ester", 
            " Benzenepropanoic acid, methyl ester", " 2-Propenoic acid, 3-phenyl-, methyl ester", 
            " Propanoic acid, 2-methyl-, phenylmethyl ester", " Acetic acid, (2,3-dichlorophenyl)methyl ester", 
            " L-Methionine, methyl ester", " Butanoic acid, phenylmethyl ester", 
            "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", 
            "<SYNONYMS>", ">  <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", 
            "<SYNONYMS>", "<SYNONYMS>", ">  <SYNONYMS>", "<SYNONYMS>", "<SYNONYMS>", 
            " Benzaldehyde, p-(trimethylsiloxy)-", " Acetic acid, phenyl-, methyl ester", 
            " Methyl 8-[2-((2-[(2-ethylcyclopropyl)methyl]cyclopropyl)methyl)cyclopropyl]octanoate #", 
            " Mevalonic lactone, trimethylsilyl", " Acetic acid, phenyl-, benzyl ester", 
            " Butyric acid, 3,3-dimethyl-, methyl ester", " NA", " Methamidophos", 
            " Octanoic acid, benzyl ester", " Hydrocinnamic acid, methyl ester", 
            " Cinnamic acid, methyl ester", " Isobutyric acid, benzyl ester", 
            " NA", " Methyl 2-amino-4-(methylsulfanyl)butanoate #", " Butyric acid, benzyl ester"
), .Dim = c(15L, 4L), .Dimnames = list(c("1", "2", "3", "4", 
                                         "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15"), 
                                       c("NAME", NA, "NA.1", "NA.2"))))

names(df1) <-c("class1", "name", "class2", "synonym") 
df1$name <- as.character(df1$name)
df1[1,2] # there are leading spaces
df1$name <- sub(" ", "", df1$name) # lose the leading space
#details of chemspider search api: http://www.chemspider.com/Search.asmx

df1$chem.id <- lapply(df1$name, chemCrawl)
head(df1)

name2 <- structure(c("Name: 1-Phosphatidyl-D-myo-inositol;: 1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;: Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;: 1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol", 
                   "Name: Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione", 
                   "Name: Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine", 
                   "Name: H+;: Hydron", "Name: 3-Iodo-L-tyrosine", "Name: 3-Methoxytyramine", 
                   "Name: 3-Methoxy-4-hydroxyphenylacetaldehyde;: (4-Hydroxy-3-methoxyphenyl)acetaldehyde;: Homovanillin", 
                   "Name: L-Noradrenaline;: Noradrenaline;: Norepinephrine;: Arterenol;: 4-[(1R)-2-Amino-1-hydroxyethyl]-1,2-benzenediol", 
                   "Name: 3,4-Dihydroxymandelaldehyde;: 3,4-Dihydroxyphenylglycolaldehyde", 
                   "Name: L-Metanephrine", "Name: L-Adrenaline;: (R)-(-)-Adrenaline;: (R)-(-)-Epinephrine;: (R)-(-)-Epirenamine;: (R)-(-)-Adnephrine;: 4-[(1R)-1-Hydroxy-2-(methylamino)ethyl]-1,2-benzenediol", 
                   "Name: 3-Methoxy-4-hydroxyphenylglycolaldehyde", "Name: L-Normetanephrine", 
                   "Name: L-Dopachrome;: 2-L-Carboxy-2,3-dihydroindole-5,6-quinone", 
                   "Name: 5,6-Dihydroxyindole;: DHI"), .Dim = c(15L, 1L))

name2 <- sub("Name: ", "", name2)
name2 <- sub(";.+$", "", name2)
chem.id <- rep(NA, 15)
df2 <- as.data.frame(cbind(name2, chem.id))
names(df2)[1] <- "name2"
df2$chem.id <- lapply(df2$name2, chemCrawl)
head(df2)
df1$chem.id <- as.character(df1$chem.id)
df2$chem.id <- as.character(df2$chem.id)
df3 <- merge(df1, df2, by = "chem.id", all = TRUE)
df3

r - Rでの化学名マッチング

私が試したことは？

2 に答える 2

Related

Reference