0

約 100 の用語を検索して置換する AppleScript があります。正規表現の使用。この検索と置換関数を R にインポートしたいのですが、ScriptEditor で AppleScript をテキスト ファイルとして保存し、readLines() を介して R にインポートしました。このインポートの dput() の結果は、以下の punct.out のようになります。インポート (以下の punct を参照) からではなく、生のベクトルからパターンと置換の独自のデータ フレームを作成すると、テスト文字列 (以下の test を参照) の検索と置換がうまく機能します。しかし、インポートしたデータ フレームで同じコマンドを実行すると、機能せず、NA が返されます。

どういうわけか、インポートされたテキストの結果は、どういうわけか正規表現として、または文字ベクトルとして解釈されていません...私はそれを理解できません。

#structure of my imported patterns and replacements
punct.out<-structure(list(replace = c(NA, NA, "good-bye[a-z]+|good-bye", 
"good bye[a-z]+|good bye", "good-", "ill at ease", "ill-", "-like", 
" well,", "- well,", ", well,", "as well", ".,", ".... well", 
"... well", ". Well,", ": well,", "well-", "well,", "well,", 
"well,", "Well,", "- okay,", ", okay,", "okay,", " okay,", ".... okay", 
"... okay", ". Okay,", ": okay,", "OK", "'okay,", "okay,", "Okay,", 
"Okay", ", too", "too /", "too,", "too.", "too?", "too:", "(No)(. )([0-    9]+)", 
"( [A-Z])(.)( )", "www.", "ain't", "let's", "won't", "can't", 
"n't", "cannot", "'d", "'ll", "'m", "'ve", "'re", "!", "?", ";", 
"", ",", "--", "-", "-", "é", "è", "à", "ç", "&", "%", "per cent", 
"_", "Que.", "Ont.", "Nfld.", "Alta.", "Man.", "Sask.", "St.", 
"Ste.", "i.e.", "Mr.", "Ms.", "Mrs.", "Prof.", ".com", "a. m.", 
"p. m.", "a.m.", "p.m.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", 
"Jul.", "Aug.", "Sept.", "Oct.", "Nov.", "Dec.", "gen.", "Dr.", 
"e. coli", "(.)([A-Z])(.)", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", 
"([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", 
"([0-9])(.)([0-9])", "()(S)", "([a-z]+)(')", "(')([a-z]+)", "bull ' s eye", 
"no man ' s land", "pandora ' s box", "....", "...", ".", ",", 
":", "", "", "", "", NA, NA), with = c("character(0)", "character(0)", 
"goodbye", "goodbye", "good x", "ill at xease", "ill x", " xlike", 
" xwell", " xwell", " xwell", "as xwell", " ", " xwell", " xwell", 
". xWell", ": xwell", "well x", "xwell", " xwell", "xwell", "xWell", 
" xokay", " xokay", " xokay", " xokay", " xokay", " xokay", ". xOkay", 
": xokay", "okay", "xokay", "xokay", "xOkay", "xOkay", " xtoo", 
"xtoo /", "xtoo", "xtoo.", "xtoo.", "xtoo", "#\\\\3", "\\\\1\\\\3", 
"www", "am not", "let us", "will not", "can not", " not", "can not", 
" would", " will", " am", " have", " are", ".", ".", "", "", 
"", " ", " ", " ", "e", "e", "a", "c", "and", "percent", "percent", 
" ", "Que", "Ont", "Nfld", "Alta", "Man", "Sask", "St", "Ste", 
"ie", "Mr", "Ms", "Mrs", "Prof", "com", "am", "pm", " am", " pm", 
"Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sept", "Oct", 
"Nov", "Dec", "gen", "Dr", "e coli", "\\\\1\\\\2 ", "\\\\1\\\\3", 
"\\\\1\\\\3", "\\\\1\\\\3", "\\\\1\\\\3", "\\\\1\\\\3", "\\\\1dot\\\\3", 
"\\\\1 \\\\2", "\\\\1 \\\\2", "\\\\1 \\\\2", "bull's eye", "no man's land", 
"pandora's box", "", "", " . ", " ,", "", " ", " ", " ", " ", 
"character(0)", "character(0)")), .Names = c("replace", "with"
), row.names = c(NA, -127L), class = "data.frame")

#library
library(stringi)
#test string
test<-c('Sept.','Mr.' ,'Oct.', 'ill at ease', 'as well', 'Dr.', 'OK'   
, 'well,', '.com')
#data frame of patterns and replacements
punct<-data.frame(replace=c('ill at ease', 'Sept.', 'Mr.', 'Oct.', 'as    
well',    'Dr.', 'OK', 'well,', '.com'), with=c('ill at xease', 'Sept', 
'Mr', 'Oct', 'as   xwell', 'Dr', 'okay', 'xwell', 'com'))
#This works
stri_replace_all_regex(test, punct$replace, punct$with, vectorize_all=F)
#But this doesn't
stri_replace_all_regex(test, punct.out$replace, punct.out$with,    
vectorize_all=F)

2番目の問題:以下のコメントに基づいて上記の問題を解決しました。しかし、いくつかの正規表現の出現にはいくつかの特定の問題があります。具体的には、\1、\2 など、正規表現で一致する最初と 2 番目のパターンを出力するためにバックスラッシュをエスケープする方法がわかりません。

#Define data
punct.out<-structure(list(replace = c("(\\.)([A-Z])(\\.)", "([A-Z])(\\.)([A-  
Z])", 
"([0-9])(\\.)([0-9])", "([a-z]+)(')", "(')   ([a-z]+)"), with =   
c("\\\\1\\\\2 ",                                                                                                          
"\\\\1\\\\3", "\\\\1dot\\\\3", "\\\\1 \\\\2", "\\\\1 \\\\2")), .Names = 
c("replace",                                                                                                                                                                           
"with"), row.names = c(104L, 105L, 110L, 112L, 113L), class = "data.frame")
#Test string of characters that the above regex's are supposed to match
test<-c('.B.', 'B.B', '1.1','premier\'s')
#This sort of works but I clearly haven't figured out how to properly escape 
the backslashes to capture the references
stri_replace_all_regex(test,punct.out$replace, punct.out$with, 
vectorize_all=F)
#Based on the help for stri_replace I also tried using $ to capture the    
references.
punct.out$with<-gsub('\\\\\\\\', '$', punct.out$with)
#And it did work.
stri_replace_all_regex(test,punct$replace, punct$with, vectorize_all=F)
4

1 に答える 1

1

punct.out観察の欠落で構成されています。NAこれが、出力に s が表示される理由です。na.omitたとえば、最初に使用する必要があります。さらに、正規表現の一致を実行しているため、一部の文字 (例: .) をエスケープする必要があります。つまり、バックスラッシュを前に付けます。また、最初の列に空の文字列がいくつかあることに注意してください。それらも削除する必要があります。

于 2016-05-19T14:28:19.913 に答える