8

text文中の文字要素のベクトルを分割したいと思います。分割基準のパターンが複数あります ( "and/ERT""/$")。また、パターンからの例外 ( :/$.and/ERT then、 ) があります。./$. Smiley

試行: 分割する必要があるケースを一致させます。そこに変わった模様("^&*")を入れます。strsplit特定のパターン

問題: 例外を適切に処理する方法がわかりません。"^&*"通常とは異なるパターン ( ) を削除して、実行前に元のテキストを復元する必要がある明示的なケースがありますstrsplit

コード:

text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")

patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")

exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")

# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) # 

# Ideal split:
textsplitted
> textsplitted
[[1]]
 [1] "This are faulty propositions one and/ERT" 
 [2] "two ,/$," 
 [3] "which I want to split ./$."
 [4] "There are cases where I explicitly want and/ERT" 
 [5] "some where I don't want to split ./$." 
 [6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
 [7] "This is also one case where I dont't want to split ./$. Smiley !/$." 
 [8] "Thank you ./$!"

[[2]]
 [1] "This are the same faulty propositions one and/ERT 
 [2] "two ,/$,"
#...      

# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)
4

1 に答える 1

9

この式を使用して、必要な分割を達成できると思います。strsplit分割する文字を使い果たすと、一致する/一致しないものに続くスペースで分割する必要があります(これは、OPの目的の出力にあるものです):

strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)"  , perl = TRUE )
#[[1]]
#[1] "This are faulty propositions one and/ERT"                                 
#[2] "two ,/$,"                                                                 
#[3] "which I want to split ./$."                                               
#[4] "There are cases where I explicitly want and/ERT"                          
#[5] "some where I don't want to split ./$."                                    
#[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
#[7] "This is also one case where I dont't want to split ./$. Smiley !/$."      
#[8] "Thank you ./$!" 

説明

  • (?<=and/ERT)\\s- IS\\sの後に続くスペースで 分割(?<=...)"and/ERT"
  • (?!then) - ただし、そのスペース後にのみ(?!...)"then"
  • | - 次の式を連結する OR 演算子
  • (?<=/\\$[[:punct:]])"/$"-任意の句読点が続く 肯定的な後読みアサーション
  • (?<!:/\\$[[:punct:]])\\s(?!Smiley)-前にないスペースに一致します":/$"[[:punct:]](ただし、前のポイントによれば、ISは前にあり、後にはありませ"/$[[:punct:]]"(?!...)"Smiley"
于 2013-09-09T12:55:06.913 に答える