text
文中の文字要素のベクトルを分割したいと思います。分割基準のパターンが複数あります ( "and/ERT"
、"/$"
)。また、パターンからの例外 ( :/$.
、and/ERT then
、 ) があります。./$. Smiley
試行: 分割する必要があるケースを一致させます。そこに変わった模様("^&*"
)を入れます。strsplit
特定のパターン
問題: 例外を適切に処理する方法がわかりません。"^&*"
通常とは異なるパターン ( ) を削除して、実行前に元のテキストを復元する必要がある明示的なケースがありますstrsplit
。
コード:
text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")
patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")
exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")
# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) #
# Ideal split:
textsplitted
> textsplitted
[[1]]
[1] "This are faulty propositions one and/ERT"
[2] "two ,/$,"
[3] "which I want to split ./$."
[4] "There are cases where I explicitly want and/ERT"
[5] "some where I don't want to split ./$."
[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
[7] "This is also one case where I dont't want to split ./$. Smiley !/$."
[8] "Thank you ./$!"
[[2]]
[1] "This are the same faulty propositions one and/ERT
[2] "two ,/$,"
#...
# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)