r - R で複数の文字列のデータを解析する

Question

複数の情報を含む単一の列を解析するコードを作成しようとしています。たとえば、df という次のデータフレームがあるとします。

  ids             info
1 101       red;circle
2 103      circle;blue
3 122        red;green
4 102 circle;red;green
5 213             blue
6 170         red;blue

table(df) を実行すると、次のようになります。

    table(df)
         info
    ids   blue circle;blue circle;red;green red;blue red;circle
      101    0           0                0        0          1
      102    0           0                1        0          0
      103    0           1                0        0          0
      122    0           0                0        0          0
      170    0           0                0        1          0
      213    1           0                0        0          0
         info
    ids   red;green
      101         0
      102         0
      103         0
      122         1
      170         0

  213         0

私がやりたいことは、1.情報列を2つの列に分割し、1つは形状用、もう1つは色用、2.複数の色を持つIDを「多色」に割り当てます。だから私は次のように書いています：

df$shape <- as.character(df$info)
for (i in 1:dim(df)[1]){
  if (grepl("circle",df$info[i])==TRUE) {
    df$shape[i] <- "circle" 
  } else if (grepl("circle",df$info[i])==FALSE) {
    df$shape[i]<-NA}
}
for (i in 1:dim(df)[1]){
  if (grepl(";",df$info[i])==TRUE) {
    df$info[i] <- "Multicolored" 
  } else {df$info[i]<-df$info[i]}
}

このコードから、次の出力が得られます。

df
  ids         info  shape
1 101 Multicolored circle
2 103 Multicolored circle
3 122 Multicolored   <NA>
4 102 Multicolored circle
5 213         blue   <NA>
6 170 Multicolored   <NA>

101 red;circle私のコードが書かれているように、実際にはそうではなく、赤と円だけのようなインスタンスが多色であると言っています。「円」が最初、中間、または最後の情報列に表示される場合、このデータを解析する正しい方法は何ですか? あらゆる提案を歓迎します。ありがとうございます。

score 1 · Accepted Answer

このタイプの問題では、文字列を分割してから、文字;列のベクトルを処理することが理にかなっている場合があります。例えば、

mystrings <- strsplit(df$info,";")
getStrings <- function(x,s,none=NA_character_,multiple="Multicolored")
   switch(sum(x%in%s)+1,none,x[x%in%s],multiple,multiple)
df$shape <- sapply(mystrings,FUN=getStrings,s=c("circle"))
df$color <- sapply(mystrings,FUN=getStrings,s=c("red","green","blue"))

個人的には、純粋な正規表現や if ステートメントを使用するよりも、このアプローチの方が簡単だと思います。

score 0 · Accepted Answer

あなたも試すことができます：

 pat1 <- paste0(c("red","blue", "green"), collapse="|")
shape1 <- gsub(paste(pat1, ";", sep="|"), "", df$info)
shape1[shape1==''] <- NA
df[,c("info", "shape")] <- as.data.frame(do.call(rbind,
                Map(`c`, lapply(regmatches(df$info, gregexpr(pat1, df$info)), function(x)   {
             if(length(x)>1) "Multicolored" else x}), shape1)), stringsAsFactors=FALSE)

 df
 #  ids         info  shape
 #1 101          red circle
 #2 103         blue circle
 #3 122 Multicolored   <NA>
 #4 102 Multicolored circle
 #5 213         blue   <NA>
 #6 170 Multicolored   <NA>

r - R で複数の文字列のデータを解析する

3 に答える 3

Related

Reference