r - R/Dplyr は、コンマ区切りのセル値を持つセルを含むデータフレームからより大きなデータフレームを作成します

Question

以下のようなデータフレームで作業しています。SOでフォーマットするために最善を尽くしました。重要なことは、person, personpartyandに同じ数のカンマ区切りのエントリがあることですsponsordate(セルを切り捨てたので、この例では同じではないかもしれませんが、データセットでは同じです)。

bill                                               status       person                       personparty       sponsordate
A bill to amend chapter 44 of title 18, ....        2ND Sen.   David Vitter [R-LA]           Republican             12/05/2015
A bill to authorize the appropriation of funds....  RESTRICT    Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Barbara Boxer [D-CA]  Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat,     21/05/2014, 02/06/2015, 05/04/2017, 22/05/2014, 21/07/2014, 09/06/2014, 02/06/2014, 12/06/2014, 21/05/2014, 02/06/2014, 21/05/2014

5 列の新しいデータフレームを作成したいと考えています。基本的に、これらの（リストではない）値を1つの大きなデータフレームにリストから外したいと考えています。

最後のデータフレームには、の i 番目のコンマ区切りエントリの行があり、との同じ列の値を保持する必要がbillありstatusます。

たとえば、サンプルデータセットの 2 行目には、法案名 (資金の充当を承認する法案....)、ステータス (RESTRICT)、Ed Markey、Democrat、21/05/ の行があります。 2014年。次の行は、コンマで区切られた値からの 2 番目のエントリになります (同じ法案名、同じステータス、上院議員 Ed Markey [D-MA]、Democrat、02/06/2015) など。

最後の 3 つの列に 1 つの値しかない行の場合、それらは同じままです。

これらのリストのような値を本質的にネスト解除するにはどうすればよいですか?

score 1 · Accepted Answer

を探しているようですseparate_rows。

前提:これら 3 つの列のカンマ区切り値は同じ数値です。それはあなたの投稿からの抜粋に基づいています- 「重要なことは、person、personparty、sponsordate に同じ数のカンマ区切りのエントリがあることです」

library(dplyr)
library(tidyr)

df %>%
  separate_rows(person, personparty, sponsordate, sep=",")

出力は次のとおりです。

                                                bill   status                     person personparty
1       A bill to amend chapter 44 of title 18, .... 2ND Sen.        David Vitter [R-LA]  Republican
2 A bill to authorize the appropriation of funds.... RESTRICT      Sen. Ed Markey [D-MA]    Democrat
3 A bill to authorize the appropriation of funds.... RESTRICT      Sen. Ed Markey [D-MA]    Democrat
4 A bill to authorize the appropriation of funds.... RESTRICT      Sen. Ed Markey [D-MA]    Democrat
5 A bill to authorize the appropriation of funds.... RESTRICT  Sen. Barbara Boxer [D-CA]    Democrat
  sponsordate
1  12/05/2015
2  21/05/2014
3  02/06/2015
4  05/04/2017
5  22/05/2014

サンプルデータ：

df <- structure(list(bill = structure(1:2, .Label = c("A bill to amend chapter 44 of title 18, ....", 
"A bill to authorize the appropriation of funds...."), class = "factor"), 
    status = structure(1:2, .Label = c("2ND Sen.", "RESTRICT"
    ), class = "factor"), person = structure(1:2, .Label = c("David Vitter [R-LA]", 
    "Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Barbara Boxer [D-CA]"
    ), class = "factor"), personparty = structure(c(2L, 1L), .Label = c("Democrat, Democrat, Democrat, Democrat", 
    "Republican"), class = "factor"), sponsordate = structure(1:2, .Label = c("12/05/2015", 
    "21/05/2014, 02/06/2015, 05/04/2017, 22/05/2014"), class = "factor")), .Names = c("bill", 
"status", "person", "personparty", "sponsordate"), class = "data.frame", row.names = c(NA, 
-2L))

score 0 · Accepted Answer

あなたが何を望んでいるのか理解できていないので、あなたが持っていると仮定するデータフレームから始めます:

df=structure(list(bill = c("A bill to amend chapter 44 of title 18, .<U+0085>", 
"A bill to authorize the appropriation of funds...."), status = c("2ND Sen.", 
"RESTRICT"), person = c("David Vitter [R-LA]", "Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Barbara Boxer [D-CA]"
), personparty = c("Republican", "Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat,"
), sponsordate = c("12/05/15", "21/05/2014, 02/06/2015, 05/04/2017, 22/05/2014, 21/07/2014, 09/06/2014, 02/06/2014, 12/06/2014, 21/05/2014, 02/06/2014, 21/05/2014"
)), .Names = c("bill", "status", "person", "personparty", "sponsordate"
), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
), spec = structure(list(cols = structure(list(bill = structure(list(), class = c("collector_character", 
"collector")), status = structure(list(), class = c("collector_character", 
"collector")), person = structure(list(), class = c("collector_character", 
"collector")), personparty = structure(list(), class = c("collector_character", 
"collector")), sponsordate = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("bill", "status", "person", "personparty", 
"sponsordate")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

これで、2 行目を多くの行に拡張したいことがわかりました。「多数」が行 2 の列 3、4、5 のベクトル要素のすべての組み合わせを意味し、それをデータフレームに追加する (行 2 と重なる) 場合、次のように実行できます。

librart(stringr)
x01=str_split(df$person[2],",")[[1]]
x02=str_split(df$personparty[2],",")[[1]]
x03=str_split(df$sponsordate[2],",")[[1]]
x04=expand.grid(x01,x02,x03)
df0=do.call("rbind", replicate(nrow(x04), df[2,], simplify = FALSE))
df0[2:(nrow(x04)+1),3:5]=as.matrix(x04)

お役に立てれば

r - R/Dplyr は、コンマ区切りのセル値を持つセルを含むデータフレームからより大きなデータフレームを作成します

2 に答える 2

Related

Reference