r - 使いやすいラッピングベースR形状変更

Question

R のベースとなる reshape コマンドは高速で強力ですが、構文が貧弱であることは広く認められている真実です。taRifxそのため、パッケージの次のリリースに投入する簡単なラッパーを作成しました。ただし、その前に改善を求めたいと思います。

@RichieCottonからの更新を含む私のバージョンは次のとおりです。

# reshapeasy: Version of reshape with way, way better syntax
 # Written with the help of the StackOverflow R community
 # x is a data.frame to be reshaped
 # direction is "wide" or "long"
 # vars are the names of the (stubs of) the variables to be reshaped (if omitted, defaults to everything not in id or vary)
 # id are the names of the variables that identify unique observations
 # vary is the variable that varies.  Going to wide this variable will cease to exist.  Going to long it will be created.
 # omit is a vector of characters which are to be omitted if found at the end of variable names (e.g. price_1 becomes price in long)
 # ... are options to be passed to stats::reshape
reshapeasy <- function( data, direction, id=(sapply(data,is.factor) | sapply(data,is.character)), vary=sapply(data,is.numeric), omit=c("_","."), vars=NULL, ... ) {
  if(direction=="wide") data <- stats::reshape( data=data, direction=direction, idvar=id, timevar=vary, ... )
  if(direction=="long") {
    varying <- which(!(colnames(data) %in% id))
    data <- stats::reshape( data=data, direction=direction, idvar=id, varying=varying, timevar=vary, ... )
  }
  colnames(data) <- gsub( paste("[",paste(omit,collapse="",sep=""),"]$",sep=""), "", colnames(data) )
  return(data)
}

方向以外のオプションを変更せずに、ワイドからロングに移動できることに注意してください。私にとって、これが使いやすさの鍵です。

チャットまたは電子メールで情報をお知らせいただければ、機能のヘルプファイルに大幅な改善があったことを喜んでお知らせします。

改善は、次の領域に該当する可能性があります。

関数とその引数の命名
より一般的なものにします (現在、かなり特殊なケースを処理します。これは、最も一般的であると考えられますが、まだ stats::reshape の機能を使い果たしていません)。
コードの改善

例

サンプルデータ

x.wide <- structure(list(surveyNum = 1:6, pio_1 = structure(c(2L, 2L, 1L, 
2L, 1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), pio_2 = structure(c(2L, 1L, 2L, 1L, 
2L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), pio_3 = structure(c(2L, 2L, 1L, 1L, 
2L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), caremgmt_1 = structure(c(2L, 1L, 1L, 
2L, 1L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), caremgmt_2 = structure(c(1L, 2L, 2L, 
2L, 2L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), caremgmt_3 = structure(c(1L, 2L, 1L, 
2L, 1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), prev_1 = structure(c(1L, 2L, 2L, 1L, 
1L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), prev_2 = structure(c(2L, 2L, 1L, 2L, 
1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), prev_3 = structure(c(2L, 1L, 2L, 2L, 
1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), price_1 = structure(c(2L, 1L, 2L, 5L, 
3L, 4L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2", "3", "4", "5", "6"), class = "factor"), price_2 = structure(c(6L, 
5L, 5L, 4L, 4L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2", "3", "4", "5", "6"), class = "factor"), price_3 = structure(c(3L, 
5L, 2L, 5L, 4L, 5L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2", "3", "4", "5", "6"), class = "factor")), .Names = c("surveyNum", 
"pio_1", "pio_2", "pio_3", "caremgmt_1", "caremgmt_2", "caremgmt_3", 
"prev_1", "prev_2", "prev_3", "price_1", "price_2", "price_3"
), idvars = "surveyNum", rdimnames = list(structure(list(surveyNum = 1:24), .Names = "surveyNum", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"
), class = "data.frame"), structure(list(variable = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("pio", 
"caremgmt", "prev", "price"), class = "factor"), .id = c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("variable", 
".id"), row.names = c("pio_1", "pio_2", "pio_3", "caremgmt_1", 
"caremgmt_2", "caremgmt_3", "prev_1", "prev_2", "prev_3", "price_1", 
"price_2", "price_3"), class = "data.frame")), row.names = c(NA, 
6L), class = c("cast_df", "data.frame"))

x.long <- structure(list(.id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), pio = structure(c(2L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 
1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("1", "2"), class = "factor"), 
    caremgmt = structure(c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 
    2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 
    1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 
    1L, 2L, 2L), .Label = c("1", "2"), class = "factor"), prev = structure(c(1L, 
    2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
    1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 
    2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 
    2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("1", 
    "2"), class = "factor"), price = structure(c(2L, 1L, 2L, 
    5L, 3L, 4L, 1L, 5L, 4L, 3L, 1L, 2L, 6L, 6L, 5L, 4L, 6L, 3L, 
    5L, 6L, 3L, 1L, 2L, 4L, 3L, 5L, 2L, 5L, 4L, 5L, 6L, 6L, 4L, 
    6L, 4L, 1L, 2L, 3L, 1L, 2L, 2L, 5L, 1L, 6L, 1L, 3L, 4L, 3L, 
    6L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 6L, 3L, 1L, 4L, 4L, 5L, 1L, 
    3L, 6L, 1L, 3L, 5L, 1L, 3L, 6L, 2L), .Label = c("1", "2", 
    "3", "4", "5", "6"), class = "factor"), surveyNum = c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
    15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 1L, 2L, 
    3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
    16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 1L, 2L, 3L, 
    4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
    17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L)), .Names = c(".id", 
"pio", "caremgmt", "prev", "price", "surveyNum"), row.names = c(NA, 
-72L), class = "data.frame")

例

> x.wide
  surveyNum pio_1 pio_2 pio_3 caremgmt_1 caremgmt_2 caremgmt_3 prev_1 prev_2 prev_3 price_1 price_2 price_3
1         1     2     2     2          2          1          1      1      2      2       2       6       3
2         2     2     1     2          1          2          2      2      2      1       1       5       5
3         3     1     2     1          1          2          1      2      1      2       2       5       2
4         4     2     1     1          2          2          2      1      2      2       5       4       5
5         5     1     2     2          1          2          1      1      1      1       3       4       4
6         6     1     2     1          2          1          1      2      1      1       4       2       5
> reshapeasy( x.wide, "long", NULL, id="surveyNum", vary="id", sep="_" )
    surveyNum id pio caremgmt prev price
1.1         1  1   2        2    1     2
2.1         2  1   2        1    2     1
3.1         3  1   1        1    2     2
4.1         4  1   2        2    1     5
5.1         5  1   1        1    1     3
6.1         6  1   1        2    2     4
1.2         1  2   2        1    2     6
2.2         2  2   1        2    2     5
3.2         3  2   2        2    1     5
4.2         4  2   1        2    2     4
5.2         5  2   2        2    1     4
6.2         6  2   2        1    1     2
1.3         1  3   2        1    2     3
2.3         2  3   2        2    1     5
3.3         3  3   1        1    2     2
4.3         4  3   1        2    2     5
5.3         5  3   2        1    1     4
6.3         6  3   1        1    1     5

> head(x.long)
  .id pio caremgmt prev price surveyNum
1   1   2        2    1     2         1
2   1   2        1    2     1         2
3   1   1        1    2     2         3
4   1   2        2    1     5         4
5   1   1        1    1     3         5
6   1   1        2    2     4         6

> head(reshapeasy( x.long, direction="wide", id="surveyNum", vary=".id" ))
  surveyNum pio.1 caremgmt.1 prev.1 price.1 pio.3 caremgmt.3 prev.3 price.3 pio.2 caremgmt.2 prev.2 price.2
1         1     2          2      1       2     2          1      2       3     2          1      2       6
2         2     2          1      2       1     2          2      1       5     1          2      2       5
3         3     1          1      2       2     1          1      2       2     2          2      1       5
4         4     2          2      1       5     1          2      2       5     1          2      2       4
5         5     1          1      1       3     2          1      1       4     2          2      1       4
6         6     1          2      2       4     1          1      1       5     2          1      1       2

score 3 · Accepted Answer

また、出力を並べ替えるオプションも見たいと思います。これは、ベース R での reshape について私が気に入らないことの 1 つであるためです。例として、Stata Learning Module: Reshaping data wide to longを使用してみましょう。すでにおなじみ。私が見ている例は、「1歳と2歳の子供の身長と体重」の例です。

これが私が通常行うことですreshape()：

# library(foreign)
kidshtwt = read.dta("http://www.ats.ucla.edu/stat/stata/modules/kidshtwt.dta")
kidshtwt.l = reshape(kidshtwt, direction="long", idvar=1:2, 
                     varying=3:6, sep="", timevar="age")
# The reshaped data is correct, just not in the order I want it
# so I always have to do another step like this
kidshtwt.l = kidshtwt.l[order(kidshtwt.l$famid, kidshtwt.l$birth),]

これは、データを再形成するときに常に実行しなければならない面倒な手順であるため、関数に追加すると便利だと思います。

longまた、少なくともからへの再形成のための最終的な列の順序で同じことを行うためのオプションを用意することをお勧めしますwide。

列の順序付けの関数の例

これを関数に統合する最良の方法はわかりませんが、これをまとめて、変数名の基本パターンに基づいてデータフレームを並べ替えます。

col.name.sort = function(data, patterns) {
  a = names(data)
  b = length(patterns)

  subs = vector("list", b)

  for (i in 1:b) {
    subs[[i]] = sort(grep(patterns[i], a, value=T))
    }
  x = unlist(subs)
  data[ , x ]
}

以下の方法で使用できます。サンプルの出力をという名前のデータフレームとして保存し、reshapeasy long" widesurveyNum a"、"caremgmt" (1-3)、"prev" (1-3)、"pio" (1- 3)、および「価格」(1-3)、次のように使用できます。

col.name.sort(a, c("sur", "car", "pre", "pio", "pri"))

score 2 · Accepted Answer

おそらく怠け者で変数名を入力するのが嫌いな人のために、関数の先頭に次を追加できます。

  if (is.numeric(id) == 1) {
    id = colnames(data)[id]
  } else if (is.numeric(id) == 0) {
    id = id
  }

  if (is.numeric(vary) == 1) {
    vary = colnames(data)[vary]
  } else if (is.numeric(vary) == 0) {
    vary = vary
  }

次に、例に従って、次の省略形を使用できます。

reshapeasy(x.wide, direction="long", id=1, sep="_", vary="id")
reshapeasy(x.long, direction="wide", id=6, vary=1)

（コードが読みにくくなったり、後で誰かが理解しにくくなったりする可能性があるため、良い習慣ではないかもしれませんが、頻繁に発生します。）

score 2 · Accepted Answer

あなたの例には間違いがあるかもしれないと思います。ワイドからロングに移動すると、次のエラーが発生します。

> reshapeasy( x.wide, "long", NULL, id="surveyNum", vary="id", sep="_" )
Error in gsub(paste("[", paste(omit, collapse = "", sep = ""), "]$", sep = ""),  : 
  invalid regular expression '[]$', reason 'Missing ']''

を削除NULLすると、問題が修正されます。その意図した目的は何NULLですか？

timeまた、ユーザーによって明示的に指定されていない場合 ( で行われているように)、デフォルトで変数を生成した場合、関数は改善されると思いますreshape()。

たとえば、 base から次を参照してくださいreshpae()。

> head(reshape(x.wide, direction="long", idvar=1, varying=2:13, sep="_"))
    surveyNum time pio caremgmt prev price
1.1         1    1   2        2    1     2
2.1         2    1   2        1    2     1
3.1         3    1   1        1    2     2
4.1         4    1   2        2    1     5
5.1         5    1   1        1    1     3
6.1         6    1   1        2    2     4

私がこれに精通していて、あなたの関数が私のために「変化」を処理していることがわかれば、試してみたくなるかもしれません:

> head(reshapeasy( x.wide, "long", id="surveyNum", sep="_" ))
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L],  : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘1.1’

しかし、これはあまり有用なエラーではありません。カスタムエラーメッセージを含めると、最終的な関数に役立つ場合があります。

現在のバージョンの関数で行ったように、ユーザーがに設定できるようにするNULLことも賢明ではないようです。これにより、次のような出力が得られます。

> head(reshapeasy( x.wide, "long", id="surveyNum", NULL, sep="_" ))
    surveyNum pio caremgmt prev price
1.1         1   2        2    1     2
2.1         2   2        1    2     1
3.1         3   1        1    2     2
4.1         4   2        2    1     5
5.1         5   1        1    1     3
6.1         6   1        2    2     4

この出力の問題は、幅を広くする必要がある場合に、簡単にできないことです。したがって、変数を生成する reshape のデフォルトオプションを保持しながらtime、ユーザーがそれをオーバーライドできるようにすることは、便利な機能になると思います。

score 2 · Accepted Answer

いくつかの最初の考え：

方向コマンドの「ワイド」と「ロング」が少しあいまいだといつも思っていました。データをその形式に変換したいということですか、それともデータが既にその形式になっているということですか? それはあなたが学ぶか調べる必要があるものです。関数reshapeToWideとを分離することで、この問題を回避できreshapeToLongます。おまけとして、各関数のシグネチャの引数が 1 つ少なくなります。

行を含めるつもりはなかったと思います

varying <- which(!(colnames(x.wide) %in% "surveyNum"))

特定のデータセットを参照しているためです。

data入力がデータフレームであることを明確にするxため、最初の引数を使用することを好みます。

一般に、最初にデフォルトなしで引数を指定する方がよい形式です。そのため、 and のvars後に来る必要があります。idvary

idとのデフォルトを選択できますvaryか? reshape::meltデフォルトは、id の factor および character 列、vary の数値列です。

r - 使いやすいラッピングベースR形状変更

4 に答える 4

列の順序付けの関数の例

Related

Reference