r - Rでバランスの取れたパネルデータを見つける方法（別名、パネル内のどのエントリが特定のウィンドウで完了しているかを見つける方法）

Question

Compustat からの大きなデータパネルがあります。それに、手作業で収集したデータを追加しています (古い本のスタックから真剣に手作業で収集したものです)。ただし、パネル全体を手作業で収集するのではなく、ランダムに選択されたサブセットのみを収集します。より大きなセット (ランダムに選択したもの) を見つけるには、Compustat のバランスの取れたパネルから始めたいと思います。

アンバランスパネルを操作するためのライブラリが表示plmされますが、バランスを維持したいと考えています。サンプル期間を実行しない企業 (パネルピークの個人) を検索して除外する以外に、これを行うクリーンな方法はありますか? ありがとう！

score 1 · Accepted Answer

よく考えてみると、これを行うためのはるかに簡単な方法があります。

これを見てください：

data.with.only.complete.subjects.data <- function(xx, subject.column, number.of.observation.a.subject.should.have)
{
    subjects <- xx[,subject.column]
    num.of.observations.per.subject <- table(subjects)
    subjects.to.keep <- names(num.of.observations.per.subject)[num.of.observations.per.subject == number.of.observation.a.subject.should.have]

    subset.by.me <- subjects %in%   subjects.to.keep

    new.xx <- xx[subset.by.me ,]

    return(new.xx)
}

xx <- data.frame(subject = rep(1:4, each = 3),
            observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]

data.with.only.complete.subjects.data(xx.mis , 1, 3)

score 0 · Accepted Answer

更新：このソリューションは、上記で投稿した他のソリューションよりも優れていないと思いますが、ソリューションの例として残しています-これはあまり良くありません:) *

こんにちはリシャール、

役立つサンプルデータがないと少し難しいです。

しかし、「reshape」パッケージの「melt」と「cast」を使用してデータを再形成できるように思えます。そうすることで、被験者ごとの観察が少なすぎる場所を見つけ、その情報を使用してデータをサブセット化することができます。

これを行う方法のサンプルコードを次に示します。

xx <- data.frame(subject = rep(1:4, each = 3),
            observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]

require(reshape)


num.of.obs.per.subject <- cast(xx.mis, subject ~.)
the.number <- num.of.obs.per.subject[,2]
subjects.to.keep <- num.of.obs.per.subject[,1] [the.number  == 3]

ss.index.of.who.to.keep <- xx.mis $subject %in% subjects.to.keep 

xx.to.work.with <- xx.mis[ss.index.of.who.to.keep ,]


xx.to.work.with

乾杯、

タル

score 0 · Accepted Answer

> # read data
> file.in <- "243815928.csv"
> data <- read.csv(file.in)
> 
> # find which gvkeys run the entire sample period
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
> 
> # create new df w/o firms that don't run the whole sample period
> pot.obs.index <- which(data$gvkey %in% pot.obs)
> data.bal <- data[pot.obs.index, ]
> 
> # write data to csv file
> file.out <- paste(substr(file.in, 1, (nchar(file.in)-4)), "sorted.csv", sep="")
> write.csv(data.bal, file.out)

score 0 · Accepted Answer

今見ると、一部のデータの書式設定が失われていますが、後でわかります。パネルのバランスのとれた部分を取る私の試みは次のとおりです。

    > data <- read.csv("223601533.csv")
> head(data)
  gvkey indfmt  datafmt consol popsrc fyear fyr datadate exchg         isin
1  2721   INDL HIST_STD      C      I  2000  12 20001231   264 JP3242800005
2  2721   INDL HIST_STD      C      I  2001  12 20011231   264 JP3242800005
3  2721   INDL HIST_STD      C      I  2002  12 20021231   264 JP3242800005
4  2721   INDL HIST_STD      C      I  2003  12 20031231   264 JP3242800005
5  2721   INDL HIST_STD      C      I  2004  12 20041231   264 JP3242800005
6  2721   INDL HIST_STD      C      I  2005  12 20051231   264 JP3242800005
    sedol      conm costat fic
1 6172323 CANON INC      A JPN
2 6172323 CANON INC      A JPN
3 6172323 CANON INC      A JPN
4 6172323 CANON INC      A JPN
5 6172323 CANON INC      A JPN
6 6172323 CANON INC      A JPN
> 
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
> data.bal <- as.data.frame(matrix(NA, nrow=nt.bal, ncol=ncol(data)))
> colnames(data.bal) <- colnames(data)
> 
> for(i in 1:length(pot.obs)) {
+   last.row <- i * mode.num.obs
+   first.row <- last.row - (mode.num.obs - 1)
+   data.bal[first.row:last.row, ] <- subset(data, gvkey == pot.obs[i])
+ }
> 
> head(data.bal)
  gvkey indfmt datafmt consol popsrc fyear fyr datadate exchg isin sedol conm
1  2721      2       1      1      1  2000  12 20001231   264  875   359  331
2  2721      2       1      1      1  2001  12 20011231   264  875   359  331
3  2721      2       1      1      1  2002  12 20021231   264  875   359  331
4  2721      2       1      1      1  2003  12 20031231   264  875   359  331
5  2721      2       1      1      1  2004  12 20041231   264  875   359  331
6  2721      2       1      1      1  2005  12 20051231   264  875   359  331
  costat fic
1      1   1
2      1   1
3      1   1
4      1   1
5      1   1
6      1   1
>

r - Rでバランスの取れたパネルデータを見つける方法（別名、パネル内のどのエントリが特定のウィンドウで完了しているかを見つける方法）

4 に答える 4

Related

Reference