r - R: 因子水準、残りを「その他」に再コード化

Question

私は因子をあまり使用せず、一般的に理解できると思いますが、特定の操作の詳細についてはよくわかりません。現在、観測値がほとんどないカテゴリを「その他」にコーディング/折りたたんでおり、それを行う簡単な方法を探しています.20レベルの変数がありますが、それらの束を1つに折りたたむことに興味があります.

data <- data.frame(employees = sample.int(1000,500),
                   naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
                                  100, replace=T))

これが私の興味のレベルと、それらのラベルを別々のベクトルで示したものです。

#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
         '621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
               'Offices of dentists',
               'Offices of all other miscellaneous health practitioners',
               'Home health care services',
               'Offices of Mental Health Practitioners',
               'Offices of chiropractors',
               'Medical Laboratories',
               'Outpatient Mental Health and Substance Abuse Centers',
               'Offices of optometrists')

呼び出しを使用してfactor()、それらをすべて列挙し、カテゴリに観測がほとんどないたびに「その他」として分類することができました。

以上が実際のトップ 8 であると仮定すると、の値が正しくコード化され、他のすべてがとして再コード化されるように因子変数として宣言する最良のtop8方法は何ですか?top8_descdata$naicstop8other

score 6 · Accepted Answer

最も簡単な方法は、トップ 8 に含まれていないすべての naic を特別な値に再ラベル付けすることだと思います。

data$naics[!(data$naics %in% top8)] = -99

次に、それを要因に変えるときに「除外」オプションを使用できます

factor(data$naics, exclude=-99)

score 0 · Accepted Answer

私はこれを行うための関数を書きましたが、これは他の人に役立つ可能性がありますか? レベルがベースの mp パーセント未満であるかどうか、最初に相対的な方法でチェックします。その後、レベルの最大数を ml に制限するようにチェックします。

ds は data.frame 型の手持ちのデータセットです。因数として cat_var_names に表示されるすべての列に対してこれを行います。

cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])

recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
  # remove less frequent levels in factor
  # 
  n <- nrow(ds)
  # keep levels with more then mp percent of cases
  for (i in var_list){
    keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }

  # keep top ml levels
  for (i in var_list){
    keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }
  return(ds)
}

r - R: 因子水準、残りを「その他」に再コード化

4 に答える 4

Related

Reference