r - データフレーム列の因子水準の組み合わせ

Question

dataカテゴリ変数を表す「Project License」という名前の列を持つデータフレームがあるため、R 用語ではfactorです。私は新しいコラムを作成しようとしています。ここでは、オープンソースソフトウェアライセンスが私の分類ごとにより大きなカテゴリにまとめられています。ただし、その因子のレベルを結合 (マージ)しようとすると、すべてのレベルが失われるか変更されない列が表示されるか、次のようなエラーメッセージが表示されます。

factor(data[["Project License"]] のエラー、レベル = 分類、ラベル = c("非常に制限的"、: 無効な「ラベル」; 長さ 4 は 1 または 6 である必要があります

この機能のコードは次のとおりです(関数から抽出):

myLevels <- c('gpl', 'lgpl', 'bsd',
              'other', 'artistic', 'public')
myLabels <- c('GPL', 'LGPL', 'BSD',
              'Other', 'Artistic', 'Public')

licenses <- factor(data[["Project License"]],
                   levels = myLevels, labels = myLabels)

data[["Project License"]] <- licenses

classification <- c(highly = c('gpl'),
                    restrictive = c('lgpl', 'public'),
                    permissive = c('bsd', 'artistic'),
                    unknown = c('other'))

restrictiveness <- 
  factor(data[["Project License"]],
         levels = classification,
         labels = c('Highly Restrictive', 'Restrictive',
                    'Permissive', 'Unknown'))

data[["License Restrictiveness"]] <- restrictiveness

私は他のアプローチ(「R Inferno」のセクション 8.2.5 で説明されているものを含む)も試しましたが、これまでのところ成功していません。

私は何を間違っていますか? この問題を解決するにはどうすればよいですか? ありがとうございました！

更新 (データ):

> head(data, n=20)
   Project ID Project License
1       45556            lgpl
2       41636             bsd
3       95627             gpl
4       66930             gpl
5       51103             gpl
6       65637             gpl
7       41834             gpl
8       70998             gpl
9       95064             gpl
10      48810            lgpl
11      95934             gpl
12      90909             gpl
13       6538         website
14      16439             gpl
15      41924             gpl
16      78987             gpl
17      58662            zlib
18       1904             bsd
19      93838          public
20      90047            lgpl

> str(data)
'data.frame':   45033 obs. of  2 variables:
 $ Project ID     : chr  "45556" "41636" "95627" "66930" ...
 $ Project License: chr  "lgpl" "bsd" "gpl" "gpl" ...
 - attr(*, "SQL")=Class 'base64'  chr "ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDMxNC5ncm91cHMKV0hFUkUgZ3JvdXBfaWQgPCAxMDAwMDA="
 - attr(*, "indicatorName")=Class 'base64'  chr "cHJqTGljZW5zZQ=="
 - attr(*, "resultNames")=Class 'base64'  chr "UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl"

更新 2 (データ):

> unique(data[["Project License"]])
 [1] "lgpl"       "bsd"        "gpl"        "website"    "zlib"
 [6] "public"     "other"      "ibmcpl"     "rpl"        "mpl11"
[11] "mit"        "afl"        "python"     "mpl"        "apache"
[16] "osl"        "w3c"        "iosl"       "artistic"   "apsl"
[21] "ibm"        "plan9"      "php"        "qpl"        "psfl"
[26] "ncsa"       "rscpl"      "sunpublic"  "zope"       "eiffel"
[31] "nethack"    "sissl"      "none"       "opengroup"  "sleepycat"
[36] "nokia"      "attribut"   "xnet"       "eiffel2"    "wxwindows"
[41] "motosoto"   "vovida"     "jabber"     "cvw"        "historical"
[46] "nausite"    "real"

score 3 · Accepted Answer

問題は、レベルの数が因子作成のラベルの数と等しくなく、長さも 1 でないことです。

から?factor:

labels  
  either an optional character vector of labels for the levels (in the same order as
  levels after removing those in exclude), or a character string of length 1.

これらを一致させる必要があります。の名前は、ラベルを組み合わせるclassificationヒントではありません。factor

例えば：

factor(..., levels=classification, labels=c('Highly Restrictive',
                                            'Restrictive.1',
                                            'Restrictive.2',
                                            'Permissive.1',
                                            'Permissive.2',
                                            'Unknown'))

因子をより少ないレベルの別の因子にマップするには、名前でベクトルにインデックスを付けることができます。classificationルックアップとしてベクトルを回す:

 classification <- c(gpl='Highly Restrictive',
                     lgpl='Restrictive', 
                     public='Restrictive',
                     bsd='Permissive',
                     artistic='Permissive',
                     other='Unknown')

これをルックアップテーブルとして使用するには:

data[["License Restrictiveness"]] <- 
    as.factor(classification[as.character(data[['Project License']])])

head(data)
##   Project ID Project License License Restrictiveness
## 1      45556            lgpl             Restrictive
## 2      41636             bsd              Permissive
## 3      95627             gpl      Highly Restrictive
## 4      66930             gpl      Highly Restrictive
## 5      51103             gpl      Highly Restrictive
## 6      65637             gpl      Highly Restrictive

score 1 · Accepted Answer

たとえば、最初にキャラクターに変換すると、タスクが簡単になる場合があります（テストされていません）

license.map <- c(lgpl="Permissive", bsd="Permissive", 
                 gpl="Restrictive", website="Unkown") # etc.
dat <- transform(dat, LicenseType=license.map[Project.License])

デフォルトで stringsAsFactor はTrueであるため、新しい列は因子です。

r - データフレーム列の因子水準の組み合わせ

2 に答える 2

Related

Reference