r - 因子レベルと因子ラベルの間の混乱

Question

Rの因子のレベルとラベルには違いがあるようです。これまで、レベルは因子レベルの「実際の」名前であり、ラベルは出力に使用される名前（テーブルやプロットなど）であると常に考えていました。。次の例が示すように、明らかにこれは当てはまりません。

df <- data.frame(v=c(1,2,3),f=c('a','b','c'))
str(df)
'data.frame':   3 obs. of  2 variables:
 $ v: num  1 2 3
 $ f: Factor w/ 3 levels "a","b","c": 1 2 3

df$f <- factor(df$f, levels=c('a','b','c'),
  labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))
levels(df$f)
[1] "Treatment A: XYZ" "Treatment B: YZX" "Treatment C: ZYX"

スクリプトを作成するときに、レベル（'a'、'b'、'c'）にアクセスできると思いましたが、これは機能しません。

> df$f=='a'
[1] FALSE FALSE FALSE

しかし、これはします：

> df$f=='Treatment A: XYZ' 
[1]  TRUE FALSE FALSE

したがって、私の質問は2つの部分で構成されています。

レベルとラベルの違いは何ですか？
スクリプトと出力の因子レベルに異なる名前を付けることは可能ですか？

背景：長いスクリプトの場合、短い要素レベルでのスクリプト作成の方がはるかに簡単なようです。ただし、レポートとプロットの場合、この短い因子レベルは適切でない可能性があるため、より正確な名前に置き換える必要があります。

score 140 · Accepted Answer

非常に短い：レベルは入力であり、ラベルはfactor()関数の出力です。因子には、関数の引数levelによって設定される属性のみがあります。これは、SPSSなどの統計パッケージのラベルの概念とは異なり、最初は混乱する可能性があります。labelsfactor()

このコード行で何をするか

df$f <- factor(df$f, levels=c('a','b','c'),
  labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))

ベクトルがあることをRに伝えていますdf$f

ファクターに変換したいもの、
さまざまなレベルがa、b、およびcとしてコード化されています
レベルに治療Aなどのラベルを付けたい場合。

因子関数は、値a、b、およびcを探し、それらを数値因子クラスに変換しlevel、因子の属性にラベル値を追加します。この属性は、内部数値を正しいラベルに変換するために使用されます。しかし、ご覧のとおり、label属性はありません。

> df <- data.frame(v=c(1,2,3),f=c('a','b','c'))    
> attributes(df$f)
$levels
[1] "a" "b" "c"

$class
[1] "factor"

> df$f <- factor(df$f, levels=c('a','b','c'),
+   labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))    
> attributes(df$f)
$levels
[1] "Treatment A: XYZ" "Treatment B: YZX" "Treatment C: ZYX"

$class
[1] "factor"

score 19 · Accepted Answer

レベルまたはラベルのいずれかを参照できるパッケージ「lfactors」を作成しました。

# packages
install.packages("lfactors")
require(lfactors)

flips <- lfactor(c(0,1,1,0,0,1), levels=0:1, labels=c("Tails", "Heads"))
# Tails can now be referred to as, "Tails" or 0
# These two lines return the same result
flips == "Tails"
#[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE
flips == 0 
#[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

lfactorでは、ラベルと混同しないように、レベルが数値である必要があることに注意してください。

score 0 · Accepted Answer

スクリプトときれいな印刷のために因子変数のレベルに異なる名前を使用するというこの問題に対処するために私が一般的に使用する手法を共有したかっただけです。

# Load packages
library(tidyverse)
library(sjlabelled)
library(patchwork)

# Create data frames
df <- data.frame(v = c(1, 2, 3), f = c("a", "b", "c"))
df_labelled <- data.frame(v = c(1, 2, 3), f = c("a", "b", "c")) %>%
  val_labels(
    # levels are characters
    f = c(
      "a" = "Treatment A: XYZ", "b" = "Treatment B: YZX", 
      "c" = "Treatment C: ZYX"
    ), 
    # levels are numeric
    v = c("1" = "Exp. Unit 1", "2" = "Exp. Unit 2", "3" = "Exp. Unit 3")
  )

# df and df_labelled appear exactly the same when printed and nothing changes
# in terms of scripting
df
#>   v f
#> 1 1 a
#> 2 2 b
#> 3 3 c
df_labelled
#>   v f
#> 1 1 a
#> 2 2 b
#> 3 3 c

# Now, let's take a look at the structure of df and df_labelled
str(df)
#> 'data.frame':    3 obs. of  2 variables:
#>  $ v: num  1 2 3
#>  $ f: chr  "a" "b" "c"
str(df_labelled) # notice the attributes
#> 'data.frame':    3 obs. of  2 variables:
#>  $ v: num  1 2 3
#>   ..- attr(*, "labels")= Named num [1:3] 1 2 3
#>   .. ..- attr(*, "names")= chr [1:3] "Exp. Unit 1" "Exp. Unit 2" "Exp. Unit 3"
#>  $ f: chr  "a" "b" "c"
#>   ..- attr(*, "labels")= Named chr [1:3] "a" "b" "c"
#>   .. ..- attr(*, "names")= chr [1:3] "Treatment A: XYZ" "Treatment B: YZX" "Treatment C: ZYX"

# Lastly, create ggplots with and without pretty names for factor levels
p1 <- df_labelled %>% # or, df
  ggplot(aes(x = f, y = v)) + 
  geom_point() + 
  labs(x = "Treatment", y = "Measurement")
p2 <- df_labelled %>%
  ggplot(aes(x = to_label(f), y = to_label(v))) + 
  geom_point() + 
  labs(x = "Treatment", y = "Experimental Unit")

p1 / p2

^{reprexパッケージ（v2.0.0）によって2021-08-17に作成されました}

r - 因子レベルと因子ラベルの間の混乱

3 に答える 3

Related

Reference