r - Rのマルチヒストグラムプロットで均一なバー幅を設定する方法

Question

26 の列と 4288 のインスタンスを持つデータセット "Dummy_data.csv" を読んでいます。ここには、データ分析にとって重要な全体で 17 のパラメーター (列) があります。17 個のパラメーターのうち 6 個 (param1、param2、param3、param5、param6、param7) は重要なパラメーターであり、同時に範囲外になると、アイテムが不良品であるかどうかが決まります (クラスラベル)。例えば、

range1 = (min1, max1) = (0.25, 0.35)
range2 = (min2, max2) = (2.5, 3.1)
range3 = (min3, max3) = (680, 700)
range5 = (min5, max5) = (56, 64)
range6 = (min6, max6) = (40, 60)
range7 = (min7, max7) = (28, 38)

if (param1 out of range1 & param2 out of range2 & param3 out of range3 &
    param5 out of range5 & param6 out of range6 & param7 out of range7)
    class = 'defective'
else
    class = 'ok'

上記のデータに対して 2 つの欠陥分析を行う必要があります。まず、アイテムの総数から不良率の割合を調べる必要があります。次に、6 つの重要なパラメーターのそれぞれの範囲外の値の頻度ヒストグラムを見つけて、これらの重要なパラメーターのどの範囲外の値が不良品の原因になっているのかを理解する必要があります。

私がしたこと: これらの 6 つの重要なパラメーターの範囲はほとんど重複していなかったので、まず 17 のパラメーターをスケーリングしました (ただし、6 つの重要なパラメーターのスケーリングで十分でした!) を使用(x - min(x))/(max(x)-min(x))して(0, 1)、頻度分布を行うことができます。 x 軸の均一なスケールで 6 つのパラメーターの範囲外の値。グラフで言えば、0 未満のパラメーター値は最小値未満を意味し、1 より大きいパラメーター値は最大値を超えることを意味します。したがって、データフレーム内のデータセットからすべての欠陥のあるインスタンスをフィルター処理しz、円グラフを描画して欠陥品と OK アイテムの割合を示します。(最初の分析)

頻度ヒストグラム (2 番目の分析) では、スケーリングされたデータセットからすべての欠陥のあるインスタンスをフィルター処理しscaled.dat.dfてdefect.dat.df. 次に、6 つのパラメータすべてからminとを選択して、欠陥間隔の範囲を決定します。max次に、6 つのパラメーターのそれぞれの一意の値をにビニングp1.bin.defect.dat.dfし、同じプロットで関数をp7.bin.defect.dat.df使用して個々のヒストグラムをプロットしました。plot

複数の重複するヒストグラムプロット
の問題以下に示すように複数のヒストグラムプロットを取得していますが、問題は 6 つのパラメーターでバーの幅が異なることです。マルチヒストグラムプロットに均一なバー幅を設定する方法を知っている人はいますか? また、このマルチヒストプロットに適切な凡例を追加するにはどうすればよいですか?

役立つ提案/回答は高く評価され、それに応じて報われます。

注: ここで複数のヒストグラムプロットに関する他のスレッドに従いました how-to-plot-two-histograms-together-in-r で、これと非常によく似た複数のヒストプロットが必要ですが、2 つの重なり合うヒストプロットではなく 6 つの重なり合うヒストプロットが必要です (スレッドのように）

library(RWeka)
library(party)
library(plyr)
library(plotrix)
library(sm)

#read data and class labels
dat <- read.csv("Dummy_data.csv", head=T, sep=",")
datm <- as.matrix(dat[,8:24])
class <- as.matrix(dat[,26])

#center and scale data
center <- c(0.25, 2.5, 680, 1067, 56, 40, 28, -99, -99, 40, 5, 50, 5000, 15000, 11.3, 9.1, 0)
scale <- c(0.1, 0.6, 20, 6, 8, 20, 10, 19, 19, 20, 2, 10, 500, 1000, 3.4, 18.3, 5)
scaled.datm <- scale(datm, center, scale)
write.table(scaled.datm, 
file = "C:\\Users\\schakrabarti\\Documents\\Dummy_data_whdr17.csv", 
append=FALSE, quote=TRUE, sep=",", eol = "\n", na = "NA", dec = ".", 
row.names = FALSE, col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

#filter total non-compliants
scaled.dat.df <- as.data.frame(scaled.datm)
total <- length(scaled.dat.df[,1])
z <- c((scaled.dat.df[,"PARAM1"]<0 | scaled.dat.df[,"PARAM1"]>1) & 
    (scaled.dat.df[,"PARAM2"]<0 | scaled.dat.df[,"PARAM2"]>1) & 
    (scaled.dat.df[,"PARAM3"]<0 | scaled.dat.df[,"PARAM3"]>1) & 
    (scaled.dat.df[,"PARAM5"]<0 | scaled.dat.df[,"PARAM5"]>1) &
    (scaled.dat.df[,"PARAM6"]<0 | scaled.dat.df[,"PARAM6"]>1) & 
    (scaled.dat.df[,"PARAM7"]<0 | scaled.dat.df[,"PARAM7"]>1) )
noncompliant <- length(z[z == TRUE])

slices <- c(noncompliant, total - noncompliant)
labls <- c("NOT OK","OK")
pct <- round(slices/sum(slices)*100, digits=2)
labls <- paste(labls, pct)
labls <- paste(labls, "%", sep="")

#pie3D(slices,labels=labls,explode=0.05, col=c(rgb(0.75,0,0.5),rgb(0,1,0.75)),main="Defect Analysis due to critical parameters")
pie(slices,labels=labls,main="Defect Analysis due to critical parameters")

#filter non-compliants due to individual params
defect.dat.df <- scaled.dat.df[z,]

#select defect interval range
min <- min(as.numeric(sapply(defect.dat.df[,c("PARAM1","PARAM2","PARAM3","PARAM5","PARAM6","PARAM7")], function(x) min(as.numeric(x)))))
max <- max(as.numeric(sapply(defect.dat.df[,c("PARAM1","PARAM2","PARAM3","PARAM5","PARAM6","PARAM7")], function(x) max(as.numeric(x)))))


#plot histogram for param1 defect
#p1.bin.defect.dat.df <- binning(defect.dat.df[,c("PARAM1")], breaks=seq(-0.4,0.2,by=0.2))
p1.bin.defect.dat.df <- binning(defect.dat.df[,c("PARAM1")], breaks=seq(min,max,by=0.2))
#h1 <- hist(defect.dat.df[,c("PARAM1")])
#plot(h1, col=rgb(1,0,0,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(head(p1.bin.defect.dat.df$breaks, n=1),tail(p1.bin.defect.dat.df$breaks, n=1)))
h1 <- hist(defect.dat.df[,c("PARAM1")], col=rgb(1,0,0,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(head(p1.bin.defect.dat.df$breaks, n=1),tail(p1.bin.defect.dat.df$breaks, n=1)))
#h1 <- hist(defect.dat.df[,c("PARAM1")], col=rgb(1,0,0,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(min,max))
box()

p2.bin.defect.dat.df <- binning(defect.dat.df[,c("PARAM2")], breaks=seq(min,max,by=0.2))
#h2 <- hist(defect.dat.df[,c("PARAM2")])
#plot(h2, col=rgb(0,0,1,1/7), xlim=c(head(p1.bin.defect.dat.df$breaks, n=1),tail(p1.bin.defect.dat.df$breaks, n=1)))
h2 <- hist(defect.dat.df[,c("PARAM2")], col=rgb(0,0,1,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(head(p2.bin.defect.dat.df$breaks, n=1),tail(p2.bin.defect.dat.df$breaks, n=1)), add=T)
#h2 <- hist(defect.dat.df[,c("PARAM2")], col=rgb(0,0,1,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(min,max), add=T)
box()

p3.bin.defect.dat.df <- binning(defect.dat.df[,c("PARAM3")], breaks=seq(min,max,by=0.2))
#h3 <- hist(defect.dat.df[,c("PARAM3")])
#plot(h3, col=rgb(0,1,0,1/7), xlim=c(head(p1.bin.defect.dat.df$breaks, n=1),tail(p1.bin.defect.dat.df$breaks, n=1)))
h3 <- hist(defect.dat.df[,c("PARAM3")], col=rgb(0,1,0,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(head(p3.bin.defect.dat.df$breaks, n=1),tail(p3.bin.defect.dat.df$breaks, n=1)), add=T)
#h3 <- hist(defect.dat.df[,c("PARAM3")], col=rgb(0,1,0,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(min,max), add=T)
box()

p5.bin.defect.dat.df <- binning(defect.dat.df[,c("PARAM5")], breaks=seq(min,max,by=0.2))
#h5 <- hist(defect.dat.df[,c("PARAM5")])
#plot(h5, col=rgb(0.5,0.5,0,1/7), xlim=c(head(p1.bin.defect.dat.df$breaks, n=1),tail(p1.bin.defect.dat.df$breaks, n=1)))
h5 <- hist(defect.dat.df[,c("PARAM5")], col=rgb(0.5,0,0.5,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(head(p5.bin.defect.dat.df$breaks, n=1),tail(p5.bin.defect.dat.df$breaks, n=1)), add=T)
#h5 <- hist(defect.dat.df[,c("PARAM5")], col=rgb(0.5,0,0.5,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(min,max), add=T)
box()

p6.bin.defect.dat.df <- binning(defect.dat.df[,c("PARAM6")], breaks=seq(min,max,by=0.2))
#h6 <- hist(defect.dat.df[,c("PARAM6")])
#plot(h6, col=rgb(0,0.5,0.5,1/7), xlim=c(head(p1.bin.defect.dat.df$breaks, n=1),tail(p1.bin.defect.dat.df$breaks, n=1)))
h6 <- hist(defect.dat.df[,c("PARAM6")], col=rgb(0,0.5,0.5,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(head(p6.bin.defect.dat.df$breaks, n=1),tail(p6.bin.defect.dat.df$breaks, n=1)), add=T)
#h6 <- hist(defect.dat.df[,c("PARAM6")], col=rgb(0,0.5,0.5,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(min,max), add=T)
box()

p7.bin.defect.dat.df <- binning(defect.dat.df[,c("PARAM7")], breaks=seq(min,max,by=0.2))
#h7 <- hist(defect.dat.df[,c("PARAM7")])
#plot(h7, col=rgb(0.5,0,0.5,1/7), xlim=c(head(p1.bin.defect.dat.df$breaks, n=1),tail(p1.bin.defect.dat.df$breaks, n=1)))
h7 <- hist(defect.dat.df[,c("PARAM7")], col=rgb(0.5,0.5,0,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(head(p7.bin.defect.dat.df$breaks, n=1),tail(p7.bin.defect.dat.df$breaks, n=1)), add=T)
#h7 <- hist(defect.dat.df[,c("PARAM7")], col=rgb(0.5,0.5,0,1/7), xlab="Param Defect Intervals", main="Frequency of Parameter Defects", xlim=c(min,max), add=T)
box()

ここに画像の説明を入力

score 0 · Accepted Answer

実際、バーを同じ幅にするためには、すべてのグループで同じブレークを使用する必要があります。先に進み、いくつかのサンプルデータを作成しました。たくさんの異なる data.frame を用意する代わりに、すべてをリストにまとめました。

#sample data
set.seed(15)
observations <- lapply(1:6, function(x) rnorm(100*x))

ここには 6 つの項目があり、それぞれ標準正規分布からランダムに抽出される回数が異なります。今、私は全範囲を洗い流し、50の休憩を作ります

maxrange <- range(sapply(observations, range))
breaks <- seq(maxrange[1], maxrange[2], length.out=50)

ここで、これらの同じブレークをすべてのセットに適用し、実際にプロットせずにヒストグラムデータを計算します。

hists <- lapply(bins, hist, breaks=breaks, plot=F)

プロットを支援するために、最初のプロットが他のすべてのプロットのすべてのデータに対して十分な高さであることを確認するために、ylim を事前に計算する必要があります。色もこれから決めます。

ylim <- range(sapply(hists, function(x) {range(x$density)}))
cols<-list(
    rgb(1,0,0,1/7), rgb(0,0,1,1/7), rgb(0,1,0,1/7),
    rgb(0.5,0,0.5,1/7), rgb(0,0.5,0.5,1/7), rgb(0.5,0.5,0,1/7)
)

次に、必要なすべてのラベルなどを含む最初のヒストグラムをプロットし、次に他のすべてを上にプロットします

plot(hists[[1]], ylim=ylim, col=cols[[1]], freq=F, 
    main="Combined Histogram", xlab="Observation")
invisible(mapply(function(x, c) 
    plot(x, ylim=ylim, col=c, freq=F, add=T), 
hists[-1], cols[-1]))

複数のヒストグラム

r - Rのマルチヒストグラムプロットで均一なバー幅を設定する方法

1 に答える 1

Related

Reference