r - R での正規分布の当てはめ

Question

正規分布に合わせて次のコードを使用しています。「b」のデータセットのリンク (大きすぎて直接投稿できません) は次のとおりです。

setwd("xxxxxx")
library(fitdistrplus)

require(MASS)
tazur <-read.csv("b", header= TRUE, sep=",")
claims<-tazur$b
a<-log(claims)
plot(hist(a))

ヒストグラムをプロットした後、正規分布がうまく適合するはずです。

f1n <- fitdistr(claims,"normal")
summary(f1n)

#Length Class  Mode   
#estimate 2      -none- numeric
#sd       2      -none- numeric
#vcov     4      -none- numeric
#n        1      -none- numeric
#loglik   1      -none- numeric

plot(f1n)

xy.coords(x, y, xlabel, ylabel, log) のエラー:

'x' はリストですが、コンポーネント 'x' と 'y' がありません

当てはめた分布をプロットしようとすると上記のエラーが発生し、要約統計量でさえ f1n がオフになっています。

助けていただければ幸いです。

score 20 · Accepted Answer

MASS::fitdistrとを混同しているようですfitdistrplus::fitdist。

MASS::fitdistrクラス "fitdistr" のオブジェクトを返しますが、これには plot メソッドはありません。したがって、推定パラメータを抽出し、推定密度曲線を自分でプロットする必要があります。
fitdistrplus関数呼び出しは明らかに使用していることを示しているため、package をロードする理由がわかりませんMASS。とにかく、クラス「fitdist」のオブジェクトを返すfitdistrplus関数があります。fitdistこのクラスには plot メソッドがありますが、によって返される "fitdistr" では機能しませんMASS。

両方のパッケージを使用する方法を紹介します。

## reproducible example
set.seed(0); x <- rnorm(500)

使用するMASS::fitdistr

利用可能なプロット方法はありませんので、自分で行ってください。

library(MASS)
fit <- fitdistr(x, "normal")
class(fit)
# [1] "fitdistr"

para <- fit$estimate
#         mean            sd 
#-0.0002000485  0.9886248515 

hist(x, prob = TRUE)
curve(dnorm(x, para[1], para[2]), col = 2, add = TRUE)

使用するfitdistrplus::fitdist

library(fitdistrplus)
FIT <- fitdist(x, "norm")    ## note: it is "norm" not "normal"
class(FIT)
# [1] "fitdist"

plot(FIT)    ## use method `plot.fitdist`

score 7 · Accepted Answer

前の回答のレビュー

前の回答では、2 つの方法の違いについては触れていませんでした。一般に、最尤推定を選択する場合は、を使用することをお勧めしMASS::fitdistrます。これは、多くの基本的な分布に対して、数値最適化ではなく正確な推定を実行するためです。Doc of?fitdistrはこれをかなり明確にしました：

For the Normal, log-Normal, geometric, exponential and Poisson
distributions the closed-form MLEs (and exact standard errors) are
used, and ‘start’ should not be supplied.

For all other distributions, direct optimization of the
log-likelihood is performed using ‘optim’.  The estimated standard
errors are taken from the observed information matrix, calculated
by a numerical approximation.  For one-dimensional problems the
Nelder-Mead method is used and for multi-dimensional problems the
BFGS method, unless arguments named ‘lower’ or ‘upper’ are
supplied (when ‘L-BFGS-B’ is used) or ‘method’ is supplied
explicitly.

一方、fitdistrplus::fitdist正確な推論が存在する場合でも、常に数値的な方法で推論を実行します。確かに、の利点fitdistは、より多くの推論原理が利用できることです。

Fit of univariate distributions to non-censored data by maximum
likelihood (mle), moment matching (mme), quantile matching (qme)
or maximizing goodness-of-fit estimation (mge).

この回答の目的

この回答では、正規分布の正確な推論を探ります。理論的な風味はありますが、尤度原理の証明はありません。結果のみが与えられます。これらの結果に基づいて、正確な推論のために独自の R 関数を作成します。これはと比較できますMASS::fitdistr。一方、と比較するために、負の対数尤度関数を数値的に最小化するためfitdistrplus::fitdistに使用します。optim

これは、統計との比較的高度な使用法を学ぶ絶好の機会ですoptim。便宜上、標準誤差ではなく分散という尺度パラメータを推定します。

正規分布の正確な推定

推論関数を自分で書く

次のコードはよくコメントされています。スイッチがありexactます。が設定されている場合FALSE、数値解法が選択されます。

## fitting a normal distribution
fitnormal <- function (x, exact = TRUE) {
  if (exact) {
    ################################################
    ## Exact inference based on likelihood theory ##
    ################################################
    ## minimum negative log-likelihood (maximum log-likelihood) estimator of `mu` and `phi = sigma ^ 2`
    n <- length(x)
    mu <- sum(x) / n
    phi <- crossprod(x - mu)[1L] / n  # (a bised estimator, though)
    ## inverse of Fisher information matrix evaluated at MLE
    invI <- matrix(c(phi, 0, 0, phi * phi), 2L,
                   dimnames = list(c("mu", "sigma2"), c("mu", "sigma2")))
    ## log-likelihood at MLE
    loglik <- -(n / 2) * (log(2 * pi * phi) + 1)
    ## return
    return(list(theta = c(mu = mu, sigma2 = phi), vcov = invI, loglik = loglik, n = n))
    }
  else {
    ##################################################################
    ## Numerical optimization by minimizing negative log-likelihood ##
    ##################################################################
    ## negative log-likelihood function
    ## define `theta = c(mu, phi)` in order to use `optim`
    nllik <- function (theta, x) {
      (length(x) / 2) * log(2 * pi * theta[2]) + crossprod(x - theta[1])[1] / (2 * theta[2])
      }
    ## gradient function (remember to flip the sign when using partial derivative result of log-likelihood)
    ## define `theta = c(mu, phi)` in order to use `optim`
    gradient <- function (theta, x) {
      pl2pmu <- -sum(x - theta[1]) / theta[2]
      pl2pphi <- -crossprod(x - theta[1])[1] / (2 * theta[2] ^ 2) + length(x) / (2 * theta[2])
      c(pl2pmu, pl2pphi)
      }
    ## ask `optim` to return Hessian matrix by `hessian = TRUE`
    ## use "..." part to pass `x` as additional / further argument to "fn" and "gn"
    ## note, we want `phi` as positive so box constraint is used, with "L-BFGS-B" method chosen
    init <- c(sample(x, 1), sample(abs(x) + 0.1, 1))  ## arbitrary valid starting values
    z <- optim(par = init, fn = nllik, gr = gradient, x = x, lower = c(-Inf, 0), method = "L-BFGS-B", hessian = TRUE)
    ## post processing ##
    theta <- z$par
    loglik <- -z$value  ## flip the sign to get log-likelihood
    n <- length(x)
    ## Fisher information matrix (don't flip the sign as this is the Hessian for negative log-likelihood)
    I <- z$hessian / n  ## remember to take average to get mean
    invI <- solve(I, diag(2L))  ## numerical inverse
    dimnames(invI) <- list(c("mu", "sigma2"), c("mu", "sigma2"))
    ## return
    return(list(theta = theta, vcov = invI, loglik = loglik, n = n))
    }
  }

テストには以前のデータを引き続き使用します。

set.seed(0); x <- rnorm(500)

## exact inference
fit <- fitnormal(x)

#$theta
#           mu        sigma2 
#-0.0002000485  0.9773790969 
#
#$vcov
#              mu    sigma2
#mu     0.9773791 0.0000000
#sigma2 0.0000000 0.9552699
#
#$loglik
#[1] -703.7491
#
#$n
#[1] 500

hist(x, prob = TRUE)
curve(dnorm(x, fit$theta[1], sqrt(fit$theta[2])), add = TRUE, col = 2)

数値法もかなり正確ですが、分散共分散が対角線から完全に 0 にならない点が異なります。

fitnormal(x, FALSE)

#$theta
#[1] -0.0002235315  0.9773732277
#
#$vcov
#                 mu       sigma2
#mu     9.773826e-01 5.359978e-06
#sigma2 5.359978e-06 1.910561e+00
#
#$loglik
#[1] -703.7491
#
#$n
#[1] 500

r - R での正規分布の当てはめ

2 に答える 2

Related

Reference