c - Cの高速で効率的な最小二乗適合アルゴリズム？

Question

時間と振幅の2つのデータ配列に線形最小二乗近似を実装しようとしています。私がこれまでに知っている唯一の手法は、（y = m * x + b）で可能なすべてのmポイントとbポイントをテストし、エラーが最小になるようにデータに最適な組み合わせを見つけることです。ただし、すべてをテストするため、非常に多くの組み合わせを繰り返すことは役に立たない場合があると思います。私が知らないプロセスをスピードアップするためのテクニックはありますか？ありがとう。

score 7 · Accepted Answer

最小二乗フィッティングには効率的なアルゴリズムがあります。詳細については、ウィキペディアを参照してください。単純な実装よりも効率的にアルゴリズムを実装するライブラリもあります。GNU Scientific Libraryはその一例ですが、より寛大なライセンスの下にあるものもあります。

score 3 · Accepted Answer

これは、単純な線形回帰を行う C/C++ 関数の私のバージョンです。計算は、単純な線形回帰に関するウィキペディアの記事に従います。これは、github: simple_linear_regression で単一ヘッダーのパブリックドメイン (MIT) ライブラリとして公開されています。ライブラリ (.h ファイル) は、Linux と Windows で動作することがテストされており、C と C++ からは -Wall -Werror と、clang/gcc でサポートされているすべての -std バージョンを使用して動作することがテストされています。

#define SIMPLE_LINEAR_REGRESSION_ERROR_INPUT_VALUE -2
#define SIMPLE_LINEAR_REGRESSION_ERROR_NUMERIC     -3

int simple_linear_regression(const double * x, const double * y, const int n, double * slope_out, double * intercept_out, double * r2_out) {
    double sum_x = 0.0;
    double sum_xx = 0.0;
    double sum_xy = 0.0;
    double sum_y = 0.0;
    double sum_yy = 0.0;
    double n_real = (double)(n);
    int i = 0;
    double slope = 0.0;
    double denominator = 0.0;

    if (x == NULL || y == NULL || n < 2) {
        return SIMPLE_LINEAR_REGRESSION_ERROR_INPUT_VALUE;
    }

    for (i = 0; i < n; ++i) {
        sum_x += x[i];
        sum_xx += x[i] * x[i];
        sum_xy += x[i] * y[i];
        sum_y += y[i];
        sum_yy += y[i] * y[i];
    }

    denominator = n_real * sum_xx - sum_x * sum_x;
    if (denominator == 0.0) {
        return SIMPLE_LINEAR_REGRESSION_ERROR_NUMERIC;
    }
    slope = (n_real * sum_xy - sum_x * sum_y) / denominator;

    if (slope_out != NULL) {
        *slope_out = slope;
    }

    if (intercept_out != NULL) {
        *intercept_out = (sum_y  - slope * sum_x) / n_real;
    }

    if (r2_out != NULL) {
        denominator = ((n_real * sum_xx) - (sum_x * sum_x)) * ((n_real * sum_yy) - (sum_y * sum_y));
        if (denominator == 0.0) {
            return SIMPLE_LINEAR_REGRESSION_ERROR_NUMERIC;
        }
        *r2_out = ((n_real * sum_xy) - (sum_x * sum_y)) * ((n_real * sum_xy) - (sum_x * sum_y)) / denominator;
    }

    return 0;
}

使用例:

#define SIMPLE_LINEAR_REGRESSION_IMPLEMENTATION
#include "simple_linear_regression.h"

#include <stdio.h>

/* Some data that we want to find the slope, intercept and r2 for */
static const double x[] = { 1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83 };
static const double y[] = { 52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46 };

int main() {
    double slope = 0.0;
    double intercept = 0.0;
    double r2 = 0.0;
    int res = 0;

    res = simple_linear_regression(x, y, sizeof(x) / sizeof(x[0]), &slope, &intercept, &r2);
    if (res < 0) {
        printf("Error: %s\n", simple_linear_regression_error_string(res));
        return res;
    }

    printf("slope: %f\n", slope);
    printf("intercept: %f\n", intercept);
    printf("r2: %f\n", r2);

    return 0;
}

score 2 · Accepted Answer

このペーパーのセクション1を見てください。このセクションでは、2D線形回帰を行列乗算の演習として表現します。データが適切に動作している限り、この手法を使用すると、最小二乗近似をすばやく作成できます。

データのサイズによっては、行列の乗算を単純な方程式のセットに代数的に減らすことで、matmult（）関数を作成する必要がなくなる場合があります。（注意してください、これは4つまたは5つ以上のデータポイントでは完全に実用的ではありません！）

c - Cの高速で効率的な最小二乗適合アルゴリズム？

8 に答える 8

Related

Reference