classification - 非常に不均衡なデータセットで vowpal wabbit を使用してロジスティック回帰を実行する方法

Question

ロジスティック回帰に vowpal wabbit を使用しようとしています。これが正しい構文であるかどうかはわかりません

For training, I do

 ./vw -d ~/Desktop/new_data.txt --passes 20 --binary --cache_file cache.txt -f lr.vw --loss_function logistic --l1 0.05

For testing I do 
./vw -d ~/libsvm-3.18_test/matlab/new_data_test.txt --binary -t -i lr.vw -p predictions.txt -r raw_score.txt

ここに私の列車データからのスニペットがあります

-1:1.00038 | 110:0.30103 262:0.90309 689:1.20412 1103:0.477121 1286:1.5563 2663:0.30103 2667:0.30103 2715:4.63112 3012:0.30103 3113:8.38411 3119:4.62325 3382:1.07918 3666:1.20412 3728:5.14959 4029:0.30103 4596:0.30103

1:2601.25 | 32:2.03342 135:3.77379 146:3.19535 284:2.5563 408:0.30103 542:3.80618 669:1.07918 689:2.25527 880:0.30103 915:1.98227 1169:5.35371 1270:0.90309 1425:0.30103 1621:0.30103 1682:0.30103 1736:3.98227 1770:0.60206 1861:4.34341 1900:3.43136 1905:7.54141 1991:5.33791 2437:0.954243 2532:2.68664 3370:2.90309 3497:0.30103 3546:0.30103 3733:0.30103 3963:0.90309 4152:3.23754 4205:1.68124 4228:0.90309 4257:1.07918 4456:0.954243 4483:0.30103 4766:0.30103

ここに私のテストデータからのスニペットがあります

-1 | 110:0.90309 146:1.64345 543:0.30103 689:0.30103 1103:0.477121 1203:0.30103 1286:2.82737 1892:0.30103 2271:0.30103 2715:4.30449 3012:0.30103 3113:7.99039 3119:4.08814 3382:1.68124 3666:0.60206 3728:5.154 3960:0.778151 4309:0.30103 4596:0.30103 4648:0.477121

しかし、結果を見ると、予測はすべて -1 で、生のスコアはすべて 0 です。私は約 200,000 の例を持っています。そのうち 100 は +1 で、残りは -1 です。この不均衡なデータを処理するために、正の例に 200,000/100 の重みを付け、負の例に 200,000/(200000-100) の重みを付けました。これが起こっているのは、重みを調整しても、私のデータが非常に不均衡であるためですか?

生のスコアファイルで (P(y|x)) の出力を期待していました。しかし、私はすべてゼロを取得します。確率出力が必要なだけです。何が起こっているのか提案はありますか？

score 18 · Accepted Answer

arielfによる詳細な回答の要約。

意図した最終的なコスト (損失) 関数が何であるかを知ることが重要です: ロジスティック損失、0/1 損失 (つまり、精度)、F1 スコア、RO 曲線下の面積、その他?

これは、arielf の回答の一部の Bash コードです。最初に、train.txt から重要な重み付けの奇妙な試みを削除する必要があることに注意してください (質問の ": 1.00038 " と ":2601.25" を意味します)。

A. Prepare the training data
grep '^-1' train.txt | shuf > neg.txt
grep '^1' train.txt | shuf > p.txt
for i in `seq 2000`; do cat p.txt; done > pos.txt
paste -d '\n' neg.txt pos.txt > newtrain.txt

B. Train model.vw
# Note that passes=1 is the default.
# With one pass, holdout_off is the default.
`vw -d newtrain.txt --loss_function=logistic -f model.vw`
#average loss = 0.0953586

C. Compute test loss using vw
`vw -d test.txt -t -i model.vw --loss_function=logistic -r   
raw_predictions.txt`
#average loss = 0.0649306

D. Compute AUROC using http://osmot.cs.cornell.edu/kddcup/software.html
cut -d ' ' -f 1 test.txt | sed -e 's/^-1/0/' > gold.txt
$VW_HOME/utl/logistic -0 raw_predictions.txt > probabilities.txt
perf -ROC -files gold.txt probabilities.txt 
#ROC    0.83484
perf -ROC -plot roc -files gold.txt probabilities.txt | head -n -2 > graph
echo 'plot "graph"' | gnuplot -persist

classification - 非常に不均衡なデータセットで vowpal wabbit を使用してロジスティック回帰を実行する方法

2 に答える 2

Related

Reference