python - 追加のライブラリやメソッドを使用せずに pandas データフレームを Scikit-Learn のモデルに適合させる

Question

一方では、pandasはscikit-learnと相性が良いと言われています。たとえば、pandas シリーズのオブジェクトは、このビデオの sklearn モデルによく適合します。一方、Scikit-Learn の機械学習メソッドと pandas スタイルのデータフレームの間の橋渡しをするsklearn-pandasがあり、そのようなライブラリが必要です。さらに、たとえば、モデルをフィッティングするために pandas データフレームを numpy 配列に変換する人もいます。

メソッドやライブラリを追加せずにpandasとscikit-learnを組み合わせることが可能かどうか疑問に思います。私の問題は、次の方法でデータセットを sklearn モデルに適合させるたびに、次のようになることです。

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC

d = {'x': np.linspace(1., 100., 20), 'y': np.linspace(1., 10., 20)}
df = pd.DataFrame(d)

train, test = train_test_split(df, test_size = 0.2)

trainX = train['x']
trainY = train['y']

lin_svm = SVC(kernel='linear').fit(trainX, trainY)

エラーが表示されます:

ValueError: Unknown label type: 19    10.000000
0      1.000000
17     9.052632
18     9.526316
12     6.684211
11     6.210526
16     8.578947
14     7.631579
10     5.736842
7      4.315789
8      4.789474
2      1.947368
13     7.157895
1      1.473684
6      3.842105
3      2.421053
Name: y, dtype: float64

私が理解している限り、それはデータ構造によるものです。ただし、同様のコードを問題なく使用している例はインターネット上にほとんどありません。

score 1 · Accepted Answer

やりたいことは、分類ではなく回帰です。

考えてみてください。分類を行うには、バイナリ出力またはマルチクラス出力が必要です。あなたの場合、分類子に連続データを与えます。

sklearnエラーをさかのぼってのメソッドの実装をもう少し掘り下げると.fit()、次の関数が見つかります。

def check_classification_targets(y):
"""Ensure that target y is of a non-regression type.

Only the following target types (as defined in type_of_target) are allowed:
    'binary', 'multiclass', 'multiclass-multioutput', 
    'multilabel-indicator', 'multilabel-sequences'

Parameters
----------
y : array-like
"""
y_type = type_of_target(y)
if y_type not in ['binary', 'multiclass', 'multiclass-multioutput', 
        'multilabel-indicator', 'multilabel-sequences']:
    raise ValueError("Unknown label type: %r" % y)

関数のドキュメント文字列type_of_targetは次のとおりです。

def type_of_target(y):
"""Determine the type of data indicated by target `y`

Parameters
----------
y : array-like

Returns
-------
target_type : string
    One of:
    * 'continuous': `y` is an array-like of floats that are not all
      integers, and is 1d or a column vector.
    * 'continuous-multioutput': `y` is a 2d array of floats that are
      not all integers, and both dimensions are of size > 1.
    * 'binary': `y` contains <= 2 discrete values and is 1d or a column
      vector.
    * 'multiclass': `y` contains more than two discrete values, is not a
      sequence of sequences, and is 1d or a column vector.
    * 'multiclass-multioutput': `y` is a 2d array that contains more
      than two discrete values, is not a sequence of sequences, and both
      dimensions are of size > 1.
    * 'multilabel-indicator': `y` is a label indicator matrix, an array
      of two dimensions with at least two columns, and at most 2 unique
      values.
    * 'unknown': `y` is array-like but none of the above, such as a 3d
      array, sequence of sequences, or an array of non-sequence objects.

あなたの場合、type_of_target(trainY)=='continuous' and then it raises aValueError in the functioncheck_classification_targets()`.

結論：

分類を実行する場合は、ターゲットを変更してくださいy。(例: バイナリベクトルを使用)
連続データを保持したい場合は、回帰を実行してください。を使用しsvm.SVRます。

python - 追加のライブラリやメソッドを使用せずに pandas データ フレームを Scikit-Learn のモデルに適合させる

1 に答える 1

Related

Reference

python - 追加のライブラリやメソッドを使用せずに pandas データフレームを Scikit-Learn のモデルに適合させる