scikit-learn を使用して、特定のデータセットの分類子を作成する小さなプログラムを作成しました。ここで、この例を試して、分類器が機能していることを確認したいと思いました。たとえば、clf は「猫」を検出する必要があります。
これが私が続ける方法です:
私は猫の写真を 50 枚と「猫じゃない」の写真を 50 枚持っています。
data_set
sift-feature 検出器を使用して記述子を取得する- データをトレーニング セットとテスト セットに分割します (猫の写真 25 枚 + 猫以外の写真 25 枚 = training_set、test_set と同じ)
- から kmeans を使用してクラスターの中心を取得します。
training_set
- クラスター中心を使用して
training_set
an のヒストグラム データを作成するtest_set
scikit-learn からこのコードを試してください:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] scores = ['precision', 'recall'] for score in scores: print("# Tuning hyper-parameters for %s" % score) print() clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score) clf.fit(X_train, y_train) print("Best parameters set found on development set:") print() print(clf.best_estimator_) print() print("Grid scores on development set:") print() for params, mean_score, scores in clf.grid_scores_: print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params)) print() print("Detailed classification report:") print() print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") print() y_true, y_pred = y_test, clf.predict(X_test) print y_true print y_pred print(classification_report(y_true, y_pred)) print() print clf.score(X_train, y_train) print "score" print clf.best_params_ print "best_params" pred = clf.predict(X_test) print accuracy_score(y_test, pred) print "accuracy_score"
そして私はその結果を得る:
# Tuning hyper-parameters for recall
()
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 0.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 0.].
average=average)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 1.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 1.].
average=average)
Best parameters set found on development set:
()
SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.001, kernel=rbf, max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.0001}
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.]
precision recall f1-score support
0.0 1.00 0.04 0.08 25
1.0 0.51 1.00 0.68 25
avg / total 0.76 0.52 0.38 50
()
0.52
score
{'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
best_params
0.52
accuracy_score
clf はみんなに自分は猫だと思っていると言っているようです....しかし、なぜですか?
data_set
小さすぎて良い結果が得られませんか?
編集: VLFeatを使用してふるい分け記述子を検出しています
機能:
def create_descriptor_data(data, ID):
descriptor_list = []
datas = numpy.genfromtxt(data,dtype='str')
for p in datas:
locs, desc = vlfeat_module.vlf_create_descriptors(p,str(ID)+'.key',ID) # create descriptors and save descs in file
if len(desc) > 500:
desc = desc[::round((len(desc))/400, 1)] # take between 400 - 800 descriptors
descriptor_list.append(desc)
ID += 1 # ID for filename
return descriptor_list
# create k-mean centers from all *.txt files in directory (data)
def create_center_data(data):
#data = numpy.vstack(data)
n_clusters = len(numpy.unique(data))
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
kmeans.fit(data)
return kmeans, n_clusters
def create_histogram_data(kmeans, descs, n_clusters):
histogram_list = []
# load from each file data
for desc in descs:
length = len(desc)
# create histogram from descriptors
histogram = kmeans.predict(desc)
histogram = numpy.bincount(histogram, minlength=n_clusters) #minlength = k in k-means
histogram = numpy.divide(histogram, length, dtype='float')
histogram_list.append(histogram)
histogram = numpy.vstack(histogram_list)
return histogram
そして呼び出し:
X_desc_pos = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_pos.txt",0) # create desc from dataset_pos, 25 pics
X_desc_neg = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_neg.txt",51) # create desc from dataset_neg, 25 pics
X_train_pos, X_test_pos = train_test_split(X_desc_pos, test_size=0.5)
X_train_neg, X_test_neg = train_test_split(X_desc_neg, test_size=0.5)
x1 = numpy.vstack(X_train_pos)
x2 = numpy.vstack(X_train_neg)
kmeans, n_clusters = lib.dataset_module.create_center_data(numpy.vstack((x1,x2)))
X_train_pos = lib.dataset_module.create_histogram_data(kmeans, X_train_pos, n_clusters)
X_train_neg = lib.dataset_module.create_histogram_data(kmeans, X_train_neg, n_clusters)
X_train = numpy.vstack([X_train_pos, X_train_neg])
y_train = numpy.hstack([numpy.ones(len(X_train_pos)), numpy.zeros(len(X_train_neg))])
X_test_pos = lib.dataset_module.create_histogram_data(kmeans, X_test_pos, n_clusters)
X_test_neg = lib.dataset_module.create_histogram_data(kmeans, X_test_neg, n_clusters)
X_test = numpy.vstack([X_test_pos, X_test_neg])
y_test = numpy.hstack([numpy.ones(len(X_test_pos)), numpy.zeros(len(X_test_neg))])
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_estimator_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() / 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print y_true
print y_pred
print(classification_report(y_true, y_pred))
print()
print clf.score(X_train, y_train)
print "score"
print clf.best_params_
print "best_params"
pred = clf.predict(X_test)
print accuracy_score(y_test, pred)
print "accuracy_score"
編集:範囲を更新していくつかの変更を加え、「精度」を再度保存します
# Tuning hyper-parameters for accuracy
()
Best parameters set found on development set:
()
SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=1.0, kernel=rbf, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
...
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1.
1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
precision recall f1-score support
0.0 0.88 0.92 0.90 25
1.0 0.92 0.88 0.90 25
avg / total 0.90 0.90 0.90 50
()
1.0
score
{'kernel': 'rbf', 'C': 1000.0, 'gamma': 1.0}
best_params
0.9
accuracy_score
しかし、それを写真でテストすることによって
rslt = clf.predict(test_histogram)
彼はまだソファに向かって言っています:「あなたは猫です」:D