numpy - 派手な連結次元の不一致

Question

私は意味を成すことができなかったnumpyの連結に関する問題に遭遇しており、誰かが同じ問題に遭遇して解決したことを望んでいました.SciKit-LearnのTfidfVectorizerによって作成された2つの配列を結合しようとしています.と labelencoder を使用していますが、入力がそれぞれ (77946, 12157) 配列と (77946, 1000) 配列であるにもかかわらず、「配列には同じ次元数が必要です」というエラーメッセージが表示されます。(コメントで要求されているように、再現可能な例は下部にあります)

TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()
tagvect2=preprocessing.LabelBinarizer()

tagvect2.fit(DS['location2'].tolist())
TV.fit(DS['tweet'])
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
A=tagvect2.transform(DS['location2'])
B=TV.transform(DS['tweet'])
print A.shape
print B.shape
pdb.set_trace()
Xtrain=np.concatenate([A,B.todense()],axis=1)

最初は、B が疎行列としてエンコードされていることが問題の原因である可能性があると考えていましたが、密行列に変換しても問題は解決しませんでした。代わりに hstack を使用して同じ問題が発生しました。

さらに奇妙なのは、3 番目の labelencoder 行列を追加してもエラーが発生しないことです。

TV.fit(DS['tweet'])
tagvect.fit(DS['state'].tolist())
tagvect2.fit(DS['location'].tolist())
GBR=GradientBoostingRegressor()
print "creating Xtrain and test"
Xtrain=pd.DataFrame(np.concatenate([tagvect.transform(DS['state']),tagvect2.transform(DS['location']),TV.transform(DS['tweet'])],axis=1))

エラーメッセージは次のとおりです。

  Traceback (most recent call last):
  File "smallerdimensions.py", line 49, in <module>
    Xtrain=pd.DataFrame(np.concatenate((A,B.todense()),axis=1))
ValueError: arrays must have same number of dimensions

ご協力いただきありがとうございます。再現可能な例を次に示します。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import preprocessing
import numpy as np


tweets=["Jazz for a Rainy Afternoon","RT: @mention: I love rainy days.", "Good Morning Chicago!"]
location=["Oklahoma", "Oklahoma","Illinois"]

DS=pd.DataFrame({"tweet":tweets,"location":location})



TV=TfidfVectorizer(min_df=1,max_features=1000)
tagvect=preprocessing.LabelBinarizer()

DS['location']=DS['location'].fillna("none")

tagvect.fit(DS['location'].tolist())
TV.fit(DS['tweet'])
print "before problem"
print DS['tweet']
print DS['location']
print tagvect.transform(DS['location'])
print tagvect.transform(DS['location']).shape
print TV.transform(DS['tweet']).shape
print TV.transform(DS['tweet'])
print TV.transform(DS['tweet']).todense()
print np.concatenate([tagvect.transform(DS['location']),TV.transform(DS['tweet'])],axis=1)

Numpy は v 1.6.1、pandas は v 0.12.0、scikit は 0.14.1 です。

numpy - 派手な連結次元の不一致

0 に答える 0

Related

Reference