python - 行列での次元削減の使用

Question

教師あり学習の場合、私のマトリックスサイズは非常に大きく、その結果、特定のモデルのみが実行に同意します。PCA は次元を大幅に削減するのに役立つと読みました。

以下は私のコードです：

def run(command):
    output = subprocess.check_output(command, shell=True)
    return output

f = open('/Users/ya/Documents/10percent/Vik.txt','r')
vocab_temp = f.read().split()
f.close()
col = len(vocab_temp)
print("Training column size:")
print(col)

#dataset = list()

row = run('cat '+'/Users/ya/Documents/10percent/X_true.txt'+" | wc -l").split()[0]
print("Training row size:")
print(row)
matrix_tmp = np.zeros((int(row),col), dtype=np.int64)
print("Train Matrix size:")
print(matrix_tmp.size)
        # label_tmp.ndim must be equal to 1
label_tmp = np.zeros((int(row)), dtype=np.int64)
f = open('/Users/ya/Documents/10percent/X_true.txt','r')
count = 0
for line in f:
    line_tmp = line.split()
    #print(line_tmp)
    for word in line_tmp[0:]:
        if word not in vocab_temp:
            continue
        matrix_tmp[count][vocab_temp.index(word)] = 1
    count = count + 1
f.close()
print("Train matrix is:\n ")
print(matrix_tmp)
print(label_tmp)
print(len(label_tmp))
print("No. of topics in train:")
print(len(set(label_tmp)))
print("Train Label size:")
print(len(label_tmp))

サイズが (202180x9984) 程度なので、matrix_tmp に PCA を適用したいと考えています。コードを変更してそれを含めるにはどうすればよいですか?

score 1 · Accepted Answer

import codecs
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
with codecs.open('input_file', 'r', encoding='utf-8') as inf:
    lines = inf.readlines()
vectorizer = CountVectorizer(binary=True)
X_train = vectorizer.fit_transform(lines)
perform_pca = False
if perform_pca:
    n_components = 100
    pca = TruncatedSVD(n_components)
    X_train = pca.fit_transform(X_train)

1- sklearn で利用可能な verctorizers を使用してベクトル化を実行します。これにより、大量のゼロ値を持つ完全なマトリックスではなく、スパースマトリックスが生成されます。

2-必要な場合にのみPCAを実行します

3-必要に応じて、ベクトライザーと pca のパラメーターを使用してパフォーマンスを向上させます。

python - 行列での次元削減の使用

2 に答える 2

Related

Reference