machine-learning - 文書分類のための教師あり潜在的ディリクレ割り当て？

Question

いくつかのグループには、すでに人間が分類したドキュメントがたくさんあります。

モデルをトレーニングし、後でそれを使用して未知のドキュメントを分類するために使用できるldaの修正バージョンはありますか？

score 15 · Accepted Answer

価値のあることとして、分類器としてのLDAは生成モデルであり、分類は判別問題であるため、かなり弱くなります。より識別力のある基準を使用してトピックを形成する監視付きLDAと呼ばれるLDAの変形があり（これのソースはさまざまな場所で入手できます）、最大マージンの定式化が行われた論文もあります。ソースコードごと。分類問題におけるトピックとカテゴリ間の対応について強い仮定をしているので、それがあなたが望むものであることが確実でない限り、私はラベル付きLDA定式化を避けます。

ただし、これらの方法はいずれもトピックモデルを直接使用して分類を行うものではないことを指摘しておく価値があります。代わりに、ドキュメントを取得し、単語ベースの機能を使用する代わりに、トピックの後方（ドキュメントの推論から生じるベクトル）を機能表現として使用してから、分類器（通常は線形SVM）にフィードします。これにより、トピックモデルベースの次元削減に続いて、強力な識別分類器が得られます。これはおそらくあなたが求めているものです。このパイプラインは、一般的なツールキットを使用してほとんどの言語で利用できます。

score 6 · Accepted Answer

Metropolisサンプラーを使用して次のグラフィカルモデルの潜在変数を学習するPyMCを使用して、 監視対象LDAを実装できます。

トレーニングコーパスは、10の映画レビュー（5つのポジティブと5つのネガティブ）と、各ドキュメントに関連する星の評価で構成されています。星による評価は、各ドキュメントに関連付けられた関心のある量である応答変数として知られています。ドキュメントと応答変数は、将来のラベルのないドキュメントの応答変数を最もよく予測する潜在トピックを見つけるために、共同でモデル化されます。詳細については、元の論文をご覧ください。次のコードを検討してください。

import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
                "simplistic silly and tedious",
                "it's so laddish and juvenile only teenage boys could possibly find it funny",
                "it shows that some studios firmly believe that people have lost the ability to think",
                "our culture is headed down the toilet with the ferocity of a frozen burrito",
                "offers that rare combination of entertainment and education",
                "the film provides some great insight",
                "this is a film well worth seeing",
                "a masterpiece four years in the making",
                "offers a breath of the fresh air of true sophistication"]
test_corpus =  ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3

#LDA parameters
num_features = 1000  #vocabulary size
num_topics = 4       #fixed for LDA

tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')

#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus)  #size D x V

print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]

#tf-idf dictionary    
tfidf_dict = tfidf.get_feature_names()

K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents

data = A_tfidf_sp.toarray()

#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]        
alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])    

z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])

@pm.deterministic
def zbar(z=z):    
    zbar_list = []
    for i in range(len(z)):
        hist, bin_edges = np.histogram(z[i], bins=K)
        zbar_list.append(hist / float(np.sum(hist)))                
    return pm.Container(zbar_list)

eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)

@pm.deterministic
def y_mu(eta=eta, zbar=zbar):
    y_mu_list = []
    for i in range(len(zbar)):
        y_mu_list.append(np.dot(eta, zbar[i]))
    return pm.Container(y_mu_list)

#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
                  value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)

#visualize topics    
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()

トレーニングデータ（観測された単語と応答変数）が与えられると、各ドキュメント（シータ）のトピックの比率に加えて、応答変数（Y）を予測するためのグローバルトピック（ベータ）と回帰係数（eta）を学習できます。学習したベータとイータを指定してYを予測するために、Yを観測しない新しいモデルを定義し、以前に学習したベータとイータを使用して次の結果を取得できます。

ここでは、1つの文で構成されるテストコーパスの肯定的なレビュー（レビューの評価範囲が-2から2の場合の約2）を予測しました。「これは本当に肯定的なレビューであり、すばらしい映画です」。右。完全な実装については、ipythonノートブックを参照してください。

score 3 · Accepted Answer

はい、 http：//nlp.stanford.edu/software/tmt/tmt-0.4/のスタンフォードパーサーでラベル付きLDAを試すことができます。

machine-learning - 文書分類のための教師あり潜在的ディリクレ割り当て？

3 に答える 3

Related

Reference