これは自然言語処理の最初の試みであるため、潜在的意味分析から始め、このチュートリアルを使用してアルゴリズムを構築しました。テストした後、最初のセマンティック ワードのみを分類し、他のドキュメントの上に同じ用語を何度も繰り返すことがわかりました。
HEREにもあるドキュメントをフィードしてみましたが、まったく同じです。同じトピックの値を他のトピックで数回繰り返します。
何が起こっているのか説明できる人はいますか?私はずっと検索してきましたが、すべてがチュートリアルとまったく同じようです。
testDocs = [
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss",
]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''
#First we apply the standard SKLearn algorithm to compare with.
for element in testDocs:
#tokens.append(tokenizer.tokenize(element.lower()))
element = element.lower()
print(testDocs)
#Vectorize the features.
vectorizer = tfdv(max_df=0.5, min_df=2, max_features=8, stop_words='english', use_idf=True)#, ngram_range=(1,3))
#Store the values in matrix X.
X = vectorizer.fit_transform(testDocs)
#Apply LSA.
lsa = TruncatedSVD(n_components=3, n_iter=100)
lsa.fit(X)
#Get a list of the terms in the order it was decomposed.
terms = vectorizer.get_feature_names()
print("Terms decomposed from the document: " + str(terms))
print()
#Prints the matrix of concepts. Each number represents how important the term is to the concept and the position relates to the position of the term.
print("Number of components in element 0 of matrix of components:")
print(lsa.components_[0])
print("Shape: " + str(lsa.components_.shape))
print()
for i, comp in enumerate(lsa.components_):
#Stick each of the terms to the respective components. Zip command creates a tuple from 2 components.
termsInComp = zip(terms, comp)
#Sort the terms according to...
sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True)
print("Concept %d", i)
for term in sortedTerms:
print(term[0], end="\t")
print()