2

ここで定義されているドキュメントの類似性を使用しています。

私の質問は、numpy.ndarrayIs there a way to sort the numpy array and get the top-K related documents that are similar から最も関連性の高いドキュメントを取得する方法です。

これがサンプルコードです。

from sklearn.feature_extraction.text import TfidfVectorizer

poem = ["All the world's a stage",
"And all the men and women merely players",
"They have their exits and their entrances",
"And one man in his time plays many parts",
"His acts being seven ages. At first, the infant",
"Mewling and puking in the nurse's arms",
"And then the whining school-boy, with his satchel",
"And shining morning face, creeping like snail",
"Unwillingly to school. And then the lover",
"Sighing like furnace, with a woeful ballad",
"Made to his mistress' eyebrow. Then a soldier",
"Full of strange oaths and bearded like the pard",
"Jealous in honour, sudden and quick in quarrel",
"Seeking the bubble reputation",
"Even in the cannon's mouth. And then the justice",
"In fair round belly with good capon lined",
"With eyes severe and beard of formal cut",
"Full of wise saws and modern instances",
"And so he plays his part. The sixth age shifts",
"Into the lean and slipper'd pantaloon",
"With spectacles on nose and pouch on side",
"His youthful hose, well saved, a world too wide",
"For his shrunk shank; and his big manly voice",
"Turning again toward childish treble, pipes",
"And whistles in his sound. Last scene of all",
"That ends this strange eventful history",
"Is second childishness and mere oblivion",
"Sans teeth, sans eyes, sans taste, sans everything"]


vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(poem) 

result = (tfidf * tfidf.T).A

print(type(result))

print(result)
4

1 に答える 1

1

diag 要素を 0 に設定argsort()し、平坦化配列の上位 K インデックスを見つけるために使用しunravel_index()、1D インデックスから 2D インデックスへの変換を使用します。

result[np.diag_indices_from(result)] = 0.0
idx = np.argsort(result, axis=None)[-10:]
midx = np.unravel_index(idx, result.shape)
print midx
print result[midx]

結果:

(配列([ 8, 14, 1, 0, 11, 17, 8, 10, 6, 8]), 配列([14, 8, 0, 1, 17, 11, 10, 8, 8, 6]) ) [ 0.2329741 0.2329741 0.2379527 0.2379527 0.25723394 0.25723394 0.26570327 0.26570327 0.34954834 0.34954834]

于 2013-06-07T22:53:13.607 に答える