python - PythonでのN-Gram、tf-idf、Cosineの類似性の簡単な実装

Question

DBに保存されているドキュメントを比較し、0と1の間の類似度スコアを考え出す必要があります。

私が使用する必要がある方法は非常に単純でなければなりません。tf-idfとCosineの類似性の単純な実装とともに、バニラバージョンのn-gram（使用するグラム数を定義できる場合）を実装します。

これを実行できるプログラムはありますか？それとも、これを最初から書き始める必要がありますか？

score 54 · Accepted Answer

NLTK パッケージをチェックしてください: http://www.nltk.org必要なものがすべて含まれています

cosine_similarity の場合:


def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

ngram の場合:


def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    """
    A utility that produces a sequence of ngrams from a sequence of items.
    For example:

    >>> ngrams([1,2,3,4,5], 3)
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

    Use ingram for an iterator version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:

    >>> ngrams([1,2,3,4,5], 2, pad_right=True)
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

    @param sequence: the source data to be converted into ngrams
    @type sequence: C{sequence} or C{iterator}
    @param n: the degree of the ngrams
    @type n: C{int}
    @param pad_left: whether the ngrams should be left-padded
    @type pad_left: C{boolean}
    @param pad_right: whether the ngrams should be right-padded
    @type pad_right: C{boolean}
    @param pad_symbol: the symbol to use for padding (default is None)
    @type pad_symbol: C{any}
    @return: The ngrams
    @rtype: C{list} of C{tuple}s
    """

    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    sequence = list(sequence)

    count = max(0, len(sequence) - n + 1)
    return [tuple(sequence[i:i+n]) for i in range(count)]

tf-idf の場合、最初に分布を計算する必要があります。これには Lucene を使用していますが、NLTK で同様のことを行うこともできます。FreqDist を使用します。

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

あなたがピルセンが好きなら、これはtf.idfを計算する方法を教えてくれます

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))
    docs = reader.numDocs()
    for i in xrange(docs):
        tfv = reader.getTermFreqVector(i, fieldname)
        if tfv:
            rec = {}
            terms = tfv.getTerms()
            frequencies = tfv.getTermFrequencies()
            for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
                    df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
                        tmap.setdefault(t, len(tmap))
                        rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
            # and normalize the values using cosine normalization
            if cosine_normalization:
                denom = sum([x**2 for x in rec.values()])**0.5
                for k,v in rec.items():
                    rec[k] = v / denom

score 28 · Accepted Answer

興味があれば、tf-idf について説明し、Scikits.learn (sklearn) Python モジュールを使用するチュートリアルシリーズ (パート Iおよびパート II ) を実行しました。

パート 3にはコサイン類似性があります。

score 4 · Accepted Answer

この問題にまだ関心がある場合は、Lucene Javaと Jython を使用して非常によく似た方法を試しました。これが私のコードの一部です。

Lucene は、いわゆるアナライザーを使用してドキュメントとクエリを前処理します。これは、Lucene の組み込み n-gram フィルターを使用します。

class NGramAnalyzer(Analyzer):
    '''Analyzer that yields n-grams for minlength <= n <= maxlength'''
    def __init__(self, minlength, maxlength):
        self.minlength = minlength
        self.maxlength = maxlength
    def tokenStream(self, field, reader):
        lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader))
        return NGramTokenFilter(lower, self.minlength, self.maxlength)

のリストをにngrams変換するにはDocument:

doc = Document()
doc.add(Field('n-grams', ' '.join(ngrams),
        Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))

ドキュメントをインデックスに保存するには:

wr = IndexWriter(index_dir, NGramAnalyzer(), True,
                 IndexWriter.MaxFieldLength.LIMITED)
wr.addDocument(doc)

QueryParserLuceneは特別な演算子、引用符などを使用したクエリ言語を想定しているため、クエリの作成は少し難しくなりますが、回避することができます (ここで部分的に説明されているように)。

score 3 · Accepted Answer

情報検索コースでは、教授が Java で作成したコードを使用します。申し訳ありませんが、Python ポートはありません。「これは、GNU General Public License の下でのみ、教育および研究目的でリリースされています。」

ドキュメントhttp://userweb.cs.utexas.edu/~mooney/ir-course/doc/を確認できます。

しかし、より具体的には、http: //userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.htmlをチェックしてください。

ダウンロードできますhttp://userweb.cs.utexas.edu/users/mooney/ir-course/

python - PythonでのN-Gram、tf-idf、Cosineの類似性の簡単な実装

5 に答える 5

Related

Reference