python - 名前付きエンティティの Python 自然言語処理

Question

名前付きエンティティを含む検索クエリを処理する必要がある Python Web アプリケーションを作成しています。たとえば、検索クエリが「mac os lion」の場合、データベースで利用可能な候補を使用してこのクエリを処理する必要があるとします。

グーグルアンドロイド。
マイクロソフトウィンドウズ。
アップル Mac OS X ライオン
...

3番目の結果が正しい結果であることは誰もが知っています。しかし、ユーザーのクエリ、つまり「mac os lion」を「Apple Mac OS X Lion」(私のデータベースで利用可能なエントリ) にマップする方法はありますか? 何を探すべきか、何をすべきか教えてください。

score 3 · Accepted Answer

ユーザークエリの何らかの正規化が必要であり、これらから正しい「クラス」へのマッピングを「学習」する必要があります。

簡単な方法は、「クラス」のいずれかと一致する「トークン」のオーバーラップを計算することです。次のサンプルコードが役立つ場合があります。

CLASSES = ['Google Android', 'Microsoft Windows', 'Apple Mac OS X Lion']

def classify_query(query_string):
    """
    Computes the most "likely" class for the given query string.

    First normalises the query to lower case, then computes the number of
    overlapping tokens for each of the possible classes.

    The class(es) with the highest overlap are returned as a list.

    """
    query_tokens = query_string.lower().split()
    class_tokens = [[x.lower() for x in c.split()] for c in CLASSES]

    overlap = [0] * len(CLASSES)
    for token in query_tokens:
        for index in range(len(CLASSES)):
            if token in class_tokens[index]:
                overlap[index] += 1

    sorted_overlap = [(count, index) for index, count in enumerate(overlap)]
    sorted_overlap.sort()
    sorted_overlap.reverse()

    best_count = sorted_overlap[0][0]

    best_classes = []
    for count, index in sorted_overlap:
        if count == best_count:
            best_classes.append(CLASSES[index])
        else:
            break

    return best_classes

出力例

classify_query('mac OS x') -> ['Apple Mac OS X Lion']
classify_query('Google') -> ['Google Android']

もちろん、これは非常に基本的な解決策にすぎません。クエリ文字列のタイプミスの場合に備えて、より堅牢にするためにいくつかのスペルチェックを追加することをお勧めします...

それが役立つことを願っています:)

score 1 · Accepted Answer

クエリに類似したテキストのみを検索する必要がある場合は、Lucene + PyLuceneなどの Python バインディングを備えたテキスト検索エンジンを使用できます。

python - 名前付きエンティティの Python 自然言語処理

2 に答える 2

Related

Reference