python - テキスト内のキーワードの正確なヒットセット

Question

テキストにはいくつかのキーワードがあり、それらの出現の開始/終了位置があります。キーワードは部分的に重複する場合があります（例： "something"-> "something" / "some" / "thing"：

keywords_occurences = {
    "key_1": [(11, 59)],
    "key_2": [(24, 46), (301, 323), (1208, 1230), (1673, 1695)],
    "key_3": [(24, 56), (1208, 1240)],
    ...
}

キーワードごとに1つの位置を選択して、重複しないようにする必要があります。この場合の解決策は次のとおりです。

key_1: 11-59
key_2: 301-323     (or 1673-1695, it does not matter)
key_3: 1208-1240

これができない場合は、重複しない一意のキーの最大数を選択してください。

一種の「正確な打撃セット」の問題のように見えますが、アルゴリズムの説明が見つかりません。

score 1 · Accepted Answer

次のコードはあなたが望むことをしていると思います。

#!/usr/bin/env python

# keyword occurrences -> [('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())]
kw_all_occ = {"key_1" : [(11, 59)],
"key_2" : [(24, 56), (301, 333), (1208, 1240), (1673, 1705)],
"key_3" : [(24, 46), (1208, 1230)]}

def non_overlapping_occ(occ):
    # dictionary with all keyword occurrences
    all_occ = dict({})
    all_occ.update(occ)

    # list with the first non overlapping occurrences of every keyword -> 
    # [('key_1', (start_1, end_1)),('key_2', (start_2, end_2)),...]
    result = []

    # Sort keywords by length -> [(22, 'key_3'), (32, 'key_2'), (48, 'key_1')]
    kw_lengths = []
    for k, v in all_occ.iteritems():
        kw_lengths.append((v[0][1] - v[0][0], k))
    kw_lengths.sort()

    while len(kw_lengths):
        # Current longest keyword
        longest_keyword = kw_lengths.pop(-1)[1]
        try:
            result.append((longest_keyword, all_occ[longest_keyword][0]))
            # Remove occurrences overlapping any occurrence of the current
            # longest_keyword value
            for item in all_occ[longest_keyword]:
                start = item[0]
                end = item[1]
                for l, k in kw_lengths:
                    v = all_occ[k]
                    all_occ[k] = filter(lambda x: (x[0] > end) | (x[1] < start), v)

        except IndexError:
            result.append((longest_keyword, ()))

    return result

print non_overlapping_occ(kw_all_occ)

次の出力が生成されます。

vicent@deckard:~$ python prova.py 
[('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())]

コードでセットを使用していないことに注意してください。あなたの質問は、セットが問題の解決に役立つ可能性があることを示唆しているだけなので、セットの使用は必須ではないことを理解しています。

また、コードは十分にテストされていませんが、問題なく機能しているようです（質問で提供されたキーワードの出現も適切に処理します。実際、これらの出現は、より単純ですが一般的ではないコードで解決できます）。

python - テキスト内のキーワードの正確なヒットセット

1 に答える 1

Related

Reference