python - 別のリストの値に基づいてリストを検索する

Question

文字列のリストから引き出そうとしている名前のリストがあります。部分一致などの誤検知が続いています。もう1つの注意点は、該当する場合は姓も取得したいということです.

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']

desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']

私はこのコードを試しました：

[i for e in names for i in target if i.startswith(e)]

これにより、予想通り、Chris Smith、Christmas is here、Kimberly が返されます。

これにどのようにアプローチするのが最善ですか？正規表現を使用していますか、それともリスト内包表記で行うことができますか? 実名リストは最大 880,000 名の長さであるため、パフォーマンスが問題になる場合があります。

(パイソン2.7)

編集:クリスマスを除外しながらキンバリーを含めたいという不可能な要求がここにあることを考えると、この例の私の基準は非現実的であることに気付きました. この問題を軽減するために、バリエーションを含むより完全な名前のリストを見つけました (Kim と Kimberly の両方が含まれています)。

score 1 · Accepted Answer

Christmas is hereあなたが合理的な基準を与えられなかった理由がわからないので、（再び）完全な推測をしてください：

これは、名前の単語で始まる単語を持つターゲットに一致します...

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']

import re
matches = [targ for targ in target if any(re.search(r'\b{}'.format(name), targ, re.I) for name in names)]
print matches
# ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']

に変え\b{}\b' - then you'll get ['Chris Smith', 'CHRIS']たら負けKim…

score 0 · Accepted Answer

これは機能しますか？

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']

res = []
for tof in target:
    for name in names:
        if tof.lower().startswith(name.lower()):
            res.append(tof)
            break
print res

score 0 · Accepted Answer

クリスマスが名前なのか他の何かなのかをシステムが判断できない可能性があるため、一致する「クリスマスはここにあります」をドロップする決定論的な方法はありません。代わりに、プロセスをスピードアップしたい場合は、この O(n) アプローチを試すことができます。私はそれを計っていませんが、あなたまたは提案されたソリューションよりも間違いなく高速です。

from difflib import SequenceMatcher
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
def foo(names, target):
    #Create a generator to search the names
    def bar(names, target):
            #which for each target
        for t in target:
                    #finds the matching blocks, a triplet, (i, j, n), and means that a[i:i+n] == b[j:j+n]
            match = SequenceMatcher(None,names, t).get_matching_blocks()[0]
                    #match.size == 0 means no match
                    #and match.b > 0 means match does not happens at the start
            if match.size > 0 and match.b == 0:
                            #and generate the matching target
                yield t
    #Join the names to create a single string
    names = ','.join(names)
    #and call the generator and return a list of the resultant generator
    return list(bar(names, target))

>>> foo(names, target)
['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']

score 0 · Accepted Answer

あなたの説明によると、ルールは次のとおりです。

大文字と小文字を区別しません。
ターゲットワードは、キーワードの頭文字でなければなりません。
ターゲット単語が正確にキーワードではない場合、ターゲット単語は文内の唯一の単語でなければなりません。

これを試して：

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']

actual_output = []
for key in names:
    for words in target:
        for word in words.split():
            if key.lower() == word.lower():
                actual_output.append(words)
            elif key.lower() == word.lower()[:len(key)] and len(words.split()) == 1:
                actual_output.append(words)
print(actual_output)

目的の出力として正確に出力されます（ところで、本当にこれが欲しいですか？）。3層ループにイライラしないでください。N 個の名前と M 個の文があり、各文の単語数が限られている場合、このコードの複雑さは最高O(mn)です。

python - 別のリストの値に基づいてリストを検索する

4 に答える 4

Related

Reference