python - 定義されたプレフィックスのセットを含む文字列を再帰的に分割します-Python

Question

文字列に付加できるプレフィックスのリストがある場合、そのような文字列をそのプレフィックスと次のサブ文字列の他の文字に分割するにはどうすればよいですか。例えば：

prefixes = ['over','under','re','un','co']

str1 = "overachieve"
output: ["over","achieve"]

str2 = "reundo"
output = ["re","un","do"]

上記のタスクを実行するためのより良い方法はありますか？おそらく正規表現または次以外のいくつかの文字列関数を使用します：

str1 = "reundo"
output = []

for x in [p for p in prefixes if p in str1]:
    output.append(x)    
    str1 =  str1.replace(x,"",1)
output.append(str1)

score 5 · Accepted Answer

正規表現は、多くの代替プレフィックスを検索するための効率的な方法です。

import re

def split_prefixes(word, prefixes):
    regex = re.compile('|'.join(sorted(prefixes, key=len, reverse=True)))
    result = []
    i = 0
    while True:
        mo = regex.match(word, i)
        if mo is None:
            result.append(word[i:])
            return result
        result.append(mo.group())
        i = mo.end()


>>> prefixes = ['over', 'under', 're', 'un', 'co']
>>> for word in ['overachieve', 'reundo', 'empire', 'coprocessor']:
        print word, '-->', split_prefixes(word, prefixes)

overachieve --> ['over', 'achieve']
reundo --> ['re', 'un', 'do']
empire --> ['empire']
coprocessor --> ['co', 'processor']

score 1 · Accepted Answer

「2つの問題」ということわざを念頭に置いて、これは正規表現の仕事だと思います。正規表現は、1つずつではなく、すべての可能なバリアントを並行してチェックするステートマシンにコンパイルされます。

これを活用する実装は次のとおりです。

import re

def split_string(string, prefixes):
    regex = re.compile('|'.join(map(re.escape, prefixes))) # (1)
    while True:
        match = regex.match(string)
        if not match:
            break
        end = match.end()
        yield string[:end]
        string = string[end:]
    if string:
        yield string # (2)

prefixes = ['over','under','re','un','co']
assert (list(split_string('recouncoundo',prefixes))
        == ['re','co','un','co','un','do'])

（1）で正規表現がどのように構成されているかに注意してください。

re.escape特殊文字が干渉しないように、プレフィックスはを使用してエスケープされます
エスケープされたプレフィックスは、|（または）正規表現演算子を使用して結合されます
すべてがコンパイルされます。

行（2）は、プレフィックスを分割した後に残っている単語がある場合は、最後の単語を生成します。if stringプレフィックスストリッピング後に何も残っていない場合に関数が空の文字列を返すようにする場合は、チェックを削除することをお勧めします。

また、re.match（とは異なりre.search）は入力文字列の先頭のパターンのみを検索するため^、正規表現に追加する必要はありません。

score 1 · Accepted Answer

str.startswithメソッドを使用します

for p in prefixes:
    if str1.startswith(p):
        output.append(p)
        str1 = str1.replace(p, '', 1)
output.append(str1)

コードの最大の欠点は、のような文字列'found'が出力されること['un', 'fod']です。

ただし、架空の string がある場合は'reuncoundo'、リストを複数回反復する必要があります。

while True:
    if not any(str1.startswith(i) for i in prefixes):
        output.append(str1)
        break
    for p in prefixes:
        if str1.startswith(p):
            output.append(p)
            str1 = str1.replace(p, '', 1)

これは出力します['re', 'un', 'co', 'un', 'do']

score 1 · Accepted Answer

prefixes = ['over','under','re','un','co']

def test(string, prefixes, existing=None):
    prefixes.sort(key = lambda s: len(s))
    prefixes.reverse() # This and the previous line ensure that longer prefixes are searched first regardless of initial sorting.
    if existing is None:
        existing = [] # deals with the fact that placing [] as a default parameter and modifying it modifies it for the entire session
    for prefix in prefixes:
        if string.startswith(prefix):
            existing.append(prefix)
            return test(string[len(prefix):], prefixes, existing)
    existing.append(string)
    return existing

このコードは文字列を再帰的に実行し、既知の接頭辞をなくなるまで削除してから、リスト全体を返します。より長い文字列では、おそらくジェネレーターの方が適切な方法ですが、短い文字列では、ジェネレーターのオーバーヘッドを追加する必要がないため、これがより良い解決策になる可能性があります。

score 1 · Accepted Answer

接頭辞を扱っている場合、正規表現は必要ありません。必要なのはstartswith(). もちろん、正規表現を使用することもできますが、このような簡単なものであっても、読み取りと保守が難しくなります。startswith()私の意見では、より簡単です。

そして、他の答えは、そのような単純な問題には複雑すぎるようです. 次のような再帰関数をお勧めします。

def split_prefixes (word, prefixes):
    split = [p for p in prefixes if word.startswith(p)]
    if split:
        return split + split_prefixes (word[len(split[0]):], prefixes)
    else:
        return [word]

結果は次のとおりです。

"overachieve" -> ['over', 'achieve']
"reundo" -> ['re', 'un', 'do']
"reuncoundo" -> ['re', 'un', 'co', 'un', 'do']
"empire" -> ['empire']

python - 定義されたプレフィックスのセットを含む文字列を再帰的に分割します-Python

5 に答える 5

Related

Reference