python - テキストのセグメンテーション: 入力を辞書の最長の単語と照合するアルゴリズム

Question

各単語が辞書から取得されるように、文字列を単語に分割する必要があります。また、左から可能な限り長い単語が選択されていることを確認してください。したがって

thisisinsane => this is insane (correct as longest possible word from left)
thisisinsane => this is in sane(wrong) 
Assuming 'this', 'is', 'in', 'insane' are all words in the dictionary.

文字列の末尾から、可能な限り長い単語に一致する先頭までトラバースすることで、この問題を解決することができました。しかし、問題はこれらの問題のために私たちを切り取り始めました..

shareasale => share as ale(wrong as 'ale' not in the dictionary)
shareasale => share a sale(correct)
Assuming 'share', 'a', 'sale' are all words in the dictionary unlike 'ale'.

エラーが発生する前に見つかった有効なセグメントを削除することで、この問題を解決しようとしました。

shareasale => 'share' and 'as' (error = 'ale')

それらを辞書から一度削除してから、問題を解決します。そう

shareasale => no valid segments when share is removed
shareasale => share a sale (when 'as' is removed from the dictionary.

したがって、私もこの問題を解決することができました。しかし、私はこれを解決することができません

asignas => 'as' ( error = 'ignas')

私の解決策は、辞書から「として」を削除し、それを解決しようとします

asignas => 'a' 'sign' (error = 'as')

新しい再帰呼び出しでは、「as」が辞書から削除されているためです。私が書いた関数はこのリンクにあります。誰かがそれを調べて、これを解決するためのより良いアルゴリズムを見つけるのを手伝ってくれることを願っています。それ以外の場合は、既存のアルゴリズムの変更を提案してください。

score 3 · Accepted Answer

本質的にあなたの問題はツリーの問題であり、すべてのレベルでツリーの接頭辞を形成するすべての単語が分岐します。文字列の一部を残さない枝が正解です。

                      thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /
                               /
                       (this,is,in,sane)

したがって、この例では 2 つのソリューションがありますが、最も長い単語を使用してソリューションを選択する必要があります。つまり、深さ優先探索戦略を使用してツリーを右から探索します。

したがって、アルゴリズムは次のようにする必要があります。

長さの降順に辞書を並べ替えます。
現在の文字列のすべてのプレフィックスを検索します。何もない場合は、を返しFalseます。
prefix最長の未調査のプレフィックスに設定します。
弦から外します。文字列が空の場合、解決策が見つかったので、すべてのプレフィックスのリストを返します。
2 に戻ります。
この分岐は失敗しました。戻りFalseます。

このソリューションの実装例:

def segment(string,wset):
    """Segments a string into words prefering longer words givens
    a dictionary wset."""
    # Sort wset in decreasing string order
    wset.sort(key=len, reverse=True)
    result = tokenize(string, wset, "")
    if result:
        result.pop() # Remove the empty string token
        result.reverse() # Put the list into correct order
        return result
    else:
        raise Exception("No possible segmentation!")

def tokenize(string, wset, token):
    """Returns either false if the string can't be segmented by 
    the current wset or a list of words that segment the string
    in reverse order."""
    # Are we done yet?
    if string == "":
        return [token]
    # Find all possible prefixes
    for pref in wset:
        if string.startswith(pref):
            res = tokenize(string.replace(pref, '', 1), wset, pref)
            if res:
                res.append(token)
                return res
    # Not possible
    return False

print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share a sale
print segment("asignas", ['as', 'sign', 'a'])                 # a sign as

score 1 · Accepted Answer

私が使用するアルゴリズムは次のようになります。

長さの降順で辞書を並べ替える
prefix(s)始まる辞書の単語を順番に生成する関数を構築するs
- すなわち。prefix("informativecow")最初に「informative」、次に「inform」、「info」、「in」の順に生成されます
次のような再帰関数test(s)を作成します。
- が空の文字列の場合s、True を返します
- プレフィックスごとに、s からプレフィックスを削除し、test()その文字列で呼び出します。true を見つけたら、true を返します。true が見つからない場合は、False を返します。

python - テキストのセグメンテーション: 入力を辞書の最長の単語と照合するアルゴリズム

4 に答える 4

Related

Reference