python - 文字列をトークン化すると、いくつかの単語がマージされます

Question

次のコードを使用して、標準入力から読み取った文字列をトークン化します。

d=[]
cur = ''
for i in sys.stdin.readline():
    if i in ' .':
        if cur not in d and (cur != ''):
            d.append(cur)
            cur = ''
    else:
        cur = cur + i.lower()

これにより、繰り返されない単語の配列が得られます。ただし、出力では、一部の単語は分割されません。

私の入力は

Dan went to the north pole to lead an expedition during summer.

出力配列 d は

[「ダン」、「行った」、「へ」、「ザ」、「北」、「ポール」、「トゥリード」、「アン」、「遠征」、「中」、「夏」]

なぜtolead一緒なの？

score 3 · Accepted Answer

これを試して

d=[]
cur = ''
for i in sys.stdin.readline():
    if i in ' .':
        if cur not in d and (cur != ''):
            d.append(cur)
        cur = '' # note the different indentation
    else:
        cur = cur + i.lower()

score 1 · Accepted Answer

"to"はすでに入ってい"d"ます。したがって、ループはとの間のスペースをスキップします"to"が"lead"、連結を続けます。次のスペースに到達すると、それがに"tolead"ないことがdわかり、追加されます。

より簡単な解決策; また、句読点のすべての形式を取り除きます。

>>> import string
>>> set("Dan went to the north pole to lead an expedition during summer.".translate(None, string.punctuation).lower().split())
set(['summer', 'north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'the'])

python - 文字列をトークン化すると、いくつかの単語がマージされます

3 に答える 3

Related

Reference