python - ストップワードと句読点を取り除く

Question

NLTK ストップワードに苦労しています。

ここに私のコードがあります..誰かが何が悪いのか教えてもらえますか?

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''

score 28 · Accepted Answer

あなたの問題は、文字列の反復子が各単語ではなく各文字を返すことです。

例えば：

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

各単語を反復してチェックする必要があります。幸い、分割関数は Python 標準ライブラリのstring モジュールの下に既に存在します。ただし、句読点を含む自然言語を扱っている場合は、モジュールを使用するより堅牢な回答をここで探す必要があります。re

単語のリストができたら、比較する前にそれらをすべて小文字にしてから、既に示した方法で比較する必要があります。

ブエナ・スエルテ。

編集1

このコードを試してみてください。うまくいくはずです。それを行う2つの方法を示しています。それらは本質的に同じですが、最初の方法は少し明確で、2番目の方法はよりpythonicです。

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words

これがお役に立てば幸いです。

score 4 · Accepted Answer

最初にトークナイザーを使用して、トークン (シンボル) のリストをストップリストと比較するので、re モジュールは必要ありません。言語を切り替えるために、追加の引数を追加しました。

def remove_stopwords(sentence, language):
    return [ token for token in nltk.word_tokenize(sentence) if token.lower() not in stopwords.words(language) ]

Dime site fue de util ;)

python - ストップワードと句読点を取り除く

3 に答える 3

編集1

Related