nlp - nltk ステマー: 文字列インデックスが範囲外です

Question

nltk のPorterStemmer. 私のプロジェクトに固有の理由から、django アプリビュー内でステミングを実行したいと考えています。

ただし、django ビュー内のドキュメントをステミングすると、文字列IndexError: string index out of rangeから例外が発生します。その結果、以下を実行します。PorterStemmer().stem()'oed'

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

上記のエラーが発生します：

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

ここで本当に奇妙なのは、django の外部で同じ文字列に対して同じステマーを実行しても (別の python ファイルまたは対話型の python コンソールであっても)、エラーが発生しないことです。言い換えると：

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

に続く：

python test.py
# successfully prints 'o'

この問題の原因は何ですか?

score 31 · Accepted Answer

これは、NLTK バージョン 3.2.2 に固有の NLTK バグであり、その責任は私にあります。これは、Porter ステマーを書き直したPR https://github.com/nltk/nltk/pull/1261によって導入されました。

NLTK 3.2.3 で出た修正を書きました。バージョン 3.2.2 を使用していて修正が必要な場合は、アップグレードするだけです。

pip install -U nltk

score 3 · Accepted Answer

nltk.stem.porterを使用してモジュールをデバッグしpdbました。数回繰り返した後、次の_apply_rule_list()ようになります。

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

この時点で、_ends_double_consonant()メソッドは実行を試みword[-1] == word[-2]ますが失敗します。

私が間違っていなければ、NLTK3.2では相対的な方法は次のとおりでした。

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

私が見る限りlen(word) < 2、新しいバージョンにはチェックがありません。

このようなものに変更_ends_double_consonant()するとうまくいくはずです：

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

関連する NLTK の問題でこの変更を提案しました。

nlp - nltk ステマー: 文字列インデックスが範囲外です

2 に答える 2

Related

Reference