python - 文の文字列内の単語の異なる認識を見つける - Python

Question

（この質問は一般的な文字列チェックに関するものであり、自然言語処理自体ではありませんが、NLPの問題と見なす場合、現在のアナライザーが分析できる言語ではないことを想像してください。簡単にするために、英語の文字列を使用します例として）

単語を実現できる形式は 6 つしかないとしましょう

最初の文字は大文字
「s」を含む複数形
「es」を含む複数形
大文字 + "es"
大文字 + "s"
複数形または大文字を使用しない基本形

文中に出現する単語の最初のインスタンスのインデックスを見つけたいとしましょうcoach。これらの 2 つの方法を実行する簡単な方法はありますか。

条件が長い場合

sentence = "this is a sentence with the Coaches"
target = "coach"

print target.capitalize()

for j, i in enumerate(sentence.split(" ")):
  if i == target.capitalize() or i == target.capitalize()+"es" or \
     i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
     i == target:
    print j

試行錯誤の繰り返し

variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]

ind = 0
for i in variations:
  try:
    j == sentence.split(" ").index(i)
    print j
  except ValueError:
    continue

score 2 · Accepted Answer

NLTK のステムパッケージを参照することをお勧めします: http://nltk.org/api/nltk.stem.html

それを使用すると、「語幹のみを残して、単語から形態学的接辞を削除できます。ステミングアルゴリズムは、たとえば文法上の役割、時制、派生形態などに必要な接辞を削除して、単語の語幹のみを残します。」

あなたの言語が現在 NLTK でカバーされていない場合は、NLTK の拡張を検討する必要があります。本当に単純なものが必要で、NLTK を気にしない場合でも、コードを小さくて組み合わせが簡単なユーティリティ関数のコレクションとして記述する必要があります。次に例を示します。

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])

score 1 · Accepted Answer

形態学は通常、有限状態の現象であるため、正規表現はそれを処理するのに最適なツールです。次のような関数を使用して、すべてのケースに一致する RE を作成します。

def inflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

使用法：

>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

屈折規則がこれよりも複雑になる場合は、Python の詳細な RE の使用を検討してください。

python - 文の文字列内の単語の異なる認識を見つける - Python

2 に答える 2

Related

Reference