spacy - Spacy での複数単語表現の認識

Question

インデックスエントリと一緒にテキストを持っています。そのうちのいくつかは、テキスト内で発生する重要な複数語表現 (MWE) を示しています (たとえば、生物学のテキストの「海綿状の骨」)。テキスト内の MWE の出現を認識できるように、エントリを使用して spaCy でカスタムマッチャーを構築したいと考えています。追加の要件は、MWE 構成単語の見出語化された表現と POS タグを保持するために、一致する出現が必要であることです。

同様のことを行う既存の spaCy の例を見てきましたが、パターンを取得できないようです。

score -1 · Accepted Answer

Spacy のドキュメントは、Matcher クラスを複数のフレーズで使用することについてあまり明確ではありませんが、Github リポジトリに複数フレーズのマッチングの例があります。

最近同じ課題に直面しましたが、以下のように機能しました。テキストファイルには、フレーズとその説明が「::」で区切られた行ごとに 1 つのレコードが含まれています。

import spacy
import io
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en')
text = nlp(u'Your text here')
rules = list()

# Create a list of tuple of phrase and description from the file
with io.open('textfile','r',encoding='utf8') as doc:
    rules = [tuple(line.rstrip('\n').split('::')) for line in doc]

# convert the phrase string to a spacy doc object 
rules = [(nlp(item[0].lower()),item[-1]) for item in rules ]

# create a dictionary for accessing value using the string as the index which is returned by matcher class
rules_dict = dict()
for key,val in rules:
    rules_dict[key.text]=val

# get just the phrases from rules list
rules_phrases = [item[0] for item in rules]

# match using the PhraseMatcher class
matcher = PhraseMatcher(nlp.vocab,rules_phrases)
matches = matcher(text)
result = list()

for start,end,tag,label,m in matches:
    result.append({"start":start,"end":end,"phrase":label,"desc":rules_dict[label]})
print(result)

spacy - Spacy での複数単語表現の認識

1 に答える 1

Related

Reference