python - Pythonでのサブジェクトオブジェクトの識別

Question

文集合の主語と目的語を識別したい。私の実際の仕事は、一連のレビューデータから原因と結果を特定することです。

データのチャンクと解析に Spacy Package を使用しています。しかし、実際には私の目標には達していません。そうする方法はありますか？

例えば：

 I thought it was the complete set

アウト：

subject  object
I        complete set

score 14 · Accepted Answer

最も簡単な方法で。依存関係は token.dep_ によってアクセスされます。

import spacy
nlp = spacy.load('en')
parsed_text = nlp(u"I thought it was the complete set")

#get token dependencies
for text in parsed_text:
    #subject would be
    if text.dep_ == "nsubj":
        subject = text.orth_
    #iobj for indirect object
    if text.dep_ == "iobj":
        indirect_object = text.orth_
    #dobj for direct object
    if text.dep_ == "dobj":
        direct_object = text.orth_

print(subject)
print(direct_object)
print(indirect_object)

score 1 · Accepted Answer

名詞チャンクを使用できます。

コード

doc = nlp("I thought it was the complete set")
for nc in doc.noun_chunks:
    print(nc.text)

結果：

I
it
the complete set

"I" と "it" の両方ではなく "I" のみを選択するには、ROOT の左側の nsubj を取得するテストを最初に記述します。

score 0 · Accepted Answer

Stanza は非常に正確なニューラルネットワークコンポーネントで構築されており、独自の注釈付きデータを使用して効率的なトレーニングと評価を行うこともできます。モジュールは PyTorch ライブラリの上に構築されています。

Stanza は Python の自然言語分析パッケージです。パイプラインで使用できるツールが含まれており、人間の言語テキストを含む文字列を文と単語のリストに変換し、それらの単語の基本形、品詞、形態学的特徴を生成し、構文構造依存解析を提供します。、名前付きエンティティを認識します。

def find_Subject_Object(text):
    # import required packages
    import stanza
    nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse')
    doc = nlp(text)
    clausal_subject = []
    nominal_subject = []
    indirect_object = []
    Object          = []
    for sent in doc.sentences:
        for word in sent.words:
            if word.deprel  == "nsubj":
                nominal_subject.append({word.text:"nominal_subject nsubj"})
            elif word.deprel  == "csubj":
                clausal_subject.append({word.text:"clausal_subject csubj"})
            elif word.deprel  == "iobj":
                indirect_object.append({word.text:"indirect_object iobj"})
            elif word.deprel  == "obj":
                Object.append({word.text:"object obj"})
    return indirect_object, Object, clausal_subject,nominal_subject

text ="""John F. Kennedy International Airport is an international airport in Queens, New York, USA, and one of the primary airports serving New York City."""

find_Subject_Object(text)
# output #
([], [{'City': 'object obj'}], [], [{'John': 'nominal_subject nsubj'}, {'Airport': 'nominal_subject nsubj'}])

Stanza には、CoreNLP Java パッケージへの Python インターフェイスが含まれており、構成要素の解析、相互参照の解決、言語パターンマッチングなどの追加機能をそこから継承します。

要約すると、Stanza の機能は次のとおりです。

セットアップに最小限の労力しか必要としないネイティブ Python 実装。
トークン化、マルチワードトークン (MWT) 拡張、見出し語化、品詞 (POS) および形態学的特徴のタグ付け、依存関係の解析、名前付きエンティティの認識など、堅牢なテキスト分析のための完全なニューラルネットワークパイプライン。
66 の (人間の) 言語をサポートする事前トレーニング済みのニューラルモデル。
CoreNLP への安定した、公式に維持されている Python インターフェース。スタンザ

python - Pythonでのサブジェクトオブジェクトの識別

3 に答える 3

コード

結果：

Related

Reference