python - Python で有向単語グラフを生成する

Question

文のリストから、有向グラフを生成して、次のプロパティに従ってサブセンテンスを生成したい:

次の 3 つの文があるとします。

お菓子大好き
愛してます
私はキャンディーが大好きです

グラフには、ペアが出現するたびに、重みが 1 のすべての連続する単語の間にエッジがあります。

次に、最大の重みを持つグラフ内のパスを見つけます。ここでは、重み 3+2+1+1 で「I love candy very much」を返します。

str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""
# create a list of words separated at whitespaces
wordList1 = str1.split(None)
# strip any punctuation marks and build modified word list
# start with an empty list
wordList2 = []
for word1 in wordList1:
# last character of each word
lastchar = word1[-1:]
# use a list of punctuation marks
if lastchar in [",", ".", "!", "?", ";"]:
    word2 = word1.rstrip(lastchar)
else:
    word2 = word1
# build a wordList of lower case modified words
wordList2.append(word2.lower())

wordList2連続した単語のリストが追加されました。グラフに変換するにはどうすればよいですか。私は networkx ライブラリを使用していますが、あまり詳しくありません。

エッジ加重グラフの作成を進めるにはどうすればよいですか?

score 8 · Accepted Answer

networkX を使用した問題の解決策を次に示します。

このコードは、有向グラフ dG を生成します。dG は単語ごとに 1 つのノードを持ち、単語が何回見られたかを示す「count」属性を持ちます。バイグラムごとに 1 つの有向エッジがあり、バイグラムが発生した回数を示す「重み」属性があります。

import networkx as nx
import string
from sys import maxint

str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""

wordList1 = str1.split(None)

wordList2 = [string.rstrip(x.lower(), ',.!?;') for x in wordList1]

dG = nx.DiGraph()

for i, word in enumerate(wordList2):
    try:
        next_word = wordList2[i + 1]
        if not dG.has_node(word):
            dG.add_node(word)
            dG.node[word]['count'] = 1
        else:
            dG.node[word]['count'] += 1
        if not dG.has_node(next_word):
            dG.add_node(next_word)
            dG.node[next_word]['count'] = 0

        if not dG.has_edge(word, next_word):
            dG.add_edge(word, next_word, weight=maxint - 1)
        else:
            dG.edge[word][next_word]['weight'] -= 1
    except IndexError:
        if not dG.has_node(word):
            dG.add_node(word)
            dG.node[word]['count'] = 1
        else:
            dG.node[word]['count'] += 1
    except:
        raise

グラフの内容を確認するには、ノードとエッジを印刷します。

for node in dG.nodes():
    print '%s:%d\n' % (node, dG.node[node]['count'])

for edge in dG.edges():
    print '%s:%d\n' % (edge, maxint - dG.edge[edge[0]][edge[1]]['weight'])

バイグラムエッジの重みは、ゼロからカウントアップするのではなく、maxint からカウントダウンすることに注意してください。これは、networkX の最短パスユーティリティが重み値をエッジあたりのコストとして使用するためです。最大のカウントを最小の重みにすることで、これらのユーティリティを使用して、最大のエッジカウントを持つパスを見つけることができます。

たとえば、「man」という単語と「exhausted」という単語の間のカウントが最大のパスを取得できます。

>>>shortest_path = nx.shortest_path(dG, source='man', target='exhausted', weight='weight')
>>>print shortest_path
['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted']

または、「man」という単語と他のすべての単語の間のカウントが最も高いパスを取得できます。

>>>shortest_paths = nx.shortest_path(dG, source='man', weight='weight')
>>>print shortest_paths
{'run': ['man', 'who', 'run'], 
'get': ['man', 'who', 'run', 'behind', 'car', 'get'], 
'exhausted': ['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted'], 
'car': ['man', 'who', 'run', 'behind', 'car'], 
'who': ['man', 'who'], 
'behind': ['man', 'who', 'run', 'behind'], 
'of': ['man', 'who', 'run', 'in', 'front', 'of'], 
'in': ['man', 'who', 'run', 'in'], 
'front': ['man', 'who', 'run', 'in', 'front'], 
'tired': ['man', 'who', 'run', 'behind', 'car', 'get', 'tired'], 
'man': ['man']}

上で述べたように、このようなグラフではサイクルが発生する可能性があり、networkx 最短パスアルゴリズムがそのケースをどの程度うまく処理できるかはわかりません。

幸運を！

score 1 · Accepted Answer

辞書を使用してバイグラムを格納し、バイグラムに遭遇するたびに値に 1 を追加します。辞書値の最大値を取得して開始点を生成し、最後の単語で始まるバイグラムがなくなるまで、前のバイグラムの最後の単語で始まる最大値を持つバイグラムを取得する関数を再帰的に呼び出します。循環グラフの可能性に注意してください。たとえばI love that I love that I love、無限 (再帰制限を設定) などです。

特にnetworkxを使用したことはありませんが、ホームページを一瞥した後でも、上記が当てはまります。

python - Python で有向単語グラフを生成する

2 に答える 2

Related

Reference