python - 正規表現を使用してテキストのブロックを分離する - Python

Question

Stanford Parser からの次の出力があります。

nicaragua president ends visit to finland .

nn(ends-3, nicaragua-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(finland-6, to-5)
xcomp(visit-4, finland-6)

guatemala president ends visit to tropos .

nn(ends-3, guatemala-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(tropos-6, to-5)
xcomp(visit-4, tropos-6)

[...]

この出力をセグメント化して、文とすべての依存関係のリストを含むタプルを取得する必要があります ((sentence,[list of dependencies])各文のように)。誰かが Python でこれを行う方法を提案してもらえますか? ありがとう!

score 0 · Accepted Answer

このようなことを行うこともできますが、解析している構造にとってはおそらくやり過ぎです。依存関係も解析する必要がある場合は、比較的簡単に拡張できます。私はまだこれを実行していないか、構文をチェックしていないので、すぐに機能しない場合でも私を殺さないでください。

READ_SENT = 0
PRE_DEPS = 1
DEPS = 2
POST_DEPS = 3
def parse_output(input):
    state = READ_SENT
    results = []
    sent = None
    deps = []
    for line in input.splitlines():
        if state == READ_SENT:
            sent = line
            state = PRE_DEPS
        elif state == PRE_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 state = DEPS
         elif state == DEPS:
             if line:
                 deps.append(line)
             else:
                 state = POST_DEPS
         elif state == POST_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 results.append((sent, deps))
                 sent = None
                 deps = []
                 state = READ_SENT
    return results

python - 正規表現を使用してテキストのブロックを分離する - Python

1 に答える 1

Related

Reference