python - BIOチャンク文からチャンクを抽出するには? - パイソン

Question

BIO チャンクタグを含む入力文を指定します。

[('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ( 'of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('? '、'O')]

関連するフレーズを抽出する必要があります。たとえば、を抽出したい場合は'NP'、とを含むタプルのフラグメントを抽出する必要がB-NPありI-NPます。

[アウト]：

[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

(注: 抽出タプルの数字はトークンインデックスを表します。)

次のコードを使用して抽出しようとしました。

def extract_chunks(tagged_sent, chunk_type):
    current_chunk = []
    current_chunk_position = []
    for idx, word_pos in enumerate(tagged_sent):
        word, pos = word_pos
        if '-'+chunk_type in pos: # Append the word to the current_chunk.
            current_chunk.append((word))
            current_chunk_position.append((idx))
        else:
            if current_chunk: # Flush the full chunk when out of an NP.
                _chunk_str = ' '.join(current_chunk) 
                _chunk_pos_str = '-'.join(map(str, current_chunk_position))
                yield _chunk_str, _chunk_pos_str 
                current_chunk = []
                current_chunk_position = []
    if current_chunk: # Flush the last chunk.
        yield ' '.join(current_chunk), '-'.join(current_chunk_position)


tagged_sent = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]
print (list(extract_chunks(tagged_sent, chunk_type='NP')))

しかし、同じタイプの隣接するチャンクがある場合:

tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'),  ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]

print (list(extract_chunks(tagged_sent, chunk_type='NP')))

これは次のように出力されます。

[('The Mitsubishi Electric Company Managing Director', '0-1-2-3-4-5'), ('ramen', '7')]

希望の代わりに：

[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

上記のコードからこれをどのように解決できますか?

上記のコードからどのように行われるか以外に、特定の目的のチャンクを抽出するためのより良い解決策はありchunk_typeますか?

score 0 · Accepted Answer

私は次のようにします：

import re
def extract_chunks(tagged_sent, chunk_type):
    # compiles the expression we want to match
    regex = re.compile(chunk_type)

    # filters matched items in a dictionary whose keys are the matched indexes
    first_step = {index_:tag[0] for index_, tag in enumerate(tagged_sent) if regex.findall(tag[1])}

    # builds list of lists following output format
    second_step = []
    for key_ in sorted(first_step.keys()):
        if second_step and int(second_step [len(second_step )-1][1].split('-')[-1]) == key_ -1:           
            second_step[len(second_step)-1][0] += ' {0}'.format(first_step[key_])
            second_step[len(second_step)-1][1] += '-{0}'.format(str(key_))
        else:
            second_step.append([first_step[key_], str(key_)])

    # builds output in final format
    return [tuple(item) for item in second_step]

私が行っているように出力全体をメモリに構築する代わりにジェネレーターを使用するように適応させ、パフォーマンスを向上させるためにリファクタリングすることができます (私は急いでいるので、コードは最適とはほど遠いです)。

それが役に立てば幸い！

python - BIOチャンク文からチャンクを抽出するには? - パイソン

3 に答える 3

Related

Reference