python - NLTK はイニシャルの後にドットが続くものを認識できますか?

Question

ロシア語のテキストを解析するために NLTK を使用しようとしていますが、А のような略語やイニシャルでは機能しません。И。МанташеваとЯ。Вышинский。

代わりに、次のように壊れます。

организовывал забастовки と демонстрации, поднимал рабочих на бакинских предприятиях А.

И。

Манташева。

https://github.com/mhq/train_punktrussian.pickleから使用した場合も同じでした。これは一般的な NLTK の制限ですか、それとも言語固有ですか?

score 4 · Accepted Answer

コメントのいくつかがほのめかしているように、あなたが使いたいのはパンクトセンテンスセグメンター/トークナイザーです。

NLTKまたは言語固有ですか？

ない。ご存知のように、単純にすべての期間に分割することはできません。NLTKには、さまざまな言語でトレーニングされたいくつかのPunktセグメンターが付属しています。ただし、問題が発生している場合の最善の策は、Punktトークナイザーが学習するためのより大きなトレーニングコーパスを使用することです。

ドキュメントリンク

サンプル実装

以下は、正しい方向を示すためのコードの一部です。ロシア語のテキストファイルを提供することで、自分でも同じことができるはずです。その原因の1つは、ロシア語版のWikipediaデータベースダンプである可能性がありますが、それは潜在的な二次的な問題として残しておきます。

import logging
try:
    import cPickle as pickle
except ImportError:
    import pickle
import nltk


def create_punkt_sent_detector(fnames, punkt_fname, progress_count=None):
    """Makes a pass through the corpus to train a Punkt sentence segmenter.

    Args:
        fname: List of filenames to be used for training.
        punkt_fname: Filename to save the trained Punkt sentence segmenter.
        progress_count: Display a progress count every integer number of pages.
    """
    logger = logging.getLogger('create_punkt_sent_detector')

    punkt = nltk.tokenize.punkt.PunktTrainer()

    logger.info("Training punkt sentence detector")

    doc_count = 0
    try:
        for fname in fnames:
            with open(fname, mode='rb') as f:
                punkt.train(f.read(), finalize=False, verbose=False)
                doc_count += 1
                if progress_count and doc_count % progress_count == 0:
                    logger.debug('Pages processed: %i', doc_count)
    except KeyboardInterrupt:
        print 'KeyboardInterrupt: Stopping the reading of the dump early!'

    logger.info('Now finalzing Punkt training.')

    punkt.finalize_training(verbose=True)
    learned = punkt.get_params()
    sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(learned)
    with open(punkt_fname, mode='wb') as f:
        pickle.dump(sbd, f, protocol=pickle.HIGHEST_PROTOCOL)

    return sbd


if __name__ == 'main':
    punkt_fname = 'punkt_russian.pickle'
    try:
        with open(punkt_fname, mode='rb') as f:
            sent_detector = pickle.load(f)
    except (IOError, pickle.UnpicklingError):
        sent_detector = None

    if sent_detector is None:
        corpora = ['russian-1.txt', 'russian-2.txt']
        sent_detector = create_punkt_sent_detector(fnames=corpora,
                                                   punkt_fname=punkt_fname)

    tokenized_text = sent_detector.tokenize("some russian text.",
                                            realign_boundaries=True)
    print '\n'.join(tokenized_text)

python - NLTK はイニシャルの後にドットが続くものを認識できますか?

2 に答える 2

NLTKまたは言語固有ですか？

ドキュメントリンク

サンプル実装

Related

Reference