python - Python での文処理

Question

そのようなデータを含むファイルがあります：

Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'

印刷したいのはすべてSentences0です。これは私が行ったことですが、空白のリストが出力されます。

from nltk import *
import codecs
f=codecs.open('topon.txt','r+','cp1251')
text = f.readlines()
first=[sentence for sentence in text if re.findall('\.\n^Abc',sentence)]
print first

score 3 · Accepted Answer

これには NLTK は必要ありません (また、使用していません)。私が質問を誤解していない限り、これでうまくいくはずです：

with open('topon.txt') as infile:
  for line in infile:
    print line.split('.', 1)[0]

score 1 · Accepted Answer

@inspectorG4dget の答えに加えて、正規表現でそれを行うことができます：

from nltk import *
import codecs

f = codecs.open('a.txt', 'r+', 'cp1251')
text = f.readlines()
print [re.findall('^[^.]+', sentence) for sentence in text]

score 1 · Accepted Answer

段落をピリオドで分割することは、すべての文がピリオドで終わり、ピリオドが他に使用されない場合にのみ機能します。実際のテキストがたくさんある場合、これらのどちらも真実に近くありません。略語、質問？感嘆符！などでつまずきます。したがって、この目的のために nltk が提供するツールを使用してください: function sent_tokenize(). 完璧ではありませんが、月経を探すよりははるかに優れています。が段落のリストである場合text、次のように使用します。

first = [ ]
for par in text:
    sentences = nltk.sent_tokenize(par)
    first.append(sentences[0])

上記をリスト内包表記に折りたたむことはできますが、あまり読みにくくなります...

python - Python での文処理

3 に答える 3

Related

Reference