python - 次の段落から最初の文を取得するにはどうすればよいですか?

Question

これは簡単に聞こえるかもしれません。最初に出てくるドット(.)を基準にしようと思ったのですが、略語や略語が出てくるとどうしようもありません。

例 -

サーウィンストンレナードスペンサー-チャーチル、KG、OM、CH、TD、PC、DL、FRS、Hon. RA (1874 年 11 月 30 日 – 1965 年 1 月 24 日) は、第二次世界大戦中の英国の指導者として知られる英国の政治家および政治家でした。彼は戦時中の偉大な指導者の 1 人として広く認められており、2 度首相を務めました。著名な政治家であり雄弁家でもあったチャーチルは、英国陸軍の将校、歴史家、作家、芸術家でもありました。

ここで、最初のドットは Hon. ですが、第二次世界大戦で終わる完全な最初の行が必要です。

それは可能な人ですか？

score 8 · Accepted Answer

使用するnltk場合は、次のように略語を追加できます。

>>> import nltk
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent_detector._params.abbrev_types.add('hon')
>>> sent_detector.tokenize(your_text)
['Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA 
(30 November 1874 \xe2\x80\x93 24 January 1965) was a British politician and 
statesman known for his leadership of the United Kingdom during the Second 
World War.', 
'He is widely regarded as one of the great wartime leaders and served as Prime 
Minister twice.', 
'A noted statesman and orator, Churchill was also an officer in the British Army,
a historian, a writer, and an artist.']

このアプローチは、 Kiss & Strunk 2006に基づいており、テストコーパスに応じて、Punkt の F スコア (適合率と再現率の調和平均) が 91% から 99% であると報告されています。

キッス、ティボール、ヤン・ストランク。2006.「教師なし多言語文境界検出」。 計算言語学、(32) 485-525。

score 1 · Accepted Answer

これは一般的に不可能です。略語、数値 (「$23.45」、「32.5 度」)、引用 (「彼は言った: 'ha! you'll never [...]'」)、または句読点を含む名前 (例:「Panic! At the Disco」)または、基本的に独自の文である括弧内の従属節全体でさえ（「料理人（誰も優れた画家です！）[...]」）、ドットと感嘆符/疑問符でテキストを分割することはできないことを意味しますまたは、他の「単純な」アプローチを使用します。

基本的に、一般的なケースを解決するには、これらすべての特殊なケースを処理する文法を備えた自然言語用のパーサーが必要です (その場合は、Python ではなくプロローグを使用する方がよい場合があります)。問題をあまり一般的ではないものに減らすことができれば、たとえば略語と引用を処理するだけで済み、何かを解決できるかもしれませんが、それでも正規表現は十分に強力ではないため、何らかのパーサーやステートマシンが必要になります。これらの種類のために。

score 1 · Accepted Answer

自然言語ツールキット nltk を調べましたか? 文のトークナイザーが利用できるようです。http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize-module.html

score 0 · Accepted Answer

ウィキペディアの最初の文は、ほとんどの場合、何かを言いますis, was, are or were。したがって、考えられる解決策は、接続動詞 (is、was、are、were) に到達するまで文を終了しないことです。もちろん、これは 100% 正確に機能しませんが、考えられる解決策は次のとおりです。

def get_first_sentence(my_string):

    linking_verbs = set(['was', 'is', 'are', 'were'])

    split_string = my_string.split(' ')

    first_sentence = []
    linked_verb_booly = False
    for ele in split_string:
        first_sentence.append(ele)
        if ele in linking_verbs:
            linked_verb_booly = True
        if '.' in ele and linked_verb_booly == True:
            break

    return ' '.join(first_sentence)

例 1:

サーウィンストンレナードスペンサー-チャーチル、KG、OM、CH、TD、PC、DL、FRS、Hon. RA (1874 年 11 月 30 日 – 1965 年 1 月 24 日) は、第二次世界大戦中の英国の指導者として知られる英国の政治家および政治家でした。彼は戦時中の偉大な指導者の 1 人として広く認められており、2 度首相を務めました。著名な政治家であり雄弁家でもあったチャーチルは、英国陸軍の将校、歴史家、作家、芸術家でもありました。

my_string_1 = 'Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist.'
first_sentence_1 =  get_first_sentence(my_string_1)

結果：

>>> first_sentence_1
'Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 \xe2\x80\x93 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War.'

例 2:

Python は汎用の高水準プログラミング言語 [11] であり、その設計哲学はコードの読みやすさを重視しています。その構文は明確で[12]、表現力豊かであると言われています[13]。Python には、大規模で包括的な標準ライブラリがあります[14]。

結果：

>>> first_sentence_2
'Python is a general-purpose, high-level programming language[11] whose design philosophy emphasizes code readability.'

例 3:

中国 (Listeni/ˈtʃaɪnə/; 中国語: 中国; ピンイン: Zhōngguó; 中国の名前も参照)、正式には中華人民共和国 (PRC) は、人口が 13 億人を超える世界で最も人口の多い国です。約 960 万平方キロメートルをカバーする東アジアの州は、土地面積で世界第 2 位の国であり[13]、総面積の定義に応じて総面積で 3 番目または 4 番目に大きい国です[14]。

my_string_3 = "China (Listeni/ˈtʃaɪnə/; Chinese: 中国; pinyin: Zhōngguó; see also Names of China), officially the People's Republic of China (PRC), is the world's most-populous country, with a population of over 1.3 billion. Covering approximately 9.6 million square kilometres, the East Asian state is the world's second-largest country by land area,[13] and the third- or fourth-largest in total area, depending on the definition of total area.[14]"
first_sentence_3 = get_first_sentence(my_string_3)

結果：

>>> first_sentence_3

    "China (Listeni/\xcb\x88t\xca\x83a\xc9\xaan\xc9\x99/; Chinese: \xe4\xb8\xad\xe5\x9b\xbd; pinyin: Zh\xc5\x8dnggu\xc3\xb3; see also Names of China), officially the People's Republic of China (PRC), is the world's most-populous country, with a population of over 1.3"

'.' 1.3にあります。

また、上記はおそらく正規表現で行う方がよいでしょう。

ただのアイデア。

score 0 · Accepted Answer

ここにいる多くの人々は良い点を持っていますが、自然言語処理は実際には非常に難しいタスクであり、膨大な量の研究が行われてきましたが、非常に信頼できない結果が得られています. ただし、そこには解決策があります。現存する最も強力な自然言語処理ツールの 1 つである自然言語ツールキットについて多くの人が言及しています。実際、NLTK には、すぐに構築できるセンテンストークナイザーがあり、完璧ではありませんが、非常に優れています。これは PunktSentenceTokenizer と呼ばれ、略語を適切にフィルタリングします。よりスラングなスピーチではかなりの問題がありますが、上記のようなフィクションの文では素晴らしく機能します. ドキュメントはこちらにあります: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html

from nltk import tokenize

def print_sentences(text):
    test = tokenize.punkt.PunktSentenceTokenizer()
    return test.sentences_from_text(text)

悲しいことに、あなたが提示した例では実際には機能しませんが、非常に詳細なルックアップがあり、多くの略語をキャッチします. この例のプロジェクトのかなりの部分は、その「ホン」だと思います。も固有名詞であり、辞書にはそのように表示される可能性があります。この特定のケースをキャッチするために nltk で辞書をカスタム構成することは可能です。

score -1 · Accepted Answer

ピリオドの後にスペースまたは改行が続く場合にのみピリオドが文を終了するという規則に固執する場合は、次のようなことができます。

s="Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist."
sentence_delimiters = ['. ', '.\n', '? ', '?\n', '! ', '!\n']
pos = [s.find(delimiter) for delimiter in sentence_delimiters]
pos = min([p for p in pos if p >= 0])
print s[:pos]

python - 次の段落から最初の文を取得するにはどうすればよいですか?

6 に答える 6

Related

Reference