python - 段落のパッセージまたは束からタイトルケースフレーズを見つける方法

Question

パッセージから文の大文字小文字のフレーズを解析するにはどうすればよいですか。

例えばこの一節から

コナン・ドイルは、ホームズの性格は、ドイルがエジンバラ王立病院の事務員として働いていたジョセフ・ベル博士に触発されたと述べた。ホームズのように、ベルは最小の観察から大きな結論を引き出すことで有名でした。[1] マイケル・ハリソンは、エラリー・クイーンズ・ミステリー・マガジンの1971年の記事で、このキャラクターは、1882年にイギリスで新聞の注目を集めたとされる殺人事件の「相談刑事」であるウェンデル・シェラーに触発されたと主張しました。

コナンドイル、ホームズ、ジョセフベル博士、ウェンデルシェルなどを生成する必要があります。

可能であればPythonicソリューションをお勧めします

score 5 · Accepted Answer

この種の処理は非常に難しい場合があります。この単純なコードは、ほぼ正しいことを行います。

for s in re.finditer(r"([A-Z][a-z]+[. ]+)+([A-Z][a-z]+)?", text):
    print s.group(0)

生成：

Conan Doyle
Holmes
Dr. Joseph Bell
Doyle
Edinburgh Royal Infirmary. Like Holmes
Bell
Michael Harrison
Ellery Queen
Mystery Magazine
Wendell Scherer
England

「ジョセフ・ベル博士」を含めるには、「エジンバラ王立病院。ホームズのように」で許可されている文字列のピリオドに問題がない必要があります。

私は同様の問題を抱えていました：文の分離。

score 2 · Accepted Answer

「再」アプローチは、すぐに蒸気を使い果たします。固有表現抽出は非常に複雑なトピックであり、SOの回答の範囲をはるかに超えています。この問題への良いアプローチがあると思われる場合は、Flann O'Brien、別名Myles na cGopaleen、Sukarno、Harry S. Truman、J。Edgar Hoover、JK Rowling、数学者L'Hopital、Joe di Maggio、 Algernon Douglas-Montagu-Scott、およびHugo Max Graf von und zuLerchenfeldaufKöferingundSchönberg。

更新フォローは、より多くの有効なケースを見つける「再」ベースのアプローチです。しかし、これはまだ良いアプローチではないと思います。注意：テキストサンプルでバイエルン伯爵の名前を特定しました。誰かが本当にこのようなものを使いたいのなら、彼らはUnicodeで動作し、ある段階（入力または出力のいずれか）で空白を正規化する必要があります。

import re

text1 = """Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882."""

text2 = """Flann O'Brien a.k.a. Myles na cGopaleen, I Zingari, Sukarno and Suharto, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg."""

pattern1 = r"(?:[A-Z][a-z]+[. ]+)+(?:[A-Z][a-z]+)?"

joiners = r"' - de la du von und zu auf van der na di il el bin binte abu etcetera".split()

pattern2 = r"""(?x)
    (?:
        (?:[ .]|\b%s\b)*
        (?:\b[a-z]*[A-Z][a-z]*\b)?
    )+
    """ % r'\b|\b'.join(joiners)

def get_names(pattern, text):
    for m in re.finditer(pattern, text):
        s = m.group(0).strip(" .'-")
        if s:
            yield s

for t in (text1, text2):
    print "*** text: ", t[:20], "..."
    print "=== Ned B"
    for s in re.finditer(pattern1):
        print repr(s.group(0))
    print "=== John M =="
    for name in get_names(pattern2, t):
        print repr(name)

出力：

C:\junk\so>\python26\python extract_names.py
*** text:  Conan Doyle said tha ...
=== Ned B
'Conan Doyle '
'Holmes '
'Dr. Joseph Bell'
'Doyle '
'Edinburgh Royal Infirmary. Like Holmes'
'Bell '
'Michael Harrison '
'Ellery Queen'
'Mystery Magazine '
'Wendell Scherer'
'England '
=== John M ==
'Conan Doyle'
'Holmes'
'Dr. Joseph Bell'
'Doyle'
'Edinburgh Royal Infirmary. Like Holmes'
'Bell'
'Michael Harrison'
'Ellery Queen'
'Mystery Magazine'
'Wendell Scherer'
'England'
*** text:  Flann O'Brien a.k.a. ...
=== Ned B
'Flann '
'Brien '
'Myles '
'Sukarno '
'Harry '
'Edgar Hoover'
'Joe '
'Algernon Douglas'
'Hugo Max Graf '
'Lerchenfeld '
'Koefering '
'Schoenberg.'
=== John M ==
"Flann O'Brien"
'Myles na cGopaleen'
'I Zingari'
'Sukarno'
'Suharto'
'Harry S. Truman'
'J. Edgar Hoover'
'J. K. Rowling'
"L'Hopital"
'Joe di Maggio'
'Algernon Douglas-Montagu-Scott'
'Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg'

python - 段落のパッセージまたは束からタイトルケースフレーズを見つける方法

2 に答える 2

Related

Reference