python - ページのテキストだけを取得するために pywikipedia を使用できますか?

Question

pywikipedia を使用して、内部リンクやテンプレートを使用せずに、写真などを使用せずに、ページのテキストのみを取得することは可能ですか?

score 5 · Accepted Answer

「ウィキテキストのみを取得したい」という場合は、wikipedia.Pageクラスとgetメソッドを見てください。

import wikipedia

site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...

このようにして、記事から完全な生のウィキテキストを取得します。

ウィキの構文を削除したい場合[[Concept inventory]]や、コンセプトインベントリなどに変換する場合は、もう少し手間がかかります。

この問題の主な理由は、MediaWiki wiki 構文に文法が定義されていないことです。これにより、解析とストリップが非常に難しくなります。私は現在、これを正確に行うことができるソフトウェアを知りません。もちろん、MediaWiki Parser クラスもありますが、これは PHP であり、把握するのが少し難しく、その目的は非常に大きく異なります。

しかし、リンクを削除したいだけの場合、または非常に単純な wiki 構成で正規表現を使用する場合:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.

次に、パイプリンクの場合:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.

等々。

しかし、たとえば、ネストされたテンプレートをページから取り除く信頼できる簡単な方法はありません。コメントにリンクがある画像についても同様です。これは非常に難しく、最も内部的なリンクを再帰的に削除し、それをマーカーに置き換えて最初からやり直す必要があります。templateWithParams必要に応じて wikipedia.py の関数を見てください。ただし、きれいではありません。

score 1 · Accepted Answer

Github には mwparserfromhellというモジュールがあり、必要なものに応じて必要なものに非常に近づけることができます。多くのマークアップを削除する strip_code() というメソッドがあります。

import pywikibot
import mwparserfromhell

test_wikipedia = pywikibot.Site('en', 'test')
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()

full = mwparserfromhell.parse(text)
stripped = full.strip_code()

print full
print '*******************'
print stripped

比較スニペット:

{{db-foreign}}
<!--  Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.   

==Publication history==
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


*******************

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.   

Publication history
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat.

score 0 · Accepted Answer

Pywikibot は、wikitext または html タグを削除できます。textlib 内には 2 つの関数があります。

removeHTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -> str:

HTML マークアップが無効になっている部分のないテキストを返しますが、html タグ間のテキストは保持します。例えば：
```
 from pywikibot Import textlib
 text = 'This is <small>small</small> text'
 print(removeHTMLParts(text, keeptags=[]))
```
これは印刷されます：
```
 This is small text
```
removeDisabledParts(text: str, tags=None, include=[], site=None) -> str: Wiki マークアップが無効になっている部分を除いたテキストを返します。これにより、ウィキテキストテキスト内のテキストが削除されます。例えば：
```
 from pywikibot Import textlib
 text = 'This is <small>small</small> text'
 print(removeDisabledPartsParts(text, tags=['small']))
```
これは印刷されます：
```
 This is  text
```
のように削除または保持する定義済みのタグが多数あります 'comment', 'header', 'link', 'template'。

タグパラメータのデフォルトは['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']

その他の例:

removeDisabledPartsParts('See [[this link]]', tags=['link'])'See ' removeDisabledPartsParts('', tags=['comment'])Gives Gives Givesが、 Pywikibot 6.0.0 以降でのみ機能し'' removeDisabledPartsParts('{{Infobox}}', tags=['template'])ます''

python - ページのテキストだけを取得するために pywikipedia を使用できますか?

4 に答える 4

Related

Reference