wikipedia - コンテンツの概要を取得するためだけのウィキペディアAPIはありますか？

Question

ウィキペディアページの最初の段落を取得するだけです。

コンテンツはHTML形式で、私のWebサイトに表示できるようになっている必要があります（BBCodeやWikipediaの特別なコードは使用できません ！）

score 232 · Accepted Answer

HTMLを解析せずに、「紹介セクション」全体を取得する方法があります。追加のパラメーターを使用したAnthonySの回答と同様にexplaintext、紹介セクションのテキストをプレーンテキストで取得できます。

クエリ

プレーンテキストでStackOverflowの紹介を取得する：

ページタイトルの使用：

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow

または使用pageids：

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040

JSON応答

（警告は削除されました）

{
    "query": {
        "pages": {
            "21721040": {
                "pageid": 21721040,
                "ns": 0,
                "title": "Stack Overflow",
                "extract": "Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open alternative to earlier Q&A sites such as Experts Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.\nIt features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. Users of Stack Overflow can earn reputation points and \"badges\"; for example, a person is awarded 10 reputation points for receiving an \"up\" vote on an answer given to a question, and can receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site or forum. All user-generated content is licensed under a Creative Commons Attribute-ShareAlike license. Questions are closed in order to allow low quality questions to improve. Jeff Atwood stated in 2010 that duplicate questions are not seen as a problem but rather they constitute an advantage if such additional questions drive extra traffic to the site by multiplying relevant keyword hits in search engines.\nAs of April 2014, Stack Overflow has over 2,700,000 registered users and more than 7,100,000 questions. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML."
            }
        }
    }
}

ドキュメント：API：query / prop = extract

score 84 · Accepted Answer

実際には、この目的のために特別に設計されたクエリで使用できる抽出と呼ばれる非常に優れた小道具があります。

抽出を使用すると、記事の抽出（切り捨てられた記事のテキスト）を取得できます。ゼロ番目のセクションのテキストを取得するために使用できるexintroと呼ばれるパラメーターがあります（画像やインフォボックスなどの追加のアセットはありません）。また、特定の文字数（exchars）や特定の文数（ exsentences ）など、より細かい粒度で抽出を取得することもできます。

サンプルクエリ http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro3%titles=Stack%20Overflow とAPIサンドボックス http://en.wikipedia.org/wiki/を次に示します。 Special：ApiSandbox＃action = query＆prop = extract＆format = json＆exintro =＆titles = Stack％20Overflowを使用して、このクエリをさらに試してください。

特に最初の段落が必要な場合は、選択した回答で提案されているように、追加の解析を行う必要があることに注意してください。ここでの違いは、解析するAPI応答に画像などの追加のアセットがないため、このクエリによって返される応答が、提案されている他のAPIクエリの一部よりも短いことです。

score 78 · Accepted Answer

2017年以降、ウィキペディアはより優れたキャッシュを備えたRESTAPIを提供しています。ドキュメントには、ユースケースに完全に適合する次のAPIがあります（新しいページプレビュー機能で使用されるため）。

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow 小さなサムネイルでサマーリーを表示するために使用できる次のデータを返します。

{
  "type": "standard",
  "title": "Stack Overflow",
  "displaytitle": "Stack Overflow",
  "extract": "Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.",
  "extract_html": "<p><b>Stack Overflow</b> is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of <i>Coding Horror</i>, Atwood's popular programming blog.</p>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q549037",
  "titles": {
    "canonical": "Stack_Overflow",
    "normalized": "Stack Overflow",
    "display": "Stack Overflow"
  },
  "pageid": 21721040,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/en/thumb/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png/320px-Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 320,
    "height": 149
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/en/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 462,
    "height": 215
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "902900099",
  "tid": "1a9cdbc0-949b-11e9-bf92-7cc0de1b4f72",
  "timestamp": "2019-06-22T03:09:01Z",
  "description": "website hosting questions and answers on a wide range of topics in computer programming",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.wikipedia.org/wiki/Stack_Overflow?action=history",
      "edit": "https://en.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Stack_Overflow"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Stack_Overflow",
      "edit": "https://en.m.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Stack_Overflow"
    }
  },
  "api_urls": {
    "summary": "https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow",
    "metadata": "https://en.wikipedia.org/api/rest_v1/page/metadata/Stack_Overflow",
    "references": "https://en.wikipedia.org/api/rest_v1/page/references/Stack_Overflow",
    "media": "https://en.wikipedia.org/api/rest_v1/page/media/Stack_Overflow",
    "edit_html": "https://en.wikipedia.org/api/rest_v1/page/html/Stack_Overflow",
    "talk_page_html": "https://en.wikipedia.org/api/rest_v1/page/html/Talk:Stack_Overflow"
  }
}

デフォルトでは、リダイレクトに従います（これ/api/rest_v1/page/summary/StackOverflowも機能します）が、これはで無効にすることができます?redirect=false。

別のドメインからAPIにアクセスする必要がある場合は、（たとえば）を使用してCORSヘッダーを設定できます。&origin=&origin=*

2019年現在：APIは、ページに関するより有用な情報を返すようです。

score 39 · Accepted Answer

このコードを使用すると、ページの最初の段落のコンテンツをプレーンテキストで取得できます。

この答えの一部はここから、したがってここから来ています。詳細については、 MediaWikiAPIのドキュメントを参照してください。

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in JSON format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Baseball&prop=text&section=0';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // Get the main text content of the query (it's parsed HTML)

// Pattern for first match of a paragraph
$pattern = '#<p>(.*)</p>#Us'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match($pattern, $content, $matches))
{
    // print $matches[0]; // Content of the first paragraph (including wrapping <p> tag)
    print strip_tags($matches[1]); // Content of the first paragraph without the HTML tags.
}

score 33 · Accepted Answer

はいあります。たとえば、記事Stack Overflowの最初のセクションのコンテンツを取得したい場合は、次のようなクエリを使用します。

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

パーツはこれを意味します：

format=xml：結果フォーマッタをXMLとして返します。他のオプション（JSONなど）を利用できます。これは、ページコンテンツ自体の形式には影響せず、囲んでいるデータ形式にのみ影響します。
action=query&prop=revisions：ページのリビジョンに関する情報を取得します。どのリビジョンを指定しないため、最新のリビジョンが使用されます。
titles=Stack%20Overflow：ページに関する情報を取得しますStack Overflow。名前を。で区切ると、一度に複数のページのテキストを取得できます|。
rvprop=content：リビジョンの内容（またはテキスト）を返します。
rvsection=0：セクション0のコンテンツのみを返します。
rvparse：解析されたコンテンツをHTMLとして返します。

これにより、ハットノート（「その他の用途…」）、インフォボックス、画像などを含む最初のセクション全体が返されることに注意してください。

APIの操作を容易にするさまざまな言語で利用可能なライブラリがいくつかありますが、そのうちの1つを使用した方がよい場合があります。

score 14 · Accepted Answer

これは、私が作成しているWebサイトで現在使用しているコードであり、ウィキペディアの記事の先頭の段落、要約、およびセクション0を取得する必要があります。これらはすべて、ブラウザー（クライアント側のJavaScript）内で実行されます。JSONPの魔法！-> http://jsfiddle.net/gautamadude/HMJJg/1/

Wikipedia APIを使用して、次のようにHTMLの先頭の段落（セクション0と呼ばれる）を取得します。http：//en.wikipedia.org/w/api.php？format = json＆action = parse＆page = Stack_Overflow＆prop = text＆section = 0＆callback =？

次に、HTMLやその他の不要なデータを取り除き、記事の要約のクリーンな文字列を提供します。必要に応じて、少し調整するだけで、先頭の段落の周りに「p」HTMLタグを付けることができますが、現在は、段落の間に改行文字があります。

コード：

var url = "http://en.wikipedia.org/wiki/Stack_Overflow";
var title = url.split("/").slice(4).join("/");

// Get leading paragraphs (section 0)
$.getJSON("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=" + title + "&prop=text&section=0&callback=?", function (data) {
    for (text in data.parse.text) {
        var text = data.parse.text[text].split("<p>");
        var pText = "";

        for (p in text) {
            // Remove HTML comment
            text[p] = text[p].split("<!--");
            if (text[p].length > 1) {
                text[p][0] = text[p][0].split(/\r\n|\r|\n/);
                text[p][0] = text[p][0][0];
                text[p][0] += "</p> ";
            }
            text[p] = text[p][0];

            // Construct a string from paragraphs
            if (text[p].indexOf("</p>") == text[p].length - 5) {
                var htmlStrip = text[p].replace(/<(?:.|\n)*?>/gm, '') // Remove HTML
                var splitNewline = htmlStrip.split(/\r\n|\r|\n/); //Split on newlines
                for (newline in splitNewline) {
                    if (splitNewline[newline].substring(0, 11) != "Cite error:") {
                        pText += splitNewline[newline];
                        pText += "\n";
                    }
                }
            }
        }
        pText = pText.substring(0, pText.length - 2); // Remove extra newline
        pText = pText.replace(/\[\d+\]/g, ""); // Remove reference tags (e.x. [1], [4], etc)
        document.getElementById('textarea').value = pText
        document.getElementById('div_text').textContent = pText
    }
});

score 8 · Accepted Answer

このURLは、要約をXML形式で返します。

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Agra&MaxHits=1

ウィキペディアからキーワードの説明を取得する関数を作成しました。

function getDescription($keyword) {
    $url = 'http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=' . urlencode($keyword) . '&MaxHits=1';
    $xml = simplexml_load_file($url);
    return $xml->Result->Description;
}

echo getDescription('agra');

score 5 · Accepted Answer

また、ウィキペディアのコンテンツを取得し、そこから構造化された情報を作成するDBPedia （ RDF ）を介して最初の段落などのコンテンツを取得し、APIを介してこれを利用できるようにすることもできます。DBPedia APIはSPARQLAPI（RDFベース）ですが、JSONを出力し、ラップするのは非常に簡単です。

例として、ここにWikipediaJSという名前の非常に単純なJavaScriptライブラリがあります。これは、要約の最初の段落を含む構造化されたコンテンツを抽出できます。

このブログ投稿で詳細を読むことができます：WikipediaJS-Javascriptを介してウィキペディアの記事データにアクセスする

JavaScriptライブラリコードはwikipedia.jsにあります。

score 2 · Accepted Answer

2

abstract.xml.gzダンプはあなたが望むもののように聞こえます。

于 2011-12-18T22:35:10.570 に答える

score 1 · Accepted Answer

テキストを探しているだけで、分割することはできますが、APIを使用したくない場合は、en.wikipedia.org / w / index.php？title = Elephant＆action=rawを参照してください。

score 1 · Accepted Answer

私のアプローチは次のとおりです（PHPで）：

$url = "whatever_you_need"

$html = file_get_contents('https://en.wikipedia.org/w/api.php?action=opensearch&search='.$url);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');

$utf8htmlさらに掃除が必要かもしれませんが、基本的にはそれだけです。

score 1 · Accepted Answer

Michael Rapadasと@Krinkleのソリューションを試しましたが、私の場合、大文字と小文字によってはいくつかの記事を見つけるのに苦労しました。ここみたいに：

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro3%exsentences=1&explaintext3%titles=Led%20zeppelin

応答を次のように切り捨てたことに注意してくださいexsentences=1

どうやら「タイトルの正規化」は正しく機能していませんでした：

タイトルの正規化は、ページのタイトルを正規形に変換します。これは、最初の文字を大文字にし、アンダースコアをスペースに置き換え、名前空間をそのWiki用に定義されたローカライズされた形式に変更することを意味します。タイトルの正規化は、使用されているクエリモジュールに関係なく、自動的に行われます。ただし、ページタイトル（\ n）の末尾の改行は奇妙な動作を引き起こすため、最初に削除する必要があります。

キャピタライゼーションの問題を簡単に解決できたはずですが、オブジェクトを配列にキャストしなければならないという不便もありました。

よく知られた定義済みの検索の最初の段落が本当に必要だったので（別の記事から情報を取得するリスクはありません）、次のようにしました。

https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json

この場合、私は切り捨てを行ったことに注意してくださいlimit=1

こちらです：

応答データに非常に簡単にアクセスできます。
応答は非常に小さいです。

ただし、検索の大文字化には注意を払う必要があります。

詳細：https ：//www.mediawiki.org/wiki/API：Opensearch

wikipedia - コンテンツの概要を取得するためだけのウィキペディアAPIはありますか？

12 に答える 12

クエリ

JSON応答

Related

Reference