php - ウィキペディアの記事の最初の段落を解析しますか？

Question

重複の可能性：
ウィキペディアAPI
を使用してコンテンツを取得するPHPを使用して、MediaWiki APIを使用してウィキペディアの記事の最初の段落を取得するにはどうすればよいですか？

これは主にXML関連の質問です。

私はMediaWikiAPIを使用してこれを行おうとしています。

XML形式で応答を取得でき（簡単な場合はJSONに変更できます）、応答に必要なすべてのコンテンツが表示されます。例：

http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=War%20and%20Peace&prop=revisions&rvprop=content&format=xmlfm

ここでは、フォーマット上の理由からxmlfmを使用しました。PHPで私がしていること：

$request = "http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=War%20and%20Peace&prop=revisions&rvprop=content&format=xml";

$response = @file_get_contents($request);

$wxml = simplexml_load_string($response);

var_dump($wxml);

これは、XMLのすべてを出力します。私の質問は、これから最初の段落を取得するにはどうすればよいですか？

記事全体から解析できるので、基本的に私が求めているのは、このXMLから記事のテキストを取得するにはどうすればよいですか？もちろん、最初の段落に直接進む方法があれば、それが最善でしょう。

score 5 · Accepted Answer

私は間違いなくあなたがこれを探していると言うでしょう。

（最初の段落だけでなく）最初のセクションのすべてを取得する場合：

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in json format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=Baseball&format=json&prop=text&section=0';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // get the main text content of the query (it's parsed HTML)

// pattern for first match of a paragraph
$pattern = '#<p>(.*?)</p>#s'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match_all($pattern, $content, $matches))
{
    // print $matches[0]; // content of the first paragraph (including wrapping <p> tag)
    print strip_tags(implode("\n\n",$matches[1])); // Content of the first paragraph without the HTML tags.
}

php - ウィキペディアの記事の最初の段落を解析しますか？

1 に答える 1

Related

Reference