php - 読みやすさのようなPHPスクレープ記事の抜粋

Question

私はこの質問を見ましたが、それは私が探しているものを本当に満足させません。その質問の答えは次のいずれかでした：メタ記述タグから持ち上げ、2番目はあなたがすでに本文を持っている記事の抜粋を生成することでした。

私がやりたいのは、読みやすさのように、実際に記事の最初の数文を取得することです。これに最適な方法は何ですか？HTML解析？これが私が現在使用しているものですが、これはあまり信頼できません。

function guessExcerpt($url) {
    $html = file_get_contents_curl($url);

    $doc = new DOMDocument();
    @$doc->loadHTML($html);

    $metas = $doc->getElementsByTagName('meta');

    for ($i = 0; $i < $metas->length; $i++)
    {
        $meta = $metas->item($i);
        if($meta->getAttribute('name') == 'description')
            $description = $meta->getAttribute('content');

    }

    return $description;
}

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

score 10 · Accepted Answer

PHPの可読性の移植版は次のとおりです：https ：//github.com/andreskrey/readability.php 。やってみなよ。抽出結果は、可読性に似ています（可読性のアルゴリズムを実装しているため）。

require 'lib/Readability.inc.php';

$html = file_get_contents_curl($url);

$Readability     = new Readability($html, $html_input_charset); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$title   = $ReadabilityData['title'];
$content = $ReadabilityData['content'];

$content次に、抜粋としてからのいくつかの文を使用できます。

php - 読みやすさのようなPHPスクレープ記事の抜粋

1 に答える 1

Related

Reference