php - Simple Html Dom を使用して一部の要素を削除する

Question

これは、Simple Html Dom を使用して解析しようとしているページです。私は機能の 90% を完了しましたが、ライブラリを初めて使用するため、これを行うかどうかはよくわかりません。

各ニュースのテキストをスクレイピングしたいのですが、テキストは<p>要素内にある->innertextため、リンクを含めてすべてを内部に取り込むようなものを使用します。

これが私が試したことです：

<h1>Scraper Noticias</h1>

<?php

include('simple_html_dom.php');

class News {
    var $image;
    var $fechanoticia;
    var $title;
    var $description;
    var $sourceurl;

    function get_image( ) {
        return $this->image;
    }

    function set_image ($new_image) {
        $this->image = $new_image;
    }

    function get_fechanoticia( ) {
        return $this->fechanoticia;
    }

    function set_fechanoticia ($new_fechanoticia) {
        $this->fechanoticia = $new_fechanoticia;
    }

    function get_title( ) {
        return $this->title;
    }

    function set_title ($new_title) {
        $this->title = $new_title;
    }

    function get_description( ) {
        return $this->description;
    }

    function set_description ($new_description) {
        $this->description = $new_description;
    }

    function get_sourceurl( ) {
        return $this->sourceurl;
    }

    function set_sourceurl ($new_sourceurl) {
        $this->sourceurl = $new_sourceurl;
    }
}

// Create DOM from URL or file
$html = file_get_html('http://www.uvm.cl/noticias_mas.shtml');

$parsedNews = array();

// Find all news items.
foreach($html->find('#cont2 p') as $element) {

    $newItem = new News;

    // Parse the news item's thumbnail image.
    foreach ($element->find('img') as $image) {
        $newItem->set_image($image->src);
        //echo $newItem->get_image() . "<br />";
    }

    // Parse the news item's post date.
    foreach ($element->find('span.fechanoticia') as $fecha) {
        $newItem->set_fechanoticia($fecha->innertext);
        //echo $newItem->get_fechanoticia() . "<br />";
    }

    // Parse the news item's title.
    foreach ($element->find('a') as $title) {
        $newItem->set_title($title->innertext);
        //echo $newItem->get_title() . "<br />";
    }

    // Parse the news item's source URL link.
    foreach ($element->find('a') as $sourceurl) {
        $newItem->set_sourceurl("http://www.uvm.cl/" . $sourceurl->href);
    }

    // Parse the news items' description text.
    echo $link; //This is the entire <p> tag. How can I get just the text. Not the link?

} 

?>

score 2 · Accepted Answer

これが私が見つけた解決策です。コードを改善できれば幸いです。

<h1>Scraper Noticias</h1>

<?php

include('simple_html_dom.php');

class News {
    var $image;
    var $fechanoticia;
    var $title;
    var $description;
    var $sourceurl;

    function get_image( ) {
        return $this->image;
    }

    function set_image ($new_image) {
        $this->image = $new_image;
    }

    function get_fechanoticia( ) {
        return $this->fechanoticia;
    }

    function set_fechanoticia ($new_fechanoticia) {
        $this->fechanoticia = $new_fechanoticia;
    }

    function get_title( ) {
        return $this->title;
    }

    function set_title ($new_title) {
        $this->title = $new_title;
    }

    function get_description( ) {
        return $this->description;
    }

    function set_description ($new_description) {
        $this->description = $new_description;
    }

    function get_sourceurl( ) {
        return $this->sourceurl;
    }

    function set_sourceurl ($new_sourceurl) {
        $this->sourceurl = $new_sourceurl;
    }
}

// Create DOM from URL or file
$html = file_get_html('http://www.uvm.cl/noticias_mas.shtml');

$parsedNews = array();

// Find all news items.
foreach($html->find('#cont2 p') as $element) {

    $newItem = new News;

    // Parse the news item's thumbnail image.
    foreach ($element->find('img') as $image) {
        $newItem->set_image($image->src);
        //echo $newItem->get_image() . "<br />";
    }

    // Parse the news item's post date.
    foreach ($element->find('span.fechanoticia') as $fecha) {
        $newItem->set_fechanoticia($fecha->innertext);
        //echo $newItem->get_fechanoticia() . "<br />";
    }

    // Parse the news item's title.
    foreach ($element->find('a') as $title) {
        $newItem->set_title($title->innertext);
        //echo $newItem->get_title() . "<br />";
    }

    // Parse the news item's source URL link.
    foreach ($element->find('a') as $sourceurl) {
        $newItem->set_sourceurl("http://www.uvm.cl/" . $sourceurl->href);
    }

    // Parse the news items' description text.
    foreach ($element->find('a') as $link) {
        $link->outertext = '';
    }

    foreach ($element->find('span') as $link) {
        $link->outertext = '';
    }

    foreach ($element->find('img') as $link) {
        $link->outertext = '';
    }

    echo $element->innertext;

} 

?>

score 0 · Accepted Answer

のinnertext代わりに使用outertext

    foreach ($element->find('a') as $sourceurl) {
    echo $sourceurl->innertext . "<br />";
    }

php - Simple Html Dom を使用して一部の要素を削除する

2 に答える 2

Related

Reference