php - 複雑な xml からテキストと画像を解析する方法

Question

私はあなたがそれで私を助けてくれることを願っています. XML ファイルは次のようになります。

<channel><item>
<description>
<div>  <a href="http://image.com">
<span>   
<img src="http://image.com" /> 
</span>
</a>
Lorem Ipsum is simply dummy text of the printing etc... 
</div>
</description>
</item></channel>

説明タグの内容を取得できますが、それを行うと、そこに大量の css を含む構造全体が取得され、それは望ましくありません。本当に必要なのは、href リンクと Lorem Ipsum テキストのみを解析することです。単純な XML を試していますが、わかりません。複雑すぎます。何か案は？

編集： xmlを解析するために使用するコード

$file = new SimpleXMLElement($mydata);
{

    foreach($file->channel->item as $post)
{

    echo $post->description; } }

score 1 · Accepted Answer

この XML は、RSS または Atom フィード (またはその抜粋) に非常によく似ています。ノードは通常、descriptionエスケープされるか、または、、または<![CDATA[ ... ]]>が含まれていても、その内容が生のテキストとして扱われることを示すとマークされたセクション内に配置されます。<>&

あなたのサンプルはそれを示していませんが、タグなどechoを含むコンテンツ全体を提供している場合img、それが起こっていることであり、あなたの質問はRSS フィードから画像のみを解析しようとしているに似ています - つかむ必要がありますコンテンツ全体descriptionを独自のドキュメントとして解析します。

何らかの理由で HTML がエスケープされておらず、実際に XML 内の子ノードの束として含まれている場合、リンクされた URL に直接アクセスできます (構造が常に一貫していると仮定します)。

echo (string)$post->description->div->a['href'];

(string)テキストに関しては、 (でecho自動的に文字列にキャストする場合、SimpleXML は特定の要素のすべてのテキストコンテンツを連結します (ただし、その子の中からではありません) 。echo最終的にはそれよりも）。

あなたの例では、必要なテキストは最初の (そして唯一の) div 内にあるため、次のように表示されます。

echo (string)$post->description->div;

ただし、「たくさんのCSS」について言及していますが、簡単にするために例から除外したと思われるため、実際のコンテンツがどれほど一貫しているかはわかりません。

score 0 · Accepted Answer

これは、質問に答える最後のコードです。

$xml = simplexml_load_file('myfile.xml');

$descriptions = $xml->xpath('//item/description');

foreach ( $descriptions as $description_node ) {

    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    $description_sxml = simplexml_import_dom( $description_dom );

    $imgs = $description_sxml->xpath('//img');
    $text = $description_sxml->xpath('//div');

    foreach($imgs as $image){

    echo (string)$image['src'];     
       }
    foreach($text as $t){

        echo (string)$t;
       }
    }

これはIMSoPのコードであり、$text = $description_sxml->xpath('//div');内にあるテキストを読み取るためにを追加しました<div>。

私の場合、xml内のいくつかの投稿には複数<div>の<span>タグが含まれているため、それらすべてを解析するに->xpathは、<span>またはif... elseステートメントに別の投稿を追加して、内部にコンテンツがない場合は代わりにコンテンツを<div>エコーする必要があります。<span>返信ありがとうございます。

score 0 · Accepted Answer

それは複雑になるでしょう。そこには XML はありませんが、html はありません。1 つの違いは、タグに別のタグと XML のテキストを含めることができないことです。そのため、PHP のDOMを使用します(これはまだ使用していませんが、純粋な JavaScript に似ています)。

これは私が一緒にハッキングしたものです（テストされていません）：

// first create our document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML("your html here"); // there is also a loadHTMLFile

// this tries to get an a element which has a href and returns that href
function getAHref ( $doc ) {
    // now get all a elements to get the one with a href
    $aElements = $doc->getElementsByTagName( "a" );
    foreach ( $aElements as $a ) {
        // has this element a href? than return
        if ( $a->hasAttribute( "href" ) ) {
            return $a->getAttribute( "href" );
        }
    }
    // failed? return false
    return false;
}

// tires to get the text in the node
// in your example the text isn't wrapped in anything so this is going to be difficult
function getTextFromNode ( $doc ) {
    // get and loop all divs (assuming the text is always a child of a div)
    $divs = $doc->getElementsByTagName( "div" ); // do we know it's always in that div?
    foreach ( $divs as $div ) {
        // also loop all child nodes to get the text nodes
        foreach ( $div->childNodes as $child ) {
            // is this a text node?
            if ( $child->nodeType == XML_TEXT_NODE ) {
                // is there something in it (new lines count as text nodes)
                if ( trim( $child->nodeValue ) != "" ) {
                    // *pfew* got it
                    return $child->nodeValue;
                }
            }
        }
    }
    // failed? return false
    return false;
}

php - 複雑な xml からテキストと画像を解析する方法

3 に答える 3

Related

Reference