php - PHP を使用して HTML を解析し、h2 の後、次の h2 の前にすべての h3 を取得する

Question

記事の最初の h2 を探しています。見つかったら、次の h2 が見つかるまですべての h3 を探します。すべての見出しと小見出しが見つかるまで、すすいで繰り返します。

この質問を重複した解析の質問としてすぐにフラグを立てるか閉じる前に、質問のタイトルに注意してください。これは基本的なノードの取得に関するものではないためです。私はその部分を下に持っています。

を使用しDOMDocumentて HTML を解析しDOMDocument::loadHTML()、記事の重要な見出しを取得するために使用しています。DOMDocument::getElementsByTagName()DOMDocument::saveHTML()

私のコードは次のとおりです。

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('h2') as $node) {
    $matches['heading-two'][] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches['heading-three'][] = $dom->saveHtml($node);
}
if($matches){
    $this->key_points = $matches;
}

次のような出力が得られます。

array(
    'heading-two' => array(
        '<h2>Here is the first heading two</h2>',
        '<h2>Here is the SECOND heading two</h2>'
    ),
    'heading-three' => array(
        '<h3>Here is the first h3</h3>',
        '<h3>Here is the second h3</h3>',
        '<h3>Here is the third h3</h3>',
        '<h3>Here is the fourth h3</h3>',
    )
);

私はもっと似たものを探しています:

array(
    '<h2>Here is the first heading two</h2>' => array(
        '<h3>Here is an h3 under the first h2</h3>',
        '<h3>Here is another h3 found under first h2, but after the first h3</h3>'
    ),
    '<h2>Here is the SECOND heading two</h2>' => array(
        '<h3>Here is an h3 under the SECOND h2</h3>',
        '<h3>Here is another h3 found under SECOND h2, but after the first h3</h3>'
    )
);

私はコード補完を正確に探しているわけではありませんが (そうすることで他の人の助けになると思われる場合は、先に進んでください)、上記のようなネストされた配列を実現するための正しい方向への多かれ少なかれガイダンスやアドバイスを探しています。

score 10 · Accepted Answer

すべての見出しは DOM で同じレベルにあると想定しているため、すべての h3 は h2 の兄弟です。その仮定により、次の h2 に遭遇するまで h2 の兄弟を反復処理できます。

foreach($dom->getElementsByTagName('h2') as $node) {
    $key = $dom->saveHtml($node);
    $matches[$key] = array();
    while(($node = $node->nextSibling) && $node->nodeName !== 'h2') {
        if($node->nodeName == 'h3') {
            $matches[$key][] = $dom->saveHtml($node);   
        }
    }
}

score 2 · Accepted Answer

これは、ドキュメント内でノード要素が見つかった行番号を取得し、それを配列要素キーとして保存することによっても機能します。次にksort($matches)、配列内の各ノード要素を元の行位置に戻します。 HTML ドキュメント。

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);

foreach($dom->getElementsByTagName('h2') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}

ksort($matches);

...または少しタイトなコード;

foreach(array('h2', 'h3') as $tag) {
    foreach($dom->getElementsByTagName($tag) as $node) {
        $matches[$node->getLineNo()] = $dom->saveHtml($node);
    }
}

ksort($matches);

php - PHP を使用して HTML を解析し、h2 の後、次の h2 の前にすべての h3 を取得する

2 に答える 2

Related

Reference