php - HTML内のコンテンツを取得できない

Question

Web サイト内から html コンテンツを抽出しようとしています。タグ内のコンテンツのみが必要です。

    //$validLink is a link with .htm extension, source code is rather large 
    //contains 24,000 lines of html code

    $thehtml = file_get_contents($validlink);
    $thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml);

他に何ができますか？$thehtml は空です....これをワードプレスの投稿に挿入しようとしています...しかし、$thehtml は空です....何らかの奇妙な理由で。タイムアウトの問題などの可能性はありますか???

file_get_contents($validlink); だけを出力すると、タイムアウトの問題が発生することはありません..... なぜかBODYが見つからない……。

別の可能な解決策は、ドキュメントで見つかった最初の div と最後の div の間のコンテンツを取得することです....

score 0 · Accepted Answer

$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml,$matches);
$thehtml = $matches[0];

score 0 · Accepted Answer

タグの開始と終了の両方の「strpos()」を使用して文字列の位置を取得し、この位置でサブ文字列メソッド、つまり substr() を使用します

score 0 · Accepted Answer

正しいコードは次のとおりです。

$thehtml = file_get_contents($validlink);
preg_match('/<body.*?>(.*?)<\/body>/is', $thehtml, $matches);
$thehtml = $matches[1];

ただし、代わりにDOM パーサーを使用することをお勧めします。

php - HTML内のコンテンツを取得できない

3 に答える 3

Related

Reference