php - ヘッダー/エンコードなしで外部WebページのHTMLソースを取得する

Question

ヘッダーをエンコードせずにhtmlファイルから（utf-8で）エンコードされたコンテンツを抽出できるかどうかを知りたいだけです。

私の特定のケースはこのウェブサイトです：

http://www.metal-archives.com/band/discography/id/203/tab/all

すべての情報を抽出したいのですが、ご覧のとおり、たとえばこの単語は見栄えが悪いです。

MotÃ¶rhead

file_get_html、htmlentities、utf_decode、utf_encodeを使用し、それらをさまざまなオプションと組み合わせて使用しようとしましたが、解決策が見つかりません...

編集：

この単純なコードを使用して、同じWebサイトを正しい形式で表示したいだけです。

$html_discos = file_get_html("http://www.metal-archives.com/band/discography/id/223/tab/all");
//some transform/decode here
print_r($html_discos);

文字列またはDOMオブジェクトの正しい形式のコンテンツで、後でいくつかの部分を取得したい。

編集2：

$ file_get_htmlは、「simplehtmldom」ライブラリの関数です。

http://simplehtmldom.sourceforge.net/

それはこのコードを持っています：

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    // We DO force the tags to be terminated.
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
    $contents = file_get_contents($url, $use_include_path, $context, $offset);
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
    //$contents = retrieve_url_contents($url);
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
    {
        return false;
    }
    // The second parameter can force the selectors to all be lowercase.
    $dom->load($contents, $lowercase, $stripRN);
    return $dom;
}

score 2 · Accepted Answer

URL のコンテンツタイプ

http://www.metal-archives.com/band/discography/id/203/tab/all

は：

Content-Type: text/html

これはデフォルトで ISO-8859-1 になります。ただし、代わりに UTF-8 を使用する必要があります。これが正しく通知されるように Content-Type を変更します。

Content-Type: text/html; charset=utf-8

参照: HTTP charset パラメータの設定

score 1 · Accepted Answer

header('Content-Type: text/html; charset=utf-8');
echo file_get_contents('http://www.metal-archives.com/band/discography/id/203/tab/all');

UTF-8 として送信している限り、生データは適切に機能します。

score 0 · Accepted Answer

html_eneity_decode http://php.net/manual/en/function.html-entity-decode.phpを使用してみてください(そのページのソースにはエンコードされた文字があります)

php - ヘッダー/エンコードなしで外部WebページのHTMLソースを取得する

3 に答える 3

Related

Reference