php - PHPDOMUTF-8の問題

Question

まず、私のデータベースはネイティブ文字セットとしてWindows-1250を使用しています。データをUTF-8として出力しています。Webサイト全体でiconv（）関数を使用して、Windows-1250文字列をUTF-8文字列に変換していますが、これは完璧に機能します。

問題は、PHP DOMを使用してデータベースに格納されているHTMLを解析している場合です（HTMLはWYSIWYGエディターからの出力であり、無効であり、html、head、bodyタグなどがありません）。

HTMLは次のようになります。次に例を示します。

<p>Hello</p>

データベースから特定のHTMLを解析するために使用する方法は次のとおりです。

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

上記のメソッドからの出力は、すべての特殊文字がÃšÄ�のような奇妙なものに置き換えられたゴミです。

もう一つ。それは私の開発サーバーで動作します。

ただし、本番サーバーでは機能しません。

助言がありますか？

本番サーバーのPHPバージョン：PHPバージョン5.2.0RC4-dev

開発サーバーのPHPバージョン：PHPバージョン5.2.13

アップデート：

私は自分で解決策に取り組んでいます。このPHPバグレポートからインスピレーションを得ています（実際にはバグではありません）： http：//bugs.php.net/bug.php？ id = 32547

これが私の提案する解決策です。明日試してみて、うまくいくかどうかお知らせします。

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  // this might work
  // it basically just adds head and meta tags to the document
  $html = $doc->getElementsByTagName('html')->item(0);
  $head = $doc->createElement('head', '');
  $meta = $doc->createElement('meta', '');
  $meta->setAttribute('http-equiv', 'Content-Type');
  $meta->setAttribute('content', 'text/html; charset=utf-8');
  $head->appendChild($meta);
  $body = $doc->getElementsByTagName('body')->item(0);
  $html->removeChild($body);
  $html->appendChild($head);
  $html->appendChild($body);

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

score 5 · Accepted Answer

あなたの「ハック」は意味がありません。

Windows-1250 HTMLファイルをUTF-8に変換してから、先頭に追加し<?xml encoding="UTF-8">ます。これは機能しません。HTMLファイルのDOM拡張機能：

「content-type」のメタhttp-equivで指定された文字セットを取得します。
それ以外の場合はISO-8859-1を想定しています

代わりに、Windows-1250からISO-8859-1に変換し、何も追加しないことをお勧めします。

編集Windows-1250にはISO-8859-1にない文字が含まれているため、この提案はあまり良くありません。コンテンツタイプの要素のないフラグメントを処理してmetaいるため、独自のフラグメントを追加して、UTF-8として解釈を強制できます。

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

与える：

string（79）"ĄĽź‰‡…á（一部はwin-1250に存在しますが、LATIN1またはwin-1252には存在しません）"

score 1 · Accepted Answer

2つの解決策。

エンコーディングをヘッダーとして設定することができます。

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

または、METAタグとして設定できます。

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

編集：これらの両方が正しく設定されている場合は、次のようにします。

UTF-8文字を含む小さなページを作成します。
すでに持っているのと同じ方法でページを書きます。
FiddlerまたはWiresharkを使用して、DEVおよびPROD環境で転送される生のバイトを調べます。Fiddler/Wiresharkを使用してヘッダーを再確認することもできます。

正しいヘッダーが送信されていると確信している場合、エラーを見つける可能性が最も高いのは、生のバイトの調査を開始することです。同一のブラウザに送信された同一のバイトは同じ結果をもたらすため、それらが同一でない理由を探し始める必要があります。Fiddler/Wiresharkがそれを支援します。

score 0 · Accepted Answer

私も同じ問題を抱えていました。私の修正は、notepad ++を使用し、phpドキュメントのエンコーディングを「BOMなしのUTF-8」に設定することでした。これが他の人の助けになることを願っています。

php - PHPDOMUTF-8の問題

3 に答える 3

Related

Reference