php - HTML文字参照（ף）を通常のUTF-8に変換するにはどうすればよいですか？

Question

次のような文字参照を含むヘブライ語のWebサイトがいくつかあります。נוף

これらの文字を表示できるのは、ファイルを.htmlとして保存し、UTF-8エンコーディングで表示した場合のみです。

通常のテキストファイルとして開こうとすると、UTF-8エンコーディングで適切な出力が表示されません。

テキストエディタを開いてUTF-8でヘブライ語を書くと、この例では各文字が4バイト行ではなく2バイトかかることに気付きました（ו）

これがUTF-16または他の種類の文字のUTF表現である場合、何かアイデアはありますか？

可能であれば、どうすれば通常の文字に変換できますか？

最新のPHPバージョンを使用します。

score 6 · Accepted Answer

これらは、ISO 10646 の文字を、その文字のコードポイントを 10 進数 ( ) または 16 進数 ( ) 表記で指定して参照する文字参照です。&#n;&#xn;

を使用html_entity_decodeして、そのような文字参照とHTML 4 用に定義され<たエンティティのエンティティ参照をデコードできるため、、>、などの他の参照&もデコードされます。

$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');

数字参照をデコードするだけの場合は、次のように使用できます。

function html_dereference($match) {
    if (strtolower($match[1][0]) === 'x') {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-32BE');
}
$str = preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', 'html_dereference', $str);

YuriKolovskyとthirtydot が別の質問で指摘したように、ブラウザベンダーは、仕様とは異なり、文書化されていない文字参照マッピングに関する何かに「黙って」同意したようです。

通常はLatin 1 補足にマップされる文字参照がいくつかあるようですが、実際には別の文字にマップされます。これは、Unicode 文字セットが構築されている ISO 8859-1 ではなく、Windows-1252 から文字をマッピングした結果として生じるマッピングによるものです。Jukka Korpela は、このトピックに関する詳細な記事を書きました。

ここで、この癖を処理する上記の関数の拡張を次に示します。

function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true) {
    $deref = function($match) use ($encoding, $fixMappingBug) {
        if (strtolower($match[1][0]) === "x") {
            $codepoint = intval(substr($match[1], 1), 16);
        } else {
            $codepoint = intval($match[1], 10);
        }
        // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
        if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
            $mapping = array(
                8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
                338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
                8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
            $codepoint = $mapping[$codepoint-130];
        }
        return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
    };
    return preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', $deref, $string);
}

匿名関数が利用できない場合(5.3.0 で導入)、以下を使用することもできますcreate_function。

$deref = create_function('$match', '
    $encoding = '.var_export($encoding, true).';
    $fixMappingBug = '.var_export($fixMappingBug, true).';
    if (strtolower($match[1][0]) === "x") {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
    if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
        $mapping = array(
            8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
            338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
            8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
        $codepoint = $mapping[$codepoint-130];
    }
    return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
');

HTML 5の動作に準拠しようとする別の関数を次に示します。

function html5_decode($string, $flags=ENT_COMPAT, $charset='UTF-8') {
    $deref = function($match) use ($flags, $charset) {
        if ($match[1][0] === '#') {
            if (strtolower($match[1][0]) === '#') {
                $codepoint = intval(substr($match[1], 2), 16);
            } else {
                $codepoint = intval(substr($match[1], 1), 10);
            }

            // HTML 5 specific behavior
            // @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references

            // handle Windows-1252 mismapping
            // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
            // @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides
            $overrides = array(
                0x00=>0xFFFD,0x80=>0x20AC,0x82=>0x201A,0x83=>0x0192,0x84=>0x201E,
                0x85=>0x2026,0x86=>0x2020,0x87=>0x2021,0x88=>0x02C6,0x89=>0x2030,
                0x8A=>0x0160,0x8B=>0x2039,0x8C=>0x0152,0x8E=>0x017D,0x91=>0x2018,
                0x92=>0x2019,0x93=>0x201C,0x94=>0x201D,0x95=>0x2022,0x96=>0x2013,
                0x97=>0x2014,0x98=>0x02DC,0x99=>0x2122,0x9A=>0x0161,0x9B=>0x203A,
                0x9C=>0x0153,0x9E=>0x017E,0x9F=>0x0178);
            if (isset($windows1252Mapping[$codepoint])) {
                $codepoint = $windows1252Mapping[$codepoint];
            }

            if (($codepoint >= 0xD800 && $codepoint <= 0xDFFF) || $codepoint > 0x10FFFF) {
                $codepoint = 0xFFFD;
            }
            if (($codepoint >= 0x0001 && $codepoint <= 0x0008) ||
                ($codepoint >= 0x000E && $codepoint <= 0x001F) ||
                ($codepoint >= 0x007F && $codepoint <= 0x009F) ||
                ($codepoint >= 0xFDD0 && $codepoint <= 0xFDEF) ||
                in_array($codepoint, array(
                    0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
                    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
                    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
                    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
                    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF))) {
                $codepoint = 0xFFFD;
            }
            return mb_convert_encoding(pack("N", $codepoint), $charset, "UTF-32BE");
        } else {
            return html_entity_decode($match[0], $flags, $charset);
        }   
    };
    return preg_replace_callback('/&(#(?:x[0-9a-f]+|[0-9]+)|[A-Za-z0-9]+);/i', $deref, $string);
}

また、PHP 5.4.0 では、HTML 5 の動作のためにENT_HTML5という名前の別のフラグがhtml_entity_decode関数に追加されたことにも気付きました。

score 5 · Accepted Answer

これらは XML Character Referenceです。次を使用してそれらをデコードしますhtml_entity_decode()。

$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

詳細については、問題のエンティティを Google で検索してください。次のいくつかの例を参照してください。

php - HTML文字参照（ף）を通常のUTF-8に変換するにはどうすればよいですか？

2 に答える 2

Related

Reference