php - 文字エンコーディングが異なるファイル内の文字列の処理 (ISO-8859-1 と UTF-8)

Question

ファイルに一連の行があり、各行が複数行のコメントを表している可能性があります。元の開発者が選択した行区切り記号はピルクロー (¶) でした。これは、誰かのコメントに表示されることは決してないと彼が感じたためです。私は現在、これらをデータベースに入れていますが、より一般的な行区切りを使用したいと考えています (ただし、アプリケーションのインストーラーによって設定されている可能性があります)。

問題は、ISO-8859-1 エンコーディング (hex b6) を使用する行と、UTF-8 エンコーディング (hex c2b6) を使用する行があることです。私が現在行っていることよりも優れたサポートを提供する、これに対処するためのエレガントな方法を探しています。

これは私がこれまでに処理した方法ですが、よりエレガントなソリューションを探しています。

// Due to the way the quote file is stored, line breaks can either be
// in 2-byte or 1-byte characters for the pilcrow. Since we're dealing
// with them on a unix system, it makes more sense to replace these
// funky characters with a newline character as is more standard.
//
// To do this, however, requires a bit of chicanery. We have to do
// 1-byte replacement, but with a 2-byte character.
//
// First, some constants:
define('PILCROW', '¶'); // standard two-byte pilcrow character
define('SHORT_PILCROW', chr(0XB6)); // the one-byte version used in the source data some places
define('NEEDLE', '/['.PILCROW.SHORT_PILCROW.']/'); // this is what is searched for
define('REPLACEMENT', $GLOBALS['linesep']);

function fix_line_breaks($quote)
{
  $t0 = preg_replace(NEEDLE,REPLACEMENT,$quote); // convert either long or short pilcrow to a newline. 
  return $t0;
}

score 0 · Accepted Answer

私は次のようにします：

define('PILCROW', '¶'); // standard two-byte pilcrow character
define('REPLACEMENT', $GLOBALS['linesep']);

function fix_encoding($quote) {
    return mb_convert_encoding($quote, 'UTF-8', mb_detect_encoding($quote));
}

function fix_line_breaks($quote) {
    // convert UTF-8 pilcrow to a newline.
    return str_replace(PILCROW, REPLACEMENT, $quote);
}

行コメントごとに、次に呼び出しfix_encodingますfix_line_breaks

$quote = fix_encoding($quote);
$quote = fix_line_breaks($quote);

php - 文字エンコーディングが異なるファイル内の文字列の処理 (ISO-8859-1 と UTF-8)

1 に答える 1

Related

Reference