regex - 複数のエンコーディングを使用してメールの件名を解析する正規表現

Question

そこには！

1 つのメール件名ですべてのインラインエンコーディングを照合し、件名文字列を utf8 で作成したいと考えています。

いくつかの例：

[Listname | Topic123] =?utf-8?Q?encodedtext?=
=?iso-8859-1?q?this=20is=20some=20text?=
Klartext-Betreff
[Listname | Topic123] =?utf-8?Q?encodedtext?= =?iso-8859-1?q?this=20is=20some=20text?=
=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

また、2 つの異なるエンコーディングのメールを受け取りました (最後の行の例)。

電子メールでは、件名が複数の行に分割され、各行 (最初の行を除く) が少なくとも 1 つの空白で始まる場合もあります。

だから私は解析する正規表現を探しています:

パート+

Part は次のいずれかです。

スペースを含むテキスト
=?charset?encoding?encoded-text?=

私はそれが次のようなものになると思います：

ENC = (=\?)([A-Za-z0-9-]*)(\?)([A-Za-z0-9-]*)(?)([Any Character])(\?=)
Part = any character that doesnt match to ENC or ENC

score 0 · Accepted Answer

function decode ($string, $source_enc, $dest_enc)
{
    $parts = preg_split (
        '/=\?([^?]+)\?([^?]+)\?([^?]+)\?=/', 
        $string, 
        -1, PREG_SPLIT_DELIM_CAPTURE);

    $result = "";

    for ($i = 0; $i < count ($parts); $i++)
    {
        $part = $parts [$i];

        if ($i % 4 == 0)
            $result .= iconv ($source_enc, $dest_enc, $part);
        else
        {
            $charset = $parts [$i++];
            $encoding = $parts [$i++];
            $text = $parts [$i];

            if ($encoding == 'Q' || $encoding == 'q')
                $text = quoted_printable_decode ($text);
            else if ($encoding == 'B' || $encoding == 'b')
                $text = base64_decode ($text);

            $result .= iconv ($charset, $dest_enc, $text);
        }
    }

    return $result;
}

echo (decode ("=?utf-8?Q?encodedtext?= =?iso-8859-1?q?this=20is=20some=20text?=
=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=", 
    "ISO-8859-1", "ISO-8859-1"));

私にとっての出力は次のとおりです。

encodedtext this is some text If you can read this yo u understand the example.

regex - 複数のエンコーディングを使用してメールの件名を解析する正規表現

1 に答える 1

Related

Reference