php - 名前付き配列にデータを抽出するために、PHPのpreg_matchが文字列とstring_literalsで異なる動作をするのはなぜですか？

Question

HTMLメールの本文から送信者、顧客IDなどの6つのフィールドを抽出しようとしています。

$string = '... some other html text ... <p>
   <strong>Sender:</strong>&nbsp;Holly Schöne<br>
   <strong>Customer ID:</strong>&nbsp;3853XXXX<br>
   <strong>Email:</strong>&nbsp;email@test.net<br>
   <strong>Transaction ID:</strong>&nbsp;836248467<br>
   <strong>Reference:</strong>&nbsp;product<br>
   <strong>Explanation:</strong>&nbsp;Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
</p>... some more html text ...';

...私はそのように抽出します：

$message = imap_fetchbody($inbox, $email_number, $section);
// determine $encoding and $charset
$decodedMessage = decodeMessage($message, $encoding, $charset);

この関数の使用:(他のエンコーディングの場合は、そこで何も行われないため省略されます）

function decodeMessage($message, $encoding, $charset) {
    switch ($encoding) {
        case 3: // BASE64
            $message = base64_decode($message);
            break;
        case 4: // QUOTED-PRINTABLE
            $message = quoted_printable_decode($message);
            break;
        default:
            break;
    }
    if ($charset != NULL) {
        $message = mb_convert_encoding($message , 'utf-8' , $charset);
        //$message = mb_convert_encoding($message , 'iso-8859-1' , $charset);
    }
    return $message;
}

それはすべて魅力のように機能します。問題はここから始まります：

$regex = '/\<p\>[\w\W. ]*?\<strong\>Sender\:\<\/strong\>&nbsp;(?<sender>[\w\W ]+?)\<br\>.*?\<strong\>Customer ID\:\<\/strong\>&nbsp;(?<customerId>[\w\W ]+?)\<br\>.*?\<strong\>Email\:\<\/strong\>&nbsp;(?<email>[\w\W ]+?)\<br\>.*?\<strong\>Transaction ID\:\<\/strong\>&nbsp;(?<transactionId>[\w\W ]+?)\<br\>.*?\<strong\>Reference\:\<\/strong\>&nbsp;(?<reference>[\w\W ]+?)\<br\>.*?\<strong\>Explanation\:\<\/strong\>&nbsp;(?<explanation>[\w\W ]+?)\<\/p\>/is';
$result = preg_match($regex, $decodedMessage, $matches);

その正規表現を上記の文字列に適用すると、まさに私が望むものが得られます-次のような配列：

print_r($matches) = Array (
    [0] => <p>
       <strong>Sender:</strong>&nbsp;Holly SchÃ¶ne<br>
       <strong>Customer ID:</strong>&nbsp;3853XXXX<br>
       <strong>Email:</strong>&nbsp;email@test.net<br>
       <strong>Transaction ID:</strong>&nbsp;836248467<br>
       <strong>Reference:</strong>&nbsp;product<br>
       <strong>Explanation:</strong>&nbsp;Holly SchÃ¶ne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
    </p>
    [sender] => Holly SchÃ¶ne
    [1] => Holly SchÃ¶ne
    [customerId] => 3853XXXX
    [2] => 3853XXXX
    [email] => email@test.net
    [3] => email@test.net
    [transactionId] => 836248467
    [4] => 836248467
    [reference] => product
    [5] => product
    [explanation] => Holly SchÃ¶ne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
    [6] => Holly SchÃ¶ne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
)

...ただし、$ decodeMessageで同じことを行うと、次のようになります。

preg_last_error() -> PREG_NO_ERROR
$result -> [empty string]
$matches -> array()

私はすべてを試し、周りを見回しましたが、問題を理解することができません。私の推測では、それは電子メール本文のエンコーディングまたは文字セットに関係していると思います。どんな助けでも大歓迎です。

さて、あなたはそれを求めました-私はこの質問がすでに非常に長いと思いました...ここにvardumpがあります-私はいくつかの個人情報を変更しただけです

...ああくそー...そして私の問題もありました

Waterfoxのソースコードビューアにだまされてしまいました

<br />として表示<br>され、各テーブルに追加された<tbody>ため、正規表現に基づいたソースコードは、電子メールが実際に持っていたものではありませんでした-今はかなり愚かです-以下の実際のHTMLソースコード

<html>
<table width="750" cellpadding="0" cellspacing="0">
    <tr>
        <td style="background-repeat:no-repeat;" background="http://i1.mbsvr.net/images/bg_mailframe.gif" width="100%" align="center">
            <table width="95%" align="center">
                <tr>
                    <td align="left" style="padding:10px 0 0 10px;">
                        <a href="http://www.moneybookers.com/app/?l=EN" target="_blank" style="color:FD932C;font-weight:normal;" onfocus="this.blur()">
                            <img src="http://i1.mbsvr.net/images/skrill/mb-logo-the-future.png" border="0" />
                        </a>
                    </td>
                </tr>
            </table>
            <table width="740">
                <tr><td style="padding:0px 40px 0px 0px" align="center">
<table width="100%" border="0" cellpadding="0" cellspacing="0">
    <tr>
        <td valign="top" align="middle">
            <table cellspacing="0" cellpadding="0" width="100%" border="0">
                <tr>
                    <td>
                        <hr style="!important; font-family: verdana, arial, sans-serif; border: 0; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
                    </td>
                </tr>
                <tr> 
                    <td style="!important; font-family: verdana, arial, sans-serif; margin: 0; padding: 0px 0px 10px 0px; color: #EF8116; font-weight: bold; font-size: 18px;" nowrap width="50%">
                        You have received EUR 0.05
                    </td>
                </tr>
                <tr>
                    <td style="!important; font-family: verdana, arial, sans-serif;  font-size: 11px;   color: #656565;">
                        <br/> 
                        Dear Mmmmmmm Bbbbbbb,<br />
                        <br/>
                        Holly Schöne has sent you EUR 0.05 via Skrill (Moneybookers). The full details of the transaction are:<br />
                        <p>
                            <strong>Sender:</strong> Holly Schöne<br />
                            <strong>Customer ID:</strong> 3853XXXX<br />
                            <strong>Email:</strong> email@test.net<br />
                            <strong>Transaction ID:</strong> 836151721<br />
                            <strong>Reference:</strong> TPBwishes<br />
                            <strong>Explanation:</strong> Holly Schoene
#gsg4sda65g4r65e4g8s4g56asd54e#
                        </p>
                        Your money is waiting for you in your Skrill (Moneybookers) account - <a href="https://www.moneybookers.com">https://www.moneybookers.com</a>.<br />
                        <br />
                        <b>IMPORTANT:</b> If you are using Skrill (Moneybookers) commercially, we <b>STRONGLY</b> advise that you check in your Skrill (Moneybookers) account history that the money is there.<br />
                        <br />
                        Have you increased your withdrawal and receiving limits? Just log into your Skrill (Moneybookers) account and click <b>View Limits</b> in the "My Account" section.<br />
                        <br />
                        Kind regards,<br />
                        Skrill (Moneybookers)<br />
                    </td>
                </tr>
                <tr>
                    <td>
                        <hr style="!important; font-family: verdana, arial, sans-serif; border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
                    </td>
                </tr>
            </table>
            <table cellspacing=0 cellpadding=0 width="100%" border=0> 
    <tr>
        <td style="font-family: verdana, arial, sans-serif; font-size: 12px;    color: #656565;"><b>Skrill (Moneybookers) Security Reminders</b></td>
    </tr>
       <tr>          

<td class=smooth valign="top" style="font-family: verdana, arial, sans-serif; font-size: 11px;  color: #656565;"><p> <br>              <strong>Protect Your Password</strong><br>Skrill (Moneybookers) and its representatives will NEVER ask you to reveal your password. There are NO EXCEPTIONS to this policy. If anyone asks for your password by phone or by email, or on any website other than moneybookers.com, refuse and immediately report this to <a href="mailto:security@moneybookers.com" style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Access your account ONLY using the login link on the Moneybookers homepage</strong><br>Please be advised that Skrill (Moneybookers) and its representatives will NEVER send you an email asking you to provide your login details within a form provided or to click on a hyperlink to access your account! Immediately report any incident to <a href="mailto:security@moneybookers.com"              style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Case Sensitive Login</strong><br>Please remember your password is case-sensitive, at least 8 characters long and contains at least one number or non-alphabetic character such as '-'. <br>              <br>            </p></td>                </tr>      </table>
        </td>
    </tr>
</table>                </td></tr>
                <tr>
                    <td style="padding:0px 54px 0px 0px" class="separator"><hr style="border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;"/></td>
                </tr>
            </table>
            <table align="left" width="740">
                <tr>
                    <td width="10"> </td>
                    <td style="font-family: verdana, arial, sans-serif; font-size: 11px;    color: #656565;" valign="top" width="100%" align="center">
                    Moneybookers Ltd., London, Registered in England and Wales no 4260907.<br>
Registered office: Welken House, 10-11 Charterhouse Square, London, EC1M 6EH, United Kingdom.<br>
Authorised by the Financial Services Authority (FSA) under the Electronic Money Regulations 2011 for the issuing of electronic money.
                    </td>
                </tr>
            </table>
        </td>
    <tr>
        <td valign="top">
            <img src="http://i1.mbsvr.net/images/bg_mailframe_bottom.gif" border="0" />
        </td>
    </tr>
</table>
</html>

それで、トマラックの答えと一緒に、私は今、2つの実用的な解決策を得ました：

<br />正しく閉じられたタグを考慮し、値も解析する、現在機能している正規表現：

$regex = '/<td .*?>.*?You have received(?<value>.+?\d+\.\d\d).*?<\/td>.*?<p>.*?<strong>Sender:<\/strong>(?<sender>.+?)<br*.?\/?>.*?<strong>Customer ID:<\/strong>(?<customerId>.+?)<br*.?\/?>.*?<strong>Email:<\/strong>(?<email>.+?)<br*.?\/?>.*?<strong>Transaction ID:<\/strong>(?<transactionId>.+?)<br*.?\/?>.*?<strong>Reference:<\/strong>(?<reference>.+?)<br*.?\/?>.*?<strong>Explanation:<\/strong>(?<explanation>.+?)<\/p>/is';

以下のTomalakのソリューションへの調整されたxpath：

$path = "p/strong[contains(., '$info')]/following-sibling::text()[1]";

最初にスラッシュがないということは、DOMツリーのどこにでもxpathがあり、それが必要な場所にのみ一致することを意味します。

助けようとしたすべての人に感謝します

score 0 · Accepted Answer

そのためだけに、正規表現を回避する実装を次に示します。

$doc = new DOMDocument();
$doc->loadHTML($decodedMessage);
$xpath = new DOMXPath($doc);

$info = array(
  'sender'         => get_info($xpath, 'Sender:'),
  'customer_id'    => get_info($xpath, 'Customer ID:'),
  'email'          => get_info($xpath, 'Email:'),
  'transaction_id' => get_info($xpath, 'Transaction ID:'),
  'reference'      => get_info($xpath, 'Reference:'),
  'explanation'    => get_info($xpath, 'Explanation:')
);


function get_info($xpath_object, $info) 
{
    $result = null;
    $path   = "//strong[contains(., '$info')]/following-sibling::text()[1]";
    $nodes  = $xpath_object->query($path);

    foreach ($nodes as $node)
    {
        $result = $node->textContent;
        break;
    }

    return $result;
}

php - 名前付き配列にデータを抽出するために、PHPのpreg_matchが文字列とstring_literalsで異なる動作をするのはなぜですか？

1 に答える 1

Related

Reference