php - 開始タグ以外の内容にマッチする正規表現

Question

他のタグを除外して、タグの内容を一致させようとしています。クリーンアップしようとしている不正な形式の html があります。
簡単に言えば：

    <td><ins>sample content</td>
    <td>other content</td>
</tr>
<tr>
    <td>remaining</ins>other copy</td>

「サンプルコンテンツ」、その前の html ( <td><ins>)、およびその後の html をキャプチャし、除外したい</ins>

私が探しているのは否定的な先読みだと思いますが、これが PHP でどのように機能するかについては少し迷っています。

score 0 · Accepted Answer

閉じられていない開始タグが何であるかわからないため、必要最低限のパーサーを作成しています。

コンテンツをループして、開き角かっこに遭遇するたびに停止する必要があります<。これは本当にAWKで行う必要があります; ただし、PHPを使用する必要がある場合は、次のようにします。

<?php

    $file = file_get_contents('path/to/file');
    $file = preg_replace( '/[\n\r\t]/' , '' , $file );

    $pieces = explode( '<' , $file );
    if ( !$pieces[0] ) array_shift($pieces);

    /* Given your example code, $pieces looks like this
    $pieces = array(
        [0] => 'td>',
        [1] => 'ins>sample content',
        [2] => '/td>',
        [3] => 'td>other content',
        [4] => '/td>',
        [5] => '/tr>',
        [6] => 'tr>',
        [7] => 'td>remaining',
        [8] => '/ins>other copy',
        [9] => '/td>'
    );
    */

    $openers = array();//$openers = [];
    $closers = array();//$closers = [];
    $brokens = array();//$brokens = [];

    for ( $i = 0 , $count = count($pieces) ; $i < $count ; $i++ ) {
        //grab everything essentially between the brackets
        $tag = strstr( $pieces[$i] , '>' , TRUE );
        $oORc = strpos( $pieces[$i] , '/' );

        if ( $oORc !== FALSE ) {
            //store this for later (and maintain $pieces' index)
            $closers[$i] = $tag;
            $elm = str_replace( '/' , '' , $tag );
            if ( ( $oIndex = array_search( $elm , $openers ) ) && count($openers) != count($closers) ) {
                //more openers than closers ==> broken pair
                $brokens[$oIndex] = $pieces[$oIndex];
                $cIndex = array_search( $tag, $closers );
                $brokens[$cIndex] = $pieces[$cIndex];
                //remove the unpaired elements from the 2 arrays so count works
                unset( $openers[$oIndex] , $closers[$cIndex] );
            }
        } else {
            $openers[$i] = $tag;
        }//fi

    }//for

    print_r($brokens);

?>

のインデックス$brokensは$pieces、不正な形式のhtmlが出現した場所のインデックスであり、その値は問題のあるタグとそのコンテンツです。

$brokens = Array(
    [1] => ins>sample content
    [8] => /ins>other copy
);

警告<br />これは、またはのような自己終了タグを考慮していません<img />（ただし、このためにすでに存在する多くのソフトウェアアプリの1つを使用する必要があるのはそのためです）。

php - 開始タグ以外の内容にマッチする正規表現

1 に答える 1

Related

Reference