php - 開始点と停止点を設定し、HTML フォーマットを保持できる PHP substr() 関数は?

Question

PHPの通常のsubstr()関数を使用すると、文字列のカットを「開始」する場所を決定したり、長さを設定したりすることができます。長さはおそらく最も使用されますが、この場合、最初から約 120 文字を切り取る必要があります。問題は、文字列内の html をそのまま維持し、タグ内の実際のテキストのみを切り取る必要があることです。

いくつかのカスタム関数を見つけましたが、開始点を設定できる関数は 1 つも見つかりませんでした。弦を切り始めたいところ。

ここに私が見つけたものがあります：フォーマットを保持し、HTMLを壊さずにPHP substr()とstrip_tags()を使用する

したがって、基本的には、substr()フォーマットを維持することを除いて、元の関数とまったく同じように機能する関数が必要です。

助言がありますか？

変更するコンテンツの例:

<p>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>

最初から5をカットした後：

<p>ary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>

そして、最初と最後から5つ：

<p>ary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.1</p>

ええ、あなたは私のドリフトをキャッチしますか？

単語の途中で切り捨てるのであれば単語全体を切り捨てた方がいいと思いますが、それほど重要ではありません。

** 編集: ** 固定引用符。

score 2 · Accepted Answer

あなたが求めていることには非常に多くの複雑さが伴います（本質的に、文字列オフセットを指定して有効なhtmlサブセットを生成します）。テキスト文字数として表現されるような方法で問題を再定式化する方が本当に良いでしょう. htmlを含む任意の文字列を切り取るのではなく、保持したい。そうすれば、実際の HTML パーサーを使用できるため、この問題ははるかに簡単になります。次のことを心配する必要はありません。

要素を誤って半分に切断します。
うっかり体を半分に切ってしまう。
要素内のテキストをカウントしません。
文字エンティティが 1 文字としてカウントされることを確認します。
すべての要素が適切に閉じられていることを確認します。
substr()utf-8 文字列を使用しているため、文字列を破棄しないようにしてください。

正規表現 (フラグを使用) とタグスタック (私は以前に実行したことがあります)を使用してこれを達成することは可能ですが、多くのエッジケースがあり、通常はハードスローグになります。umb_substr()

ただし、DOM ソリューションは非常に単純です。文字列の長さを数えながらすべてのテキストノードを調べ、必要に応じてテキストコンテンツを削除または部分文字列化します。以下のコードはこれを行います。

$html = <<<'EOT'
<p>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going <a href="#">through the cites</a> of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus</p> <p>Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the <strong>Renaissance</strong>. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.</p>
EOT;

function substr_html($html, $start, $length=null, $removeemptyelements=true) {
    if (is_int($length)) {
        if ($length===0) return '';
        $end = $start + $length;
    } else {
        $end = null;
    }
    $d = new DOMDocument();
    $d->loadHTML('<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><title></title></head><body>'.$html.'</body>');
    $body = $d->getElementsByTagName('body')->item(0);
    $dxp = new DOMXPath($d);
    $t_start = 0; // text node's start pos relative to all text
    $t_end   = null; // text node's end pos relative to all text

    // copy because we may modify result of $textnodes
    $textnodes = iterator_to_array($dxp->query('/descendant::*/text()', $body));

// PHP 5.2 doesn't seem to implement Traversable on DOMNodeList,
// so `iterator_to_array()` won't work. Use this instead:
// $textnodelist = $dxp->query('/descendant::*/text()', $body);
// $textnodes = array();
// for ($i = 0; $i < $textnodelist->length; $i++) {
//  $textnodes[] = $textnodelist->item($i);
//}
//unset($textnodelist);

    foreach($textnodes as $text) {
        $t_end = $t_start + $text->length;
        $parent = $text->parentNode;
        if ($start >= $t_end || ($end!==null && $end < $t_start)) {
            $parent->removeChild($text);
        } else {
            $n_offset = max($start - $t_start, 0);
            $n_length = ($end===null) ? $text->length : $end - $t_start;
            if (!($n_offset===0 && $n_length >= $text->length)) {
                $substr = $text->substringData($n_offset, $n_length);
                if (strlen($substr)) {
                    $text->deleteData(0, $text->length);
                    $text->appendData($substr);
                } else {
                    $parent->removeChild($text);
                }
            }
        }

        // if removing this text emptied the parent of nodes, remove the node!
        if ($removeemptyelements && !$parent->hasChildNodes()) {
            $parent->parentNode->removeChild($parent);
        }

        $t_start = $t_end;
    }
    unset($textnodes);
    $newstr = $d->saveHTML($body);

    // mb_substr() is to remove <body></body> tags
    return mb_substr($newstr, 6, -7, 'utf-8');
}


echo substr_html($html, 480, 30);

これは出力されます：

<p> of "de Finibus</p> <p>Bonorum et Mal</p>

「部分文字列」が複数のp要素にまたがっているという事実によって混乱しないことに注意してください。

score 1 · Accepted Answer

ここでは、DOMDocument(xml/html パーサー)、RecursiveIteratorIterator(再帰構造を簡単にトラバーサルするため) およびカスタムDOMNodeListイテレーターの実装を利用して、 RecursiveIteratorIterator.

それはまだすべてかなりずさんです（コピーを返さないが、DOMNode/の参照に作用します）、および/またはの負の値などDOMDocument、通常の派手な機能はありませんが、そうするようです仕事、これまで。バグがあるのは確かだけど。しかし、でこれを行う方法についてのアイデアが得られるはずです。substr()$start$lengthDOMDocument

カスタムイテレータ:

class DOMNodeListIterator
    implements Iterator
{
    protected $domNodeList;

    protected $position;

    public function __construct( DOMNodeList $domNodeList )
    {
        $this->domNodeList = $domNodeList;
        $this->rewind();
    }

    public function valid()
    {
        return $this->position < $this->domNodeList->length;
    }

    public function next()
    {
        $this->position++;
    }

    public function key()
    {
        return $this->position;
    }

    public function rewind()
    {
        $this->position = 0;
    }

    public function current()
    {
        return $this->domNodeList->item( $this->position );
    }
}

class RecursiveDOMNodeListIterator
    extends DOMNodeListIterator
    implements RecursiveIterator
{
    public function hasChildren()
    {
        return $this->current()->hasChildNodes();
    }

    public function getChildren()
    {
        return new self( $this->current()->childNodes );
    }
}

実際の機能:

function DOMSubstr( DOMNode $domNode, $start = 0, $length = null )
{
    if( $start == 0 && ( $length == null || $length >= strlen( $domNode->nodeValue ) ) )
    {
        return;
    }

    $nodesToRemove = array();
    $rii = new RecursiveIteratorIterator( new RecursiveDOMNodeListIterator( $domNode->childNodes ), RecursiveIteratorIterator::SELF_FIRST );
    foreach( $rii as $node )
    {
        if( $start <= 0 && $length !== null && $length <= 0 )
        {
            /* can't remove immediately
             * because this will mess with
             * iterating over RecursiveIteratorIterator
             * so remember for removal, later on
             */
            $nodesToRemove[] = $node;
            continue;
        }

        if( $node->nodeType == XML_TEXT_NODE )
        {
            if( $start > 0 )
            {
                $count = min( $node->length, $start );
                $node->deleteData( 0, $count );
                $start -= $count;
            }

            if( $start <= 0 )
            {
                if( $length == null )
                {
                    break;
                }
                else if( $length <= 0 )
                {
                    continue;
                }
                else if( $length >= $node->length )
                {
                    $length -= $node->length;
                    continue;
                }
                else
                {
                    $node->deleteData( $length, $node->length - $length );
                    $length = 0;
                }
            }
        }
    }

    foreach( $nodesToRemove as $node )
    {
        $node->parentNode->removeChild( $node );
    }
}

使用法：

$html = <<<HTML
<p>Just a short text sample with <a href="#">a link</a> and some trailing elements such as <strong>strong text<strong>, <em>emphasized text</em>, <del>deleted text</del> and <ins>inserted text</ins></p>
HTML;

$dom = new DomDocument();
$dom->loadHTML( $html );
/*
 * this is particularly sloppy:
 * I pass $dom->firstChild->nextSibling->firstChild (i.e. <body>)
 * because the function uses strlen( $domNode->nodeValue )
 * which will be 0 for DOMDocument itself
 * and I didn't want to utilize DOMXPath in the function
 * but perhaps I should have
 */
DOMSubstr( $dom->firstChild->nextSibling->firstChild, 8, 25 );

/*
 * passing a specific node to DOMDocument::saveHTML()
 * only works with PHP >= 5.3.6
 */
echo $dom->saveHTML( $dom->firstChild->nextSibling->firstChild->firstChild );

score 0 · Accepted Answer

（ランタイムのために）テキストが長くない場合は、これを試してみてください。

しかし、この場合、最初から約120文字を切り取る必要があります。

まさにこれを行う。テキストを入力するか、どこかからテキストを取得して、最初から消去する文字数を入力します。

そして、それを十分に強調することはできません。短い文字列の解決策であり、そうするための最良の方法ではありませんが、完全に機能するコードサンプルです。

<?php
$text = "<a href='blablabla'>m</a>ylinks...<b>not this code is working</b>......";
$newtext = "";
$delete = 13;
$tagopen = false;

while ($text != ""){
    $checktag=$text[0];
    $text=substr( $text, 1 );
    if ($checktag =="<" || $tagopen == TRUE){
        $newtext .= $checktag;
        if ($checktag == ">"){
        $tagopen = FALSE;
        }
        else{
        $tagopen = TRUE;
        }
    }
    elseif ($delete > 0){   
        $delete = $delete -1 ;
        }
    else
    {
    $newtext .= $checktag;

    }
}
echo $newtext;



?>

それは戻ります：

<a href='blablabla'></a><b> this code is working</b>......

php - 開始点と停止点を設定し、HTML フォーマットを保持できる PHP substr() 関数は?

3 に答える 3

Related

Reference