php - 単語を分割したり、html タグを壊したりせずにテキストを短くする

Question

単語を半分に切ったり、html タグを保持したりせずに、236 文字の後にテキストを切り捨てようとしています。これは私が今使っているものです：

$shortdesc = $_helper->productAttribute($_product, $_product->getShortDescription(), 'short_description');
$lenght = 236;
echo substr($shortdesc, 0, strrpos(substr($shortdesc, 0, $lenght), " "));

これはほとんどの場合に機能しますが、html タグは考慮されません。たとえば、このテキスト：

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>

タグが開いたまま切れてしまいます。236文字の後にテキストを切り取る方法はありますが、htmlタグを尊重していますか?

score 19 · Accepted Answer

これについて私が見つけた最良の解決策は、 CakePHP フレームワーク TextHelper クラスからのものです

ここに方法があります

/**
* Truncates text.
*
* Cuts a string to the length of $length and replaces the last characters
* with the ending if the text is longer than length.
*
* ### Options:
*
* - `ending` Will be used as Ending and appended to the trimmed string
* - `exact` If false, $text will not be cut mid-word
* - `html` If true, HTML tags would be handled correctly
*
* @param string  $text String to truncate.
* @param integer $length Length of returned string, including ellipsis.
* @param array $options An array of html attributes and options.
* @return string Trimmed string.
* @access public
* @link http://book.cakephp.org/view/1469/Text#truncate-1625
*/
function truncate($text, $length = 100, $options = array()) {
    $default = array(
        'ending' => '...', 'exact' => true, 'html' => false
    );
    $options = array_merge($default, $options);
    extract($options);

    if ($html) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen(strip_tags($ending));
        $openTags = array();
        $truncate = '';

        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }
    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - mb_strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($html) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }
    $truncate .= $ending;

    if ($html) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

他のフレームワークにも、この問題に対する同様の (または異なる) 解決策がある可能性があるため、それらも参照してください。私が Cake に精通していたことが、彼らのソリューションへのリンクを促した理由です。

編集：

OPのテキストで作業しているアプリでこのメソッドをテストしました

<?php 
echo truncate(
'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>', 
236, 
array('html' => true, 'ending' => '')); 
?>

出力：

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubegre</strong>

出力は最後の単語が完了する直前で停止しますが、完全な強力なタグが含まれていることに注意してください。

score 15 · Accepted Answer

これはそれを行う必要があります：

class Html
{
    protected
        $reachedLimit = false,
        $totalLen = 0,
        $maxLen = 25,
        $toRemove = array();

    public static function trim($html, $maxLen = 25)
    {

        $dom = new DomDocument();

        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            $dom->loadHTML($html);
        } else {
            $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
        }

        $instance = new static();
        $toRemove = $instance->walk($dom, $maxLen);

        // remove any nodes that exceed limit
        foreach ($toRemove as $child) {
            $child->parentNode->removeChild($child);
        }

        // remove wrapper tags added by DD (doctype, html...)
        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            // http://stackoverflow.com/a/6953808/1058140
            $dom->removeChild($dom->firstChild);
            $dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);

            return $dom->saveHTML();
        }

        return $dom->saveHTML();
    }

    protected function walk(DomNode $node, $maxLen)
    {

        if ($this->reachedLimit) {
            $this->toRemove[] = $node;
        } else {
            // only text nodes should have text,
            // so do the splitting here
            if ($node instanceof DomText) {
                $this->totalLen += $nodeLen = strlen($node->nodeValue);

                // use mb_strlen / mb_substr for UTF-8 support
                if ($this->totalLen > $maxLen) {
                    $node->nodeValue = substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
                    $this->reachedLimit = true;
                }
            }

            // if node has children, walk its child elements
            if (isset($node->childNodes)) {
                foreach ($node->childNodes as $child) {
                    $this->walk($child, $maxLen);
                }
            }
        }

        return $this->toRemove;
    }
}

次のように使用します。$str = Html::trim($str, 236);

（デモはこちら）

これと CakePHP の正規表現ソリューションのパフォーマンス比較

ここに画像の説明を入力

違いはほとんどなく、非常に大きな文字列サイズでは、DomDocument の方が実際には高速です。私の意見では、信頼性は数マイクロ秒の節約よりも重要です。

score -2 · Accepted Answer

XML アプローチを使用して、文字列の長さが 236 を超えるまで要素を文字列 var にプッシュできます。

サンプルコード?

for each node // text or tag
  push to the string var

  if string length > 236
    break

endfor

PHP で HTML を解析するためhttp://simplehtmldom.sourceforge.net/

score -2 · Accepted Answer

これがJSソリューションです：trim-html

アイデアは、HTML文字列をそのように分割して、要素がhtmlタグ（オープンまたはクローズ）または単なる文字列である配列を持つことです。

var arr = html.replace(/</g, "\n<")
              .replace(/>/g, ">\n")
              .replace(/\n\n/g, "\n")
              .replace(/^\n/g, "")
              .replace(/\n$/g, "")
              .split("\n");

配列を反復して文字をカウントするよりも。

score -2 · Accepted Answer

私はJSでやった.このロジックがPHPでも役立つことを願っています..

splitText : function(content, count){
        var originalContent = content;
         content = content.substring(0, count);
          //If there is no occurance of matches before breaking point and the hit breakes in between html tags.
         if (content.lastIndexOf("<") > content.lastIndexOf(">")){
            content = content.substring(0, content.lastIndexOf('<'));
            count = content.length;
            if(originalContent.indexOf("</", count)!=-1){
                content += originalContent.substring(count, originalContent.indexOf('>', originalContent.indexOf("</", count))+1);
            }else{
                 content += originalContent.substring(count, originalContent.indexOf('>', count)+1);
            }
          //If the breaking point is in between tags.
         }else if(content.lastIndexOf("<") != content.lastIndexOf("</")){
            content = originalContent.substring(0, originalContent.indexOf('>', count)+1);
         }
        return content;
    },

このロジックが誰かに役立つことを願っています..

php - 単語を分割したり、html タグを壊したりせずにテキストを短くする

8 に答える 8

これと CakePHP の正規表現ソリューションのパフォーマンス比較

Related

Reference