php - フォーマットを保持し、HTML を壊さずに PHP substr() と strip_tags() を使用する

Question

タグを削除したり HTML を壊したりせずに (元のコンテンツではなく、削除されたコンテンツの) 100 文字にカットするさまざまな HTML 文字列があります。

元の HTML 文字列(288 文字):

$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";

標準トリム: 100 文字にトリムし、HTML を分割します。削除されたコンテンツは最大 40 文字になります:

$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */

削除された HTML:正しい文字数を出力しますが、明らかに書式設定が失われます:

$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */

部分的な解決策: HTML Tidy または purifier を使用してタグを閉じると、クリーンな HTML が出力されますが、HTML の 100 文字は表示されません。

$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */

課題:きれいな HTML とn文字 (HTML 要素の文字数を除く) を出力するには:

$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";

類似の質問

score 13 · Accepted Answer

PHPのDOMDocumentクラスを使用して、HTMLフラグメントを正規化します。

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

この質問は以前の質問と似ており、ここに1つの解決策をコピーして貼り付けました。onmouseover="do_something_evil()"HTMLがユーザーによって送信された場合は、またはのような潜在的なJavascript攻撃ベクトルも除外する必要があります<a href="javascript:more_evil();">...</a>。HTML Purifierのようなツールは、これらの問題を見つけて解決するように設計されており、私が投稿できるどのコードよりもはるかに包括的です。

score 5 · Accepted Answer

私はそれを行うために別の関数を作成しました.UTF-8をサポートしています:

/**
 * Limit string without break html tags.
 * Supports UTF8
 * 
 * @param string $value
 * @param int $limit Default 100
 */
function str_limit_html($value, $limit = 100)
{

    if (mb_strwidth($value, 'UTF-8') <= $limit) {
        return $value;
    }

    // Strip text with HTML tags, sum html len tags too.
    // Is there another way to do it?
    do {
        $len          = mb_strwidth($value, 'UTF-8');
        $len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
        $len_tags     = $len - $len_stripped;

        $value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
    } while ($len_stripped > $limit);

    // Load as HTML ignoring errors
    $dom = new DOMDocument();
    @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);

    // Fix the html errors
    $value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));

    // Remove body tag
    $value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
    // Remove empty tags
    return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}

デモを見る.

関数の開始時に使用することをお勧めしますhtml_entity_decode。これにより、UTF-8 文字が保持されます。

 $value = html_entity_decode($value);

score 3 · Accepted Answer

Tidy HTMLを使用する必要があります。文字列を切り取り、Tidy を実行してタグを閉じます。

(クレジットが期限のクレジット)

php - フォーマットを保持し、HTML を壊さずに PHP substr() と strip_tags() を使用する

12 に答える 12

Related

Reference