php - PHP で HTML をクリーンアップしてクリーンな文字列を作成する

Question

PHPを使用してPDFファイルに書き込んでいるHTMLデータがたくさんあります。PDF では、すべての HTML を削除してクリーンアップしたいと考えています。たとえば、次のようになります。

<ul>
    <li>First list item</li>
    <li>Second list item which is quite a bit longer</li>
    <li>List item with apostrophe 's 's</li>
</ul>

次のようになる必要があります。

First list item
Second list item which is quite a bit longer
List item with apostrophe 's 's

ただし、単純にを使用するstrip_tags()と、次のようになります。

   First list item&#8232;

   Second list item which is quite a bit
longer&#8232;

   List item with apostrophe &rsquo;s &rsquo;s

出力のインデントにも注意してください。

HTML を適切にクリーンアップして、乱雑な空白や奇妙な文字のないきれいな文字列にする方法に関するヒントはありますか?

ありがとう：）

score 5 · Accepted Answer

文字はhtmlエンティティのようです。試す：

html_entity_decode( strip_tags( $my_html_code ) );

score 3 · Accepted Answer

html_entity_decodeを使用してstrip_tagsの結果をデコードするか、preg_replaceを使用してそれらを削除できます。

$text = strip_tags($html_text);
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

行の先頭から空白を削除するには、ltrimを使用します。

$content = join("\n", array_map("ltrim", explode("\n", $content )));

アポストロフィを保持するには、代わりにこれを使用します。

$text = strip_tags($html_text);
$text = str_replace("&rsquo;","'", $text); 
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

score 0 · Accepted Answer

PHP Tidyライブラリを使用して html をクリーンアップします。しかし、あなたの場合、DOMDocumentクラスを使用して html からデータを取得します。

3 に答える 3