php - PHPのnodeValueと文字数

Question

fopen（）を使用してテキストファイル（.htmlなど）を含むString変数があり、次に、そのタグなしテキストを記事のプレビューに使用できるように、strip_tags（）に移動しますが、その前に、h1nodeValueを取得する必要があります。また、その文字数を数えるので、以下のコードのゼロをその値に置き換えて、150+の値で終了することができます。

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
$StrippedFile=strip_tags($WholeFile);
$TextExtract = mb_substr("$StrippedFile", 0,150);

私が行くための最良の方法は何ですか？パーサーが答えですか？これは[これまでのところ]唯一の状況なので、htmlタグから値を抽出します

score 2 · Accepted Answer

構造化テキスト（HTML、XML、json、YAMLなど）がある場合は、特別な理由がない限り、常に適切なパーサーを使用する必要があります。

この状況では正規表現を回避できる可能性がありますが、解決策は非常に脆弱であり、文字エンコード、エンティティ、または空白に関連する問題が発生する可能性があります。上記のすべての解決策は微妙に壊れます。たとえば、次のような入力がある場合：

<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here's
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>

DOMDocumentこれは、を使用したソリューションでDOMXPathあり、最悪のHTMLを除くすべてで機能し、すべてのエンティティが文字値に正規化された150文字（バイト、文字ではない）のutf-8応答を常に提供します。

$html = '<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here\'s
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>';


$doc = new DOMDocument();
$doc->loadHTML($html);
// if you have a url or filename, you can use this instead:
// $doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);

// you can easily modify the xquery to match the "title" of different documents
$titlenode = $xp->query('/html/body//h1[1]');

$xpath = 'normalize-space(substring(
        concat(
            normalize-space(.),
            " ",
            normalize-space(./following-sibling::*)
        ), 0, 150))';


$excerpt = null;
if ($titlenode->length) {
    $excerpt = $xp->evaluate($xpath, $titlenode->item(0));
}

var_export($excerpt);

このコードは次のように出力します。

'Title — maybe emdash counted as 7 characters and whitespace counted excessively too. And here\'s a utf-8 character that may get split in the middle: ©'

ここでの基本的な考え方は、h1（または任意のタイトル要素）をXPathと照合し、その要素とそれに続くすべての要素の文字列値を取得し、XPathを使用して150文字で切り捨てることです。すべてをXPathに保持することで、代わりにPHPで処理しなければならない厄介な文字セットとエンティティの問題をすべて回避できます。

score 0 · Accepted Answer

処理しているファイルの内容が確実で、タイトルがH1にあることがわかっている場合は、その </h1>場所で取得している文字列をスライスできる可能性があります（strstr()たとえば、これを行う方法は多数あります）。、2つの文字列に。

次に、最初のタグのタグを削除してタイトルを取得し、2番目のタグのタグを削除してコンテンツを取得できます。これは、記事のコンテンツを含むdom要素の前に、ファイルにタイトルを含む単一のh1のみがあることを前提としています。

これは、オンラインでさまざまな記事を解析するための最良の方法ではないことに注意してください。より一般的な解決策として、専用のパーサークラスを調べます。

コードサンプルは次のとおりです。

コードサンプル

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
// Modified part
$content = strip_tags(strstr($WholeFile, '</h1>'));
$title = strip_tags(strstr($WholeFile, '</h1>', true)); // Valid with PHP 5.3.0 only I think
$TextExtract = mb_substr($content, 0,150);

php - PHPのnodeValueと文字数

2 に答える 2

Related

Reference