php - キーワードが間違っている、ウェブサイトからコンテンツを抽出している。OOP

Question

Webサイト（wiki記事）からキーワードを抽出するときに問題が発生します。抽出されるキーワードは、正確にはキーワードではなく、HTMLから取得した単語であり、Webサイトのものではありません。

私は次のコードを使用します：

include("Extkeys.php");
[...]
if (empty($keywords)){
$ekeywords = new KeyPer;
$keywords = $ekeywords->Keys($webhtml);
}

そして、「Extkeys」のコードは次のとおりです。

<?php
class Extkeys {
function Keys($webhtml) { 
$webhtml = $this->clean($webhtml); 
$blacklist='de,la,los,las,el,ella,nosotros,yo,tu,el,te,mi,del,ellos'; 
$sticklist='test'; 
$minlength = 3; 
$count = 17; 

$webhtml = preg_replace('/[\.;:|\'|\"|\`|\,|\(|\)|\-]/', ' ', $webhtml); 
$webhtml = preg_replace('/¡/', '', $webhtml); 
$webhtml = preg_replace('/¿/', '', $webhtml);

$keysArray = explode(" ", $webhtml); 
$keysArray = array_count_values(array_map('strtolower', $keysArray)); 
$blackArray = explode(",", $blacklist); 

foreach($blackArray as $blackWord){ 
if(isset($keysArray[trim($blackWord)])) 
unset($keysArray[trim($blackWord)]); 
} 
arsort($keysArray); 
$i = 1; 
$keywords = ""; 
foreach($keysArray as $word => $instances){ 
if($i > $count) break; 
if(strlen(trim($word)) >= $minlength && is_string($word)) { 
$keywords .= $word . ", "; 
$i++; 
} 
} 

$keywords = rtrim($keywords, ", "); 

return $keywords=$sticklist.''.$keywords; 
} 

function clean($webhtml) { 

$regex = '/(([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*@([A-Za-z0-9-]+)(\\.[A-Za-z0-9-]+)*)/iex'; 
$desc = preg_replace($regex, '', $webhtml); 
$webhtml = preg_replace( "''si", '', $webhtml ); 
$webhtml = preg_replace( '/]*>([^<]+)<\/a>/is', '\2 (\1)', $webhtml ); 
$webhtml = preg_replace( '//', '', $webhtml ); 
$webhtml = preg_replace( '/{.+?}/', '', $webhtml ); 
$webhtml = preg_replace( '/ /', ' ', $webhtml ); 
$webhtml = preg_replace( '/&/', ' ', $webhtml ); 
$webhtml = preg_replace( '/"/', ' ', $webhtml ); 
$webhtml = strip_tags( $webhtml ); 
$webhtml = htmlspecialchars($webhtml); 
$webhtml = str_replace(array("\r\n", "\r", "\n", "\t"), " ", $webhtml); 

while (strchr($webhtml," ")) { 
$webhtml = str_replace(" ", "",$webhtml); 
} 

for ($cnt = 1; 
$cnt < strlen($webhtml)-1; $cnt++) {
if (($webhtml{$cnt} == '.') || ($webhtml{$cnt} == ',')) { 
if ($webhtml{$cnt+1} != ' ') { 
$webhtml = substr_replace($webhtml, ' ', $cnt + 1, 0); 
} 
} 
} 
return $webhtml; 
} 
}
?>

これは、抽出されたキーワードの例です。

testfalse、lang、{mw、loader、window、function、true、vector、user、gadget、mediawiki、legacy、options、usebetatoolbar、implement、resourceloader、default

記事の：http： //en.wikipedia.org/wiki/Searchengine

チュートリアルからのコードのコピーであるコード「Extkeys」は、私がそれを機能させるように適合させました。

コードでHTMLではなくWebサイトのキーワードを抽出できるようにするにはどうすればよいですか？

よろしくお願いします！

score 1 · Accepted Answer

私があなたの質問を理解していると仮定すると、単に次のことを行うことがあなたが探している解決策だと思います.

これは、HTML をパラメーターとして要求するのではなく、URL (例: http://www.whatever.com/page.html )から HTML を読み取り、それを使用してキーを生成します。

function Keys($url) { 
    $webhtml = file_get_contents($url);

score 1 · Accepted Answer

最初にページからコンテンツを抽出してから、キーワードを検索します。つまり、ページから実際のコンテンツを見つけて、サイドバーやフッターなどの要素を取り除きたいということです。HTML コンテンツの抽出についてはググってください。これに関する記事は数多くあります。

私はJavaでこれを1回行いました.ボイラーパイプと呼ばれるライブラリがあります.PHPポート/インターフェースがあるかどうかはわかりません.Googleで簡単に検索しても何も明らかになりませんでした. しかし、PHP用の同様のライブラリがあると確信しています。

HTML を取り除き、特にページコンテンツだけを検索しない最も簡単な方法は、正規表現を使用してすべての html を削除することですs/<[^>]+>//g。ただし、キーの抽出を台無しにする可能性のある多くのがらくたが発生するため、おそらく最良のアプローチではない検索エンジンの場合.

編集:これは、 PHP を使用したコンテンツ抽出に関する記事です。

php - キーワードが間違っている、ウェブサイトからコンテンツを抽出している。OOP

2 に答える 2

Related

Reference