text - コンコーダンスを構築するために Web サイトからすべてのテキストを抽出する

Question

Web サイト内のすべてのテキストを取得するにはどうすればよいですか。ctrl+a/c だけではありません。Web サイト (および関連するすべてのページ) からすべてのテキストを抽出し、それを使用してそのサイトから単語のコンコーダンスを構築できるようにしたいと考えています。何か案は？

score 1 · Accepted Answer

私はこれに興味をそそられたので、これに対する解決策の最初の部分を書きました。

便利な strip_tags 関数のため、コードは PHP で記述されています。これもラフで手続き的ですが、自分のアイデアを示していると感じています。

<?php
$url = "http://www.stackoverflow.com";

//To use this you'll need to get a key for the Readabilty Parser API http://readability.com/developers/api/parser
$token = "";

//I make a HTTP GET request to the readabilty API and then decode the returned JSON
$parserResponse = json_decode(file_get_contents("http://www.readability.com/api/content/v1/parser?url=$url&token=$token"));

//I'm only interested in the content string in the json object
$content = $parserResponse->content;

//I strip the HTML tags for the article content
$wordsOnPage = strip_tags($content);

$wordCounter = array();

$wordSplit = explode(" ", $wordsOnPage);

//I then loop through each word in the article keeping count of how many times I've seen the word
foreach($wordSplit as $word)
{
incrementWordCounter($word);
}

//Then I sort the array so the most frequent words are at the end
asort($wordCounter);

//And dump the array
var_dump($wordCounter);

function incrementWordCounter($word)
{
    global $wordCounter;

    if(isset($wordCounter[$word]))
    {
    $wordCounter[$word] = $wordCounter[$word] + 1;
    }
    else
    {
    $wordCounter[$word] = 1;
    }

}


?>

可読性 API が使用する SSL 用に PHP を構成するには、これを行う必要がありました。

ソリューションの次のステップは、ページ内のリンクを検索し、これを再帰的に呼び出して、関連するページの要件を処理することです。

また、上記のコードは、意味のあるものにするためにもう少し処理したい単語数の生データを提供するだけです。

text - コンコーダンスを構築するために Web サイトからすべてのテキストを抽出する

1 に答える 1

Related

Reference