php - php を使用したテキストで最もよく使用される単語

Question

以下のコードは、stackoverflow で見つけたもので、文字列内の最も一般的な単語を見つけるのにうまく機能します。しかし、「a、if、you、have など」のような一般的な単語を数えることを除外できますか? または、数えた後に要素を削除する必要がありますか? どうすればいいですか？前もって感謝します。

<?php

$text = "A very nice to tot to text. Something nice to think about if you're into text.";


$words = str_word_count($text, 1); 

$frequency = array_count_values($words);

arsort($frequency);

echo '<pre>';
print_r($frequency);
echo '</pre>';
?>

score 11 · Accepted Answer

文字列から一般的な単語を抽出する関数です。3つのパラメータを取ります。文字列、ストップワード配列、およびキーワードがカウントされます。txtファイルを配列に取り込むphp関数を使用してtxtファイルからstop_wordsを取得する必要があります

$ stop_words = file（'stop_words.txt'、FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES）;

$ this-> extract_common_words（$ text、$ stop_words）

このファイルstop_words.txtをプライマリストップワードファイルとして使用することも、独自のファイルを作成することもできます。

function extract_common_words($string, $stop_words, $max_count = 5) {
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
    
      preg_match_all('/\b.*?\b/i', $string, $match_words);
      $match_words = $match_words[0];
       
      foreach ( $match_words as $key => $item ) {
          if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
              unset($match_words[$key]);
          }
      }  
       
      $word_count = str_word_count( implode(" ", $match_words) , 1); 
      $frequency = array_count_values($word_count);
      arsort($frequency);
      
      //arsort($word_count_arr);
      $keywords = array_slice($frequency, 0, $max_count);
      return $keywords;
}

score 4 · Accepted Answer

組み込みの PHP 関数を使用した私のソリューションは次のとおりです。

most_frequent_words — 文字列内で最も頻繁に出現する単語を見つける

function most_frequent_words($string, $stop_words = [], $limit = 5) {
    $string = strtolower($string); // Make string lowercase

    $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
    $words = array_diff($words, $stop_words); // Remove black-list words from the array
    $words = array_count_values($words); // Count the number of occurrence

    arsort($words); // Sort based on count

    return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}

文字列内で最も頻繁に出現する単語を含む配列を返します。

パラメーター：

string $string - 入力文字列。

配列$stop_words (オプション) - 配列から除外される単語のリスト、デフォルトの空の配列。

string $limit (オプション) - 返される単語数を制限します。デフォルトは5です。

score 2 · Accepted Answer

これは、次を使用して簡単に実行できますarray_diff()。

$words = array("if", "you", "do", "this", 'I', 'do', 'that');
$stopwords = array("a", "you", "if");

print_r(array_diff($words, $stopwords));

与える

 Array
(
    [2] => do
    [3] => this
    [4] => I
    [5] => do
    [6] => that
)

ただし、小文字と大文字は自分で処理する必要があります。ここで最も簡単な方法は、事前にテキストを小文字に変換することです。

score 2 · Accepted Answer

除外する単語を渡すことができる追加のパラメーターやネイティブ PHP 関数はありません。そのため、私はあなたが持っているものを使用し、によって返されるカスタムの単語セットを無視しますstr_word_count。

php - php を使用したテキストで最もよく使用される単語

4 に答える 4

パラメーター ：

Related

Reference

パラメーター：