php - テキストのブロックで最も使用されている 2 つの単語の組み合わせを見つけるにはどうすればよいですか?

Question

連続して使用した最も一般的な 2 つの単語をテキストブロックから特定するにはどうすればよいですか? 言い換えれば、テキストをコピーして貼り付けることができるオンラインまたはオフラインのツール (またはコード) があり、次のような最もよく使用される 2 つの単語の頻度を出力します。

最も使用されているものから最も使用されていないものへ:

「猫」 2.9% 「彼女が言った」 1.8% 「行った」 1.2%

ありがとう

score 2 · Accepted Answer

テキストを 2 つの単語のペアに分割します ( substrとstrposを使用してください)
- strpos を使用してスペースの 2 番目のインデックスを検索し、最初のスペースインデックスと 2 番目のスペースインデックスの間の部分文字列を検索して、2 つの単語のペアを取得します。
各ペアをマップまたはセットに追加し (ペアがキーになります)、値を設定します (マップに既に存在する場合は、値を増やします)
全文を解析したら、マップ/セットのサイズと各ペアの値に基づいてパーセンテージを計算します。

score 1 · Accepted Answer

これは楽しかったですが、ちょっとやってみました。

これは基本的に単語を2つにグループ化し、それらを配列にインデックス付けし、そこで見つかった時間をインクリメントし、最後にパーセンテージに変換します:)

$data = 'In the first centuries of typesetting, quotations were distinguished merely by indicating the speaker, and this can still be seen in some editions of the Bible. During the Renaissance, quotations were distinguished by setting in a typeface contrasting with the main body text (often Italic type with roman, or the other way round). Block quotations were set this way at full size and full measure.
Quotation marks were first cut in type during the middle of the sixteenth century, and were used copiously by some printers by the seventeenth. In Baroque and Romantic-period books, they could be repeated at the beginning of every line of a long quotation. When this practice was abandoned, the empty margin remained, leaving an indented block quotation';

//Clean The Data from un required chars!
$data = preg_replace("/[^\w]/"," ",$data);

$segments = explode(" ",$data);
$indexes = array();

for($i=0;$i<count($segments);$i++)
{
   if($i == 0)
   {
      continue;
   }

   if(trim($segments[$i - 1]) != "" && trim($segments[$i]) != "")
   {
      $key = trim($segments[$i - 1]) . " " . trim($segments[$i]);
      if(array_key_exists($key,$indexes))
      {
          $indexes[$key]["count"]++;
      }else
      {
          $indexes[$key] = array(
              'count' => 1,
              'words' => $key
          );
      }
   }
}

//Change to the percentage:
$total_double_words = count($segments);
foreach($indexes as $id => $set)
{
    $indexes[$id]['percentage'] = number_format((($set['count']/ $total_double_words) * 100),2) . "%";
}

var_dump($indexes);

ここでライブを見ることができます: http://codepad.org/rcwpddW8

php - テキストのブロックで最も使用されている 2 つの単語の組み合わせを見つけるにはどうすればよいですか?

2 に答える 2

Related

Reference