私は2つのテーブルを持っています:
tb_sentence
:
================================
|id|doc_id|sentence_id|sentence|
================================
| 1| 1 | 0 | AB |
| 2| 1 | 1 | CD |
| 3| 2 | 0 | EF |
| 4| 2 | 1 | GH |
| 5| 2 | 2 | IJ |
| 6| 2 | 3 | KL |
================================
まず、every の文の数を数えてdocument_id
変数に保存します$total_sentence
。したがって、$total_sentence
変数の値はArray ( [0] => 2 [1] => 4 )
2番目の表は次のとおりですtb_stem
。
============================
|id|stem|doc_id|sentence_id|
============================
|1 | B | 1 | 0 |
|2 | A | 1 | 1 |
|3 | C | 2 | 0 |
|4 | A | 2 | 1 |
|5 | E | 2 | 2 |
|6 | C | 2 | 3 |
|7 | D | 2 | 4 |
|8 | G | 2 | 5 |
|9 | A | 2 | 6 |
============================
stem
2 番目に、すべてのデータをグループ化し、( ) の前の結果で構成されるのdoc_id
数をカウントする必要があります。概念は、ドキュメントの総数をステムを含むドキュメントの数で割ります。コード :sentence_id
$token
$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
$token = $row['unique']; //the result $token must be : ABACDEG
}
$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
while ($row = mysql_fetch_array($query2)) {
$ndw = $row['ndw']; //the result must be : 1122111
}
$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc
しかし、結果は下の表のように異なるドキュメント間で分離されていません:
============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | | | |
|2 | B | | | |
|3 | C | | | |
|4 | D | | | |
|5 | E | | | |
|6 | G | | | |
============================
結果は次のようになります。
============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | 1 | | |
|2 | B | 1 | | |
|3 | A | 2 | | |
|4 | C | 2 | | |
|5 | D | 2 | | |
|6 | E | 2 | | |
|7 | G | 2 | | |
============================
助けてください、ありがとう:)
idf の式はidf = log(N/df)
、N
はドキュメントの数、 は単語df
(t) が出現するドキュメントの数です。すべての文はドキュメントと見なされます。idf 計算の例を次に示します。 ドキュメント :Do you read poetry while flying. Many people find it relaxing to read on long flights
=================================================
| Term | Document1(D1)| D2| df | idf |
=================================================
| find | 0 | 1 | 1 |log(2/1)|
| fly | 1 | 1 | 2 |log(2/2)|
| long | 0 | 1 | 1 |log(2/1)|
| people | 0 | 1 | 1 |log(2/1)|
| poetry | 1 | 0 | 1 |log(2/1)|
| read | 1 | 1 | 2 |log(2/2)|
| relax | 0 | 1 | 1 |log(2/1)|
=================================================