2

私は2つのテーブルを持っています:

tb_sentence:

================================
|id|doc_id|sentence_id|sentence|
================================
| 1|  1   |   0       |    AB  |
| 2|  1   |   1       |    CD  |
| 3|  2   |   0       |    EF  |
| 4|  2   |   1       |    GH  |
| 5|  2   |   2       |    IJ  |
| 6|  2   |   3       |    KL  |
================================

まず、every の文の数を数えてdocument_id変数に保存します$total_sentence。したがって、$total_sentence変数の値はArray ( [0] => 2 [1] => 4 )

2番目の表は次のとおりですtb_stem

============================
|id|stem|doc_id|sentence_id|
============================
|1 | B  |  1   |     0     |
|2 | A  |  1   |     1     |
|3 | C  |  2   |     0     |
|4 | A  |  2   |     1     |
|5 | E  |  2   |     2     |
|6 | C  |  2   |     3     |
|7 | D  |  2   |     4     |
|8 | G  |  2   |     5     |
|9 | A  |  2   |     6     |
============================

stem2 番目に、すべてのデータをグループ化し、( ) の前の結果で構成されるのdoc_id数をカウントする必要があります。概念は、ドキュメントの総数をステムを含むドキュメントの数で割ります。コード :sentence_id$token

$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
    $token = $row['unique']; //the result $token must be : ABACDEG
}

$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
    while ($row = mysql_fetch_array($query2)) {
        $ndw = $row['ndw']; //the result must be : 1122111
}

$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc

しかし、結果は下の表のように異なるドキュメント間で分離されていません:

============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |      |      |    |
|2 | B  |      |      |    |
|3 | C  |      |      |    |
|4 | D  |      |      |    |
|5 | E  |      |      |    |
|6 | G  |      |      |    |
============================

結果は次のようになります。

 ============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |   1  |      |    |
|2 | B  |   1  |      |    |
|3 | A  |   2  |      |    |
|4 | C  |   2  |      |    |
|5 | D  |   2  |      |    |
|6 | E  |   2  |      |    |
|7 | G  |   2  |      |    |
============================

助けてください、ありがとう:)

idf の式はidf = log(N/df)Nはドキュメントの数、 は単語df(t) が出現するドキュメントの数です。すべての文はドキュメントと見なされます。idf 計算の例を次に示します。 ドキュメント :Do you read poetry while flying. Many people find it relaxing to read on long flights

=================================================
|     Term     | Document1(D1)| D2| df |   idf  |
=================================================
|     find     |     0        | 1 |  1 |log(2/1)|
|     fly      |     1        | 1 |  2 |log(2/2)|
|     long     |     0        | 1 |  1 |log(2/1)|
|    people    |     0        | 1 |  1 |log(2/1)|
|    poetry    |     1        | 0 |  1 |log(2/1)|
|     read     |     1        | 1 |  2 |log(2/2)|
|    relax     |     0        | 1 |  1 |log(2/1)|
=================================================
4

1 に答える 1

2

このクエリは、探しているテーブルを提供します。

SELECT t1.doc_id, t2.token as word, t2.token_freq as df, 
log(t1.docs/t2.token_freq) as idf
FROM 
(SELECT doc_id,count(sentence_id) as docs from tb_sentence group by doc_id) as t1,
(SELECT DISTINCT(stem) as token, doc_id, COUNT(sentence_id) as token_freq 
      FROM tb_stem GROUP BY doc_id, token) as t2
WHERE t1.doc_id = t2.doc_id

注: 元のクエリの Unique は MySQL の予約語であり、エラーが発生します。

于 2012-09-12T20:46:52.490 に答える