多数のドキュメントの類似性を計算するために Lucene を使用しようとしています。BM25 und VSM を使用した類似度計算 im。
GATE を使用する Lucene Im に加えて、言語処理タスクを実行するオープンソース フレームワーク。
ドキュメント (15) 間の類似度を計算しようとすると、奇妙な動作に遭遇しました。
VSM を使用すると、結果は次のようになります。
Post-processing links before ranking
Ranking all links by similarities
3/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 3 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[1.6188]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1.5119]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[0.2702]
Clearing previous runtime results...
Score breakdown:
6.860396E-7 = (MATCH) max of:
0.0 = (MATCH) MatchAllDocsQuery, product of:
0.0 = boost
0.0032560423 = queryNorm
6.860396E-7 = (MATCH) product of:
0.0034322562 = (MATCH) sum of:
0.0017054792 = (MATCH) weight(TERM:http in 1) [DefaultSimilarity], result of:
0.0017054792 = score(doc=1,freq=2.0), product of:
0.0045762537 = queryWeight, product of:
1.4054651 = idf(docFreq=3, maxDocs=6)
0.0032560423 = queryNorm
0.37268022 = fieldWeight in 1, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
1.4054651 = idf(docFreq=3, maxDocs=6)
0.1875 = fieldNorm(doc=1)
8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
8.6338853E-4 = score(doc=1,freq=2.0), product of:
0.0032560423 = queryWeight, product of:
1.0 = idf(docFreq=5, maxDocs=6)
0.0032560423 = queryNorm
0.26516503 = fieldWeight in 1, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
1.0 = idf(docFreq=5, maxDocs=6)
0.1875 = fieldNorm(doc=1)
8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
8.6338853E-4 = score(doc=1,freq=2.0), product of:
0.0032560423 = queryWeight, product of:
1.0 = idf(docFreq=5, maxDocs=6)
0.0032560423 = queryNorm
0.26516503 = fieldWeight in 1, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
1.0 = idf(docFreq=5, maxDocs=6)
0.1875 = fieldNorm(doc=1)
1.9988007E-4 = coord(3/15009)
BM25 を使用すると、奇妙な動作が発生します。
Post-processing links before ranking
Ranking all links by similarities
40/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 40 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[10768.2471]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1798.1300]
Link = [12695.xml(0,58320)@Bug[15009] | 13091.xml(0,1721)@Feature[216]]@[965.0315]
Link = [5822.xml(0,10098)@Bug[1434] | 13091.xml(0,1721)@Feature[216]]@[372.0819]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[174.2649]
Link = [12695.xml(0,58320)@Bug[15009] | 12700.xml(0,410)@Feature[36]]@[97.6378]
Link = [5822.xml(0,10098)@Bug[1434] | 1910.xml(0,237)@Feature[21]]@[46.4066]
Link = [12694.xml(0,1504)@Bug[188] | 13091.xml(0,1721)@Feature[216]]@[35.8532]
Link = [5822.xml(0,10098)@Bug[1434] | 12701.xml(0,137)@Feature[14]]@[29.6364]
Link = [12698.xml(0,362)@Bug[56] | 12713.xml(0,18247)@Feature[1974]]@[22.4652]
Link = [132.xml(0,409)@Bug[33] | 12713.xml(0,18247)@Feature[1974]]@[21.1697]
Link = [5822.xml(0,10098)@Bug[1434] | 12700.xml(0,410)@Feature[36]]@[16.7317]
Link = [132.xml(0,409)@Bug[33] | 13091.xml(0,1721)@Feature[216]]@[15.8749]
Link = [12697.xml(0,257)@Bug[34] | 12713.xml(0,18247)@Feature[1974]]@[15.5943]
Link = [12696.xml(0,272)@Bug[40] | 12713.xml(0,18247)@Feature[1974]]@[14.8670]
Link = [5822.xml(0,10098)@Bug[1434] | 12702.xml(0,88)@Feature[9]]@[14.8045]
Link = [12694.xml(0,1504)@Bug[188] | 1910.xml(0,237)@Feature[21]]@[13.8415]
Link = [12694.xml(0,1504)@Bug[188] | 12700.xml(0,410)@Feature[36]]@[11.7942]
Link = [12703.xml(0,331)@Bug[43] | 12713.xml(0,18247)@Feature[1974]]@[11.2949]
Link = [12699.xml(0,616)@Bug[67] | 12713.xml(0,18247)@Feature[1974]]@[9.4193]
Link = [12695.xml(0,58320)@Bug[15009] | 12701.xml(0,137)@Feature[14]]@[8.6146]
Link = [12699.xml(0,616)@Bug[67] | 13091.xml(0,1721)@Feature[216]]@[7.1386]
Link = [12695.xml(0,58320)@Bug[15009] | 1910.xml(0,237)@Feature[21]]@[5.9274]
Link = [12698.xml(0,362)@Bug[56] | 13091.xml(0,1721)@Feature[216]]@[4.4054]
Link = [12699.xml(0,616)@Bug[67] | 12700.xml(0,410)@Feature[36]]@[4.0292]
Link = [12703.xml(0,331)@Bug[43] | 13091.xml(0,1721)@Feature[216]]@[3.3257]
Link = [12696.xml(0,272)@Bug[40] | 13091.xml(0,1721)@Feature[216]]@[2.5366]
Link = [12695.xml(0,58320)@Bug[15009] | 12702.xml(0,88)@Feature[9]]@[2.2157]
Link = [12699.xml(0,616)@Bug[67] | 1910.xml(0,237)@Feature[21]]@[2.0420]
Link = [12697.xml(0,257)@Bug[34] | 13091.xml(0,1721)@Feature[216]]@[0.9461]
Link = [12694.xml(0,1504)@Bug[188] | 12702.xml(0,88)@Feature[9]]@[0.9092]
Link = [12694.xml(0,1504)@Bug[188] | 12701.xml(0,137)@Feature[14]]@[0.8928]
Link = [12697.xml(0,257)@Bug[34] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12696.xml(0,272)@Bug[40] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12698.xml(0,362)@Bug[56] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12698.xml(0,362)@Bug[56] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12696.xml(0,272)@Bug[40] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12697.xml(0,257)@Bug[34] | 12701.xml(0,137)@Feature[14]]@[0.8178]
BM25 は、「良い」または高い結果のためにすべてをリンクします。説明は次のようになります。
Score breakdown:
2.2157059 = (MATCH) max of:
0.0 = (MATCH) MatchAllDocsQuery, product of:
0.0 = boost
1.0 = queryNorm
2.2157059 = (MATCH) sum of:
1.3065486 = (MATCH) weight(TERM:http in 1) [BM25Similarity], result of:
1.3065486 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
0.6931472 = idf(docFreq=3, maxDocs=6)
1.8849511 = tfNorm, computed from:
2.0 = termFreq=2.0
1.2 = parameter k1
0.75 = parameter b
746.8333 = avgFieldLength
28.444445 = fieldLength
0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
0.24116206 = idf(docFreq=5, maxDocs=6)
1.8849511 = tfNorm, computed from:
2.0 = termFreq=2.0
1.2 = parameter k1
0.75 = parameter b
746.8333 = avgFieldLength
28.444445 = fieldLength
0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
0.24116206 = idf(docFreq=5, maxDocs=6)
1.8849511 = tfNorm, computed from:
2.0 = termFreq=2.0
1.2 = parameter k1
0.75 = parameter b
746.8333 = avgFieldLength
28.444445 = fieldLength
デバッグ上の理由から、実際の結果を確認するために用語ブーストやその他のものを無効にしました。通常、すべての値は、1 より大きいか 0 より小さい場合、1 または 0 に正規化されます。
Lucene 5.0.0 を使用しています。ドキュメントは、他のチケットへの参照を持つ通常のチケットです。
類似点は次のように実装されます。
new BM25Similarity(k1, b); where k1 = 1.2 and b = 0.75 (defaults). (BM25)
new DefaultSimilarity() (VSM)
こんなにスコアが違うなんて。私が見ることができるように、VSM が競合するものはすべて小さいです。
この奇妙な動作に遭遇した人はいますか?
どんな種類の助けにも感謝します!
- 編集
また、BM25 の各クエリで queryNorm が 1.0 に等しいことも疑問に思っています。ただし、VSM ではクエリごとに異なります。
これによると: Lucene スコアリング: queryNorm はどのようなコンテキストで使用されますか?
queryNorm(q) は、クエリ間のスコアを比較可能にするために使用される正規化係数です。この係数はドキュメントのランキングには影響しません (ランク付けされたすべてのドキュメントに同じ係数が掛けられるため)。むしろ、異なるクエリ (または異なるインデックス) からのスコアを比較できるようにしようとするだけです。
いつも同じはずですよね?