php - zend lucene インデックスからの重複ドキュメントの削除

Question

実際、インデックスを作成して最適化する私の方法は、毎回レコードのチャンクを作成して最適化し、すべてを一度に変換しないことです。今私が直面している問題は、インデックスに重複したドキュメント/レコードが作成されることです。インデックスから重複を削除するための関数またはコードがあるかどうかを知る必要があります。前もって感謝します。

score 2 · Accepted Answer

レコードを更新する前にレコードを削除する必要があります。これがLuceneの動作方法です。既存のレコードを更新することはできません。

これがレコードを削除する方法です

$index = Zend_Search_Lucene::open('data/index');//'data/index' is the file that lucene generated
$query = new Zend_Search_Lucene_Search_Query_Term(new
Zend_Search_Lucene_Index_Term($listing_id, 'listing_id'));// 'listing_id' is a field i added when creating index for the first time. $listing_id is the id value of the row i want to delete
$hits = $index->find($query); 
foreach ($hits as $hit) {
    $index->delete($hit->id);// $hit->id is not listing_id, it's lucene unique index of the row that has listing_id = $listing_id
}

これで、基本的に挿入である更新を実行できます:)、これがluceneの動作方法です。

score 0 · Accepted Answer

一意の識別子である用語が必要です。次に、ドキュメントをインデックスに追加する前に、ドキュメントを削除します。

重複とは、同じ一意のIDを持つ複数のドキュメントがある場合のことです。したがって、一意のIDフィールドのすべての用語を列挙し、2つの結果を持つ用語を検索します。私の知る限り、これを行うための組み込みの方法はありません。

score 0 · Accepted Answer

$index->commit()新しいデータを追加する前にコミットを忘れないでください。それが私の重複データが$index->find($query).

$index = Zend_Search_Lucene::open('/lucene/index');
$query = new Zend_Search_Lucene_Search_Query_Term (new Zend_Search_Lucene_Index_Term($id, 'key'));

$hits = $index->find($query);
foreach ($hits as $hit) {
       $index->delete($hit->id); // $hit->id is not key , it's lucene unique index of the row that has key = $id
}
$index->commit();   // apply changes (delete) before index new data

doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::keyword('key', $id));
$doc->addField(Zend_Search_Lucene_Field::Text('user', $user, 'utf-8'));

php - zend lucene インデックスからの重複ドキュメントの削除

3 に答える 3

Related

Reference