java - Lucene 4.4 を使用して Term-Document マトリックスを生成する

Question

LSIをさらに実験するために、小さなコーパス用のTerm-Documentマトリックスを作成しようとしています。しかし、Lucene 4.4 でそれを行う方法が見つかりませんでした。

次のように、各ドキュメントの TermVector を取得する方法を知っています。

//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);    
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size());  //just testing

マトリックスを取得するために、すべての termVector をマトリックスの列として結合するだけでよいと思いました。ただし、異なるドキュメントの termVector のサイズは異なります。また、termVector に 0 を埋め込む方法もわかりません。したがって、確かに、この方法は機能しません。

したがって、誰かが Lucene 4.4 で Term-Document ベクターを作成する方法を教えてくれませんか? (できればサンプルコードを見せてください)。
Lucene がこの機能をサポートしていない場合、他にどのような方法をお勧めしますか?

どうもありがとう、

score 2 · Accepted Answer

ここで私の問題の解決策を見つけました。コードはLuceneの古いバージョンで書かれているため、多くのことを変更する必要がありますが、Sujit氏による非常に詳細な例です。コードが完成したら、詳細を更新します。これがLucene 4.4で動作する私のソリューションです

public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
    reader = DirectoryReader.open(FSDirectory.open(index));
    searcher = new IndexSearcher(reader);
    this.corpus = corpus;
    termIdMap = computeTermIdMap(reader);
}   
/**
*  Map term to a fix integer so that we can build document matrix later.
*  It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
    Map<String,Integer> termIdMap = new HashMap<String,Integer>();
    int id = 0;
    Fields fields = MultiFields.getFields(reader);
    Terms terms = fields.terms("contents");
    TermsEnum itr = terms.iterator(null);
    BytesRef term = null;
    while ((term = itr.next()) != null) {               
        String termText = term.utf8ToString();              
        if (termIdMap.containsKey(termText))
            continue;
        //System.out.println(termText); 
        termIdMap.put(termText, id++);
    }

    return termIdMap;
}

/**
*  build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
    //iterate through directory to work with each doc
    int col = 0;
    int numDocs = countDocs(corpus);            //get the number of documents here      
    int numTerms = termIdMap.size();    //total number of terms     
    RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);

    for (File f : corpus.listFiles()) {
        if (!f.isHidden() && f.canRead()) {
            //I build term document matrix for a subset of corpus so
            //I need to lookup document by path name. 
            //If you build for the whole corpus, just iterate through all documents
            String path = f.getPath();
            BooleanQuery pathQuery = new BooleanQuery();
            pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
            TopDocs hits = searcher.search(pathQuery, 1);

            //get term vector
            Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
            TermsEnum itr = termVector.iterator(null);
            BytesRef term = null;

            //compute term weight
            while ((term = itr.next()) != null) {               
                String termText = term.utf8ToString();              
                int row = termIdMap.get(termText);
                long termFreq = itr.totalTermFreq();
                long docCount = itr.docFreq();
                double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
                tdMatrix.setEntry(row, col, weight);
            }
            col++;
        }
    }       
    return tdMatrix;
}
}

java - Lucene 4.4 を使用して Term-Document マトリックスを生成する

2 に答える 2

Related

Reference