I am working on customizing the Highlighter plugin(using FVH) to output the position offset of query terms for a given search. So far I have been able to extract the offset information for normal queries using the code below. However, for Phrase queries the code returns the position offset of all the query terms(i.e. termSet) even when it is not part of the Phrase query. Therefore, I am wondering if there is a way in Lucene to get the offset information of only the matched phrase for Phrase queries using FVH?
// In DefaultSolrHighlighter.java::doHighlightingByFastVectorHighlighter()
SolrIndexSearcher searcher = req.getSearcher();
TermFreqVector[] tvector = searcher.getReader().getTermFreqVectors(docId);
TermPositionVector tvposition = (TermPositionVector) tvector[0];
Set<String> termSet = highlighter.getHitTermSet (fieldQuery, fieldName);
int[] positions;
List hitOffsetPositions = new ArrayList<String[]>();
for (String term : termSet)
{
int index = tvposition.indexOf(term);
positions = tvposition.getTermPositions(index);
StringBuilder sb = new StringBuilder();
for (int pos : positions)
{
if (!Integer.toString(pos).isEmpty())
sb.append( pos ).append(',');
}
hitOffsetPositions.add(sb.substring(0, sb.length() - 1).toString());
}
if( snippets != null && snippets.length > 0 )
{
docSummaries.add( fieldName, snippets );
docSummaries.add( "hitOffsetPositions", hitOffsetPositions);
}
// In FastVectorHighlighter.java
// Wrapper function to get query Terms
public Set<String> getHitTermSet (FieldQuery fieldQuery, String fieldName)
{
Set<String> termSet = fieldQuery.getTermSet( fieldName );
return termSet;
}
Current Output:
<lst name="6H500F0">
<arr name="name">
<str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
<str>2</str>
<str>3</str>
<str>10</str>
</arr>
Expected Output:
<lst name="6H500F0">
<arr name="name">
<str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
<str>2</str>
<str>3</str>
</arr>
The field that I am trying to highlight has termVectors="true", termPositions="true" and termOffsets="true" and am using Lucene 3.1.0.