lucene搜索之高亮显示highlighter

highlighter介绍

这几天一直加班,博客有三天没有更新了,望见谅;我们在做查询的时候,希望对我们自己的搜索结果与搜索内容相近的地方进行着重显示,就如下面的效果

这里我们搜索的内容是“一步一步跟我学习lucene”,搜索引擎展示的结果中对用户的输入信息进行了配色方面的处理,这种区分正常文本和输入内容的效果即是高亮显示;

这样做的好处:

视觉上让人便于查找有搜索对应的文本块;界面展示更友好;

lucene提供了highlighter插件来体现类似的效果;

highlighter对查询关键字高亮处理;

highlighter包包含了用于处理结果页查询内容高亮显示的功能,,其中Highlighter类highlighter包的核心组件,借助Fragmenter, fragment Scorer, 和Formatter等类来支持用户自定义高亮展示的功能;

示例程序

这里边我利用了之前的做的目录文件索引

package com.lucene.search.util;import java.io.IOException;import java.io.StringReader;import java.util.concurrent.ExecutorService;import java.util.concurrent.Executors;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.index.Term;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TermQuery;import org.apache.lucene.search.TopDocs;import org.apache.lucene.search.highlight.Highlighter;import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;import org.apache.lucene.search.highlight.QueryScorer;import org.apache.lucene.search.highlight.SimpleFragmenter;import org.apache.lucene.search.highlight.SimpleHTMLFormatter;import org.apache.lucene.util.BytesRef;public class HighlighterTest {public static void main(String[] args) {IndexSearcher searcher;TopDocs docs;ExecutorService service = Executors.newCachedThreadPool();try {searcher = SearchUtil.getMultiSearcher("index", service);Term term = new Term("content",new BytesRef("lucene"));TermQuery termQuery = new TermQuery(term);docs = SearchUtil.getScoreDocsByPerPage(1, 30, searcher, termQuery);ScoreDoc[] hits = docs.scoreDocs;QueryScorer scorer = new QueryScorer(termQuery);SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<B>","</B>");//设定高亮显示的格式<B>keyword</B>,此为默认的格式Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);highlighter.setTextFragmenter(new SimpleFragmenter(20));//设置每次返回的字符数Analyzer analyzer = new StandardAnalyzer();for(int i=0;i<hits.length;i++){Document doc = searcher.doc(hits[i].doc);String str = highlighter.getBestFragment(analyzer, "content", doc.get("content")) ;System.out.println(str);}} catch (IOException e1) {// TODO Auto-generated catch blocke1.printStackTrace();} catch (InvalidTokenOffsetsException e) {// TODO Auto-generated catch blocke.printStackTrace();}finally{service.shutdown();}}}

lucene的highlighter高亮展示的原理:

根据Formatter和Scorer创建highlighter对象,formatter定义了高亮的显示方式,而scorer定义了高亮的评分;

评分的算法是先根据term的评分值获取对应的document的权重,在此基础上对文本的内容进行轮询,获取对应的文本出现的次数,和它在term对应的文本中出现的位置(便于高亮处理),评分并分词的算法为:

public float getTokenScore() {position += posIncAtt.getPositionIncrement();//记录出现的位置String termText = termAtt.toString();WeightedSpanTerm weightedSpanTerm;if ((weightedSpanTerm = fieldWeightedSpanTerms.get(termText)) == null) {return 0;}if (weightedSpanTerm.positionSensitive &&!weightedSpanTerm.checkPosition(position)) {return 0;}float score = weightedSpanTerm.getWeight();//获取权重// found a query term – is it unique in this doc?if (!foundTerms.contains(termText)) {//结果排重处理totalScore += score;foundTerms.add(termText);}return score; }

formatter的原理为:对搜索的文本进行判断,如果scorer获取的totalScore不小于0,即查询内容在对应的term中存在,则按照格式拼接成preTag+查询内容+postTag的格式

详细算法如下:

public String highlightTerm(String originalText, TokenGroup tokenGroup) {if (tokenGroup.getTotalScore() <= 0) {return originalText;}// Allocate StringBuilder with the right number of characters from the// beginning, to avoid char[] allocations in the middle of appends.StringBuilder returnBuffer = new StringBuilder(preTag.length() + originalText.length() + postTag.length());returnBuffer.append(preTag);returnBuffer.append(originalText);returnBuffer.append(postTag);return returnBuffer.toString(); }

其默认格式为“<B></B>”的形式;

Highlighter根据scorer和formatter,对document进行分析,查询结果调用getBestTextFragments,TokenStream tokenStream,String text,boolean mergeContiguousFragments,int maxNumFragments),其过程为查询工具类当你困难失望的时候,最重要的是事瞧得起你自己;

lucene搜索之高亮显示highlighter

相关文章:

你感兴趣的文章:

标签云: