Lucene实现自定义中文同义词分词器

———————————————————-lucene的分词_中文分词介绍———————————————————-Paoding:庖丁解牛分词器。已经没有更新了mmseg:使用搜狗的词库1.导入包(有两个包:1.带dic的,2.不带dic的)如果使用不带dic的,,得自己指定词库位置2.创建MMSegAnalyzer(指明词库所在的位置)

———————————————————-

lucene的分词_实现自定义同义词分词器_思路分析(analyzer—->TokenStream —> TokenFilter —>Tokenizer)与得到分词器详细信息的构造顺序相反———————————————————-/* * 实现自定义中文同义词分词器(mmseg词库) */public class MySameAnalyzer extends Analyzer {@Overridepublic TokenStream tokenStream(String fieldName, Reader read) {Dictionary dic = Dictionary.getInstance("F:\\BaiduYunDownload\\Cache\\lucune\\chinesedic");return new MySameTokenFilter(new MMSegTokenizer(new MaxWordSeg(dic), read));}}

———————————————————lucene的分词_实现自定义同义词分词器_实现分词器———————————————————Reader —->MMSegTokenizer(进行分词)—->(添加同义词)MySameTokenFilter(自定义分词器)—->获取同义词(在相同的位置存储同义词)—>发现同义词—>保存当前状态—>跳到下一个元素—>根据同义词列表来存储元素—>还原状态—>在同一位置保存元素

/* * 自定义同义词分词过滤器 */public class MySameTokenFilter extends TokenFilter {// 存储分词数据private CharTermAttribute cta = null;// 存储语汇单元的位置信息private PositionIncrementAttribute pia = null;// 添加是否有同义词的判断变量属性,保存当前元素的状态信息private AttributeSource.State current;// 栈存储private Stack<String> sames = null;protected MySameTokenFilter(TokenStream input) {super(input);cta = this.addAttribute(CharTermAttribute.class);pia = this.addAttribute(PositionIncrementAttribute.class);sames = new Stack<String>();}@Overridepublic boolean incrementToken() throws IOException {// 保存上一个语汇的同义词while (sames.size() > 0) {// 出栈,并获取同义词String str = sames.pop();// 还原上一个语汇的状态restoreState(current);// 在上一个语汇上保存元素cta.setEmpty();cta.append(str);// 设置同义词位置为0pia.setPositionIncrement(0);return true;}// 跳到下个ctaif (!this.input.incrementToken())// 没有元素返回falsereturn false;if (getSameWords(cta.toString())) {// 如果有同义词,改变词汇的current状态信息,把当前状态保存(捕获当前状态)current = captureState();}return true;}/* * * 获取同义词 */private Boolean getSameWords(String name) {Map<String, String[]> maps = new HashMap<String, String[]>();maps.put("我", new String[] { "俺", "咱" });maps.put("湖南", new String[] { "鱼米之乡", "湘" });String[] sws = maps.get(name);if (sws != null) {// 添加进栈中for (String str : sws) {sames.push(str);}return true;}return false;}}

—————————————————-lucene的分词_实现自定义同义词分词器_实现分词器(良好设计方案)—————————————————-思路:针对接口编程才是王道1.创建管理同义词的接口/* * 用于存储同义词的接口 */public interface MySameContxt {//获取同义词String[]public String[] getSameWords(String name);}2.实现接口,添加同义词库public class MySimpleSameContxt implements MySameContxt {/* * 实现同义词接口 */Map<String, String[]> maps = new HashMap<String, String[]>();public MySimpleSameContxt() {maps.put("我", new String[] { "俺", "咱" });maps.put("湖南", new String[] { "鱼米之乡", "湘" });}public String[] getSameWords(String name) {return maps.get(name);}}3.自定义的分词器的过滤器中TokenFilter中添加同义词属性// 获取专门管理同义词的库private MySameContxt sameContxt;

全代码/* * 自定义同义词分词过滤器 */public class MySameTokenFilter extends TokenFilter {// 存储分词数据private CharTermAttribute cta = null;// 存储语汇单元的位置信息private PositionIncrementAttribute pia = null;// 添加是否有同义词的判断变量属性,保存当前元素的状态信息private AttributeSource.State current;// 栈存储private Stack<String> sames = null;// 获取专门管理同义词的库private MySameContxt sameContxt;protected MySameTokenFilter(TokenStream input, MySameContxt sameContxt) {super(input);cta = this.addAttribute(CharTermAttribute.class);pia = this.addAttribute(PositionIncrementAttribute.class);sames = new Stack<String>();this.sameContxt = sameContxt;}@Overridepublic boolean incrementToken() throws IOException {// 保存上一个语汇的同义词while (sames.size() > 0) {// 出栈,并获取同义词String str = sames.pop();// 还原上一个语汇的状态restoreState(current);// 在上一个语汇上保存元素cta.setEmpty();cta.append(str);// 设置同义词位置为0pia.setPositionIncrement(0);return true;}// 跳到下个ctaif (!this.input.incrementToken())// 没有元素返回falsereturn false;if (getSameWords(cta.toString())) {// 如果有同义词,改变词汇的current状态信息,把当前状态保存(捕获当前状态)current = captureState();}return true;}/* * * 获取同义词 */private Boolean getSameWords(String name) {// 通过接口sameContxt获取同义词的所有String[]String[] sws = sameContxt.getSameWords(name);if (sws != null) {// 添加进栈中for (String str : sws) {sames.push(str);}return true;}return false;}}4.实现TokenStream/* * 实现自定义中文同义词分词器(mmseg词库) */public class MySameAnalyzer extends Analyzer {// 添加同义词词库private MySameContxt sameContxt;public MySameAnalyzer(MySameContxt msc) {this.sameContxt = msc;}@Overridepublic TokenStream tokenStream(String fieldName, Reader read) {// z最后传递的是自定义的同义词库管理类Dictionary dic = Dictionary.getInstance("F:\\BaiduYunDownload\\Cache\\lucune\\chinesedic");return new MySameTokenFilter(new MMSegTokenizer(new MaxWordSeg(dic),read), sameContxt);}}5.编写索引测试 public void test05() {try {//将同义词词库作为分词器Analyzer的属性获取TokenStreamAnalyzer a1 = new MySameAnalyzer(new MySimpleSameContxt());String txt = "我来自湖南邵阳";// 创建索引Directory dir = new RAMDirectory();IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35, a1));Document doc = new Document();doc.add(new Field("content", txt, Field.Store.YES,Field.Index.ANALYZED));writer.addDocument(doc);writer.close();// 创建搜索IndexReader reader = IndexReader.open(dir);IndexSearcher search = new IndexSearcher(reader);TopDocs tds = search.search(new TermQuery(new Term("content","鱼米之乡")), 10);for (ScoreDoc sdc : tds.scoreDocs) {Document docc = search.doc(sdc.doc);System.out.println(docc.get("content"));}// new AnalyzerUtils().displayToken(txt, a1);} catch (CorruptIndexException e) {e.printStackTrace();} catch (LockObtainFailedException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}

眼睛可以近视,目光不能短浅。

Lucene实现自定义中文同义词分词器

相关文章:

你感兴趣的文章:

标签云: