什么是Tika?
POI, Pdfbox
PDF -通过PdfboxMS-* -通过POIHTML -OpenOfficeArchive – zip, tar, gzip, bzip等RTF – Tika提供Image -只支持图像的元数据抽取 XMLParser interface
public voidparse(InputStream stream, ContentHandler handler, Metadata metadata)
(
提示:主要tika-xx.jar,解析相应的文件必须有相应的jar.例如:excel文件必须用到poi-xx.jar
importorg.apache.tika.metadata.Metadata; importorg.apache.tika.parser.AutoDetectParser; importorg.apache.tika.parser.ParseContext; importorg.apache.tika.parser.Parser; importorg.apache.tika.sax.BodyContentHandler; importorg.xml.sax.ContentHandler; import java.io.*; public class TiKaUtil{public static String parseFile(Filefile){Parser parser = newAutoDetectParser();InputStream input = null;try{Metadata metadata = newMetadata();metadata.set(Metadata.CONTENT_ENCODING, "utf-8");metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());input = newFileInputStream(file);ContentHandler handler = newBodyContentHandler();//当文件大于100000时,,newBodyContentHandler(1024*1024*10);ParseContext context = newParseContext();context.set(Parser.class,parser);parser.parse(input,handler,metadata,context);for(String name:metadata.names()){System.out.println(name+":"+metadata.get(name));}System.out.println(handler.toString());return handler.toString();}catch (Exception e){e.printStackTrace();}finally {try {if(input!=null)input.close();} catch (IOException e) {e.printStackTrace();}}return null;}public static void main(Stringargt0[])throws Exception{parseFile(new File("D:\\svntest\\svnkittest\\branches\\doImport.txt"));}
总结:
只有流过血的手指才能弹出世间的绝唱。