读取Webpage表中的内容

nutch将从网页中抓取到的信息放入hbase数据库中,默认情况下表名为$crawlId_webpage,但表中的内容以16进制进行表示,直接scan或者通过Java API进行读取均只能读取到16进制信息。 因此nutch提供了readdb选项进行数据获取,,将表中的内容读取到一个文本中。具体用法为:$ bin/nutch readdbUsage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])[-crawlId <id>] [-content] [-headers] [-links] [-text]-crawlId <id> – the id to prefix the schemas to operate on,(default: storage.crawl.id)-stats [-sort] – print overall statistics to System.out[-sort]- list status sorted by host-url <url>- print information on <url> to System.out-dump <out_dir> [-regex regex] – dump the webtable to a text file in<out_dir>-content- dump also raw content-headers- dump protocol headers-links- dump links-text- dump extracted text[-regex]- filter on the URL of the webtable entry示例:(1)seed.txt的内容为:(2)执行以下命令进行inject操作bin/nutch inject seed.txt -crawlId test001(3)scan表中内容,发现无意义

hbase(main):002:0> scan 'test001_webpage'ROWCOLUMN+CELLcom.163.money:http/column=f:fi, timestamp=1423550107073, value=\x00'\x8D\x00com.163.money:http/column=f:ts, timestamp=1423550107073, value=\x00\x00\x01Kr2\xC7\xD6com.163.money:http/column=mk:_injmrk_, timestamp=1423550107073, value=ycom.163.money:http/column=mk:dist, timestamp=1423550107073, value=0com.163.money:http/column=mtdt:_csh_, timestamp=1423550107073, value=?\x80\x00\x00com.163.money:http/column=s:s, timestamp=1423550107073, value=?\x80\x00\x001 row(s) in 0.4090 seconds(4)将表中内容读取到/mnt/jediael/2bin/nutch readdb -dump /mnt/jediael/2 -crawlId test001 -content (5)查看/mnt/jediael/2中的内容$ lltotal 4-rwxrwxrwx. 1 jediael jediael 344 Feb 10 14:41 part-r-00000-rwxrwxrwx. 1 jediael jediael 0 Feb 10 14:41 _SUCCESS$ cat part-r-00000 key: com.163.money:http/baseUrl:nullstatus: 0 (null)fetchTime:1423550105558prevFetchTime: 0fetchInterval: 2592000retriesSinceFetch:0modifiedTime: 0prevModifiedTime:0protocolStatus: (null)parseStatus: (null)title: nullscore: 1.0marker _injmrk_ :ymarker dist : 0reprUrl:nullmetadata _csh_ :?锟

怎么能研究出炸药呢?爱迪生不经历上千次的来自失败,怎么能发明电灯呢

读取Webpage表中的内容

相关文章:

你感兴趣的文章:

标签云: