WebCollector 2.x 入门教程

WebCollector爬虫官网:https://github.com/CrawlScript/WebCollector

技术讨论群:250108697

1.将WebCollector 2.x导入工程:

进入WebCollector 2.x主页:https://github.com/CrawlScript/WebCollector

下载:webcollector-版本号-bin.zip

将解压后文件夹中的所有jar包添加到工程既可。

2.用WebCollector 2.x爬取雅虎新闻的标题和内容:

import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;import cn.edu.hfut.dmic.webcollector.model.Links;import cn.edu.hfut.dmic.webcollector.model.Page;import java.util.regex.Pattern;import org.jsoup.nodes.Document;/** * Crawl news from yahoo news * * @author hu */public class YahooCrawler extends BreadthCrawler {/*** @param crawlPath crawlPath is the path of the directory which maintains* information of this crawler* @param autoParse if autoParse is true,BreadthCrawler will auto extract* links which match regex rules from pag*/public YahooCrawler(String crawlPath, boolean autoParse) {super(crawlPath, autoParse);/*start page*/this.addSeed("");/*fetch url like */this.addRegex("*");/*do not fetch url like )*/this.addRegex("-+/.*");/*do not fetch jpg|png|gif*/this.addRegex("-.*\\.(jpg|png|gif).*");/*do not fetch url contains #*/this.addRegex("-.*#.*");}@Overridepublic void visit(Page page, Links nextLinks) {String url = page.getUrl();/*if page is news page*/if (Pattern.matches("+html", url)) {/*we use jsoup to parse page*/Document doc = page.getDoc();/*extract title and content of news by css selector*/String title = doc.select("h1[class=headline]").first().text();String content = doc.select("div[class=body yom-art-content clearfix]").first().text();System.out.println("URL:\n" + url);System.out.println("title:\n" + title);System.out.println("content:\n" + content);/*If you want to add urls to crawl,add them to nextLink*//*WebCollector automatically filters links that have been fetched before*//*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/// nextLinks.add("");}}public static void main(String[] args) throws Exception {YahooCrawler crawler = new YahooCrawler("crawl", true);crawler.setThreads(50);crawler.setTopN(100);//crawler.setResumable(true);/*start crawl with depth of 4*/crawler.start(4);}}

,劝君更尽一杯酒,西出阳关无故人。

WebCollector 2.x 入门教程

相关文章:

你感兴趣的文章:

标签云: