java从零到变身爬虫大神

刚开始先从最简单的爬虫逻辑入手

爬虫最简单的解析面真的是这样

import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import java.io.IOException;public class Test {    public static void Get_Url(String url) {        try {         Document doc = Jsoup.connect(url)           //.data("query", "Java")          //.userAgent("头部")          //.cookie("auth", "token")          //.timeout(3000)          //.post()          .get();        } catch (IOException e) {              e.printStackTrace();        }    }}

这只是一个函数而已

那么在下面加上：

  //main函数     public static void main(String[] args) {         String url = "...";        Get_Url(url);     }

哈哈，搞定

就是这么一个爬虫了

太神奇

但是得到的只是网页的html页面的东西

而且还没筛选

那么就筛选吧

public static void Get_Url(String url) {        try {         Document doc = Jsoup.connect(url)           //.data("query", "Java")          //.userAgent("头部")          //.cookie("auth", "token")          //.timeout(3000)          //.post()          .get();                 //得到html的所有东西        Element content = doc.getElementById("content");        //分离出html下<a>...</a>之间的所有东西        Elements links = content.getElementsByTag("a");        //Elements links = doc.select("a[href]");        // 扩展名为.png的图片        Elements pngs = doc.select("img[src$=.png]");        // class等于masthead的div标签        Element masthead = doc.select("div.masthead").first();                    for (Element link : links) {              //得到<a>...</a>里面的网址              String linkHref = link.attr("href");              //得到<a>...</a>里面的汉字              String linkText = link.text();              System.out.println(linkText);            }        } catch (IOException e) {              e.printStackTrace();        }    }

那就用上面的来解析一下我的博客园

解析的是<a>…</a>之间的东西

看起来还不错吧

——————————-我是一根牛逼的分割线——————————-

其实还有另外一种爬虫的方法更加好

他能批量爬取网页保存到本地

先保存在本地再去正则什么的筛选自己想要的东西

这样效率比上面的那个高了很多

看代码！

//将抓取的网页变成html文件，保存在本地    public static void Save_Html(String url) {        try {            File dest = new File("src/temp_html/" + "保存的html的名字.html");            //接收字节输入流            InputStream is;            //字节输出流            FileOutputStream fos = new FileOutputStream(dest);                URL temp = new URL(url);            is = temp.openStream();                        //为字节输入流加缓冲            BufferedInputStream bis = new BufferedInputStream(is);            //为字节输出流加缓冲            BufferedOutputStream bos = new BufferedOutputStream(fos);                int length;                byte[] bytes = new byte[1024*20];            while((length = bis.read(bytes, 0, bytes.length)) != -1){                fos.write(bytes, 0, length);            }            bos.close();            fos.close();            bis.close();            is.close();        } catch (IOException e) {            e.printStackTrace();        }    }

这个方法直接将html保存在了文件夹src/temp_html/里面

在批量抓取网页的时候

都是先抓下来，保存为html或者json

然后在正则什么的进数据库

东西在本地了，自己想怎么搞就怎么搞

反爬虫关我什么事

上面两个方法都会造成一个问题

这个错误代表

这种爬虫方法太low逼

大部分网页都禁止了

所以，要加个头

就是UA

方法一那里的头部那里直接

userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")

方法二间接加：

 URL temp = new URL(url); URLConnection uc = temp.openConnection();uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");  is = temp.openStream();

加了头部，几乎可以应付大部分网址了

——————————-我是一根牛逼的分割线——————————-

将html下载到本地后需要解析啊

解析啊看这里啊

//解析本地的html    public static void Get_Localhtml(String path) {        //读取本地html的路径        File file = new File(path);        //生成一个数组用来存储这些路径下的文件名        File[] array = file.listFiles();        //写个循环读取这些文件的名字                for(int i=0;i<array.length;i++){            try{                if(array[i].isFile()){                //文件名字                System.out.println("正在解析网址：" + array[i].getName());                //下面开始解析本地的html                Document doc = Jsoup.parse(array[i], "UTF-8");                //得到html的所有东西                Element content = doc.getElementById("content");                //分离出html下<a>...</a>之间的所有东西                Elements links = content.getElementsByTag("a");                //Elements links = doc.select("a[href]");                // 扩展名为.png的图片                Elements pngs = doc.select("img[src$=.png]");                // class等于masthead的div标签                Element masthead = doc.select("div.masthead").first();                                for (Element link : links) {                      //得到<a>...</a>里面的网址                      String linkHref = link.attr("href");                      //得到<a>...</a>里面的汉字                      String linkText = link.text();                      System.out.println(linkText);                        }                    }                }catch (Exception e) {                    System.out.println("网址：" + array[i].getName() + "解析出错");                    e.printStackTrace();                    continue;                }        }    }

文字配的很漂亮

就这样解析出来啦

主函数加上

//main函数    public static void main(String[] args) {        String url = "http://www.cnblogs.com/TTyb/";        String path = "src/temp_html/";        Get_Localhtml(path);    }

那么这个文件夹里面的所有的html都要被我解析掉

好啦

3天java1天爬虫的结果就是这样子咯

——————————-我是快乐的分割线——————————-

其实对于这两种爬取html的方法来说，最好结合在一起

作者测试过

方法二稳定性不足

方法一速度不好

所以自己改正

将方法一放到方法二的catch里面去

当方法二出现错误的时候就会用到方法一

但是当方法一也错误的时候就跳过吧

结合如下：

import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import java.io.BufferedInputStream;import java.io.BufferedOutputStream;import java.io.BufferedReader;import java.io.File;import java.io.FileOutputStream;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.io.OutputStream;import java.io.OutputStreamWriter;import java.net.HttpURLConnection;import java.net.URL;import java.net.URLConnection;import java.util.Date;import java.text.SimpleDateFormat;public class JavaSpider {        //将抓取的网页变成html文件，保存在本地    public static void Save_Html(String url) {        try {            File dest = new File("src/temp_html/" + "我是名字.html");            //接收字节输入流            InputStream is;            //字节输出流            FileOutputStream fos = new FileOutputStream(dest);                URL temp = new URL(url);            URLConnection uc = temp.openConnection();            uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");            is = temp.openStream();                        //为字节输入流加缓冲            BufferedInputStream bis = new BufferedInputStream(is);            //为字节输出流加缓冲            BufferedOutputStream bos = new BufferedOutputStream(fos);                int length;                byte[] bytes = new byte[1024*20];            while((length = bis.read(bytes, 0, bytes.length)) != -1){                fos.write(bytes, 0, length);            }            bos.close();            fos.close();            bis.close();            is.close();        } catch (IOException e) {            e.printStackTrace();            System.out.println("openStream流错误，跳转get流");            //如果上面的那种方法解析错误            //那么就用下面这一种方法解析            try{                Document doc = Jsoup.connect(url)                .userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")                .timeout(3000)                 .get();                                File dest = new File("src/temp_html/" + "我是名字.html");                if(!dest.exists())                    dest.createNewFile();                FileOutputStream out=new FileOutputStream(dest,false);                out.write(doc.toString().getBytes("utf-8"));                out.close();            }catch (IOException E) {                E.printStackTrace();                System.out.println("get流错误，请检查网址是否正确");            }                    }    }        //解析本地的html    public static void Get_Localhtml(String path) {        //读取本地html的路径        File file = new File(path);        //生成一个数组用来存储这些路径下的文件名        File[] array = file.listFiles();        //写个循环读取这些文件的名字                for(int i=0;i<array.length;i++){            try{                if(array[i].isFile()){                    //文件名字                    System.out.println("正在解析网址：" + array[i].getName());                    //文件地址加文件名字                    //System.out.println("#####" + array[i]);                     //一样的文件地址加文件名字                    //System.out.println("*****" + array[i].getPath());                                                             //下面开始解析本地的html                    Document doc = Jsoup.parse(array[i], "UTF-8");                    //得到html的所有东西                    Element content = doc.getElementById("content");                    //分离出html下<a>...</a>之间的所有东西                    Elements links = content.getElementsByTag("a");                    //Elements links = doc.select("a[href]");                    // 扩展名为.png的图片                    Elements pngs = doc.select("img[src$=.png]");                    // class等于masthead的div标签                    Element masthead = doc.select("div.masthead").first();                                        for (Element link : links) {                          //得到<a>...</a>里面的网址                          String linkHref = link.attr("href");                          //得到<a>...</a>里面的汉字                          String linkText = link.text();                          System.out.println(linkText);                        }                    }                }catch (Exception e) {                    System.out.println("网址：" + array[i].getName() + "解析出错");                    e.printStackTrace();                    continue;                }            }        }    //main函数    public static void main(String[] args) {        String url = "http://www.cnblogs.com/TTyb/";        String path = "src/temp_html/";        //保存到本地的网页地址        Save_Html(url);        //解析本地的网页地址        Get_Localhtml(path);    }}

总的来说

java爬虫的方法比python的多好多

java的库真特么变态

感受不同地域不一样的节奏与表象。

相关文章：

你感兴趣的文章：

标签云：