当爬虫遇到JavaScript

当爬虫遇到JavaScript

目录

1 前言2 分析目标网页3 grspider.py4 后记5 资料

1 前言

这次我的目标是写一个爬虫程序,获取网站 GNMA官网 每个月发行的Remic Prospectuses文件,具体到某年某月的URL是:

http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/prospectuses/Pages/remic_prospectuses.aspx?YearDropDown=2013&MonthDropDown=March

如果有多页的内容,则需要点击操作,然后JavaScript生成页面,不然看不到其它页面内容,如:

javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={11};dvt_startposition={}');

这个程序应该接受年、月参数,自动获取Remic清单,遇到需要点击操作才出来的页面也要能够处理。

2 分析目标网页

目标页面URL是有一定规律的,年份是数字,模式是 \d+ ,而月份是英文各个月份的全拼,首字母大写。

对于具体页面的Remic文件源码格式类似:

<a href="http://lesliezhu.github.com/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2013Mar21-037.pdf" target="_blank">2013-037 - Dated March 21, 2013</a>

而对应的JavaScript操作源码,整理的类似有:

<a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={11};dvt_startposition={}');"><a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={1};dvt_startposition={}');"><a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={21};dvt_startposition={}');"><a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={31};dvt_startposition={}');">

所幸在源码中找到定义:

<script type="text/javascript">//<![CDATA[var theForm = document.forms['aspnetForm'];if (!theForm) {    theForm = document.aspnetForm;}function __doPostBack(eventTarget, eventArgument) {    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {        theForm.__EVENTTARGET.value = eventTarget;        theForm.__EVENTARGUMENT.value = eventArgument;        theForm.submit();    }}//]]></script>

所以,必须通过表单的形式进行POST方法提交,获取表单信息。

3 grspider.py

通过Firebug可以获取到cURL命令行下载相应页面的命令,但具体做法会比较乱,通过模拟POST提交表单:

运行:

$ python grspider.py -y 2014http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/prospectuses/Pages/remic_prospectuses.aspx?YearDropDown=2014&MonthDropDown=Month$ cat gnma_remic.json{    "2014-001": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-001.pdf",    "2014-001O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-001O.pdf",    "2014-002": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-002.pdf",    "2014-002O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-002O.pdf",    "2014-003": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-003.pdf",    "2014-003O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Feb24-003O.pdf",    "2014-004": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-004.pdf",    "2014-004O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Feb24-004O.pdf",     ....}
4 后记

遇到JavaScript的页面,最好还是通过Firebug等方法查看规律,然后模拟表单操作。

5 资料

爬虫抓取网页,遇到很多url都是通过javascript代码实现http://www.zhihu.com/question/20626694http://phantomjs.org/https://www.ptt.cc/bbs/java/M.1307291078.A.1B4.html

日期: 12/13/2014

作者: Leslie Zhu

Org version 7.8.11 with Emacs version 24

Validate XHTML 1.0

当爬虫遇到JavaScript

相关文章:

你感兴趣的文章:

标签云: