自己写的一个爬虫小程序,主要功能是抓取小站的每个页面,然后保存文章标题到本地文件中.主要用到了BeautifulSoup和正则表达式来获取一些有用的地址和页面标题.分享给大家,就当作是抛砖引玉啦,下面贴出代码:
#author:liangliang#email:liangliangyy@gmail.com#blog:http://www.lylinux.org/import urllib2from bs4 import BeautifulSoupimport re#baseurl = "http://www.lylinux.org"title_file = "title.txt"#保存文章标题到本地文件.def save_page(link): page = urllib2.urlopen(link) soup = BeautifulSoup(page) file = open(title_file,'a') title = soup.head.title print title file.write(str(title) + "\n")#获取下一页def get_next_page(url): page = urllib2.urlopen(url) soup = BeautifulSoup(page) #这儿说明下,在我测试的时候发现,别的页面都可以,唯独这个页面不行... #获取不到下一页的地址. #不知道为什么,会不会是主题的原因? if url == "http://www.lylinux.org/page/15": next_page = "http://www.lylinux.org/page/16" return next_page for i in soup.findAll(attrs={"class" : "next-page"}): pattern = re.compile(r'http://www.lylinux.org/page/[0-9]*') the_next_page = pattern.findall(str(i)) print the_next_page next_page = i for i in the_next_page: next_page = i print next_page return next_pagedef loop_page(url): print url get_article_url(url) next_page = get_next_page(url) loop_page(next_page)#获取页面的文章urldef get_article_url(link): print "get article" + link page = urllib2.urlopen(link) soup = BeautifulSoup(page) for i in soup.findAll(attrs={"class" : "focus"}): pattern = re.compile(r'[a-zA-z]+://[^\s]*.html') test = pattern.findall(str(i)) for link in test: # print link save_page(link)if __name__ == '__main__': loop_page('http://www.lylinux.org/') print "the end!"
好了,代码就是这样.初学python,望指正.
转载请注明:逝去日子的博客 » python爬虫小程序抓取网站
勤勉是通往胜利的必经之路。要是由于胆怯艰难而去另觅佳径,