对指定URL获取其titile

最近在广告投放时需要找到一批强项关的人群, 现在发现了指定的一些URL可能会跟给广告相关,所以需要对每个URL 的网页内容进行解析,以便能判断URL 是否与该广告相关.

我这里使用python中的urllib或urllib2包对URL 的内容提取.方法如下:

#!/usr/bin/python# -*- coding: utf-8 -*-import urllib2import reurl=''html = urllib2.urlopen(url).read()res_list = re.findall(r"<title>.*</title>", html)for t in res_list:print t执行结果为:

python如何正确抓取网页标题 – SegmentFault

看这个帖子里说的BeautifulSoup也是很不错的, 没有试过,代码如下:

import urllibfrom BeautifulSoup import BeautifulSoupcontent = urllib.urlopen('').read()soup = BeautifulSoup(content)print soup.find('title')

用第一段代码在获取我的主页?viewmode=list的title时,报错了:

Traceback (most recent call last): File "get_title.py", line 9, in <module>html = urllib2.urlopen(url).read() File "/usr/lib/python2.7/urllib2.py", line 126, in urlopenreturn _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 400, in openresponse = meth(req, response) File "/usr/lib/python2.7/urllib2.py", line 513, in http_response'http', request, response, code, msg, hdrs) File "/usr/lib/python2.7/urllib2.py", line 438, in errorreturn self._call_chain(*args) File "/usr/lib/python2.7/urllib2.py", line 372, in _call_chainresult = func(*args) File "/usr/lib/python2.7/urllib2.py", line 521, in http_error_defaultraise HTTPError(req.get_full_url(), code, msg, hdrs, fp)urllib2.HTTPError: HTTP Error 403: Forbidden是403错误,看来CSDN对本次请求是禁止的, 所以我们要模拟正常人访问浏览器的行为,加上headersurl='?viewmode=list'user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'# 将user_agent写入头信息headers = { 'User-Agent' : user_agent }req = urllib2.Request(url, headers = headers)html = urllib2.urlopen(req).read()res_list = re.findall(r"<title>.*\s?.*</title>", html)for t in res_list:print t[7:-8]执行没有问题,结果为:

lming_08技术博客 – 博客频道 – CSDN.NET

注意这里标题里面含有\n字符,而之前正则表达式r"<title>.*</title>"是不会包含\n的,所以改为 r"<title>.*\s?.*</title>" 或者 r"<title>.*\n?.*</title>"

最新发现这获取 title时仍然会报403错误, 查了下需要更新headers

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'headers = { 'User-Agent' : user_agent }执行结果为:mingliang@ubuntu:~/MyWorkSpace/Pycode/htmlparse$ python get_title.pyMentholatum 曼秀雷敦肌研极润保湿系列套装（洁面乳50g+化妆水100ml+乳液90ml+面膜18ml+眼膜2片+眼霜3g） 99元（199-100）_乐蜂网优惠_什么值得买在实际操作中我们会对很多url进行解析, 中间免不了会出现服务器返回其他错误代码,因此我们要捕获异常继续执行:try:html = urllib2.urlopen(req).read()res_list = re.findall(r"<title>.*\s?.*</title>", html)for t in res_list:print t[7:-8]except urllib2.HTTPError:print "failed parsing web url"

文章参考于:

，我们什么都没有，唯一的本钱就是青春。

相关文章：

你感兴趣的文章：

标签云：