[python知识] 爬虫知识之BeautifulSoup库安装及简单介绍

一. 前言

在前面的几篇文章中我介绍了如何通过Python分析源代码来爬取博客、维基百科InfoBox和图片，其文章链接如下： [python学习] 简单爬取维基百科程序语言消息盒 [Python学习] 简单网络爬虫抓取博客文章及思想介绍 [python学习] 简单爬取图片网站图库中图片其中核心代码如下：

# coding=utf-8import urllibimport re#下载静态HTML网页url=''content = urllib.urlopen(url).read()open('csdn.html','w+').write(content)#获取标题title_pat=r'(?<=<title>).*?(?=</title>)'title_ex=re.compile(title_pat,re.M|re.S)title_obj=re.search(title_ex, content)title=title_obj.group()print title#获取超链接内容 href = r'<a href=.*?>(.*?)</a>'m = re.findall(href,content,re.S|re.M)for text in m:print unicode(text,'utf-8')break #只输出一个url 输出结果如下：>>>CSDN.NET – 全球最大中文IT社区，为IT专业技术人员提供最全面的信息传播和服务平台登录>>> 图片下载的核心代码如下：import osimport urllibclass AppURLopener(urllib.FancyURLopener):version = "Mozilla/5.0"urllib._urlopener = AppURLopener()url = ""filename = os.path.basename(url)urllib.urlretrieve(url , filename) 但是上面这种分析HTML来爬取网站内容的方法存在很多弊端，譬如： 1.正则表达式被HTML源码所约束，而不是取决于更抽象的结构；网页结构中很小的改动可能会导致程序的中断。 2.程序需要根据实际HTML源码分析内容，可能会遇到字符实体如&之类的HTML特性，需要指定处理如、图标超链接、下标等不同内容。 3.正则表达式并不是完全可读的，更复杂的HTML代码和查询表达式会变得很乱。正如《Python基础教程(第2版)》采用两种解决方案：第一个是使用Tidy(Python库)的程序和XHTML解析；第二个是使用BeautifulSoup库。

二. 安装及介绍Beautiful Soup库 Beautiful Soup是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航navigating，搜索以及修改剖析树的操作。它可以大大节省你的编程时间。正如书中所说“那些糟糕的网页不是你写的，，你只是试图从中获得一些数据。现在你不用关心HTML是什么样子的，解析器帮你实现”。下载地址：安装过程如下图所示：python setup.py install

具体使用方法建议参照中文：其中BeautifulSoup的用法简单讲解下，使用“爱丽丝梦游仙境”的官方例子：

#!/usr/bin/python# -*- coding: utf-8 -*-from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body>The Dormouse's storyOnce upon a time there were three little sisters; and their names were<a href="" class="sister" id="link1">Elsie</a>,<a href="" class="sister" id="link2">Lacie</a> and<a href="" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.…"""#获取BeautifulSoup对象并按标准缩进格式输出soup = BeautifulSoup(html_doc)print(soup.prettify()) 输出内容按照标准的缩进格式的结构输出如下：<html> <head> <title> The Dormouse's story </title> </head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="" id="link1">Elsie </a> , <a class="sister" href="" id="link2">Lacie </a> and <a class="sister" href="" id="link3">Tillie </a> ;and they lived at the bottom of a well. … </body></html> 下面是BeautifulSoup库简单快速入门介绍：(参考：官方文档)'''获取title值'''print soup.title# <title>The Dormouse's story</title>print soup.title.name# titleprint unicode(soup.title.string)# The Dormouse's story'''获取值'''print soup.p# The Dormouse's storyprint soup.a# <a class="sister" href="" id="link1">Elsie</a>'''从文档中找到<a>的所有标签链接'''print soup.find_all('a')# [<a class="sister" href="" id="link1">Elsie</a>,# <a class="sister" href="" id="link2">Lacie</a>,# <a class="sister" href="" id="link3">Tillie</a>]for link in soup.find_all('a'):print(link.get('href'))# # # print soup.find(id='link3')# <a class="sister" href="" id="link3">Tillie</a> 如果想获取文章中所有文字内容，代码如下：'''从文档中获取所有文字内容'''print soup.get_text()# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## … 同时在这过程中你可能会遇到两个典型的错误提示： 1.ImportError: No module named BeautifulSoup 当你成功安装BeautifulSoup 4库后，“from BeautifulSoup import BeautifulSoup”可能会遇到该错误。

每一天都是一个阶梯，是向既定目标迈进的新的一步。

相关文章：

你感兴趣的文章：

标签云：