1. 安装Beautifulsoup4
pip install beautifulsoup4 pip install lxml pip install html5lib
lxml 和 html5lib 是解析器
2. html
<!-- This is the example.html file. --> <html><head><title>The Website Title</title></head> <body> <p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p> <p class="slogan">Learn Python the easy way!</p> <p>By <span id="author">Al Sweigart</span></p> </body></html>
上面的html保存html文件
3.开始解析
import bs4 exampleFile = open('example.html') exampleSoup = bs4.BeautifulSoup(exampleFile.read(),'html5lib') elems = exampleSoup.select('#author') type(elems) print (elems[0].getText())
结果输出 Al Sweigart
BeautifulSoup 使用select 方法寻找元素,类似jquery的css选择器
soup.select(‘div’) ———————–所有为<div>的元素
soup.select(‘#author’)—————–id为author的元素
soup.select(‘.notice’)——————class 为notice的元素
参考《Python 编程快速上手—–让繁琐工作自动化》
1818