BeautifulSoup学习

前期准备：

pip install bs4
pip install lxml

bs解析器

从上面的表格可以看出，lxml解析器可以解析HTML和XML文档，并且速度快，容错能力强，所有推荐使用它。

节点选择器

获取名称

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag.name)

获取属性

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="Dormouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.p.attrs)
print(soup.p.attrs['name'])

结果：

{'class': ['title'], 'name': 'Dormouse'}
Dormouse

获取子节点

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title",name="测试一下"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie
<p name="hahah测试一下"</p>
</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')
#   打印其中的元素值print(soup.p)
print("soup.title = ",soup.title)
print("soup.a = ",soup.a)
print("type(p):",type(soup.p))
print("soup.title.name = ",soup.title.name)
print("soup.a.name = ",soup.a.name)
print("type(soup.a.name):",type(soup.a.name))
attrs = soup.p.attrs

结果：

<p ,name="测试一下" class="title"><b>The Dormouse's story</b></p>
soup.title =  <title>The Dormouse's story</title>
soup.a =  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
type(p): <class 'bs4.element.Tag'>
soup.title.name =  title
soup.a.name =  a
type(soup.a.name): <class 'str'>

关联节点：

获取子孙节点：

选取节点元素之后，想要获取它的直接子节点可以调用contents属性。

具体代码示例如下：

print("soup.body.contents:",soup.body.contents)
print("type(soup.body.contents):",type(soup.body.contents))

结果：

soup.body.contents: ['\n', <p ,name="测试一下" class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie
<p <="" name="hahah测试一下" p="">
</p></a>;
and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n']
type(soup.body.contents): <class 'list'>

相同的功能还可以通过调用children属性来获取。

print("soup.body.children:",soup.body.children)
print("soup.body,children的类型为:",type(soup.body.children))

结果：

soup.body.children: <list_iterator object at 0x104cc75e0>
soup.body,children的类型为: <class 'list_iterator'>

如果想要获取子孙的节点的话，可以调用descendants属性来获取输出内容。

具体代码示例如下所示：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span>Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.p.descendants)
for child in soup.p.descendants:print(child)

# 测试父节点和祖先节点
pars = soup.p.parents
print("type(soup.p.parents):",type(pars))
pars = soup.p.parents
for par in pars:print("父节点和祖先节点的值分别为,",par)

结果：

type(soup.p.parents): <class 'generator'>
父节点和祖先节点的值分别为, <body>
<p ,name="测试一下" class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie
<p <="" name="hahah测试一下" p="">
</p></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
父节点和祖先节点的值分别为, <html><head><title>The Dormouse's story</title></head>
<body>
<p ,name="测试一下" class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie
<p <="" name="hahah测试一下" p="">
</p></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
父节点和祖先节点的值分别为, 
<html><head><title>The Dormouse's story</title></head>
<body>
<p ,name="测试一下" class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie
<p <="" name="hahah测试一下" p="">
</p></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
part的值为 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
next_sib的值为：  andtype(next_sibings): <class 'generator'>
next_sib :  andnext_sib : <a class="sister" href="http://example.com/tillie" id="link3">Tillie
<p <="" name="hahah测试一下" p="">
</p></a>

获取父祖节点：

如果想要获取某个节点的父节点可以直接调用parent属性。

具体代码示例如下所示：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body><p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<p>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
</p>
</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.a.parents)
for i, parent in enumerate(soup.a.parents):print(i, parent)

获取祖先节点，依然返回的类型仍然是生成器类型。所以通过循环可以遍历出每一个内容。

试着运行上面的代码，你会发现，输出结果包含了body节点和html节点。

获取兄弟节点：

上面的两个了例子说明了父节点与子节点的获取方法。那假如我需要获取同级节点该怎么办呢？可以使用next_sibling、previous_sibling、next_siblings、previous_siblings这四个属性来获取。

具体代码示例如下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.a.next_sibling)
print(list(soup.a.next_siblings))
print(soup.a.previous_sibling)
print(list(soup.a.previous_siblings))

从上面的代码可以发现，这里调用了4个属性，分别是next_sibling和previous_sibling，这个两个属性分别获取节点的上一个兄弟元素和下一个兄弟元素。

而next_siblings和previous_siblings是获取前面和后面的兄弟节点，返回的类型依然是生成器类型。(type(next_sibings): <class 'generator'>)(获取的parents，也是这种生成器类型，包括获取的其他节点，也都是类生成器，我们都可以使用for循环获取其中的内容)

方法选择器

前面所讲的内容都是通过属性来选择的，这种方法非常快，但是如果是较为复杂的选择，那上面的选择方法就可能显得繁琐。因此，Beautiful Soup为我们提供了查询方法，比如:find_all()和find()等。调用它们，传入相应的参数。(一般我们使用的也是方法选择器)

find_all()

它的API如下：

find_all(name, attrs, recursive, text, **kwargs)

（1）name

可以根据节点名称来选择参数

具体代码示例如下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="Dormouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all('a'))
print(len(soup.find_all('a')))

上面的代码调用了find_all( )方法，传入了name参数，参数值为a，

试着运行上面的代码，我们想要获取的所有a节点，返回结果是列表类型，长度为3。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="Dormouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a>,
<a href="http://example.com/lacie" class="sister" id="link2"><span>Lacie</span></a> and
<a href="http://example.com/tillie" class="sister" id="link3"><span>Tillie</span></a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all('a'))
for a in soup.find_all('a'):print(a.find_all('span'))print(a.string)

将上面的代码做些许修改。

试着运行上面的代码，你会发现可以通过a节点去获取span节点，同样的也可以获取a节点的文本内容。

（2）attrs

除了根据节点名查询的话，同样的也可以通过属性来查询。

具体代码示例如下所示：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="Dormouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a>,
<a href="http://example.com/lacie" class="sister" id="link2"><span>Lacie</span></a> and
<a href="http://example.com/tillie" class="sister" id="link3"><span>Tillie</span></a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(attrs={'id': 'link1'}))
print(soup.find_all(attrs={'name': 'Dormouse'}))

这里查询的时候要传入的参数是attrs参数，参数的类型是字典类型。

对于常用的属性比如class，我们可以直接传入class这个参数，还是上面的文本，具体代码示例如下：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(class_ = 'sister'))

在这里需要注意的是class是Python的保留字，所以在class的后面加上下划线。

同样的，其实id属性也可以这样操作，还是上面的文本，具体代码示例如下：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(id = 'link2'))

find( )

除了find_all( )方法，还有find( )方法，前者返回的是多个元素，以列表形式返回，后缀是返回一个元素。

具体代码示例如下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="Dormouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a>,
<a href="http://example.com/lacie" class="sister" id="link2"><span>Lacie</span></a> and
<a href="http://example.com/tillie" class="sister" id="link3"><span>Tillie</span></a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find(name='a'))
print(type(soup.find(name='a')))

试着运行上面的代码，你会发现，find ( )方法返回第一个a节点的元素，类型是Tag类型。

find( )与find_all( )的使用方法相同。

还有其他方法选择器，在这里做一下简单的介绍。

find_parents() 和find_parent()：前者返回所有祖先节点，后者返回直接父节点。

find_next_siblings()和find_next_sibling()：前者返回后面的所有兄弟节点，后者返回后面第一个兄弟节点。

find_previous_siblings和find_previous_sibling()：前者返回前面的所有兄弟节点，后者返回前面第一个兄弟节点。

notes：

在 BeautifulSoup 中，soup.find().string 和 soup.find().text 有以下区别：

soup.find().string:

soup.find().string 返回的是该标签内部的字符串内容，如果标签内有多个子节点，那么 string 方法会返回 None。
如果标签内只有一个文本节点，那么 string 方法返回的就是这个文本节点的内容。

soup.find().text:

soup.find().text 返回的是该标签内部的所有子节点组成的文本内容，包括所有子节点的文本内容以及子节点的子节点的文本内容。
与 string 不同，text 方法会返回该标签内部的所有文本内容，而不仅仅是第一个文本节点的内容。

因此，当标签内部只有一个文本节点时，soup.find().string 和 soup.find().text 返回的结果是一样的。但当标签内部包含多个文本节点时，它们的返回结果就会有所不同。

例子：

如果您只想要获取标签内部第一个文本节点的内容，可以使用.find().text结合.splitlines()和.strip()来实现。示例如下：
first_text = soup.find().text.splitlines()[0].strip()
print(first_text)
这段代码首先使用.find().text获取了标签内部所有文本内容，然后使用.splitlines()[0]来获取第一个文本节点的内容，并最后使用.strip()来去除文本内容的前后空白字符，以得到最终的结果。

splitlines() 是 Python 字符串对象的一个方法，用于将字符串按行分割成一个字符串列表。它会根据换行符 \n、 \r 或 \r\n 来分割字符串。

例如，假设我们有一个字符串包含多行文本：
text = "Hello\nWorld\nWelcome\rTo\rPython\r\nProgramming"
lines = text.splitlines()
print(lines)
输出结果将是一个包含每行文本的列表：
['Hello', 'World', 'Welcome', 'To', 'Python', 'Programming']
在前面提到的示例中，.splitlines() 方法被用于获取标签内部文本内容的每一行，并且通过索引 [0] 来获取第一个文本节点的内容。

notes2:

关于find()和find_all()，find_all()返回的是一个resultSet类型的数据，不能获取string等参数，但是可以通过遍历来获取每个具体的节点，然后再获取具体的属性；而find()获取的则是<class 'bs4.element.Tag'>类型

notes3:

在BeautifulSoup中，可以调用find()和find_all()方法的数据类型包括：

BeautifulSoup对象：可以在另一个BeautifulSoup对象中调用find()和find_all()方法，用于在其子树中查找匹配的元素。

Tag对象：可以在一个Tag对象中调用find()和find_all()方法，用于在其子树中查找匹配的元素。

这两种数据类型都是BeautifulSoup库中定义的，用于表示HTML或XML文档中的标签或元素。通过调用find()和find_all()方法，可以方便地在文档中查找特定的标签或元素，并进行后续的处理和分析。

CSS选择器

Beautiful Soup还为我们提供了另一种选择器，就是CSS选择器。熟悉前端开发的小伙伴来说，CSS选择器肯定也不陌生。

使用CSS选择器的时候，需要调用select( ) 方法，将属性值或者是节点名称传入选择器即可。

具体代码示例如下：

html_doc = """
<div class="panel"><div class="panel-heading"><h4>Hello World</h4>   </div><div class="panel-body"><ul class="list" id="list-1"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul><ul class="list list-samll" id="list-2"><li class="element">Foo</li><li class="element">Bar</li><li class="element">Jay</li></ul></div></div>
</div>
"""from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('.panel .panel-heading')) # 获取class为panel-heading的节点
print(soup.select('ul li')) # 获取ul下的li节点
print(soup.select('#list-2 li')) # 获取id为list-2下的li节点
print(soup.select('ul'))    # 获取所有的ul节点
print(type(soup.select('ul')[0]))

试着运行上面的代码，查看运行结果之后，很多内容你就明白了。

最后一句输出列表中元素的类型，你会发现依然还是Tag类型。

嵌套选择

select( )方法同样支持嵌套选择，例如，会选择所有的ul节点，在对ul节点进行遍历，选择li节点。

与上面的html文本相同，具体代码如下所示：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for ul in soup.select('ul'):print(ul.select('li'))

试着运行上面的结果，输出所有ul节点下的所有li节点组成的列表。

获取属性

从上面的几个例子中相信大家应该明白了，所有的节点类型都是Tag类型，所以获取属性依然可以使用以前的方法，仍然是上面的HTML文本，这里尝试获取每个ul节点下的id属性。

具体代码示例如下所示：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for ul in soup.select('ul'):print(ul['id'])print(ul.attrs['id'])

从上面的代码可以看出，可以直接向中括号传入属性名，或者通过attrs属性获取属性值。

获取文本

要获取文本除了之前所说的string属性，另外，还可以调用get_text()方法。

依然还是前面的html文本具体代码示例如下所示：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for li in soup.select('li'):print('String:', li.string)print('get text:', li.get_text())

小结

Beautiful Soup到这里基本上就结束了。

在编写爬虫的时候一般使用find_all( )和find( )方法获取指定节点。

如果对css选择器熟悉的话也可以使用select( )方法。

实战

1.爬取豆瓣最受欢迎的250部电影慢慢看

from bs4 import BeautifulSoup
import requests
import openpyxl
from fake_useragent import UserAgent
wb =openpyxl.Workbook()
ws = wb.active
ws.append(["电影名称","电影图片","电影排名","电影评分","电影作者","电影简介"])
ua = UserAgent();# 建立循环
start = 0
output_file = "./thetop250movies.xlsx"
url = "https://movie.douban.com/top250?start="+str(start)+"&filter="
headers = {"user-agent" : ua.random}
# response = requests.get(url = url,headers=headers )
# # print("response.status=",respopnse.status_code)
# print("response的值为：",response.text)
while start<250:headers = {"user-agent" : ua.random}url = "https://movie.douban.com/top250?start="+str(start)+"&filter="response = requests.get(url = url,headers=headers)# 获取页面内容content = response.text# 创建bssoup = BeautifulSoup(content, "html.parser")element = soup.find(class_="grid_view")for li in element.find_all("li"):src = li.find("img").get("src")title = li.find(class_ = "title").stringindex = li.find("em").stringprint(index)pre_intro = li.find(class_="inq")if(pre_intro ==None):intro = ""else:intro = pre_intro.stringscore = li.find(class_ = "rating_num").stringauthor = li.find("p").text.splitlines()[1].strip()ws.append([title,src,index,score,author,intro])start += 25
wb.save(output_file)

notes:关于ua:爬虫| <Response [418]>原因是什么原因呢？原因： requests.get()函数返回<Response [418]>。响应状态码418表示访问的网站有反爬虫机制，而解决方法就是带请求头header(user-agent)访问

在 Python 中，您可以使用 fake_useragent 库来生成随机的 User-Agent，以模拟不同的浏览器访问网页。以下是使用 fake_useragent 库实现的示例代码：

from fake_useragent import UserAgent
import requests# 创建一个UserAgent对象
ua = UserAgent()# 生成一个随机的User-Agent
user_agent = ua.random
print(user_agent)# 使用生成的随机User-Agent发送请求
headers = {'User-Agent': user_agent}
response = requests.get('https://www.example.com', headers=headers)# 输出状态码
print(response.status_code)

2.爬取b站弹幕

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
ua = UserAgent()
headers = {"user-agent":ua.random}
url = """https://api.bilibili.com/x/v2/dm/wbi/web/seg.so?type=1&oid=276746872&pid=501083374&segment_index=1&pull_mode=1&ps=0&pe=120000&web_location=1315873&w_rid=43b159864d00ef22bf1c483986cbc52e&wts=1702340150
"""
url1 = "https://api.bilibili.com/x/v1/dm/list.so?oid=276746872"
response = requests.get(url1,headers = headers)
content = response.contentprint(content)
soup = BeautifulSoup(content,"html.parser")
results = soup.find_all('d')
print('type(result:',type(results))
# 打开文件
file1 = open("弹幕.txt","w")
for result in results:file1.write(result.string)file1.write("\n")

其他补充：

浏览器F12，Network中各按钮的作用

Network下

preserve log:勾选，页面发生跳转，接口不丢失；（比如登录成功跳转到首页，登录的接口就没了,勾选Perserve log，会记录跳转前的接口）；

Disable cache:不使用缓存，勾选，拿服务器的缓存；不勾选，用本地缓存；（测试经常出现，开发说改了，但本地bug仍然出现的情况，可能就是本地缓存的原因，勾选后直接从服务器拿缓存）；

All那列，表示浏览器的请求类型，对应下面的列type;

All：表示所有的请求类型；

XHR:表示接口类型；

Doc：表示文档类型（type:Document）

js:表示js脚本(type:script)

CSS:表示前端样式（type:stylesheet）

img:图片（type:jpg、jpeg、png、gif...）

Media:音频、视频（type:MP4、MP3...）

Font:字体（小说里面会有特殊字体处理...）

WS：WebSocket是一种在单个TCP连接上进行全双工通信的协议。

这些术语通常与Web开发和网络请求/响应相关。以下是它们的简要解释：

Headers（标头）: 在HTTP请求或响应中，标头包含了关于消息的元数据，比如内容类型、内容长度、授权信息等。请求的标头通常包含客户端发送给服务器的信息，而响应的标头包含服务器返回给客户端的信息。

Payload（负载）: 在网络通信中，负载是指实际的数据或信息，不包括通信协议和其他控制信息。在HTTP中，负载通常指请求或响应中的实际内容，比如HTML页面、JSON数据等。

Preview（预览）: 在开发者工具中，预览通常指对请求或响应中数据的简要预览，比如JSON数据的结构、HTML页面的内容片段等。

Response（响应）: 在HTTP中，响应是指服务器对客户端请求的回复。它包含一个状态码、标头和可选的数据负载。

Initiator（发起者）: 在网络请求中，发起者指的是触发请求的源，比如页面上的脚本、用户交互等。

Timing（时间）: 指的是网络请求的时间信息，比如DNS解析时间、连接时间、SSL握手时间、传输时间等。这些信息通常用于性能分析和优化。

Cookies（Cookie）: 在Web开发中，Cookie是服务器发送给浏览器并保存在本地的小型文本文件，它包含了关于用户的信息。浏览器在后续的请求中会将Cookie发送给服务器，用于识别和跟踪用户的会话状态。