Python爬虫技术第15节 CSS选择器基础

在使用Python进行网页爬取时，CSS选择器是提取HTML文档中特定元素的常用方法之一。CSS选择器基于HTML元素的结构和属性来定位和选择页面中的元素。结合Python中的BeautifulSoup库或PyQuery库等，可以非常高效地解析和筛选出你想要的数据。

CSS选择器基础

标签选择器：
使用元素名称作为选择器，如 div 或 a。
类选择器：
使用点前缀加上类名，如 .classname。
ID选择器：
使用井号前缀加上ID名，如 #idname。
属性选择器：
可以选择具有特定属性的元素，如 [href] 或 [class="myclass"]。
子元素选择器：
用于选择某个元素的直接子元素，如 ul > li。
后代选择器：
用于选择某个元素的所有后代元素，如 div p（选择所有在div内的p标签）。
相邻兄弟选择器：
用于选择紧接在另一个元素后的元素，如 h1 + p。
一般兄弟选择器：
用于选择同级的元素，如 h1 ~ p。
组合选择器：
可以将多个选择器用逗号分隔，如 div, span。

在Python中使用CSS选择器

使用BeautifulSoup

假设你有以下HTML代码：

<div id="content"><h1>My Title</h1><p class="description">This is a description.</p><ul><li>Item 1</li><li>Item 2</li></ul>
</div>

使用BeautifulSoup来解析并提取数据：

from bs4 import BeautifulSouphtml_doc = """
<div id="content"><h1>My Title</h1><p class="description">This is a description.</p><ul><li>Item 1</li><li>Item 2</li></ul>
</div>
"""soup = BeautifulSoup(html_doc, 'html.parser')# 获取标题
title = soup.select_one('h1').text
print("Title:", title)# 获取描述
description = soup.select_one('.description').text
print("Description:", description)# 获取列表项
items = [item.text for item in soup.select('li')]
print("Items:", items)

使用PyQuery

PyQuery库的使用方式更接近jQuery：

from pyquery import PyQuery as pqhtml_doc = """
<div id="content"><h1>My Title</h1><p class="description">This is a description.</p><ul><li>Item 1</li><li>Item 2</li></ul>
</div>
"""doc = pq(html_doc)# 获取标题
title = doc('h1').text()
print("Title:", title)# 获取描述
description = doc('.description').text()
print("Description:", description)# 获取列表项
items = doc('li').map(lambda i, e: pq(e).text())
print("Items:", list(items))

以上就是使用CSS选择器结合Python进行网页数据抓取的基本方法。通过这些工具，你可以更加灵活和精确地从网页中提取所需信息。

当然，我们可以处理更复杂的HTML结构、使用更多的CSS选择器以及处理可能出现的异常情况。下面是一个更详细的示例，展示如何使用BeautifulSoup和PyQuery处理一个包含更多元素和属性的HTML文档。

假设我们有以下HTML结构：

<html>
<head><title>Sample Page</title>
</head>
<body><div id="header"><h1>Welcome to Our Site</h1></div><div id="content"><section class="main"><article class="post" data-id="1"><h2>Post Title 1</h2><p>Some text here...</p><a href="/post/1" class="read-more">Read more</a></article><article class="post" data-id="2"><h2>Post Title 2</h2><p>Some other text here...</p><a href="/post/2" class="read-more">Read more</a></article></section><aside><h3>Latest Comments</h3><ul><li>User 1 commented on Post 1</li><li>User 2 commented on Post 2</li></ul></aside></div><footer><p>Copyright © 2024</p></footer>
</body>
</html>

我们将使用这个HTML结构来演示如何提取特定的帖子标题、文本和链接。

使用BeautifulSoup

from bs4 import BeautifulSouphtml_doc = """
<html>
<head><title>Sample Page</title>
</head>
<body><div id="header"><h1>Welcome to Our Site</h1></div><div id="content"><section class="main"><article class="post" data-id="1"><h2>Post Title 1</h2><p>Some text here...</p><a href="/post/1" class="read-more">Read more</a></article><article class="post" data-id="2"><h2>Post Title 2</h2><p>Some other text here...</p><a href="/post/2" class="read-more">Read more</a></article></section><aside><h3>Latest Comments</h3><ul><li>User 1 commented on Post 1</li><li>User 2 commented on Post 2</li></ul></aside></div><footer><p>Copyright © 2024</p></footer>
</body>
</html>
"""soup = BeautifulSoup(html_doc, 'html.parser')# 提取所有帖子的标题
titles = [post.h2.text for post in soup.select('.post')]
print("Titles:", titles)# 提取所有帖子的链接
links = [post.a['href'] for post in soup.select('.post .read-more') if 'href' in post.a.attrs]
print("Links:", links)# 提取第一个帖子的文本
first_post_text = soup.select_one('.post:first-of-type p').text
print("First Post Text:", first_post_text)# 检查是否有最新评论
latest_comments = soup.select_one('#content aside ul')
if latest_comments:print("Latest Comments Found!")
else:print("No latest comments found.")

使用PyQuery

from pyquery import PyQuery as pqhtml_doc = """
<html>
<!-- HTML content here -->
</html>
"""doc = pq(html_doc)# 提取所有帖子的标题
titles = doc('.post h2').map(lambda i, e: pq(e).text())
print("Titles:", list(titles))# 提取所有帖子的链接
links = doc('.post .read-more').map(lambda i, e: pq(e).attr('href'))
print("Links:", list(links))# 提取第一个帖子的文本
first_post_text = doc('.post:first-of-type p').text()
print("First Post Text:", first_post_text)# 检查是否有最新评论
latest_comments = doc('#content aside ul')
if latest_comments.length:print("Latest Comments Found!")
else:print("No latest comments found.")

以上代码展示了如何使用CSS选择器与Python库来处理和提取复杂HTML文档中的信息。注意，在实际应用中，你可能需要处理网络请求错误、HTML解析错误或页面结构不一致的情况，因此在真实环境中，你可能需要添加更多的错误检查和异常处理逻辑。

接下来，我们可以添加异常处理机制，确保在遇到网络错误、无效的HTML结构或者缺少预期元素时，程序能够优雅地处理这些情况。同时，我们还可以增强代码的健壮性，例如通过使用更具体的CSS选择器来减少误匹配的可能性，并且在处理大量数据时考虑性能优化。

以下是使用BeautifulSoup和PyQuery对上述HTML代码进行数据提取的改进版代码：

使用BeautifulSoup

from bs4 import BeautifulSoup
import requestsdef fetch_data(url):try:response = requests.get(url)response.raise_for_status()  # 如果响应状态码不是200，则抛出HTTPError异常return response.textexcept requests.RequestException as e:print(f"Error fetching URL: {url}")print(e)return Nonedef parse_html(html):if html is None:return []soup = BeautifulSoup(html, 'html.parser')posts = []for post in soup.select('.post'):try:title = post.h2.text.strip()text = post.p.text.strip()link = post.find('a', class_='read-more')['href']posts.append({'title': title,'text': text,'link': link})except AttributeError:print("Missing element in post, skipping...")continuereturn postsdef main():url = "http://example.com"html = fetch_data(url)posts = parse_html(html)print(posts)if __name__ == "__main__":main()

使用PyQuery

from pyquery import PyQuery as pq
import requestsdef fetch_data(url):try:response = requests.get(url)response.raise_for_status()return response.textexcept requests.RequestException as e:print(f"Error fetching URL: {url}")print(e)return Nonedef parse_html(html):if html is None:return []doc = pq(html)posts = []doc('.post').each(lambda i, e: posts.append({'title': pq(e)('h2').text(),'text': pq(e)('p').text(),'link': pq(e)('a.read-more').attr('href')}) if pq(e)('h2') and pq(e)('p') and pq(e)('a.read-more') else None)return [post for post in posts if post is not None]def main():url = "http://example.com"html = fetch_data(url)posts = parse_html(html)print(posts)if __name__ == "__main__":main()

在这两个示例中，我们做了如下改进：

添加了网络请求函数fetch_data，它会处理网络错误和HTTP错误。
在parse_html函数中，我们添加了对缺失元素的异常处理，避免因为某个元素不存在而导致整个程序崩溃。
使用了strip()方法来去除文本中的空白字符，保证数据的整洁。
在使用PyQuery时，使用了.each()方法来迭代每个.post元素，这样可以更自然地处理每个帖子的提取过程，并且通过列表推导式过滤掉任何可能为None的帖子。

这些改进使得代码更加健壮，能够在面对各种意外情况时给出适当的反馈，而不是突然崩溃。