Python爬虫技术第17节使用BeautifulSoup

使用Python进行网页爬取是一个常见的任务，特别是当需要从网站上批量获取数据时。BeautifulSoup是一个非常流行的Python库，用于解析HTML和XML文档，非常适合用来提取网页中的信息。

下面我将通过一个简单的案例来介绍如何使用Python和BeautifulSoup来抓取网页上的数据。

安装依赖库

首先确保你安装了requests和beautifulsoup4这两个库。如果还没有安装，可以通过pip来安装：

pip install requests beautifulsoup4

示例代码

假设我们想要从一个新闻网站上抓取最新的头条新闻标题。这里以一个虚构的新闻网站为例。

1. 导入必要的库

import requests
from bs4 import BeautifulSoup

2. 发送HTTP请求

我们需要向目标网站发送一个GET请求，并获取响应的内容。

url = 'https://example.com/news'  # 假设这是我们要爬取的新闻网站URL
response = requests.get(url)
html_content = response.text

3. 解析HTML文档

接下来，我们使用BeautifulSoup来解析HTML文档。

soup = BeautifulSoup(html_content, 'html.parser')

4. 提取所需的数据

假设我们想要获取所有的新闻标题，这些标题通常会被包含在一个特定的HTML标签中，比如<h2>或者<a>标签内。

titles = soup.find_all('h2', class_='news-title')  # 这里的class_='news-title'是假设的类名，请根据实际情况调整
for title in titles:print(title.text.strip())

完整示例

让我们把上述代码整合到一个完整的脚本中：

import requests
from bs4 import BeautifulSoupdef fetch_news_titles(url):response = requests.get(url)if response.status_code == 200:soup = BeautifulSoup(response.text, 'html.parser')titles = soup.find_all('h2', class_='news-title')return [title.text.strip() for title in titles]else:print(f"Failed to retrieve data from {url}")return []if __name__ == '__main__':url = 'https://example.com/news'news_titles = fetch_news_titles(url)for title in news_titles:print(title)

注意事项

在实际使用中，你需要替换上面示例中的URL为真实的新闻网站URL，并且可能需要调整CSS选择器或HTML标签来匹配实际网站的结构。
有些网站可能使用JavaScript动态加载内容，这种情况下直接抓取HTML可能无法获取所有数据。此时可以考虑使用像Selenium这样的工具。
记得遵守目标网站的robots.txt文件规定以及相关法律法规，尊重网站的版权和数据使用政策。

接下来我们可以进一步扩展之前的案例，使其能够处理更复杂的网页结构，并添加一些额外的功能，如错误处理、日志记录等。此外，还可以增加对动态加载内容的支持。

扩展案例

1. 错误处理与日志记录

在实际爬虫项目中，错误处理非常重要。我们应该考虑到网络连接失败、服务器返回错误、网页结构变化等情况。同时，良好的日志记录可以帮助我们追踪程序运行的状态和问题。

2. 处理动态加载的内容

有些网站使用JavaScript来动态加载内容，这使得直接抓取HTML变得不够全面。在这种情况下，我们可以使用Selenium来模拟浏览器行为。

3. 存储数据

最后，我们需要考虑如何存储抓取的数据。这里我们可以简单地将其写入文本文件或数据库。

示例代码

1. 引入必要的库

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import logging
import time

2. 设置日志

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

3. 定义函数来抓取静态内容

def fetch_static_content(url):try:logger.info(f"Fetching static content from: {url}")response = requests.get(url)response.raise_for_status()  # 抛出HTTP错误（如404）return response.textexcept requests.RequestException as e:logger.error(f"Error fetching content: {e}")return None

4. 定义函数来抓取动态加载的内容

def fetch_dynamic_content(url):options = webdriver.ChromeOptions()options.add_argument('--headless')  # 无头模式运行service = Service('path/to/chromedriver')  # 指定chromedriver路径driver = webdriver.Chrome(service=service, options=options)try:logger.info(f"Fetching dynamic content from: {url}")driver.get(url)# 假设有一个“加载更多”按钮需要点击load_more_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "load-more")))load_more_button.click()# 等待新内容加载完成time.sleep(3)# 获取页面源码html_content = driver.page_sourcereturn html_contentexcept Exception as e:logger.error(f"Error fetching dynamic content: {e}")return Nonefinally:driver.quit()

5. 定义函数来解析内容并提取数据

def parse_and_extract_data(html_content):if not html_content:return []soup = BeautifulSoup(html_content, 'html.parser')# 假设新闻标题都在<h2>标签里，并且有特定的class属性titles = soup.find_all('h2', class_='news-title')news_items = [{'title': title.text.strip()} for title in titles]return news_items

6. 定义主函数

def main():url = 'https://example.com/news'# 静态内容抓取static_html = fetch_static_content(url)static_news_items = parse_and_extract_data(static_html)logger.info("Static content parsed.")# 动态内容抓取dynamic_html = fetch_dynamic_content(url)dynamic_news_items = parse_and_extract_data(dynamic_html)logger.info("Dynamic content parsed.")# 合并数据all_news_items = static_news_items + dynamic_news_items# 写入文件with open('news_items.txt', 'w', encoding='utf-8') as f:for item in all_news_items:f.write(f"{item['title']}\n")logger.info("Data saved to file.")if __name__ == '__main__':main()

注意事项

在使用Selenium时，确保已经安装了ChromeDriver，并将其路径设置正确。
请根据实际网站结构调整CSS选择器或HTML标签。
如果网站有反爬措施，可能还需要添加其他策略，例如使用代理、设置合理的延时等。

以上就是一个更完整的示例，它包含了错误处理、日志记录、动态内容抓取等功能。如果你有任何疑问或需要进一步的帮助，请随时告诉我！

我们可以进一步优化代码，使其更加健壮和高效。以下是一些改进点：

异常处理和重试机制：在网络不稳定的情况下，可以添加重试逻辑来提高爬虫的稳定性。
异步请求：使用异步IO来并发处理多个请求，提高爬虫效率。
更详细的日志记录：记录更多的信息以便于调试和监控。
更优雅的退出机制：确保在发生错误时，程序能够安全地结束。

下面是优化后的代码示例：

1. 异步请求

我们将使用asyncio和aiohttp来进行异步请求，以提高效率。

2. 重试机制

对于网络请求，我们可以添加重试机制，以应对偶尔的网络波动。

3. 日志记录

我们将使用更详细的日志记录，包括请求状态、异常信息等。

4. 安全退出

确保在发生错误时，程序能够安全地结束。

示例代码

1. 安装额外依赖

pip install aiohttp

2. 更新导入模块

import asyncio
import logging
from bs4 import BeautifulSoup
from aiohttp import ClientSession, ClientResponseError
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

3. 设置日志

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

4. 定义异步请求函数

async def fetch(session, url, max_retries=3):for retry in range(max_retries):try:async with session.get(url) as response:response.raise_for_status()logger.info(f"Fetched {url} successfully.")return await response.text()except ClientResponseError as e:logger.warning(f"Failed to fetch {url}, status code: {e.status}. Retrying...")except Exception as e:logger.error(f"Error fetching {url}: {e}. Retrying...")await asyncio.sleep(2 * (retry + 1))  # Exponential backofflogger.error(f"Failed to fetch {url} after {max_retries} retries.")return None

5. 定义异步抓取静态内容函数

async def fetch_static_content(url, session):html_content = await fetch(session, url)return html_content

6. 定义抓取动态内容函数

def fetch_dynamic_content(url):options = webdriver.ChromeOptions()options.add_argument('--headless')  # 无头模式运行service = Service('path/to/chromedriver')  # 指定chromedriver路径driver = webdriver.Chrome(service=service, options=options)try:logger.info(f"Fetching dynamic content from: {url}")driver.get(url)# 假设有一个“加载更多”按钮需要点击load_more_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "load-more")))load_more_button.click()# 等待新内容加载完成time.sleep(3)# 获取页面源码html_content = driver.page_sourcereturn html_contentexcept Exception as e:logger.error(f"Error fetching dynamic content: {e}")return Nonefinally:driver.quit()

7. 定义解析内容并提取数据函数

def parse_and_extract_data(html_content):if not html_content:return []soup = BeautifulSoup(html_content, 'html.parser')# 假设新闻标题都在<h2>标签里，并且有特定的class属性titles = soup.find_all('h2', class_='news-title')news_items = [{'title': title.text.strip()} for title in titles]return news_items

8. 定义主函数

async def main():url = 'https://example.com/news'async with ClientSession() as session:# 静态内容抓取static_html = await fetch_static_content(url, session)static_news_items = parse_and_extract_data(static_html)logger.info("Static content parsed.")# 动态内容抓取dynamic_html = fetch_dynamic_content(url)dynamic_news_items = parse_and_extract_data(dynamic_html)logger.info("Dynamic content parsed.")# 合并数据all_news_items = static_news_items + dynamic_news_items# 写入文件with open('news_items.txt', 'w', encoding='utf-8') as f:for item in all_news_items:f.write(f"{item['title']}\n")logger.info("Data saved to file.")if __name__ == '__main__':asyncio.run(main())