python常见爬虫库以及案例

一、常见库

以下是一些常见的Python爬虫库，按照一般热门程度的排序：

Requests：requests库是非常流行的用于发送HTTP请求的库，因其简洁易用和广泛的社区支持而备受青睐。
Beautiful Soup：Beautiful Soup是一个广泛使用的HTML和XML解析库，用于提取和操作网页内容。
Scrapy：Scrapy是一个功能强大的爬虫框架，广泛用于大规模爬取任务。它有一个活跃的社区和强大的文档支持。
Selenium：Selenium用于浏览器自动化和处理JavaScript内容，特别适用于需要模拟用户行为的任务。
lxml：lxml是一个高性能的HTML和XML解析库，它支持XPath，可用于高效地处理大量数据。
Splash：Splash是一个用于处理JavaScript渲染的服务，对于需要动态网页抓取的任务非常有用。
PyQuery：PyQuery提供了类似于jQuery的语法，用于HTML解析和操作。
Tornado：Tornado是一个异步网络库，适用于构建高性能的网络爬虫，虽然不仅仅用于爬虫。
Gevent：Gevent是一个协程库，用于编写异步和高性能的网络应用，也可用于爬虫。
Aiohttp：Aiohttp是一个用于构建异步HTTP客户端/服务器的框架，适用于异步爬虫。

二、案例

以下是一些简单的Python爬虫库的示例用例：

Requests：

import requests# 发送GET请求
response = requests.get("https://www.example.com")
print(response.text)

Beautiful Soup：

from bs4 import BeautifulSoup
import requests# 发送GET请求并解析HTML
response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, 'html.parser')# 提取标题文本
title = soup.title.string
print("Title:", title)

Scrapy：

使用Scrapy来爬取整个网站的所有标题链接：

scrapy startproject myproject

创建一个Spider并定义抓取规则：

import scrapyclass MySpider(scrapy.Spider):name = "example"start_urls = ["https://www.example.com",]def parse(self, response):for title in response.css('h1'):yield {'title': title.get(),}

运行Spider：

scrapy crawl example

Selenium：

使用Selenium来打开一个网页并截取屏幕截图：

from selenium import webdriver# 启动Chrome浏览器
driver = webdriver.Chrome()# 打开网页
driver.get("https://www.example.com")# 截取屏幕截图
driver.save_screenshot("screenshot.png")# 关闭浏览器
driver.quit()

lxml：

使用lxml来解析HTML并提取链接：

from lxml import html
import requests# 发送GET请求
response = requests.get("https://www.example.com")# 解析HTML
tree = html.fromstring(response.content)# 提取所有链接
links = tree.xpath('//a/@href')
for link in links:print("Link:", link)

PyQuery：

使用PyQuery来解析HTML并提取标题：

from pyquery import PyQuery as pq
import requests# 发送GET请求
response = requests.get("https://www.example.com")# 创建PyQuery对象
doc = pq(response.text)# 提取标题文本
title = doc('title').text()
print("Title:", title)

Splash：

使用Splash来渲染JavaScript并获取渲染后的页面内容：

import requests
import json# 请求Splash服务来渲染页面
url = "http://localhost:8050/render.html"
params = {'url': "https://www.example.com",'wait': 2  # 等待2秒钟，以确保JavaScript加载完成
}
response = requests.get(url, params=params)# 解析渲染后的页面内容
rendered_html = response.text
print(rendered_html)

Tornado：

使用Tornado构建一个简单的异步爬虫：

import tornado.ioloop
import tornado.httpclientasync def fetch_url(url):http_client = tornado.httpclient.AsyncHTTPClient()response = await http_client.fetch(url)print("Fetched URL:", url)return response.bodyasync def main():urls = ["https://www.example.com", "https://www.example2.com"]for url in urls:html = await fetch_url(url)# 在这里处理HTML内容if __name__ == "__main__":tornado.ioloop.IOLoop.current().run_sync(main)

Gevent：

使用Gevent来并发地获取多个URL：

import gevent
import requestsdef fetch_url(url):response = requests.get(url)print("Fetched URL:", url)# 在这里处理HTML内容urls = ["https://www.example.com", "https://www.example2.com"]
jobs = [gevent.spawn(fetch_url, url) for url in urls]
gevent.joinall(jobs)

Aiohttp：

使用Aiohttp来异步获取多个URL：

import aiohttp
import asyncioasync def fetch_url(url):async with aiohttp.ClientSession() as session:async with session.get(url) as response:html = await response.text()print("Fetched URL:", url)# 在这里处理HTML内容urls = ["https://www.example.com", "https://www.example2.com"]
loop = asyncio.get_event_loop()
tasks = [fetch_url(url) for url in urls]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()