https://www.firecrawl.dev/

什么是Firecrawl
Firecrawl 是一款 可以将网站转换为 便于AI处理的Markdown 格式的爬虫工具 ,主要 提供 API 服务 ,无需站点地图,只需要接收一个 URL 地址就可以爬取网站及网站下可访问的所有子页面内容。
本地部署Firecrawl
https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
For a simpler setup, you can use Docker Compose to run all services:
- Prerequisites: Make sure you have Docker and Docker Compose installed
Copy the
.env.example
file to.env
in the/apps/api/
directory and configure as neededFrom the root directory, run: docker compose up
This will start Redis, the API server, and workers automatically in the correct configuration.
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
创建.env
文件
cp apps/api/.env.example apps/api/.env
需要使用LLM的话修改一下OPENAI_API_KEY和OPENAI_BASE_URL
OPENAI_API_KEY=xxx
OPENAI_BASE_URL=xxx
构建并启动
docker compose build
docker compose up -d
国内可能下载playwright很慢,可以修改「apps/playwright-service-ts/Dockerfile」
RUN echo "deb http://mirrors.aliyun.com/debian/ bookworm main non-free contrib\n\
deb http://mirrors.aliyun.com/debian/ bookworm-updates main non-free contrib\n\
deb http://mirrors.aliyun.com/debian-security bookworm-security main non-free contrib" > /etc/apt/sources.list # Install Playwright dependencies
ENV PLAYWRIGHT_DOWNLOAD_HOST=https://npmmirror.com/mirrors/playwright/
RUN npx playwright install --with-deps
测试一下
curl -X GET http://localhost:3002/test
使用python调用
pip install firecrawl-py
import logging
from firecrawl import FirecrawlApp logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__) def main(): try: app = FirecrawlApp(api_key=None, api_url="http://localhost:3002") params = { 'formats': ['markdown'], } logger.info("开始抓取网页...") scrape_status = app.scrape_url('https://www.kujiale.com/', params=params) logger.info("抓取结果:") print(scrape_status) except Exception as e: logger.error(f"抓取过程中发生错误: {str(e)}") raise if __name__ == "__main__": main()


从结果可以看到它会提取一些内容,方便直接将数据给AI或者插入RAG中进行后续操作
