测试pdfplumber识别效果好些;另外pdf这两个如果超过20多页就没法识别了,结果为空
1、pdfplumber
安装:pip install pdfplumber -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com
代码:
import pdfplumberwith pdfplumber.open(r"C:\Users\loong\Downloads\数字人研究报告.pdf") as pdf:num_pages = len(pdf.pages)print(num_pages)for page_num in range(num_pages):page = pdf.pages[page_num]text = page.extract_text()print(text)
原内容
识别结果:
2、PyPDF2
安装:pip install PyPDF2
代码:
import PyPDF2
from tqdm import tqdmpdftext = ""
with open(r"C:\Users\loong\Desktop\杰创\大模型\杰创智能.pdf", "rb") as pdfFileObj:pdfReader = PyPDF2.PdfReader(pdfFileObj)for page in tqdm(pdfReader.pages):pdftext += page.extract_text()print(pdftext)