【中文】PDF文档切分\切片\拆分最优方案-数据预处理阶段，为后续导入RAG向量数据库和ES数据库实现双路召回

目的

将PDF文档拆开，拆开后每个数据是文档中的某一段，目的是保证每条数据都有较完整的语义，并且长度不会太长

项目自述

看了很多切分项目，包括langchain、Langchain-Chatchat、、Chinese-LangChain、LangChain-ChatGLM-Webui、ChatPDF、semchunk等等，效果还行，但是不够完美，毕竟他们的对"\n"的优先级设置的较高，使用pymupdf得到的文本中充斥着大量的"\n"，如果全部删掉也十分影响语义

切分逻辑

1、保持段落完整性

2、保持语义完整性

代码逻辑

1、转换PDF文件为DOCX文件

2、循环遍历paragraphs保持段落完整性

3、以句号为节点，保持语义完整性

代码实现

import re
import os
import csv
from pdf2docx import Converter
from docx import Documentdef pdf_to_docx(pdf_file_path):try:docx_path = os.path.join(os.path.dirname(pdf_file_path), os.path.basename(pdf_file_path).split(".")[0] +".docx")cv = Converter(pdf_file_path)cv.convert(docx_path)cv.close()return docx_pathexcept Exception as e:print(f"转换过程中发生错误：{str(e)}")return Falsedef pdf2docx_to_csv(pdf_file_path, max_length=400):docx_path = pdf_to_docx(pdf_file_path)if not docx_path:return Falsedocx = Document(docx_path)result = []current_text = ""for paragraph in docx.paragraphs:section = paragraph.text.strip()if not current_text or len(current_text) + len(section) + 1 <= max_length:current_text += " " + sectionelse:period_index = current_text.rfind('。')if period_index != -1:period_text = current_text[:period_index+1].strip()if period_text:result.append((os.path.basename(docx_path),period_text))current_text = current_text[period_index+1:] + sectionelse:current_text = current_text.strip()if current_text:result.append((os.path.basename(docx_path),current_text))current_text = sectionif current_text.strip():result.append((os.path.basename(docx_path),current_text.strip()))output_path = os.path.join(os.path.dirname(pdf_file_path), os.path.basename(pdf_file_path).split(".")[0] + "_pdf2docx_"+ ".csv")with open(output_path, 'w', newline='', encoding='utf-8') as csvfile:csvwriter = csv.writer(csvfile)csvwriter.writerow(['filename', 'text'])csvwriter.writerows(result)print(f"{pdf_file_path} 处理完成")if __name__ == "__main__":pdf_file_path = "/path/to/your/xxx.pdf"pdf2docx_to_csv(pdf_file_path)