这段代码通过提取、查询、替换DOI,生成参考文献列表来处理Word文档,可按功能模块划分:
- 导入模块
import re
from docx import Document
from docx.oxml.ns import qn
from habanero import Crossref
导入正则表达式模块re
用于文本模式匹配,python - docx
库中的Document
类操作Word文档,qn
函数处理命名空间(代码中未实际使用),以及habanero
库的Crossref
类,用于通过DOI查询参考文献信息。
2. 提取DOI函数
def extract_dois(text):doi_pattern = r'(10\.\d{4,9}/[-._;()/:A-Z0-9]+)'return re.findall(doi_pattern, text, re.IGNORECASE)
定义extract_dois
函数,接收文本参数text
,使用正则表达式doi_pattern
匹配DOI格式,通过re.findall
函数提取所有符合格式的DOI字符串,返回包含这些DOI的列表,忽略大小写。
3. 获取参考文献函数
def get_reference(doi):cr = Crossref()try:result = cr.works(ids=doi)if'message' in result:message = result['message']# 提取作者信息authors = []if 'author' in message:for author in message['author']:if 'family' in author and 'given' in author:last_name = author['family']first_initial = author['given'][0] if author['given'] else ''authors.append(f"{last_name}, {first_initial}.")author_str = ', '.join(authors)# 提取年份、标题等其他信息year = message['issued']['date - parts'][0][0] if 'issued' in message and 'date - parts' in message['issued'] and message['issued']['date - parts'] else 'n.d.'title = message['title'][0] if 'title' in message and message['title'] else 'No title'journal = message['container - title'][0] if 'container - title' in message and message['container - title'] else 'No journal'volume = message['volume'] if 'volume' in message else 'No volume'issue = message['issue'] if 'issue' in message else 'No issue'pages = message['page'] if 'page' in message else 'No pages'reference = f"{author_str} ({year}). {title}. {journal}, {volume}({issue}), {pages}. doi:{doi}"return referenceelse:return Noneexcept Exception:return None
get_reference
函数接收DOI参数doi
,创建Crossref
实例cr
查询该DOI对应的参考文献信息。尝试获取查询结果,若结果中存在message
字段,则从中提取作者、年份、标题、期刊、卷号、期号、页码等信息,格式化为APA格式参考文献字符串并返回;若查询失败或出现异常,返回None
。
4. 主处理函数
def convert_dois_in_word(input_file, output_file):doc = Document(input_file)all_dois = []doi_original_index = {}index = 1# 提取文档中所有DOI并编号for paragraph in doc.paragraphs:dois = extract_dois(paragraph.text)for doi in dois:if doi not in all_dois:all_dois.append(doi)doi_original_index[doi] = indexindex += 1references = []successful_dois = []failed_dois = []# 获取每个DOI的参考文献信息for doi in all_dois:reference = get_reference(doi)if reference:references.append(reference)successful_dois.append(doi)else:failed_dois.append(doi)# 将文档中的DOI替换为上标引用序号for paragraph in doc.paragraphs:for doi in all_dois:if doi in successful_dois:index = successful_dois.index(doi) + 1runs = paragraph.runsfor run in runs:if doi in run.text:parts = run.text.split(doi)run.text = parts[0]new_run = paragraph.add_run(f"[{index}]")new_run.font.superscript = Truerun = paragraph.add_run(parts[1])# 在文档末尾添加参考文献列表doc.add_page_break()doc.add_heading('参考文献', level=1)for i, reference in enumerate(references, start=1):doc.add_paragraph(f"[{i}] {reference}")doc.save(output_file)# 打印转换结果print("成功转换的DOI:")for doi in successful_dois:print(doi)print("\n转换失败的DOI:")for doi in failed_dois:original_index = doi_original_index[doi]print(f"{original_index}. {doi}")
convert_dois_in_word
函数接收输入、输出文件路径参数input_file
、output_file
。打开输入Word文档,遍历段落提取所有DOI,为每个唯一DOI编号并存储。尝试获取每个DOI的参考文献信息,区分成功与失败的DOI。再次遍历段落,将成功获取信息的DOI替换为上标引用序号。在文档末尾添加分页符、“参考文献”标题及格式化的参考文献列表,最后保存文档并打印成功和失败转换的DOI信息。
5. 使用示例
input_file = 'input.docx'
output_file = 'output.docx'
convert_dois_in_word(input_file, output_file)
定义输入、输出文件路径,调用convert_dois_in_word
函数执行对Word文档DOI的转换和参考文献生成操作。