BERT应用——文本间关联性分析

本文结合了自然语言处理（NLP）和深度学习技术，旨在分析一段指定的任务文本中的动词，并进一步探讨这个动词与一系列属性之间的关联性。具体技术路径包括文本的词性标注、语义编码和模型推断。
在这里插入图片描述

一、技术思路

NLP和词性标注

在自然语言处理中，理解文本通常开始于基础的语法分析，包括词性标注（POS Tagging）。这一过程涉及识别文本中每个单词的语法类别，如名词、动词等。在本代码中，使用了spaCy这一开源NLP库，它包含预训练的语言模型来执行包括词性标注在内的多种语言分析任务。通过识别任务文本中的主动词，我们可以抓住句子的行动指令，这对于理解任务的本质至关重要。

深度学习模型BERT

代码中采用的BERT模型是一种预训练的变换器模型，具有处理各种语言理解任务的能力。BERT模型的核心优势在于其双向编码器表示，能够理解单词在上下文中的复杂关系。通过对BERT模型的进一步训练或微调，它可以适应特定的文本分类任务，如判断文本属性是否与特定动词相关。

语义编码和模型推断

文本编码是将文本转换为模型能够处理的格式的过程，通常涉及分词和数字化表示。在BERT中，使用特定的分词器将文本转换为一系列的token，然后将这些token转换为索引，这些索引对应于BERT模型训练时使用的词汇表。这种编码后的数值形式允许模型处理并理解原始文本。

在属性的相关性判断上，代码首先将每个属性独立地进行编码，然后通过BERT模型进行推断，输出每个属性与动词的相关性得分。通过设定阈值，可以决定哪些属性与动词具有高度相关性。这一步骤是基于模型学习到的语言特征，判断不同属性对于完成任务的相关性和重要性。

二、代码实现

from transformers import BertTokenizer, BertForSequenceClassification
import torch
import spacy# 加载spaCy的英语语言模型
nlp = spacy.load("en_core_web_sm")# 加载预训练的bert-base-uncased模型和分词器
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")# 任务文本
task_text = "Please take the cup for me"
# 已知的物品属性
attribute_text = "graspable, containable, red, cylinder shape"# 处理任务文本
doc = nlp(task_text)# 提取最关键的动词
main_verb = None
for token in doc:if "VB" in token.tag_:main_verb = token.textbreak# 打印最关键的动词
if main_verb:print(f"The main verb is: {main_verb}")
else:print("No main verb found in the sentence.")# 分词并编码任务文本
task_tokens = tokenizer.encode(main_verb, add_special_tokens=True, max_length=32, truncation=True,padding='max_length')
print('task_tokens: ', task_tokens)
task_input_ids = torch.tensor(task_tokens).unsqueeze(0)  # 添加批次维度
print('task_input_ids: ', task_input_ids)# 分词并编码每个属性
attributes = attribute_text.split(", ")
is_related = []# 阈值用于判断属性是否与任务相关
threshold = 0.5# 遍历每个属性并判断相关性
for attribute in attributes:attribute_tokens = tokenizer.encode(attribute, add_special_tokens=True, max_length=32, truncation=True,padding='max_length')attribute_input_ids = torch.tensor(attribute_tokens).unsqueeze(0)# 推理with torch.no_grad():attribute_output = model(input_ids=attribute_input_ids)print('attribute_output: ', attribute_output)# 获取模型的分类分数attribute_scores = attribute_output.logitsprint(attribute_scores)# 检查标签为1的类别（相关性）print(attribute_scores[0][1])is_related_attribute = attribute_scores[0][1] > thresholdis_related.append((attribute, is_related_attribute))print(is_related)
# 打印结果
for attribute, related in is_related:if related:print(f"任务文本与属性 '{attribute}' 相关")else:print(f"任务文本与属性 '{attribute}' 不相关")

输出结果

The main verb is:	take
task_tokens:	[101, 2202, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
task_input_ids:	tensor([[ 101, 2202, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attribute_output:	SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2044, -0.1437]]), hidden_states=None, attentions=None)
	tensor([[ 0.2044, -0.1437]])
	tensor(-0.1437)
attribute_output:	SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2036, -0.1523]]), hidden_states=None, attentions=None)
	tensor([[ 0.2036, -0.1523]])
	tensor(-0.1523)
attribute_output:	SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2480, -0.1076]]), hidden_states=None, attentions=None)
	tensor([[ 0.2480, -0.1076]])
	tensor(-0.1076)
attribute_output:	SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2650, -0.1961]]), hidden_states=None, attentions=None)
	tensor([[ 0.2650, -0.1961]])
	tensor(-0.1961)
	[(‘graspable’, tensor(False)), (‘containable’, tensor(False)), (‘red’, tensor(False)), (‘cylinder shape’, tensor(False))]
	任务文本与属性 ‘graspable’ 不相关
	任务文本与属性 ‘containable’ 不相关
	任务文本与属性 ‘red’ 不相关
	任务文本与属性 ‘cylinder shape’ 不相关