- NLP和词性标注
在自然语言处理中,理解文本通常开始于基础的语法分析,包括词性标注(POS Tagging)。这一过程涉及识别文本中每个单词的语法类别,如名词、动词等。在本代码中,使用了spaCy这一开源NLP库,它包含预训练的语言模型来执行包括词性标注在内的多种语言分析任务。通过识别任务文本中的主动词,我们可以抓住句子的行动指令,这对于理解任务的本质至关重要。
- 深度学习模型BERT
- 语义编码和模型推断
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import spacy# 加载spaCy的英语语言模型
nlp = spacy.load("en_core_web_sm")# 加载预训练的bert-base-uncased模型和分词器
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")# 任务文本
task_text = "Please take the cup for me"
# 已知的物品属性
attribute_text = "graspable, containable, red, cylinder shape"# 处理任务文本
doc = nlp(task_text)# 提取最关键的动词
main_verb = None
for token in doc:if "VB" in token.tag_:main_verb = token.textbreak# 打印最关键的动词
if main_verb:print(f"The main verb is: {main_verb}")
else:print("No main verb found in the sentence.")# 分词并编码任务文本
task_tokens = tokenizer.encode(main_verb, add_special_tokens=True, max_length=32, truncation=True,padding='max_length')
print('task_tokens: ', task_tokens)
task_input_ids = torch.tensor(task_tokens).unsqueeze(0) # 添加批次维度
print('task_input_ids: ', task_input_ids)# 分词并编码每个属性
attributes = attribute_text.split(", ")
is_related = []# 阈值用于判断属性是否与任务相关
threshold = 0.5# 遍历每个属性并判断相关性
for attribute in attributes:attribute_tokens = tokenizer.encode(attribute, add_special_tokens=True, max_length=32, truncation=True,padding='max_length')attribute_input_ids = torch.tensor(attribute_tokens).unsqueeze(0)# 推理with torch.no_grad():attribute_output = model(input_ids=attribute_input_ids)print('attribute_output: ', attribute_output)# 获取模型的分类分数attribute_scores = attribute_output.logitsprint(attribute_scores)# 检查标签为1的类别(相关性)print(attribute_scores[0][1])is_related_attribute = attribute_scores[0][1] > thresholdis_related.append((attribute, is_related_attribute))print(is_related)
# 打印结果
for attribute, related in is_related:if related:print(f"任务文本与属性 '{attribute}' 相关")else:print(f"任务文本与属性 '{attribute}' 不相关")
The main verb is: | take |
task_tokens: | [101, 2202, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] |
task_input_ids: | tensor([[ 101, 2202, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) |
attribute_output: | SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2044, -0.1437]]), hidden_states=None, attentions=None) |
tensor([[ 0.2044, -0.1437]]) | |
tensor(-0.1437) | |
attribute_output: | SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2036, -0.1523]]), hidden_states=None, attentions=None) |
tensor([[ 0.2036, -0.1523]]) | |
tensor(-0.1523) | |
attribute_output: | SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2480, -0.1076]]), hidden_states=None, attentions=None) |
tensor([[ 0.2480, -0.1076]]) | |
tensor(-0.1076) | |
attribute_output: | SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2650, -0.1961]]), hidden_states=None, attentions=None) |
tensor([[ 0.2650, -0.1961]]) | |
tensor(-0.1961) | |
[(‘graspable’, tensor(False)), (‘containable’, tensor(False)), (‘red’, tensor(False)), (‘cylinder shape’, tensor(False))] | |
任务文本与属性 ‘graspable’ 不相关 | |
任务文本与属性 ‘containable’ 不相关 | |
任务文本与属性 ‘red’ 不相关 | |
任务文本与属性 ‘cylinder shape’ 不相关 |