【学习总结】Python transformers 预处理 SQuAD 数据集,并展示

1. 数据介绍

SQuAD 官网
SQuAD(Stanford Question Answering Dataset)是由斯坦福大学开发的一个广泛使用的机器阅读理解数据集。它被用于训练和评估问答系统,旨在测试模型对自然语言文本中的问题和答案的理解能力。

详细介绍 : SQuAD数据集简介

1.1 评价指标

SQuAD的评价指标是基于精确匹配(Exact Match,简称EM)和部分匹配(Partial Match,简称F1 Score)的度量。

1.1.1 精确匹配(EM)

精确匹配是指模型给出的答案与参考答案完全一致时的评价指标。
如果模型的答案与参考答案完全相同,则EM得分为1;否则为0。
计算公式:
EM = 1,如果答案与参考答案完全一致;
EM = 0,如果答案与参考答案不一致。

1.1.2 部分匹配(F1 Score):

部分匹配是通过比较模型答案与参考答案之间的共享词汇来评估答案的相似性。
F1 Score是根据模型答案和参考答案之间的匹配度来计算的。
prediction指模型生成的答案,而ground truth指参考答案。
计算公式:
首先,将模型答案和参考答案分别划分为单个单词或字符的列表。
然后,计算模型答案和参考答案之间的共享词汇,分别记为count_pred和count_ref。

count_pred:模型答案中与参考答案相同的词汇数量。
count_ref:参考答案中的词汇数量。

接下来,计算精确匹配(precision)和召回率(recall),然后使用这些值计算F1 Score:
精确匹配(precision):precision = count_pred / len(prediction)
召回率(recall):recall = count_pred / len(ground truth)
F1 Score:F1 = 2 * (precision * recall) / (precision + recall)

1.1.2.1 F1 Score 物理意义

F1 Score是一种常用的二分类问题评价指标,它综合考虑了分类模型的精确性(precision)和召回率(recall)。F1 Score的物理意义是衡量了模型在识别正例(Positive)和负例(Negative)样本方面的综合性能。
在二分类问题中,我们通常关注两个方面:
精确性(Precision):模型正确预测为正例的样本数占所有预测为正例的样本数的比例。精确性衡量了模型在预测为正例的样本中的准确性
召回率(Recall):模型正确预测为正例的样本数占所有实际正例样本数的比例。召回率衡量了模型对于实际正例样本的覆盖程度
F1 Score结合了精确性和召回率,通过计算二者的调和平均值来综合评估模型的性能。F1 Score的物理意义是衡量模型在同时考虑精确性和召回率时的综合表现。

F1 Score的取值范围是0到1,其中1表示最佳性能,0表示最差性能。当模型的精确性和召回率都很高时,F1 Score接近1;当精确性和召回率之一较低时,F1 Score会减小。

在机器学习和评估任务中,F1 Score是一个常用的指标,特别适用于不平衡数据集或当我们希望精确性和召回率双重考虑时。通过优化模型的F1 Score,我们可以追求在正例和负例样本识别方面的平衡和综合性能。

1.2 数据结构

DatasetDict({train: Dataset({features: ['id', 'title', 'context', 'question', 'answers'],num_rows: 87599})validation: Dataset({features: ['id', 'title', 'context', 'question', 'answers'],num_rows: 10570})
})

单个数据:

{'id': '5733bed24776f41900661188','title': 'University_of_Notre_Dame','context': 'The university is the major seat of the Congregation of Holy Cross (albeit not its official headquarters, which are in Rome). Its main seminary, Moreau Seminary, is located on the campus across St. Joseph lake from the Main Building. Old College, the oldest building on campus and located near the shore of St. Mary lake, houses undergraduate seminarians. Retired priests and brothers reside in Fatima House (a former retreat center), Holy Cross House, as well as Columba Hall near the Grotto. The university through the Moreau Seminary has ties to theologian Frederick Buechner. While not Catholic, Buechner has praised writers from Notre Dame and Moreau Seminary created a Buechner Prize for Preaching.','question': 'Where is the headquarters of the Congregation of the Holy Cross?','answers': {'text': ['Rome'], 'answer_start': [119]}}

2. 数据预处理

2.1 数据随机展示

from datasets import load_dataset
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTMLdef show_random_elements(dataset, num_examples=3):assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."picks = []for _ in range(num_examples):pick = random.randint(0, len(dataset) - 1)while pick in picks:pick = random.randint(0, len(dataset) - 1)picks.append(pick)df = pd.DataFrame(dataset[picks])print(dataset.features.items())for column, typ in dataset.features.items():print(column, typ)if isinstance(typ, ClassLabel):df[column] = df[column].transform(lambda i: typ.names[i])elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])# display(HTML(df.to_html()))print(df)"""
"squad_v2" if squad_v2 else "squad" 是一个三元运算符。
如果squad_v2为True,则返回字符串"squad_v2",表示加载SQuAD 2.0数据集;
如果squad_v2为False,则返回字符串"squad",表示加载SQuAD 1.1数据集"""
squad_v2 = False
datasets = load_dataset("squad_v2" if squad_v2 else "squad")
show_random_elements(datasets["train"])

输出结果:

dict_items([('id', Value(dtype='string', id=None)), ('title', Value(dtype='string', id=None)), ('context', Value(dtype='string', id=None)), ('question', Value(dtype='string', id=None)), ('answers', Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None))])
id Value(dtype='string', id=None)
title Value(dtype='string', id=None)
context Value(dtype='string', id=None)
question Value(dtype='string', id=None)
answers Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)id  ...                                            answers
0  572812cd2ca10214002d9d4a  ...  {'text': ['group homomorphisms'], 'answer_star...
1  56faebc18f12f319006302d3  ...  {'text': ['Nadifa Mohamed'], 'answer_start': [...
2  5726e490dd62a815002e9420  ...  {'text': ['Rabobank, a large bank, has its hea...[3 rows x 5 columns]

其中 from datasets import ClassLabel, Sequence 中的 ClassLabel, Sequence
在Hugging Face的datasets库中,ClassLabel和Sequence是用于描述和处理数据集中特征的类。

2.1.1 ClassLabel类的作用

  • ClassLabel用于表示具有离散类别的特征,通常用于分类任务中的标签或类别。
  • 提供了一种方便的方法来处理类别标签的索引和名称之间的转换,以及对类别标签的操作和访问。
  • ClassLabel类的实例可以用作数据集中的特征类型,以指定特征的类别标签集合。
from datasets import ClassLabel# 定义一个ClassLabel类型的特征
label_feature = ClassLabel(names=["cat", "dog", "bird"])# 获取类别标签的数量
num_labels = label_feature.num_classes
print(num_labels)# 将索引转换为类别标签名称
label_index = 1
label_name = label_feature.names[label_index]
print(label_name)# 将类别标签名称转换为索引
label_name = "bird"
label_index = label_feature.str2int(label_name)
print(label_index)
print(label_feature)

输出结果:

3
dog
2
ClassLabel(names=['cat', 'dog', 'bird'], id=None)

2.1.2 Sequence类的作用

  • Sequence类是Hugging Face的datasets库中的一个容器类,用于表示长序列特征,如文本的标记序列或时间序列数据。
  • 它允许对序列进行操作、访问和转换,并提供了一些方法用于处理序列数据。

代码展示:

from datasets import Sequence, Value# 定义一个包含序列的特征
sequence_feature = Sequence(Value("float32"))# 创建一个包含序列的示例
example_sequence = [1, 2, 3]# 将序列赋值给特征
print(sequence_feature)
print(type(sequence_feature))
print(sequence_feature.length)
print(sequence_feature.feature)
sequence_feature.feature = example_sequenceprint(type(sequence_feature.feature[0]))
print(sequence_feature)

输出结果:

Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)
<class 'datasets.features.features.Sequence'>
-1
Value(dtype='float32', id=None)
<class 'int'>
Sequence(feature=[1, 2, 3], length=-1, id=None)

2.2 Tokenizer 处理

代码:

from datasets import load_dataset
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from transformers import AutoTokenizer
import transformers"""
"squad_v2" if squad_v2 else "squad" 是一个三元运算符。
如果squad_v2为True,则返回字符串"squad_v2",表示加载SQuAD 2.0数据集;
如果squad_v2为False,则返回字符串"squad",表示加载SQuAD 1.1数据集"""
squad_v2 = False
datasets = load_dataset("squad_v2" if squad_v2 else "squad")model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)# 断言确保我们的 Tokenizers 使用的是 FastTokenizer(Rust 实现,速度和功能性上有一定优势)。
# isinstance()是一个内置函数,检查一个对象是否属于指定的类或类型
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
print(tokenizer("What is your name?", "My name is Sylvain."))
print(tokenizer("What is your name?"))
print(tokenizer("My name is Sylvain."))"""
在问答预处理中的一个特定问题是如何处理非常长的文档。
在其他任务中,当文档的长度超过模型最大句子长度时,我们通常会截断它们,但在这里,删除上下文的一部分可能会导致我们丢失正在寻找的答案。
为了解决这个问题,我们允许数据集中的一个(长)示例生成多个输入特征,每个特征的长度都小于模型的最大长度(或我们设置的超参数)。
"""
# The maximum length of a feature (question and context)
max_length = 384
print("--------------------在训练集数据中找到第一个超过384个标记的样本,并停止遍历----------------------")
for i, example in enumerate(datasets["train"]):if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:breakexample = datasets["train"][i]
print(example)
print(tokenizer(example["question"], example["context"]))print("--------------------截断上下文不保留超出部分------------------------")
tokenized_example0 = tokenizer(example["question"],example["context"],max_length=max_length,truncation="only_second")
print(tokenized_example0)
print(tokenized_example0["input_ids"])print("--------------------截断上下文,保留问题和超出部分------------------------")
"""
直接截断超出部分: truncation=only_second
仅截断上下文(context),保留问题(question):return_overflowing_tokens=True & 设置stride"""
# 当需要拆分时,上下文的两个部分之间的授权重叠
# The authorized overlap between two part of the context when splitting it is needed.
doc_stride = 120
tokenized_example1 = tokenizer(example["question"],example["context"],max_length=max_length,truncation="only_second",return_overflowing_tokens=True,stride=doc_stride
)print([len(x) for x in tokenized_example1["input_ids"]])
for x in tokenized_example1["input_ids"][:2]:print(tokenizer.decode(x))"""
使用 offsets_mapping 获取原始的 input_ids
设置 return_offsets_mapping=True,将使得截断分割生成的多个 input_ids 列表中的 token,通过映射保留原始文本的 input_ids。
如下所示:第一个标记([CLS])的起始和结束字符都是(0, 0),因为它不对应问题/答案的任何部分,然后第二个标记与问题(question)的字符0到3相同."""
tokenized_example2 = tokenizer(example["question"],example["context"],max_length=max_length,truncation="only_second",return_overflowing_tokens=True,return_offsets_mapping=True,stride=doc_stride
)
print(len(tokenized_example2["offset_mapping"][0]))
print(tokenized_example2["offset_mapping"][0][:20])
print([len(x) for x in tokenized_example2["offset_mapping"]])
"""
可以使用这个映射来找到答案在给定特征中的起始和结束标记的位置。
只需区分偏移的哪些部分对应于问题,哪些部分对应于上下文。"""
print(example["question"])
first_token_id = tokenized_example2["input_ids"][0][1]
offsets = tokenized_example2["offset_mapping"][0][1]
print(first_token_id)
print(offsets)
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])second_token_id = tokenized_example2["input_ids"][0][2]
offsets = tokenized_example2["offset_mapping"][0][2]
print(tokenizer.convert_ids_to_tokens([second_token_id])[0], example["question"][offsets[0]:offsets[1]])"""
借助tokenized_example的sequence_ids方法,我们可以方便的区分token的来源编号:
对于特殊标记:返回None,
对于正文Token:返回句子编号(从0开始编号)。
综上,可以很方便的在一个输入特征中找到答案的起始和结束 Token。"""
sequence_ids = tokenized_example2.sequence_ids()
print(sequence_ids)print("-------------------answers----------------------")
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
print(answers)
print(end_char)# 当前span在文本中的起始标记索引。
"""
当找到第一个标识符为1的位置时,循环停止,
token_start_index即为当前span在文本中的起始标记索引"""
token_start_index = 0
while sequence_ids[token_start_index] != 1:token_start_index += 1
print("token_start_index : ",token_start_index)# 当前span在文本中的结束标记索引。
token_end_index = len(tokenized_example2["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:token_end_index -= 1
print("token_end_index : ",token_end_index)# 检测答案是否超出span范围(如果超出范围,该特征将以CLS标记索引标记)。
offsets = tokenized_example2["offset_mapping"][0]#根据答案字符级别的起始和结束位置(start_char 和 end_char),调整标记级别的起始和结束索引(token_start_index 和 token_end_index)
#如果答案的起始字符位置大于等于当前标记的起始字符位置,并且答案的结束字符位置小于等于当前标记的结束字符位置,则判断答案在当前特征中
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):# 将token_start_index和token_end_index移动到答案的两端。# 注意:如果答案是最后一个单词,我们可以移到最后一个标记之后(边界情况)。while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:token_start_index += 1start_position = token_start_index - 1while offsets[token_end_index][1] >= end_char:token_end_index -= 1end_position = token_end_index + 1print(start_position, end_position)
else:print("答案不在此特征中...")# 通过查找 offset mapping 位置,解码 context 中的答案
print(tokenizer.decode(tokenized_example2["input_ids"][0][start_position: end_position+1]))
# 直接打印 数据集中的标准答案(answer["text"])
print(answers["text"][0])

输出结果:

{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
--------------------在训练集数据中找到第一个超过384个标记的样本,并停止遍历----------------------
{'id': '5733caf74776f4190066124c', 'title': 'University_of_Notre_Dame', 'context': "The men's basketball team has over 1,600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 NCAA tournaments. Former player Austin Carr holds the record for most points scored in a single game of the tournament with 61. Although the team has never won the NCAA Tournament, they were named by the Helms Athletic Foundation as national champions twice. The team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending UCLA's record 88-game winning streak in 1974. The team has beaten an additional eight number-one teams, and those nine wins rank second, to UCLA's 10, all-time in wins against the top team. The team plays in newly renovated Purcell Pavilion (within the Edmund P. Joyce Center), which reopened for the beginning of the 2009–2010 season. The team is coached by Mike Brey, who, as of the 2014–15 season, his fifteenth at Notre Dame, has achieved a 332-165 record. In 2009 they were invited to the NIT, where they advanced to the semifinals but were beaten by Penn State who went on and beat Baylor in the championship. The 2010–11 team concluded its regular season ranked number seven in the country, with a record of 25–5, Brey's fifth straight 20-win season, and a second-place finish in the Big East. During the 2014-15 season, the team went 32-6 and won the ACC conference tournament, later advancing to the Elite 8, where the Fighting Irish lost on a missed buzzer-beater against then undefeated Kentucky. Led by NBA draft picks Jerian Grant and Pat Connaughton, the Fighting Irish beat the eventual national champion Duke Blue Devils twice during the season. The 32 wins were the most by the Fighting Irish team since 1908-09.", 'question': "How many wins does the Notre Dame men's basketball team have?", 'answers': {'text': ['over 1,600'], 'answer_start': [30]}}
{'input_ids': [101, 2129, 2116, 5222, 2515, 1996, 10289, 8214, 2273, 1005, 1055, 3455, 2136, 2031, 1029, 102, 1996, 2273, 1005, 1055, 3455, 2136, 2038, 2058, 1015, 1010, 5174, 5222, 1010, 2028, 1997, 2069, 2260, 2816, 2040, 2031, 2584, 2008, 2928, 1010, 1998, 2031, 2596, 1999, 2654, 5803, 8504, 1012, 2280, 2447, 5899, 12385, 4324, 1996, 2501, 2005, 2087, 2685, 3195, 1999, 1037, 2309, 2208, 1997, 1996, 2977, 2007, 6079, 1012, 2348, 1996, 2136, 2038, 2196, 2180, 1996, 5803, 2977, 1010, 2027, 2020, 2315, 2011, 1996, 16254, 2015, 5188, 3192, 2004, 2120, 3966, 3807, 1012, 1996, 2136, 2038, 23339, 1037, 2193, 1997, 6314, 2015, 1997, 2193, 2028, 4396, 2780, 1010, 1996, 2087, 3862, 1997, 2029, 2001, 4566, 12389, 1005, 1055, 2501, 6070, 1011, 2208, 3045, 9039, 1999, 3326, 1012, 1996, 2136, 2038, 7854, 2019, 3176, 2809, 2193, 1011, 2028, 2780, 1010, 1998, 2216, 3157, 5222, 4635, 2117, 1010, 2000, 12389, 1005, 1055, 2184, 1010, 2035, 1011, 2051, 1999, 5222, 2114, 1996, 2327, 2136, 1012, 1996, 2136, 3248, 1999, 4397, 10601, 26429, 10531, 1006, 2306, 1996, 9493, 1052, 1012, 11830, 2415, 1007, 1010, 2029, 11882, 2005, 1996, 2927, 1997, 1996, 2268, 1516, 2230, 2161, 1012, 1996, 2136, 2003, 8868, 2011, 3505, 7987, 3240, 1010, 2040, 1010, 2004, 1997, 1996, 2297, 1516, 2321, 2161, 1010, 2010, 16249, 2012, 10289, 8214, 1010, 2038, 4719, 1037, 29327, 1011, 13913, 2501, 1012, 1999, 2268, 2027, 2020, 4778, 2000, 1996, 9152, 2102, 1010, 2073, 2027, 3935, 2000, 1996, 8565, 2021, 2020, 7854, 2011, 9502, 2110, 2040, 2253, 2006, 1998, 3786, 23950, 1999, 1996, 2528, 1012, 1996, 2230, 1516, 2340, 2136, 5531, 2049, 3180, 2161, 4396, 2193, 2698, 1999, 1996, 2406, 1010, 2007, 1037, 2501, 1997, 2423, 1516, 1019, 1010, 7987, 3240, 1005, 1055, 3587, 3442, 2322, 1011, 2663, 2161, 1010, 1998, 1037, 2117, 1011, 2173, 3926, 1999, 1996, 2502, 2264, 1012, 2076, 1996, 2297, 1011, 2321, 2161, 1010, 1996, 2136, 2253, 3590, 1011, 1020, 1998, 2180, 1996, 16222, 3034, 2977, 1010, 2101, 10787, 2000, 1996, 7069, 1022, 1010, 2073, 1996, 3554, 3493, 2439, 2006, 1037, 4771, 12610, 2121, 1011, 3786, 2121, 2114, 2059, 15188, 5612, 1012, 2419, 2011, 6452, 4433, 11214, 15333, 6862, 3946, 1998, 6986, 9530, 2532, 18533, 2239, 1010, 1996, 3554, 3493, 3786, 1996, 9523, 2120, 3410, 3804, 2630, 13664, 3807, 2076, 1996, 2161, 1012, 1996, 3590, 5222, 2020, 1996, 2087, 2011, 1996, 3554, 3493, 2136, 2144, 5316, 1011, 5641, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
--------------------截断上下文不保留超出部分------------------------
{'input_ids': [101, 2129, 2116, 5222, 2515, 1996, 10289, 8214, 2273, 1005, 1055, 3455, 2136, 2031, 1029, 102, 1996, 2273, 1005, 1055, 3455, 2136, 2038, 2058, 1015, 1010, 5174, 5222, 1010, 2028, 1997, 2069, 2260, 2816, 2040, 2031, 2584, 2008, 2928, 1010, 1998, 2031, 2596, 1999, 2654, 5803, 8504, 1012, 2280, 2447, 5899, 12385, 4324, 1996, 2501, 2005, 2087, 2685, 3195, 1999, 1037, 2309, 2208, 1997, 1996, 2977, 2007, 6079, 1012, 2348, 1996, 2136, 2038, 2196, 2180, 1996, 5803, 2977, 1010, 2027, 2020, 2315, 2011, 1996, 16254, 2015, 5188, 3192, 2004, 2120, 3966, 3807, 1012, 1996, 2136, 2038, 23339, 1037, 2193, 1997, 6314, 2015, 1997, 2193, 2028, 4396, 2780, 1010, 1996, 2087, 3862, 1997, 2029, 2001, 4566, 12389, 1005, 1055, 2501, 6070, 1011, 2208, 3045, 9039, 1999, 3326, 1012, 1996, 2136, 2038, 7854, 2019, 3176, 2809, 2193, 1011, 2028, 2780, 1010, 1998, 2216, 3157, 5222, 4635, 2117, 1010, 2000, 12389, 1005, 1055, 2184, 1010, 2035, 1011, 2051, 1999, 5222, 2114, 1996, 2327, 2136, 1012, 1996, 2136, 3248, 1999, 4397, 10601, 26429, 10531, 1006, 2306, 1996, 9493, 1052, 1012, 11830, 2415, 1007, 1010, 2029, 11882, 2005, 1996, 2927, 1997, 1996, 2268, 1516, 2230, 2161, 1012, 1996, 2136, 2003, 8868, 2011, 3505, 7987, 3240, 1010, 2040, 1010, 2004, 1997, 1996, 2297, 1516, 2321, 2161, 1010, 2010, 16249, 2012, 10289, 8214, 1010, 2038, 4719, 1037, 29327, 1011, 13913, 2501, 1012, 1999, 2268, 2027, 2020, 4778, 2000, 1996, 9152, 2102, 1010, 2073, 2027, 3935, 2000, 1996, 8565, 2021, 2020, 7854, 2011, 9502, 2110, 2040, 2253, 2006, 1998, 3786, 23950, 1999, 1996, 2528, 1012, 1996, 2230, 1516, 2340, 2136, 5531, 2049, 3180, 2161, 4396, 2193, 2698, 1999, 1996, 2406, 1010, 2007, 1037, 2501, 1997, 2423, 1516, 1019, 1010, 7987, 3240, 1005, 1055, 3587, 3442, 2322, 1011, 2663, 2161, 1010, 1998, 1037, 2117, 1011, 2173, 3926, 1999, 1996, 2502, 2264, 1012, 2076, 1996, 2297, 1011, 2321, 2161, 1010, 1996, 2136, 2253, 3590, 1011, 1020, 1998, 2180, 1996, 16222, 3034, 2977, 1010, 2101, 10787, 2000, 1996, 7069, 1022, 1010, 2073, 1996, 3554, 3493, 2439, 2006, 1037, 4771, 12610, 2121, 1011, 3786, 2121, 2114, 2059, 15188, 5612, 1012, 2419, 2011, 6452, 4433, 11214, 15333, 6862, 3946, 1998, 6986, 9530, 2532, 18533, 2239, 1010, 1996, 3554, 3493, 3786, 1996, 9523, 2120, 3410, 3804, 2630, 13664, 3807, 2076, 1996, 2161, 1012, 1996, 3590, 5222, 2020, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[101, 2129, 2116, 5222, 2515, 1996, 10289, 8214, 2273, 1005, 1055, 3455, 2136, 2031, 1029, 102, 1996, 2273, 1005, 1055, 3455, 2136, 2038, 2058, 1015, 1010, 5174, 5222, 1010, 2028, 1997, 2069, 2260, 2816, 2040, 2031, 2584, 2008, 2928, 1010, 1998, 2031, 2596, 1999, 2654, 5803, 8504, 1012, 2280, 2447, 5899, 12385, 4324, 1996, 2501, 2005, 2087, 2685, 3195, 1999, 1037, 2309, 2208, 1997, 1996, 2977, 2007, 6079, 1012, 2348, 1996, 2136, 2038, 2196, 2180, 1996, 5803, 2977, 1010, 2027, 2020, 2315, 2011, 1996, 16254, 2015, 5188, 3192, 2004, 2120, 3966, 3807, 1012, 1996, 2136, 2038, 23339, 1037, 2193, 1997, 6314, 2015, 1997, 2193, 2028, 4396, 2780, 1010, 1996, 2087, 3862, 1997, 2029, 2001, 4566, 12389, 1005, 1055, 2501, 6070, 1011, 2208, 3045, 9039, 1999, 3326, 1012, 1996, 2136, 2038, 7854, 2019, 3176, 2809, 2193, 1011, 2028, 2780, 1010, 1998, 2216, 3157, 5222, 4635, 2117, 1010, 2000, 12389, 1005, 1055, 2184, 1010, 2035, 1011, 2051, 1999, 5222, 2114, 1996, 2327, 2136, 1012, 1996, 2136, 3248, 1999, 4397, 10601, 26429, 10531, 1006, 2306, 1996, 9493, 1052, 1012, 11830, 2415, 1007, 1010, 2029, 11882, 2005, 1996, 2927, 1997, 1996, 2268, 1516, 2230, 2161, 1012, 1996, 2136, 2003, 8868, 2011, 3505, 7987, 3240, 1010, 2040, 1010, 2004, 1997, 1996, 2297, 1516, 2321, 2161, 1010, 2010, 16249, 2012, 10289, 8214, 1010, 2038, 4719, 1037, 29327, 1011, 13913, 2501, 1012, 1999, 2268, 2027, 2020, 4778, 2000, 1996, 9152, 2102, 1010, 2073, 2027, 3935, 2000, 1996, 8565, 2021, 2020, 7854, 2011, 9502, 2110, 2040, 2253, 2006, 1998, 3786, 23950, 1999, 1996, 2528, 1012, 1996, 2230, 1516, 2340, 2136, 5531, 2049, 3180, 2161, 4396, 2193, 2698, 1999, 1996, 2406, 1010, 2007, 1037, 2501, 1997, 2423, 1516, 1019, 1010, 7987, 3240, 1005, 1055, 3587, 3442, 2322, 1011, 2663, 2161, 1010, 1998, 1037, 2117, 1011, 2173, 3926, 1999, 1996, 2502, 2264, 1012, 2076, 1996, 2297, 1011, 2321, 2161, 1010, 1996, 2136, 2253, 3590, 1011, 1020, 1998, 2180, 1996, 16222, 3034, 2977, 1010, 2101, 10787, 2000, 1996, 7069, 1022, 1010, 2073, 1996, 3554, 3493, 2439, 2006, 1037, 4771, 12610, 2121, 1011, 3786, 2121, 2114, 2059, 15188, 5612, 1012, 2419, 2011, 6452, 4433, 11214, 15333, 6862, 3946, 1998, 6986, 9530, 2532, 18533, 2239, 1010, 1996, 3554, 3493, 3786, 1996, 9523, 2120, 3410, 3804, 2630, 13664, 3807, 2076, 1996, 2161, 1012, 1996, 3590, 5222, 2020, 102]
--------------------截断上下文,保留问题和超出部分------------------------
[384, 149]
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 20092010 season. the team is coached by mike brey, who, as of the 201415 season, his fifteenth at notre dame, has achieved a 332 - 165 record. in 2009 they were invited to the nit, where they advanced to the semifinals but were beaten by penn state who went on and beat baylor in the championship. the 201011 team concluded its regular season ranked number seven in the country, with a record of 255, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were [SEP]
[CLS] how many wins does the notre dame men's basketball team have? [SEP] its regular season ranked number seven in the country, with a record of 255, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were the most by the fighting irish team since 1908 - 09. [SEP]
384
[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9)]
[384, 149]
How many wins does the Notre Dame men's basketball team have?
2129
(0, 3)
how How
many many
[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]
-------------------answers----------------------
{'text': ['over 1,600'], 'answer_start': [30]}
40
token_start_index :  16
token_end_index :  382
23 26
over 1, 600
over 1,600

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/804349.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

龙蜥社区「人人都可以参与开源」——体验开源成为“开源人“

龙蜥社区「人人都可以参与开源」体验开源——让更多的人了解开源&#xff01; 龙蜥社区开源概述&#xff1a;龙蜥社区开源的探索过程:龙蜥社区收获总结:AtomGit评测:服务设计上:功能结构上:安全设计上: AtomGit测评总结: 龙蜥社区开源概述&#xff1a; 在追求技术的路上少不了…

[vue] v-viewer 点击失效 图片有更新

rebuild 当图片发生变更时(添加、删除或排序)&#xff0c;viewer实例会使用update方法更新内容。 写法1 <div class"images" v-viewer.rebuild><img v-for"src in images" :src"src" :key"src"> </div>写法2 &l…

铸造大型基础平板的结构应该怎样设计

设计大型基础平板的结构时&#xff0c;需要考虑以下几个方面&#xff1a; 地质条件&#xff1a;首先要了解工程所在地的地质条件&#xff0c;包括土质、地下水位、地震状况等。根据地质条件来选择合适的基础类型&#xff0c;如浅基、深基或地下连续墙等。 荷载分析&#xff1a…

linux c共享内存和信号量

编译环境&#xff1a;Ubuntu16.04 64位 交叉编译工具&#xff1a;arm-hisiv500-linux-gcc 文章目录 1. 项目背景2. 涉及的函数3. 头文件JShm.h4. 类的实现5. sample代码 1. 项目背景 最近项目中需要用到共享内存的交互&#xff0c;自己造个轮子吧。 2. 涉及的函数 详细描述可…

Linux进阶之旅:深入探索Linux的高级功能

文章目录 Linux进阶之旅:深入探索Linux的高级功能1. Shell脚本编程2. 进程管理3. 网络管理4. 文本处理5. 系统监控6. 总结 Linux进阶之旅:深入探索Linux的高级功能 在上一篇博客中,我们对Linux操作系统进行了入门级的介绍,包括Linux的特点、发行版、安装方法以及基本使用。接下…

python——输入/输出

输出 关键字 print()注意&#xff1a;print() 默认自带一个换行符\n 格式化输出 格式化符号 符号作用%s转换字符串%d有符号的十进制整数%f浮点数%c字符%u无符号的十进制整数%o八进制整数%x十六进制整数&#xff0c;小写ox%X十六进制整数&#xff0c;大写OX%e科学计数法&…

Proxmox VE qm 方式一键创建Windows虚拟机

前言 实现qm 方式一键创建Windows虚拟机&#xff0c;提高效率。 qm 一键创建Windows虚拟机 以下实现在线下载镜像&#xff0c;创建虚拟机&#xff0c;安装系统需要自己手动安装哦&#xff0c;如果想实现全自动安装系统&#xff0c;建议部署自己的内网pxe server 系统参考各参…

(一)ffmpeg 入门基础知识

一、ffmpeg FFmpeg是一套强大的开源音视频处理工具&#xff0c;能够录制、转换以及流化音视频内容。 FFmpeg是开源的&#xff0c;这意味着它的源代码是公开的&#xff0c;允许任何人使用、修改和分发。它提供了录制、转换以及流化音视频的完整解决方案&#xff0c;支持多种格…

【C语言】整数和浮点数在内存中的存储

点这里是个人主页~ 这次的内容是比较底层的奥&#xff0c;对于理解编程很重要~ 整数浮点数在内存中的存储 一、 整数在内存中的存储二、大小端字节序和字节序判断大小端的概念一道简单关于大小端排序的百度面试题 三、简单理解数据类型存储范围例一例二例三例四例五例六 四、 …

STM32F4 IAP跳转APP问题及STM32基于Ymodem协议IAP升级笔记

STM32F4 IAP 跳转 APP问题 ST官网IAP例程Chapter1 STM32F4 IAP 跳转 APP问题1. 概念2. 程序2.1 Bootloader 程序 问题现象2.2. APP程序 3. 代码4. 其他问题 Chapter2 STM32-IAP基本原理及应用 | ICP、IAP程序下载流程 | 程序执行流程 | 配置IAP到STM32F4xxxChapter3 STM32基于Y…

未来工厂大脑:图扑组态软件在智能制造中的应用

组态软件&#xff1a;一般英文简称有三种分别为 HMI/MMI/SCADA&#xff0c;中文翻译为&#xff1a;人机界面/监视控制和数据采集软件。 运行于 PC 平台的一个通用工具软件&#xff0c;涉及各行各业&#xff0c;其主要功能是对生产现场的运行设备进行监控并就危险情况进行报警&…

【学习】使用VScode连接服务器。

step1: 安装 Remote - ssh 扩展 step2&#xff1a; 进入步骤2中&#xff0c;进行文件配置。 step3&#xff1a; 点击箭头进行连接。 step4&#xff1a; 输入密码即可。选择 platform时候&#xff0c;选择使用 Linux&#xff0c;而不是windows。

FreeRTOS创建第一个程序

使用freeRTOS创建任务时使用如下函数 函数的参数 创建一个FreeRTOS任务点亮led灯实现led灯500毫秒翻转一次 具体的代码实现 #include "stm32f10x.h" // Device header #include "Delay.h" #include "freeRTOS.h" #include &quo…

PMP持证者在面试项目经理时有加持吗?

对PMP认证获取后是否在面试中加持很多人是没有体验过的&#xff0c;因为大部分人考取PMP认证的原因是因为公司的要求&#xff0c;没有这个证书可能面临被“优化”的风险。理论上来说一样的道理&#xff0c;PMP认证既然能够保住工作岗位&#xff0c;那么在面试中一定会有相应的作…

利用AI开源引擎平台:构建文本、图片及视频内容审核系统|可本地部署

网络空间的信息量呈现出爆炸式增长。在这个信息多元化的时代&#xff0c;内容审核系统成为了维护网络秩序、保护用户免受有害信息侵害的重要工具。本文将探讨内容审核系统的核心优势、技术实现以及在不同场景下的应用。 开源项目介绍(可本地部署&#xff0c;支持国产化) 思通数…

【RK平台 dumpsys info使用】

RK平台 dumpsys info使用 问题描述解决方法郑重声明:本人原创博文,都是实战,均经过实际项目验证出货的 转载请标明出处:攻城狮2015 Platform: Rockchip OS:Android 7.1.2 Kernel: 3.10 问题描述 在看问题的时候,经常需要查看内存情况,等各项指标 解决方法 1.常用的dumpsys …

Vue2 响应式原理

Vue 的响应式原理 Vue 的响应式原理基于"数据劫持"和"依赖收集"的概念。当我们将一个普通的 JavaScript 对象传递给 Vue 实例的 data 选项时&#xff0c;Vue 将遍历此对象的所有属性&#xff0c;并使用 Object.defineProperty()来对每个属性进行 getter 和…

Redis ttl与key过期策略

TTL ttl --- time to line 网络原理的IP协议当中&#xff0c;IP协议报头中&#xff0c;就有一个字段&#xff0c;TTL IP中的TTL不是用时间衡量过期的&#xff0c;而是用次数 功能&#xff1a;查看当前的key的过期时间还剩多少 语法&#xff1a; ttl key 时间复杂度&#xf…

12.C++常用的算法_遍历算法

文章目录 遍历算法1. for_each()代码工程运行结果 2. transform()代码工程运行结果 3. find()代码工程运行结果 遍历算法 1. for_each() 有两种方式&#xff1a; 1.普通函数 2.仿函数 代码工程 #define _CRT_SECURE_NO_WARNINGS #include<iostream> #include<vect…

layui复选框勾选取消勾选事件监听

监听事件放置位置&#xff1a; form.on(checkbox(equipInputClick), function(data){var a data.elem.checked;var val data.value;if(a true){}else{}});html部分 <input lay-filter"equipInputClick" type"checkbox" lay-skin"primary&quo…