transformers中的data_collator

前言

使用huggingface的Dataset加载数据集,然后使用过tokenizer对文本数据进行编码,但是此时的特征数据还不是tensor,需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。

本文记录huggingface transformers中两种比较常用的data_collator,一种是default_data_collator,另一种是DataCollatorWithPadding。本文使用BertTokenizer作为基础tokenizer,如下所示:

from transformers import BertTokenizer
from transformers import default_data_collator, DataCollatorWithPadding
from datasets import Datasettokenizer = BertTokenizer.from_pretrained("hfl/chinese-bert-wwm-ext")def func(exam):return tokenizer(exam["text"])

default_data_collator

如果使用pytorch框架,default_data_collator本质是执行torch_default_data_collator。注意输入参数要求是List[Any]格式,输出需满足Dict[str, Any]格式。

def default_data_collator(features: List[InputDataClass], return_tensors="pt") -> Dict[str, Any]:"""Very simple data collator that simply collates batches of dict-like objects and performs special handling forpotential keys named:- `label`: handles a single value (int or float) per object- `label_ids`: handles a list of values per objectDoes not do any additional preprocessing: property names of the input object will be used as corresponding inputsto the model. See glue and ner for example of how it's useful."""# In this function we'll make the assumption that all `features` in the batch# have the same attributes.# So we will look at the first element as a proxy for what attributes exist# on the whole batch.if return_tensors == "pt":return torch_default_data_collator(features)elif return_tensors == "tf":return tf_default_data_collator(features)elif return_tensors == "np":return numpy_default_data_collator(features)

torch_default_data_collator 源码如下,源码中假设所有features特征数据拥有相同的属性信息,因此源码选择使用第一个样例数据进行逻辑判断。另外源码对特征数据中的label或者label_ids属性进行特殊处理, 分别对应单标签分类多标签分类。并且将特征属性更名为“labels”——大多数预训练模型的forward方法中定义的关键词参数名为labels

def torch_default_data_collator(features: List[InputDataClass]) -> Dict[str, Any]:import torchif not isinstance(features[0], Mapping):features = [vars(f) for f in features]first = features[0]batch = {}# Special handling for labels.# Ensure that tensor is created with the correct type# (it should be automatically the case, but let's make sure of it.)if "label" in first and first["label"] is not None:label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]dtype = torch.long if isinstance(label, int) else torch.floatbatch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)elif "label_ids" in first and first["label_ids"] is not None:if isinstance(first["label_ids"], torch.Tensor):batch["labels"] = torch.stack([f["label_ids"] for f in features])else:dtype = torch.long if type(first["label_ids"][0]) is int else torch.floatbatch["labels"] = torch.tensor([f["label_ids"] for f in features], dtype=dtype)# Handling of all other possible keys.# Again, we will use the first element to figure out which key/values are not None for this model.for k, v in first.items():if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):if isinstance(v, torch.Tensor):batch[k] = torch.stack([f[k] for f in features])elif isinstance(v, np.ndarray):batch[k] = torch.tensor(np.stack([f[k] for f in features]))else:batch[k] = torch.tensor([f[k] for f in features])return batch

示例:

x = [{"text": "我爱中国。", "label": 1}, {"text": "我爱中国。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
dataset = default_data_collator(features)

DataCollatorWithPadding

注意DataCollatorWithPadding是一个类,首先需要实例化,然后再将features转为dataset。与default_data_collator相比,DataCollatorWithPadding会为接受到的特征数据进行padding操作——各个维度的size补全到相同值。其源码如下:

@dataclass
class DataCollatorWithPadding:"""Data collator that will dynamically pad the inputs received.Args:tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):The tokenizer used for encoding the data.padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):Select a strategy to pad the returned sequences (according to the model's padding side and padding index)among:- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a singlesequence is provided).- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximumacceptable input length for the model if that argument is not provided.- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).max_length (`int`, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (`int`, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=7.5 (Volta).return_tensors (`str`):The type of Tensor to return. Allowable values are "np", "pt" and "tf"."""tokenizer: PreTrainedTokenizerBasepadding: Union[bool, str, PaddingStrategy] = Truemax_length: Optional[int] = Nonepad_to_multiple_of: Optional[int] = Nonereturn_tensors: str = "pt"def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:batch = self.tokenizer.pad(features,padding=self.padding,max_length=self.max_length,pad_to_multiple_of=self.pad_to_multiple_of,return_tensors=self.return_tensors,)if "label" in batch:batch["labels"] = batch["label"]del batch["label"]if "label_ids" in batch:batch["labels"] = batch["label_ids"]del batch["label_ids"]return batch

在实例化过程中,注意pad_to_multiple_of其含义是指将max_length扩充为指定值的整数倍。举例而言,如果max_length=510pad_to_multiple_of=8,则会将max_length设置为512。参考transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad源码:

    def _pad(self,encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],max_length: Optional[int] = None,padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,pad_to_multiple_of: Optional[int] = None,return_attention_mask: Optional[bool] = None,) -> dict:"""Pad encoded inputs (on left/right and up to predefined length or max length in the batch)Args:encoded_inputs:Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).max_length: maximum length of the returned list and optionally padding length (see below).Will truncate by taking into account the special tokens.padding_strategy: PaddingStrategy to use for padding.- PaddingStrategy.LONGEST Pad to the longest sequence in the batch- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)- PaddingStrategy.DO_NOT_PAD: Do not padThe tokenizer padding sides are defined in self.padding_side:- 'left': pads on the left of the sequences- 'right': pads on the right of the sequencespad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability`>= 7.5` (Volta).return_attention_mask:(optional) Set to False to avoid returning attention mask (default: set to model specifics)"""
...
...if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
...
...

DataCollatorWithPadding的__call__方法中,同样将label或者label_ids重命名为labels。并且其实质是通过transformers.tokenization_utils_base.PreTrainedTokenizerBase.pad实现的。

    def pad(self,encoded_inputs: Union[BatchEncoding,List[BatchEncoding],Dict[str, EncodedInput],Dict[str, List[EncodedInput]],List[Dict[str, EncodedInput]],],padding: Union[bool, str, PaddingStrategy] = True,max_length: Optional[int] = None,pad_to_multiple_of: Optional[int] = None,return_attention_mask: Optional[bool] = None,return_tensors: Optional[Union[str, TensorType]] = None,verbose: bool = True,) -> BatchEncoding:"""Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence lengthin the batch.Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,`self.pad_token_id` and `self.pad_token_type_id`).Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode thetext followed by a call to the `pad` method to get a padded encoding.<Tip>If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, theresult will use the same type unless you provide a different tensor type with `return_tensors`. In the case ofPyTorch tensors, you will lose the specific device of your tensors however.</Tip>Args:encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):Tokenized inputs. Can represent one input ([`BatchEncoding`] or `Dict[str, List[int]]`) or a batch oftokenized inputs (list of [`BatchEncoding`], *Dict[str, List[List[int]]]* or *List[Dict[str,List[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloadercollate function.Instead of `List[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), seethe note above for the return type.padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):Select a strategy to pad the returned sequences (according to the model's padding side and paddingindex) among:- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a singlesequence if provided).- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximumacceptable input length for the model if that argument is not provided.- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of differentlengths).max_length (`int`, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (`int`, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability`>= 7.5` (Volta).return_attention_mask (`bool`, *optional*):Whether to return the attention mask. If left to the default, will return the attention mask accordingto the specific tokenizer's default, defined by the `return_outputs` attribute.[What are attention masks?](../glossary#attention-mask)return_tensors (`str` or [`~utils.TensorType`], *optional*):If set, will return tensors instead of list of python integers. Acceptable values are:- `'tf'`: Return TensorFlow `tf.constant` objects.- `'pt'`: Return PyTorch `torch.Tensor` objects.- `'np'`: Return Numpy `np.ndarray` objects.verbose (`bool`, *optional*, defaults to `True`):Whether or not to print more information and warnings."""......# If we have a list of dicts, let's convert it in a dict of lists# We do this to allow using this method as a collate_fn function in PyTorch Dataloaderif isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):encoded_inputs = {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()}
......
  • 首先注意pad方法对输入参数的要求,其中EncodedInput是List[int]的别名。BatchEncoding可以看做是一个字典对象,其格式满足Dict[str, Any],其数据存储在data属性中。并且BatchEncoding实例化过程中,会调用convert_to_tensors方法,该方法会将data属性中的数据转换成tensor类型。
  • 如果输入的特征数据是List[Dict[str, Any]]格式,会将其转换为Dict[str, List],以满足pytorch Dataloader的要求。并且如果直接使用datasets.Dataset示例对象作为pad方法的输入,会报错——datasets.Dataset示例没有keys属性。

示例:

x += [{"text": "中国是一个伟大国家。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
dataset = data_collator(features=features.to_list())  # convert Dataset into List

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/166994.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

小学语文老师重点工作

小学语文老师是学生在语言学习过程中的关键引导者&#xff0c;他们的主要职责是帮助学生建立正确的语言基础&#xff0c;培养良好的阅读习惯&#xff0c;并提高学生的语文素养。以下是小学语文老师的一些重点工作。 一、教授语言知识 小学语文老师首要的任务是教授学生语言知识…

《DApp开发:开启全新数字时代篇章》

随着区块链技术的日益成熟&#xff0c;去中心化应用&#xff08;DApp&#xff09;逐渐成为数字世界的新焦点。在这个充满无限可能的全新领域&#xff0c;DApp开发为创新者们提供了开启数字时代新篇章的钥匙。 一、DApp&#xff1a;区块链创新成果 DApp是建立在区块链技术基础之…

C/C++ 开发SCM服务管理组件

SCM&#xff08;Service Control Manager&#xff09;服务管理器是 Windows 操作系统中的一个关键组件&#xff0c;负责管理系统服务的启动、停止和配置。服务是一种在后台运行的应用程序&#xff0c;可以在系统启动时自动启动&#xff0c;也可以由用户或其他应用程序手动启动。…

CMakeLists.txt:打印find_package变量;判断库文件路径设定是否正确;install文件设置

CMake打印find_package变量&#xff1b;install文件设置 打印find_package找到的各种变量判断库文件是否被找到install文件设置install详细说明 打印find_package找到的各种变量 目的&#xff1a;find_package后&#xff0c;想使用找到的include/lib文件夹。 find_package(Yo…

chromium通信系统-mojo系统(一)-ipcz系统基本概念

ipcz 是chromium的跨进程通信系统。z可能是代表zero&#xff0c;表示0拷贝通信。 chromium的文档是非常丰富的&#xff0c;关于ipcz最重要的一篇官方文档是IPCZ。 关于ipcz本篇文章主要的目的是通过源代码去分析它的实现。再进入分析前我们先对官方文档做一个总结&#xff0c;…

axios封装和请求跨域和.gitignore文件

axios封装 首先这部分网上找找应该一大堆&#xff0c;其中本人喜欢同.env文件一同配合使用&#xff1b; let base_url process.env.PROJECT_NAME if (process.env.NODE_ENV production){base_url process.env.PROJECT_BASEURL process.env.PROJECT_NAME// base_url http:…

Java计算两个时间的相差年,日,小时,分,秒

主函数 public static int dateDiff(char flag, Calendar calSrc, Calendar calDes) {long millisDiff getMillis(calSrc) - getMillis(calDes);if (flag y) {return (calSrc.get(Calendar.YEAR) - calDes.get(Calendar.YEAR));}if (flag d) {return (int) (millisDiff / D…

Unity RenderFeature架构分析

自定义RenderFeature接口流程 URP内部ScriptableRenderPass分析 public、protected属性 renderPassEvent &#xff1a;渲染事件发生的时刻colorAttachments &#xff1a;渲染的颜色纹理列表 m_ColorAttachmentscolorAttachment &#xff1a;m_ColorAttachments[0];depthAttac…

【网络奇幻之旅】那年我与大数据的邂逅

&#x1f33a;个人主页&#xff1a;Dawn黎明开始 &#x1f380;系列专栏&#xff1a;网络奇幻之旅 ⭐每日一句&#xff1a;循梦而行&#xff0c;向阳而生 &#x1f4e2;欢迎大家&#xff1a;关注&#x1f50d;点赞&#x1f44d;评论&#x1f4dd;收藏⭐️ 文章目录 &#x1f4…

Windows 下安装MySQL8.0 Zip

1、将下载的mysql 压缩包解压。 2、已管理员身份证 打开 cmd窗口&#xff0c;进入到解压目录的&#xff0c;本文以解压到 D:\soft\mysql-8.0.29-winx64 为例来介绍。 3、在解压目录下 新建一个 my.ini 文件。 my.ini 文件内容如下&#xff1a; [mysqld] # 设置3306端口 por…

linux wget --no-check-certificate

如果您希望每次使用wget命令时都跳过SSL证书检查&#xff0c;可以将–no-check-certificate参数添加到wget的默认配置文件中。 请按照以下步骤进行操作&#xff1a; vi ~/.wgetrc# 插入内容 check_certificate off保存并关闭文件。 现在&#xff0c;wget命令将在每次使用时自…

windows远程linux或远程虚拟机连接拒绝问题排查

当我们使用MobaXterm远程连接时&#xff0c;报错如下&#xff1a; 1.首先检查该ubuntu防火墙是否关闭&#xff0c;先将防火墙关闭。 1.检查防火墙状态 sudo ufw status 2.开启防火墙 sudo ufw enable 3.关闭防火墙 sudo ufw disable 2.关闭防火墙后&#xff0c;使用ping命令相…

【数据结构/C++】栈和队列_顺序栈

#include<iostream> using namespace std; #define MaxSize 10 // 1. 顺序栈 typedef int ElemType; struct Stack {ElemType data[MaxSize];int top; } SqStack; // 初始化栈 void init(Stack &s) {// 初始化栈顶指针s.top -1; } // 入栈 bool push(Stack &s, …

什么是工业物联网(IOT)?这样的IOT平台你需要吗?——青创智通

物联网(IOT)是指在互联网上为传输和共享数据而嵌入传感器和软件的互联设备的广泛性网络。这允许将从物理对象收集的信息(数据)存储在专用服务器或云中。通过分析这些积累的信息&#xff0c;通过提供最优的设备控制和方法&#xff0c;可以实现一个更安全、更方便的社会。在智能家…

【完美解决】 Python pyecharts Map 地图数据不显示

目录 项目场景问题描述原因分析解决方案完整代码 项目场景 Python数据可视化&#xff0c;使用 Pyecharts.charts 模块中的Map&#xff0c;并导入数据来构建全国疫情热力地图 B站 黑马程序员 Python课程【P106 第一阶段 - 第十一章 - 02全国疫情地图构建】 问题描述 本人在学习…

vue+face-api.js实现前端人脸识别功能

近期做了一个前端vue实现人脸识别的功能&#xff0c;主要功能逻辑包含&#xff1a;人脸识别&#xff0c;人脸验证&#xff0c;唤起摄像头视频流之后从三个事件&#xff08;用户点头、摇头、眨眼睛&#xff09;中随机选中两个事件&#xff0c;待两个事件通过判断后人脸静止不动3…

基于Java+Vue+uniapp微信小程序微信阅读网站平台设计和实现

博主介绍&#xff1a;✌全网粉丝30W,csdn特邀作者、博客专家、CSDN新星计划导师、Java领域优质创作者,博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌ &#x1f345;文末获取源码联系&#x1f345; &#x1f447;&#x1f3fb; 精彩专…

使用端口扫描工具解决开放端口威胁并增强安全性

从暴露网络漏洞到成为入侵者的通道&#xff0c;开放端口可能会带来多种风险向量&#xff0c;威胁到网络的机密性、完整性和可用性。因此&#xff0c;最佳做法是关闭打开的端口&#xff0c;为了应对开放端口带来的风险&#xff0c;网络管理员依靠端口扫描工具来识别、检查、分析…

ubuntu下配置qtcreator交叉编译环境

文章目录 安装交叉编译工具安装qt creator开发环境配置交叉编译示例demo参考 安装交叉编译工具 安装qt creator开发环境 1 官网 2 填写信息 3 下载 默认没有出现Qt5.15版本 WISONIC\80081001ub16-1001:~$ /opt/Qt/Tools/QtCreator/bin/qtcreator /opt/Qt/Tools/QtCreat…

【PDF.js】2023 最新 PDF.js 在 Vue3 中的使用

因为自己写业务要定制各种 pdf 预览情况&#xff08;可能&#xff09;&#xff0c;所以采用了 pdf.js 而不是各种第三方封装库&#xff0c;主要还是为了更好的自由度。 一、PDF.js 介绍 官方地址 中文文档 PDF.js 是一个使用 HTML5 构建的便携式文档格式查看器。 pdf.js 是社区…