transformers中的data

前言

使用huggingface的Dataset加载数据集，然后使用过tokenizer对文本数据进行编码，但是此时的特征数据还不是tensor，需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。

本文记录huggingface transformers中两种比较常用的data_collator，一种是default_data_collator，另一种是DataCollatorWithPadding。本文使用BertTokenizer作为基础tokenizer，如下所示：

from transformers import BertTokenizer
from transformers import default_data_collator, DataCollatorWithPadding
from datasets import Datasettokenizer = BertTokenizer.from_pretrained("hfl/chinese-bert-wwm-ext")def func(exam):return tokenizer(exam["text"])

default_data_collator

如果使用pytorch框架，default_data_collator本质是执行torch_default_data_collator。注意输入参数要求是List[Any]格式，输出需满足Dict[str, Any]格式。

def default_data_collator(features: List[InputDataClass], return_tensors="pt") -> Dict[str, Any]:"""Very simple data collator that simply collates batches of dict-like objects and performs special handling forpotential keys named:- `label`: handles a single value (int or float) per object- `label_ids`: handles a list of values per objectDoes not do any additional preprocessing: property names of the input object will be used as corresponding inputsto the model. See glue and ner for example of how it's useful."""# In this function we'll make the assumption that all `features` in the batch# have the same attributes.# So we will look at the first element as a proxy for what attributes exist# on the whole batch.if return_tensors == "pt":return torch_default_data_collator(features)elif return_tensors == "tf":return tf_default_data_collator(features)elif return_tensors == "np":return numpy_default_data_collator(features)

torch_default_data_collator 源码如下，源码中假设所有features特征数据拥有相同的属性信息，因此源码选择使用第一个样例数据进行逻辑判断。另外源码对特征数据中的label或者label_ids属性进行特殊处理，分别对应单标签分类 与 多标签分类。并且将特征属性更名为“labels”——大多数预训练模型的forward方法中定义的关键词参数名为labels。

def torch_default_data_collator(features: List[InputDataClass]) -> Dict[str, Any]:import torchif not isinstance(features[0], Mapping):features = [vars(f) for f in features]first = features[0]batch = {}# Special handling for labels.# Ensure that tensor is created with the correct type# (it should be automatically the case, but let's make sure of it.)if "label" in first and first["label"] is not None:label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]dtype = torch.long if isinstance(label, int) else torch.floatbatch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)elif "label_ids" in first and first["label_ids"] is not None:if isinstance(first["label_ids"], torch.Tensor):batch["labels"] = torch.stack([f["label_ids"] for f in features])else:dtype = torch.long if type(first["label_ids"][0]) is int else torch.floatbatch["labels"] = torch.tensor([f["label_ids"] for f in features], dtype=dtype)# Handling of all other possible keys.# Again, we will use the first element to figure out which key/values are not None for this model.for k, v in first.items():if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):if isinstance(v, torch.Tensor):batch[k] = torch.stack([f[k] for f in features])elif isinstance(v, np.ndarray):batch[k] = torch.tensor(np.stack([f[k] for f in features]))else:batch[k] = torch.tensor([f[k] for f in features])return batch

示例：

x = [{"text": "我爱中国。", "label": 1}, {"text": "我爱中国。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
dataset = default_data_collator(features)

DataCollatorWithPadding

注意DataCollatorWithPadding是一个类，首先需要实例化，然后再将features转为dataset。与default_data_collator相比，DataCollatorWithPadding会为接受到的特征数据进行padding操作——各个维度的size补全到相同值。其源码如下：

@dataclass
class DataCollatorWithPadding:"""Data collator that will dynamically pad the inputs received.Args:tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):The tokenizer used for encoding the data.padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):Select a strategy to pad the returned sequences (according to the model's padding side and padding index)among:- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a singlesequence is provided).- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximumacceptable input length for the model if that argument is not provided.- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).max_length (`int`, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (`int`, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=7.5 (Volta).return_tensors (`str`):The type of Tensor to return. Allowable values are "np", "pt" and "tf"."""tokenizer: PreTrainedTokenizerBasepadding: Union[bool, str, PaddingStrategy] = Truemax_length: Optional[int] = Nonepad_to_multiple_of: Optional[int] = Nonereturn_tensors: str = "pt"def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:batch = self.tokenizer.pad(features,padding=self.padding,max_length=self.max_length,pad_to_multiple_of=self.pad_to_multiple_of,return_tensors=self.return_tensors,)if "label" in batch:batch["labels"] = batch["label"]del batch["label"]if "label_ids" in batch:batch["labels"] = batch["label_ids"]del batch["label_ids"]return batch

在实例化过程中，注意pad_to_multiple_of，其含义是指将max_length扩充为指定值的整数倍。举例而言，如果max_length=510，pad_to_multiple_of=8，则会将max_length设置为512。参考transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad源码：

    def _pad(self,encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],max_length: Optional[int] = None,padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,pad_to_multiple_of: Optional[int] = None,return_attention_mask: Optional[bool] = None,) -> dict:"""Pad encoded inputs (on left/right and up to predefined length or max length in the batch)Args:encoded_inputs:Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).max_length: maximum length of the returned list and optionally padding length (see below).Will truncate by taking into account the special tokens.padding_strategy: PaddingStrategy to use for padding.- PaddingStrategy.LONGEST Pad to the longest sequence in the batch- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)- PaddingStrategy.DO_NOT_PAD: Do not padThe tokenizer padding sides are defined in self.padding_side:- 'left': pads on the left of the sequences- 'right': pads on the right of the sequencespad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability`>= 7.5` (Volta).return_attention_mask:(optional) Set to False to avoid returning attention mask (default: set to model specifics)"""
...
...if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
...
...

DataCollatorWithPadding的__call__方法中，同样将label或者label_ids重命名为labels。并且其实质是通过transformers.tokenization_utils_base.PreTrainedTokenizerBase.pad实现的。

    def pad(self,encoded_inputs: Union[BatchEncoding,List[BatchEncoding],Dict[str, EncodedInput],Dict[str, List[EncodedInput]],List[Dict[str, EncodedInput]],],padding: Union[bool, str, PaddingStrategy] = True,max_length: Optional[int] = None,pad_to_multiple_of: Optional[int] = None,return_attention_mask: Optional[bool] = None,return_tensors: Optional[Union[str, TensorType]] = None,verbose: bool = True,) -> BatchEncoding:"""Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence lengthin the batch.Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,`self.pad_token_id` and `self.pad_token_type_id`).Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode thetext followed by a call to the `pad` method to get a padded encoding.<Tip>If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, theresult will use the same type unless you provide a different tensor type with `return_tensors`. In the case ofPyTorch tensors, you will lose the specific device of your tensors however.</Tip>Args:encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):Tokenized inputs. Can represent one input ([`BatchEncoding`] or `Dict[str, List[int]]`) or a batch oftokenized inputs (list of [`BatchEncoding`], *Dict[str, List[List[int]]]* or *List[Dict[str,List[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloadercollate function.Instead of `List[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), seethe note above for the return type.padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):Select a strategy to pad the returned sequences (according to the model's padding side and paddingindex) among:- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a singlesequence if provided).- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximumacceptable input length for the model if that argument is not provided.- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of differentlengths).max_length (`int`, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (`int`, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability`>= 7.5` (Volta).return_attention_mask (`bool`, *optional*):Whether to return the attention mask. If left to the default, will return the attention mask accordingto the specific tokenizer's default, defined by the `return_outputs` attribute.[What are attention masks?](../glossary#attention-mask)return_tensors (`str` or [`~utils.TensorType`], *optional*):If set, will return tensors instead of list of python integers. Acceptable values are:- `'tf'`: Return TensorFlow `tf.constant` objects.- `'pt'`: Return PyTorch `torch.Tensor` objects.- `'np'`: Return Numpy `np.ndarray` objects.verbose (`bool`, *optional*, defaults to `True`):Whether or not to print more information and warnings."""......# If we have a list of dicts, let's convert it in a dict of lists# We do this to allow using this method as a collate_fn function in PyTorch Dataloaderif isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):encoded_inputs = {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()}
......

首先注意pad方法对输入参数的要求，其中EncodedInput是List[int]的别名。BatchEncoding可以看做是一个字典对象，其格式满足Dict[str, Any]，其数据存储在data属性中。并且BatchEncoding实例化过程中，会调用convert_to_tensors方法，该方法会将data属性中的数据转换成tensor类型。
如果输入的特征数据是List[Dict[str, Any]]格式，会将其转换为Dict[str, List]，以满足pytorch Dataloader的要求。并且如果直接使用datasets.Dataset示例对象作为pad方法的输入，会报错——datasets.Dataset示例没有keys属性。

示例：

x += [{"text": "中国是一个伟大国家。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
dataset = data_collator(features=features.to_list())  # convert Dataset into List