IEEE Transactions on Affective Computing上的一篇文章,做微表情识别,阅读完做个笔记。本文讨论了Data Leakage对模型准确度评估的影响,及如何融合多个微表情数据集,从而提升模型的准确度。工作量非常饱满,很认真,并且开源了,赞一个。
data leakage
To this end,we go through common pitfalls,propose a new standardized evaluation protocol using facial action units withover 2000 micro-expression samples,and provide an open source library that implements the evaluation protocols in a standardized manner.
However,recently,we have spotted aworrying trend with extremely high yet unreliable perfor-mances reaching close to perfect performance and potentialissues during evaluation when analyzing available sourcecode.
Data leakage refers to using information from the testing data during the training procedure,giving an overly optimisticevaluation result.
The concern with data leakage is that it creates a misleading understanding of the capabilities ofmodels.
The use of different datasets with varying evaluation strategies and different numberof emotions,subjects and samples creates more confusion and difficulties.
To act towards more united protocols,we propose a new protocol,CD6ME,that consists of six ME datasets with over 2000 samples.
By combining the datasets and using AUs,problems with the inconsistency of the labels can be largely alleviated,as the datasets are annotated by standardized FACS(facial actioncoding system)[38]certified coders.
通过组合数据集和使用AUs,可以在很大程度上缓解标签不一致的问题,因为数据集由标准化的FACS(面部动作编码系统) [ 38 ]认证的编码员进行注释。
MEB imple-ments tedious data loading routines,standardized trainingpipelines and multiple dif
ferent models from the ME lit-erature.
MEB 实现了繁琐的数据加载例程、标准化的训练流程以及来自微表情文献的多种不同模型。
Common pitfalls found in the ME literature areshowcased and discussed.
A new composite cross-dataset action unit classifica-tion protocol for ME analysis is proposed.
Comprehensive analysis is performed that comparesaction units and emotions in MEs.
The typical framework of a micro-expressionanalysis system consists of two phases:spotting and recog-nition.
典型的微表情分析系统框架包括两个阶段:定位 和 识别。
In the spotting phase,unsegmented videos are givenas inputs and the task is to spot a temporal sequence duringwhich an ME is occurring.
In the recognition phase,the pre-segmented video clip is classified to an emotioncategory such as happiness,sadness,surprise,etc.
The FACS(facial action coding system)[38]is a taxonomy offine-grained facial configurations.
FACS(面部动作编码系统) [ 38 ]是一个细粒度面部结构的分类法。
AUs(action units)
AUs(动作单元) 作为对面部肌肉运动进行编码的基本单元。
AUs can b econsidered as sign judgement of the face[49],as opposed to emotional labels that attempt to convey the meaning.Due to this difference,automatic AU systems can be applied to a wider set of applications such as pain detection and analysis of nonaffective facial expressions[49].
Each AU can be given five different intensity levels(and one forneutral)denoted by an uppercase letter from A to E,whereA is a trace and E is maximum[38].
Most datasets usea different set of emotion inducing videos.
Compared to the onset and apex frames,the offset frameis more ambiguous as faces do not necessarily fully return to a relaxed state.
Different annotation strategies create a discrepancy between the datasets and makes comparison between the datasets inconsistent.
The measure is between zero and one,where one means complete agreement.
Objective classes[39]based on action units have been suggested to avoid this problem.More recently,directly using action units[43]have also been suggested.
基于动作单元的目标类[ 39 ]被提出以避免这一问题。最近,直接使用动作单元[ 43 ]也被提出。
However,a large meta study of facial expressions[10]suggests that there is no one-to-one mapping between facial movements and emotions.
然而,一项关于面部表情的大型元研究[ 10 ]表明,面部动作与情绪之间并不存在一一对应的映射关系。
This supports the findings of the meta-study[10]and that there are no one-to-one mappings between AUs and self-reported emotions for MEs.
这支持了元研究[ 10 ]的发现,即对于情绪智力而言,AU与自我报告的情绪之间不存在一对一的映射关系。
These inconsistencies makes training on emotions difficult,especially with small datasets.
This means that we cannot expect models to perform with an accuracy of 100% as the ground-truth labels contain noise.
意味着,由于真实标签包含噪声,我们不能期望模型以100 %的准确率运行。
These include data leakage,imprecise use ofthe F1-Score and evaluation strategies.
这些问题包括数据泄露、F1 - Score的使用不精确以及评价策略等。
In data leakage,information from the testing data leaks to the training data that is used to train the model,leading to overly optimistic evaluation.
early stopping
在机器学习中,早停(Early Stopping)是一种用于防止过拟合的技术。它通过在模型在验证数据集上性能不再提高时停止训练,以防止模型在训练数据上学习到噪声而失去泛化能力。通常,训练过程中监测验证集上的性能指标,一旦性能不再提高或开始下降,就停止训练。
Using information from the test data during training can lead to a large positive bias,but the positive bias is misleading and not representative of the generalizable performance,especially when a fold isjust a single subject.
The experiments show that using early stopping with test data can create a large positive bias,while using the validation data shows barely no impact.
To avoid the above issue,the pre-training should be done using additional data not part of the evaluation data or the pre-training should be done inside the individual folds.
If the evaluation is done with the same dataset that the generative model was trained on,a data leak may occur.
A dummy model that always predicts the class with the most common occurrence could achieve good performance with accuracy.Use of F1-Score is a standard practice in the ME recognition task[40].
F1-Score 的计算方式。
The F1-Score can be generalized to a multi-class setting by a few different strategies.
One should be aware that when computing the F1-Score as noted by Opitz and Burst[35],the averaging can be done in two ways,as shown in Equation 4 or by first aggregating over the classes to compute precision and recalland using Equation 2 to compute the F1-Score.
值得注意的是,在计算Opitz和Burst [ 35 ]所指出的F1 - Score时,可以通过两种方式进行平均,如公式4所示,或者通过先聚合类来计算精确率和召回率,以及使用公式2来计算F1 - Score。
A common pitfall is to compute the F1-Score in each foldseparately and aggregate the results together.
As can be seen,both micro-and weighted F1give a positive bias as they do not take the class imbalanceinto account.While averaging over the folds leads to asignificant negative bias.
They split the validation strategies to three categories 1)person dependent evaluation(PDE),2)person independent evaluation(PIE)and 3)cross domain evaluation(CDE).
他们将验证策略分为三类:1 )个体依赖性评价( PDE ),2 )个体独立性评价( PIE )和3 )跨领域评价( CDE )。三种策略依次从简单到难。
In addition to different evaluation strategies,the number of samples and the number of used emotions may be differ-ent across articles.
However,dif-ferent works use changing subsets with different numberof emotions and samples.Add this to the common pitfalls discussed in the previous section and the comparison of different works is extremely difficult.
The use of AUs allows us to combine the datasets as the annotation of AUs is standardized by hav-ing the annotators be qualified FACS coders.
The trainingand testing is repeated for n D times,where n D refers to the number of datasets.
训练和测试重复n D次,其中n D为数据集的个数。
In AU detection an unsegmented video clip is given with frame level labels.The task is to predict a binary multi-label whether an AU exists for each frame separately.
In AU classification a pre-segmented video clip is given with a single binary multi-label[78].The task is to predict whether an AU exists for the whole clip.
在AU分类中,一个预分割的视频片段被赋予一个单一的二进制多标签[ 78 ]。任务是预测整个剪辑中是否存在一个AU。
Optical strain
光学应变(Optical Strain)是指在图像中检测物体表面的形变和变化的一种方法。在计算机视觉中,特别是在分析运动或变形时,光学应变常用于表示物体或场景中的局部形状变化。
As mentioned previously,data leakage and evaluation issues are largely affected,which made reproduc-ing results difficult.
By combining the above together in this paper,we are able to evaluate methods in a more realistic setting,while providing increased perfor-mance by using additional data.
Multiple AUs may occurat different times,using only the apex may therefore miss one or more AUs.
The resultsin Table 6 show promising results for the use of RGB asinput when using large composite data.
As shown by our work,significant gains can be obtained without touching the models.
Although the cross-dataset is a more realis-tic setting,the data is still from a laboratory setting,which limits the applicability for in-the-wild scenarios.
Another limitation is the need for data which requires capturing spontaneous subtle facial-expressions from human subjectsand accurate labor intensive annotations.
we point out common pitfalls such as data leakage and fragmented use of evaluation protocols in micro-expression recognition.
We propose a new benchmark,CD6ME,that uses a cross-dataset protocol for generalized evaluation.
Action units are used instead ofemotional classes for a more objective and consistent label.
A micro-expression analysis library,MEB,with the implementation of data loading routines,training loops and several commonly used micro-expression models,is introduced and openly shared.