文献一：PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

[摘要] 音频模式识别是机器学习领域的一个重要研究课题，包括音频标注、声场景分类、音乐分类、语音情感分类和声音事件检测等多项任务。最近，神经网络已被应用于解决音频模式识别问题。但是，以前的系统建立在持续时间有限的特定数据集上。最近，在计算机视觉和自然语言处理中，在大规模数据集上预训练的系统已经很好地推广到了几个任务。然而，在用于音频模式识别的大规模数据集上的预训练系统的研究有限。在本文中，我们提出了在大规模 AudioSet 数据集上训练的预训练音频神经网络 (PANN)。这些 PANN 被转移到其他与音频相关的任务中。我们研究了由各种卷积神经网络建模的 PANN 的性能和计算复杂性。我们提出了一种称为 Wavegram-Logmel-CNN 的架构，使用 log-mel 频谱图和波形作为输入特征。我们最好的 PANN 系统在 AudioSet 标记上实现了 0.439 的最先进的平均精度 (mAP)，优于之前最好的系统 0.392。我们将 PANN 转移到六个音频模式识别任务中，并在其中几个任务中展示了最先进的性能。

文献二：Towards Duration Robust Weakly Supervised Sound Event Detection

> [1]

引言部分

SOUND event detection (SED) research classifies and localizes particular audio events (e.g., dog barking, alarm ringing) within an audio clip, assigning each event a label along with a start point (onset) and an endpoint (offset).
声音事件检测 (SED) 研究对音频剪辑中的特定音频事件（例如，狗吠、警报响起）进行分类和定位，为每个事件分配一个标签以及起点（开始）和终点（偏移）。
Label assignment is usually referred to as tagging, while the onset/offset detection is referred to as localization.
标签分配通常称为标记，而起始/偏移检测称为定位。
SED can be used for query-based sound retrieval [1], smart cities, and homes [2], [3], as well as voice activity detection [4].
SED 可用于基于查询的声音检索 [1]、智能城市和家庭 [2]、[3]，以及语音活动检测 [4]。
Unlike common classification tasks such as image or speaker recognition, a single audio clip might contain multiple different sound events (multi-output), sometimes occurring simultaneously (multi-label).
与图像或说话人识别等常见分类任务不同，单个音频剪辑可能包含多个不同的声音事件（多输出），有时同时发生（多标签）。
In particular, the localization task escalates the difficulty within the scope of SED, since different sound events have various time lengths, and each occurrence is unique.
特别是定位任务在 SED 范围内升级了难度，因为不同的声音事件具有不同的时间长度，并且每次发生都是独一无二的。
Two main approaches exist to train an effective localization model: Fully supervised SED and weakly supervised SED (WSSED).
训练有效定位模型的主要方法有两种：全监督 SED 和弱监督 SED (WSSED)。
Fully supervised approaches, which potentially perform better than weakly supervised ones, require manual time-stamp labeling.
完全监督的方法可能比弱监督的方法表现得更好，需要手动标记时间戳。
However, manual labeling is a significant hindrance for scaling to large datasets due to the expensive labor cost.
然而，由于昂贵的劳动力成本，手动标记是扩展到大型数据集的重大障碍。
This paper primarily focuses on WSSED, which only has access to clip event labels during training yet requires to predict onsets and offsets at the inference stage.
本文主要关注 WSSED，它只能在训练期间访问剪辑事件标签，但需要在推理阶段预测开始和偏移。
Challenges such as the Detection and Classification of Acoustic Scenes and Events (DCASE) exemplify the difficulties in training robust SED systems.
声学场景和事件的检测和分类 (DCASE) 等挑战体现了训练稳健 SED 系统的困难。
DCASE challenge datasets are real-world recordings (e.g., audio with no quality control and lossy compression), thus containing unknown noises and scenarios.
DCASE 挑战数据集是真实世界的录音（例如，没有质量控制和有损压缩的音频），因此包含未知的噪音和场景。
Specifically, in each challenge since 2017, at least one task was primarily concerned with WSSED. Most previous work focuses on providing single target task-specific solutions for WSSED on either tagging-, segment- or event-level.
具体而言，在 2017 年以来的每项挑战中，至少有一项任务主要与 WSSED 相关。以前的大部分工作都集中在为 WSSED 提供标记、段或事件级别的单一目标任务特定解决方案。
Tagging-level solutions are often capable of localizing event boundaries, yet their temporal consistency is subpar to segment- and event-level methods.
标记级解决方案通常能够定位事件边界，但它们的时间一致性低于段级和事件级方法。
This has been seen during the DCASE2017 challenge, where no single model could win both tagging and localization subtasks.
这已经在 DCASE2017 挑战中看到了，在那里没有一个模型可以同时赢得标记和本地化子任务。
Solutions optimized for segment level often utilize a fixed target time resolution (e.g., 1 Hz), inhibiting fine-scale localization performance (e.g., 50 Hz).
针对分段级别优化的解决方案通常使用固定的目标时间分辨率（例如 1 Hz），从而抑制精细定位性能（例如 50 Hz）。
Lastly, successful event-level solutions require prior knowledge about each events’ duration to obtain temporally consistent predictions.
最后，成功的事件级解决方案需要关于每个事件持续时间的先验知识，以获得时间上一致的预测。
Previous work in [5] showed that successful models such as the DCASE2018 task 4 winner are biased towards predicting tags from long-duration clips, which might limit themselves from generalizing towards different datasets (e.g., deploy the same model on a new dataset) since new datasets possibly contain short or unknown duration events.
[5] 之前的工作表明，成功的模型，例如 DCASE2018 任务 4 获胜者倾向于从长持续时间的剪辑中预测标签，这可能会限制自己对不同数据集的泛化（例如，在新数据集上部署相同的模型），因为新数据集可能包含短时间或未知持续时间的事件。
In contrast, we aim to enhance WSSED performance, specifically in duration estimation regarding short, abrupt events, without a pre-estimation of each respective event’s individual weight.
相比之下，我们的目标是提高 WSSED 性能，特别是在关于短暂、突然事件的持续时间估计方面，而不预先估计每个事件的单独权重。

相关工作
Most current approaches within SED and WSSED utilize neural networks, in particular convolutional neural networks [6], [7] (CNN) and convolutional recurrent neural networks [4], [5] (CRNN).
SED 和 WSSED 中的大多数当前方法都利用神经网络，特别是卷积神经网络 [6]、[7]（CNN）和卷积循环神经网络 [4]、[5]（CRNN）。
CNN models generally excel at audio tagging [8], [9] and scale with data, yet falling behind CRNN approaches in onset and offset estimations [10].
CNN 模型通常在音频标记 [8]、[9] 和数据规模方面表现出色，但在开始和偏移估计方面落后于 CRNN 方法 [10]。
Apart from different modeling methods, many recent works propose other approaches for the localization conundrum.
除了不同的建模方法外，许多最近的工作还为定位难题提出了其他方法。
A plethora of temporal pooling strategies are proposed, aiming to summarize frame-level beliefs into a single clip-wise probability.
提出了大量的时间池策略，旨在将帧级信念总结为单个剪辑概率。
Contribution:
In our work, we modify and extend the framework of [5] further towards other datasets and aim to analyze the benefits and the limits of duration robust training.
贡献：
在我们的工作中，我们将 [5] 的框架进一步修改和扩展到其他数据集，旨在分析持续时间稳健训练的好处和限制。
Our main goal with this work is to bridge the gap between real-world SED and research models and facilitate a common framework that works well on both tagging and localization-level without utilizing dataset-specific knowledge.
我们这项工作的主要目标是弥合现实世界 SED 和研究模型之间的差距，并促进一个通用框架，该框架在标记和本地化级别上都能很好地工作，而无需利用特定于数据集的知识。
Our contributions are:
A new, lightweight, model architecture for WSSED using L4-norm temporal subsampling.
我们的贡献是：
使用 L4 范数时间子采样的 WSSED 新的轻量级模型架构。
A novel thresholding technique named triple threshold, bridging the gap between tagging and localization performance.
一种名为三重阈值的新阈值技术，弥合了标记和定位性能之间的差距。
Verification of our proposed approach across three publicly available datasets, without the requirement of manually optimizing towards dataset-specific hyperparameters.
在三个公开可用的数据集上验证我们提出的方法，无需手动优化特定于数据集的超参数。