r软件时间序列分析论文
数据科学 , 机器学习 (Data Science, Machine Learning)
In machine learning with time series, using features extracted from series is more powerful than simply treating a time series in a tabular form, with each date/timestamp in a separate column. Such features can capture the characteristics of series, such as trend and autocorrelations.
在具有时间序列的机器学习中,使用从序列中提取的特征比仅以表格形式处理时间序列(每个日期/时间戳在单独的列中)更强大。 这些特征可以捕获序列的特征,例如趋势和自相关。
But… what sorts of features can you extract and how do you select among them?
但是……您可以提取哪些类型的特征,以及如何在其中进行选择?
In this article, I discuss the findings of two papers that analyze feature-based representations of time series. The papers conduct comprehensive work to collect thousands of time series feature extractors and evaluate which features capture the most useful information from a series.
在本文中,我讨论了两篇分析基于特征的时间序列表示的论文的发现。 这些论文进行了全面的工作,以收集成千上万个时间序列特征提取器,并评估哪些特征捕获了序列中最有用的信息。
Highly comparative time-series analysis: the empirical structure of time series and their methods. (Fulcher, et al 2013)
高度比较的时间序列分析:时间序列的经验结构及其方法 。 (Fulcher等,2013)
catch22: CAnonical Time-series CHaracteristics (Lubba, et al 2019)
catch22 :CAnonical时间序列特征 ( Lubba等,2019)
The papers show how to compare time series by extracting features that describe the series behavior and suggest a pipeline for identifying an “optimal” subset of time series features.
这些论文展示了如何通过提取描述序列行为的特征并建议用于识别时间序列特征的“最佳”子集的管道来比较时间序列。
为什么这很重要? (Why Is This Important?)
There are two basic ways to compare time series:
有两种比较时间序列的基本方法:
A similarity measure that quantifies whether two-time series are close (on average) across time, such as Dynamic Time Warping. These measures are typically best for short, aligned series of equal length. They tend to have poor scalability, with quadratic computation in both the number of time series and series length because distances must be computed between all pairs.
一种用于量化两个时间序列在整个时间上是否接近(平均)的相似性度量 ,例如Dynamic Time Warping 。 这些措施通常最适合短而对齐的等长序列。 它们往往具有较差的可伸缩性,因为在时间序列的数量和序列长度上都需要进行二次计算,因为必须在所有对之间计算距离。
Define similarity between series in terms of features extracted from time series using time series analysis algorithms. Feature extractors do not require series to be of equal length. The result is an interpretable summary of the dynamical characteristics of each series. These features can then be used for machine learning.
使用时间序列分析算法从时间序列提取的特征方面定义序列之间的相似性。 特征提取器不需要序列的长度相等。 结果是每个系列动力学特性的可解释性总结。 这些功能可以用于机器学习。
Interpretability is another key: time series features can capture complex, time-varying patterns in a set of interpretable characteristics.
可解释性是另一个关键:时间序列特征可以以一组可解释的特征捕获复杂的时变模式。
Problematically, there are a vast number of methods to extract interpretable features from time series. Further, feature selection is often done manually and subjectively.
有问题的是,有很多方法可以从时间序列中提取可解释的特征。 此外,特征选择通常是手动和主观地完成的。
What sort of features can be extracted from series and how could you select among them?
可以从系列中提取什么样的特征,如何从中选择?
高度比较的时间序列分析:时间序列的经验结构及其方法 (Highly comparative time-series analysis: the empirical structure of time series and their methods)
Paper motivation: although time series are studied across scientific disciplines (e.g. stock prices in finance, human heartbeats in medicine), different methods for time series analysis have been developed separately in different disciplines.
论文动机:尽管跨学科研究了时间序列(例如金融中的股票价格,医学上的人的心跳),但在不同学科中分别开发了不同的时间序列分析方法 。
Given the great number of methods, it is difficult to determine how methods developed by different disciplines are related. As a result, how can a practitioner select the optimal method for their data?
鉴于方法众多,因此很难确定不同学科开发的方法之间的关系。 结果,从业者如何为他们的数据选择最佳方法?
To address this challenge, the HCTSA paper…
为了应对这一挑战,HCTSA论文…
- Assembles an extensive annotated library of time series data and methods for time series analysis. 组装了一个广泛的带注释的时间序列数据库和时间序列分析方法。
- Models time series methods according to their behavior on the data and group time series by their measured properties. 根据时间序列方法在数据上的行为对时间序列方法进行建模,并通过其测量属性将时间序列分组。
- Introduces a range of comparative analysis techniques for series and their methods. First, the ability to link given time series to similar real-world and model-generated series. Second, the ability to link specific time series analysis methods to a range of alternatives across the literature. 介绍了一系列用于系列及其方法的比较分析技术。 首先,可以将给定的时间序列链接到类似的真实世界和模型生成的序列。 其次,将特定的时间序列分析方法链接到整个文献中的其他方法的能力。
HCTSA框架和范围 (HCTSA Framework and Scope)
The paper is scope extensive: the authors annotated a library of 38,190 univariate time series and 9,613 time series analysis algorithms.
本文涉及面很广:作者注释了38,190个单变量时间序列和9,613个时间序列分析算法的库。
The time series analysis methods vary in form, ranging from summary statistics to statistical model fits. Each transformation summarizes an input series with a single real number.
时间序列分析方法的形式各不相同,从汇总统计信息到统计模型拟合不等。 每个转换都会汇总一个具有单个实数的输入序列。
The library of time series transformations cover a wide range of time series properties:
时间序列转换库涵盖了广泛的时间序列属性:
- basic statistics of the distribution (e.g. location, spread, outlier properties) 分布的基本统计信息(例如位置,分布,离群值属性)
- linear correlations (e.g. autocorrelations, features of the power spectrum) 线性相关(例如,自相关,功率谱的特征)
- stationarity (e.g. sliding window measures, unit root tests) 平稳性(例如,滑动窗口度量,单位根检验)
- information-theoretic and entropy measures (e.g. auto-mutual information, Approximate Entropy) 信息理论和熵测度(例如,自动互信息,近似熵)
- methods from the physical nonlinear time-series analysis literature (e.g. correlation dimension) 物理非线性时间序列分析文献中的方法(例如,相关维)
- linear and nonlinear model fits (e.g. goodness of fit and parameters from autoregressive models) 线性和非线性模型拟合(例如拟合优度和自回归模型的参数)
- others (e.g. wavelet methods) 其他(例如小波方法)
For transformations that require parameter values, the transformation is repeated for multiple parameters. A single “operation” is considered a transformation plus a single parameter value. Of the 9k operations evaluated in the paper, a single transformation might be counted multiple times, once for each parameter value. The paper evaluates approximately 1k unique transformations.
对于需要参数值的转换,将对多个参数重复该转换。 单个“操作”被视为转换加上单个参数值。 在本文评估的9k运算中,单个转换可能会被计数多次,每个参数值一次。 本文评估了大约1k个唯一转换。
HCTSA:时间序列分析方法的经验结构 (HCTSA: Empirical structure of time series analysis methods)
The authors used k-medoids clustering to identify four broad categories of time series analysis operations:
作者使用k-medoids聚类来识别时间序列分析操作的四大类:
- Linear correlation 线性相关
- Stationarity (Properties that change with time) 平稳性(随时间变化的属性)
- Information theory 信息论
- Nonlinear time series analysis 非线性时间序列分析
The clustering analysis revealed that a subset of 200 time series operations, or an empirical fingerprint of a series’ behavior, can approximate the 8,651 operations considered. The 200 operations summarize different behaviors of time series analysis methods. These operations include techniques developed in a variety of disciplines.
聚类分析表明,200个时间序列操作的子集或序列行为的经验指纹可以近似考虑所考虑的8,651个操作。 这200个操作总结了时间序列分析方法的不同行为。 这些操作包括在各种学科中开发的技术。
Further, the analysis uncovered a local structure surrounding each target operation. For a given operation, they were able to identify alternative operations with similar behavior.
此外,分析发现了围绕每个目标操作的局部结构。 对于给定的操作,他们能够识别行为相似的替代操作。
“By comparing their empirical behaviour, the techniques demonstrated above can be used to connect new methods to alternatives developed in other fields in a way that encourages interdisciplinary collaboration on the development of novel methods for time-series analysis that do not simply reproduce the behaviour of existing methods” [1]
“通过比较他们的经验行为,上面展示的技术可以用于将新方法与其他领域开发的替代方法联系起来,从而鼓励跨学科合作,开发时间序列分析的新方法,而不仅仅是再现行为的行为。现有方法” [1]
HCTSA:时间序列的经验结构 (HCTSA: Empirical structure of time series)
Time series can be represented by properties that capture important dynamical behavior of the series. The authors use 200 representative operations to compare 24,577 time series from different systems and of varying lengths.
时间序列可以由捕获序列的重要动力学行为的属性表示。 作者使用200个代表性操作来比较来自不同系统和不同长度的24,577个时间序列。
This empirical fingerprint of 200 diverse time-series analysis operations facilitates a meaningful comparison of scientific time series.
200种不同时间序列分析操作的经验指纹有助于对科学时间序列进行有意义的比较。
To group their library of 24k time series, the authors used complete linkage clustering to form 2,000 clusters. Due to the wide range of time series properties used, the clusters grouped series according to dynamics, even when the lengths differ.
为了将他们的24k时间序列库分组,作者使用了完整的链接聚类来形成2,000个聚类。 由于使用了广泛的时间序列属性,因此即使长度不同,聚类也会根据动力学将序列分组。
Most clusters grouped time series measured from the same system:
大多数群集将从同一系统测得的时间序列分组:
Some clusters contained series generated by different systems:
一些集群包含由不同系统生成的序列:
The reduced representation of time series allows you to retrieve a local neighborhood of series with similar properties. This allows you to automatically relate real-world time series to similar, model-generated time series.
时间序列的简化表示使您可以检索具有相似属性的序列的局部邻域。 这使您可以自动将现实世界的时间序列与模型生成的类似时间序列相关联。
Thus, the transformations can be used to suggest suitable families of models for use in real-world systems.
因此,这些转换可用于建议适用于实际系统的模型族。
HCTSA守则 (HCTSA Code)
The code for Highly Comparative Time Series Analysis can be found on GitHub; however, it is written in Matlab. (You can use the hctsa package from python using the pyopy
package). The hctsa package allows thousands of features to be extracted from a time series. The software also has an accompanying paper.
可在GitHub上找到高度比较时间序列分析的代码; 但是,它是用Matlab编写的。 (您可以使用pyopy包从python使用pyopy
包)。 hctsa包允许从一个时间序列中提取成千上万个功能。 该软件还附有论文 。
Of important note, it is slow to run. Reducing the full set of HCTSA operations to even 200 of the thousands of candidate features is computationally expensive. This approach is infeasible for some applications, especially those with large training data.
重要的是,它运行缓慢。 将全套HCTSA操作减少到数千个候选特征中的200个在计算上是昂贵的。 对于某些应用程序,尤其是具有大量训练数据的应用程序,这种方法是不可行的。
HCTSA also has a web platform, CompEngine. CompEngine “is a self-organizing database of time-series data that allows users to upload, explore, and compare thousands of diverse types of time-series data.” [4]
HCTSA还具有一个Web平台CompEngine 。 CompEngine“是一个时间序列数据的自组织数据库,允许用户上载,浏览和比较数千种不同类型的时间序列数据。” [4]
catch22,CAnonical时间序列特征 (catch22, CAnonical Time-series CHaracteristics)
The subsequent catch22: CAnonical Time-series CHaracteristics paper (2019) builds on HCTSA by reducing the set of representative features to 22 time series features that:
随后的内容22:CAnonical时间序列Characteristics论文(2019)建立在HCTSA的基础上, 将代表性特征的集合减少到22个时间序列特征 ,这些特征包括:
- exhibit strong classification performance across a given collection of time-series problems, and 在给定的时间序列问题集合中表现出强大的分类性能,并且
- are minimally redundant, and 最少冗余,并且
- capture the diversity of analysis contained in HCTSA. 捕获HCTSA中包含的分析多样性。
The paper creates a data-driven subset of the most useful features extracted from a time series. The authors compare across a diverse set of time series analysis algorithms, starting with the features in the HCTSA toolbox.
本文创建了从时间序列中提取的最有用功能的数据驱动子集。 作者从HCTSA工具箱中的功能开始,对各种时间序列分析算法进行了比较。
The catch22 time series characteristics capture a diverse and interpretable time series “signature” based on their properties.
catch22时间序列特征基于其特性捕获了多种且可解释的时间序列“签名”。
This signature includes linear and non-linear temporal auto-correlation, successive differences, value distributions and outliers, and fluctuation scaling properties.
该签名包括线性和非线性时间自相关,连续差异,值分布和离群值以及波动比例属性。
catch22功能的好处 (Benefits of catch22 features)
- Fast computation (~1000x faster than full HCTSA feature set in Matlab) 快速计算(比Matlab中完整的HCTSA功能集快1000倍)
- Provides low dimensional summary of time series 提供时间序列的低维摘要
- Interpretable characteristics that are useful for classification and clustering. 可解释的特征,对分类和聚类很有用。
Further, if the catch22 features are not appropriate for your problem, the feature selection pipeline is general. The pipeline can be used to select informative subsets of features new or more complex problems.
此外,如果catch22功能不适合您的问题,则功能选择管道很通用。 管道可用于选择新的或更复杂问题的特征性信息子集。
Catch22功能评分 (Catch22 feature scoring)
The authors score features by evaluating decision tree classification accuracy across a set of 93 classification problems from the Time Series Classification Repository. Performance with 4791 features from HCTSA has 77.2% mean class-balanced accuracy across all tasks. Performance with smaller set of 22 features is 71.7% mean class-balanced accuracy.
作者通过评估时间序列分类库中的93个分类问题的决策树分类准确性来为特征评分。 HCTSA具有4791功能的性能在所有任务中具有77.2%的平均班级平衡准确性。 具有22个功能的较小集合的性能为71.7%的平均类平衡准确性。
Catch22功能选择管道 (Catch22 feature selection pipeline)
For all data sets, each time series feature was linearly rescaled to unit 0–1 interval. This scaling may not be appropriate for some real-world applications.
对于所有数据集,每个时间序列特征均线性调整为单位0–1间隔。 这种缩放可能不适用于某些实际应用。
First, the authors excluded features sensitive to mean and variance of distribution of values because the majority of series were normalized.
首先,作者排除了对值的均值和方差敏感的特征,因为大多数序列都已归一化。
For some applications, this preselection is not desirable. If working with non-normalized series, you should consider including the distributional features, such as mean and standard deviation. These can lead to significant performance gains.
对于某些应用,这种预选择是不希望的。 如果使用非归一化序列,则应考虑包括分布特征,例如均值和标准差。 这些可以导致显着的性能提升。
Next, the authors excluded the transformations that frequently output special values. Special values indicate that an algorithm is not suitable for the input data, or that it did not evaluate successfully.
接下来,作者排除了经常输出特殊值的转换。 特殊值表示算法不适合输入数据,或者评估失败。
Last, the authors created a pipeline to filter for features that can individually discriminate across a range of real-world data. The pipeline then filtered for those that have complementary behavior.
最后,作者创建了一个管道,以筛选可分别区分一系列实际数据的功能。 然后,管道会筛选出具有互补行为的管道。
The feature selection pipeline had 3 rounds:
功能选择管道进行了三轮:
- Statistical pre-filtering: filter out features whose performance were statistically insignificant on the given learning tasks. 统计预过滤:过滤掉在给定学习任务中性能在统计上不重要的特征。
- Performance filtering: select features that perform best across all tests. “Performance” is the ability to distinguish between labeled classes in 93 classification tasks with a decision tree classifier. 性能过滤:选择在所有测试中性能最好的功能。 “性能”是使用决策树分类器区分93个分类任务中标记的类的能力。
- Redundancy minimization. The top features were clustered (hierarchical clustering with complete linkage) into groups according to performance scores across tasks. From each cluster, a single representative feature was selected for the feature set. The representative feature selected as the one with highest score across tasks — unless it was computationally intensive, in which case another high-accuracy feature with greater interpretability and efficiency was manually selected. 冗余最小化。 根据任务之间的性能得分,将主要功能(通过完全链接的层次化群集)进行分组。 从每个群集中,为功能集选择一个代表性功能。 代表性特征被选为在所有任务中得分最高的特征-除非计算量大,否则将手动选择另一种具有更高可解释性和效率的高精度特征。
准确性/可解释性的权衡 (Accuracy / Interpretability Trade-off)
The authors compared classification performance using the catch22 features with a wide variety of time series classification algorithms, such as those implemented in sktime
.
作者将使用catch22功能的分类性能与各种时间序列分类算法(例如在sktime
实现的算法)进行了sktime
。
The classification of time series with catch22 features, despite large dimensionality reduction, results in “similar” performance to alternative methods. The authors admit that majority of datasets exhibit better performance using existing algorithms than catch22.
尽管具有较大的降维效果,但具有catch22特征的时间序列分类却导致与替代方法“相似”的性能。 作者承认, 使用现有算法 , 大多数数据集表现出比catch22更好的性能。
The paper often claims that catch22 only has a “small’ reduction in accuracy. (The authors did not publish the performance of classifiers with catch22 features). In one instance, they called a decrease from 99.2% to 89.5% “small”, but in my opinion, this is not small for many applications.
该论文经常声称catch22的准确性仅“小”降低。 (作者未发布具有catch22功能的分类器的性能)。 在一种情况下,他们称从99.2%降低到89.5%是“小”,但在我看来,这对于许多应用程序来说并不小。
While the authors failed to prove, in my view, that a classification model built with the catch22 features could outperform a native time series classifier, catch22 does offer interpretable features for model explanation.
在我看来,尽管作者未能证明使用catch22特征构建的分类模型可以胜过本机时间序列分类器, 但是catch22确实提供了可解释的特征用于模型解释 。
In particular, the authors highlighted one classifier where a single feature was able to perfectly separate two classes (series = triangle or noise). The feature “quantifies the length of the longest continued descending increments in the data”. Clearly, this is simple to explain.
尤其是,作者强调了一个分类器,其中一个功能可以完美地将两个分类(序列=三角形或噪声)分开。 该功能“量化数据中最长的连续下降增量的长度”。 显然,这很容易解释。
备用时间序列功能集 (Alternative Time Series Feature Sets)
The authors noted that “There is no single representation that is best for all time-series datasets.” Instead, “the optimal representation depends on the structure of the dataset and the questions being asked of it.” [3]
作者们指出:“没有一种最适合所有时间序列数据集的表示形式。” 相反,“最佳表示形式取决于数据集的结构和所要提出的问题。” [3]
Thus, the catch22 features may not be the optimal features for all time series datasets and tasks.
因此,catch22特征可能不是所有时间序列数据集和任务的最佳特征。
The catch22 feature representation often outperforms datasets that do not have “reliable shape differences between classes” relative to classifiers based on time-domain distance metrics.
catch22特征表示相对于基于时域距离度量的分类器,其性能通常优于没有“可靠的类间形状差异”的数据集。
The authors compared performance of catch22 features to the time series features available in the tsfeatures
R package. On the same set of classification tasks, tsfeatures
features had a 69.4% mean accuracy, compared to catch22’s 71.7% accuracy.
作者将catch22功能的性能与tsfeatures
R软件包中可用的时间序列功能进行了tsfeatures
。 在同一组分类任务中, tsfeatures
特征的平均准确度为69.4%,而catch22的平均准确度为71.7%。
实作 (Implementation)
Extraction of the catch22 features has been implemented in C, with wrappers in Python, R, Matlab. An open-source implementation of catch22 can be found on GitHub.
catch22功能的提取已在C中实现,并在Python,R,Matlab中使用了包装器。 catch22的开源实现可以在GitHub上找到 。
The C version of catch22 exhibits near-linear computational complexity, O(N1.16) for time series length. For a time series with 10,000 observations, the catch22 can be computed in 0.5 seconds.
catch22的C版本显示时间序列长度的近似线性计算复杂度O(N1.16)。 对于具有10,000个观测值的时间序列,可以在0.5秒内计算catch22。
The code for the feature selection pipeline that produced the 22 features is available on GitHub at https://github.com/chlubba/op_importance.
GitHub上的https://github.com/chlubba/op_importance上提供了用于生成22个功能的功能选择管道的代码。
适用于实际问题 (Application to real problems)
A wide range of features can be extracted from time series that describe the many properties and dynamics of a series.
可以从时间序列中提取各种各样的特征,这些特征描述了序列的许多特性和动力学。
The features analyzed in the HCTSA paper and are available on GitHub are comprehensive and informative. The key challenge is that there are “too many” features for most applications.
HCTSA论文中分析的功能以及可以在GitHub上获得的功能都是全面且信息丰富的。 关键的挑战是大多数应用程序的功能太多。
The catch22 features are tailored to capture key properties of the UCR/UEA datasets, which are short and phase aligned. The feature selection method could be rerun to generate reduced feature sets tailored to other applications.
catch22的功能经过定制,可以捕获UCR / UEA数据集的关键属性,这些属性很短且相位对齐。 可以重新运行功能选择方法以生成适合其他应用程序的精简功能集。
Indeed, new feature selection may be necessary in many applications where the series have different properties, such as those where location and variance of a data distribution are highly relevant. Distributional features were excluded from the catch22 analysis because the data considered were normalized. (Normalization removes location and shift).
确实,在一系列具有不同属性的应用程序中,例如在数据分布的位置和方差高度相关的那些应用程序中,可能需要新的特征选择。 catch22分析排除了分布特征,因为考虑的数据已标准化。 (归一化删除位置和移位)。
最后的话 (A Final Word)
If you enjoyed this article, please follow me for more content about time series machine learning. Articles on time series classification and a taxonomy of time series features are in the works.
如果您喜欢本文,请关注我以获取有关时间序列机器学习的更多内容。 有关时间序列分类和时间序列特征分类的文章正在撰写中。
翻译自: https://medium.com/towards-artificial-intelligence/highly-comparative-time-series-analysis-a-paper-review-5b51d14a291c
r软件时间序列分析论文
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390577.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!