python 平滑时间序列_时间序列平滑以实现更好的聚类

python 平滑时间序列

In time series analysis, the presence of dirty and messy data can alter our reasonings and conclusions. This is true, especially in this domain, because the temporal dependency plays a crucial role when dealing with temporal sequences.

在时间序列分析中,脏数据和杂乱数据的存在会改变我们的推理和结论。 这是正确的,尤其是在此领域,因为在处理时间序列时,时间依赖性起着至关重要的作用。

Noise or outliers must be handled with care following ad-hoc solutions. In this situation, the tsmoothie package can help us save a lot of time in preparing time series for our analysis. Tsmoothie is a python library for time series smoothing and outlier detection that can handle multiple series in a vectorized way. It’s useful because it can provide the preprocess steps we needed, like denoising or outlier removal, preserving the temporal pattern present in our raw data.

按照临时解决方案,必须小心处理噪声或异常值。 在这种情况下, tsmoothie软件包可以帮助我们节省大量时间来准备用于分析的时间序列。 Tsmoothie是用于时间序列平滑和离群值检测的python库,可以以矢量化方式处理多个序列。 这很有用,因为它可以提供我们所需的预处理步骤,例如去噪或离群值去除,保留原始数据中存在的时间模式。

In this post, we use these trinks to improve a clustering task. More precisely, we try to identify some changes in financial data carrying out an unsupervised approach. In the end, we will expect to point out clear patterns in the closing prices that can be used to inspect the hidden behavior of the market.

在本文中,我们将使用这些工具来改善聚类任务。 更准确地说,我们尝试在无监督的情况下识别财务数据中的某些变化。 最后,我们期望指出收盘价的清晰模式,可用于检查市场的隐藏行为。

数据 (THE DATA)

As introduced before, we operate with financial time series. There are a lot of tools or premade datasets that provide and store financial data. For our aims, we use a dataset collected from Kaggle. The Stock data 2000–2018 is a cleaned collection of stock prices from 2000 to 2018 of around 39 different stocks. It reports volumes, open, high, low, and close prices daily. We focus on the close prices.

如前所述,我们使用财务时间序列进行操作。 有很多提供或存储财务数据的工具或预制数据集。 为了我们的目标,我们使用从Kaggle收集的数据集。 2000-2018年的股票数据是从2000年到2018年大约39种不同股票的干净价格集合。 它每日报告交易量,开盘价,最高价,最低价和收盘价。 我们关注收盘价。

For a demonstrative purpose, we consider the Amazon stock price but the same findings appear also in other stock signals.

出于说明目的,我们考虑了亚马逊股票的价格,但在其他股票信号中也出现了相同的发现。

Image for post
Amazon closing price history and distribution
亚马逊收盘价历史和分布

时间序列平滑 (Time Series Smoothing)

The first step in our workflow consists of time series preprocessing. Our strategy is very intuitive and effective. Given a time series of closing prices, we split it into small sliding pieces. Each piece is then smooth in order to remove outliers. The smoothing process is essential to reduce the noise present in our series and point out the true patterns that may present over time.

我们工作流程的第一步包括时间序列预处理。 我们的策略非常直观有效。 给定一个时间序列的收盘价,我们将其分为几个小块。 然后,每片都是光滑的,以去除异常值。 平滑过程对于减少我们系列中存在的噪声并指出随着时间推移可能出现的真实图案至关重要。

Tsmoothie provides different smoothing techniques for our purpose. It also has the built-in utility to operate a sliding smoothing approach. The raw time series is partitioned into equal windowed pieces which are then smoothed independently. We select the Locally Weighted Scatterplot Smooth (LOWESS) as the smoothing procedure.

Tsmoothie为我们的目的提供了不同的平滑技术。 它还具有内置实用程序,可操作滑动平滑方法。 原始时间序列被分成相等的窗口部分,然后分别进行平滑。 我们选择局部加权散点图平滑( LOWESS )作为平滑过程。

LOWESS is a powerful non-parametric technique for fitting a smoothed line for given data either through univariate or multivariate smoothing. It implements a regression on a collection of points in a moving range, and weighted according to distance, around abscissa values in order to calculate ordinal values. The selection of the smoothing parameter (alpha) is often entirely based on a “repeated trial” basis. There is no specific technique for the selection of its exact value. The selection of a particular value may lead to “over-smoothing” or “under-smoothing”.

LOWESS是一种强大的非参数技术,可通过单变量或多变量平滑拟合给定数据的平滑线。 它对移动范围内的点集合进行回归,并根据距离在横坐标值附近加权,以便计算序数值。 平滑参数( alpha )的选择通常完全基于“重复试验”。 没有用于选择其确切值的特定技术。 选择特定值可能会导致“过度平滑”或“欠平滑”。

Below the result of applying the mentioned procedure with sliding windows of length 20 (days) and alpha equal to 0.6. In other words, we are computing a LOWESS for every generated window.

下面是使用长度为20(天)且alpha等于0.6的滑动窗口应用上述过程的结果。 换句话说,我们正在为每个生成的窗口计算一个LOWESS。

Image for post
The first smoothed windows from the AMZN stock prices
AMZN股票价格的第一个平滑窗口

时间序列聚类 (Time Series Clustering)

The second step involves the usage of a clustering algorithm to identify the behaviors in our time series. The creation of equal length windows is aimed to solve this task easily.

第二步涉及使用聚类算法来识别时间序列中的行为。 等长窗口的创建旨在轻松解决此任务。

Generally speaking, clustering different time series into similar groups is challenging because each data point follows a temporal structure that we must respect in order to obtain satisfactory results. The distance measures used in standard clustering algorithms, such as Euclidean distance, are often not appropriate to time series. A stronger approach is to replace the default distance measure with a metric for comparing time series, such as Dynamic Time Warping.

一般而言,将不同的时间序列聚类为相似的组具有挑战性,因为每个数据点都遵循一个时间结构,为了获得令人满意的结果,我们必须遵循该时间结构。 标准聚类算法中使用的距离度量(例如欧几里得距离)通常不适用于时间序列。 一种更强大的方法是用一种用于比较时间序列的度量标准来代替默认距离度量,例如Dynamic Time Warping 。

The search of 4 clusters with K-means and Dynamic Time Warping metric produces the following results:

使用K均值和动态时间规整度量标准对4个聚类进行搜索会产生以下结果:

Image for post
with smoothing并进行平滑处理

As we can see, it’s evident the creation of 4 different clusters that represent 4 different market movements: an increasing trend (cluster 0), a decreasing trend (cluster 1), a downward turning point (cluster 2), an upward turning point (cluster 3). We can do the same with our raw time windows without computing the smoothing and make a comparison.

如我们所见,很明显,创建了代表4个不同市场运动的4个不同的集群:上升趋势( 集群0 ),下降趋势( 集群1 ),下降拐点( 集群2 ),上升拐点( 集群 ) 组3 )。 我们可以对原始时间窗口执行相同操作,而无需计算平滑度并进行比较。

Image for post
without smoothing无需平滑

Now the difference between the 4 groups is not so marked. It’s more difficult to provide an interpretation of the generated clusters. The ability to generate meaningfully groups from a clustering algorithm is the more important prerequisite of any unsupervised approach. If we can’t attribute an explanation, the results can’t be utilized to make a decision. In this sense, the adoption of a smoothing preprocess can help the analysis.

现在,这四个组之间的差异不再那么明显。 提供对生成的集群的解释更加困难。 从聚类算法生成有意义的组的能力是任何无监督方法的重要前提。 如果我们无法解释原因,那么结果将无法用于做出决定。 从这个意义上讲,采用平滑预处理可以帮助分析。

Image for post
with smoothing平滑获得聚类

摘要 (SUMMARY)

In the financial domain, the concept of volatility is fundamental to take decisions. It measures the uncertainty, i.e. the risk, present in the market. Here we went deeper extending our idea of market regimes in the short term. We identified four clear market conditions, smoothing our time series blocks to better understand the real dynamic of the data. In this post, we took advantage of the time series smoothing in a financial clustering application but this approach is valid and useful in some other contests involving time series analysis.

在金融领域,波动性概念是做出决策的基础。 它测量市场中存在的不确定性,即风险。 在这里,我们在短期内更深入地扩展了市场体制的概念。 我们确定了四个明确的市场条件,从而平滑了时间序列块,以更好地了解数据的真实动态。 在本文中,我们利用了金融聚类应用程序中的时间序列平滑功能,但是这种方法在涉及时间序列分析的其他一些竞赛中是有效且有用的。

CHECK MY GITHUB REPO

检查我的GITHUB回购

Keep in touch: Linkedin

保持联系: Linkedin

翻译自: https://towardsdatascience.com/time-series-smoothing-for-better-clustering-121b98f308e8

python 平滑时间序列

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389641.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

帮助学生改善学习方法_学生应该如何花费时间改善自己的幸福

帮助学生改善学习方法There have been numerous studies looking into the relationship between sleep, exercise, leisure, studying and happiness. The results were often quite like how we expected, though there have been debates about the relationship between sl…

Spring Boot 静态资源访问原理解析

一、前言 springboot配置静态资源方式是多种多样,接下来我会介绍其中几种方式,并解析一下其中的原理。 二、使用properties属性进行配置 应该说 spring.mvc.static-path-pattern 和 spring.resources.static-locations这两属性是成对使用的,如…

深挖“窄带高清”的实现原理

过去几年,又拍云一直在点播、直播等视频应用方面潜心钻研,取得了不俗的成果。我们结合点播、直播、短视频等业务中的用户场景,推出了“省带宽、压成本”系列文章,从编码技术、网络架构等角度出发,结合又拍云的产品成果…

Redis 服务安装

下载 客户端可视化工具: RedisDesktopManager redis官网下载: http://redis.io/download windos服务安装 windows服务安装/卸载下载文件并解压使用 管理员身份 运行命令行并且切换到解压目录执行 redis-service --service-install windowsR 打开运行窗口, 输入 services.msc 查…

熊猫数据集_对熊猫数据框使用逻辑比较

熊猫数据集P (tPYTHON) Logical comparisons are used everywhere.逻辑比较随处可见 。 The Pandas library gives you a lot of different ways that you can compare a DataFrame or Series to other Pandas objects, lists, scalar values, and more. The traditional comp…

决策树之前要不要处理缺失值_不要使用这样的决策树

决策树之前要不要处理缺失值As one of the most popular classic machine learning algorithm, the Decision Tree is much more intuitive than the others for its explainability. In one of my previous article, I have introduced the basic idea and mechanism of a Dec…

gl3520 gl3510_带有gl gl本机的跨平台地理空间可视化

gl3520 gl3510Editor’s note: Today’s post is by Ib Green, CTO, and Ilija Puaca, Founding Engineer, both at Unfolded, an “open core” company that builds products and services on the open source deck.gl / vis.gl technology stack, and is also a major contr…

uiautomator +python 安卓UI自动化尝试

使用方法基本说明:https://www.cnblogs.com/mliangchen/p/5114149.html,https://blog.csdn.net/Eugene_3972/article/details/76629066 环境准备:https://www.cnblogs.com/keeptheminutes/p/7083816.html 简单实例 1.自动化安装与卸载 &#…

power bi中的切片器_在Power Bi中显示选定的切片器

power bi中的切片器Just recently, while presenting my session: “Magnificent 7 — Simple tricks to boost your Power BI Development” at the New Stars of Data conference, one of the questions I’ve received was:就在最近,在“新数据之星”会议上介绍我…

5939. 半径为 k 的子数组平均值

5939. 半径为 k 的子数组平均值 给你一个下标从 0 开始的数组 nums ,数组中有 n 个整数,另给你一个整数 k 。 半径为 k 的子数组平均值 是指:nums 中一个以下标 i 为 中心 且 半径 为 k 的子数组中所有元素的平均值,即下标在 i …

数据库逻辑删除的sql语句_通过数据库的眼睛查询sql的逻辑流程

数据库逻辑删除的sql语句Structured Query Language (SQL) is famously known as the romance language of data. Even thinking of extracting the single correct answer from terabytes of relational data seems a little overwhelming. So understanding the logical flow…

数据挖掘流程_数据流挖掘

数据挖掘流程1-简介 (1- Introduction) The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve th…

北门外的小吃街才是我的大学食堂

学校北门外的那些小吃摊,陪我度过了漫长的大学四年。 细数下来,我最怀念的是…… (1)烤鸡翅 吸引指数:★★★★★ 必杀技:酥流油 烤鸡翅有蜂蜜味、香辣味、孜然味……最爱店家独创的秘制鸡翅。鸡翅的外皮被…

[LeetCode]最长公共前缀(Longest Common Prefix)

题目描述 编写一个函数来查找字符串数组中的最长公共前缀。如果不存在公共前缀,返回空字符串 ""。 示例 1:输入: ["flower","flow","flight"]输出: "fl"示例 2:输入: ["dog","racecar",&quo…

spark的流失计算模型_使用spark对sparkify的流失预测

spark的流失计算模型Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In thi…

区块链开发公司谈区块链与大数据的关系

在过去的两千多年的时间长河中,数字一直指引着我们去探索很多未知的科学世界。到目前为止,随着网络和信息技术的发展,一切与人类活动相关的活动,都直接或者间接的连入了互联网之中,一个全新的数字化的世界展现在我们的…

Jupyter Notebook的15个技巧和窍门,可简化您的编码体验

Jupyter Notebook is a browser bases REPL (read eval print loop) built on IPython and other open-source libraries, it allows us to run interactive python code on the browser.Jupyter Notebook是基于IPL和其他开源库构建的基于REPL(读取评估打印循环)的浏览器&#…

bi数据分析师_BI工程师和数据分析师的5个格式塔原则

bi数据分析师Image by Author图片作者 将美丽融入数据 (Putting the Beauty in Data) Have you ever been ravished by Vizzes on Tableau Public that look like only magic could be in play to display so much data in such a pleasing way?您是否曾经被Tableau Public上的…

BSOJ 2423 -- 【PA2014】Final Zarowki

Description 有n个房间和n盏灯,你需要在每个房间里放入一盏灯。每盏灯都有一定功率,每间房间都需要不少于一定功率的灯泡才可以完全照亮。 你可以去附近的商店换新灯泡,商店里所有正整数功率的灯泡都有售。但由于背包空间有限,你…

WPF绑定资源文件错误(error in binding resource string with a view in wpf)

报错:无法将“***Properties.Resources.***”StaticExtension 值解析为枚举、静态字段或静态属性 解决办法:尝试右键单击在Visual Studio解决方案资源管理器的资源文件,并选择属性选项,然后设置自定义工具属性 PublicResXFile cod…