python 平滑时间序列_时间序列平滑以实现更好的聚类

python 平滑时间序列

In time series analysis, the presence of dirty and messy data can alter our reasonings and conclusions. This is true, especially in this domain, because the temporal dependency plays a crucial role when dealing with temporal sequences.

在时间序列分析中,脏数据和杂乱数据的存在会改变我们的推理和结论。 这是正确的,尤其是在此领域,因为在处理时间序列时,时间依赖性起着至关重要的作用。

Noise or outliers must be handled with care following ad-hoc solutions. In this situation, the tsmoothie package can help us save a lot of time in preparing time series for our analysis. Tsmoothie is a python library for time series smoothing and outlier detection that can handle multiple series in a vectorized way. It’s useful because it can provide the preprocess steps we needed, like denoising or outlier removal, preserving the temporal pattern present in our raw data.

按照临时解决方案,必须小心处理噪声或异常值。 在这种情况下, tsmoothie软件包可以帮助我们节省大量时间来准备用于分析的时间序列。 Tsmoothie是用于时间序列平滑和离群值检测的python库,可以以矢量化方式处理多个序列。 这很有用,因为它可以提供我们所需的预处理步骤,例如去噪或离群值去除,保留原始数据中存在的时间模式。

In this post, we use these trinks to improve a clustering task. More precisely, we try to identify some changes in financial data carrying out an unsupervised approach. In the end, we will expect to point out clear patterns in the closing prices that can be used to inspect the hidden behavior of the market.

在本文中,我们将使用这些工具来改善聚类任务。 更准确地说,我们尝试在无监督的情况下识别财务数据中的某些变化。 最后,我们期望指出收盘价的清晰模式,可用于检查市场的隐藏行为。

数据 (THE DATA)

As introduced before, we operate with financial time series. There are a lot of tools or premade datasets that provide and store financial data. For our aims, we use a dataset collected from Kaggle. The Stock data 2000–2018 is a cleaned collection of stock prices from 2000 to 2018 of around 39 different stocks. It reports volumes, open, high, low, and close prices daily. We focus on the close prices.

如前所述,我们使用财务时间序列进行操作。 有很多提供或存储财务数据的工具或预制数据集。 为了我们的目标,我们使用从Kaggle收集的数据集。 2000-2018年的股票数据是从2000年到2018年大约39种不同股票的干净价格集合。 它每日报告交易量,开盘价,最高价,最低价和收盘价。 我们关注收盘价。

For a demonstrative purpose, we consider the Amazon stock price but the same findings appear also in other stock signals.

出于说明目的,我们考虑了亚马逊股票的价格,但在其他股票信号中也出现了相同的发现。

Image for post
Amazon closing price history and distribution
亚马逊收盘价历史和分布

时间序列平滑 (Time Series Smoothing)

The first step in our workflow consists of time series preprocessing. Our strategy is very intuitive and effective. Given a time series of closing prices, we split it into small sliding pieces. Each piece is then smooth in order to remove outliers. The smoothing process is essential to reduce the noise present in our series and point out the true patterns that may present over time.

我们工作流程的第一步包括时间序列预处理。 我们的策略非常直观有效。 给定一个时间序列的收盘价,我们将其分为几个小块。 然后,每片都是光滑的,以去除异常值。 平滑过程对于减少我们系列中存在的噪声并指出随着时间推移可能出现的真实图案至关重要。

Tsmoothie provides different smoothing techniques for our purpose. It also has the built-in utility to operate a sliding smoothing approach. The raw time series is partitioned into equal windowed pieces which are then smoothed independently. We select the Locally Weighted Scatterplot Smooth (LOWESS) as the smoothing procedure.

Tsmoothie为我们的目的提供了不同的平滑技术。 它还具有内置实用程序,可操作滑动平滑方法。 原始时间序列被分成相等的窗口部分,然后分别进行平滑。 我们选择局部加权散点图平滑( LOWESS )作为平滑过程。

LOWESS is a powerful non-parametric technique for fitting a smoothed line for given data either through univariate or multivariate smoothing. It implements a regression on a collection of points in a moving range, and weighted according to distance, around abscissa values in order to calculate ordinal values. The selection of the smoothing parameter (alpha) is often entirely based on a “repeated trial” basis. There is no specific technique for the selection of its exact value. The selection of a particular value may lead to “over-smoothing” or “under-smoothing”.

LOWESS是一种强大的非参数技术,可通过单变量或多变量平滑拟合给定数据的平滑线。 它对移动范围内的点集合进行回归,并根据距离在横坐标值附近加权,以便计算序数值。 平滑参数( alpha )的选择通常完全基于“重复试验”。 没有用于选择其确切值的特定技术。 选择特定值可能会导致“过度平滑”或“欠平滑”。

Below the result of applying the mentioned procedure with sliding windows of length 20 (days) and alpha equal to 0.6. In other words, we are computing a LOWESS for every generated window.

下面是使用长度为20(天)且alpha等于0.6的滑动窗口应用上述过程的结果。 换句话说,我们正在为每个生成的窗口计算一个LOWESS。

Image for post
The first smoothed windows from the AMZN stock prices
AMZN股票价格的第一个平滑窗口

时间序列聚类 (Time Series Clustering)

The second step involves the usage of a clustering algorithm to identify the behaviors in our time series. The creation of equal length windows is aimed to solve this task easily.

第二步涉及使用聚类算法来识别时间序列中的行为。 等长窗口的创建旨在轻松解决此任务。

Generally speaking, clustering different time series into similar groups is challenging because each data point follows a temporal structure that we must respect in order to obtain satisfactory results. The distance measures used in standard clustering algorithms, such as Euclidean distance, are often not appropriate to time series. A stronger approach is to replace the default distance measure with a metric for comparing time series, such as Dynamic Time Warping.

一般而言,将不同的时间序列聚类为相似的组具有挑战性,因为每个数据点都遵循一个时间结构,为了获得令人满意的结果,我们必须遵循该时间结构。 标准聚类算法中使用的距离度量(例如欧几里得距离)通常不适用于时间序列。 一种更强大的方法是用一种用于比较时间序列的度量标准来代替默认距离度量,例如Dynamic Time Warping 。

The search of 4 clusters with K-means and Dynamic Time Warping metric produces the following results:

使用K均值和动态时间规整度量标准对4个聚类进行搜索会产生以下结果:

Image for post
with smoothing并进行平滑处理

As we can see, it’s evident the creation of 4 different clusters that represent 4 different market movements: an increasing trend (cluster 0), a decreasing trend (cluster 1), a downward turning point (cluster 2), an upward turning point (cluster 3). We can do the same with our raw time windows without computing the smoothing and make a comparison.

如我们所见,很明显,创建了代表4个不同市场运动的4个不同的集群:上升趋势( 集群0 ),下降趋势( 集群1 ),下降拐点( 集群2 ),上升拐点( 集群 ) 组3 )。 我们可以对原始时间窗口执行相同操作,而无需计算平滑度并进行比较。

Image for post
without smoothing无需平滑

Now the difference between the 4 groups is not so marked. It’s more difficult to provide an interpretation of the generated clusters. The ability to generate meaningfully groups from a clustering algorithm is the more important prerequisite of any unsupervised approach. If we can’t attribute an explanation, the results can’t be utilized to make a decision. In this sense, the adoption of a smoothing preprocess can help the analysis.

现在,这四个组之间的差异不再那么明显。 提供对生成的集群的解释更加困难。 从聚类算法生成有意义的组的能力是任何无监督方法的重要前提。 如果我们无法解释原因,那么结果将无法用于做出决定。 从这个意义上讲,采用平滑预处理可以帮助分析。

Image for post
with smoothing平滑获得聚类

摘要 (SUMMARY)

In the financial domain, the concept of volatility is fundamental to take decisions. It measures the uncertainty, i.e. the risk, present in the market. Here we went deeper extending our idea of market regimes in the short term. We identified four clear market conditions, smoothing our time series blocks to better understand the real dynamic of the data. In this post, we took advantage of the time series smoothing in a financial clustering application but this approach is valid and useful in some other contests involving time series analysis.

在金融领域,波动性概念是做出决策的基础。 它测量市场中存在的不确定性,即风险。 在这里,我们在短期内更深入地扩展了市场体制的概念。 我们确定了四个明确的市场条件,从而平滑了时间序列块,以更好地了解数据的真实动态。 在本文中,我们利用了金融聚类应用程序中的时间序列平滑功能,但是这种方法在涉及时间序列分析的其他一些竞赛中是有效且有用的。

CHECK MY GITHUB REPO

检查我的GITHUB回购

Keep in touch: Linkedin

保持联系: Linkedin

翻译自: https://towardsdatascience.com/time-series-smoothing-for-better-clustering-121b98f308e8

python 平滑时间序列

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389641.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于SmartQQ协议的QQ自动回复机器人-1

0. 本项目的原始代码及我二次开发后的代码 1. 软件安装:【myeclipse6.0 maven2】 0. https://blog.csdn.net/zgmzyr/article/details/6886440 1. https://blog.csdn.net/shuzhe66/article/details/45009175 2. https://www.cnblogs.com/whgk/p/7112560.html<mirror><…

1725. 可以形成最大正方形的矩形数目

1725. 可以形成最大正方形的矩形数目 给你一个数组 rectangles &#xff0c;其中 rectangles[i] [li, wi] 表示第 i 个矩形的长度为 li 、宽度为 wi 。 如果存在 k 同时满足 k < li 和 k < wi &#xff0c;就可以将第 i 个矩形切成边长为 k 的正方形。例如&#xff0c…

帮助学生改善学习方法_学生应该如何花费时间改善自己的幸福

帮助学生改善学习方法There have been numerous studies looking into the relationship between sleep, exercise, leisure, studying and happiness. The results were often quite like how we expected, though there have been debates about the relationship between sl…

Spring Boot 静态资源访问原理解析

一、前言 springboot配置静态资源方式是多种多样&#xff0c;接下来我会介绍其中几种方式&#xff0c;并解析一下其中的原理。 二、使用properties属性进行配置 应该说 spring.mvc.static-path-pattern 和 spring.resources.static-locations这两属性是成对使用的&#xff0c;如…

深挖“窄带高清”的实现原理

过去几年&#xff0c;又拍云一直在点播、直播等视频应用方面潜心钻研&#xff0c;取得了不俗的成果。我们结合点播、直播、短视频等业务中的用户场景&#xff0c;推出了“省带宽、压成本”系列文章&#xff0c;从编码技术、网络架构等角度出发&#xff0c;结合又拍云的产品成果…

学习总结5 - bootstrap学习记录1__安装

1.bootstrap是什么&#xff1f; 简洁、直观、强悍的前端开发框架&#xff0c;说白了就是给后端二把刀开发网页用的&#xff0c;让web开发更迅速、简单。 复制代码 2.如何使用&#xff1f; 如图所示到bootstrap中文网进行下载 复制代码 下载完成之后&#xff0c;如图所示&#x…

519. 随机翻转矩阵

519. 随机翻转矩阵 给你一个 m x n 的二元矩阵 matrix &#xff0c;且所有值被初始化为 0 。请你设计一个算法&#xff0c;随机选取一个满足 matrix[i][j] 0 的下标 (i, j) &#xff0c;并将它的值变为 1 。所有满足 matrix[i][j] 0 的下标 (i, j) 被选取的概率应当均等。 …

模型的搜索和优化方法综述:

一、常用的优化方法&#xff1a; 1.爬山 2.最陡峭下降 3.期望最大值 二、常用的搜索方法&#xff1a; 1.贪婪搜索 2.分支界定 3.宽度&#xff08;深度&#xff09;优先遍历转载于:https://www.cnblogs.com/xyp666/p/9042143.html

Redis 服务安装

下载 客户端可视化工具: RedisDesktopManager redis官网下载: http://redis.io/download windos服务安装 windows服务安装/卸载下载文件并解压使用 管理员身份 运行命令行并且切换到解压目录执行 redis-service --service-install windowsR 打开运行窗口, 输入 services.msc 查…

熊猫数据集_对熊猫数据框使用逻辑比较

熊猫数据集P (tPYTHON) Logical comparisons are used everywhere.逻辑比较随处可见 。 The Pandas library gives you a lot of different ways that you can compare a DataFrame or Series to other Pandas objects, lists, scalar values, and more. The traditional comp…

初级功能笔试题-1

给我徒弟整理的一些理论性的笔试题&#xff0c;不喜勿喷。&#xff08;所以没有答案哈&#xff09; 1、测试人员返测缺陷时&#xff0c;如果缺陷未修复&#xff0c;把缺陷的状态置为下列什么状态&#xff08;&#xff09;。 2、当验证被测系统的主要业务流程和功能是否实现时&a…

ansbile--playbook剧本案例

个人博客转至&#xff1a; www.zhangshoufu.com 通过ansible批量管理三台服务器&#xff0c;使三台服务器实现备份&#xff0c;web01、nfs、backup&#xff0c;把web和nfs上的重要文件被分到backup上&#xff0c;主机ip地址分配如下 CharacterIP地址IP地址主机名Rsync--server1…

5938. 找出数组排序后的目标下标

5938. 找出数组排序后的目标下标 给你一个下标从 0 开始的整数数组 nums 以及一个目标元素 target 。 目标下标 是一个满足 nums[i] target 的下标 i 。 将 nums 按 非递减 顺序排序后&#xff0c;返回由 nums 中目标下标组成的列表。如果不存在目标下标&#xff0c;返回一…

决策树之前要不要处理缺失值_不要使用这样的决策树

决策树之前要不要处理缺失值As one of the most popular classic machine learning algorithm, the Decision Tree is much more intuitive than the others for its explainability. In one of my previous article, I have introduced the basic idea and mechanism of a Dec…

说说 C 语言中的变量与算术表达式

我们先来写一个程序&#xff0c;打印英里与公里之间的对应关系表。公式&#xff1a;1 mile1.61 km 程序如下&#xff1a; #include <stdio.h>/* print Mile to Kilometre table*/ main() {float mile, kilometre;int lower 0;//lower limitint upper 1000;//upper limi…

gl3520 gl3510_带有gl gl本机的跨平台地理空间可视化

gl3520 gl3510Editor’s note: Today’s post is by Ib Green, CTO, and Ilija Puaca, Founding Engineer, both at Unfolded, an “open core” company that builds products and services on the open source deck.gl / vis.gl technology stack, and is also a major contr…

uiautomator +python 安卓UI自动化尝试

使用方法基本说明&#xff1a;https://www.cnblogs.com/mliangchen/p/5114149.html&#xff0c;https://blog.csdn.net/Eugene_3972/article/details/76629066 环境准备&#xff1a;https://www.cnblogs.com/keeptheminutes/p/7083816.html 简单实例 1.自动化安装与卸载 &#…

5922. 统计出现过一次的公共字符串

5922. 统计出现过一次的公共字符串 给你两个字符串数组 words1 和 words2 &#xff0c;请你返回在两个字符串数组中 都恰好出现一次 的字符串的数目。 示例 1&#xff1a;输入&#xff1a;words1 ["leetcode","is","amazing","as",&…

Python+Appium寻找蓝牙/wifi匹配

前言&#xff1a; 此篇是介绍怎么去寻找蓝牙&#xff0c;进行匹配。主要2个问题点&#xff1a; 1.在不同环境下&#xff0c;搜索到的蓝牙数量有变 2.在不同环境下&#xff0c;搜索到的蓝牙排序会变 简单思路&#xff1a; 将搜索出来的蓝牙名字添加到一个list去&#xff0c;然后…

power bi中的切片器_在Power Bi中显示选定的切片器

power bi中的切片器Just recently, while presenting my session: “Magnificent 7 — Simple tricks to boost your Power BI Development” at the New Stars of Data conference, one of the questions I’ve received was:就在最近&#xff0c;在“新数据之星”会议上介绍我…