多元时间序列回归模型_多元时间序列分析和预测:将向量自回归(VAR)模型应用于实际的多元数据集...

多元时间序列回归模型

Multivariate Time Series Analysis

多元时间序列分析

A univariate time series data contains only one single time-dependent variable while a multivariate time series data consists of multiple time-dependent variables. We generally use multivariate time series analysis to model and explain the interesting interdependencies and co-movements among the variables. In the multivariate analysis — the assumption is that the time-dependent variables not only depend on their past values but also show dependency between them. Multivariate time series models leverage the dependencies to provide more reliable and accurate forecasts for a specific given data, though the univariate analysis outperforms multivariate in general[1]. In this article, we apply a multivariate time series method, called Vector Auto Regression (VAR) on a real-world dataset.

单变量时间序列数据仅包含一个时间相关的变量,而多元时间序列数据则包含多个时间相关的变量。 我们通常使用多元时间序列分析来建模和解释变量之间有趣的相互依存关系和共同运动。 在多变量分析中,假定时间相关变量不仅取决于它们的过去值,而且还显示它们之间的依赖关系。 多元时间序列模型利用依存关系为特定的给定数据提供更可靠,更准确的预测,尽管单变量分析通常优于多元变量[1]。 在本文中,我们在现实世界的数据集上应用了一种称为向量自动回归(VAR)的多元时间序列方法。

Vector Auto Regression (VAR)

向量自回归(VAR)

VAR model is a stochastic process that represents a group of time-dependent variables as a linear function of their own past values and the past values of all the other variables in the group.

VAR模型是一个随机过程,将一组时间相关变量表示为它们自己的过去值以及该组中所有其他变量的过去值的线性函数。

For instance, we can consider a bivariate time series analysis that describes a relationship between hourly temperature and wind speed as a function of past values [2]:

例如,我们可以考虑一个双变量时间序列分析,该分析描述了每小时温度和风速之间的关系,该关系是过去值的函数[2]:

temp(t) = a1 + w11* temp(t-1) + w12* wind(t-1) + e1(t-1)

temp(t)= a1 + w11 * temp(t-1)+ w12 * wind(t-1)+ e1(t-1)

wind(t) = a2 + w21* temp(t-1) + w22*wind(t-1) +e2(t-1)

wind(t)= a2 + w21 * temp(t-1)+ w22 * wind(t-1)+ e2(t-1)

where a1 and a2 are constants; w11, w12, w21, and w22 are the coefficients; e1 and e2 are the error terms.

其中a1和a2是常数; w11,w12,w21和w22是系数; e1和e2是误差项。

Dataset

数据集

Statmodels is a python API that allows users to explore data, estimate statistical models, and perform statistical tests [3]. It contains time series data as well. We download a dataset from the API.

Statmodels是python API,允许用户浏览数据,估计统计模型并执行统计测试[3]。 它还包含时间序列数据。 我们从API下载数据集。

To download the data, we have to install some libraries and then load the data:

要下载数据,我们必须安装一些库,然后加载数据:

import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
data = sm.datasets.macrodata.load_pandas().data
data.head(2)

The output shows the first two observations of the total dataset:

输出显示了总数据集的前两个观察值:

Image for post
A snippet of the dataset
数据集的摘要

The data contains a number of time-series data, we take only two time-dependent variables “realgdp” and “realdpi” for experiment purposes and use “year” columns as the index of the data.

数据包含许多时间序列数据,出于实验目的,我们仅采用两个与时间相关的变量“ realgdp”和“ realdpi”,并使用“ year”列作为数据索引。

data1 = data[["realgdp", 'realdpi']]
data1.index = data["year"]

output:

输出:

Image for post
A snippet of the data
数据片段

Let's visualize the data:

让我们可视化数据:

data1.plot(figsize = (8,5))
Image for post

Both of the series show an increasing trend over time with slight ups and downs.

这两个系列都显示出随着时间的推移呈上升趋势,并有轻微的起伏。

Stationary

固定式

Before applying VAR, both the time series variable should be stationary. Both the series are not stationary since both the series do not show constant mean and variance over time. We can also perform a statistical test like the Augmented Dickey-Fuller test (ADF) to find stationarity of the series using the AIC criteria.

应用VAR之前,两个时间序列变量均应为固定值。 两个序列都不是平稳的,因为两个序列都没有显示出恒定的均值和随时间变化。 我们还可以执行统计测试(如增强迪基-富勒检验(ADF)),以使用AIC标准查找系列的平稳性。

adfuller_test = adfuller(data1['realgdp'], autolag= "AIC")
print("ADF test statistic: {}".format(adfuller_test[0]))
print("p-value: {}".format(adfuller_test[1]))

output:

输出:

Image for post
Result of the statistical test for stationarity
平稳性统计检验的结果
Image for post

In both cases, the p-value is not significant enough, meaning that we can not reject the null hypothesis and conclude that the series are non-stationary.

在这两种情况下,p值都不足够显着,这意味着我们不能拒绝原假设并得出结论该序列是非平稳的。

Differencing

差异化

As both the series are not stationary, we perform differencing and later check the stationarity.

由于两个系列都不平稳,因此我们进行微分,然后检查平稳性。

data_d = data1.diff().dropna()

The “realgdp” series becomes stationary after first differencing of the original series as the p-value of the test is statistically significant.

由于测试的p值具有统计意义,因此“ realgdp”系列在对原始系列进行第一次求差后将变得平稳。

Image for post
ADF test for one differenced realgdp data
ADF测试一个不同的realgdp数据

The “realdpi” series becomes stationary after first differencing of the original series as the p-value of the test is statistically significant.

由于原始的p值在统计上具有显着性,因此“ realdpi”系列在与原始系列进行首次差异化处理后就变得稳定了。

Image for post
ADF test for one differenced realdpi data
ADF测试一个不同的realdpi数据

Model

模型

In this section, we apply the VAR model on the one differenced series. We carry-out the train-test split of the data and keep the last 10-days as test data.

在本节中,我们将VAR模型应用于一个差分序列。 我们对数据进行火车测试拆分,并保留最后10天作为测试数据。

train = data_d.iloc[:-10,:]
test = data_d.iloc[-10:,:]

Searching optimal order of VAR model

搜索VAR模型的最优阶

In the process of VAR modeling, we opt to employ Information Criterion Akaike (AIC) as a model selection criterion to conduct optimal model identification. In simple terms, we select the order (p) of VAR based on the best AIC score. The AIC, in general, penalizes models for being too complex, though the complex models may perform slightly better on some other model selection criterion. Hence, we expect an inflection point in searching the order (p), meaning that, the AIC score should decrease with order (p) gets larger until a certain order and then the score starts increasing. For this, we perform grid-search to investigate the optimal order (p).

在VAR建模过程中,我们选择采用信息准则赤池(AIC)作为模型选择标准来进行最佳模型识别。 简单来说,我们根据最佳AIC得分选择VAR的阶数(p)。 通常,AIC会因过于复杂而对模型进行惩罚,尽管复杂模型在某些其他模型选择标准上可能会稍好一些。 因此,我们期望在搜索阶数(p)时出现拐点,这意味着,随着阶数(p)的增大,AIC分数应减小,直到达到某个阶数,然后分数才开始增加。 为此,我们执行网格搜索以研究最佳阶数(p)。

forecasting_model = VAR(train)results_aic = []
for p in range(1,10):
results = forecasting_model.fit(p)
results_aic.append(results.aic)

In the first line of the code: we train VAR model with the training data. Rest of code: perform a for loop to find the AIC scores for fitting order ranging from 1 to 10. We can visualize the results (AIC scores against orders) to better understand the inflection point:

在代码的第一行:我们使用训练数据训练VAR模型。 其余代码:执行for循环以找到适合订单的AIC得分,范围从1到10。我们可以可视化结果(针对订单的AIC得分),以更好地了解拐点:

import seaborn as sns
sns.set()
plt.plot(list(np.arange(1,10,1)), results_aic)
plt.xlabel("Order")
plt.ylabel("AIC")
plt.show()
Image for post
Investigating optimal order of VAR models
研究VAR模型的最佳顺序

From the plot, the lowest AIC score is achieved at the order of 2 and then the AIC scores show an increasing trend with the order p gets larger. Hence, we select the 2 as the optimal order of the VAR model. Consequently, we fit order 2 to the forecasting model.

从图中可以看到,最低的AIC得分约为2,然后,随着p的增大,AIC得分呈上升趋势。 因此,我们选择2作为VAR模型的最优顺序。 因此,我们将订单2拟合到预测模型。

let's check the summary of the model:

让我们检查一下模型的摘要:

results = forecasting_model.fit(2)
results.summary()

The summary output contains much information:

摘要输出包含许多信息:

Image for post

Forecasting

预测

We use 2 as the optimal order in fitting the VAR model. Thus, we take the final 2 steps in the training data for forecasting the immediate next step (i.e., the first day of the test data).

我们使用2作为拟合VAR模型的最佳顺序。 因此,我们在训练数据中采取最后的2个步骤来预测下一步(即测试数据的第一天)。

Image for post
Forecasting test data
预测测试数据

Now, after fitting the model, we forecast for the test data where the last 2 days of training data set as lagged values and steps set as 10 days as we want to forecast for the next 10 days.

现在,在拟合模型之后,我们预测测试数据,其中训练数据的最后2天设置为滞后值,步长设置为10天,因为我们希望在接下来的10天进行预测。

laaged_values = train.values[-2:]forecast = pd.DataFrame(results.forecast(y= laaged_values, steps=10), index = test.index, columns= ['realgdp_1d', 'realdpi_1d'])forecast

The output:

输出:

Image for post
First differenced forecasts
一阶差异预测

We have to note that the aforementioned forecasts are for the one differenced model. Hence, we must reverse the first differenced forecasts into the original forecast values.

我们必须注意,上述预测是针对一种差异模型的。 因此,我们必须将最初的差异预测反转为原始预测值。

forecast["realgdp_forecasted"] = data1["realgdp"].iloc[-10-1] +   forecast_1D['realgdp_1d'].cumsum()forecast["realdpi_forecasted"] = data1["realdpi"].iloc[-10-1] +      forecast_1D['realdpi_1d'].cumsum() 

output:

输出:

Image for post
Forecasted values for 1 differenced series and for the original series
1个差异序列和原始序列的预测值

The first two columns are the forecasted values for 1 differenced series and the last two columns show the forecasted values for the original series.

前两列是1个差异序列的预测值,后两列显示原始序列的预测值。

Now, we visualize the original test values and the forecasted values by VAR.

现在,我们通过VAR可视化原始测试值和预测值。

Image for post
Original and Forecasted values for realgdp and realdpi
realgdp和realdpi的原始值和预测值

The original realdpi and the forecasted realdpi show a similar pattern throwout the forecasted days. For realgdp: the first half of the forecasted values show a similar pattern as the original values, on the other hand, the last half of the forecasted values do not follow similar pattern.

原始的realdpi和预测的realdpi显示出相似的模式,从而排除了预测的天数。 对于realgdp:预测值的前半部分显示与原始值相似的模式,另一方面,预测值的后半部分没有遵循相似的模式。

To sum up, in this article, we discuss multivariate time series analysis and applied the VAR model on a real-world multivariate time series dataset.

综上所述,在本文​​中,我们讨论了多元时间序列分析,并将VAR模型应用于实际的多元时间序列数据集。

You can also read the article — A real-world time series data analysis and forecasting, where I applied ARIMA (univariate time series analysis model) to forecast univariate time series data.

您还可以阅读这篇文章- 真实的时间序列数据分析和预测 在这里我应用了ARIMA(单变量时间序列分析模型)来预测单变量时间序列数据。

[1] https://homepage.univie.ac.at/robert.kunst/prognos4.pdf

[1] https://homepage.univie.ac.at/robert.kunst/prognos4.pdf

[2] https://www.aptech.com/blog/introduction-to-the-fundamentals-of-time-series-data-and-analysis/

[2] https://www.aptech.com/blog/introduction-to-the-fundamentals-of-time-series-data-and-analysis/

[3] https://www.statsmodels.org/stable/index.html

[3] https://www.statsmodels.org/stable/index.html

翻译自: https://towardsdatascience.com/multivariate-time-series-forecasting-456ace675971

多元时间序列回归模型

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391378.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

字符串基本操作

1.已知‘星期一星期二星期三星期四星期五星期六星期日 ’,输入数字(1-7),输出相应的‘星期几 s星期一星期二星期三星期四星期五星期六星期日 d int(input(输入1-7:)) print(s[3*(d-1):3*d]) 2.输入学号,识别年级、专业…

数据分析和大数据哪个更吃香_处理数据,大数据甚至更大数据的17种策略

数据分析和大数据哪个更吃香Dealing with big data can be tricky. No one likes out of memory errors. ☹️ No one likes waiting for code to run. ⏳ No one likes leaving Python. 🐍处理大数据可能很棘手。 没有人喜欢内存不足错误。 No️没有人喜欢等待代码…

MySQL 数据还原

1.1还原使用mysqldump命令备份的数据库的语法如下&#xff1a; mysql -u root -p [dbname] < backup.sq 示例&#xff1a; mysql -u root -p < C:\backup.sql 1.2还原直接复制目录的备份 通过这种方式还原时&#xff0c;必须保证两个MySQL数据库的版本号是相同的。MyISAM…

VueJs学习入门指引

新产品开发决定要用到vuejs&#xff0c;总结一个vuejs学习指引。 1.安装一个Node环境 去Nodejs官网下载windows版本node 下载地址&#xff1a; https://nodejs.org/zh-cn/ 2.使用node的npm工具搭建一个Vue项目&#xff0c;这里混合进入了ElementUI 搭建指引地址: https:…

centos7.4二进制安装mysql

1&#xff1a;下载二进制安装包&#xff08;安装时确保没有mysql数据库服务器端&#xff09;&#xff1a; mariadb-10.2.12-linux-x86_64.tar.gz、 mariadb-10.2.12.tar.gz。2&#xff1a;创建系统账号指定shell类型&#xff08;默认自动创建同名的组&#xff09;3&#xff1a;…

批梯度下降 随机梯度下降_梯度下降及其变体快速指南

批梯度下降 随机梯度下降In this article, I am going to discuss the Gradient Descent algorithm. The next article will be in continuation of this article where I will discuss optimizers in neural networks. For understanding those optimizers it’s important to…

java作业 2.6

//程序猿&#xff1a;孔宏旭 2017.X.XX /**功能&#xff1a;在键盘输入一个三位数&#xff0c;求它们的各数位之和。 *1、使用Scanner关键字来实现从键盘输入的方法。 *2、使用取余的方法将各个数位提取出来。 *3、最后将得到的各个数位相加。 */ import java.util.Scanner; p…

Linux 命令 之查看程序占用内存

2019独角兽企业重金招聘Python工程师标准>>> 查看PID ps aux | grep nginx root 3531 0.0 0.0 18404 832 ? Ss 15:29 0:00 nginx: master process ./nginx 查看占用资源情况 pmap -d 3531 top -p 3531 转载于:https://my.oschina.net/mengzha…

逻辑回归 自由度_回归自由度的官方定义

逻辑回归 自由度Back in middle and high school you likely learned to calculate the mean and standard deviation of a dataset. And your teacher probably told you that there are two kinds of standard deviation: population and sample. The formulas for the two a…

网络对抗技术作业一 201421410031

姓名&#xff1a;李冠华 学号&#xff1a;201421410031 指导教师&#xff1a;高见 1、虚拟机安装与调试 安装windows和linux&#xff08;kali&#xff09;两个虚拟机&#xff0c;均采用NAT网络模式&#xff0c;查看主机与两个虚拟机器的IP地址&#xff0c;并确保其连通性。同时…

生存分析简介:Kaplan-Meier估计器

In my previous article, I described the potential use-cases of survival analysis and introduced all the building blocks required to understand the techniques used for analyzing the time-to-event data.在我的上一篇文章中 &#xff0c;我描述了生存分析的潜在用例…

使用r语言做garch模型_使用GARCH估计货币波动率

使用r语言做garch模型Asset prices have a high degree of stochastic trends inherent in the time series. In other words, price fluctuations are subject to a large degree of randomness, and therefore it is very difficult to forecast asset prices using traditio…

方差偏差权衡_偏差偏差权衡:快速介绍

方差偏差权衡The bias-variance tradeoff is one of the most important but overlooked and misunderstood topics in ML. So, here we want to cover the topic in a simple and short way as possible.偏差-方差折衷是机器学习中最重要但被忽视和误解的主题之一。 因此&…

win10 uwp 让焦点在点击在页面空白处时回到textbox中

原文:win10 uwp 让焦点在点击在页面空白处时回到textbox中在网上 有一个大神问我这样的问题&#xff1a;在做UWP的项目&#xff0c;怎么能让焦点在点击在页面空白处时回到textbox中&#xff1f; 虽然我的小伙伴认为他这是一个 xy 问题&#xff0c;但是我还是回答他这个问题。 首…

重学TCP协议(1) TCP/IP 网络分层以及TCP协议概述

1. TCP/IP 网络分层 TCP/IP协议模型&#xff08;Transmission Control Protocol/Internet Protocol&#xff09;&#xff0c;包含了一系列构成互联网基础的网络协议&#xff0c;是Internet的核心协议&#xff0c;通过20多年的发展已日渐成熟&#xff0c;并被广泛应用于局域网和…

分节符缩写p_p值的缩写是什么?

分节符缩写pp是概率吗&#xff1f; (Is p for probability?) Technically, p-value stands for probability value, but since all of statistics is all about dealing with probabilistic decision-making, that’s probably the least useful name we could give it.从技术…

[测试题]打地鼠

Description 小明听说打地鼠是一件很好玩的游戏&#xff0c;于是他也开始打地鼠。地鼠只有一只&#xff0c;而且一共有N个洞&#xff0c;编号为1到N排成一排&#xff0c;两边是墙壁&#xff0c;小明当然不可能百分百打到&#xff0c;因为他不知道地鼠在哪个洞。小明只能在白天打…

重学TCP协议(2) TCP 报文首部

1. TCP 报文首部 1.1 源端口和目标端口 每个TCP段都包含源端和目的端的端口号&#xff0c;用于寻找发端和收端应用进程。这两个值加上IP首部中的源端IP地址和目的端IP地址唯一确定一个TCP连接 端口号分类 熟知端口号&#xff08;well-known port&#xff09;已登记的端口&am…

机器学习 预测模型_使用机器学习模型预测心力衰竭的生存时间-第一部分

机器学习 预测模型数据科学 &#xff0c; 机器学习 (Data Science, Machine Learning) 前言 (Preface) Cardiovascular diseases are diseases of the heart and blood vessels and they typically include heart attacks, strokes, and heart failures [1]. According to the …

重学TCP协议(3) 端口号及MTU、MSS

1. 端口相关的命令 1.1 查看端口是否打开 使用 nc 和 telnet 这两个命令可以非常方便的查看到对方端口是否打开或者网络是否可达。如果对端端口没有打开&#xff0c;使用 telnet 和 nc 命令会出现 “Connection refused” 错误 1.2 查看监听端口的进程 使用 netstat sudo …