python 时间序列预测_使用Python进行动手时间序列预测

python 时间序列预测

Time series analysis is the endeavor of extracting meaningful summary and statistical information from data points that are in chronological order. They are widely used in applied science and engineering which involves temporal measurements such as signal processing, pattern recognition, mathematical finance, weather forecasting, control engineering, healthcare digitization, applications of smart cities, and so on.

时间序列分析是从按时间顺序排列的数据点中提取有意义的摘要和统计信息的努力。 它们被广泛地应用在涉及时间测量的应用科学和工程中,例如信号处理,模式识别,数学财务,天气预报,控制工程,医疗保健数字化,智能城市的应用等。

As we are continuously monitoring and collecting time series data, the opportunities for applying time series analysis and forecasting are increasing.

随着我们不断监视和收集时间序列数据,应用时间序列分析和预测的机会越来越多。

In this article, I will show how to develop an ARIMA model with a seasonal component for time series forecasting in Python. We will follow Box-Jenkins three-stage modeling approach to reach at the best model for forecasting.

在本文中,我将展示如何开发带有季节性成分的ARIMA模型,以便在Python中进行时间序列预测。 我们将遵循Box-Jenkins的三阶段建模方法,以获取最佳的预测模型。

I encourage anyone to check out the Jupyter Notebook on my GitHub for the full analysis.

我鼓励任何人在我的GitHub上查看Jupyter Notebook进行完整分析。

In time series analysis, Box-Jenkins method named after statisticians George Box and Gwilym Jenkins applying ARIMA models to find the best fit of a time series model.

在时间序列分析中,以统计学家George Box和Gwilym Jenkins命名的Box-Jenkins方法应用ARIMA模型来找到时间序列模型的最佳拟合。

The model indicates 3 steps: model identification, parameter estimation and model validation.

该模型指示3个步骤:模型识别,参数估计和模型验证。

Image for post

时间序列 (Time Series)

As data, we will use the monthly milk production dataset. It includes monthly production records in terms of pounds per cow between 1962–1975.

作为数据,我们将使用每月牛奶产量数据集。 它包括1962年至1975年之间的月度生产记录,以每头母牛的磅数表示。

df = pd.read_csv('./monthly_milk_production.csv', sep=',',                            parse_dates=['Date'], index_col='Date')

时间序列数据检查 (Time Series Data Inspection)

Image for post

As we can observe from the plot above, we have an increasing trend and very strong seasonality in our data.

从上图可以看出,我们的数据呈上升趋势,并且季节性非常强。

We will use the statsmodels library from Python to perform a time series decomposition. The decomposition of time series is a statistical method to deconstruct time series into its trend, seasonal and residual components.

我们将使用Python中的statsmodels库执行时间序列分解。 时间序列的分解是一种将时间序列分解为趋势,季节残差成分的统计方法

import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decomposedecomposition = seasonal_decompose(df['Production'], freq=12)
decomposition.plot()
plt.show()
Image for post

The decomposition plot indicates that the monthly milk production has an increasing trend and seasonal pattern.

分解图表明,每月的牛奶产量具有增加的趋势和季节性模式。

If we want to observe the seasonal component more precisely, we can plot the data based on the month.

如果我们想更精确地观察季节成分,则可以根据月份绘制数据。

Image for post

1.型号识别 (1. Model Identification)

In this step, we need to detect whether time series is stationary, and if not, we need to understand what kind of transformation is required to make it stationary.

在此步骤中,我们需要检测时间序列是否稳定,如果不是,则需要了解需要哪种变换才能使其稳定。

A time series is stationary when its statistical properties such as mean, variance, and autocorrelation are constant over time. In other words, time series is stationary when it is not dependent on time and not have a trend or seasonal effects. Most statistical forecasting methods are based on the assumption that time series is (approximately) stationary.

当时间序列的统计属性(例如均值,方差和自相关)随时间恒定时,它是固定的。 换句话说,时间序列在不依赖时间且没有趋势或季节影响的情况下是固定的。 大多数统计预测方法都是基于时间序列(近似)平稳的假设。

Imagine, we have a time series that is consistently increasing over time, the sample mean and variance will grow with the size of the sample, and they will always underestimate the mean and variance in future periods. This is why, we need to start with a stationary time series, which is removed from its time dependent trend and seasonal components.

想象一下,我们有一个随时间连续增长的时间序列,样本均值和方差将随样本的大小而增长,并且它们始终会低估未来期间的均值和方差。 因此,我们需要从固定的时间序列开始,将其从与时间相关的趋势和季节成分中删除。

We can check stationarity by using different approaches:

我们可以使用不同的方法来检查平稳性:

  • We can understand from the plots, such as decomposition plot we have seen previously where we have already observed there is trend and seasonality.

    我们可以从图中了解到,例如我们之前已经看到的分解图和已经观察到的趋势和季节性。
  • We can plot autocorrelation function and partial autocorrelation function plots, which provide information about the dependency of time series values to their previous values. If the time series is stationary, the ACF/PACF plots will show a quick cut off after a small number of lags.

    我们可以绘制自 相关函数图和部分自相关函数图,它们提供有关时间序列值与其先前值的相关性的信息。 如果时间序列是固定的,则ACF / PACF图将显示少量延迟后的快速中断。

from statsmodels.graphics.tsaplots import plot_acf, plot_pacfplot_acf(df, lags=50, ax=ax1)
plot_pacf(df, lags=50, ax=ax2)
Image for post

Here we see that both ACF and PACF plots do not show a quick cut off into the 95% confidence interval area (in blue) meaning time series is not stationary.

在这里,我们看到ACF和PACF图都没有显示出快速切入95%置信区间区域(蓝色)的意思,这意味着时间序列不是固定的。

  • We can apply statistical tests and Augmented Dickey-Fuller test is the widely used one. The null hypothesis of the test is time series has a unit root, meaning that it is non-stationary. We interpret the test result using the p-value of the test. If the p-value is lower than the threshold value (5% or 1%), we reject the null hypothesis and time series is stationary. If the p-value is higher than the threshold, we fail to reject the null hypothesis and time series is non-stationary.

    我们可以应用统计检验,而增强Dickey-Fuller检验是广泛使用的检验。 该检验的零假设是时间序列具有单位根,这意味着它是非平稳的。 我们使用测试的p值解释测试结果。 如果p值低于阈值(5%或1%),我们将拒绝原假设,并且时间序列是固定的。 如果p值高于阈值,则我们无法拒绝原假设,并且时间序列是非平稳的。
from statsmodels.tsa.stattools import adfullerdftest = adfuller(df['Production'])dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

Results of Dickey-Fuller Test:Test Statistic -1.303812p-value 0.627427#Lags Used 13.000000Number of Observations Used 154.000000Critical Value (1%) -3.473543Critical Value (5%) -2.880498Critical Value (10%) -2.576878

Dickey-Fuller测试的结果:测试统计-1.303812p值0.627427#使用的延迟13.000000使用的观察数154.000000临界值(1%)-3.473543临界值(5%)-2.880498临界值(10%)-2.576878

P-value is greater than the threshold value, we fail to reject the null hypothesis and time series is non-stationary, it has time dependent component.

P值大于阈值,我们无法拒绝原假设并且时间序列是非平稳的,它具有时间依赖性。

All these approaches suggest we have non-stationary data. Now, we need to find a way to make it stationary.

所有这些方法表明我们有不稳定的数据。 现在,我们需要找到一种使其固定的方法。

There are two major reasons behind non-stationary time series; trend and seasonality. We can apply differencing to make time series stationary by subtracting the previous observations from the current observations. Doing so we will eliminate trend and seasonality, and stabilize the mean of time series. Due to both trend and seasonal components, we apply one non-seasonal diff() and one seasonal differencing diff(12).

非平稳时间序列背后的主要原因有两个: 趋势和季节性。 通过从当前观测值中减去先前的观测值,我们可以应用差分来使时间序列平稳。 这样做可以消除趋势和季节性,并稳定时间序列的平均值。 由于趋势和季节因素,我们应用一个非季节性diff()和一个季节性差异diff(12)

df_diff = df.diff().diff(12).dropna()
Image for post
Image for post

Results of Dickey-Fuller Test:Test Statistic -5.038002p-value 0.000019#Lags Used 11.000000Number of Observations Used 143.000000Critical Value (1%) -3.476927Critical Value (5%) -2.881973Critical Value (10%) -2.577665

Dickey-Fuller测试的结果:测试统计-5.038002p值0.000019#使用的滞后11.000000使用的观察数143.000000临界值(1%)-3.476927临界值(5%)-2.881973临界值(10%)-2.577665

Applying the previously listed stationarity checks, we notice the plot of differenced time series does not reveal any specific trend or seasonal behavior, ACF/PACF plots have a quick cut-off, and ADF test result returns p-value almost 0.00. which is lower than the threshold. All these checks suggest that differenced data is stationary.

应用先前列出的平稳性检查,我们注意到不同时间序列的图没有揭示任何特定的趋势或季节性行为,ACF / PACF图具有快速截止值,并且ADF测试结果返回的p值几乎为0.00。 低于阈值。 所有这些检查表明差异数据是固定的。

We will apply Seasonal Autoregressive Integrated Moving Average (SARIMA or Seasonal-ARIMA) which is an extension of ARIMA that supports time series data with a seasonal component. ARIMA stands for Autoregressive Integrated Moving Average which is one of the most common techniques of time series forecasting.

我们将应用季节性自回归综合移动平均线(SARIMA或Seasonal-ARIMA),这是ARIMA的扩展,它支持带有季节性成分的时间序列数据。 ARIMA代表自回归综合移动平均值,它是时间序列预测中最常用的技术之一。

ARIMA models are denoted with the order of ARIMA(p,d,q) and SARIMA models are denoted with the order of SARIMA(p, d, q)(P, D, Q)m.

ARIMA模型以ARIMA(p,d,q)的顺序表示,而SARIMA模型以SARIMA(p,d,q)(P,D,Q)m的顺序表示。

AR(p) is a regression model that utilizes the dependent relationship between an observation and some number of lagged observations.

AR(p)是一种回归模型,利用了观察值与一些滞后观察值之间的依赖关系。

I(d) is the differencing order to make time series stationary.

I(d)是使时间序列平稳的微分阶数。

MA(q) is a model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

MA(q)是一个模型,它使用观察值与应用于滞后观察值的移动平均模型的残差之间的依赖关系。

(P, D, Q)m are the additional set of parameters that specifically describe the seasonal components of the model. P, D, and Q represent the seasonal regression, differencing, and moving average coefficients, and m represents the number of data points in each seasonal cycle.

(P,D,Q)m是另外一组参数,它们专门描述了模型的季节性成分。 P,D和Q表示季节回归系数,微分系数和移动平均系数,m表示每个季节周期中数据点的数量。

2.模型参数估计 (2. Model Parameter Estimation)

We will use Python’s pmdarima library, to automatically extract the best parameters for our Seasonal ARIMA model. Inside auto_arima function, we will specify d=1and D=1 as we differentiate once for the trend and once for seasonality, m=12 because we have monthly data, and trend='C'to include constant and seasonal=Trueto fit a seasonal-ARIMA. Besides, we specify trace=Trueto print status on the fits. This helps us to determine the best parameters by comparing the AIC scores.

我们将使用Python的pmdarima库为我们的季节性ARIMA模型自动提取最佳参数。 在auto_arima函数中,我们将指定d=1D=1因为我们分别对趋势和季节性进行了区分,因为我们有月度数据,所以对m=12进行了区分,并且trend='C'包含了常数, seasonal=True适合一个季节性的ARIMA。 此外,我们指定trace=True来显示适合的打印状态。 这可以帮助我们通过比较AIC分数来确定最佳参数。

import pmdarima as pmmodel = pm.auto_arima(df['Production'], d=1, D=1,
m=12, trend='c', seasonal=True,
start_p=0, start_q=0, max_order=6, test='adf',
stepwise=True, trace=True)

AIC (Akaike Information Criterion) is an estimator of out of sample prediction error and the relative quality of our model. The desired result is to find the lowest possible AIC score.

AIC (赤池信息准则)是对样本外预测误差和模型相对质量的估计。 理想的结果是找到最低的AIC分数。

Image for post

The result of auto_arima function with various (p, d, q)(P, D, Q)m parameters indicates that the lowest AIC score is obtained when the parameters equal to (1, 1, 0)(0, 1, 1, 12).

参数为(p,d,q)(P,D,Q)m的auto_arima函数的结果表明,当参数等于(1,1,0)(0,1,1, 12)。

Image for post

We split the dataset into a train and test set. Here I’ve used 85% as train split size. We create a SARIMA model, on the train set with the suggested parameters. We use SARIMAX function from statsmodel library (X describes the exogenous parameter, but here we don’t add any). After fitting the model, we can also print the summary statistics.

我们将数据集分为训练和测试集。 在这里,我使用了85%作为火车分割大小。 我们在火车上使用建议的参数创建SARIMA模型。 我们使用statsmodel库中的SARIMAX函数(X描述了外部参数,但此处未添加任何参数)。 拟合模型后,我们还可以打印摘要统计信息。

from statsmodels.tsa.statespace.sarimax import SARIMAXmodel = SARIMAX(train['Production'],
order=(1,1,0),seasonal_order=(0,1,1,12))
results = model.fit()
results.summary()
Image for post

3.模型验证 (3. Model Validation)

Primary concern of the model is to ensure that the residuals are normally distributed with zero mean and uncorrelated.

该模型的主要关注点是确保残差正态分布且均值为零且不相关。

To check for residuals statistics, we can print model diagnostics:

要检查残差统计信息,我们可以打印模型诊断:

results.plot_diagnostics()
plt.show()
Image for post
  • The top-left plot shows the residuals over time and it appears to be a white noise with no seasonal component.

    左上方的图显示了随时间变化的残差,它似乎是白噪声,没有季节性成分。
  • The top-right plot shows that kde line (in red) closely follows the N(0,1) line, which is the standard notation of normal distribution with zero mean and standard deviation of 1, suggesting the residuals are normally distributed.

    右上图显示kde线(红色)紧跟N(0,1)线,这是正态分布的标准表示法,均值为零,标准差为1,表明残差呈正态分布。
  • The bottom-left normal gg-plot shows ordered distribution of residuals (in blue) closely follow the linear trend of the samples taken from a standard normal distribution, suggesting residuals are normally distributed.

    左下方正态gg曲线显示残差的有序分布(蓝色)紧密遵循从标准正态分布获取的样本的线性趋势,表明残差呈正态分布。
  • The bottom-right is a correlogram plot indicating residuals have a low correlation with lagged versions.

    右下角是相关图,表明残差与滞后形式的相关性较低。

All these results suggest residuals are normally distributed with low correlation.

所有这些结果表明残差正态分布且相关性较低。

To measure the accuracy of forecasts, we compare the prediction values on the test set with its real values.

为了衡量预测的准确性,我们将测试集上的预测值与其实际值进行比较。

forecast_object = results.get_forecast(steps=len(test))
mean = forecast_object.predicted_mean
conf_int = forecast_object.conf_int()
dates = mean.index
Image for post

From the plot, we see that model prediction nearly matches with the real values of the test set.

从图中可以看出,模型预测几乎与测试集的实际值匹配。

from sklearn.metrics import r2_scorer2_score(test['Production'], predictions)>>> 0.9240433686806808

The R squared of the model is 0.92, indicating that the coefficient of determination of the model is 92%.

该模型的R平方为0.92,表明该模型的确定系数为92%。

mean_absolute_percentage_error = np.mean(np.abs(predictions - test['Production'])/np.abs(test['Production']))*100>>> 1.649905

Mean absolute percentage error (MAPE) is one of the most used accuracy metrics, expressing the accuracy as a percentage of the error. MAPE score of the model equals to 1.64, indicating the forecast is off by 1.64% and 98.36% accurate.

平均绝对百分比误差 (MAPE)是最常用的精度指标之一,将精度表示为误差的百分比。 该模型的MAPE得分等于1.64,表明预测的准确度为1.64%和98.36%。

Since both the diagnostic test and the accuracy metrics intimates that our model is nearly perfect, we can continue to produce future forecasts.

由于诊断测试和准确性指标都表明我们的模型几乎是完美的,因此我们可以继续产生未来的预测。

Here is the forecast for the next 60 months.

这是对未来60个月的预测。

results.get_forecast(steps=60)
Image for post

I hope you enjoyed following this tutorial and building time series forecasts in Python.

我希望您喜欢本教程并使用Python建立时间序列预测。

Let me know if you have any questions or suggestions.✨

如果您有任何问题或建议,请告诉我。✨

翻译自: https://towardsdatascience.com/hands-on-time-series-forecasting-with-python-d4cdcabf8aac

python 时间序列预测

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389463.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

keras框架:目标检测Faster-RCNN思想及代码

Faster-RCNN(RPN CNN ROI)概念 Faster RCNN可以分为4个主要内容: Conv layers:作为一种CNN网络目标检测方法,Faster RCNN首先使用一组基础的convrelupooling层提取 image的feature maps。该feature maps被共享用于…

算法偏见是什么_算法可能会使任何人(包括您)有偏见

算法偏见是什么在上一篇文章中,我们展示了当数据将情绪从动作中剥离时会发生什么 (In the last article, we showed what happens when data strip emotions out of an action) In Part 1 of this series, we argued that data can turn anyone into a psychopath, …

大数据笔记-0907

2019独角兽企业重金招聘Python工程师标准>>> 复习: 1.clear清屏 2.vi vi xxx.log i-->edit esc-->command shift:-->end 输入 wq 3.cat xxx.log 查看 --------------------------- 1.pwd 查看当前光标所在的path 2.家目录 /boot swap / 根目录 起始位置 家…

Tensorflow框架:目标检测Yolo思想

Yolo-You Only Look Once YOLO算法采用一个单独的CNN模型实现end-to-end的目标检测: Resize成448448,图片分割得到77网格(cell)CNN提取特征和预测:卷积部分负责提取特征。全链接部分负责预测:过滤bbox(通过nms&#…

线性回归非线性回归_了解线性回归

线性回归非线性回归Let’s say you’re looking to buy a new PC from an online store (and you’re most interested in how much RAM it has) and you see on their first page some PCs with 4GB at $100, then some with 16 GB at $1000. Your budget is $500. So, you es…

朴素贝叶斯和贝叶斯估计_贝叶斯估计收入增长的方法

朴素贝叶斯和贝叶斯估计Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works wi…

numpy统计分布显示

import numpy as np from sklearn.datasets import load_iris dataload_iris()petal_lengthnumpy.array(list(len[2]for len in data[data]))#取出花瓣长度数据 print(np.max(petal_length))#花瓣长度最大值 print(np.mean(petal_length))#花瓣长度平均值 print(np.std(petal_l…

Keras框架:人脸检测-mtcnn思想及代码

人脸检测-mtcnn 概念: MTCNN,英文全称是Multi-task convolutional neural network,中文全称是多任务卷积神经网络, 该神经网络将人脸区域检测与人脸关键点检测放在了一起。 从工程实践上,MTCNN是一种检测速度和准确率…

python中格式化字符串_Python中所有字符串格式化的指南

python中格式化字符串Strings are one of the most essential and used datatypes in programming. It allows the computer to interact and communicate with the world, such as printing instructions or reading input from the user. The ability to manipulate and form…

Javassist实现JDK动态代理

提到JDK动态代理,相信很多人并不陌生。然而,对于动态代理的实现原理,以及如何编码实现动态代理功能,可能知道的人就比较少了。接下一来,我们就一起来看看JDK动态代理的基本原理,以及如何通过Javassist进行模…

数据图表可视化_数据可视化如何选择正确的图表第1部分

数据图表可视化According to the World Economic Forum, the world produces 2.5 quintillion bytes of data every day. With so much data, it’s become increasingly difficult to manage and make sense of it all. It would be impossible for any person to wade throug…

Keras框架:实例分割Mask R-CNN算法实现及实现

实例分割 实例分割(instance segmentation)的难点在于: 需要同时检测出目标的位置并且对目标进行分割,所以这就需要融合目标检测(框出目标的位置)以及语义分割(对像素进行分类,分割…

机器学习 缺陷检测_球检测-体育中的机器学习。

机器学习 缺陷检测🚩 目标 (🚩Objective) We want to evaluate the quickest way to detect the ball in a sport event in order to develop an Sports AI without spending a million dollars on tech or developers. Quickly we find out that detec…

使用python和javascript进行数据可视化

Any data science or data analytics project can be generally described with the following steps:通常可以通过以下步骤来描述任何数据科学或数据分析项目: Acquiring a business understanding & defining the goal of a project 获得业务理解并定义项目目…

为什么饼图有问题

介绍 (Introduction) It seems as if people are split on pie charts: either you passionately hate them, or you are indifferent. In this article, I am going to explain why pie charts are problematic and, if you fall into the latter category, what you can do w…

先知模型 facebook_使用Facebook先知进行犯罪率预测

先知模型 facebookTime series prediction is one of the must-know techniques for any data scientist. Questions like predicting the weather, product sales, customer visit in the shopping center, or amount of inventory to maintain, etc - all about time series …

github gists 101使代码共享漂亮

If you’ve been going through Medium, looking at technical articles, you’ve undoubtedly seen little windows that look like the below:如果您一直在阅读Medium,并查看技术文章,那么您无疑会看到类似于以下内容的小窗口: def hello_…

基于Netty的百万级推送服务设计要点

1. 背景1.1. 话题来源最近很多从事移动互联网和物联网开发的同学给我发邮件或者微博私信我,咨询推送服务相关的问题。问题五花八门,在帮助大家答疑解惑的过程中,我也对问题进行了总结,大概可以归纳为如下几类:1&#x…

鲜为人知的6个黑科技网站_6种鲜为人知的熊猫绘图工具

鲜为人知的6个黑科技网站Pandas is the go-to Python library for data analysis and manipulation. It provides numerous functions and methods that expedice the data analysis process.Pandas是用于数据分析和处理的Python库。 它提供了加速数据分析过程的众多功能和方法…

VRRP网关冗余

实验要求 1、R1创建环回口,模拟外网 2、R2,R3使用VRRP技术 3、路由器之间使用EIGRP路由协议  实验拓扑  实验配置  R1(config)#interface loopback 0R1(config-if)#ip address 1.1.1.1 255.255.255.0R1(config-if)#int e0/0R1(config-if)#ip addr…