季节性时间序列数据分析_如何指导时间序列数据的探索性数据分析

季节性时间序列数据分析

为什么要进行探索性数据分析? (Why Exploratory Data Analysis?)

You might have heard that before proceeding with a machine learning problem it is good to do en end-to-end analysis of the data by carrying a proper exploratory data analysis. A common question that pops in people’s head after listening to this as to why EDA?

您可能已经听说,在进行机器学习问题之前,最好通过进行适当的探索性数据分析来对数据进行端到端分析。 听了为什么要使用EDA的一个普遍问题在人们的脑海中浮现。

· What is it, that makes EDA so important?

·这是什么使EDA如此重要?

· How to do proper EDA and get insights from the data?

·如何进行适当的EDA并从数据中获取见解?

· What is the right way to begin with exploratory data analysis?

·探索性数据分析的正确方法是什么?

So, let us how we can perform exploratory data analysis and get useful insights from our data. For performing EDA I will take dataset from Kaggle’s M5 Forecasting Accuracy Competition.

因此,让我们了解如何进行探索性数据分析并从数据中获得有用的见解。 为了执行EDA,我将从Kaggle的M5预测准确性竞赛中获取数据集。

了解问题陈述: (Understanding the Problem Statement:)

Before you begin EDA, it is important to understand the problem statement. EDA depends on what you are trying to solve or find. If you don’t sync your EDA with respect to solving the problem it will just be plain plotting of meaningless graphs.

开始EDA之前,了解问题陈述很重要。 EDA取决于您要解决或找到的内容。 如果您不同步您的EDA以解决问题,那将只是无意义的图形的简单绘图。

Hence, before you begin understand the problem statement. So, let us understand the problem statement for this data.

因此,在您开始理解问题陈述之前。 因此,让我们了解此数据的问题陈述。

问题陈述: (Problem Statement:)

We here have a hierarchical data for products for Walmart store for different categories from three states namely, California, Wisconsin and Texas. Looking at this data we need to predict the sales for the products for 28 days. The training data that we have consist of individual sales for each product for 1914 days. Using this train data we need to make a prediction on the next days.

我们在这里拥有来自三个州(加利福尼亚州,威斯康星州和德克萨斯州)不同类别的沃尔玛商店产品的分层数据。 查看这些数据,我们需要预测产品28天的销售量。 我们拥有的培训数据包括1914天每种产品的个人销售。 使用此火车数据,我们需要在未来几天进行预测。

We have the following files provided from as the part of the competition:

作为比赛的一部分,我们提供了以下文件:

  1. calendar.csv — Contains information about the dates on which the products are sold.

    calendar.csv-包含有关产品销售日期的信息。
  2. sales_train_validation.csv — Contains the historical daily unit sales data per product and store [d_1 — d_1913]

    sales_train_validation.csv-包含每个产品和商店的历史每日单位销售数据[d_1-d_1913]
  3. sample_submission.csv — The correct format for submissions. Reference the Evaluation tab for more info.

    sample_submission.csv —提交的正确格式。 请参考评估选项卡以获取更多信息。
  4. sell_prices.csv — Contains information about the price of the products sold per store and date.

    sell_prices.csv-包含有关每个商店和日期出售产品的价格的信息。
  5. sales_train_evaluation.csv — Includes sales [d_1 — d_1941] (labels used for the Public leaderboard)

    sales_train_evaluation.csv-包括销售[d_1-d_1941](用于公共排行榜的标签)

Using this dataset we need to make the sales prediction for the next 28 days.

使用此数据集,我们需要对未来28天进行销售预测。

分析数据框: (Analyzing Dataframes:)

Now, after you have understood the problem statement well, the first thing to do, to begin with, EDA, is analyze the dataframes and understand the features that are present in our dataset.

现在,在您很好地理解了问题陈述之后,首先要做的是EDA,首先要分析数据框并了解数据集中存在的特征。

As mentioned earlier, for this data we have 5 different CSV files. Hence, to begin with, EDA we will first print the head of each of the dataframe to get the intuition of features and the dataset.

如前所述,对于此数据,我们有5个不同的CSV文件。 因此,首先,EDA我们将首先打印每个数据框的头部,以获取要素和数据集的直觉。

Here, I am using Python’s pandas library for reading the data and printing the first few rows. View the first few rows and write your observations.:

在这里,我正在使用Python的pandas库读取数据并打印前几行。 查看前几行并写下您的观察结果:

日历数据: (Calendar Data:)

First Few Rows:

前几行:

Value Counts Plot:

值计数图:

To get a visual idea about our data we will plot the value counts in each of the category of calendar dataframe. For this we will use the Seaborn library.

为了对我们的数据有一个直观的了解,我们将在日历数据框的每个类别中绘制值计数。 为此,我们将使用Seaborn库。

Image for post
Code-Snippet for Plotting Value Counts of Each Feature
用于绘制每个功能的值计数的代码段
Image for post
Value_counts for each day of week
一周中每一天的Value_counts
Image for post
Value_counts for each month
每个月的Value_counts
Image for post
Value_counts for each year
每年的Value_counts
Image for post
Value Counts for each event based on name
基于名称的每个事件的值计数
Image for post
Value_counts for each event based on event_name
每个事件基于event_name的Value_counts
Image for post
Value_counts for type of event in type_1
type_1中事件类型的Value_counts
Image for post
Value_counts for the type of event in type_2
type_2中事件类型的Value_counts

日历数据框的观察结果: (Observations from Calendar Dataframe:)

  1. We have the date, weekday, month, year and event for each of day for which we have the forecast information.

    我们拥有每天的日期工作日月份年份事件 ,并为其提供了预测信息。

  2. Also, we see many NaN vales in our data especially in the event fields, which means that for the day there is no event, we have a missing value placeholder.

    同样,我们在数据中看到许多NaN值,尤其是在事件字段中,这意味着在没有事件的那天,我们缺少一个占位符。
  3. We have data for all the weekdays with equal counts. Hence, it is safe to say we do not have any kind of missing entries here.

    我们拥有所有平日的数据,并且计数相同。 因此,可以肯定地说我们在这里没有任何缺失的条目。
  4. We have a higher count of values for the month of March, April and May. For the last quarter, the count is low.

    我们在3月,4月和5月的值计数更高。 对于最后一个季度,这一数字很低。
  5. We have data from 2011 to 2016. Although we don’t have the data for all the days of 2016. This explains the higher count of values for the first few months.

    我们拥有2011年至2016年的数据。尽管我们没有2016年所有时间的数据。这解释了前几个月的价值较高。
  6. We also have a list of events, that might be useful in analyzing trends and patterns in our data.

    我们还提供了事件列表,这可能有助于分析数据中的趋势和模式。
  7. We have more data for cultural events rather than religious events.

    我们有更多的文化活动而非宗教活动数据。

Hence, by just plotting a few basic graphs we are able to grab some useful information about our dataset that we didn’t know earlier. That is amazing indeed. So, let us try the same for other CSV files we have.

因此,只需绘制一些基本图形,我们就可以获取一些我们之前不知道的有关数据集的有用信息。 确实是太神奇了。 因此,让我们对已有的其他CSV文件尝试相同的操作。

销售验证数据集: (Sales Validation Dataset:)

First few rows:

前几行:

Next, we will explore the validation dataset provided to us:

接下来,我们将探索提供给我们的验证数据集:

Image for post
First five rows of validation data
验证数据的前五行

Value counts plot:

值计数图:

Image for post
Code-Snippet for count_plot
count_plot的代码段
Image for post
Value_counts plot for each store
每个商店的Value_counts图
Image for post
Value_counts plot for each state
每个州的Value_counts图
Image for post
Value_counts plot for each category
每个类别的Value_counts图
Image for post
Value_counts plot for each department
每个部门的Value_counts图

来自销售数据的观察: (Observations from Sales Data:)

  1. We have data for three different categories which are Household, Food and Hobbies

    我们有三个不同类别的数据,分别是家庭,食品和嗜好
  2. We have data for three different states California, Wisconsin and Texas. Of these three states, maximum sales are from the state of California.

    我们有加利福尼亚,威斯康星州和德克萨斯州三个不同州的数据。 在这三个州中,最大的销售量来自加利福尼亚州。
  3. Sales for the category of Foods is maximum.

    食品类别的销售额最高。

卖价数据: (Sell Price Data:)

First few rows:

前几行:

Image for post
First 5 rows for Sell Price Data
售价数据的前5行

Observations:

观察结果:

  1. Here we have the sell_price of each item.

    这里我们有每个项目的sell_price。
  2. We have already seen the item_id and store_id plots earlier.

    我们之前已经看过item_id和store_id的图。

向您的数据提问: (Asking Questions to your Data:)

Till now we have seen the basic EDA plots. The above plots gave us a brief overview about the data that we have. Now, for the next phase we need to find answers of the questions that we have from put data. This depends on the problem statement that we have.

到目前为止,我们已经看到了基本的EDA图。 上面的图对我们提供的数据进行了简要概述。 现在,对于下一阶段,我们需要从放置数据中找到问题的答案。 这取决于我们的问题陈述。

For Example:

例如:

In our data we need to forecast the sales for each product on the next 28 days. Hence, for this we need to know if there are any kind of patterns in the sales earlier before that 28 days? Because, if that is so then the sales is likely to follow the same pattern for next 28 days too.

在我们的数据中,我们需要预测未来28天每种产品的销售额。 因此,为此,我们需要知道在那28天之前的销售情况中是否存在任何类型的模式? 因为,如果是这样,那么接下来的28天销售量也可能会遵循相同的模式。

So, here goes our first question?

那么,这是我们的第一个问题?

过去的销售分布是什么? (What is the Sales distribution in the past?)

So, to find out the same, let us randomly select few products and see their sales distribution for 1914 days given in our validation data:

因此,要找出相同的结果,让我们随机选择一些产品,并在我们的验证数据中查看其1914天的销售分布:

Image for post
Code-snippet for plotting sales of a product
用于绘制产品销售的代码段
Image for post
FOODS_3_0900_CA_3_validationFOODS_3_0900_CA_3_validation的销售分配图
Image for post
HOUSEHOLD_2_348_CA_1_validationHOUSEHOLD_2_348_CA_1_validation的销售分配图
Image for post
FOODS_3_325_TX_3_validationFOODS_3_325_TX_3_validation的销售分配图

Observations:

观察结果:

  1. The plots are very random and it is difficult to find out a pattern.

    这些图是非常随机的,很难找到一个模式。
  2. For FOODS_3_0900_CA_3_validation we see that on day1 the sales were high after which it was Nil for sometime. After that once again it reached high and is fluctuating up and down since then. The sudden fall after day1 might be because the product got out of stock.

    对于FOODS_3_0900_CA_3_validation,我们 看到第一天的销售量很高,此后一段时间内为零。 此后,它再次达到高点,此后一直在上下波动。 第一天过后的突然下跌可能是因为产品缺货

  3. For HOUSEHOLD_2_348_CA_1_validation we see that the sales plot is extremely random. It has a lot of noise. On some day the sales are high and on some it got lowered considerably.

    对于HOUSEHOLD_2_348_CA_1_validation,我们看到销售情况非常随机。 它有很多噪音。 有一天,销售很高,有的时候却大大降低了。

  4. For FOODS_3_325_TX_3_validation we see absolutely no sales for first 500 days. This means that for the first 500 days the product was not in stock. After that the sales reached a peak in every 200 days. Hence, for this food product we see a seasonal dependency.

    对于FOODS_3_325_TX_3_validation,我们发现前500天绝对没有销售。 这意味着前500天该产品没有库存。 此后,销量每200天达到峰值。 因此,对于这种食品,我们看到了季节依赖性。

Hence, by just randomly plotting few sales graph we are able to take our some important insights from our dataset. These insights will also help us in choosing the right model for training process.

因此,仅通过随机绘制少量销售图,我们就可以从数据集中获取一些重要见解。 这些见解还将帮助我们为培训过程选择正确的模型。

每周,每月和每年的销售方式是什么? (What is the Sales Pattern on Weekly, Monthly and Yearly Basis?)

We saw earlier that there are seasonal trends in our data. So, next let us break down the time variables and see the weekly, monthly and yearly sales pattern:

之前我们看到数据中存在季节性趋势。 因此,接下来让我们分解时间变量,并查看每周,每月和每年的销售模式:

Image for post
Code-Snippet for Weekly Average Sales Distribution
每周平均销售分配的代码段
Image for post
HOUSEHOLD_1_118_CA_3_validationHOUSEHOLD_1_118_CA_3_validation的每周平均分配

For this particular HOUSEHOLD_1_118_CA_3_validation we can see that the sales see a drop after Tuesday and hits minimum on Saturday.

对于此特定的HOUSEHOLD_1_118_CA_3_validation,我们可以看到销售在周二之后有所下降,在周六达到最低。

Image for post
Code-Snippet for Monthly Average Sales Distribution
每月平均销售分布的代码段
Image for post
HOUSEHOLD_1_118_CA_3_validationHOUSEHOLD_1_118_CA_3_validation的月平均分配

The monthly sales drop in the middle of the year. After which we can say that it reaches a minimum in 7th month that is July.

每月的销售额在年中下降。 之后,我们可以说它在7月份的第7个月达到了最小值。

Image for post
Code-Snippet for Yearly Average Sales Distribution
年度平均销售分布的代码段
Image for post
HOUSEHOLD_1_118_CA_3_validationHOUSEHOLD_1_118_CA_3_validation的年平均分布

From the above graph we can see that the sales just dropped to zero from 2013 to 2014. This means that the product might be have been updated with a new product version or just removed from this store. From this plot it will be safe to say that for days to predict the sales should still be zero.

从上图可以看出,从2013年到2014年,销售刚刚下降到零。这意味着该产品可能已经使用新产品版本进行了更新,或者刚刚从该商店中删除。 从该图可以肯定地说,几天来可以预测销售额仍为零。

每个类别的销售分布是什么? (What is the Sales Distribution in Each Category?)

We have sales data belonging to three different categories. Hence, it might be good to see if the sales of product depend on the category it belongs to. The same we will do now:

我们拥有属于三个不同类别的销售数据。 因此,最好查看产品的销售是否取决于其所属的类别。 我们现在将做的相同:

Image for post
Code-Snippet for Sales Distribution Category-Wise
明智的销售分销类别代码段
Image for post
Sales-Distribution for each Category
每个类别的销售分布

We see that the sales is maximum for Foods. Also, the sales curve for FOOD do not overlap at all with the other two categories. This shows that on any day the sales of Food is more than Household and Hobbies.

我们看到食品的销售量最大。 另外,食品的销售曲线与其他两个类别完全不重叠。 这表明,在任何一天,食品的销量都超过了家庭嗜好

每个州的销售分布是什么? (What is the Sales Distribution for Each State?)

Besides category we also have state to which the sales belong. So, let us analyze if there is a state for which the sales follow a different pattern:

除了类别,我们还具有销售所属的州。 因此,让我们分析一下是否存在销售遵循不同模式的状态:

Image for post
Code-Snippet for Sales Distribution State-Wise
精明的销售分布代码段
Image for post
Sales-Distribution for each State
每个州的销售分配

在每周,每月和每年的基础上,属于“兴趣”类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to category of Hobbies on weekly, monthly and yearly basis?)

Now, let us see the sales of randomly selected products from the categories Hobbies and see if their weekly, monthly or yearly average follows a pattern:

现在,让我们查看“兴趣爱好”类别中随机选择的产品的销售情况,并查看其每周,每月或每年的平均值是否遵循以下模式:

Image for post
Image for post
Code-Snippet for plotting sales distribution of products from Hobbies
代码段,用于绘制爱好产品的销售分布图
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

观察结果 (Observations)

From the above plot we see that in meed week usually for 4th and 5th day (Tuesday and Wednesday), the sales drop especially in the case when states are ‘WI’ and ‘TX’.

从上图可以看出,通常在第4天和第5天(星期二和星期三)的一周中,销量下降,尤其是在州为“ WI”和“ TX”的情况下。

Let us analyze the results on individual states to see this more clearly, as we see different sales pattern for different states. And, this brings us to our next question:

让我们分析各个州的结果,以便更清楚地看到这一点,因为我们看到了不同州的不同销售模式。 并且,这将我们带入下一个问题:

特定州在每周,每月和每年的基础上属于“兴趣”类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to the category of Hobbies on weekly, monthly and yearly basis for a particular state?)

Image for post
Code-Snippet for selecting Sales of products from Hobbies category and state of Wisconsin
用于从“兴趣爱好”类别和威斯康星州选择产品销售的代码段
Image for post
Code-Snippet for selecting few products at random and plotting their distribution
用于随机选择几种产品并绘制其分布的代码片段
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

观察结果: (Observations:)

  1. From the above plots, we can see that in the state of Wisconsin, for most of the products the sales decrease considerably in mid-week.

    从上面的图可以看出,在威斯康星州,大多数产品的销售在星期三中大幅下降。
  2. This also gives us a little sense of life-style of people in Wisconsin, that people here do not shop much during day 3–4 which is Monday and Tuesday. This probably might be because are these are the busiest days of the week.

    这也使我们对威斯康星州人们的生活方式有所了解,即这里的人们在周一至周二的第3至4天购物不多。 这可能是因为这些是一周中最忙的日子。
  3. From the monthly average we can see that, in first quarter the sales often experienced a dip.

    从每月平均数可以看出,第一季度的销售额经常出现下降。
  4. For the product HOBBIES_1_369_WI_2_validation, we see that the sales data is nill till year 2014. This shows that this product was introduced after this year and the weekly and monthly pattern that we see for this product is after the year 2014.

    对于产品HOBBIES_1_369_WI_2_validation,我们看到直到2014年为止的销售数据都是零。这表明该产品是在今年之后推出的,而我们看到的该产品的每周和每月模式是在2014年之后。

每周,每月和每年,属于食品类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to category of Foods on weekly, monthly and yearly basis?)

Now, doing analysis for Hobbies individually gave us some useful insights. Let, us try the same for the category of Foods:

现在,分别对爱好进行分析可以为我们提供一些有用的见解。 让我们对食品类别尝试相同的方法:

Image for post
Code-Snippet for Food making dataframe with only products of Food Category
仅包含食品类别产品的食品数据代码段
Image for post
Code-Snippet for plotting weekly, monthly and yearly average sales for food products
代码段,用于绘制食品的每周,每月和每年的平均销售额
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

观察: (Observation:)

  1. From the plots above we can say that, for food items categories the purchase is more in the early week as compared to the last two days.

    从上面的图可以看出,对于食品类别,与前两天相比,在前一周的购买量更多。
  2. This is might be because people are habituated of buying food supplies during the start of the week and then keep it for the entire week. This curves shows us the similar behavior.

    这可能是因为人们习惯于在一周开始时购买食品,然后整个星期都保持食用。 该曲线向我们展示了类似的行为。

每周,每月和每年,属于家庭类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to category of Household on weekly, monthly and yearly basis?)

Image for post
Code-Snippet for plotting sales distribution of products from Houehold category
用于绘制Houehold类别产品的销售分布图的代码段
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

观察: (Observation:)

  1. From the plots above we can say that, for Household items categories the purchase shows a dip for Monday and Tuesday.

    从上面的图可以看出,对于家庭用品类别,购买显示星期一和星期二有所下降。

  2. In the start of week people are busy with office work and hardly go for shopping. This is the pattern that we see here.

    在一周的开始,人们忙于办公室工作,几乎不去购物。 这就是我们在这里看到的模式。

有没有办法在不丢失信息的情况下更清楚地看到产品的销售情况? (Is there a way to see the sales of products more clearly without losing information?)

We saw plots for sales distribution earlier for each products. These were quite cluttered and we couldn’t see the pattern clearly. Hence, you might be wondering if there is a way to do so. And, the good news is yes there is.

我们早先看到了每种产品的销售分布图。 这些非常混乱,我们看不清模式。 因此,您可能想知道是否有办法做到这一点。 而且,好消息是,是的。

Here comes denoising in picture. We will denoise our dataset and see the distribution.

图片降噪 。 我们将对数据集进行去噪并查看分布。

Here we will see two common denoising techniques. Wavelet denoising and Moving average.

在这里,我们将看到两种常见的降噪技术。 小波去噪移动平均

Wavelet Denoising:

小波去噪:

From the sales plots of invidual products we saw that the sales changes rapidly. This is because the sales of a product on a day depend on multiple factors. So, let us try denoising our data and see if we are able to find anything intresesting.

从单个产品的销售图上,我们看到销售变化Swift。 这是因为一天的产品销售取决于多个因素。 因此,让我们尝试对数据进行去噪处理,看看是否能够找到令人感兴趣的东西。

The basic idea behind wavelet denoising, or wavelet thresholding, is that the wavelet transform leads to a sparse representation for many real-world signals and images. What this means is that the wavelet transform concentrates signal and image features in a few large-magnitude wavelet coefficients. Wavelet coefficients which are small in value are typically noise and you can “shrink” those coefficients or remove them without affecting the signal or image quality. After you threshold the coefficients, you reconstruct the data using the inverse wavelet transform.

小波去噪或小波阈值处理的基本思想是,小波变换导致许多现实信号和图像的稀疏表示。 这意味着小波变换将信号和图像特征集中在几个大幅度的小波系数中。 小值的小波系数通常是噪声,您可以“缩小”这些系数或将其删除而不影响信号或图像质量。 对系数设定阈值后,您可以使用小波逆变换来重建数据。

For wavelet denoising, we require the the library pywt.

对于小波去噪,我们需要库pywt。

Here we will use wavelet denoising. For deciding the threshold of denoising we will use Mean Absolute Deviation.

在这里,我们将使用小波去噪。 为了确定降噪的阈值,我们将使用平均绝对偏差

Image for post
Code-Snippet for Wavelet Denoising
小波去噪的代码片段
Image for post
Image for post

Observations:

观察结果:

We are able to see a pattern more clear after denoising the data. It shows the same pattern every 500 days which we were not able to see before denoising.

去噪数据后,我们可以看到更清晰的图案。 它每500天显示一次相同的模式,这是我们在去噪之前无法看到的。

Moving Average Denoising:

移动平均降噪:

Let us now try a simple smoothing technique.In this technique, we take a fixed window sie and move it along out time-series data calculating the average. We also take a stride value so as to leave the intervals accordingly. For example, let's say we take a window size of 20 and stride as 5. Then our first point will be the mean of points from day1 to day 20, the next will be the mean of points from day5 to day25, then day10 to day30 and so on.

现在让我们尝试一种简单的平滑技术,在此技术中,我们采用固定的窗口sie并将其沿时间序列数据移出以计算平均值。 我们还采用跨度值,以便相应地保留间隔。 例如,假设我们的窗口大小为20,跨度为5,那么我们的第一个点将是从第1天到第20天的点的平均值,下一个是从第5天到第25天的点的平均值,然后是从第10天到第30的点的平均值。等等。

So, let us try this average smoothing on our dataset and see if we find any kind of patterns here.

因此,让我们对数据集尝试这种平均平滑处理,看看是否在这里找到任何类型的模式。

Image for post
Code-Snippet for Moving Window Average Calculation
移动窗口平均值计算的代码片段
Image for post

Observations:

观察结果:

We see that the average smoothing does remove some noise but not as effective as the wavelet decomposition.

我们看到,平均平滑确实消除了一些噪声,但效果不如小波分解。

每个州的总销售额是否有所不同? (Do the sales vary overall for each state?)

Now, from a broader perspective let us see if the sales vary for each state:

现在,从更广泛的角度来看,让我们看看每个州的销售额是否有所不同:

Image for post
Code-Snippet for Average Sales in Each state
各州平均销售额的代码段
Image for post
Sales-pattern for each state
每个州的销售模式
Image for post
Box-plot for Sales distribution of each state
各州销售分布的箱形图

观察结果: (Observations:)

  1. From the above plot we can see that the sales for store CA_3 lie above the sales for all other states. The same applies for CA_4 where the sales are lowest. For other sales the patterns are distinguishable to some extent.

    从上图可以看出,商店CA_3的销售额高于所有其他州的销售额。 CA_4的销售额最低也是如此。 对于其他销售,这些模式在一定程度上是可以区分的。
  2. One thing that we observe that all these patterns follow a similar trend that repeats itself after some time. Also, the sales reaches a higher value in the graph.

    我们观察到的一件事是,所有这些模式都遵循类似的趋势,并在一段时间后重复出现。 同样,销售额在图中达到更高的值。
  3. As we saw from the line-plot, the box plot also shows non-overlapping sales patternf for CA_3 nd CA_4.

    从线图中可以看到,箱形图还显示了CA_3和CA_4的非重叠销售模式f。
  4. No overlapping between the stores of California and totally independent of the fact that all of these belong to the same state. This shows high variance for the state of California.

    加利福尼亚的商店之间没有重叠,并且完全独立于所有这些商店都属于同一州。 这表明加利福尼亚州的差异很大。
  5. For Texas the states TX_1 and TX_3 have quite smiliar patterns and intersect a couple of times. But TX_2 lies above them with maximum sales and more disparity as compared to the other two. In the later parts, we see that TX_3 is growing rapidly and is approaching towards TX_2. Hence, from this, we can conclude that sales for TX_3 increase at the fastest pace.

    对于得克萨斯州,州TX_1和TX_3具有相当明显的模式,并且相交几次。 但是TX_2位于它们之上,与其他两个相比,其销售量最大且差异更大。 在后面的部分中,我们看到TX_3正在快速增长,并且正在接近TX_2。 因此,由此可以得出结论,TX_3的销售额增长最快。

结论: (Conclusion:)

Hence, by just plotting few simple graphs we are able to know our dataset quite well. Its just a matter of questions that you want to ask to the data. The plotting will give you all the answers.

因此,仅绘制几个简单的图,我们就能很好地了解我们的数据集。 这只是您要向数据询问的问题。 绘图将为您提供所有答案。

I hope this would have given you an idea of doing simple EDA. You can find the complete code in my github repository.

我希望这会给您带来进行简单EDA的想法。 您可以在我的github存储库中找到完整的代码。

  1. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

    https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

  2. https://www.kaggle.com/tarunpaparaju/m5-competition-eda-models/output

    https://www.kaggle.com/tarunpaparaju/m5-competition-eda-models/output

  3. https://mobidev.biz/blog/machine-learning-methods-demand-forecasting-retail

    https://mobidev.biz/blog/machine-learning-methods-demand-forecasting-retail

  4. https://www.mygreatlearning.com/blog/how-machine-learning-is-used-in-sales-forecasting/

    https://www.mygreatlearning.com/blog/how-machine-learning-is-used-in-sales-forecasting/

  5. https://medium.com/@chunduri11/deep-learning-part-1-fast-ai-rossman-notebook-7787bfbc309f

    https://medium.com/@chunduri11/deep-learning-part-1-fast-ai-rossman-notebook-7787bfbc309f

  6. https://www.kaggle.com/anshuls235/time-series-forecasting-eda-fe-modelling

    https://www.kaggle.com/anshuls235/time-series-forecasting-eda-fe-modelling

  7. https://eng.uber.com/neural-networks/

    https://eng.uber.com/neural-networks/

  8. https://www.kaggle.com/mayer79/m5-forecast-keras-with-categorical-embeddings-v2

    https://www.kaggle.com/mayer79/m5-forecast-keras-with-categorical-embeddings-v2

翻译自: https://medium.com/analytics-vidhya/how-to-guide-on-exploratory-data-analysis-for-time-series-data-34250ff1d04f

季节性时间序列数据分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389912.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

TortoiseGit上传项目到GitHub

1. 简介 gitHub是一个面向开源及私有软件项目的托管平台,因为只支持git 作为唯一的版本库格式进行托管,故名gitHub。 2. 准备 2.1 安装git:https://git-scm.com/downloads。无脑安装 2.2 安装TortoiseGit(小乌龟):https://torto…

利用PHP扩展Taint找出网站的潜在安全漏洞实践

一、背景 笔者从接触计算机后就对网络安全一直比较感兴趣,在做PHP开发后对WEB安全一直比较关注,2016时无意中发现Taint这个扩展,体验之后发现确实好用;不过当时在查询相关资料时候发现关注此扩展的人数并不多;最近因为…

美团骑手检测出虚假定位_在虚假信息活动中检测协调

美团骑手检测出虚假定位Coordination is one of the central features of information operations and disinformation campaigns, which can be defined as concerted efforts to target people with false or misleading information, often with some strategic objective (…

CertUtil.exe被利用来下载恶意软件

1、前言 经过国外文章信息,CertUtil.exe下载恶意软件的样本。 2、实现原理 Windows有一个名为CertUtil的内置程序,可用于在Windows中管理证书。使用此程序可以在Windows中安装,备份,删除,管理和执行与证书和证书存储相…

335. 路径交叉

335. 路径交叉 给你一个整数数组 distance 。 从 X-Y 平面上的点 (0,0) 开始,先向北移动 distance[0] 米,然后向西移动 distance[1] 米,向南移动 distance[2] 米,向东移动 distance[3] 米,持续移动。也就是说&#x…

回归分析假设_回归分析假设的最简单指南

回归分析假设The Linear Regression is the simplest non-trivial relationship. The biggest mistake one can make is to perform a regression analysis that violates one of its assumptions! So, it is important to consider these assumptions before applying regress…

Spring Aop之Advisor解析

2019独角兽企业重金招聘Python工程师标准>>> 在上文Spring Aop之Target Source详解中,我们讲解了Spring是如何通过封装Target Source来达到对最终获取的目标bean进行封装的目的。其中我们讲解到,Spring Aop对目标bean进行代理是通过Annotatio…

为什么随机性是信息

用位思考 (Thinking in terms of Bits) Imagine you want to send outcomes of 3 coin flips to your friends house. Your friend knows that you want to send him those messages but all he can do is get the answer of Yes/No questions arranged by him. Lets assume th…

大数据相关从业_如何在组织中以数据从业者的身份闪耀

大数据相关从业Build bridges, keep the maths under your hat and focus on serving.架起桥梁,将数学放在脑海中,并专注于服务。 通过协作而不是通过孤立的孤岛来交付出色的数据工作。 (Deliver great data work through collaboration not through co…

Django进阶之中间件

中间件简介 在http请求 到达视图函数之前 和视图函数return之后,django会根据自己的规则在合适的时机执行中间件中相应的方法。 中间件的执行流程 1、执行完所有的request方法 到达视图函数。 2、执行中间件的其他方法 2、经过所有response方法 返回客户端。 注意…

汉诺塔递归算法进阶_进阶python 1递归

汉诺塔递归算法进阶When something is specified in terms of itself, it is called recursion. The recursion gives us a new idea of how to solve a kind of problem and this gives us insights into the nature of computation. Basically, many of computational artifa…

windows 停止nginx

1、查找进程 tasklist | findstr nginx2、杀死进程 taskkill /pid 6508 /F3、一次杀死多个进程taskkill /pid 6508 /pid 16048 /f转载于:https://blog.51cto.com/dressame/2161759

SpringBoot返回json和xml

有些情况接口需要返回的是xml数据&#xff0c;在springboot中并不需要每次都转换一下数据格式&#xff0c;只需做一些微调整即可。 新建一个springboot项目&#xff0c;加入依赖jackson-dataformat-xml&#xff0c;pom文件代码如下&#xff1a; <?xml version"1.0&quo…

orange 数据分析_使用Orange GUI的放置结果数据分析

orange 数据分析Objective : Analysing of several factors influencing the recruitment of students and extracting information through plots.目的&#xff1a;分析影响学生招生和通过情节提取信息的几个因素。 Description : The following analysis presents the diffe…

普里姆从不同顶点出发_来自三个不同聚类分析的三个不同教训数据科学的顶点...

普里姆从不同顶点出发绘制大流行时期社区的风险群图&#xff1a;以布宜诺斯艾利斯为例 (Map Risk Clusters of Neighbourhoods in the time of Pandemic: a case of Buenos Aires) 介绍 (Introduction) Every year is unique and particular. But, 2020 brought the world the …

荷兰牛栏 荷兰售价_荷兰的公路货运是如何发展的

荷兰牛栏 荷兰售价I spent hours daily driving on one of the busiest motorways in the Netherlands when commuting was still a norm. When I first came across with the goods vehicle data on CBS website, it immediately attracted my attention: it could answer tho…

Vim 行号的显示与隐藏

2019独角兽企业重金招聘Python工程师标准>>> Vim 行号的显示与隐藏 一、当前文档的显示与隐藏 1 打开一个文档 [rootpcname ~]# vim demo.txt This is the main Apache HTTP server configuration file. It contains the configuration directives that give the s…

结对项目-小学生四则运算系统网页版项目报告

结对作业搭档&#xff1a;童宇欣 本篇博客结构一览&#xff1a; 1&#xff09;.前言(包括仓库地址等项目信息) 2&#xff09;.开始前PSP展示 3&#xff09;.结对编程对接口的设计 4&#xff09;.计算模块接口的设计与实现过程 5&#xff09;.计算模块接口部分的性能改进 6&…

袁中的第三次作业

第一题&#xff1a; 输出月份英文名 设计思路: 1:看题目&#xff1a;主函数与函数声明&#xff0c;知道它要你干什么2&#xff1a;理解与分析&#xff1a;在main中&#xff0c;给你一个月份数字n&#xff0c;要求你通过调用函数char *getmonth&#xff0c;来判断&#xff1a;若…

Python从菜鸟到高手(1):初识Python

1 Python简介 1.1 什么是Python Python是一种面向对象的解释型计算机程序设计语言&#xff0c;由荷兰人吉多范罗苏姆&#xff08;Guido van Rossum&#xff09;于1989年发明&#xff0c;第一个公开发行版发行于1991年。目前Python的最新发行版是Python3.6。 Python是纯粹的自由…