软件工程方法学要素含义_日期时间数据的要素工程

软件工程方法学要素含义

According to Wikipedia, feature engineering refers to the process of using domain knowledge to extract features from raw data via data mining techniques. These features can then be used to improve the performance of machine learning algorithms.

一个 ccording百川,有特色的工程是指使用领域知识通过数据挖掘技术,从原始数据提取特征的过程。 然后,可以使用这些功能来提高机器学习算法的性能。

Feature engineering does not necessarily have to be fancy. One simple, yet prevalent, use case of feature engineering is in time-series data. The importance of feature engineering in this realm is due to the fact that (raw) time-series data usually only contains one single column to represent the time attribute, namely date-time (or timestamp).

功能设计不一定被看中。 特征工程的一种简单但普遍的用例是时序数据。 在这一领域中,要素工程的重要性是由于以下事实:(原始)时间序列数据通常仅包含一个列来表示时间属性,即日期时间(或时间戳记)。

Regarding this date-time data, feature engineering can be seen as extracting useful information from such data as standalone (distinct) features. For example, from a date-time data of “2020–07–01 10:21:05”, we might want to extract the following features from it:

关于此日期时间数据, 可以将特征工程视为从诸如独立(独特)特征之类的数据中提取有用的信息 。 例如,从日期时间数据“ 2020-07-01 01 10:21:05”,我们可能要从中提取以下功能:

  1. Month: 7

    月:7
  2. Day of month: 1

    一个月中的某天:1
  3. Day name: Wednesday (2020–07–01 was Wednesday)

    日期名称:星期三(2020-07-01是星期三)
  4. Hour: 10

    时间:10

Extracting such kinds of features from date-time data is precisely the objective of the current article. Afterwards, we will incorporate our engineered features as predictors of a gradient boosting regression model. Specifically, we will forecast metro interstate traffic volume.

从日期时间数据中提取此类特征正是本文的目的。 此后,我们将把我们的工程功能纳入梯度提升回归模型的预测变量中。 具体来说,我们将预测地铁州际交通量。

快速总结 (Quick summary)

This article will cover the following.

本文将介绍以下内容。

A step-by-step guide to extract the below features from a date-time column.

从日期时间列中提取以下功能的分步指南。

  1. Month

  2. Day of month

    一个月中的某天
  3. Day name

    日名
  4. Hour

    小时
  5. Daypart (morning, afternoon, etc)

    时段(早上,下午等)
  6. Weekend flag (1 if weekend, else 0)

    周末标志(如果周末则为1,否则为0)

How to incorporate those features in a Gradient Boosting regression model to forecast metro interstate traffic volume.

如何将这些功能整合到Gradient Boosting回归模型中以预测地铁州际交通量。

数据 (The data)

Throughout the article, we use Metro Interstate Traffic Volume Data Set, which can be found in the UCI Machine Learning Repository here.

在整篇文章中,我们使用Metro Interstate Traffic Volume Data Set,数据集可在此处的UCI机器学习存储库中找到 。

Citing its abstract, the data is about hourly Minneapolis-St Paul, MN traffic volume for westbound I-94, which includes weather and holiday features from 2012–2018. This 48204 rows data contains the following attributes.

引用其摘要,该数据大约是明尼阿波利斯-圣保罗,明尼苏达州I-94西行的每小时交通量,其中包括2012-2018年的天气和假日特征。 该48204行数据包含以下属性。

  1. holiday: Categorical US National holidays plus regional holiday, Minnesota State Fair

    holiday :美国国定假日加区域性假日,明尼苏达州博览会

  2. temp: Numeric Average temp in kelvin

    temp :数值平均开氏温度

  3. rain_1h: Numeric Amount in mm of rain that occurred in the hour

    rain_1h :小时内发生的以毫米为单位的数值雨量

  4. snow_1h: Numeric Amount in mm of snow that occurred in the hour

    snow_1h :每小时发生的雪的数值(以毫米为单位)

  5. clouds_all: Numeric Percentage of cloud cover

    clouds_all :云量的数值百分比

  6. weather_main: Categorical Short textual description of the current weather

    weather_main :类别当前天气的简短文字说明

  7. weather_description: Categorical Longer textual description of the current weather

    weather_description :类别当前天气的较长文字描述

  8. date_time: DateTime Hour of the data collected in local CST time

    date_time :DateTime在本地CST时间中收集的数据的小时数

  9. traffic_volume:Numeric Hourly I-94 ATR 301 reported westbound traffic volume (the target)

    traffic_volume:每小时I-94 ATR 301数字报告的西行流量(目标)

Let’s load the data.

让我们加载数据。

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt# load the data
raw = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')# display first five rows
raw.head()# display details for each column
raw.info()
Image for post
raw.head()
raw.head()
Image for post
raw.info()
raw.info()

From the output of the info method in the above, we know the date_time column is still in object type. So we need to convert it to datetime type.

从上面的info方法的输出中,我们知道date_time列仍然是object类型。 因此,我们需要将其转换为datetime类型。

# convert date_time column to datetime type
raw.date_time = pd.to_datetime(raw.date_time)

开始特征工程 (Start feature engineering)

From the output of info method in the above, we know there are categorical features other than the date_time column. But due to the main topic of this article, we will focus on feature engineering our date_time column.

从上面的info方法的输出中,我们知道date_time列以外还有其他分类功能。 但是由于本文的主题,我们将在date_time列上着重进行功能设计。

Month

It turns out that Pandas has many handy methods to work with datetime typed data. To extract time/date components, all we need to do is calling pd.Series.dt attributes family. pd.Series.dt.month is the one we need to extract the month component. This will yield a Series of the digit of the month component (e.g. 1 for January, 10 for October) in int64 format.

事实证明,Pandas有许多方便的方法可以处理datetime类型的数据。 要提取时间/日期成分,我们需要做的就是调用pd.Series.dt属性族。 pd.Series.dt.month是我们提取月份组件所需要的。 这将产生int64格式的系列月份数字部分(例如1表示1月,10表示10月)。

# extract month feature
months = raw.date_time.dt.month

Day of month

一个月中的某天

Quite similar as before, we just need to call pd.Series.dt.day. For example, a date-time of 2012–10–27 09:00:00 would be resulted in 27 using this attribute.

与以前非常相似,我们只需要调用pd.Series.dt.day 。 例如,使用该属性将导致2012年10月27日09:00:00的日期时间为27。

# extract day of month feature
day_of_months = raw.date_time.dt.day

Hour

小时

This one is also trivial. The attribute pd.Series.dt.hour will result in a Series of hour digits, ranging from 0 to 23.

这也是微不足道的。 属性pd.Series.dt.hour将产生一系列小时数字,范围从0到23。

# extract hour feature
hours = raw.date_time.dt.hour

Day name

日名

This one is getting interesting. Our goal is to extract the day name for each date-time in the raw.date_time Series. It consists of two steps. First is to extract the day name literal using pd.Series.dt.day_name() method. Afterwards, we need to one-hot encode the results from the first step using pd.get_dummies() method.

这变得越来越有趣。 我们的目标是为raw.date_time系列中的每个日期时间提取日期名称。 它包括两个步骤。 首先是使用pd.Series.dt.day_name()方法提取日期名称文字。 然后,我们需要使用pd.get_dummies()方法对第一步的结果进行一次热编码。

# first: extract the day name literal
to_one_hot = raw.date_time.dt.day_name()# second: one hot encode to 7 columns
days = pd.get_dummies(to_one_hot)#display data
days
Image for post
Day name, one-hot encoded
天名,一键编码

Daypart

时段

In this part, we will create a grouping based on the hour digits. On a high level, we want to have six groups representing each daypart. They are Dawn (02.00 — 05.59), Morning (06.00 —09.59), Noon (10.00–13.59), Afternoon (14.00–17.59), Evening (18.00–21.59), and Midnight (22.00–01.59 on Day+1).

在这一部分中,我们将基于小时数字创建分组。 总体而言,我们希望每个时段有六个小组代表。 它们是黎明(02.00 — 05.59),早晨(06.00 —09.59),中午(10.00–13.59),下午(14.00–17.59),晚上(18.00–21.59)和午夜(第1天的22.00–01.59)。

To this end, we create an identifying function that we later use to feed an apply method of a Series. Afterwards, we perform one-hot encoding on the resulted dayparts.

为此,我们创建了一个标识函数,以后将其apply提供Series的apply方法。 之后,我们对结果时段进行一次热编码。

# daypart function
def daypart(hour):
if hour in [2,3,4,5]:
return "dawn"
elif hour in [6,7,8,9]:
return "morning"
elif hour in [10,11,12,13]:
return "noon"
elif hour in [14,15,16,17]:
return "afternoon"
elif hour in [18,19,20,21]:
return "evening"
else: return "midnight"# utilize it along with apply method
raw_dayparts = hours.apply(daypart)# one hot encoding
dayparts = pd.get_dummies(raw_dayparts)# re-arrange columns for convenience
dayparts = dayparts[['dawn','morning','noon','afternoon','evening','midnight']]#display data
dayparts
Image for post
Day parts, one-hot encoded
日部分,一键编码

Weekend flag

周末标志

The final feature we engineer from the date_time column is is_weekend. This column indicates whether the given date-time is in the weekend (Saturday or Sunday) or not. To proceed with this objective, we will make use of our previous pd.Series.dt.day_name() method and apply a simple lambda function on top of it.

我们在date_time列中设计的最终功能是is_weekend 。 此列指示给定的日期时间是否在周末(星期六或星期日)。 为了实现此目标,我们将使用之前的pd.Series.dt.day_name()方法,并在其之上应用一个简单的lambda函数。

# is_weekend flag 
day_names = raw.date_time.dt.day_name()
is_weekend = day_names.apply(lambda x : 1 if x in ['Saturday','Sunday'] else 0)

Holiday flag & weather

假日标志和天气

Lucky on us, the data also contains public holiday information. The information is granular since it mentions the name of each public holiday. Nevertheless, I assumed that there is no significant gain for encoding each of these holidays. Therefore, let’s just create a binary feature indicating whether or not the corresponding date is a holiday.

幸运的是,该数据还包含公共假期信息。 该信息非常详尽,因为它提到了每个公共假期的名称。 不过,我假设对这些假期中的每一个进行编码都没有明显的好处。 因此,让我们创建一个二进制功能,指示相应的日期是否是假期。

# is_holiday flag
is_holiday = raw.holiday.apply(lambda x : 0 if x == "None" else 1)

The last categorical feature we need to take care of is the weather column (my assumption strikes again here, I do not include weather_description feature). As you might guess, we just one-hot encode the feature as follows.

我们需要处理的最后一个分类功能是weather列(我的假设再次出现在这里,我包括weather_description功能)。 您可能会猜到,我们仅对该功能进行一次热编码,如下所示。

# one-hot encode weather
weathers = pd.get_dummies(raw.weather_main)#display data
weathers
Image for post
Weather, one-hot encoded
天气,一键编码

The final data

最终数据

Hurray! We finally have our final — ready-to-train — data!

欢呼! 我们终于有了最终的准备好训练的数据!

# features table
features = pd.DataFrame({
'temp' : raw.temp,
'rain_1h' : raw.rain_1h,
'snow_1h' : raw.snow_1h,
'clouds_all' : raw.clouds_all,
'month' : months,
'day_of_month' : day_of_months,
'hour' : hours,
'is_holiday' : is_holiday,
'is_weekend' : is_weekend
})features = pd.concat([features, days, dayparts, weathers], axis = 1)# target column
target = raw.traffic_volume

Before we feed the data to our model, we need to split the data (training and test data).

在将数据提供给模型之前,我们需要拆分数据(训练和测试数据)。

#split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.1, shuffle = False)

造型零件 (Modelling parts)

Now we are ready to build our model to forecast metro interstate traffic volume. In this work, we will use the Gradient Boosting regression model.

现在,我们准备建立模型来预测地铁州际交通量。 在这项工作中,我们将使用Gradient Boosting回归模型。

The details of the model are beyond the scope of this article but on a high level, the gradient boosting model belongs to ensemble model family which employs gradient descent algorithm to minimize errors in sequential (additive) weak learner models (decision trees).

该模型的详细信息不在本文讨论范围之内,但从较高的角度来看,梯度提升模型属于集成模型家族,该家族采用梯度下降算法来最大程度地减少顺序(加性)弱学习者模型(决策树)中的错误。

Model training

模型训练

Let’s instantiate and train the model on the training data!

让我们在训练数据上实例化并训练模型!

from sklearn import datasets, ensemble# define the model parameters
params = {'n_estimators': 500,
'max_depth': 4,
'min_samples_split': 5,
'learning_rate': 0.01,
'loss': 'ls'}# instantiate and train the model
gb_reg = ensemble.GradientBoostingRegressor(**params)
gb_reg.fit(X_train, y_train)

Just wait a little while until the training converged.

稍等片刻,直到培训结束。

Model evaluation

模型评估

To evaluate the model, we use two metrics: MAPE (mean absolute percentage error) and R2 score. We will compute these metrics on the test data.

为了评估模型,我们使用两个指标:MAPE(平均绝对百分比误差)和R2得分。 我们将在测试数据上计算这些指标。

# define MAPE function
def mape(true, predicted):
inside_sum = np.abs(predicted - true) / true
return round(100 * np.sum(inside_sum ) / inside_sum.size,2)# import r2 score
from sklearn.metrics import r2_score# evaluate the metrics
y_true = y_test
y_pred = gb_reg.predict(X_test)#print(f"GB model MSE is {round(mean_squared_error(y_true, y_pred),2)}")
print(f"GB model MAPE is {mape(y_true, y_pred)} %")
print(f"GB model R2 is {round(r2_score(y_true, y_pred)* 100 , 2)} %")
Image for post
The metrics performance on test data
测试数据的指标性能

We can see that our model is quite decent in performance. Our MAPE is less than 15%, while R2 score is a little over 95%.

我们可以看到我们的模型在性能上相当不错。 我们的MAPE低于15%,而R2得分略高于95%。

Graphical results

图形结果

To comprehend our model performance visually, let’s have some plot!

为了直观地了解我们的模型性能,让我们来做些图!

Due to the length of our test data (4820 data points), we just plot the actual vs model-predicted values on the last 100 data points. Moreover, we also include another model (called gb_reg_lite in the plotting code below) which does not incorporate date-time engineered features as its predictors (it only contains non-date-time column as features, including temp, weather, etc).

由于测试数据的长度(4820个数据点),我们仅在最后100个数据点上绘制实际值与模型预测值之间的关系。 此外,我们还包括另一个模型(在下面的绘图代码中称为gb_reg_lite ,该模型将日期时间工程特征作为其预测变量(它仅包含非日期时间列作为特征,包括tempweather等)。

fig, ax = plt.subplots(figsize = (12,6))index_ordered = raw.date_time.astype('str').tolist()[-len(X_test):][-100:]ax.set_xlabel('Date')
ax.set_ylabel('Traffic Volume') # the actual values
ax.plot(index_ordered, y_test[-100:].to_numpy(), color='k', ls='-', label = 'actual')# predictions of model with engineered features
ax.plot(index_ordered, gb_reg.predict(X_test)[-100:], color='b', ls='--', label = 'predicted; with date-time features')# predictions of model without engineered features
ax.plot(index_ordered, gb_reg_lite.predict(X_test_lite)[-100:], color='r', ls='--', label = 'predicted; w/o date-time features')every_nth = 5
for n, label in enumerate(ax.xaxis.get_ticklabels()):
if n % every_nth != 0:
label.set_visible(False)ax.tick_params(axis='x', labelrotation= 90)plt.legend()
plt.title('Actual vs predicted on the last 100 data points')
plt.draw()
Image for post
Our prediction performance on the last 100 data points
我们对最近100个数据点的预测效果

The figure supports our previous findings on good evaluation metrics the model attained, as the blue dashed line approximates the with the black solid line closely. That is, our gradient boosting model can forecast the metro traffic decently.

该图支持我们先前关于模型获得的良好评估指标的发现,因为蓝色虚线与黑色实线非常接近。 也就是说,我们的梯度提升模型可以合理地预测地铁流量。

Meanwhile, we see that the model which does not use the date-time engineered features falls apart in performance (red dashed line). Why this occurs? Because the target (transportation traffic) would indeed depend on the features we just created. Transportation traffic tends to lower in weekend days, but spikes during rush hours. Thus, we would miss these sound predictors if we do not perform feature engineering on the date-time column!

同时,我们看到使用日期时间工程特征的模型在性能上有所区别(红色虚线)。 为什么会这样? 因为目标(运输流量)确实取决于我们刚刚创建的功能。 周末交通运输量趋于下降,但高峰时段交通高峰。 因此,如果不在日期时间列上执行特征工程,我们将错过这些声音预测器!

你走之前 (Before you go)

Congratulations for you who have managed reading this far!

恭喜您成功阅读了本文!

Now for a short recap. In this article, we learned how to perform feature engineering on date-time data. Afterwards, we incorporated the engineered features to build a powerful gradient boosting regression model, to forecast metro traffic volume.

现在简要回顾一下。 在本文中,我们学习了如何对日期时间数据执行特征工程。 之后,我们结合了工程化功能以构建强大的梯度提升回归模型,以预测地铁交通量。

Finally, thanks for reading and let’s connect with me on LinkedIn!

最后,感谢您的阅读,让我们在LinkedIn上与我联系!

翻译自: https://towardsdatascience.com/feature-engineering-on-date-time-data-90f6e954e6b8

软件工程方法学要素含义

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389101.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

vue图片压缩不失真_图片压缩会失真?快试试这几个无损压缩神器。

前端通常在做网页的时候 会出现图片加载慢的情况 在这里我通常会将图片进行压缩 但是通常情况下 观众会认为 图片压缩会出现失真的现象 在这里我会向大家推荐几款图片压缩的工具 基本上会实现无损压缩1.TinyPng地址:https://tinypng.comEnglish?不要慌&a…

remoteing2

此示例主要演示了net remoting,其中包含一个服务器程序Server.exe和一个客户端程序CAOClient.exe。客户端程序会通过http channel调用服务器端RemoteType.dll的对象和方法。服务器端的代码文件由下图所述:Server.cs源代码 :using System;using System.Runtime.Remot…

更换mysql_Docker搭建MySQL主从复制

Docker搭建MySQL主从复制 主从服务器上分别安装Docker 1.1 Docker 要求 CentOS 系统的内核版本高于 3.10 [rootlocalhost ~]# uname -r 3.10.0-693.el7.x86_641.2 确保 yum 包更新到最新。 [rootlocalhost ~]# sudo yum update Loaded plugins: fastestmirror, langpacks Loadi…

理解ConstraintLayout 对性能的好处

自从在17年GoogleI/O大会宣布了Constraintlayout,我们持续提升了布局的稳定性和布局编辑的支持。我们还为ConstraintLayout添加了一些新特性支持创建不同类型的布局,添加这些新特性,可以明显的提升性能,在这里,我门将讨论Contrain…

数据湖 data lake_在Data Lake中高效更新TB级数据的模式

数据湖 data lakeGOAL: This post discusses SQL “UPDATE” statement equivalent for a data lake (object) storage using Apache Spark execution engine. To further clarify consider this, when you need to perform conditional updates to a massive table in a relat…

advanced installer更换程序id_好程序员web前端培训分享kbone高级-事件系统

好程序员web前端培训分享kbone高级-事件系统:1、用法,对于多页面的应用,在 Web 端可以直接通过 a 标签或者 location 对象进行跳转,但是在小程序中则行不通;同时 Web 端的页面 url 实现和小程序页面路由也是完全不一样…

ai对话机器人实现方案_显然地引入了AI —无代码机器学习解决方案

ai对话机器人实现方案A couple of folks from Obviously.ai contacted me a few days back to introduce their service — a completely no-code machine learning automation tool. I was a bit skeptical at first, as I always am with supposedly fully-automated solutio…

网络负载平衡的

网络负载平衡允许你将传入的请求传播到最多达32台的服务器上,即可以使用最多32台服务器共同分担对外的网络请求服务。网络负载平衡技术保证即使是在负载很重的情况下它们也能作出快速响应。 网络负载平衡对外只须提供一个IP地址(或域名)。 如…

神经网络 CNN

# encodingutf-8import tensorflow as tfimport numpy as npfrom tensorflow.examples.tutorials.mnist import input_datamnist input_data.read_data_sets(MNIST_data, one_hotTrue)def weight_variable(shape): initial tf.truncated_normal(shape, stddev0.1) # 定义…

图片中的暖色或冷色滤色片是否会带来更多点击? —机器学习A / B测试

A/B test on ads is the art of choosing the best advertisement that optimizes your goal (number of clicks, likes, etc). For example, if you change a simple thing like a filter in your pictures you will drive more traffic to your links.广告的A / B测试是一种选…

3d制作中需要注意的问题_浅谈线路板制作时需要注意的问题

PCB电路板是电子设备重要的基础组装部件,在制作PCB电路板时,只有将各个方面都考虑清楚,才能保证电子设备在使用时不会出现问题。今天小编就与大家一起分享线路板制作时需要注意的问题,归纳一下几点:1、考虑制作类型电路…

冷启动、热启动时间性能优化

用户希望应用程序能够快速响应并加载。 一个启动速度慢的应用程序不符合这个期望,可能会令用户失望。 这种糟糕的体验可能会导致用户在应用商店中对您的应用进行糟糕的评价,甚至完全放弃您的应用。 本文档提供的信息可帮助您优化应用的启动时间。 它首先…

python:lambda、filter、map、reduce

lambda 为关键字。filter,map,reduce为内置函数。 lambda:实现python中单行最小函数。 g lambda x: x * 2 #相当于 def g(x):return x*2print(g(3))# 6 注意:这里直接g(3)可以执行,但没有输出的,前面的…

集群

原文地址:http://www.microsoft.com/china/MSDN/library/windev/COMponentdev/CdappCEnter.mspx?mfrtrue 本文假设读者熟悉 Windows 2000、COM、IIS 5.0 摘要 Application Center 2000 简化了从基于 Microsoft .NET 的应用程序到群集的部署,群集是一组…

Myeclipes连接Mysql数据库配置

相信大家在网站上也找到了许多关于myeclipes如何连接mysql数据库的解决方案,虽然每一步都按照他的步骤来,可到最后还是提示连接失败,有的方案可能应个人设备而异,配置环境不同导致。经过个人多方探索终于找到一个简单便捷的配置方…

cnn图像二分类 python_人工智能Keras图像分类器(CNN卷积神经网络的图片识别篇)...

上期文章我们分享了人工智能Keras图像分类器(CNN卷积神经网络的图片识别的训练模型),本期我们使用预训练模型对图片进行识别:Keras CNN卷积神经网络模型训练导入第三方库from keras.preprocessing.image import img_to_arrayfrom keras.models import lo…

图卷积 节点分类_在节点分类任务上训练图卷积网络

图卷积 节点分类This article goes through the implementation of Graph Convolution Networks (GCN) using Spektral API, which is a Python library for graph deep learning based on Tensorflow 2. We are going to perform Semi-Supervised Node Classification using C…

回归分析预测_使用回归分析预测心脏病。

回归分析预测As per the Centers for Disease Control and Prevention report, heart disease is the prime killer of both men and women in the United States and around the globe. There are several data mining techniques that can be leveraged by researchers/ stat…

crc16的c语言函数 计算ccitt_C语言为何如此重要

●●●如今,有很多学生不懂为何要学习编程语言,为何要学习C语言?原因是大学生不能满足于只会用办公软件,而应当有更高的学习要求,对于理工科的学生尤其如此。计算机的本质是“程序的机器”,程序和指令的思想…

aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题

aws spark技术提示 (TECHNICAL TIPS) 介绍 (Introduction) At first, it seemed to be quite easy to write down and run a Spark application. If you are experienced with data frame manipulation using pandas, numpy and other packages in Python, and/or the SQL lang…