软件工程方法学要素含义
According to Wikipedia, feature engineering refers to the process of using domain knowledge to extract features from raw data via data mining techniques. These features can then be used to improve the performance of machine learning algorithms.
一个 ccording百川,有特色的工程是指使用领域知识通过数据挖掘技术,从原始数据提取特征的过程。 然后,可以使用这些功能来提高机器学习算法的性能。
Feature engineering does not necessarily have to be fancy. One simple, yet prevalent, use case of feature engineering is in time-series data. The importance of feature engineering in this realm is due to the fact that (raw) time-series data usually only contains one single column to represent the time attribute, namely date-time (or timestamp).
功能设计不一定被看中。 特征工程的一种简单但普遍的用例是时序数据。 在这一领域中,要素工程的重要性是由于以下事实:(原始)时间序列数据通常仅包含一个列来表示时间属性,即日期时间(或时间戳记)。
Regarding this date-time data, feature engineering can be seen as extracting useful information from such data as standalone (distinct) features. For example, from a date-time data of “2020–07–01 10:21:05”, we might want to extract the following features from it:
关于此日期时间数据, 可以将特征工程视为从诸如独立(独特)特征之类的数据中提取有用的信息 。 例如,从日期时间数据“ 2020-07-01 01 10:21:05”,我们可能要从中提取以下功能:
- Month: 7 月:7
- Day of month: 1 一个月中的某天:1
- Day name: Wednesday (2020–07–01 was Wednesday) 日期名称:星期三(2020-07-01是星期三)
- Hour: 10 时间:10
Extracting such kinds of features from date-time data is precisely the objective of the current article. Afterwards, we will incorporate our engineered features as predictors of a gradient boosting regression model. Specifically, we will forecast metro interstate traffic volume.
从日期时间数据中提取此类特征正是本文的目的。 此后,我们将把我们的工程功能纳入梯度提升回归模型的预测变量中。 具体来说,我们将预测地铁州际交通量。
快速总结 (Quick summary)
This article will cover the following.
本文将介绍以下内容。
A step-by-step guide to extract the below features from a date-time column.
从日期时间列中提取以下功能的分步指南。
- Month 月
- Day of month 一个月中的某天
- Day name 日名
- Hour 小时
- Daypart (morning, afternoon, etc) 时段(早上,下午等)
- Weekend flag (1 if weekend, else 0) 周末标志(如果周末则为1,否则为0)
How to incorporate those features in a Gradient Boosting regression model to forecast metro interstate traffic volume.
如何将这些功能整合到Gradient Boosting回归模型中以预测地铁州际交通量。
数据 (The data)
Throughout the article, we use Metro Interstate Traffic Volume Data Set, which can be found in the UCI Machine Learning Repository here.
在整篇文章中,我们使用Metro Interstate Traffic Volume Data Set,该数据集可在此处的UCI机器学习存储库中找到 。
Citing its abstract, the data is about hourly Minneapolis-St Paul, MN traffic volume for westbound I-94, which includes weather and holiday features from 2012–2018. This 48204 rows data contains the following attributes.
引用其摘要,该数据大约是明尼阿波利斯-圣保罗,明尼苏达州I-94西行的每小时交通量,其中包括2012-2018年的天气和假日特征。 该48204行数据包含以下属性。
holiday
: Categorical US National holidays plus regional holiday, Minnesota State Fairholiday
:美国国定假日加区域性假日,明尼苏达州博览会temp
: Numeric Average temp in kelvintemp
:数值平均开氏温度rain_1h
: Numeric Amount in mm of rain that occurred in the hourrain_1h
:小时内发生的以毫米为单位的数值雨量snow_1h
: Numeric Amount in mm of snow that occurred in the hoursnow_1h
:每小时发生的雪的数值(以毫米为单位)clouds_all
: Numeric Percentage of cloud coverclouds_all
:云量的数值百分比weather_main
: Categorical Short textual description of the current weatherweather_main
:类别当前天气的简短文字说明weather_description
: Categorical Longer textual description of the current weatherweather_description
:类别当前天气的较长文字描述date_time
: DateTime Hour of the data collected in local CST timedate_time
:DateTime在本地CST时间中收集的数据的小时数traffic_volume:
Numeric Hourly I-94 ATR 301 reported westbound traffic volume (the target)traffic_volume:
每小时I-94 ATR 301数字报告的西行流量(目标)
Let’s load the data.
让我们加载数据。
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt# load the data
raw = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')# display first five rows
raw.head()# display details for each column
raw.info()
From the output of the info
method in the above, we know the date_time
column is still in object
type. So we need to convert it to datetime
type.
从上面的info
方法的输出中,我们知道date_time
列仍然是object
类型。 因此,我们需要将其转换为datetime
类型。
# convert date_time column to datetime type
raw.date_time = pd.to_datetime(raw.date_time)
开始特征工程 (Start feature engineering)
From the output of info method in the above, we know there are categorical features other than the date_time
column. But due to the main topic of this article, we will focus on feature engineering our date_time
column.
从上面的info方法的输出中,我们知道date_time
列以外还有其他分类功能。 但是由于本文的主题,我们将在date_time
列上着重进行功能设计。
Month
月
It turns out that Pandas has many handy methods to work with datetime
typed data. To extract time/date components, all we need to do is calling pd.Series.dt
attributes family. pd.Series.dt.month
is the one we need to extract the month component. This will yield a Series of the digit of the month component (e.g. 1 for January, 10 for October) in int64
format.
事实证明,Pandas有许多方便的方法可以处理datetime
类型的数据。 要提取时间/日期成分,我们需要做的就是调用pd.Series.dt
属性族。 pd.Series.dt.month
是我们提取月份组件所需要的。 这将产生int64
格式的系列月份数字部分(例如1表示1月,10表示10月)。
# extract month feature
months = raw.date_time.dt.month
Day of month
一个月中的某天
Quite similar as before, we just need to call pd.Series.dt.day
. For example, a date-time of 2012–10–27 09:00:00 would be resulted in 27 using this attribute.
与以前非常相似,我们只需要调用pd.Series.dt.day
。 例如,使用该属性将导致2012年10月27日09:00:00的日期时间为27。
# extract day of month feature
day_of_months = raw.date_time.dt.day
Hour
小时
This one is also trivial. The attribute pd.Series.dt.hour
will result in a Series of hour digits, ranging from 0 to 23.
这也是微不足道的。 属性pd.Series.dt.hour
将产生一系列小时数字,范围从0到23。
# extract hour feature
hours = raw.date_time.dt.hour
Day name
日名
This one is getting interesting. Our goal is to extract the day name for each date-time in the raw.date_time
Series. It consists of two steps. First is to extract the day name literal using pd.Series.dt.day_name()
method. Afterwards, we need to one-hot encode the results from the first step using pd.get_dummies()
method.
这变得越来越有趣。 我们的目标是为raw.date_time
系列中的每个日期时间提取日期名称。 它包括两个步骤。 首先是使用pd.Series.dt.day_name()
方法提取日期名称文字。 然后,我们需要使用pd.get_dummies()
方法对第一步的结果进行一次热编码。
# first: extract the day name literal
to_one_hot = raw.date_time.dt.day_name()# second: one hot encode to 7 columns
days = pd.get_dummies(to_one_hot)#display data
days
Daypart
时段
In this part, we will create a grouping based on the hour digits. On a high level, we want to have six groups representing each daypart. They are Dawn (02.00 — 05.59), Morning (06.00 —09.59), Noon (10.00–13.59), Afternoon (14.00–17.59), Evening (18.00–21.59), and Midnight (22.00–01.59 on Day+1).
在这一部分中,我们将基于小时数字创建分组。 总体而言,我们希望每个时段有六个小组代表。 它们是黎明(02.00 — 05.59),早晨(06.00 —09.59),中午(10.00–13.59),下午(14.00–17.59),晚上(18.00–21.59)和午夜(第1天的22.00–01.59)。
To this end, we create an identifying function that we later use to feed an apply
method of a Series. Afterwards, we perform one-hot encoding on the resulted dayparts.
为此,我们创建了一个标识函数,以后将其apply
提供Series的apply
方法。 之后,我们对结果时段进行一次热编码。
# daypart function
def daypart(hour):
if hour in [2,3,4,5]:
return "dawn"
elif hour in [6,7,8,9]:
return "morning"
elif hour in [10,11,12,13]:
return "noon"
elif hour in [14,15,16,17]:
return "afternoon"
elif hour in [18,19,20,21]:
return "evening"
else: return "midnight"# utilize it along with apply method
raw_dayparts = hours.apply(daypart)# one hot encoding
dayparts = pd.get_dummies(raw_dayparts)# re-arrange columns for convenience
dayparts = dayparts[['dawn','morning','noon','afternoon','evening','midnight']]#display data
dayparts
Weekend flag
周末标志
The final feature we engineer from the date_time
column is is_weekend
. This column indicates whether the given date-time is in the weekend (Saturday or Sunday) or not. To proceed with this objective, we will make use of our previous pd.Series.dt.day_name()
method and apply a simple lambda function on top of it.
我们在date_time
列中设计的最终功能是is_weekend
。 此列指示给定的日期时间是否在周末(星期六或星期日)。 为了实现此目标,我们将使用之前的pd.Series.dt.day_name()
方法,并在其之上应用一个简单的lambda函数。
# is_weekend flag
day_names = raw.date_time.dt.day_name()
is_weekend = day_names.apply(lambda x : 1 if x in ['Saturday','Sunday'] else 0)
Holiday flag & weather
假日标志和天气
Lucky on us, the data also contains public holiday information. The information is granular since it mentions the name of each public holiday. Nevertheless, I assumed that there is no significant gain for encoding each of these holidays. Therefore, let’s just create a binary feature indicating whether or not the corresponding date is a holiday.
幸运的是,该数据还包含公共假期信息。 该信息非常详尽,因为它提到了每个公共假期的名称。 不过,我假设对这些假期中的每一个进行编码都没有明显的好处。 因此,让我们创建一个二进制功能,指示相应的日期是否是假期。
# is_holiday flag
is_holiday = raw.holiday.apply(lambda x : 0 if x == "None" else 1)
The last categorical feature we need to take care of is the weather
column (my assumption strikes again here, I do not include weather_description
feature). As you might guess, we just one-hot encode the feature as follows.
我们需要处理的最后一个分类功能是weather
列(我的假设再次出现在这里,我不包括weather_description
功能)。 您可能会猜到,我们仅对该功能进行一次热编码,如下所示。
# one-hot encode weather
weathers = pd.get_dummies(raw.weather_main)#display data
weathers
The final data
最终数据
Hurray! We finally have our final — ready-to-train — data!
欢呼! 我们终于有了最终的准备好训练的数据!
# features table
features = pd.DataFrame({
'temp' : raw.temp,
'rain_1h' : raw.rain_1h,
'snow_1h' : raw.snow_1h,
'clouds_all' : raw.clouds_all,
'month' : months,
'day_of_month' : day_of_months,
'hour' : hours,
'is_holiday' : is_holiday,
'is_weekend' : is_weekend
})features = pd.concat([features, days, dayparts, weathers], axis = 1)# target column
target = raw.traffic_volume
Before we feed the data to our model, we need to split the data (training and test data).
在将数据提供给模型之前,我们需要拆分数据(训练和测试数据)。
#split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.1, shuffle = False)
造型零件 (Modelling parts)
Now we are ready to build our model to forecast metro interstate traffic volume. In this work, we will use the Gradient Boosting regression model.
现在,我们准备建立模型来预测地铁州际交通量。 在这项工作中,我们将使用Gradient Boosting回归模型。
The details of the model are beyond the scope of this article but on a high level, the gradient boosting model belongs to ensemble model family which employs gradient descent algorithm to minimize errors in sequential (additive) weak learner models (decision trees).
该模型的详细信息不在本文讨论范围之内,但从较高的角度来看,梯度提升模型属于集成模型家族,该家族采用梯度下降算法来最大程度地减少顺序(加性)弱学习者模型(决策树)中的错误。
Model training
模型训练
Let’s instantiate and train the model on the training data!
让我们在训练数据上实例化并训练模型!
from sklearn import datasets, ensemble# define the model parameters
params = {'n_estimators': 500,
'max_depth': 4,
'min_samples_split': 5,
'learning_rate': 0.01,
'loss': 'ls'}# instantiate and train the model
gb_reg = ensemble.GradientBoostingRegressor(**params)
gb_reg.fit(X_train, y_train)
Just wait a little while until the training converged.
稍等片刻,直到培训结束。
Model evaluation
模型评估
To evaluate the model, we use two metrics: MAPE (mean absolute percentage error) and R2 score. We will compute these metrics on the test data.
为了评估模型,我们使用两个指标:MAPE(平均绝对百分比误差)和R2得分。 我们将在测试数据上计算这些指标。
# define MAPE function
def mape(true, predicted):
inside_sum = np.abs(predicted - true) / true
return round(100 * np.sum(inside_sum ) / inside_sum.size,2)# import r2 score
from sklearn.metrics import r2_score# evaluate the metrics
y_true = y_test
y_pred = gb_reg.predict(X_test)#print(f"GB model MSE is {round(mean_squared_error(y_true, y_pred),2)}")
print(f"GB model MAPE is {mape(y_true, y_pred)} %")
print(f"GB model R2 is {round(r2_score(y_true, y_pred)* 100 , 2)} %")
We can see that our model is quite decent in performance. Our MAPE is less than 15%, while R2 score is a little over 95%.
我们可以看到我们的模型在性能上相当不错。 我们的MAPE低于15%,而R2得分略高于95%。
Graphical results
图形结果
To comprehend our model performance visually, let’s have some plot!
为了直观地了解我们的模型性能,让我们来做些图!
Due to the length of our test data (4820 data points), we just plot the actual vs model-predicted values on the last 100 data points. Moreover, we also include another model (called gb_reg_lite
in the plotting code below) which does not incorporate date-time engineered features as its predictors (it only contains non-date-time column as features, including temp
, weather
, etc).
由于测试数据的长度(4820个数据点),我们仅在最后100个数据点上绘制实际值与模型预测值之间的关系。 此外,我们还包括另一个模型(在下面的绘图代码中称为gb_reg_lite
,该模型未将日期时间工程特征作为其预测变量(它仅包含非日期时间列作为特征,包括temp
, weather
等)。
fig, ax = plt.subplots(figsize = (12,6))index_ordered = raw.date_time.astype('str').tolist()[-len(X_test):][-100:]ax.set_xlabel('Date')
ax.set_ylabel('Traffic Volume') # the actual values
ax.plot(index_ordered, y_test[-100:].to_numpy(), color='k', ls='-', label = 'actual')# predictions of model with engineered features
ax.plot(index_ordered, gb_reg.predict(X_test)[-100:], color='b', ls='--', label = 'predicted; with date-time features')# predictions of model without engineered features
ax.plot(index_ordered, gb_reg_lite.predict(X_test_lite)[-100:], color='r', ls='--', label = 'predicted; w/o date-time features')every_nth = 5
for n, label in enumerate(ax.xaxis.get_ticklabels()):
if n % every_nth != 0:
label.set_visible(False)ax.tick_params(axis='x', labelrotation= 90)plt.legend()
plt.title('Actual vs predicted on the last 100 data points')
plt.draw()
The figure supports our previous findings on good evaluation metrics the model attained, as the blue dashed line approximates the with the black solid line closely. That is, our gradient boosting model can forecast the metro traffic decently.
该图支持我们先前关于模型获得的良好评估指标的发现,因为蓝色虚线与黑色实线非常接近。 也就是说,我们的梯度提升模型可以合理地预测地铁流量。
Meanwhile, we see that the model which does not use the date-time engineered features falls apart in performance (red dashed line). Why this occurs? Because the target (transportation traffic) would indeed depend on the features we just created. Transportation traffic tends to lower in weekend days, but spikes during rush hours. Thus, we would miss these sound predictors if we do not perform feature engineering on the date-time column!
同时,我们看到不使用日期时间工程特征的模型在性能上有所区别(红色虚线)。 为什么会这样? 因为目标(运输流量)确实取决于我们刚刚创建的功能。 周末交通运输量趋于下降,但高峰时段交通高峰。 因此,如果不在日期时间列上执行特征工程,我们将错过这些声音预测器!
你走之前 (Before you go)
Congratulations for you who have managed reading this far!
恭喜您成功阅读了本文!
Now for a short recap. In this article, we learned how to perform feature engineering on date-time data. Afterwards, we incorporated the engineered features to build a powerful gradient boosting regression model, to forecast metro traffic volume.
现在简要回顾一下。 在本文中,我们学习了如何对日期时间数据执行特征工程。 之后,我们结合了工程化功能以构建强大的梯度提升回归模型,以预测地铁交通量。
Finally, thanks for reading and let’s connect with me on LinkedIn!
最后,感谢您的阅读,让我们在LinkedIn上与我联系!
翻译自: https://towardsdatascience.com/feature-engineering-on-date-time-data-90f6e954e6b8
软件工程方法学要素含义
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389101.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!