了解机器学习 (Understanding ML)

This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020). Some of the ideas presented here are strictly designed for competitions like that and might not be useful IRL.
本文基于我 对DrivenData平台上的DengAI竞赛的参与 。 我的得分一直在0.2％以内(截至2020年6月2日，得分为14/9069)。 这里提出的一些想法是严格针对此类比赛而设计的，可能对IRL无效。

Before we start I have to warn you that some parts might be obvious for more advanced data engineers, and it’s a very long article. You might read it section by section of just pick the parts that are interesting for you:

在开始之前，我必须警告您，某些部分对于更高级的数据工程师可能是显而易见的，并且这是一篇很长的文章。您可能会逐节阅读它，只是选择一些您感兴趣的部分：

DengAI, Data preprocessing

DengAI，数据预处理

More parts are coming soon…

更多零件即将推出…

问题描述 (Problem description)

First, we need to discuss the competition itself. DengAI’s goal was (actually, at this moment even is, because the administration of DrivenData decided to make it “ongoing” competition, so you can join and try yourself) to predict a number of dengue cases in the particular week base on weather data and location. Each participant was given a training dataset and test dataset (not validation dataset). MAE ( Mean Absolute Error) is a metric used to calculate score and the training dataset covers 28 years of weekly values for 2 cities (1456 weeks). Test data is smaller and spans over 5 and 3 years (depends on the city).

首先，我们需要讨论比赛本身。 DengAI的目标是(实际上，在此时此刻，甚至是因为DrivenData的管理部门决定使其正在进行的竞争，因此您可以加入并进行尝试)根据天气数据和位置。每个参与者都得到了训练数据集和测试数据集 (不是验证数据集)。 MAE ( 平均绝对误差 )是用于计算得分的指标，训练数据集涵盖2个城市(1456周)的28年每周值。测试数据较小，跨度为5年和3年(取决于城市)。

For those who don’t know, Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. Because it’s carried by mosquitoes, the transmission is related to climate and weather variables.

对于那些不知道的人，登革热是一种蚊子传播的疾病，发生在世界的热带和亚热带地区。由于它是由蚊子携带的，因此传播与气候和天气变量有关。

数据集 (Dataset)

If we look at the training dataset it has multiple features:

如果我们查看训练数据集，它具有多种功能：

City and date indicators:

城市和日期指标：

city — City abbreviations: sj for San Juan and iq for Iquitos
city —城市缩写： sj代表圣胡安， iq代表伊基托斯
week_start_date — Date given in yyyy-mm-dd format
week_start_date —以yyyy-mm-dd格式给出的日期

NOAA’s GHCN daily climate data weather station measurements:

NOAA的GHCN每日气候数据气象站测量结果：

station_max_temp_c — Maximum temperature
station_max_temp_c-最高温度
station_min_temp_c — Minimum temperature
station_min_temp_c-最低温度
station_avg_temp_c — Average temperature
station_avg_temp_c-平均温度
station_precip_mm — Total precipitation
station_precip_mm —总降水量
station_diur_temp_rng_c — Diurnal temperature range
station_diur_temp_rng_c —昼夜温度范围

PERSIANN satellite precipitation measurements (0.25x0.25 degree scale):

PERSIANN卫星降水测量(0.25x0.25度标度)：

precipitation_amt_mm — Total precipitation
rainfall_amt_mm —总降水量

NOAA’s NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale):

NOAA的NCEP气候预测系统再分析测量结果(0.5x0.5度等级)：

reanalysis_sat_precip_amt_mm — Total precipitation
reanalysis_sat_precip_amt_mm —总降水量
reanalysis_dew_point_temp_k — Mean dew point temperature
reanalysis_dew_point_temp_k —平均露点温度
reanalysis_air_temp_k — Mean air temperature
reanalysis_air_temp_k —平均气温
reanalysis_relative_humidity_percent — Mean relative humidity
reanalysis_relative_humidity_percent —平均相对湿度
reanalysis_specific_humidity_g_per_kg — Mean specific humidity
reanalysis_specific_humidity_g_per_kg —平均比湿度
reanalysis_precip_amt_kg_per_m2 — Total precipitation
reanalysis_precip_amt_kg_per_m2 —总降水量
reanalysis_max_air_temp_k — Maximum air temperature
reanalysis_max_air_temp_k —最高气温
reanalysis_min_air_temp_k — Minimum air temperature
reanalysis_min_air_temp_k —最低气温
reanalysis_avg_temp_k — Average air temperature
reanalysis_avg_temp_k —平均气温
reanalysis_tdtr_k — Diurnal temperature range
reanalysis_tdtr_k —日温度范围

Satellite vegetation — Normalized difference vegetation index (NDVI) — NOAA’s CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements:

卫星植被-归一化植被指数(NDVI)-NOAA的CDR归一化植被指数(0.5x0.5度)：

ndvi_se — Pixel southeast of city centroid
ndvi_se —城市质心的东南像素
ndvi_sw — Pixel southwest of city centroid
ndvi_sw —城市质心的西南像素
ndvi_ne — Pixel northeast of city centroid
ndvi_ne —城市质心的东北像素
ndvi_nw — Pixel northwest of city centroid
ndvi_nw —城市质心的西北像素

Additionally, we have information about the number of total_cases each week.

此外，我们还提供有关每周total_cases数量的信息。

It is easy to spot that for each row in the dataset we have multiple features describing similar kinds of data. There are four categories:

很容易发现，对于数据集中的每一行，我们都有描述相似类型数据的多种功能。有四个类别：

- temperature- precipitation- humidity- ndvi (those four features are referring to different points in the cities, so they are not exactly the same data)

-温度-降水量-湿度-ndvi(这四个要素所指的是城市中的不同地点，因此它们并不是完全相同的数据)

Because of that, we should be able to remove some of the redundant data from the input. Ofc, we cannot just pick one temperature randomly. If we look at just temperature data there is a distinguishment between ranges (min, avg, max) and even type (mean dew point or diurnal).

因此，我们应该能够从输入中删除一些冗余数据。当然，我们不能随便挑一个温度。如果仅查看温度数据，则在范围(最小值，平均值，最大值)和类型(平均露点或昼夜)之间存在区别。

输入示例： (Input example:)

week_start_date 1994-05-07
total_cases 22
station_max_temp_c 33.3
station_avg_temp_c 27.7571428571
station_precip_mm 10.5
station_min_temp_c 22.8
station_diur_temp_rng_c 7.7
precipitation_amt_mm 68.0
reanalysis_sat_precip_amt_mm 68.0
reanalysis_dew_point_temp_k 295.235714286
reanalysis_air_temp_k 298.927142857
reanalysis_relative_humidity_percent 80.3528571429
reanalysis_specific_humidity_g_per_kg 16.6214285714
reanalysis_precip_amt_kg_per_m2 14.1
reanalysis_max_air_temp_k 301.1
reanalysis_min_air_temp_k 297.0
reanalysis_avg_temp_k 299.092857143
reanalysis_tdtr_k 2.67142857143
ndvi_location_1 0.1644143
ndvi_location_2 0.0652
ndvi_location_3 0.1321429
ndvi_location_4 0.08175

提交格式： (Submission format:)

city,year,weekofyear,total_cases
sj,1990,18,4
sj,1990,19,5
...

分数评估： (Score evaluation:)

数据分析 (Data Analysis)

Before even starting designing the models we need to look at the raw data and fix it. To accomplish that we’re going to use Pandas Library. Usually, we can just import .csv files out of the box and work on the imported DataFrame, but sometimes (especially when there is no column description in the first row) we have to provide a list of columns.

在甚至开始设计模型之前，我们需要查看原始数据并进行修复。为此，我们将使用Pandas Library 。通常，我们可以直接打开.csv文件并在导入的DataFrame上工作，但是有时(尤其是当第一行中没有列说明时)，我们必须提供列列表。

import pandas as pd
pd.set_option("display.precision", 2)
df = pd.read_csv('./dengue_features_train_with_out.csv')
df.describe()

Pandas has a build-in method called describe which displays basic statistical info about columns in a dataset.

Pandas有一个称为describe的内置方法，该方法显示有关数据集中列的基本统计信息。

Naturally, this method works only on numerical data. If we have non-numerical columns we have to do some preprocessing first. In our case, the only column that is a categorical column is city. This column contains only two values sj and iq and we’re going to deal with it later.

自然，此方法仅适用于数值数据。如果我们有非数字列，则必须先进行一些预处理。在我们的例子中，唯一的列是city 。该列仅包含两个值sj和iq ，稍后我们将对其进行处理。

Back to the main table. Each row contains a different kind of information:

回到主表。每行包含不同类型的信息：

count — describes the number of non-NaN values, basically how many values are correct, not empty number
count —描述非NaN值的数量，基本上是多少个正确的值，不是空值
mean — mean value from the whole column (useful for normalization)
平均值 -整列的平均值(用于归一化)
std — standard deviation (also useful for normalization)
std-标准偏差(也可用于标准化)
min -> max — shows us a range in which values are contained (useful for scaling)
min- > max-向我们显示一个包含值的范围(用于缩放)

Let us start with the count. It is important to know how many records in your dataset has missing data (one or many) an decide what to do with them. If you look at the ndvi_nw value, it is empty in 13.3% of cases. That might be a problem if you decide to replace missing values with some arbitrary value like 0. Usually, there are two common solutions to this problem:

让我们从伯爵开始。重要的是要知道数据集中有多少记录缺少数据(一个或多个)，并决定如何处理它们。如果查看ndvi_nw值，则在13.3％的情况下该值为空。如果您决定用某些任意值(例如0)替换缺失值，则可能会遇到问题。通常，有两种常见的解决方案：

set an average value
设定平均值
do the interpolation
进行插值

Interpolation (dealing with missing data)

插值(处理丢失的数据)

When dealing with series data (like we do) it’s easier to interpolate (average from just the neighbors) value from its neighbors instead of replacing it with just an average from the entire set. Usually, series data have some correlation between values in the series, and using neighbors gives a better result. Let me give you an example.

在处理序列数据时(像我们一样)，更容易从邻居内插值(仅来自邻居的平均值)，而不是仅用整个集中的平均值替换它。通常，系列数据在系列中的值之间具有一定的相关性，使用邻居可以得到更好的结果。让我举一个例子。

Suppose you’re dealing with temperature data, and your entire dataset consists of the values from January to December. The average value from the entire year is going to be an invalid replacement for missing days throughout most of the year. If you take days from July then you might have values like [28, 27, -, -, 30] (or [82, 81, -, -, 86] for those who prefer imperial units). If that would be a London then an annual average temperature is 11C (or 52F). Using 11 seams wrong in this case, doesn’t it? That’s why we should use interpolation instead of the average. With interpolation (even in the case when there is a wider gap) we should be able to achieve a better result. If you calculate values you should get (27+30)/2=28.5 and (28.5+30)/2=29.25 so at the end our dataset will look like [28, 27, 28.5, 29.25, 30], way better than [28, 27, 11, 11, 30].

假设您要处理温度数据，并且整个数据集都包含一月到十二月的值。全年的平均值将替代一年中大部分时间的缺勤天数。如果您从7月开始花费几天的时间，则可能会使用[28，27，-，-，30]之类的值 (对于喜欢英制单位的人，则可能是[ 82，81，-，-，86] )。如果那是伦敦，那么年平均气温为11摄氏度(或52华氏度)。在这种情况下使用11个接缝是错误的，不是吗？这就是为什么我们应该使用插值法而不是平均值法。通过插值(即使在间隙更大的情况下)，我们也应该能够获得更好的结果。如果您计算值，则应该得到(27 + 30)/2=28.5和(28.5 + 30)/2=29.25，所以最后我们的数据集看起来像[ 28，27，28.5，29.25，30 ] ，比[28、27、11、11、30] 。

Splitting dataset into cities

将数据集划分为城市

Because we’ve already covered some important things lets define a method which allows us to redefine our categorical column ( city) into binary column vectors and interpolate data:

因为我们已经介绍了一些重要的事情，所以让我们定义一个方法，该方法可以将分类列( city )重新定义为二进制列向量并插值数据：

def extract_data(train_file_path, columns, categorical_columns=CATEGORICAL_COLUMNS, categories_desc=CATEGORIES,
                 interpolate=True):
    # Read csv file and return
    all_data = pd.read_csv(train_file_path, usecols=columns)
    if categorical_columns is not None:
        # map categorical to columns
        for feature_name in categorical_columns:
            mapping_dict = {categories_desc[feature_name][i]: categories_desc[feature_name][i] for i in
                            range(0, len(categories_desc[feature_name]))}
            all_data[feature_name] = all_data[feature_name].map(mapping_dict)
        # Change mapped categorical data to 0/1 columns
        all_data = pd.get_dummies(all_data, prefix='', prefix_sep='')
    # fix missing data
    if interpolate:
        all_data = all_data.interpolate(method='linear', limit_direction='forward')
    return all_data

All constants (like CATEGORICAL_COLUMNS) are defined in this Gist.
所有常量(例如CATEGORICAL_COLUMNS)都在 此Gist 中定义 。

This function returns a dataset with two binary columns called sj and iq which are having true values where city was set to be either sj or iq.

此函数返回具有两个名为sj和iq的二进制列的数据集，这些列具有真实值，其中city设置为sj或iq 。

绘制数据 (Plotting the data)

It is important to plot your data to get a visual understanding of how values are distributed in the series. We’re going to use a library called Seaborn to help us with plotting data.

重要的是绘制数据以直观了解值在系列中的分布方式。我们将使用一个名为Seaborn的库来帮助我们绘制数据。

sns.pairplot(dataset[["precipitation_amt_mm", "reanalysis_sat_precip_amt_mm", "station_precip_mm"]], diag_kind="kde")

Here we have just one feature from the dataset, and we can clearly distinguish seasons and cities (the point when the average value drops from ~297K to ~292K).

在这里，我们仅从数据集中获得一个特征，我们可以清楚地区分季节和城市(均值从〜297K降至〜292K)。

Another thing that could be useful is a pair correlation between different features. That way we could be able to remove some of the redundant features from our dataset.

可能有用的另一件事是不同功能之间的成对关联 。这样，我们便可以从数据集中删除一些冗余特征。

As you can notice, we can drop one of the precipitation features right away. It might be unintentional at first but because we have data from different sources, the same kind of data (like precipitation) won’t always be fully correlated with each other. This might be due to different measurement methods or something else.

如您所见，我们可以立即删除其中一个降水特征。乍一看可能不是故意的，但是由于我们有来自不同来源的数据，因此相同类型的数据(例如降水)将不会总是彼此完全相关。这可能是由于不同的测量方法或其他原因造成的。

数据关联 (Data Correlation)

When working with a lot of features we don’t really have to plot pair plots for every pair like that. Another option is just to calculate sth called Correlation Score. There are different types of correlations for a different types of data. Because our dataset consists only of numerical data we can use the builtin method called .corr() to generate correlations for each city.

当使用许多功能时，我们实际上不必像这样为每对绘制图。另一种选择是只计算某项，称为Correlation Score 。对于不同类型的数据，存在不同类型的关联。因为我们的数据集仅包含数字数据，所以我们可以使用称为.corr()的内置方法为每个城市生成相关性。

If there are categorical columns which shouldn’t be treated as binary you could calculate Cramér’s V measure of association to find out a “correlation” between them and the rest of the data.
如果存在不应被视为二进制的分类列，则可以计算 Cramér的V关联度量， 以找出它们与其余数据之间的“关联”。

import pandas as pd
import seaborn as sns
# Importing our extraction function
from helpers import extract_data
from data_info import *
train_data = extract_data(train_file, CSV_COLUMNS)
# Get data for "sj" city and drop both binary columns
sj_train = train_data[train_data['sj'] == 1].drop(['sj', 'iq'], axis=1)
# Generate heatmap
corr = sj_train.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
plt.figure(figsize=(20, 10))
ax = sns.heatmap(
    corr, 
     mask=mask, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_title('Data correlation for city "sj"')
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

You could do the same for iq city and compare both of them (correlations are different).

您可以对iq city进行相同的操作，然后将它们进行比较(相关性不同)。

If you look at this heatmap it’s obvious which features are correlated with each other and which are not. You should be aware there are positive and negative correlations (dark blueish and dark red). Features without correlation are white. There are groups of positively correlated features and unsurprisingly they are referring to the same type of measurement (correlation between station_min_temp_c and station_avg_temp_c). But there are also correlations between different kind of features (like reanalysis_specific_humidity_g_per_kg and reanalysis_dew_point_temp_k). We should also focus on the correlation between total_cases and the rest of the features because that’s what we have to predict.

如果您查看此热图，很明显哪些功能是相互关联的，哪些功能是不相关的。您应该知道，存在正相关和负相关(深蓝色和深红色)。没有关联的要素是白色。有成组的正相关特征，毫不奇怪，它们指的是同一类型的测量( station_min_temp_c和station_avg_temp_c之间的相关性)。但是，不同类型的功能(例如reanalysis_specific_humidity_g_per_kg和reanalysis_dew_point_temp_k )之间也存在相关性。我们还应该关注total_cases与其余功能之间的相关性，因为这是我们必须预测的。

This time we’re out of luck because nothing is really strongly correlated with our target. But we still should be able to pick the most important features for our model. Looking on the heatmap is not that useful right now so let me switch to the bar plot.

这次我们很不走运，因为没有任何东西与我们的目标真正相关。但是我们仍然应该能够为模型选择最重要的功能。现在查看热图并不是那么有用，所以让我切换到条形图。

sorted_y = corr.sort_values(by='total_cases', axis=0).drop('total_cases')
plt.figure(figsize=(20, 10))
ax = sns.barplot(x=sorted_y.total_cases, y=sorted_y.index, color="b")
ax.set_title('Correlation with total_cases in "sj"')

Usually, when picking features to our model we’re choosing features that have the highest absolute correlation value with our target. It’s up to you to decide how many features you choose, you might even choose all of them but that’s usually not the best idea.

通常，在为模型选择要素时，我们会选择与目标具有最高绝对相关值的要素。由您决定选择多少个功能，甚至可以选择所有功能，但这通常不是最好的主意。

It is also important to look at how target values are distributed within our dataset. We can easily do that using pandas:

查看目标值在数据集中的分布方式也很重要。我们可以使用熊猫轻松地做到这一点：

On an average number of cases for a week is quite low. Only from time to time (once a year), total number of cases jumps to some higher value. We need to remember that when designing our model because even if we manage to find that “jumps” we might lose a lot during the weeks with little or none cases.

平均而言，一周的病例数很低。仅每隔一段时间(一年一次)，病例总数就会跃升至更高的水平。我们需要记住，在设计模型时，因为即使我们设法找到“跳跃”，我们也可能在几周内损失很少甚至没有的情况下损失很多。

什么是NDVI值？ (What is an NDVI value?)

Last thing we have to discuss in this article in an NDVI index ( Normalized difference vegetation index). This index is an indicator of vegetation. High negative values correspond to water, values close to 0 represent rocks/sand/snow, and values close to 1 tropical forests. In the given dataset, we have 4 different NDVI values for each city (each for a different corner on the map).

我们在本文中最后要讨论的是NDVI指数 ( 归一化植被指数 )。该指数是植被的指标。高负值表示水，接近0的值表示岩石/沙地/雪，接近1的热带森林。在给定的数据集中，每个城市有4个不同的NDVI值(每个都对应于地图上的不同角)。

Even if the overall NDVI index is quite useful to understand a type of terrain we’re dealing with and if we would need to design a model for multiple cities that might come in handy, but in this case, we have only two cities which climate and position on the map are known. We don’t have to train our model to figure out which kind of environment we’re dealing with, instead, we can just train two separate models for each city.

即使总体NDVI指数对于了解我们正在处理的地形类型非常有用，并且我们需要为可能会派上用场的多个城市设计模型，但是在这种情况下，我们只有两个城市和在地图上的位置是已知的。我们不必训练模型就可以确定要处理的环境，相反，我们可以为每个城市训练两个单独的模型。

I’ve spent a while trying to make use of those values (especially that interpolation is hard in this case because we’re using a lot of information during the process). Using the NDVI index might be also misleading because changes in values don’t have to correspond to changes in the vegetation process.

我花了一段时间尝试使用这些值(特别是在这种情况下很难插值，因为在此过程中我们使用了大量信息)。使用NDVI指数也可能会产生误导，因为值的变化不必与植被过程的变化相对应。

If you want to check those cities out pleas refer to San Juan, Puerto Rico, and Iquitos, Peru.

如果要检查这些城市，请参阅波多黎各的San Juan和秘鲁的Iquitos 。

结论 (Conclusion)

At this point, you should be aware of how our dataset looks like. We didn’t even start designing the first model but already know that some of the features are less important than others and some of them are just repeating the same data. If you would need to take with you one thing from this entire article, it is “Try to understand your data first!”.

此时，您应该了解我们的数据集的外观。我们甚至没有开始设计第一个模型，但已经知道某些功能不如其他功能重要，而某些功能只是重复相同的数据。如果您需要从整篇文章中带走一件事，那就是“尝试先了解您的数据！”。

Originally published at https://erdem.pl.

最初发布在 https://erdem.pl 。

翻译自: https://towardsdatascience.com/dengai-how-to-approach-data-science-competitions-eda-22a34158908a

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/389354.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

Pytorch模型层简单介绍

模型层layers 深度学习模型一般由各种模型层组合而成。 torch.nn中内置了非常丰富的各种模型层。它们都属于nn.Module的子类，具备参数管理功能。例如： nn.Linear, nn.Flatten, nn.Dropout, nn.BatchNorm2d nn.Conv2d,nn.AvgPool2d,nn.Conv1d,nn.Co…

DengAI —如何应对数据科学竞赛？ （EDA）