Bigmart数据集销售预测

Note: This post is heavy on code, but yes well documented.

注意:这篇文章讲的是代码,但确实有据可查。

问题描述 (The Problem Description)

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

BigMart的数据科学家收集了2013年不同城市10家商店中1559种产品的销售数据。 另外,已经定义了每个产品和商店的某些属性。 目的是建立预测模型并找出特定商店中每种产品的销售情况。

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

BigMart将使用此模型尝试了解在增加销售额中起关键作用的产品和商店的属性。

Find the entire notebook on GitHub: BigMart Sales Prediction

在GitHub上找到整个笔记本: BigMart销售预测

Metric Used Root Mean Squared Error

使用的度量 标准—均方根误差

I achieved an RMSE of 946.34. Thanks to K-Fold Cross Validation, Random Forest Regressor and obviously enough patience.

我的RMSE为946.34。 多亏了K折交叉验证,Random Forest Regressor和明显的耐心。

You can find the dataset here: DATASET

您可以在此处找到数据集: DATASET

First lets get a feel of the data

首先让我们感受一下数据

png
train.dtypesItem_Identifier               object
Item_Weight float64
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object

检查表是否缺少值 (Checking if table has missing values)

train.isnull().sum(axis=0)Item_Identifier                 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Item_Weight has 1463 and Outlet_Size has 2410 missing values

Item_Weight有1463,Outlet_Size有2410缺失值

train.describe()
png

让我们做一些数据可视化! (Lets do some Data Viz!)

png
Scatter Plot
散点图

Hmm.. Items having visibility less than 0.2 sold them most

可见度小于0.2的商品最多

png
Barplot
条形图
.Top 2 Contributors: Outlet_27 > Outlet_35
.Bottom 2 Contributors: Outlet 10 & Outlet 19

让我们检查一下哪种物品类型的销量最高 (Lets check which item type sold the most)

png
Barplot
条形图

检查异常值 (Checking for outliers)

png
.Health and hygiene has an outlier

这里是有趣的部分! (Here comes the FUN part!!)

资料清理 (DATA CLEANING)

Peeking into what kind of values Item_Fat_Content and Item_Visibility contains.

窥视Item_Fat_ContentItem_Visibility包含哪些类型的值。

train.Item_Fat_Content.value_counts() # has mismatched factor levelsLow Fat    5089
Regular 2889
LF 316
reg 117
low fat 112
Name: Item_Fat_Content, dtype: int64train.Item_Visibility.value_counts().head()0.000000 526
0.076975 3
0.041283 2
0.085622 2
0.187841 2
Name: Item_Visibility, dtype: int64

Strange!! Item Visibility cant be 0. Lets keep a note of that for now.

奇怪!! 项目可见性不能为0。暂时保留一下。

train.Outlet_Size.value_counts()Medium    2793
Small 2388
High 932
Name: Outlet_Size, dtype: int64

到目前为止,从数据集中的快速观察: (Quick observations from the dataset so far:)

1.Item_Fat_Content has mismatched factor levels
2.Min(Item_visibility) = 0. Not practically possible. Treat 0's as missing values
3.Item_weight has 1463 missing values
4.Outlet_Size has unmatched factor levels

数据插补 (Data Imputation)

Filling outlet size

灌装口尺寸

My opinion: Outlet size depends on outlet type and the location of the outlet

我的看法:插座尺寸取决于插座类型和插座位置

crosstable = pd.crosstab(train['Outlet_Size'],train['Outlet_Type'])
crosstable
png

This is why I love the crosstab feature ❤

这就是为什么我喜欢交叉表功能❤

From the above table it is evident that all the grocery stores are of small types, which is mostly true in the real world.

从上表可以看出,所有杂货店都是小型的,这在现实世界中大多是正确的。

Therefore mapping Grocery store and small size

因此,映射杂货店和小尺寸

dic = {'Grocery Store':'Small'}
s = train.Outlet_Type.map(dic)train.Outlet_Size= train.Outlet_Size.combine_first(s)
train.Outlet_Size.value_counts()Small 2943
Medium 2793
High 932
Name: Outlet_Size, dtype: int64# Checking if imputation was successful
train.isnull().sum(axis=0)Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 1855
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

In real world it is mostly seen that outlet size varies with the location of the outlet, hence checking between the same

在现实世界中,大多数情况下会看到插座的尺寸随插座的位置而变化,因此在相同插座之间进行检查

png

From the above table it is evident that all the Tier 2 stores are of small types. Therefore mapping Tier 2 store and small size

从上表可以看出,所有第2层商店都是小型商店。 因此,映射第2层商店且尺寸较小

dic = {"Tier 2":"Small"}
s = train.Outlet_Location_Type.map(dic)
train.Outlet_Size = train.Outlet_Size.combine_first(s)train.isnull().sum(axis=0)Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64train.Item_Identifier.value_counts().sum()8523

Outlet size missing values have been imputed

出口尺寸缺失值已估算

Imputing for Item_Weight

估算Item_Weight

Instead of imputing with the overall mean of all the items. It would be better to impute it with the mean of particular item type — Food,Drinks,Non-Consumable. Did this as some products may be on the heavier side and some on the lighter.

而不是用所有项目的整体平均值来估算。 最好用特定项目类型的平均值(食物,饮料,非消耗品)来估算。 这样做是因为某些产品可能偏重而某些产品较轻。

#Fill missing values of weight of Item According to means of Item Identifier
train['Item_Weight']=train['Item_Weight'].fillna(train.groupby('Item_Identifier')['Item_Weight'].transform('mean'))train.isnull().sum()Item_Identifier 0
Item_Weight 4
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64train[train.Item_Weight.isnull()]
png

The above 4 item weights weren’t imputed because in the dataset there is only one record for each of them. Hence mean could not be calculated.

上面的4个项目权重没有被估算,因为在数据集中每个项只有一条记录。 因此,均值无法计算。

So, we will fill Item_Weight by the corresponding Item_Type for these 4 values

因此,我们将使用这4个值的相应Item_Type填充Item_Weight

# List of item types item_type_list = train.Item_Type.unique().tolist()# grouping based on item type and calculating mean of item weightItem_Type_Means = train.groupby('Item_Type')['Item_Weight'].mean()# Mapiing Item weight to item type meanfor i in item_type_list:
dic = {i:Item_Type_Means[i]}
s = train.Item_Type.map(dic)
train.Item_Weight = train.Item_Weight.combine_first(s)

Item_Type_Means = train.groupby('Item_Type')['Item_Weight'].mean() # Checking if Imputation was successfultrain.isnull().sum()Item_Identifier 0
Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Missing values for item_weight have been imputed

估算了item_weight的缺失值

估算项目可见性 (Imputing for item visibility)

Item visibility cannot be 0 and should be treated as missing values and imputed

项目可见性不能为0,应将其视为缺失值并估算

Imputing with mean of item_visibility of particular item identifier category as some items may be more visible (big — TV,Fridge etc) and some less visible (Shampoo Sachet,Surf Excel and other such small pouches)

以特定项目标识符类别的item_visibility的平均值进行估算,因为某些项目可能更可见(大—电视,冰箱等),而某些项目则不那么可见(洗发香囊,Surf Excel和其他此类小袋)

# Replacing 0's with NaN
train.Item_Visibility.replace(to_replace=0.000000,value=np.NaN,inplace=True)# Now fill by mean of visbility based on item identifiers
train.Item_Visibility = train.Item_Visibility.fillna(train.groupby('Item_Identifier')['Item_Visibility'].transform('mean'))# Checking if Imputation was carried out successfully
train.isnull().sum()Item_Identifier 0
Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Renaming Item_Fat_Content levels

重命名Item_Fat_Content级别

Item_Fat_Content_levels if you see have different values representing the same case. For example, Regular and Reg are the same. Lets deal with this.

如果看到的Item_Fat_Content_levels具有代表相同案例的不同值。 例如,Regular和Reg相同。 让我们处理一下。

train.Item_Fat_Content.value_counts()Low Fat    5089
Regular 2889
LF 316
reg 117
low fat 112
Name: Item_Fat_Content, dtype: int64
# Replacing train.Item_Fat_Content.replace(to_replace=["LF","low fat"],value="Low Fat",inplace=True)train.Item_Fat_Content.replace(to_replace="reg",value="Regular",inplace=True)
train.Item_Fat_Content.value_counts()Low Fat 5517
Regular 3006
Name: Item_Fat_Content, dtype: int64# Creating a feature that describes the no of years the outlet has been in existence after 2013.train['Outlet_Year'] = (2013 - train.Outlet_Establishment_Year)train.head()
png

功能编码 (Feature Encoding)

Encoding Categorical Variables

编码分类变量

var_cat = train.select_dtypes(include=[object])
var_cat.head()
png
#Convert categorical into numerical 
var_cat = var_cat.columns.tolist()
var_cat = ['Item_Fat_Content',
'Item_Type',
'Outlet_Size',
'Outlet_Location_Type',
'Outlet_Type']
var_cat['Item_Fat_Content',
'Item_Type',
'Outlet_Size',
'Outlet_Location_Type',
'Outlet_Type']

Using Regex to rename the values in Item_type column and store it in a new column

使用Regex重命名Item_type列中的值并将其存储在新列中

train.Item_Type_New.replace(to_replace="^FD*.*",value="Food",regex=True,inplace=True)train.Item_Type_New.replace(to_replace="^DR*.*",value="Drinks",regex=True,inplace=True)train.Item_Type_New.replace(to_replace="^NC*.*",value="Non-Consumable",regex=True,inplace=True)
train.head()
png

使用标签编码器的标签编码功能 (Label Encoding features using Label Encoder)

le = LabelEncoder()train['Outlet'] = le.fit_transform(train.Outlet_Identifier)
train['Item'] = le.fit_transform(train.Item_Type_New)
train.head()
png
for i in var_cat:
train[i] = le.fit_transform(train[i])
train.head()
png

可视化相关 (Visualizing Correlation)

png
png

预测建模 (Predictive Modelling)

Choosing the predictors for our model

为我们的模型选择预测因子

predictors=['Item_Fat_Content','Item_Visibility','Item_Type','Item_MRP','Outlet_Size','Outlet_Location_Type','Outlet_Type','Outlet_Year',
'Outlet','Item','Item_Weight']
seed = 240
np.random.seed(seed)X = train[predictors]
y = train.Item_Outlet_SalesX.head()
png
y.head()0    3735.1380
1 443.4228
2 2097.2700
3 732.3800
4 994.7052
Name: Item_Outlet_Sales, dtype: float64

将数据集分为训练和测试数据 (Splitting the Dataset into Training and Testing Data)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25,random_state = 42)X_train.shape(6392, 11)X_train.tail()
png
X_test.shape(2131, 11)y_train.shape(6392,)y_test.shape(2131,)

建筑模型 (Model Building)

We will be building different types of models.

我们将建立不同类型的模型。

  1. Linear Regression

    线性回归
lm = LinearRegression()model = lm.fit(X_train,y_train)
predictions = lm.predict(X_test)

绘制模型结果 (Plotting the model results)

plt.scatter(y_test,predictions)
plt.show()
png

评估模型 (Evaluating the Model)

#R^2 Score
print("Linear Regression Model Score:",model.score(X_test,y_test))Linear Regression Model Score: 0.5052133696581114

计算RMSE (Calculating RMSE)

original_values = y_test#Root mean squared error
rmse = np.sqrt(metrics.mean_squared_error(original_values,predictions))print("Linear Regression RMSE: ", rmse)

Linear Regression without cross validation:

没有交叉验证的线性回归:

Linear Regression R2 score: 0.505inear Regression RMSE: 1168.37

L inear Regression R2得分:0.505inear Regression RMSE:1168.37

# Linear Regression with statsmodels
x = sm.add_constant(X_train)
results = sm.OLS(y_train,x).fit()
results.summary()
Image for post
predictions = results.predict(x)predictionsDF = pd.DataFrame({"Predictions":predictions})
joined = x.join(predictionsDF)
joined.head()
png

执行交叉验证 (Performing Cross Validation)

# Perform 6-fold cross validation
score = cross_val_score(model,X,y,cv=5)
print("Linear Regression CV Score: ",score)

Linear Regression CV Score: [0.51828865 0.5023478 0.48262104 0.50311721 0.4998021 ]

线性回归CV得分:[0.51828865 0.5023478 0.48262104 0.50311721 0.4998021]

Predicting with cross_val_predict

用cross_val_predict预测

predictions = cross_val_predict(model,X,y,cv=6)# Plotting the results
plt.scatter(y,predictions)
plt.show()
png
Plotting the results
绘制结果

Linear Regression with Cross- Validation

具有交叉验证的线性回归

Linear Regression R2 with CV: 0.501inear Regression RMSE with CV: 1205.05

具有CV的L线性回归R2: 0.501具有CV的线性回归RMSE: 1205.05

使用KFold验证 (Using KFold Validation)

Function to fit the model and return training and validation error

拟合模型并返回训练和验证错误的功能

def calc_metrics(X_train, y_train, X_test, y_test, model):
'''fits model and returns the RMSE for in-sample error and out-of-sample error''' model.fit(X_train, y_train) train_error = calc_train_error(X_train, y_train, model) validation_error = calc_validation_error(X_test, y_test, model)
return train_error, validation_error

计算训练误差的功能 (Function to calculate the training error)

def calc_train_error(X_train, y_train, model):
'''returns in-sample error for already fit model.'''
predictions = model.predict(X_train)
mse = metrics.mean_squared_error(y_train, predictions)
rmse = np.sqrt(mse)
return mse

Function to calculate the validation (Function to calculate the validation)

    
def calc_validation_error(X_test, y_test, model):
'''returns out-of-sample error for already fit model.'''
predictions = model.predict(X_test)
mse = metrics.mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
return mse

与Lasso回归一起执行10倍交叉验证,以克服模型的过拟合问题。 (Performing 10 fold Cross Validation along with Lasso Regression to overcome over-fitting of the model.)

Find the code here: CODE

在此处找到代码: CODE

2.决策树回归器 (2. Decision Tree Regressor)

regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X_train,y_train)DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=0, splitter='best')predictions = regressor.predict(X_test)
predictions[:5]array([ 792.302 , 356.8688, 365.5242, 5778.4782, 2356.932 ])results = pd.DataFrame({'Actual':y_test,'Predicted':predictions})
results.head()
png

具有Kfold验证的决策树回归 (Decision Tree Regression with Kfold validation)

Mean Absolute Error: 625.88Root Mean Squared Error: 1161.40

平均绝对误差: 625.88均方根误差: 1161.40

3.随机森林回归 (3. Random Forest Regressor)

Model that gave me the best RMSE

给我最好的RMSE的模型

rf = RandomForestRegressor(random_state=43)rf.fit(X_train,y_train)RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=43, verbose=0, warm_start=False)predictions = rf.predict(X_test)rmse = np.sqrt(metrics.mean_squared_error(y_test,predictions))results = pd.DataFrame({'Actual':y_test,'Predicted':predictions})
results.head()
png

具有kfold验证得分的Randorm森林回归 (Randorm Forest Regression with kfold validation score)

RMSE:946.34 R2得分:0.675 (RMSE: 946.34
R2 Score: 0.675
)

摘要 (Summary)

This was a great learning project for me as I applied a lot of different techniques and researched a lot on different issues I faced throughout the duration of the project. I would like to thanks Analytics Vidhya team for hosting this challenge. Also, kudos to Towards Data Science for their amazing content on different aspects of Data Science.

对我来说,这是一个很棒的学习项目,因为我运用了许多不同的技术,并对整个项目期间遇到的不同问题进行了很多研究。 我要感谢Analytics Vidhya团队承办这项挑战。 另外,对走向数据科学的荣誉 他们在数据科学各个方面的精彩内容。

未来的改进 (Future Improvements)

Hyper-parameter Tuning and Gradient Boosting.

超参数调整和梯度提升。

翻译自: https://medium.com/analytics-vidhya/bigmart-dataset-sales-prediction-c1f1cdca9af1

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391684.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

数据特征分析-帕累托分析

帕累托分析(贡献度分析):即二八定律 目的:通过二八原则寻找属于20%的关键决定性因素。 随机生成数据 df pd.DataFrame(np.random.randn(10)*10003000,index list(ABCDEFGHIJ),columns [销量]) #避免出现负数 df.sort_values(销量,ascending False,i…

dt决策树_决策树:构建DT的分步方法

dt决策树介绍 (Introduction) Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred f…

读C#开发实战1200例子记录-2017年8月14日10:03:55

C# 语言基础应用,注释 "///"标记不仅仅可以为代码段添加说明,它还有一项更重要的工作,就是用于生成自动文档。自动文档一般用于描述项目,是项目更加清晰直观。在VisualStudio2015中可以通过设置项目属性来生成自动文档。…

数据特征分析-正太分布

期望值,即在一个离散性随机变量试验中每次可能结果的概率乘以其结果的总和。 若随机变量X服从一个数学期望为μ、方差为σ^2的正态分布,记为N(μ,σ^2),其概率密度函数为正态分布的期望值μ决定了其位置,其标准差σ决定…

r语言调用数据集中的数据集_自然语言数据集中未解决的问题

r语言调用数据集中的数据集Garbage in, garbage out. You don’t have to be an ML expert to have heard this phrase. Models uncover patterns in the data, so when the data is broken, they develop broken behavior. This is why researchers allocate significant reso…

数据特征分析-相关性分析

相关性分析是指对两个或多个具备相关性的变量元素进行分析,从而衡量两个变量的相关密切程度。 相关性的元素之间需要存在一定的联系或者概率才可以进行相关性分析。 相关系数在[-1,1]之间。 一、图示初判 通过pandas做散点矩阵图进行初步判断 df1 pd.DataFrame(np.…

获取所有权_住房所有权经济学深入研究

获取所有权Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seekin…

getBoundingClientRect说明

getBoundingClientRect用于获取某个元素相对于视窗的位置集合。 1.语法:这个方法没有参数。 rectObject object.getBoundingClientRect() 2.返回值类型:TextRectangle对象,每个矩形具有四个整数性质( 上, 右 &#xf…

robot:接口入参为图片时如何发送请求

https://www.cnblogs.com/changyou615/p/8776507.html 接口是上传图片,通过F12抓包获得如下信息 由于使用的是RequestsLibrary,所以先看一下官网怎么传递二进制文件参数,https://2.python-requests.org//en/master/user/advanced/#post-multi…

已知两点坐标拾取怎么操作_已知的操作员学习-第3部分

已知两点坐标拾取怎么操作有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as mu…

缺失值和异常值处理

一、缺失值 1.空值判断 isnull()空值为True,非空值为False notnull() 空值为False,非空值为True s pd.Series([1,2,3,np.nan,hello,np.nan]) df pd.DataFrame({a:[1,2,np.nan,3],b:[2,np.nan,3,hello]}) print(s.isnull()) print(s[s.isnull() False]…

特征工程之特征选择_特征工程与特征选择

特征工程之特征选择📈Python金融系列 (📈Python for finance series) Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.警告 : 这里没有神奇的配方或圣杯,尽管新世界可…

版本号控制-GitHub

前面几篇文章。我们介绍了Git的基本使用方法及Gitserver的搭建。本篇文章来学习一下怎样使用GitHub。GitHub是开源的代码库以及版本号控制库,是眼下使用网络上使用最为广泛的服务,GitHub能够托管各种Git库。首先我们须要注冊一个GitHub账号,打…

数据标准化和离散化

在某些比较和评价的指标处理中经常需要去除数据的单位限制,将其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。因此需要通过一定的方法进行数据标准化,将数据按比例缩放,使之落入一个小的特定区间。 一、标准…

熊猫tv新功能介绍_熊猫简单介绍

熊猫tv新功能介绍Out of all technologies that is introduced in Data Analysis, Pandas is one of the most popular and widely used library.在Data Analysis引入的所有技术中,P andas是最受欢迎和使用最广泛的库之一。 So what are we going to cover :那么我…

数据转换软件_数据转换

数据转换软件📈Python金融系列 (📈Python for finance series) Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.警告 :这里没有神奇的配方或圣杯,尽管新世界可能为您…

10张图带你深入理解Docker容器和镜像

【编者的话】本文用图文并茂的方式介绍了容器、镜像的区别和Docker每个命令后面的技术细节,能够很好的帮助读者深入理解Docker。这篇文章希望能够帮助读者深入理解Docker的命令,还有容器(container)和镜像(image&#…

matlab界area_Matlab的数据科学界

matlab界area意见 (Opinion) My personal interest in Data Science spans back to 2011. I was learning more about Economies and wanted to experiment with some of the ‘classic’ theories and whilst many of them held ground, at a micro level, many were also pur…

hdf5文件和csv的区别_使用HDF5文件并创建CSV文件

hdf5文件和csv的区别In my last article, I discussed the steps to download NASA data from GES DISC. The data files downloaded are in the HDF5 format. HDF5 is a file format, a technology, that enables the management of very large data collections. Thus, it is…

机器学习常用模型:决策树_fairmodels:让我们与有偏见的机器学习模型作斗争

机器学习常用模型:决策树TL; DR (TL;DR) The R Package fairmodels facilitates bias detection through model visualizations. It implements a few mitigation strategies that could reduce bias. It enables easy to use checks for fairness metrics and comparison betw…