xgboost keras_用catboost lgbm xgboost和keras预测财务交易

xgboost keras

The goal of this challenge is to predict whether a customer will make a transaction (“target” = 1) or not (“target” = 0). For that, we get a data set of 200 incognito variables and our submission is judged based on the Area Under Receiver Operating Characteristic Curve which we have to maximise.

这项挑战的目的是预测客户是否会进行交易(“目标” = 1)(“目标” = 0)。 为此,我们获得了200个隐身变量的数据集,并根据必须最大化的接收器工作特征曲线下面积来判断提交。

This project is somewhat different from others, you basically get a huge amount of data with no missing values and only numbers. A dream come true for any data scientist. Of course, that sounds too good to be true! Let’s dive in.

这个项目与其他项目有些不同,您基本上可以获得大量的数据,没有缺失值,只有数字。 任何数据科学家都梦想成真。 当然,这听起来好得令人难以置信! 让我们潜入。

一,设置 (I. Set up)

We start by loading the data and get a quick overview of the data we’ll have to handle. We do so by calling the describe() and info() functions.

我们首先加载数据,然后快速概览需要处理的数据。 我们通过调用describe()和info()函数来实现。

# Load the data setstrain_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")# Create a merged data set and review initial informationcombined_df = pd.concat([train_df, test_df])
print(combined_df.describe())
print(combined_df.info())
Image for post

We have a total of 400.000 observations, 200.000 of whom in our training set. We can also see that we will have to deal with the class imbalance issue as we have a mean of 0.1 in the target column.

我们总共有400.000个观测值,其中有200.000个在我们的训练集中。 我们还可以看到,我们必须处理类不平衡问题,因为目标列的平均值为0.1。

二。 缺失值 (II. Missing values)

Let’s check whether we have any missing values. For that, we print the column names that contain missing values.

让我们检查是否缺少任何值。 为此,我们打印包含缺少值的列名称。

# Check missing valuesprint(combined_df.columns[combined_df.isnull().any()])
Image for post

We have zero missing values. Let’s move forward.

我们的零缺失值。 让我们前进。

三, 资料类型 (III. Data types)

Let’s check the data we have. Are we dealing with categorical variables? Or text? Or just numbers? We print a dictionary containing the different types of data we have and its occurrence.

让我们检查一下我们拥有的数据。 我们在处理分类变量吗? 还是文字? 还是数字? 我们打印一个字典,其中包含我们拥有的不同数据类型及其出现的位置。

# Get the data typesprint(Counter([combined_df[col].dtype for col in combined_df.columns.values.tolist()]).items())
Image for post

Only float data. We don’t have to create dummy variables.

仅浮动数据。 我们不必创建虚拟变量。

IV。 数据清理 (IV. Data cleaning)

We don’t want to use our ID column to make our predictions and therefore store it into the index.

我们不想使用ID列进行预测,因此将其存储到索引中。

# Set the ID col as indexfor element in [train_df, test_df]:
element.set_index('ID_code', inplace = True)

We now separate the target variable from our training set and create a new dataframe for our target variable.

现在,我们将目标变量从训练集中分离出来,并为目标变量创建一个新的数据框。

# Create X_train_df and y_train_df setX_train_df = train_df.drop("target", axis = 1)
y_train_df = train_df["target"]

V.缩放 (V. Scaling)

We haven’t done anything when it comes to data exploration and outlier analysis. It is always highly recommended to conduct these. However, given the nature of the challenge, we suspect that the variables in themselves might not be too interesting.

在数据探索和离群值分析方面,我们没有做任何事情。 始终强烈建议进行这些操作。 但是,鉴于挑战的性质,我们怀疑变量本身可能不太有趣。

In order to compensate for our lack of outlier detection, we scale the data using RobustScaler().

为了弥补我们对异常值检测的不足,我们使用RobustScaler()缩放数据。

# Scale the data and use RobustScaler to minimise the effect of outliersscaler = RobustScaler()# Scale the X_train setX_train_scaled = scaler.fit_transform(X_train_df.values)
X_train_df = pd.DataFrame(X_train_scaled, index = X_train_df.index, columns= X_train_df.columns)# Scale the X_test setX_test_scaled = scaler.transform(test_df.values)
X_test_df = pd.DataFrame(X_test_scaled, index = test_df.index, columns= test_df.columns)

We now create a X_train, y_train, X_test and y_test set for training our model and then testing it on hold-out data.

现在,我们创建一个X_train,y_train,X_test和y_test集来训练我们的模型,然后对保留数据进行测试。

# Split our training sample into train and test, leave 20% for testX_train, X_test, y_train, y_test = train_test_split(X_train_df, y_train_df, test_size=0.2, random_state = 20)

When it comes to outliers, some could use IsolationForest() in order to automatically identify and remove rows that are outliers. This technique is often used for data sets with numerous variables. This code chunk has been borrowed form MachineLearningMastery.

当涉及到离群值时,有些人可以使用IsolationForest()来自动识别和删除离群值。 此技术通常用于具有众多变量的数据集。 该代码块已从MachineLearningMastery借用。

# OUTLIERS# Remove outliers automaticallyiso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)
print(yhat)# Select all rows that are not outliersmask = yhat != -1
X_train, y_train = X_train.loc[mask, :], y_train.loc[mask]

Please note that this automated outlier discovery did not add any predictive power to our model and we decided to comment it out.

请注意,这种自动化的异常值发现并未为我们的模型增加任何预测能力,因此我们决定将其注释掉。

七。 班级失衡 (VII. Class Imbalance)

In our data, we have seen that we have way less observations that have made a transaction than have not. If we want our model to be equally capable at predicting both, we should make sure we don’t feed it with skewed data.

在我们的数据中,我们已经看到,进行交易的观察少于未观察到的观察。 如果我们希望我们的模型能够同时预测两者,则应确保不向其提供偏斜的数据。

We correct for class imbalance by upsampling the minority class. This techniques are inspired from this excellent article by Tara Boyle.

我们通过增加少数族裔的样本来纠正阶级失衡。 该技术的灵感来自Tara Boyle的这篇出色文章 。

# CLASS IMBALANCE# Downsample majority class# Concatenate our training data back togetherX = pd.concat([X_train, y_train], axis=1)# Separate minority and majority classesnot_transa = X[X.target==0]
transa = X[X.target==1]
not_transa_down = resample(not_transa,
replace = False, # sample without replacementn_samples = len(transa), # match minority nrandom_state = 27) # reproducible results# Combine minority and downsampled majoritydownsampled = pd.concat([not_transa_down, transa])# Checking countsprint(downsampled.target.value_counts())# Create training set againy_train = downsampled.target
X_train = downsampled.drop('target', axis=1)
print(len(X_train))

Here is the code for upsampling the minority class.

这是对少数群体进行升采样的代码。

# Upsample minority class# Concatenate our training data back togetherX = pd.concat([X_train, y_train], axis=1)# Separate minority and majority classesnot_transa = X[X.target==0]
transa = X[X.target==1]
not_transa_up = resample(transa,
replace = True, # sample without replacementn_samples = len(not_transa), # match majority nrandom_state = 27) # reproducible results# Combine minority and downsampled majorityupsampled = pd.concat([not_transa_up, not_transa])# Checking countsprint(upsampled.target.value_counts())# Create training set againy_train = upsampled.target
X_train = upsampled.drop('target', axis=1)
print(len(X_train))

And here is the code for creating synthetic samples with SMOTE.

这是使用SMOTE创建合成样本的代码。

# Create synthetic samplessm = SMOTE(random_state=27, sampling_strategy='minority')
X_train, y_train = sm.fit_sample(X_train, y_train)
print(y_train.value_counts())

八。 造型 (VIII. Modelling)

We now dive deeper into the models. The plan is to create 4 different models and then averaging their predictions to make an ensemble that will yield the final prediction. We do not plan to fine tune the models to a too wide extent. Leaving GridSearch out of this.

现在,我们深入研究模型。 计划是创建4个不同的模型,然后对它们的预测取平均,以形成一个可以产生最终预测的集合。 我们不打算对模型进行微调。 将GridSearch排除在外。

1. Keras神经网络 (1. Neural Network With Keras)

# NEURAL NETWORK# Build our neural network with input dimension 200classifier = Sequential()# First Hidden Layerclassifier.add(Dense(150, activation='relu', kernel_initializer='random_normal', input_dim=200))# Second  Hidden Layerclassifier.add(Dense(350, activation='relu', kernel_initializer='random_normal'))# Third  Hidden Layerclassifier.add(Dense(250, activation='relu', kernel_initializer='random_normal'))# Fourth  Hidden Layerclassifier.add(Dense(50, activation='relu', kernel_initializer='random_normal'))# Output Layerclassifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))# Compile the networkclassifier.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])# Fitting the data to the training data setclassifier.fit(X_train,y_train, batch_size=100, epochs=150)# Evaluate the model on training dataeval_model=classifier.evaluate(X_train, y_train)
print(eval_model)# Make predictions on the hold out datay_pred=classifier.predict(X_test)
y_pred =(y_pred>0.5)# Get the confusion matrixcm = confusion_matrix(y_test, y_pred)
print(cm)# Get the accuracy scoreprint("Accuracy of {}".format(accuracy_score(y_test, y_pred)))# Get the f1-Scoreprint("f1 score of {}".format(f1_score(y_test, y_pred)))# Get the recall scoreprint("Recall score of {}".format(recall_score(y_test, y_pred)))# Make predictions and create submission filepredictions = (classifier.predict(X_test_df)>0.5)
predictions = np.concatenate(predictions, axis=0 )
my_pred = pd.DataFrame({'ID_code': X_test_df.index, 'target': predictions})# Set 0 and 1s instead of True and Falsemy_pred["target"] = my_pred["target"].map({True:1, False : 0})# Create CSV filemy_pred.to_csv('pred_ann.csv', index=False)

This model is built upon the excellent review from Renu Khandelwal. We haven’t modified the original script except adding some layers and increased the number of neurons by layer.

该模型基于Renu Khandelwal的出色评论 。 除了添加一些层并逐层增加神经元的数量外,我们没有修改原始脚本。

Our first submission with this Neural Network gives us a score of 0.80882

我们在该神经网络中的首次提交给我们得分0.80882

2. LightGBM (2. LightGBM)

# LIGHT GBM# Get the train and test data for the training sequencetrain_data = lgbm.Dataset(X_train, label=y_train)
test_data = lgbm.Dataset(X_test, label=y_test)# Set parameters
parameters = {'application': 'binary','objective': 'binary','metric': 'auc','is_unbalance': 'true','boosting': 'gbdt','num_leaves': 31,'feature_fraction': 0.5,'bagging_fraction': 0.5,'bagging_freq': 20,'learning_rate': 0.05,'verbose': 0
}# Train our classifier
classifier = lgbm.train(parameters,
train_data,
valid_sets= test_data,
num_boost_round=5000,
early_stopping_rounds=100)# Make predictionspredictions = classifier.predict(X_test_df.values)# Create submission filemy_pred_lgbm = pd.DataFrame({'ID_code': X_test_df.index, 'target': predictions})# Create CSV filemy_pred_lgbm.to_csv('pred_lgbm.csv', index=False)

This code chunk is based on some work from this Kaggle Notebook by E. Zietsman. If you want a complete overview of how LightGBM works and how to optimally tune it, make sure you read this article from Pushkar Mandot.

此代码块基于E. Zietsman的Kaggle笔记本所做的一些工作。 如果要全面了解LightGBM的工作原理以及如何对其进行最佳调整,请确保您阅读了Pushkar Mandot的 这篇文章 。

This gives us a score of 0.89217

这给了我们0.89217的分数

3. XGBoost (3. XGBoost)

# XGBOOST# Instantiate classifierclassifier = XGBClassifier(
tree_method = 'hist',
objective = 'binary:logistic',
eval_metric = 'auc',
learning_rate = 0.01,
max_depth = 2,
colsample_bytree = 0.35,
subsample = 0.8,
min_child_weight = 53,
gamma = 9,
silent= 1)# Fit the dataclassifier.fit(X_train, y_train)# Make predictions on the hold out datay_pred = (classifier.predict_proba(X_test)[:,1] >= 0.5).astype(int)# Get the confusion matrixprint(confusion_matrix(y_test, y_pred))# Get the accuracy scoreprint("Accuracy of {}".format(accuracy_score(y_test, y_pred)))# Get the f1-Scoreprint("f1 score of {}".format(f1_score(y_test, y_pred)))# Get the recall scoreprint("Recall score of {}".format(recall_score(y_test, y_pred)))# Make predictionspredictions = (classifier.predict_proba(X_test_df)[:,1] >= 0.5).astype(int)# Create submission filemy_pred_xgb = pd.DataFrame({'ID_code': X_test_df.index, 'target_xgb': predictions})# Create CSV filemy_pred_xgb.to_csv('pred_xgb.csv', index=False)

We also rely on XGBoost and the helpfull insights from Félix Revert.

我们还依靠XGBoost和有益的见解 ,从费利克斯还原 。

This gives us a score of 0.59283

这使我们得到0.59283的分数

4. Catboost (4. Catboost)

# CATBOOST# Instantiate classifierclassifier = cb.CatBoostClassifier(loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.01,
iterations=1000,
random_seed=42,
od_type="Iter",
depth=10,
early_stopping_rounds=500
)# Fit the dataclassifier.fit(X_train, y_train)# Make predictions on the hold out datay_pred = (classifier.predict_proba(X_test)[:,1] >= 0.5).astype(int)# Get the confusion matrixprint(confusion_matrix(y_test, y_pred))# Get the accuracy scoreprint("Accuracy of {}".format(accuracy_score(y_test, y_pred)))# Get the f1-Scoreprint("f1 score of {}".format(f1_score(y_test, y_pred)))# Get the recall scoreprint("Recall score of {}".format(recall_score(y_test, y_pred)))# Make predictionspredictions = (classifier.predict_proba(X_test_df)[:,1] >= 0.5).astype(int)# Create submission filemy_pred_cat = pd.DataFrame({'ID_code': X_test_df.index, 'target_cat': predictions})# Create CSV filemy_pred_cat.to_csv('pred_cat.csv', index=False)

This part is inspired from Wakame on Kaggle.

这部分灵感来自Kaggle上的Wakame 。

This gives us a score of 0.78769

这给予我们0.78769的分数

5.合奏 (5. Ensemble)

In this last part, we take the 4 models we created and ensemble them in order to generate our final answer. We want at least 3 out of the 4 models to qualify an observation as 1 in order to effectively doing so.

在最后一部分中,我们将使用我们创建的4个模型并将它们集成在一起以生成最终答案。 为了有效地做到这一点,我们希望4个模型中至少有3个将观察结果定为1。

# ENSEMBLE# Create data framemy_pred_ens = pd.concat([my_pred_ann, my_pred_xgb, my_pred_cat, my_pred_lgbm], axis = 1, sort=False)# Review our frameprint(my_pred_ens.describe())# Sum all the predictions and only assign a 1 if sum is higher than 2my_pred_ens["target"] = my_pred_ens["target_ann"] + my_pred_ens["target_xgb"] + my_pred_ens["target_lgbm"] + my_pred_ens["target_cat"]# Assign a 1 if sum is higher than 2my_pred_ens["target"] = np.where(my_pred_ens["target"] > 2, 1, 0)# Remove other target colsmy_pred_ens = my_pred_ens.drop(["target_ann", "target_lgbm", "target_xgb", "target_cat"], axis = 1)# Create submission filemy_pred = pd.DataFrame({'ID_code': X_test_df.index, 'target': my_pred_ens["target"]})# Create CSV filemy_pred.to_csv('pred_ens.csv', index=False)

This gives us a score of 0.78627

这给了我们0.78627的分数

九。 结论 (IX. Conclusion)

Our best models was the LightGBM. In order to improve on our score, we might rely on Stratified Kfolds or any other cross validation technique. We might as well fine tune our models in more detail.

我们最好的模型是LightGBM。 为了提高我们的分数,我们可能依赖于Stratified Kfolds或任何其他交叉验证技术。 我们不妨更详细地调整模型。

Image for post

使用的包 (Packages used)

import pandas as pdimport numpy as npfrom collections import Counterfrom sklearn.preprocessing import RobustScalerfrom sklearn.model_selection import train_test_splitfrom keras import Sequentialfrom keras.layers import Densefrom sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score, roc_curve, roc_auc_scorefrom sklearn.utils import resampleimport lightgbm as lgbmfrom xgboost import XGBClassifierfrom sklearn import metricsfrom sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import IsolationForestfrom imblearn.over_sampling import SMOTEfrom sklearn.model_selection import StratifiedKFoldimport xgboost as xgbimport catboost as cbfrom catboost import Poolfrom sklearn.model_selection import KFold

我们从中汲取了灵感的有用资源 (Helpful sources we drew inspiration from)

翻译自: https://medium.com/@invest_gs/predicting-financial-transactions-with-catboost-lgbm-xgboost-and-keras-ede24a6e4a76

xgboost keras

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389728.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

2017. 网格游戏

2017. 网格游戏 给你一个下标从 0 开始的二维数组 grid ,数组大小为 2 x n ,其中 grid[r][c] 表示矩阵中 (r, c) 位置上的点数。现在有两个机器人正在矩阵上参与一场游戏。 两个机器人初始位置都是 (0, 0) ,目标位置是 (1, n-1) 。每个机器…

HUST软工1506班第2周作业成绩公布

说明 本次公布的成绩对应的作业为: 第2周个人作业:WordCount编码和测试 如果同学对作业成绩存在异议,在成绩公布的72小时内(截止日期4月26日0点)可以进行申诉,方式如下: 毕博平台的第二周在线答…

币氪共识指数排行榜0910

币氪量化数据在今天的报告中给出DASH的近期买卖信号,可以看出从今年4月中旬起到目前为止,DASH_USDT的价格总体呈现出下降的趋势。 转载于:https://www.cnblogs.com/tokpick/p/9621821.html

走出囚徒困境的方法_囚徒困境的一种计算方法

走出囚徒困境的方法You and your friend have committed a murder. A few days later, the cops pick the two of you up and put you in two separate interrogation rooms such that you have no communication with each other. You think your life is over, but the polic…

Zookeeper系列四:Zookeeper实现分布式锁、Zookeeper实现配置中心

一、Zookeeper实现分布式锁 分布式锁主要用于在分布式环境中保证数据的一致性。 包括跨进程、跨机器、跨网络导致共享资源不一致的问题。 1. 分布式锁的实现思路 说明: 这种实现会有一个缺点,即当有很多进程在等待锁的时候,在释放锁的时候会有…

resize 按钮不会被伪元素遮盖

textarea默认有个resize样式,效果就是下面这样 读 《css 揭秘》时发现两个亮点: 其实这个属性不仅适用于 textarea 元素,适用于下面所有元素:elements with overflow other than visible, and optionally replaced elements repre…

平台api对数据收集的影响_收集您的数据不是那么怪异的api

平台api对数据收集的影响A data analytics cycle starts with gathering and extraction. I hope my previous blog gave an idea about how data from common file formats are gathered using python. In this blog, I’ll focus on extracting the data from files that are…

前端技术周刊 2018-09-10:Redux Mobx

前端快爆 在 Chrome 10 周年之际,正式发布 69 版本,整体 UI 重新设计,同时iOS 版本重新将工具栏放置在了底部。API 层面,支持了 CSS Scroll Snap、前端资源锁 Web Lock API、WebWorker 里面可以跑的 OffscreenCanvas API、toggleA…

逻辑回归 概率回归_概率规划的多逻辑回归

逻辑回归 概率回归There is an interesting dichotomy in the world of data science between machine learning practitioners (increasingly synonymous with deep learning practitioners), and classical statisticians (both Frequentists and Bayesians). There is gener…

sys.modules[__name__]的一个实例

关于sys.modules[__name__]的用法,百度上阅读量比较多得一个帖子是:https://www.cnblogs.com/robinunix/p/8523601.html 对于里面提到的基础性的知识点这里就不再重复了,大家看原贴就好。这里为大家提供一个详细的例子,帮助大家更…

ajax不利于seo_利于探索移动选项的界面

ajax不利于seoLately, my parents will often bring up in conversation their desire to move away from their California home and find a new place to settle down for retirement. Typically they will cite factors that they perceive as having altered the essence o…

C#调用WebKit内核

原文:C#调用WebKit内核版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u013564470/article/details/80255954 系统要求 Windows与.NET框架 由于WebKit库和.NET框架的要求,WebKit .NET只能在Windows系统上运行。从…

数据分析入门:如何训练数据分析思维?

本文由 网易云 发布。 作者:吴彬彬(本篇文章仅限知乎内部分享,如需转载,请取得作者同意授权。) 我们在生活中,会经常听说两种推理模式,一种是归纳 一种是演绎,这两种思维模式能够帮…

559. N 叉树的最大深度

559. N 叉树的最大深度 给定一个 N 叉树,找到其最大深度。 最大深度是指从根节点到最远叶子节点的最长路径上的节点总数。 N 叉树输入按层序遍历序列化表示,每组子节点由空值分隔(请参见示例)。 示例 1: 输入&#…

el表达式取值优先级

不同容器中存在同名值时,从作用范围小到大的顺序依次尝试取值:pageContext->request->session->application 转载于:https://www.cnblogs.com/wrencai/p/9006880.html

数据探索性分析_探索性数据分析

数据探索性分析When we hear about Data science or Analytics , the first thing that comes to our mind is Modelling , Tuning etc. . But one of the most important and primary steps before all of these is Exploratory Data Analysis or EDA.当我们听到有关数据科学或…

5930. 两栋颜色不同且距离最远的房子

5930. 两栋颜色不同且距离最远的房子 街上有 n 栋房子整齐地排成一列,每栋房子都粉刷上了漂亮的颜色。给你一个下标从 0 开始且长度为 n 的整数数组 colors ,其中 colors[i] 表示第 i 栋房子的颜色。 返回 两栋 颜色 不同 房子之间的 最大 距离。 第 …

stata中心化处理_带有stata第2部分自定义配色方案的covid 19可视化

stata中心化处理This guide will cover an important, yet, under-explored part of Stata: the use of custom color schemes. In summary, we will learn how to go from this graph:本指南将涵盖Stata的一个重要但尚未充分研究的部分:自定义配色方案的使用。 总而…

Anaconda配置和使用

为什么80%的码农都做不了架构师?>>> 原来一直使用原生python和pip的方式,换了新电脑,准备折腾下Anaconda。 安装过程就不说了,全程可视化安装,很简单。 安装后用“管理员权限”打开“Anaconda Prompt”命令…

python 插补数据_python 2020中缺少数据插补技术的快速指南

python 插补数据Most machine learning algorithms expect complete and clean noise-free datasets, unfortunately, real-world datasets are messy and have multiples missing cells, in such cases handling missing data becomes quite complex.大多数机器学习算法期望完…