机器学习 综合评价_PyCaret:机器学习综合

机器学习 综合评价

Any Machine Learning project journey starts with loading the dataset and ends (continues ?!) with the finalization of the optimum model or ensemble of models for predictions on unseen data and production deployment.

任何机器学习项目的旅程都始于加载数据集,然后结束(继续?!),最后确定最佳模型或模型集合,以预测看不见的数据和生产部署。

As machine learning practitioners, we are aware that there are several pit stops to be made along the way to arrive at the best possible prediction performance outcome. These intermediate steps include Exploratory Data Analysis (EDA), Data Preprocessing — missing value treatment, outlier treatment, changing data types, encoding categorical features, data transformation, feature engineering /selection, sampling, train-test split etc. to name a few — before we can embark on model building, evaluation and then prediction.

作为机器学习的从业者,我们意识到在达到最佳预测性能结果的过程中,有几个进站。 这些中间步骤包括探索性数据分析(EDA),数据预处理-缺失值处理,离群值处理,更改数据类型,编码分类特征,数据转换,特征工程/选择,采样,训练测试拆分等,仅举几例-在我们开始进行模型构建,评估然后进行预测之前。

We end up importing dozens of python packages to help us do this and this means getting familiar with the syntax and parameters of multiple function calls within each of these packages.

我们最终导入了数十个python软件包来帮助我们完成此操作,这意味着要熟悉每个软件包中的多个函数调用的语法和参数。

Have you wished that there could be a single package that can handle the entire journey end to end with a consistent syntax interface? I sure have!

您是否希望有一个包可以使用一致的语法接口来处理整个旅程,从头到尾? 我肯定有!

输入PyCaret (Enter PyCaret)

These wishes were answered with PyCaretpackage and it is now even more awesome with the release of pycaret2.0.

PyCaret软件包满足了这些愿望,现在pycaret2.0的发布pycaret2.0更加令人敬畏。

Starting with this Article, I will post a series on how pycaret helps us zip through the various stages of an ML project.

从本文开始,我将发布一系列有关pycaret如何帮助我们完成ML项目各个阶段的文章。

安装 (Installation)

Installation is a breeze and is over in a few minutes with all dependencies also being installed. It is recommended to install using a virtual environment like python3 virtualenv or conda environments to avoid any clash with other pre-installed packages.

安装轻而易举,几分钟后就结束了,同时还安装了所有依赖项。 建议使用虚拟环境(例如python3 virtualenv或conda环境)进行安装,以免与其他预装软件包冲突。

pip install pycaret==2.0

pip install pycaret==2.0

Once installed, we are ready to begin! We import the package into our notebook environment. We will take up a classification problem here. Similarly, the respective PyCaret modules can be imported for a scenario involving regression, clustering, anomaly detection, NLP and Association rules mining.

安装完成后,我们就可以开始了! 我们将包导入到笔记本环境中。 我们将在这里处理分类问题。 同样,可以针对涉及回归,聚类,异常检测,NLP和关联规则挖掘的方案导入相应的PyCaret模块。

Image for post

We will use the titanic dataset from kaggle.com. You can download the dataset from here.

我们将使用来自kaggle.com的titanic数据集。 您可以从此处下载数据集。

Image for post

Let's check the first few rows of the dataset using the head() function:

让我们使用head()函数检查数据集的前几行:

Image for post
Image for post

建立 (Setup)

The setup() function of pycaret does most — correction, ALL, of the heavy-lifting, that normally is otherwise done in dozens of lines of code — in just a single line!

pycaretsetup()函数pycaret完成大部分工作-校正,全部进行繁重的工作,否则通常只需要一行几十行代码即可完成!

Image for post

We just need to pass the dataframe and specify the name of the target feature as the arguments. The setup command generates the following output.

我们只需要传递数据框并指定目标要素的名称作为参数即可。 setup命令生成以下输出。

Image for post

setup has helpfully inferred the data types of the features in the dataset. If we agree to it, all we need to do is hit Enter . Else, if you think the data types as inferred by setup is not correct then you can type quit in the field at the bottom and go back to the setup function to make changes. We will see how to do that shortly. For now, lets hit Enter and see what happens.

setup有助于推断数据集中要素的数据类型。 如果我们同意,则只需按Enter 。 否则,如果您认为由setup程序推断出的数据类型不正确,则可以在底部的字段中键入quit ,然后返回到setup功能进行更改。 我们将很快看到如何做。 现在,让我们Enter ,看看会发生什么。

Image for post
output contd.,
输出续
Image for post
output continued below
输出继续低于
Image for post
end of output
输出结束

Whew! A whole lot seems to have happened under the hood in just one line of innocuous-looking code! Let's take stock:

ew! 似乎只有一行无害的代码在幕后发生了很多事情! 让我们盘点一下:

  • checked for missing values

    检查缺失值
  • identified numeric and categorical features

    确定的数字和分类特征
  • created train and test data sets from the original dataset

    从原始数据集中创建训练和测试数据集
  • imputed missing values in continuous features with mean

    连续特征中的插补缺失值
  • imputed missing values in categorical features with a constant value

    具有恒定值的分类特征中的推定缺失值
  • done label-encoding

    完成标签编码
  • ..and a whole host of other options seem to be available including outlier treatment, data scaling, feature transformation, dimensionality reduction, multi-collinearity treatment, feature selection and handling imbalanced data etc.!

    ..以及似乎还有许多其他选择,包括异常值处理,数据缩放,特征转换,降维,多重共线性处理,特征选择和处理不平衡数据等!

But hey! what is that on lines 11 & 12? The number of features in the train and test datasets are 1745? Seems to be a case of label encoding gone berserk most probably from the categorical features like name, ticket and cabin. Further in this article and in the next, we will look at how we can control the setup as per our requirements to address such cases proactively.

但是,嘿! 第11和12行是什么? 训练和测试数据集中的要素数量为1745? 似乎是标签编码的一种情况,很可能是从nameticketcabin等分类特征中消失了。 在本文的下一部分和下一部分中,我们将研究如何根据我们的要求控制设置,以主动解决此类情况。

定制setup (Customizing setup)

To start with how can we exclude features from model building like the three features above? We pass the variables which we want to exclude in the ignore_features argument of the setup function. It is to be noted that the ID and DateTime columns, when inferred, are automatically set to be ignored for modelling.

首先,我们如何像上面的三个功能那样从模型构建中排除功能? 我们在setup函数的ignore_features参数中传递要排除的变量。 要注意的是,ID和DateTime列在推断时会自动设置为忽略以进行建模。

Image for post

Note below that pycaret, while asking for our confirmation has dropped the above mentioned 3 features. Let's click Enter and proceed.

请注意,在pycaret下方,要求我们确认时已删除了上述3个功能。 让我们单击Enter并继续。

Image for post

In the resultant output (the truncated version is shown below), we can see that post setup, the dataset shape is more manageable now with label encoding done only of the remaining more relevant categorical features:

在结果输出中(截断的版本如下所示),我们可以看到设置后,现在仅使用其余更相关的分类特征进行标签编码,就更易于管理数据集形状:

Image for post

In the next Article in this series we will look in detail at further data preprocessing tasks we can achieve on the dataset using this single setup function of pycaret by passing additional arguments.

在本系列的下一篇文章中,我们将详细介绍通过使用pycaret的单个setup功能通过传递附加参数可以对数据集完成的进一步数据预处理任务。

But before we go, let’s do a flash-forward to the amazing model comparison capabilities of pycaret using the compare_model() function.

但是在开始之前,让我们使用compare_model()函数pycaret惊人的模型比较功能。

Image for post
Model performance compared on various classification metrics.
在各种分类指标上比较模型性能。

Boom! All it takes is just compare_models() to get the results of 15 classification modelling algorithms compared across various classification metrics on cross-validation. At a glance, we can see that CatBoost classifier performs best across most of the metrics with Naive-Bayes doing well on recall and Gradient Boosting on precision. The top-performing model for each metric is highlighted automatically by pycaret.

繁荣! 它所compare_models()只是compare_models()来获得15种分类建模算法的结果,这些算法在交叉验证的各个分类指标之间进行了比较。 一目了然,我们可以看到CatBoost分类器在大多数指标上表现最佳,其中Naive-Bayes在召回率方面表现出色,而在精度方面则表现出Gradient Boostingpycaret自动突出显示每个指标的性能最高的模型。

Depending on the model evaluation metric(s) we are interested in pycaret helps us to straightaway zoom in on the top-performing model which we can further tune using the hyper-parameters. More on this in the upcoming Articles.

根据模型评估指标,我们对pycaret感兴趣,可以帮助我们Swift放大性能最高的模型,我们可以使用超参数进一步对其进行调整。 在即将到来的文章中对此有更多的了解。

In conclusion, we have briefly seen glimpses of how pycaret can help us to fast track through the ML project life cycle through minimal code combined with extensive and comprehensive customization of the critical data pre-processing stages.

总之,我们已经简要了解了pycaret如何通过最少的代码以及对关键数据预处理阶段的广泛而全面的自定义,可以帮助我们快速跟踪ML项目生命周期。

You may also be interested in my other articles on awesome packages that use minimal code to deliver maximum results in Exploratory Data Analysis(EDA) and Visualization.

您可能还对我的其他有关超棒软件包的文章感兴趣,这些文章使用最少的代码来在探索性数据分析(EDA)和可视化中提供最大的结果。

Thanks for reading and would love to hear your feedback. Cheers!

感谢您的阅读,并希望听到您的反馈。 干杯!

翻译自: https://towardsdatascience.com/pycaret-the-machine-learning-omnibus-dadf6e230f7b

机器学习 综合评价

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388858.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

silverlight 3D 游戏开发

http://www.postvision.net/SilverMotion/DemoTech.aspx silverlight 3D 游戏开发 时间:2010-10-22 06:33来源:开心银光 作者:黎东海 点击: 562次意外发现一个silverlight的实时3D渲染引擎。性能比开源那些强很多。 而且支持直接加载maya,3Dmax等主流3D模型文件。 附件附上它的…

皮尔逊相关系数 相似系数_皮尔逊相关系数

皮尔逊相关系数 相似系数数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING) In the last post, we analyzed the relationship between categorical variables and categorical and continuous variables. In this case, we will analyze the relati…

Kubernetes持续交付-Jenkins X的Helm部署

Jenkins X 是一个集成化的 CI / CD 平台,可用于 部署在Kubernetes集群或云计算中心。支持在云计算环境下简单地开发和部署应用。本项目是在Kubernetes上的安装支持工具集。 本工具集中包含: Jenkins - 定制好的流水线和运行环境,完全整合CI/C…

中国石油大学(华东)暑期集训--二进制(BZOJ5294)【线段树】

问题 C: 二进制 时间限制: 1 Sec 内存限制: 128 MB提交: 8 解决: 2[提交] [状态] [讨论版] [命题人:]题目描述 pupil发现对于一个十进制数,无论怎么将其的数字重新排列,均不影响其是不是3的倍数。他想研究对于二进制,是否也有类似的性质。于…

Java 8 新特性之Stream API

1. 概述 1.1 简介 Java 8 中有两大最为重要的改革,第一个是 Lambda 表达式,另外一个则是 Stream API(java.util.stream.*)。 Stream 是 Java 8 中处理集合的关键抽象概念,它可以指定你希望对集合进行的操作&#xff0c…

Ubuntu中NS2安装详细教程

前言: NS2是指 Network Simulator version 2,NS(Network Simulator) 是一种针对网络技术的源代码公开的、免费的软件模拟平台,研究人员使用它可以很容易的进行网络技术的开发,而且发展到今天,它…

14.vue路由脚手架

一.vue路由:https://router.vuejs.org/zh/ 1、定义 let router new VueRouter({mode:"history/hash",base:"基本路径" 加一些前缀 必须在history模式下有效linkActiveClass:"active", 范围选择linkExactActiveClass:"exact&qu…

linux-buff/cache过大导致内存不足-程序异常

2019独角兽企业重金招聘Python工程师标准>>> 问题描述 Linux内存使用量超过阈值,使得Java应用程序无可用内存,最终导致程序崩溃。即使在程序没有挂掉时把程序停掉,系统内存也不会被释放。 找原因的过程 这个问题已经困扰我好几个月…

Android 适配(一)

一、Android适配基础参数1.常见分辨率(px)oppx 2340x1080oppR15 2280x1080oppor11sp 2160*10801080*1920 (主流屏幕16:9)1080*216018:9 手机主流分辨率: 1080*2160高端 16:9 手机主流分辨率: 1080P (1080*1920) 或 2K …

Source Insight 创建工程(linux-2.6.22.6内核源码)

1. 软件设置 安装完Source Insight,需要对其进行设置添加对“.S”汇编文件的支持: 2. 新建linux-2.6.22.6工程 1)选择工程存放的路径: 2)下载linux-2.6.22.6内核源码,并解压。在Source Insight中 指定源码的…

课时20:内嵌函数和闭包

目录: 一、global关键字 二、内嵌函数 三、闭包 四、课时20课后习题及答案 ******************** 一、global关键字 ******************** 全局变量的作用域是整个模块(整个代码段),也就是代码段内所有的函数内部都可以访问到全局…

盛严谨,严谨,再严谨。_评估员工调查的统计严谨性

盛严谨,严谨,再严谨。The human resources industry relies heavily on a wide range of assessments to support its functions. In fact, to ensure unbiased and fair hiring practices the US department of labor maintains a set of guidelines (Uniform Guidelines) to …

开根号的笔算算法图解_一个数的开根号怎么计算

一个数的开根号怎么计算2020-11-08 15:46:47文/钟诗贺带根号的式子可以直接进行开平方的运算。一些特殊的根号运算有;√2≈1.414、1/2-√3≈0.5-1.732≈-1.232、2√5≈22.236≈4.236、√7-√6≈2.646-2.449≈0.197。开平方的笔算方法1.将被开方数的整数部分从个位起…

arima 预测模型_预测未来:学习使用Arima模型进行预测

arima 预测模型XTS对象 (XTS Objects) If you’re not using XTS objects to perform your forecasting in R, then you are likely missing out! The major benefits that we’ll explore throughout are that these objects are a lot easier to work with when it comes to …

bigquery_在BigQuery中链接多个SQL查询

bigqueryBigquery is a fantastic tool! It lets you do really powerful analytics works all using SQL like syntax.Bigquery是一个很棒的工具! 它使您能够使用像语法一样SQL来进行真正强大的分析工作。 But it lacks chaining the SQL queries. We cannot run …

大理石在哪儿 (Where is the Marble?,UVa 10474)

题目描述&#xff1a;算法竞赛入门经典例题5-1 1 #include <iostream>2 #include <algorithm>3 using namespace std;4 int maxn 10000 ;5 int main()6 {7 int n,q,a[maxn] ,k0;8 while(scanf("%d%d",&n,&q)2 && n &&q…

mysql 迁移到tidb_通过从MySQL迁移到TiDB来水平扩展Hive Metastore数据库

mysql 迁移到tidbIndustry: Knowledge Sharing行业&#xff1a;知识共享 Author: Mengyu Hu (Platform Engineer at Zhihu)作者&#xff1a;胡梦瑜(Zhhu的平台工程师) Zhihu which means “Do you know?” in classical Chinese, is the Quora of China: a question-and-ans…

XCode、Objective-C、Cocoa 说的是几样东西

大部分有一点其他平台开发基础的初学者看到XCode&#xff0c;第一感想是磨拳擦掌&#xff0c;看到 Interface Builder之后&#xff0c;第一感想是跃跃欲试&#xff0c;而看到Objective-C的语法&#xff0c;第一感想就变成就望而却步了。好吧&#xff0c;我是在说我自己。 如果…

递归函数基例和链条_链条和叉子

递归函数基例和链条因果推论 (Causal Inference) This is the fifth post on the series we work our way through “Causal Inference In Statistics” a nice Primer co-authored by Judea Pearl himself.这是本系列的第五篇文章&#xff0c;我们通过“因果统计推断”一书进行…

java lock 信号_java各种锁(ReentrantLock,Semaphore,CountDownLatch)的实现原理

先放结论&#xff1a;主要是实现AbstractQueuedSynchronizer中进入和退出函数&#xff0c;控制不同的进入和退出条件&#xff0c;实现适用于各种场景下的锁。JAVA中对于线程的同步提供了多种锁机制&#xff0c;比较著名的有可重入锁ReentrantLock&#xff0c;信号量机制Semapho…