糖药病数据集分类

背景 (Background)

Data science should be an enjoyable process focused on delivering insights and real benefits. However, that enjoyment can sometimes get lost in tools and processes. Nowadays it is important for an applied data scientist to be comfortable using tools and software for data/code versioning, reproducible model development, experiment tracking, and model inspection — to name a few!

数据科学应该是一个令人愉悦的过程，专注于提供见解和实际收益。但是，这种享受有时会在工具和过程中迷失。如今，重要的是，对于应用数据科学家来说，使用工具和软件进行数据/代码版本控制，可重现的模型开发，实验跟踪和模型检查非常重要！

The purpose of this article is to briefly touch upon reproducible model development and experiment tracking through a worked example for tuning a Random Forest classifier using Optuna and mlflow.

本文的目的是通过一个使用Optuna和mlflow调整随机森林分类器的实例简要讨论可再现模型的开发和实验跟踪。

心脏病数据和预处理 (The heart disease data and preprocessing)

This example uses the heart disease data available in the UCI ML archive. In previous work I describe some processing done to create a semi-cleaned dataset: developing a model for heart disease prediction using pycaret. In the .csv file provided for this short worked example, I have additionally converted any categorical features with two values into a single binary feature where for feature_name_x the _x suffix indicates the level represented by a value of 1. The target to predict is heart disease (hd_yes) and we want to tune a Random Forest classifier to do so.

本示例使用UCI ML存档中可用的心脏病数据。在之前的工作中，我描述了创建半清洁数据集的一些处理过程：使用pycaret开发用于心脏病预测的模型。在为该简短示例提供的.csv文件中，我另外将具有两个值的所有分类特征转换为单个二进制特征，其中对于feature_name_x ， _x后缀表示值1表示的水平。要预测的目标是心脏病( hd_yes )，我们想调整随机森林分类器。

After loading the data and creating the feature matrix and target vector, I created a preprocessing pipeline. For numeric features missing values were imputed using the median and then scaled. For binary features missing values were imputed using the mode. For a three-level categorical feature chest pain, missing values were treated as a separate category and leave-one-out target encoding applied. Finally, these processing steps were wrapped up into a single processing pipeline.

加载数据并创建特征矩阵和目标矢量后，我创建了预处理管道。对于数字特征，使用中位数估算缺失值，然后进行缩放。对于二进制特征，使用该模式估算缺失值。对于三级分类的特征性胸痛，缺失值被视为一个单独的类别，并应用了遗忘的目标编码。最后，这些处理步骤被包装到单个处理管道中。

使用Optuna和mlflow (Using Optuna and mlflow)

With our feature matrix, target vector and preprocessing pipeline ready to go, we can now tune a Random Forest classifier to predict heart disease. Note for the purpose of this demonstration I am going to forgo the use of a hold-out test set. To do the hyper-parameter optimization (model development) we will use Optuna and for experiment tracking one of its newer features: mlflow integration (see: mlflow callback in Optuna and mlflow tracking).

利用我们的特征矩阵，目标向量并准备好进行预处理，我们现在可以调整“随机森林”分类器来预测心脏病。请注意，出于演示目的，我将放弃使用保持测试集。为了进行超参数优化(模型开发)，我们将使用Optuna并进行实验跟踪，以实现其较新的功能之一：mlflow集成(请参阅： Optuna中的mlflow回调和mlflow跟踪 )。

First, we begin by defining the objective function to optimize with Optuna (Figure 1), which has three main components:

首先，我们首先定义目标函数以使用Optuna进行优化( 图1 )，它具有三个主要组件：

Defining the space of hyper-parameters and values to optimize: in our case we focus on six Random Forest hyper-parameters (n_estimators, max_depth, max_features, bootstrap, min_samples_split, min_samples_split)
定义要优化的超参数和值的空间 ：在本例中，我们集中于六个随机森林超参数(n_estimators，max_depth，max_features，bootstrap，min_samples_split，min_samples_split)
The Machine Learning pipeline: we have a preprocessing pipeline plus a Random Forest classifier
机器学习管道 ：我们有一个预处理管道以及一个随机森林分类器
The calculation of the metric to optimize: we use the 5-fold CV average ROC AUC
优化指标的计算 ：我们使用5倍CV平均ROC AUC

Figure 1. The Optuna objective function图1. Optuna目标函数

Next, we define a callback to mlflow, one of the new (and experimental!) features in Optuna 2.0 in order to track our classifier hyper-parameter tuning (Figure 2).

接下来，我们定义对mlflow的回调，这是Optuna 2.0中的新功能(也是实验性功能！)之一，以便跟踪分类器的超参数调整( 图2 )。

Figure 2. Setting the mlflow callback to track an Optuna study图2.设置mlflow回调以跟踪Optuna研究

Then we instantiate the Optuna study object to maximize our chosen metric (ROC AUC) (Figure 3). We also apply Hyperband pruning, which helps find optimums in shorter times by stopping unpromising trials early. Note that this is typically of more benefit for more intensive and difficult optimization such as in deep learning.

然后，我们实例化Optuna研究对象以最大化我们选择的指标(ROC AUC)( 图3 )。我们还应用了Hyperband修剪，可通过尽早停止毫无希望的试验来帮助您在更短的时间内找到最佳选择。请注意，这通常对于深度学习等更深入，更困难的优化更有利。

Figure 3. Creating an Optuna study图3 。创建一个Optuna研究

Finally, we run the hyper-parameter tuning using 200 trials (Figure 4). It may take a couple of minutes to run.

最后，我们使用200次试验运行超参数调整( 图4 )。运行可能需要几分钟。

Figure 4. Running the hyper-parameter optimization using Optuna图4.使用Optuna运行超参数优化

The mlflow logged experiment including assessed hyper-parameter configurations for the Random Forest classifier (Optuna study/trials), are stored in a folder called mlruns. To view these results using the mlflow user interface, do the following:

mlflow记录的实验(包括为随机森林分类器评估的超参数配置(Optuna研究/试验))存储在名为mlruns的文件夹中。要使用mlflow用户界面查看这些结果，请执行以下操作：

Open a shell (for example a Windows PowerShell) in the directory containing the mlruns folder
在包含mlruns文件夹的目录中打开外壳程序(例如Windows PowerShell)
Activate the virtual environment used for this notebook, i.e., conda activate optuna_env
激活用于此笔记本的虚拟环境，即conda activate optuna_env
Run mlflow ui
运行mlflow ui
Navigate to the localhost address provided, something like: http://kubernetes.docker.internal:5000/
导航到提供的localhost地址，例如： http://kubernetes.docker.internal:5000/

You should see Figure 5, and you can sift through the experiment trials. I deselected the user, source and version columns, as well as the tags section to simplify the view. I also sorted the results by ROC AUC. The top hyper-parameter configurations provided a ROC AUC of 0.917, where max_depth was in the range 14 to 16, max_features was 4, min_samples_leaf was 0.1, min_samples_split was 0.2, and n_estimators was 340 (note these results may differ slightly for you if you re-run the notebook).

您应该看到图5 ，然后可以筛选实验。我取消选择了用户，源和版本列以及标签部分，以简化视图。我还按ROC AUC对结果进行了排序。顶级超参数配置的ROC AUC为0.917，其中max_depth在14到16之间，max_features是4，min_samples_leaf是0.1，min_samples_split是0.2，n_estimators是340(请注意，如果您使用重新运行笔记本)。

Image for post — **Figure 5.** mlflow summary for Optuna HPO of our heart disease random Forest classifier图5.我们的心脏病随机森林分类器的Optuna HPO的mlflow摘要

Optuna also has great visualizations for summarizing experiments. Here we look at hyper-parameter importance and the optimization history (Figure 6). We can see that the min_samples_leaf is the most important hyper-parameter to tune in this example. The hyper-parameter importance is calculated using a functional ANOVA approach. For more information see Hutter, Hoos and Leyton-Brown 2014.

Optuna还具有用于汇总实验的出色可视化效果。在这里，我们看一下超参数的重要性和优化历史( 图6 )。我们可以看到，在此示例中，min_samples_leaf是最重要的超参数。使用功能方差分析方法计算超参数重要性。有关更多信息，请参阅Hutter，Hoos和Leyton-Brown 2014 。

摘要 (Summary)

Optuna has come a long way since its inception and version 2.0 released in July this year has some fantastic additions, one of which we touched upon in this short article: the mlflow callback. Perhaps as expected the tuning worked great and the summaries available via the in-built visualizations in Optuna and using the mlflow UI made this a lot of fun.

Optuna自问世以来已经走了很长一段路，并且在今年7月发布的2.0版中添加了一些很棒的功能，我们在这篇短文中谈到了其中之一：mlflow回调。也许正如预期的那样，调整效果很好，并且通过Optuna的内置可视化工具以及使用mlflow UI的摘要提供了很多乐趣。

The jupyter notebook, virtual environment and data used for this article are available at my GitHub. As always comments, thoughts, feedback and discussion are very welcome.

我的GitHub上提供了本文使用的jupyter笔记本，虚拟环境和数据。与往常一样，我们非常欢迎评论，想法，反馈和讨论。