
In this post we'll go into the concept of supervised learning, the requirements for machines to learn, and the process of learning and enhancing prediction accuracy.


What is Supervised Learning        什么是监督学习

When it comes to machine learning, there are primarily four types:


  • Supervised Machine Learning        监督学习
  • Unsupervised Machine Learning    非监督学习
  • Semi-Supervised Machine Learning   半监督学习
  • Reinforcement Learning    强化学习

Supervised machine learning refers to the process of training a machine using labeled data. Labeled data can consist of numeric or string values. For example, imagine that you have photos of animals, such as cats and dogs. To train your machine to recognize the animal, you’ll need to “label” or indicate the name of the animal alongside each animal image. The machine will then learn to pick up similar patterns in photos and predict the appropriate label.


Machine Learning        机器学习

Machine Learning is a term that refers to the process that a machine undergoes so that it can produce predictions. As mentioned, a machine can identify a cat in a photo even if it has never seen this particular cat. But how?


Through training of course, which involves a recursive process that improves output (or prediction) accuracy. In supervised machine learning, we teach the machine to identify things based on the labeled data we give it.


Today, you see and interact with trained machines everywhere. Netflix, YouTube, TikTok, and most services implement some kind of algorithm that uses your data (which was collected from you) to learn about you so it can give you things you like. That's why you spend endless hours scrolling.

如今,你随处可见训练过的机器,并与它们互动。Netflix、YouTube、TikTok 和大多数服务都实现了某种算法,这些算法使用你的数据(从你那里收集而来)来了解你,以便为你提供你喜欢的内容。这就是为什么你会花上无数个小时滚动浏览的原因。

The more data you give these services, the more they learn about you. Some of them even know you more than you know yourself.


Supervised vs Unsupervised        监督与非监督

Think of it like this:        像这样思考

As humans, we recognize a cat as a cat because we have been taught what a cat looks like by our parents and teachers. They basically "supervised" us and "labeled" our data. However, when we classify good and bad friends, we rely on our personal experiences and observations to achieve this. Similarly, machines can learn through supervised learning, where they are taught to recognize specific images, or through unsupervised learning, where they make their own judgments based on the data provided to them.


Compared to us, the learning process of a machine is different but it was inspired by our brains. To train computers, we'll mostly use statistical algorithms like Linear Regression, Decision Trees (DTs), and K-nearest neighbours (KNNs).


An algorithm is a sequence of operations that is typically used by computers to find the correct solution to a problem (or identify that there are no correct solutions).


5 things you'll need to train your model


Understand the problem        理解问题

First, you'll need to understand the problem that you're trying to solve. Usually, we can use machine learning to answer a broad range of questions, things like:


  • Can we accurately predict diseases in patients?        我们能否准确预测患者的疾病?
  • Can we predict the price of houses?   我们能预测房价吗?

It's important to understand the question we're trying to answer. Let's take the first question from the list above:


"Can we accurately predict diseases in patients?"        我们能否准确预测患者的疾病?

We can rephrase this question to:        我们可以将这个问题重新表述为:

"Is it possible to utilize historical patient data such as age, gender, blood pressure, cholesterol, and medical conditions to predict the likelihood of a new patient developing a disease?"


The answer is: Yes. This is known as a classification problem, where the input data is used to predict the patient’s potential to develop a new disease based on a list of predetermined categories.


Get and prepare the data        获取并准备数据

Imagine you buy a textbook for a math class, and all the papers are blank. Or better yet, imagine the papers include random information, not related to the topic or even unrecognizable characters. Would you be able to learn anything? Of course not. You'll need organized information. Similarly, we'll need to prepare our data before we can use it.


The quality of your data will determine the quality of your predictions.


So, the next step is to get the data. It could be located in many places, like:


  • Hospital internal database (SQL)   医院内部数据库(SQL)
  • Publicly available information (Web Scraping)   公开可获取的信息(网络爬虫)
  • Public health records (JSON)   公共卫生记录(JSON)

As you can see, the data could be in multiple locations and in many shapes and formats. As long as the data is relevant to our problem, we can make use of it.


Data Wrangling is the process of working with raw data and converting it into a usable form.

数据整理(Data Wrangling)是处理原始数据并将其转换为可用形式的过程。

Explore and analyze the data        探索和分析数据

Now that we're working with clean data, it's important to take a closer look and perform what is referred to as Explanatory Data Analysis (EDA) to find patterns and summarize the main characteristics. For example, to understand the distribution of our dataset we can calculate the mean, median, and range of our age variable. We can also analyze the correlation between disease and gender by calculating the percentage of a disease for a specific gender.


The most common programming languages used to perform EDA, and data analysis are: Python and R. Popular libraries for Python include: matplotlib, seaborn, numpy, and others.


We will not go into technical details in this post, but some common analyses done during the EDA phase include:


  • Data Distribution        数据分布
  • Dataset Structure    数据集结构
  • Handle Missing Values and Outliers    处理缺失值和异常值
  • Determine Correlations    确定相关性
  • Evaluate Assumptions    评估假设
  • Visualize by Plotting    通过绘图进行可视化
  • Identify Patterns    识别模式
  • Understand the Relevancy of External Data    理解外部数据的相关性

Choose a suitable algorithm        选择合适的算法

As we've seen earlier, we have a classification problem. We can therefore build model candidates using common classification algorithms and then compare outputs to choose the most accurate.


For this example, I'm going to use two algorithms popular for solving classification problems:


  • Random Forest    随机森林
  • Support Vector Machine (SVM)    支持向量机(SVM)

Train, test, and refine        训练、测试和调优

Using Python and scikit-learn (a machine learning library for Python), we can determine the accuracy of both algorithms given our dataset. We'll train the model by giving it a piece of the data.


While we can use all of the data in our dataset to train the model, we'll be splitting the data into two parts. Commonly, it is an 80/20 split, meaning 80% of our data will go to training, and the remaining 20% will be used for testing. This is done to prevent overfitting. The topic of overfitting was discussed in this article.


# ... Previous code omitted for brevity# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

The output is as shown below:        输出如下:

# ... Previous code omitted for brevityprint("SVM Accuracy:", svm_accuracy)
print("Random Forest Accuracy:", rf_accuracy)SVM Accuracy: 0.2857142857142857
Random Forest Accuracy: 0.8571428571428571

Looking at SVM vs. Random Forest accuracy results, we'll choose Random Forest since it has an accuracy of 85% vs. just 28% for SVM.


Accuracy refers to the ability of the model to correctly classify the disease given a set of testing data.


Obviously, you can try other algorithms until you're satisfied with the outputs based on your criteria and the problem you're trying to solve.


The above is essentially what goes into the Supervised Machine Learning process. It's important to highlight that this is an iterative process and does not end after training. We need to deploy the model and acquire feedback from stakeholders which could lead to model refinement based on new data and other factors.


Conclusion        总结

Thanks for reading! In this post, we covered what supervised machine learning is, what machines need to learn, how they learn, and how they improve. We also covered important steps such as Data Wrangling and EDA that are absolutely crucial in the prediction accuracy and relevancy of your model.






