机器学习中的监督学习介绍

In this post we'll go into the concept of supervised learning, the requirements for machines to learn, and the process of learning and enhancing prediction accuracy.

在这篇文章中,我们将深入探讨监督学习的概念、机器学习的要求以及学习和提高预测准确性的过程。

What is Supervised Learning        什么是监督学习

When it comes to machine learning, there are primarily four types:

在机器学习领域,主要有四种类型:

  • Supervised Machine Learning        监督学习
  • Unsupervised Machine Learning    非监督学习
  • Semi-Supervised Machine Learning   半监督学习
  • Reinforcement Learning    强化学习

Supervised machine learning refers to the process of training a machine using labeled data. Labeled data can consist of numeric or string values. For example, imagine that you have photos of animals, such as cats and dogs. To train your machine to recognize the animal, you’ll need to “label” or indicate the name of the animal alongside each animal image. The machine will then learn to pick up similar patterns in photos and predict the appropriate label.

监督机器学习是指使用标记数据进行机器训练的过程。标记数据可以包括数值或字符串值。例如,假设你有一些动物的照片,如猫和狗。为了训练你的机器识别这些动物,你需要“标记”或指示每张照片旁边的动物名称。然后,机器将学习在照片中识别相似的模式并预测适当的标签。

Machine Learning        机器学习

Machine Learning is a term that refers to the process that a machine undergoes so that it can produce predictions. As mentioned, a machine can identify a cat in a photo even if it has never seen this particular cat. But how?

机器学习是一个术语,指的是机器经历的过程,以便它能够产生预测。如上所述,机器即使在以前从未见过这只特定的猫,也能在照片中识别出猫。但是这是怎么做到的呢?

Through training of course, which involves a recursive process that improves output (or prediction) accuracy. In supervised machine learning, we teach the machine to identify things based on the labeled data we give it.

当然是通过训练,训练涉及一个递归过程,用于提高输出(或预测)的准确性。在监督机器学习中,我们根据提供的标记数据教机器识别事物。

Today, you see and interact with trained machines everywhere. Netflix, YouTube, TikTok, and most services implement some kind of algorithm that uses your data (which was collected from you) to learn about you so it can give you things you like. That's why you spend endless hours scrolling.

如今,你随处可见训练过的机器,并与它们互动。Netflix、YouTube、TikTok 和大多数服务都实现了某种算法,这些算法使用你的数据(从你那里收集而来)来了解你,以便为你提供你喜欢的内容。这就是为什么你会花上无数个小时滚动浏览的原因。

The more data you give these services, the more they learn about you. Some of them even know you more than you know yourself.

你向这些服务提供的数据越多,它们对你的了解就越多。有些服务甚至比你自己更了解你。

Supervised vs Unsupervised        监督与非监督

Think of it like this:        像这样思考

As humans, we recognize a cat as a cat because we have been taught what a cat looks like by our parents and teachers. They basically "supervised" us and "labeled" our data. However, when we classify good and bad friends, we rely on our personal experiences and observations to achieve this. Similarly, machines can learn through supervised learning, where they are taught to recognize specific images, or through unsupervised learning, where they make their own judgments based on the data provided to them.

作为人类,我们能够识别猫是因为我们的父母和老师教给我们猫长什么样。他们基本上“监督”我们并“标记”了我们的数据。然而,当我们区分好朋友和坏朋友时,我们依赖的是自己的经历和观察来实现这一点。同样地,机器可以通过监督学习来学习识别特定的图像,或者通过非监督学习,根据提供给它们的数据自行做出判断。

Compared to us, the learning process of a machine is different but it was inspired by our brains. To train computers, we'll mostly use statistical algorithms like Linear Regression, Decision Trees (DTs), and K-nearest neighbours (KNNs).

与我们相比,机器的学习过程是不同的,但它受到我们大脑的启发。为了训练计算机,我们主要使用统计算法,如线性回归、决策树(DTs)和K最近邻(KNNs)。

An algorithm is a sequence of operations that is typically used by computers to find the correct solution to a problem (or identify that there are no correct solutions).

算法是一系列操作的序列,通常由计算机使用,以找到问题的正确解决方案(或确定没有正确的解决方案)。

5 things you'll need to train your model

训练模型所需的5件事情

Understand the problem        理解问题

First, you'll need to understand the problem that you're trying to solve. Usually, we can use machine learning to answer a broad range of questions, things like:

首先,你需要理解你试图解决的问题。通常,我们可以使用机器学习来回答广泛的问题,例如:

  • Can we accurately predict diseases in patients?        我们能否准确预测患者的疾病?
  • Can we predict the price of houses?   我们能预测房价吗?

It's important to understand the question we're trying to answer. Let's take the first question from the list above:

了解我们试图回答的问题很重要。让我们从上面的列表中取第一个问题:

"Can we accurately predict diseases in patients?"        我们能否准确预测患者的疾病?

We can rephrase this question to:        我们可以将这个问题重新表述为:

"Is it possible to utilize historical patient data such as age, gender, blood pressure, cholesterol, and medical conditions to predict the likelihood of a new patient developing a disease?"

“是否可以利用历史患者数据,如年龄、性别、血压、胆固醇和医疗状况,来预测新患者患病的可能性?”

The answer is: Yes. This is known as a classification problem, where the input data is used to predict the patient’s potential to develop a new disease based on a list of predetermined categories.

答案是:可以。这被称为分类问题,其中输入数据用于根据预定类别的列表预测患者患新疾病的潜在性。

Get and prepare the data        获取并准备数据

Imagine you buy a textbook for a math class, and all the papers are blank. Or better yet, imagine the papers include random information, not related to the topic or even unrecognizable characters. Would you be able to learn anything? Of course not. You'll need organized information. Similarly, we'll need to prepare our data before we can use it.

想象一下,你买了一本数学课的教科书,但所有的纸张都是空白的。或者更好的是,想象一下这些纸张包含随机信息,与主题无关,甚至是无法识别的字符。你能学到东西吗?当然不能。你需要有组织的信息。同样地,在使用数据之前,我们需要先准备数据。

The quality of your data will determine the quality of your predictions.

您数据的质量将决定您预测的质量。

So, the next step is to get the data. It could be located in many places, like:

所以,下一步是获取数据。数据可能位于许多地方,比如:

  • Hospital internal database (SQL)   医院内部数据库(SQL)
  • Publicly available information (Web Scraping)   公开可获取的信息(网络爬虫)
  • Public health records (JSON)   公共卫生记录(JSON)

As you can see, the data could be in multiple locations and in many shapes and formats. As long as the data is relevant to our problem, we can make use of it.

如您所见,数据可能位于多个位置,且有多种形状和格式。只要数据与我们的问题相关,我们就可以利用它。

Data Wrangling is the process of working with raw data and converting it into a usable form.

数据整理(Data Wrangling)是处理原始数据并将其转换为可用形式的过程。

Explore and analyze the data        探索和分析数据

Now that we're working with clean data, it's important to take a closer look and perform what is referred to as Explanatory Data Analysis (EDA) to find patterns and summarize the main characteristics. For example, to understand the distribution of our dataset we can calculate the mean, median, and range of our age variable. We can also analyze the correlation between disease and gender by calculating the percentage of a disease for a specific gender.

既然我们现在处理的是干净的数据,重要的是要仔细查看并进行所谓的解释性数据分析(EDA),以查找模式并总结主要特征。例如,为了了解数据集的分布,我们可以计算年龄变量的平均值、中位数和范围。我们还可以通过计算特定性别的疾病百分比来分析疾病与性别之间的相关性。

The most common programming languages used to perform EDA, and data analysis are: Python and R. Popular libraries for Python include: matplotlib, seaborn, numpy, and others.

用于执行EDA和数据分析的最常见的编程语言是:Python和R。Python中流行的库包括:matplotlib、seaborn、numpy等。

We will not go into technical details in this post, but some common analyses done during the EDA phase include:

本文不会深入探讨技术细节,但在EDA阶段通常会进行的一些常见分析包括:

  • Data Distribution        数据分布
  • Dataset Structure    数据集结构
  • Handle Missing Values and Outliers    处理缺失值和异常值
  • Determine Correlations    确定相关性
  • Evaluate Assumptions    评估假设
  • Visualize by Plotting    通过绘图进行可视化
  • Identify Patterns    识别模式
  • Understand the Relevancy of External Data    理解外部数据的相关性

Choose a suitable algorithm        选择合适的算法

As we've seen earlier, we have a classification problem. We can therefore build model candidates using common classification algorithms and then compare outputs to choose the most accurate.

正如我们之前所见,我们面临的是一个分类问题。因此,我们可以使用常见的分类算法构建模型候选者,然后比较输出结果以选择最准确的模型。

For this example, I'm going to use two algorithms popular for solving classification problems:

对于这个例子,我将使用两种流行的分类问题解决算法:

  • Random Forest    随机森林
  • Support Vector Machine (SVM)    支持向量机(SVM)

Train, test, and refine        训练、测试和调优

Using Python and scikit-learn (a machine learning library for Python), we can determine the accuracy of both algorithms given our dataset. We'll train the model by giving it a piece of the data.

使用Python和scikit-learn(Python的机器学习库),我们可以根据我们的数据集确定两种算法的准确性。我们将通过提供一部分数据来训练模型。

While we can use all of the data in our dataset to train the model, we'll be splitting the data into two parts. Commonly, it is an 80/20 split, meaning 80% of our data will go to training, and the remaining 20% will be used for testing. This is done to prevent overfitting. The topic of overfitting was discussed in this article.

虽然我们可以使用数据集中的所有数据来训练模型,但我们会将数据分成两部分。通常,这是80/20的分割,意味着80%的数据用于训练,剩余的20%用于测试。这是为了防止过拟合。过拟合的主题在本文中已有讨论。

# ... Previous code omitted for brevity# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

The output is as shown below:        输出如下:

# ... Previous code omitted for brevityprint("SVM Accuracy:", svm_accuracy)
print("Random Forest Accuracy:", rf_accuracy)SVM Accuracy: 0.2857142857142857
Random Forest Accuracy: 0.8571428571428571

Looking at SVM vs. Random Forest accuracy results, we'll choose Random Forest since it has an accuracy of 85% vs. just 28% for SVM.

查看SVM与随机森林的准确率结果,我们将选择随机森林,因为它的准确率为85%,而SVM的准确率仅为28%。

Accuracy refers to the ability of the model to correctly classify the disease given a set of testing data.

准确率指的是模型在给定的测试数据集下正确分类疾病的能力。

Obviously, you can try other algorithms until you're satisfied with the outputs based on your criteria and the problem you're trying to solve.

显然,你可以尝试其他算法,直到你对根据你的标准和你试图解决的问题所得到的输出感到满意为止。

The above is essentially what goes into the Supervised Machine Learning process. It's important to highlight that this is an iterative process and does not end after training. We need to deploy the model and acquire feedback from stakeholders which could lead to model refinement based on new data and other factors.

以上基本上就是监督机器学习过程的内容。需要强调的是,这是一个迭代过程,并不会在训练后结束。我们需要部署模型并从利益相关者那里获取反馈,这可能会基于新数据和其他因素导致模型细化。

Conclusion        总结

Thanks for reading! In this post, we covered what supervised machine learning is, what machines need to learn, how they learn, and how they improve. We also covered important steps such as Data Wrangling and EDA that are absolutely crucial in the prediction accuracy and relevancy of your model.

感谢阅读!在本文中,我们介绍了监督机器学习是什么,机器需要学习什么,它们如何学习,以及它们如何改进。我们还介绍了诸如数据预处理和探索性数据分析(EDA)等重要步骤,这些步骤在模型预测的准确性和相关性方面至关重要。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/28236.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

归纳贪心好题

很有趣的一道归纳贪心题目 class Solution { public:int minimumAddedCoins(vector<int>& coins, int target) {sort(coins.begin(),coins.end());int n coins.size();int s 0,i0;int res 0;while(s<target){if(i<n&&coins[i]<s1)scoins[i];els…

顶顶通呼叫中心中间件-限制最大通话时间(mod_cti基于FreeSWITCH)

顶顶通呼叫中心中间件-限制最大通话时间(mod_cti基于FreeSWITCH) 一、最大通话时间 1、配置拨号方案 1、点击拨号方案 ->2、在框中输入通话最大时长->3、点击添加->4、根据图中配置->5、勾选continue。修改拨号方案需要等待一分钟即可生效 action"sched…

9M高速USB转接芯片CH347转双串口转I2C转SPI转JTAG转SWD

1、概述 CH347 TSSOP20封装和丝印 CH347 是一款高速 USB 总线转接芯片&#xff0c;通过 USB 总线提供异步串口、I2C 同步串行接口、SPI 同步串行接口和 JTAG 接口等。 在异步串口方式下&#xff0c;CH347 提供了 2 个高速串口&#xff0c;支持 RS485 串口收发使能控制、硬件流控…

LeetCode | 387.字符串中的第一个唯一字符

这道题可以用字典解决&#xff0c;只需要2次遍历字符串&#xff0c;第一次遍历字符串&#xff0c;记录每个字符出现的次数&#xff0c;第二次返回第一个出现次数为1的字符的下标&#xff0c;若找不到则返回-1 class Solution(object):def firstUniqChar(self, s):""…

Python自动化办公(一) —— 根据PDF文件批量创建Word文档

Python自动化办公&#xff08;一&#xff09; —— 根据PDF文件批量创建Word文档 在日常办公中&#xff0c;我们经常需要根据现有的PDF文件批量创建Word文档。手动操作不仅费时费力&#xff0c;而且容易出错。幸运的是&#xff0c;使用Python可以轻松实现这个过程。本文将介绍如…

linux的repo工具的入门

repo 是一个工具&#xff0c;用于管理 Git 仓库的集合&#xff0c;尤其在 Android 开发中被广泛使用。它是 Google 为 Android 项目开发的&#xff0c;以简化对大量 Git 仓库的管理。 主要特点 多仓库管理&#xff1a;repo 允许同时管理多个 Git 仓库&#xff0c;可以轻松执行…

MyBatis 的多级缓存机制是怎么样运作的?

引言&#xff1a;上周三&#xff0c;小 X 去面试一家中厂&#xff0c;其中面试官问到 MyBatis 的多级缓存机制是怎么样运行的&#xff1f;这个问题可以好好准备一下&#xff0c;很多人可能只会用 MyBatisPlus&#xff0c;简单的多表联查 SQL 语句可能都写不出来&#xff0c;更别…

Python 项目应该放弃requirements.txt?揭秘PDM的强大功能

目录 requirements.txt的局限性 PDM 的优势 如何使用 PDM 安装 PDM 初始化项目 添加依赖 管理依赖 示例代码 初始化项目并添加依赖 编写简单的 Flask 应用 运行应用 PDM高级功能 多环境管理 脚本管理 发布包 在 Python 项目中管理依赖项&#xff0c;最常见的方式…

Android APP memory统计方法

目录 进程的内存信息概述 关键的术语 测试步骤 测试步骤 数据处理 数据分析&#xff1a; 进程内存信息 Dumpsys meminfo -a PID Procrank Procmem PID 特殊内存信息 Mali ION(multi-media&#xff0c;gralloc) 进程地址空间信息 /proc/pid/smaps Showmap PID …

随笔-来了,安了

依照领导定的规矩&#xff0c;周五又去了分公司&#xff0c;赋能一线去了。到了地方就是开会->现场解决问题->干饭->开会过需求、提供解决方案&#xff0c;充实得厉害。强度也不小&#xff0c;中午干的一大碗饭&#xff0c;到五点就饿了。 六点带着分公司催着上线的需…

5000天后的世界

为何可以预见未来 1993年&#xff0c;在互联网的黎明时代&#xff0c;凯文凯利创办了《连线》杂志。他曾经采访过以比尔盖茨、史蒂夫乔布斯、杰夫贝佐斯为代表的一众风云创业家。《连线》杂志是全球发行的世界著名杂志&#xff0c;一直致力于报道科学技术带来的经济、社会变革…

【0基础学爬虫】爬虫基础之自动化工具 DrissionPage 的使用

概述 前三期文章中已经介绍到了 Selenium 与 Playwright 、Pyppeteer 的使用方法&#xff0c;它们的功能都非常强大。而本期要讲的 DrissionPage 更为独特&#xff0c;强大&#xff0c;而且使用更为方便&#xff0c;目前检测少&#xff0c;强烈推荐&#xff01;&#xff01;&a…

Google Earth Engine(GEE)——计算闪闪红星的ndvi的值和折线图(时序分析)

函数: ui.Chart.image.doySeries(imageCollection, region, regionReducer, scale, yearReducer, startDay, endDay)

手把手教你改造Sentinel Dashboard 实现配置持久化

一. 概述 Sentinel客户端默认情况下接收到 Dashboard 推送的规则配置后&#xff0c;可以实时生效。但是有一个致命缺陷&#xff0c;Dashboard和业务服务并没有持久化这些配置&#xff0c;当业务服务重启后&#xff0c;这些规则配置将全部丢失。 Sentinel 提供两种方式修改规则…

政务云参考技术架构

行业优势 总体架构 政务云平台技术框架图&#xff0c;由机房环境、基础设施层、支撑软件层及业务应用层组成&#xff0c;在运维、安全和运营体系的保障下&#xff0c;为政务云使用单位提供统一服务支撑。 功能架构 标准双区隔离 参照国家电子政务规范&#xff0c;打造符合标准的…

B3981 [信息与未来 2024] 图灵完备

题目描述 &#xff08;你不需要看懂这张图片&#xff1b;但如果你看懂了&#xff0c;会觉得它很有趣。&#xff09; JavaScript 是一种功能强大且灵活的编程语言&#xff0c;也是现代 Web 开发的三大支柱之一 (另外两个是 HTML 和 CSS)。灵活的 JavaScript 包含“自动类型转换…

从数据库到数据仓库:数据仓库导论

导言 本文为数据仓库导论&#xff0c;旨在介绍数据仓库的基本理念和应用场景&#xff0c;帮助读者理解数据仓库的重要性及其在企业中的实际应用。 数据仓库作为重要的数据管理和分析工具&#xff0c;已经发展了30多年&#xff0c;其过程中生态和技术都发生了巨大的变化。尽管…

分布式数据库核心问题和解决方法

当下&#xff0c;由于成本压力以及数据保护的要求&#xff0c;采用国产数据库的呼声越来越高&#xff0c;但是国产数据库数量众多&#xff0c;良莠不齐&#xff0c;没有选择数据库比较靠谱的标准&#xff0c;业内真正懂得数据库的人很少&#xff0c;且为了这块大的蛋糕&#xf…

Axios进阶

目录 axios实例 axios请求配置 拦截器 请求拦截器 响应拦截器 取消请求 axios不仅仅是简单的用基础请求用法的形式向服务器请求数据&#xff0c;一旦请求的端口与次数变多之后&#xff0c;简单的请求用法会有些许麻烦。所以&#xff0c;axios允许我们进行创建axios实例、ax…

Retrofit 注解参数详解

添加依赖 implementation com.squareup.retrofit2:retrofit:2.9.0 implementation com.squareup.retrofit2:converter-gson:2.9.0 初始化Retrofit val retrofit Retrofit.Builder().baseUrl("http://api.github.com/").addConverterFactory(GsonConverterFactory…