Data Mining数据挖掘—2. Classification分类

3. Classification

Given a collection of records (training set)
– each record contains a set of attributes
– one of the attributes is the class (label) that should be predicted
Find a model for class attribute as a function of the values of other attributes
Goal: previously unseen records should be assigned a class as accurately as possible
Application area: Direct Marketing, Fraud Detection
Why need it?
Classic programming is not so easy due to missing knowledge and difficult formalization as an algorithm

3.1 Definition

Given: a set of labeled records, consisting of
• data fields (a.k.a. attributes or features)
• a class label (e.g., true/false)
Generate: a function which can be used for classifying previously unseen records
• input: a record
• output: a class label

3.2 k Nearest Neighbors

The k nearest neighbors of a record x are data points that have the k smallest distance to x.
require: stored records, distance metric, value of k
steps: Compute distance to each training record -> Identify k nearest neighbors –> Use class labels of nearest neighbors to determine the class label of unknown record by taking majority vote or weighing the vote according to distance

choose k
Rule of thumb: Test k values between 1 and 10.
k too small -> sensitive to noise point
k too large -> neighborhood may include points from other classes

It’s very accurate but slow as training data needs to be searched. Can handle decision boundaries that are not parallel to the axes.

3.3 Nearest Centroids = Rocchio classifier

assign each data point to nearest centroid (center of all points of that class) 每个类别视为一个点(质心),然后在测试阶段,根据未标记数据点与这些质心的距离来决定数据点所属的类别。Nearest Centroid is a simple classification algorithm that calculates the centroid (average position) of each class based on training data. During testing, it classifies new instances by assigning them to the class whose centroid is closest.

k-NN vs. Nearest Centroid
k-NN

  • slow at classification time (linear in number of data points)
  • requires much memory (storing all data points)
  • robust to outliers

Nearest Centroid

  • fast at classification time (linear in number of classes)
  • requires only little memory (storing only the centroids)
  • robust to label noise
  • robust to class imbalance

Which classifier is better? Strongly depends on the problem at hand!

3.4 Bayes Classifier

P(C|A): conditional probability (How likely is C, given that we observe A)
Conditional Probability and Bayes Theorem

Example1 - Bayes Throrem
Example2 - Bayes Throrem

Estimating the Prior Probability P(C )
counting the records in the training set that are labeled with class Cj, dividing the count by the overall number of records
Estimating the Conditional Probability P(A | C)
Naïve Bayes assumes that all attributes are statistically independent
The independence assumption allows the joint probability P(A|C) to be reformulated as the product of the individual probabilities P(Ai|Cj).
P(A1,A2,…An|Cj) = P(A1|Cj) * P(A2|Cj) * … * P(An|Cj)
Estimating the Probabilities P(Ai|Cj)
count how often an attribute value co-occurs with class Cj, divide by the overall number of instances in class Cj
Bayes Theorem Example

Handling Numerical Attributes
1.Discretize numerical attributes before learning classifier.
2.Make assumption that numerical attributes have a normal distribution given the class.
Use training data to estimate parameters of the distribution (e.g., mean and standard deviation)
Once the probability distribution is known, it can be used to estimate the conditional probability P(Ai|Cj)
Normal Distribution
Normal Distribution Example 1

Normal Distribution Example 2

Handling Missing Values
Missing values may occur in training and in unseen classification records.
Training: Record is not included into frequency count for attribute value-class combination
Classification: Attribute will be omitted from calculation

Zero Frequency Problem
one of the conditional probabilities is zero -> the entire expression becomes zero
It is not unlikely that an exactly same data point has not yet been observed. 有可能存在尚未观察到完全相同的数据点,因此概率有可能是0。
解决方法:Laplace Smoothing Laplace Smoothing

Decision Boundary of Naive Bayes Classifier
Decision Boundary of Naive Bayes Classifier

Summary

  • Robust to isolated noise points
  • Handle missing values by ignoring the instance during probability estimate calculations在概率估计计算过程中,通过忽略具有缺失值的实例来处理缺失值
  • Robust to irrelevant attributes [reasons: the probabilistic framework + conditional independence assumption + probability smoothing techniques]
  • Independence assumption may not hold for some attributes [Use other techniques such as Bayesian Belief Networks (BBN)]

Naïve Bayes works surprisingly well even if independence assumption is clearly violated [Reasons: Robustness to Violations + Effective Classification + Maximum Probability Assignment + Simple and Efficient]

Too many redundant attributes will cause problems. Solution: Select attribute subset as Naïve Bayes often works as well or better with just a fraction of all attributes.

Technical advantages:
(1) Learning Naïve Bayes classifiers is computationally cheap (probabilities are estimated in one pass over the training data)
(2) Storing the probabilities does not require a lot of memory

Redundant Variables
Violate independence assumption in Naive Bayes [Can, at large scale, skew the result]
May also skew the distance measures in k-NN, but the effect is not as drastic (Depends on the distance measure used)

Irrelevant Variables
For Naive Bayes: p(x=v|A) = p(x=v|B) for any value v, since it is random, it does not depend on the class variable. The overall result does not change

kNN vs. Naïve Bayes

kNNNaïve Bayes
computation\faster
dataless sensitive to outliersuse all data, less sensitive to label noise
redundant attributesless problematic\
irrelevant attributes\less problematic
pre-selectionyesyes

3.5 Lazy vs. Eager Learning

K-NN is a “lazy” methods.
They do not build an explicit model! “learning” is only performed on demand for unseen records.
Nearest Centroid and Naive Bayes are simple “eager” methods (also: decision tree, rule sets, …)
classify unseen instances, learn a model

3.6 Model Evaluation

3.6.1 Metrics for Performance Evaluation

Confusion MatrixConfusion Matrix

Accuracy & Error Rate
Accuracy & Error Rate

Baseline: naive guessing(always predict majority class)

3.6.2 Limitation of Accuracy: Unbalanced Data

1. Precision & Recall
Precision & Recall

2. F1-Measure越大越好
F1-Score combines precision and recall into one measure by using harmonic mean (tends to be closer to the smaller of the two)
For F1-value to be large, both p and r must be large. 当 Precision 和 Recall 都很高时,F1 score 也会趋向于高值,表示模型在正类别和负类别的预测中取得了较好的平衡。
F1-Score
F1-Measure Graph

confidence scores: how sure the algorithms is with its prediction
3. ROC Curves:

  1. Sort classifications according to confidence scores对于每个测试样本,模型输出一个置信度分数,例如,在朴素贝叶斯中可能是预测的概率。首先,将所有测试样本按照这些置信度分数从高到低排序。
  2. Evaluate
    correct prediction -> draw one step up如果模型的预测是正确的,将 ROC 曲线向上移动一步(增加真正例率)。
    incorrect prediction -> draw one step to the right如果模型的预测是错误的,将 ROC 曲线向右移动一步(增加假正例率)。
    Interpreting ROC Curves

False Positive Rate是指在所有实际为负例的样本中,被错误地判定为正例的比例,FPR = FP / (FP+TN)
True Positive Rate指在所有实际为正例的样本中,被正确地判定为正例的比例=召回率,TPR = TP / (TP+FN)
曲线越接近左上角: 表示模型性能越好,具有更高的TPR和更低的FPR。
4. Cost Matrixcost matrix
Computing Cost of Classification

3.7 Decision Tree Classifiers

Decision Tree Classifiers

There can be more than one tree that fits the same data!
Decision Boundary
Decision boundary: border line between two neighboring regions of different classes.
Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
Finding an optimal decision tree is NP-hard
Tree building algorithms use a greedy, top-down, recursive partitioning strategy to induce a reasonable solution, also known as: divide and conquer. For example, Hunt’s Algorithm, ID3, CHAID, C4.5

3.7.1 Hunt’s Algorithm

Let Dt be the set of training records that reach a node t.
General Procedure:
If Dt contains only records that belong to the same class yt, then t is a leaf node labeled as yt
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
Recursively apply the procedure to each subset
Hunt's Algorithm

3.7.2 Split

Splitting Based on Nominal Attributes

Splitting Based on Nominal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Continuous Attributes
Splitting Based on Continuous Attributes

Discretization to form an ordinal categorical attribute
equal-interval binning
equal-frequency binning
binning based on user-provided boundaries

Binary Decision: (A < v) or (A > v)
usually sufficient in practice
consider all possible splits
find the best cut (i.e., the best v) based on a purity measure
can be computationally expensive

3.7.3 Common measures of node impurity

How to determine the Best Split?
Nodes with homogeneous class distribution are preferred. Need a measure of node impurity.
1. Gini Index
GINI
Splitting Based on GINI

Binary Attributes
Binary Attributes: Computing GINI Index

Continuous Attributes
Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

2. Entropy (Information Gain)
Entropy

Splitting Based on Information Gain

How to Find the Best Split

Decision Tree Advantages & Disadvantages
Advantages: Inexpensive to construct; Fast at classifying unknown records; Easy to interpret by humans for small-sized trees; Accuracy is comparable to other classification techniques for many simple data sets
Disadvantages: Decisions are based only one a single attribute at a time; Can only represent decision boundaries that are parallel to the axes

Decision Tree vs. k-NN

k-NNDecision Tree
Decision Boundariesarbitraryrectangular
Sensitivity to Scalesneed normalizationdoes not need normalization
Runtime & Memorycheap to train but expensive for classificationexpensive to train but cheap for classification

3.8 Overfitting

Overfitting: Good accuracy on training data, but poor on test data.
Symptoms: Tree too deep and too many branches
Typical causes of overfitting: too little training data, noise, poor learning algorithm
Which tree do you prefer?
Occam’s Razor
If you have two theories that explain a phenomenon equally well, choose the simpler one!

Learning Curve
Learning Curve

Holdout Method
The holdout method reserves a certain amount for testing and uses the remainder for training
Typical: one third for testing, the rest for training

For unbalanced datasets (few or none instances of some classes), samples might not be representative. -> Stratified sample: balances the data - Make sure that each class is represented with approximately equal proportions in both subsets. Other attributes may also be considered for stratification, e.g., gender, age, …

Leave One Out
Iterate over all examples
– train a model on all examples but the current one
– evaluate on the current one
每个样本都被当作测试集,而其余的样本组成训练集。这个过程会重复执行,直到每个样本都被作为测试集被验证过一次。
Yields a very accurate estimate but is computationally infeasible in most cases

Cross-Validation (k-fold cross-validation)
Compromise of Leave One Out and decent runtime
Cross-validation avoids overlapping test sets
Steps:
Step 1: Data is split into k subsets of equal size (Stratification may be applied)
Step 2: Each subset in turn is used for testing and the remainder for training
The error estimates are averaged to yield an overall error estimate
Frequently used value for k : 10 (the gold standard for folds was long set to 10)

How to Address Overfitting?

  1. Pre-Pruning (Early Stopping Rule)
    Stop the algorithm before it becomes a fully-grown tree
    Typical stopping conditions for a node: Stop if all instances belong to the same class or Stop if all the attribute values are the same
    Less restrictive conditions: Stop if number of instances within a node is less than some user-specified threshold or Stop if expanding the current node only slightly improves the impurity measure (user-specified threshold)
  2. Post-pruning
    Grow decision tree to its entire size -> Trim the nodes of the decision tree in a bottom-up fashion [ using a validation data set or an estimate of the generalization error ] -> If generalization error improves after trimming [ replace sub-tree by a leaf node / Class label of leaf node is determined from majority class of instances in the sub-tree]

Training vs. Generalization Errors
Training error = resubstitution error = apparent error: errors made in training, misclassified training instances -> can be computed
Generalization error: errors made on unseen data, no apparent evidence -> must be estimated

Estimating the Generalization Error
Estimating the Generalization Error

Example of Post-Pruning
Example of Post-Pruning

3.9 Alternative Classification Methods

Some cases are not nicely expressible in trees and rule sets.(Example: if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes)

3.9.1 Artificial Neural Networks (ANN)

ANN

General Structure of ANN

Algorithm for learning ANN

Decision Boundaries of ANN: Arbitrarily shaped objects & Fuzzy boundaries

3.9.2 Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data.SVM

What is computed?
a separating hyper plane, defined by its support vectors (hence the name)

“support vectors” refer to the training data points that are crucial in defining the decision boundary (or hyperplane).
Challenges: Computing an optimal separation is expensive and it requires good approximations
Dealing with noisy data: introducing “slack variables” in margin computation

3.9.3 Nonlinear Support Vector Machines

  • Transform data into higher dimensional space将输入数据从原始特征空间映射到更高维的空间
  • Transformation in higher dimensional space
    Kernel function
    Different variants: polynomial function, radial basis function, …
  • Finding a hyperplane in higher dimensional space

3.10 Hyperparameter Selection

A hyperparameter is a parameter which influences the learning process and whose value is set before the learning begins. For example, pruning thresholds for trees and rules; gamma and C for SVMs; learning rate, hidden layers for ANNs.
parameters are learned from the training data. For example, weights in an ANN, probabilities in Naïve Bayes, splits in a tree.

How to determine good hyperparameters?
(1) manually play around with different hyperparameter settings
(2) have your machine automatically test many different settings(Hyperparameter Optimization)

Hyperparameter Optimization
Goal: Find the combination of hyperparameter values that results in learning the model with the lowest generalization error
How to determine the parameter value combinations to be tested?

  • Grid Search: Test all combinations in user-defined ranges
  • Random Search: Test combinations of random parameter values
  • Evolutionary Search: Keep specific parameter values that worked well

3.11 Model Selection

From all learned models M, select the model mbest that is expected to generalize best to unseen records

Model Selection Using a Validation Set

Model Selection using Cross-Validation
Model Evaluation using Nested Cross Validation

grid search for model selection
cross-validation for model evaluation

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/220377.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Vue3中使用tinymce, tinymce上传图片,tinymce使用emoji表情

1.效果图 2. 安装 npm i tinymce npm i tinymce/tinymce-vue在node_modules文件夹中找到tinymce下的skins复制到项目public文件夹中子组件 <template><editor v-model"myValue" :init"init" :disabled"disabled" :id"tinymceId&…

小型洗衣机哪个牌子质量好?五款高性价比内衣洗衣机推荐

随着内衣洗衣机的流行&#xff0c;很多小伙伴在纠结该不该入手一款内衣洗衣机&#xff0c;专门来洗一些贴身衣物&#xff0c;答案是非常有必要的&#xff0c;因为我们现在市面上的大型洗衣机只能做清洁&#xff0c;无法对我们的贴身衣物进行一个高强度的清洁&#xff0c;而小小…

浅谈 USB Bulk 深入浅出 (3) - USB Bulk 装置传输的注意事项

来源&#xff1a;大大通 作者&#xff1a;冷氣團 1 USB Bulk 是什么 USB 是即插即用使用差动信号的装置界面&#xff0c;是以 端点 ( Endpoint )&#xff0c;做为传输装置的输出入端&#xff0c;透过不同的端点 ( Endpoint ) 和模式&#xff0c;来进行与装置的沟通&#xff…

antv - G6 绘制1:N N:1 跨节点的graph

文章目录 hover时候&#xff0c;当前节点高亮&#xff0c;且直接相连的线和节点也高亮展示&#xff08;展示直接关系&#xff09;节点的label超过10个字的时候&#xff0c;文本溢出&#xff0c;且hover有tooltip&#xff1b;小于10个字&#xff0c;没有tooltiptootip使用插件mo…

getchar的功能和用法

getchar()是C语言中的一个标准库函数&#xff0c;用于从标准输入&#xff08;通常是键盘&#xff09;读取一个字符&#xff0c;并将其作为int类型返回。它通常用于从键盘获取用户输入。 getchar()函数在程序中等待用户输入&#xff0c;当用户输入一个字符并按下回车键后&#…

Vue路由跳转重定向动态路由VueCli

Vue路由跳转&重定向&动态路由&VueCli 一、声明式导航-导航链接 1.需求 实现导航高亮效果 如果使用a标签进行跳转的话&#xff0c;需要给当前跳转的导航加样式&#xff0c;同时要移除上一个a标签的样式&#xff0c;太麻烦&#xff01;&#xff01;&#xff01; …

做题总结 160.链表相交

160.链表相交 我的思路代码改进 LeetCode&#xff1a;给你两个单链表的头节点 headA 和 headB &#xff0c;请你找出并返回两个单链表相交的起始节点。如果两个链表没有交点&#xff0c;返回 null 。 我的思路 计算链表A、B的长度count1、count2。临时指针curA、curB要同时指向…

SpringMVC学习笔记

先赞后看&#xff0c;养成习惯&#xff01;&#xff01;&#xff01;❤️ ❤️ ❤️ 资源收集不易&#xff0c;如果喜欢可以关注我哦&#xff01; ​如果本篇内容对你有所启发&#xff0c;欢迎访问我的个人博客了解更多内容&#xff1a;链接地址 是什么 Spring MVC是Spring框架…

C++面向对象(OOP)编程-友元(友元函数和友元类)

本文主要介绍面向对象编程的友元的使用&#xff0c;以及友元的特性和分类&#xff0c;提供C代码。 1 为什么引进友元 面向对象编程&#xff08;OOP&#xff09;的三大特性中的封装&#xff0c;是通过类实现对数据的隐藏和封装。一般定义类的成员变量为私有成员&#xff0c;成员…

模拟目录管理 - 华为OD统一考试(C卷)

OD统一考试(C卷) 分值: 200分 题解: Java / Python / C++ 题目描述 实现一个模拟目录管理功能的软件,输入一个命令序列,输出最后一条命令运行结果。 支持命令: 1)创建目录命令: mkdir 目录名称,如mkdir abc为在当前目录创建abc目录,如果已存在同名目录则不执行任何操作…

CentOS7安装 Docker Compose

docker系列 CentOS7安装 Docker Compose docker系列前言1、下载 Docker Compose2、 授权执行权限3、添加软链接4、验证安装 前言 下面的操作是在centos7中完成的。这里安装的是2.23.3版本的docker-compose。 1、下载 Docker Compose 确保你具有 curl 工具&#xff0c;然后使用…

每个开发人员都想使用的编程语言

在任何时候&#xff0c;一些编程语言都会把大量的开发人员变成热情的布道者&#xff0c;试图说服世界其他地方的人相信它的伟大。 当热起来的时候&#xff0c;这种语言可能会成为行业标准&#xff0c;但其他时候&#xff0c;这种受欢迎程度就会消失。 在这个故事中&#xff0…

【JVM从入门到实战】(五)类加载器

一、什么是类加载器 类加载器&#xff08;ClassLoader&#xff09;是Java虚拟机提供给应用程序去实现获取类和接口字节码数据的技术。 类加载器只参与加载过程中的字节码获取并加载到内存这一部分。 二、jdk8及之前的版本 类加载器分为三类&#xff1a; 启动类加载器-加载Ja…

express 下搞一个 websocket 长连接

安装模块 npm i express npm i express-ws 新建文件app.js 先安排源码 监听端口 7777 var express require(express) var app express() require(express-ws)(app)var port 7777 var clientObject {} app.ws(/, (client, req) > {// 连接var key req.socket.re…

预测性维护对制造企业设备管理的作用

制造企业设备管理和维护对于生产效率和成本控制至关重要。然而&#xff0c;传统的维护方法往往无法准确预测设备故障&#xff0c;导致生产中断和高额维修费用。为了应对这一挑战&#xff0c;越来越多的制造企业开始采用预测性维护技术。 预测性维护是通过传感器数据、机器学习和…

上海亚商投顾:沪指再度失守3000点 北向资金净卖出近百亿

上海亚商投顾前言&#xff1a;无惧大盘涨跌&#xff0c;解密龙虎榜资金&#xff0c;跟踪一线游资和机构资金动向&#xff0c;识别短期热点和强势个股。 一.市场情绪 三大指数昨日集体调整&#xff0c;尾盘均跌超1%&#xff0c;北证50则逆势拉升涨超3%。医药股逆势走强&#xf…

打印机怎么扫描文件到电脑?6个步骤!轻松完成!

“在工作时我经常需要用到打印机&#xff0c;有时候需要将部分文件扫描到电脑。但是我不是很清楚应该如何操作&#xff0c;有什么方法可以让打印机快速传输文件到电脑的方法吗&#xff1f;” 在人们的工作和学习中&#xff0c;打印机成了很多用户的必备工具。人们可以用它来打印…

本地搭建Linux DataEase数据可视化分析工具并实现公网访问

文章目录 前言1. 安装DataEase2. 本地访问测试3. 安装 cpolar内网穿透软件4. 配置DataEase公网访问地址5. 公网远程访问Data Ease6. 固定Data Ease公网地址 前言 DataEase 是开源的数据可视化分析工具&#xff0c;帮助用户快速分析数据并洞察业务趋势&#xff0c;从而实现业务…

React中类组件和函数组件的区别?

面试官&#xff1a;说说对React中类组件和函数组件的理解&#xff1f;有什么区别&#xff1f; 一、类组件 类组件&#xff0c;顾名思义&#xff0c;也就是通过使用ES6类的编写形式去编写组件&#xff0c;该类必须继承React.Component 如果想要访问父组件传递过来的参数&#…

kernel(三):kernel移植

本文主要探讨210官方kernel移植。 配置文件选择 选择配置文件smdkv210_android_defconfig(arch/arm/configs) 修改主Makefile 配置cpu架构和交叉编译工具链 vim MakefileARCH ? armCROSS_COMPILE ? /root/arm-2009q3/bin/arm-none-linux-gnueabi- 初步编译烧…