
之前已经学到了很多监督学习算法, 今天的监督学习算法是支持向量机,与逻辑回归和神经网络算法相比,它在学习复杂的非线性方程时提供了一种更为清晰,更强大的方式。

Support Vector Machines

SVM hypothesis

Example Dataset 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from scipy.io import loadmat
from sklearn import svm
mat = loadmat("ex6data1.mat")
X = mat['X']
y = mat['y']def plot_data(X, y):plt.figure(figsize=(6, 4))plt.scatter(X[:, 0], X[:, 1], c=y.flatten(), cmap='rainbow')plt.xlabel('X1')plt.ylabel('X2')plt.legend()plot_data(X, y)
def plot_boundary(clf, X):x_min, x_max = X[:, 0].min() * 1.2, X[:, 0].max() * 1.1y_min, y_max = X[:, 1].min() * 1.1, X[:, 1].max() * 1.1xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),np.linspace(y_min, y_max, 500))Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)plt.contour(xx, yy, Z)models = [svm.SVC(C, kernel='linear') for C in [1, 100]]
clfs = [model.fit(X, y.ravel()) for model in models]
title = ['SVM Decision Boundary with C = {} (Example Dataset 1'.format(C) for C in [1, 100]]
for model, title in zip(clfs, title):plt.figure(figsize=(8, 5))plot_data(X, y)plot_boundary(model, X)plt.title(title)plt.show()

SVM with Gaussian Kernels

Gaussian Kernel
def gauss_kernel(x1, x2, sigma):return np.exp(- ((x1 - x2) ** 2).sum() / (2 * sigma ** 2))
Example Dataset 2
mat = loadmat('ex6data2.mat')
X2 = mat['X']
y2 = mat['y']
plot_data(X2, y2)sigma = 0.1
gamma = np.power(sigma, -2.)/2
clf = svm.SVC(C=1, kernel='rbf', gamma=gamma)
modle = clf.fit(X2, y2.flatten())
plot_data(X2, y2)
plot_boundary(modle, X2)
Example Dataset 3
mat3 = loadmat('ex6data3.mat')
X3, y3 = mat3['X'], mat3['y']
Xval, yval = mat3['Xval'], mat3['yval']
plot_data(X3, y3)

Spam Classification

Preprocessing Emails

with open('emailSample1.txt', 'r') as f:email = f.read()print(email)
# 做除了Word Stemming和Removal of non-words的所有处理
def process_email(email):email = email.lower()email = re.sub('<[^<>]>', ' ', email)  # 匹配<开头,然后所有不是< ,> 的内容,知道>结尾,相当于匹配<...>email = re.sub('(http|https)://[^\s]*', 'httpaddr', email )  # 匹配//后面不是空白字符的内容,遇到空白字符则停止email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email)email = re.sub('[\$]+', 'dollar', email)email = re.sub('[\d]+', 'number', email)return email
# 预处理数据,返回一个干净的单词列表
def email2TokenList(email):# I'll use the NLTK stemmer because it more accurately duplicates the# performance of the OCTAVE implementation in the assignmentstemmer = nltk.stem.porter.PorterStemmer()email = process_email(email)# 将邮件分割为单个单词,re.split() 可以设置多种分隔符tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)# 遍历每个分割出来的内容tokenlist = []for token in tokens:# 删除任何非字母数字的字符token = re.sub('[^a-zA-Z0-9]', '', token);# Use the Porter stemmer to 提取词根stemmed = stemmer.stem(token)# 去除空字符串‘’,里面不含任何字符if not len(token): continuetokenlist.append(stemmed)return tokenlist
Vocabulary List
# 提取存在单词的索引
def email2VocabIndices(email, vocab):token = email2TokenList(email)index = [i for i in range(len(vocab)) if vocab[i] in token ]return index

Extracting Features from Emails

# 将email转化为词向量,n是vocab的长度。存在单词的相应位置的值置为1,其余为0
def email2FeatureVector(email):df = pd.read_table('data/vocab.txt',names=['words'])vocab = df.as_matrix()  # return arrayvector = np.zeros(len(vocab))  # init vectorvocab_indices = email2VocabIndices(email, vocab)  # 返回含有单词的索引# 将有单词的索引置为1for i in vocab_indices:vector[i] = 1return vector

Training SVM for Spam Classification

vector = email2FeatureVector(email)
print('length of vector = {}\nnum of non-zero = {}'.format(len(vector), int(vector.sum())))# 2.3 Training SVM for Spam Classification
# Training set
mat1 = loadmat('spamTrain.mat')
X, y = mat1['X'], mat1['y']# Test set
mat2 = scipy.io.loadmat('spamTest.mat')
Xtest, ytest = mat2['Xtest'], mat2['ytest']clf = svm.SVC(C=0.1, kernel='linear')
clf.fit(X, y)

Top Predictors for Spam

predTrain = clf.score(X, y)
predTest = clf.score(Xtest, ytest)
predTrain, predTest

C = 1/λ
大C: 低偏差,高方差(对应低λ)
小C: 高偏差,低方差(对应高λ)
大δ^2: 分布更平滑,高偏差,低方差
小δ^2: 分布更集中,地偏差,高方差

使用SVM 的步骤:


Need to specify:

  1. choice of parameter C
  2. choice of kernel (similarity function):
    eg: No kernel(‘linear kernel’)
    Gaussian kernel
    need to choose θ^2

logistic vs SVM
(2)如果较小,而且大小中等,例如在 1-1000 之间,而在10-10000之间,使用高斯核函数的支持向量机。






关闭防火墙和selinuxCentOS7以下&#xff1a;service iptables stopsetenforce 0CentOS7.xsystemctl stop firewalldsystemctl disable firewalldsystemctl status firewalldvi /etc/selinux/config把SELINUXenforcing 改成 SELINUXdisabled一、安装依赖库yum -y install make …


In continuation of my previous post ,we will keep on deep diving into basic fundamentals of PyTorch. In this post we will discuss about ways to transform data in PyTorch.延续我以前的 发布后 &#xff0c;我们将继续深入研究PyTorch的基本原理。 在这篇文章中&a…

机器学习实践六---K-means聚类算法 和 主成分分析(PCA)

在这次练习中将实现K-means 聚类算法并应用它压缩图片&#xff0c;第二部分&#xff0c;将使用主成分分析算法去找到一个脸部图片的低维描述。 K-means Clustering Implementing K-means K-means算法是一种自动将相似的数据样本聚在一起的方法,K-means背后的直观是一个迭代过…

打包 压缩 命令tar zip

2019独角兽企业重金招聘Python工程师标准>>> 打包 压缩 命令tar zip tar语法 #压缩 tar -czvf ***.tar.gz tar -cjvf ***.tar.bz2 #解压缩 tar -xzvf ***.tar.gz tar -xjvf ***.tar.bz2 tar [主选项辅选项] 文件或目录 主选项是必须要有的&#xff0c;它告诉tar要做…


首先要有 mysql-5.7.10-winx64环境: mysql-5.7.10-winx64 win10(64位)配置环境变量&#xff1a;1、把mysql-5.7.10-winx64放到D盘&#xff0c;进入D\mysql-5.7.10-winx64\bin目录&#xff0c;复制路径&#xff0c;配置环境变量&#xff0c;在path后面添加D\mysql-5.7.10-winx6…


tidb数据库This article is based on a talk given by Tianshuang Qin at TiDB DevCon 2020.本文基于Tianshuang Qin在 TiDB DevCon 2020 上的演讲 。 When we convert from a standalone system to a distributed one, one of the challenges is migrating the database. We’…


Anomaly detection 异常检测是机器学习中比较常见的应用&#xff0c;它主要用于非监督学习问题&#xff0c;从某些角度看&#xff0c; 它又类似于一些监督学习问题。 什么是异常检测&#xff1f;来看几个例子&#xff1a; 例1. 假设是飞机引擎制造商&#xff0c; 要对引擎进行…

CODE[VS] 1621 混合牛奶 USACO

题目描述 Description牛奶包装是一个如此低利润的生意,所以尽可能低的控制初级产品(牛奶)的价格变的十分重要.请帮助快乐的牛奶制造者(Merry Milk Makers)以可能的最廉价的方式取得他们所需的牛奶.快乐的牛奶制造公司从一些农民那购买牛奶,每个农民卖给牛奶制造公司的价格不一定…


刚认识女孩说不要浪费时间重点 (Top highlight)Data science train is moving, at a constantly accelerating speed, and increasing its length by adding up new coaches. Businesses want to be on the data science train to keep up with the ever-evolving technology a…


badboy这个工具本身用处不是很大&#xff0c;但有个录制脚本的功能&#xff0c;还是jmeter脚本&#xff0c;所以针对这一点很多懒人就可以通过这个录制脚本&#xff0c;而不需要自己去编写 badboy工具最近还是2016年更新的&#xff0c;后面也没在更新了&#xff0c;官方下载地址…

hive 集成sentry

2019独角兽企业重金招聘Python工程师标准>>> 环境 apache-hive-2.3.3-bin apache-sentry-2.1.0-bin 1 2 sentry是目前最新的版本&#xff0c;支持hive的最高版本为2.3.3&#xff0c;hive版本如果高于2.3.3&#xff0c;会出一些版本兼容问题[亲测] hive快速安装 wget…

isql 测试mysql连接_[libco] 协程库学习,测试连接 mysql

历史原因&#xff0c;一直使用 libev 作为服务底层&#xff1b;异步框架虽然性能比较高&#xff0c;但新人学习和使用门槛非常高&#xff0c;而且串行的逻辑被打散为状态机&#xff0c;这也会严重影响生产效率。用同步方式实现异步功能&#xff0c;既保证了异步性能优势&#x…


The term “Data Warehouse” is widely used in the data analytics world, however, it’s quite common for people who are new with data analytics to ask the above question.术语“数据仓库”在数据分析领域中被广泛使用&#xff0c;但是&#xff0c;对于数据分析新手来…


探索性数据分析入门When I started on my journey to learn data science, I read through multiple articles that stressed the importance of understanding your data. It didn’t make sense to me. I was naive enough to think that we are handed over data which we p…

python web应用_为您的应用选择最佳的Python Web爬网库

python web应用Living in today’s world, we are surrounded by different data all around us. The ability to collect and use this data in our projects is a must-have skill for every data scientist.生活在当今世界中&#xff0c;我们周围遍布着不同的数据。 在我们的…

NDK-r14b + FFmpeg-release-3.4 linux下编译FFmpeg




C# 依赖注入那些事儿

原文地址&#xff1a;http://www.cnblogs.com/leoo2sk/archive/2009/06/17/1504693.html 里面有一个例子差了些代码&#xff0c;补全后贴上。 3.1.3 依赖获取 using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml;//定义…


In FAANG company interview, Candidates always come across heap problems. There is one question they do like to ask — Top K. Because these companies have a huge dataset and they can’t always go through all the data. Finding tope data is always a good opti…

mysql springboot 缓存_Spring Boot 整合 Redis 实现缓存操作

摘要: 原创出处 www.bysocket.com 「泥瓦匠BYSocket 」欢迎转载&#xff0c;保留摘要&#xff0c;谢谢&#xff01;『 产品没有价值&#xff0c;开发团队再优秀也无济于事 – 《启示录》 』本文提纲一、缓存的应用场景二、更新缓存的策略三、运行 springboot-mybatis-redis 工程…