sklearn数据集

1、数据集划分
- 1.1 获取数据
- 1.2 获取数据返回的类型
- - 举个栗子：
- 1.3 对数据集进行分割
- - 举个栗子：
2、 sklearn分类数据集
3、 sklearn回归数据集

1、数据集划分

机器学习一般的数据集会划分为两个部分：
训练数据：用于训练，构建模型（分类、回归和聚类）
测试数据：在模型检验时使用，用于评估模型是否有效
划分的时候一般就是75%和25%的比例。

sklearn数据集划分API：sklearn.model_selection.train_test_split

1.1 获取数据

分为两种，一个是在datasets中的直接加载可以使用的，另一个一个是需要下载的大规模的数据集。

sklearn.datasets
加载获取流行数据集
datasets.load_*()
获取小规模数据集，数据包含在datasets里datasets.fetch_*(data_home=None)
获取大规模数据集，需要从网络上下载，函数的第一个参数是data_home，表示数据集下载的目录,默认是 ~/scikit_learn_data/

1.2 获取数据返回的类型

load*和fetch*返回的数据类型datasets.base.Bunch(字典格式)data：特征数据数组，是 [n_samples * n_features] 的二维 numpy.ndarray 数组target：标签数组，是 n_samples 的一维 numpy.ndarray 数组DESCR：数据描述feature_names：特征名,新闻数据，手写数字、回归数据集没有target_names：标签名

举个栗子：

**
sklearn.datasets.load_iris() 加载并返回鸢尾花数据集
在这里插入图片描述
这是一个150行4列的矩阵数组。来看一下如何实现数据加载的：

from sklearn.datasets import load_iris
li = load_iris()
print("获取特征值")
print(li.data)
print("目标值")
print(li.target)

其中li就是datasets.base.Bunch的格式，
然后运行输出：

目标值
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2]
.. _iris_dataset:Iris plants dataset
--------------------**Data Set Characteristics:**:Number of Instances: 150 (50 in each of three classes):Number of Attributes: 4 numeric, predictive attributes and the class:Attribute Information:- sepal length in cm- sepal width in cm- petal length in cm- petal width in cm- class:- Iris-Setosa- Iris-Versicolour- Iris-Virginica:Summary Statistics:============== ==== ==== ======= ===== ====================Min  Max   Mean    SD   Class Correlation============== ==== ==== ======= ===== ====================sepal length:   4.3  7.9   5.84   0.83    0.7826sepal width:    2.0  4.4   3.05   0.43   -0.4194petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)============== ==== ==== ======= ===== ====================:Missing Attribute Values: None:Class Distribution: 33.3% for each of 3 classes.:Creator: R.A. Fisher:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov):Date: July, 1988The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other... topic:: References- Fisher, R.A. "The use of multiple measurements in taxonomic problems"Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions toMathematical Statistics" (John Wiley, NY, 1950).- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.(Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New SystemStructure and Classification Rule for Recognition in Partially ExposedEnvironments".  IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. PAMI-2, No. 1, 67-71.- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactionson Information Theory, May 1972, 431-433.- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS IIconceptual clustering system finds 3 classes in the data.- Many, many more ...Process finished with exit code 0

目标值是类别0 1 2 3

里面还带有解释：这也许是模式识别文献中最有名的数据库。费舍尔的论文是该领域的经典之作，至今一直被引用。（例如，请参见Duda＆Hart。）数据集包含3类，每类50个实例，其中每个类均指一种鸢尾植物。

1.3 对数据集进行分割

划分训练集和测试集，其中训练集的特征值需要和目标值进行对应。

sklearn.model_selection.train_test_split(*arrays, **options)x	       数据集的特征值
y          数据集的标签值
test_size      测试集的大小，一般为float
random_state        随机数种子,不同的种子会造成不同的随机
采样结果。相同的种子采样结果相同。return  训练集特征值，测试集特征值，训练标签，测试标签
(默认随机取)

举个栗子：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
li = load_iris()
# print("获取特征值")
# print(li.data)
# print("目标值")
# print(li.target)
# print(li.DESCR)
# 注意返回值, 训练集 train  x_train, y_train        测试集  test   x_test, y_test 顺序不可以搞错
x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25)
print("训练集特征值和目标值：", x_train, y_train)
print("测试集特征值和目标值：", x_test, y_test)

运行结果：

D:\softwares\anaconda3\python.exe D:/PycharmProjects/MyTest/Day_0707/__init__.py
训练集特征值和目标值： [[5.4 3.9 1.7 0.4][6.3 2.9 5.6 1.8][6.2 3.4 5.4 2.3][4.6 3.4 1.4 0.3][5.5 2.5 4.  1.3][6.5 3.2 5.1 2. ][5.  3.5 1.3 0.3][5.9 3.  4.2 1.5][5.4 3.7 1.5 0.2][4.7 3.2 1.3 0.2][5.5 2.4 3.8 1.1][5.1 3.8 1.6 0.2][5.8 2.6 4.  1.2][6.3 2.8 5.1 1.5][6.4 2.9 4.3 1.3][5.1 3.8 1.9 0.4][6.  2.2 5.  1.5][6.6 2.9 4.6 1.3][7.1 3.  5.9 2.1][6.2 2.9 4.3 1.3][6.5 3.  5.2 2. ][5.6 2.9 3.6 1.3][4.9 3.1 1.5 0.1][5.1 3.7 1.5 0.4][6.3 3.4 5.6 2.4][4.9 3.  1.4 0.2][5.  3.4 1.5 0.2][5.4 3.  4.5 1.5][6.1 2.8 4.  1.3][5.1 3.5 1.4 0.2][4.8 3.4 1.6 0.2][6.1 3.  4.6 1.4][5.7 2.6 3.5 1. ][5.7 2.5 5.  2. ][6.9 3.1 4.9 1.5][7.7 2.8 6.7 2. ][5.7 2.9 4.2 1.3][5.1 3.8 1.5 0.3][4.6 3.6 1.  0.2][7.7 3.  6.1 2.3][6.4 3.2 5.3 2.3][6.4 2.8 5.6 2.1][5.2 4.1 1.5 0.1][5.  3.  1.6 0.2][6.  2.9 4.5 1.5][6.3 3.3 4.7 1.6][4.9 3.1 1.5 0.2][6.3 3.3 6.  2.5][5.7 4.4 1.5 0.4][4.4 2.9 1.4 0.2][6.7 3.  5.2 2.3][5.2 3.4 1.4 0.2][5.5 3.5 1.3 0.2][4.8 3.  1.4 0.3][6.9 3.1 5.4 2.1][6.3 2.5 5.  1.9][5.8 4.  1.2 0.2][5.1 2.5 3.  1.1][6.  2.2 4.  1. ][5.8 2.7 5.1 1.9][6.7 3.1 4.7 1.5][7.2 3.6 6.1 2.5][6.8 2.8 4.8 1.4][6.1 2.9 4.7 1.4][4.3 3.  1.1 0.1][7.  3.2 4.7 1.4][6.7 3.3 5.7 2.5][5.6 2.7 4.2 1.3][5.2 3.5 1.5 0.2][7.7 2.6 6.9 2.3][6.7 3.3 5.7 2.1][6.7 3.1 5.6 2.4][6.5 2.8 4.6 1.5][5.1 3.3 1.7 0.5][7.2 3.  5.8 1.6][5.8 2.7 4.1 1. ][7.3 2.9 6.3 1.8][5.8 2.8 5.1 2.4][6.4 2.7 5.3 1.9][4.8 3.1 1.6 0.2][7.2 3.2 6.  1.8][5.9 3.2 4.8 1.8][4.5 2.3 1.3 0.3][4.9 2.4 3.3 1. ][5.6 3.  4.5 1.5][5.1 3.5 1.4 0.3][4.9 3.6 1.4 0.1][5.  3.4 1.6 0.4][5.  3.6 1.4 0.2][6.  3.4 4.5 1.6][5.8 2.7 5.1 1.9][4.9 2.5 4.5 1.7][6.3 2.3 4.4 1.3][5.5 2.3 4.  1.3][6.1 3.  4.9 1.8][7.9 3.8 6.4 2. ][5.7 2.8 4.5 1.3][6.7 3.1 4.4 1.4][5.6 2.5 3.9 1.1][6.  3.  4.8 1.8][6.1 2.8 4.7 1.2][6.5 3.  5.8 2.2][5.9 3.  5.1 1.8][4.6 3.2 1.4 0.2][6.4 3.1 5.5 1.8][7.7 3.8 6.7 2.2][7.6 3.  6.6 2.1][5.  3.5 1.6 0.6][6.1 2.6 5.6 1.4][5.3 3.7 1.5 0.2][5.  2.  3.5 1. ][5.  3.3 1.4 0.2]] [0 2 2 0 1 2 0 1 0 0 1 0 1 2 1 0 2 1 2 1 2 1 0 0 2 0 0 1 1 0 0 1 1 2 1 2 10 0 2 2 2 0 0 1 1 0 2 0 0 2 0 0 0 2 2 0 1 1 2 1 2 1 1 0 1 2 1 0 2 2 2 1 02 1 2 2 2 0 2 1 0 1 1 0 0 0 0 1 2 2 1 1 2 2 1 1 1 2 1 2 2 0 2 2 2 0 2 0 10]
测试集特征值和目标值： [[4.6 3.1 1.5 0.2][5.6 3.  4.1 1.3][5.5 4.2 1.4 0.2][7.4 2.8 6.1 1.9][5.6 2.8 4.9 2. ][5.  3.2 1.2 0.2][5.7 3.8 1.7 0.3][4.4 3.  1.3 0.2][6.6 3.  4.4 1.4][6.4 3.2 4.5 1.5][6.3 2.5 4.9 1.5][5.1 3.4 1.5 0.2][6.9 3.1 5.1 2.3][6.8 3.2 5.9 2.3][5.8 2.7 3.9 1.2][4.7 3.2 1.6 0.2][6.9 3.2 5.7 2.3][5.4 3.4 1.5 0.4][4.8 3.  1.4 0.1][5.5 2.4 3.7 1. ][6.5 3.  5.5 1.8][6.2 2.2 4.5 1.5][5.  2.3 3.3 1. ][5.7 2.8 4.1 1.3][6.7 3.  5.  1.7][6.8 3.  5.5 2.1][4.4 3.2 1.3 0.2][6.7 2.5 5.8 1.8][5.7 3.  4.2 1.2][5.5 2.6 4.4 1.2][5.2 2.7 3.9 1.4][6.2 2.8 4.8 1.8][5.4 3.9 1.3 0.4][4.8 3.4 1.9 0.2][6.  2.7 5.1 1.6][6.4 2.8 5.6 2.2][6.3 2.7 4.9 1.8][5.4 3.4 1.7 0.2]] [0 1 0 2 2 0 0 0 1 1 1 0 2 2 1 0 2 0 0 1 2 1 1 1 1 2 0 2 1 1 1 2 0 0 1 2 20]Process finished with exit code 0

现在可以看出训练集就变少了，为原来的75%，且默认为乱序的。

2、 sklearn分类数据集

用于分类的大数据集：

sklearn.datasets.fetch_20newsgroups(data_home=None,subset=‘train’)subset: 'train'或者'test','all'，可选，选择要加载的数据集.
训练集的“训练”，测试集的“测试”，两者的“全部”
datasets.clear_data_home(data_home=None)
清除目录下的数据

文章20个类别即20个特征值；data_home为目录；获取时先下载文件在获取数据

from sklearn.datasets import  fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print(news.data)
print(news.target)

3、 sklearn回归数据集

sklearn.datasets.load_boston() 加载并返回波士顿房价数据集
在这里插入图片描述

from sklearn.datasets import   load_boston
lb = load_boston()
print("获取特征值")
print(lb.data)
print("目标值")
print(lb.target)
print(lb.DESCR)

【机器学习】sklearn数据集获取、分割、分类和回归

sklearn数据集

1、数据集划分

1.1 获取数据

1.2 获取数据返回的类型

举个栗子：

1.3 对数据集进行分割

举个栗子：

2、 sklearn分类数据集

3、 sklearn回归数据集

相关文章

LeetCode 1846. 减小和重新排列数组后的最大元素

ftp可以传输什么类型文件_FTP文件传输工具-ForkLift for Mac

1过程流程图 3 apqp_为什么过程开发的平面布置图要遵循精益原则？

把Scala代码当作脚本运行

LeetCode 1847. 最近的房间（排序离线计算 + 二分查找）

用python写一个手机app签到脚本_利用Python实现App自动签到领取积分

el表达式循环_EL表达式和JSTL标签库(百战程序员047天）

【机器学习】sklearn k-近邻算法

openresty package.path require 报错

LeetCode 1849. 将字符串拆分为递减的连续值（回溯）

axure选中后横线切换_Axure8.0｜动态面板内容简单切换技巧

flutter text 最大长度_Flutter小技巧之TextField换行自适应

【机器学习】分类算法sklearn-朴素贝叶斯算法

收集的电影网站

LeetCode 1851. 包含每个查询的最小区间（排序 + 离线查询 + 优先队列）

canvas 判断哪个元素被点击_监听 Canvas 内部元素点击事件的三种方法

git 查看某些文档的历史版本_Git 教程（二）log 命令的使用

【机器学习】sclearn分类算法-决策树、随机森林

在Java中正确使用注释

Spark 机器学习中的线性代数库