sklearn数据集
- 1、数据集划分
- 1.1 获取数据
- 1.2 获取数据返回的类型
- 举个栗子:
- 1.3 对数据集进行分割
- 举个栗子:
- 2、 sklearn分类数据集
- 3、 sklearn回归数据集
1、数据集划分
机器学习一般的数据集会划分为两个部分:
训练数据:用于训练,构建模型(分类、回归和聚类)
测试数据:在模型检验时使用,用于评估模型是否有效
划分的时候一般就是75%和25%的比例。
sklearn数据集划分API:sklearn.model_selection.train_test_split
1.1 获取数据
分为两种,一个是在datasets中的直接加载可以使用的,另一个一个是需要下载的大规模的数据集。
sklearn.datasets
加载获取流行数据集
datasets.load_*()
获取小规模数据集,数据包含在datasets里datasets.fetch_*(data_home=None)
获取大规模数据集,需要从网络上下载,函数的第一个参数是data_home,表示数据集下载的目录,默认是 ~/scikit_learn_data/
1.2 获取数据返回的类型
load*和fetch*返回的数据类型datasets.base.Bunch(字典格式)data:特征数据数组,是 [n_samples * n_features] 的二维 numpy.ndarray 数组target:标签数组,是 n_samples 的一维 numpy.ndarray 数组DESCR:数据描述feature_names:特征名,新闻数据,手写数字、回归数据集没有target_names:标签名
**
举个栗子:
**
sklearn.datasets.load_iris() 加载并返回鸢尾花数据集
这是一个150行4列的矩阵数组。来看一下如何实现数据加载的:
from sklearn.datasets import load_iris
li = load_iris()
print("获取特征值")
print(li.data)
print("目标值")
print(li.target)
其中li就是datasets.base.Bunch的格式,
然后运行输出:
目标值
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2]
.. _iris_dataset:Iris plants dataset
--------------------**Data Set Characteristics:**:Number of Instances: 150 (50 in each of three classes):Number of Attributes: 4 numeric, predictive attributes and the class:Attribute Information:- sepal length in cm- sepal width in cm- petal length in cm- petal width in cm- class:- Iris-Setosa- Iris-Versicolour- Iris-Virginica:Summary Statistics:============== ==== ==== ======= ===== ====================Min Max Mean SD Class Correlation============== ==== ==== ======= ===== ====================sepal length: 4.3 7.9 5.84 0.83 0.7826sepal width: 2.0 4.4 3.05 0.43 -0.4194petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)============== ==== ==== ======= ===== ====================:Missing Attribute Values: None:Class Distribution: 33.3% for each of 3 classes.:Creator: R.A. Fisher:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov):Date: July, 1988The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other... topic:: References- Fisher, R.A. "The use of multiple measurements in taxonomic problems"Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions toMathematical Statistics" (John Wiley, NY, 1950).- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New SystemStructure and Classification Rule for Recognition in Partially ExposedEnvironments". IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. PAMI-2, No. 1, 67-71.- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactionson Information Theory, May 1972, 431-433.- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS IIconceptual clustering system finds 3 classes in the data.- Many, many more ...Process finished with exit code 0
目标值是类别0 1 2 3
里面还带有解释:这也许是模式识别文献中最有名的数据库。 费舍尔的论文是该领域的经典之作,至今一直被引用。 (例如,请参见Duda&Hart。)数据集包含3类,每类50个实例,其中每个类均指一种鸢尾植物。
1.3 对数据集进行分割
划分训练集和测试集,其中训练集的特征值需要和目标值进行对应。
sklearn.model_selection.train_test_split(*arrays, **options)x 数据集的特征值
y 数据集的标签值
test_size 测试集的大小,一般为float
random_state 随机数种子,不同的种子会造成不同的随机
采样结果。相同的种子采样结果相同。return 训练集特征值,测试集特征值,训练标签,测试标签
(默认随机取)
举个栗子:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
li = load_iris()
# print("获取特征值")
# print(li.data)
# print("目标值")
# print(li.target)
# print(li.DESCR)
# 注意返回值, 训练集 train x_train, y_train 测试集 test x_test, y_test 顺序不可以搞错
x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25)
print("训练集特征值和目标值:", x_train, y_train)
print("测试集特征值和目标值:", x_test, y_test)
运行结果:
D:\softwares\anaconda3\python.exe D:/PycharmProjects/MyTest/Day_0707/__init__.py
训练集特征值和目标值: [[5.4 3.9 1.7 0.4][6.3 2.9 5.6 1.8][6.2 3.4 5.4 2.3][4.6 3.4 1.4 0.3][5.5 2.5 4. 1.3][6.5 3.2 5.1 2. ][5. 3.5 1.3 0.3][5.9 3. 4.2 1.5][5.4 3.7 1.5 0.2][4.7 3.2 1.3 0.2][5.5 2.4 3.8 1.1][5.1 3.8 1.6 0.2][5.8 2.6 4. 1.2][6.3 2.8 5.1 1.5][6.4 2.9 4.3 1.3][5.1 3.8 1.9 0.4][6. 2.2 5. 1.5][6.6 2.9 4.6 1.3][7.1 3. 5.9 2.1][6.2 2.9 4.3 1.3][6.5 3. 5.2 2. ][5.6 2.9 3.6 1.3][4.9 3.1 1.5 0.1][5.1 3.7 1.5 0.4][6.3 3.4 5.6 2.4][4.9 3. 1.4 0.2][5. 3.4 1.5 0.2][5.4 3. 4.5 1.5][6.1 2.8 4. 1.3][5.1 3.5 1.4 0.2][4.8 3.4 1.6 0.2][6.1 3. 4.6 1.4][5.7 2.6 3.5 1. ][5.7 2.5 5. 2. ][6.9 3.1 4.9 1.5][7.7 2.8 6.7 2. ][5.7 2.9 4.2 1.3][5.1 3.8 1.5 0.3][4.6 3.6 1. 0.2][7.7 3. 6.1 2.3][6.4 3.2 5.3 2.3][6.4 2.8 5.6 2.1][5.2 4.1 1.5 0.1][5. 3. 1.6 0.2][6. 2.9 4.5 1.5][6.3 3.3 4.7 1.6][4.9 3.1 1.5 0.2][6.3 3.3 6. 2.5][5.7 4.4 1.5 0.4][4.4 2.9 1.4 0.2][6.7 3. 5.2 2.3][5.2 3.4 1.4 0.2][5.5 3.5 1.3 0.2][4.8 3. 1.4 0.3][6.9 3.1 5.4 2.1][6.3 2.5 5. 1.9][5.8 4. 1.2 0.2][5.1 2.5 3. 1.1][6. 2.2 4. 1. ][5.8 2.7 5.1 1.9][6.7 3.1 4.7 1.5][7.2 3.6 6.1 2.5][6.8 2.8 4.8 1.4][6.1 2.9 4.7 1.4][4.3 3. 1.1 0.1][7. 3.2 4.7 1.4][6.7 3.3 5.7 2.5][5.6 2.7 4.2 1.3][5.2 3.5 1.5 0.2][7.7 2.6 6.9 2.3][6.7 3.3 5.7 2.1][6.7 3.1 5.6 2.4][6.5 2.8 4.6 1.5][5.1 3.3 1.7 0.5][7.2 3. 5.8 1.6][5.8 2.7 4.1 1. ][7.3 2.9 6.3 1.8][5.8 2.8 5.1 2.4][6.4 2.7 5.3 1.9][4.8 3.1 1.6 0.2][7.2 3.2 6. 1.8][5.9 3.2 4.8 1.8][4.5 2.3 1.3 0.3][4.9 2.4 3.3 1. ][5.6 3. 4.5 1.5][5.1 3.5 1.4 0.3][4.9 3.6 1.4 0.1][5. 3.4 1.6 0.4][5. 3.6 1.4 0.2][6. 3.4 4.5 1.6][5.8 2.7 5.1 1.9][4.9 2.5 4.5 1.7][6.3 2.3 4.4 1.3][5.5 2.3 4. 1.3][6.1 3. 4.9 1.8][7.9 3.8 6.4 2. ][5.7 2.8 4.5 1.3][6.7 3.1 4.4 1.4][5.6 2.5 3.9 1.1][6. 3. 4.8 1.8][6.1 2.8 4.7 1.2][6.5 3. 5.8 2.2][5.9 3. 5.1 1.8][4.6 3.2 1.4 0.2][6.4 3.1 5.5 1.8][7.7 3.8 6.7 2.2][7.6 3. 6.6 2.1][5. 3.5 1.6 0.6][6.1 2.6 5.6 1.4][5.3 3.7 1.5 0.2][5. 2. 3.5 1. ][5. 3.3 1.4 0.2]] [0 2 2 0 1 2 0 1 0 0 1 0 1 2 1 0 2 1 2 1 2 1 0 0 2 0 0 1 1 0 0 1 1 2 1 2 10 0 2 2 2 0 0 1 1 0 2 0 0 2 0 0 0 2 2 0 1 1 2 1 2 1 1 0 1 2 1 0 2 2 2 1 02 1 2 2 2 0 2 1 0 1 1 0 0 0 0 1 2 2 1 1 2 2 1 1 1 2 1 2 2 0 2 2 2 0 2 0 10]
测试集特征值和目标值: [[4.6 3.1 1.5 0.2][5.6 3. 4.1 1.3][5.5 4.2 1.4 0.2][7.4 2.8 6.1 1.9][5.6 2.8 4.9 2. ][5. 3.2 1.2 0.2][5.7 3.8 1.7 0.3][4.4 3. 1.3 0.2][6.6 3. 4.4 1.4][6.4 3.2 4.5 1.5][6.3 2.5 4.9 1.5][5.1 3.4 1.5 0.2][6.9 3.1 5.1 2.3][6.8 3.2 5.9 2.3][5.8 2.7 3.9 1.2][4.7 3.2 1.6 0.2][6.9 3.2 5.7 2.3][5.4 3.4 1.5 0.4][4.8 3. 1.4 0.1][5.5 2.4 3.7 1. ][6.5 3. 5.5 1.8][6.2 2.2 4.5 1.5][5. 2.3 3.3 1. ][5.7 2.8 4.1 1.3][6.7 3. 5. 1.7][6.8 3. 5.5 2.1][4.4 3.2 1.3 0.2][6.7 2.5 5.8 1.8][5.7 3. 4.2 1.2][5.5 2.6 4.4 1.2][5.2 2.7 3.9 1.4][6.2 2.8 4.8 1.8][5.4 3.9 1.3 0.4][4.8 3.4 1.9 0.2][6. 2.7 5.1 1.6][6.4 2.8 5.6 2.2][6.3 2.7 4.9 1.8][5.4 3.4 1.7 0.2]] [0 1 0 2 2 0 0 0 1 1 1 0 2 2 1 0 2 0 0 1 2 1 1 1 1 2 0 2 1 1 1 2 0 0 1 2 20]Process finished with exit code 0
现在可以看出训练集就变少了,为原来的75%,且默认为乱序的。
2、 sklearn分类数据集
用于分类的大数据集:
sklearn.datasets.fetch_20newsgroups(data_home=None,subset=‘train’)subset: 'train'或者'test','all',可选,选择要加载的数据集.
训练集的“训练”,测试集的“测试”,两者的“全部”
datasets.clear_data_home(data_home=None)
清除目录下的数据
文章20个类别 即20个特征值;data_home为目录;获取时先下载文件在获取数据
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print(news.data)
print(news.target)
3、 sklearn回归数据集
sklearn.datasets.load_boston() 加载并返回波士顿房价数据集
from sklearn.datasets import load_boston
lb = load_boston()
print("获取特征值")
print(lb.data)
print("目标值")
print(lb.target)
print(lb.DESCR)