Machine Learning from Start to Finish with Scikit-Learn

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

Machine Learning from Start to Finish with Scikit-Learn

This notebook covers the basic Machine Learning process in Python step-by-step. Go from raw data to at least 78% accuracy on the Titanic Survivors dataset.

Steps Covered

  1. Importing a DataFrame
  2. Visualize the Data
  3. Cleanup and Transform the Data
  4. Encode the Data
  5. Split Training and Test Sets
  6. Fine Tune Algorithms
  7. Cross Validate with KFold
  8. Upload to Kaggle

CSV to DataFrame

CSV files can be loaded into a dataframe by calling pd.read_csv . After loading the training and test files, print a sample to see what you're working with.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedata_train = pd.read_csv('../input/train.csv')
data_test = pd.read_csv('../input/test.csv')data_train.sample(3)

Out[1]:

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
474813O'Driscoll, Miss. BridgetfemaleNaN00143117.7500NaNQ
29629703Hanna, Mr. Mansourmale23.50026937.2292NaNC
53053112Quick, Miss. Phyllis Mayfemale2.0112636026.0000NaNS

Visualizing Data

Visualizing data is crucial for recognizing underlying patterns to exploit in the model.

In [2]:

sns.barplot(x="Embarked", y="Survived", hue="Sex", data=data_train);

29143249_QZiT.png

In [3]:

sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data_train,palette={"male": "blue", "female": "pink"},markers=["*", "o"], linestyles=["-", "--"]);

29143250_Owy5.png

Transforming Features

  1. Aside from 'Sex', the 'Age' feature is second in importance. To avoid overfitting, I'm grouping people into logical human age groups.
  2. Each Cabin starts with a letter. I bet this letter is much more important than the number that follows, let's slice it off.
  3. Fare is another continuous value that should be simplified. I ran data_train.Fare.describe() to get the distribution of the feature, then placed them into quartile bins accordingly.
  4. Extract information from the 'Name' feature. Rather than use the full name, I extracted the last name and name prefix (Mr. Mrs. Etc.), then appended them as their own features.
  5. Lastly, drop useless features. (Ticket and Name)

In [4]:

def simplify_ages(df):df.Age = df.Age.fillna(-0.5)bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']categories = pd.cut(df.Age, bins, labels=group_names)df.Age = categoriesreturn dfdef simplify_cabins(df):df.Cabin = df.Cabin.fillna('N')df.Cabin = df.Cabin.apply(lambda x: x[0])return dfdef simplify_fares(df):df.Fare = df.Fare.fillna(-0.5)bins = (-1, 0, 8, 15, 31, 1000)group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']categories = pd.cut(df.Fare, bins, labels=group_names)df.Fare = categoriesreturn dfdef format_name(df):df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])return df    def drop_features(df):return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)def transform_features(df):df = simplify_ages(df)df = simplify_cabins(df)df = simplify_fares(df)df = format_name(df)df = drop_features(df)return dfdata_train = transform_features(data_train)
data_test = transform_features(data_test)
data_train.head()

Out[4]:

 PassengerIdSurvivedPclassSexAgeSibSpParchFareCabinLnameNamePrefix
0103maleStudent101_quartileNBraund,Mr.
1211femaleAdult104_quartileCCumings,Mrs.
2313femaleYoung Adult001_quartileNHeikkinen,Miss.
3411femaleYoung Adult104_quartileCFutrelle,Mrs.
4503maleYoung Adult002_quartileNAllen,Mr.

In [5]:

 

In [5]:

sns.barplot(x="Age", y="Survived", hue="Sex", data=data_train);

29143251_Uo7e.png

In [6]:

sns.barplot(x="Cabin", y="Survived", hue="Sex", data=data_train);

29143253_wK19.png

In [7]:

sns.barplot(x="Fare", y="Survived", hue="Sex", data=data_train);

29143253_HeJf.png

Some Final Encoding

The last part of the preprocessing phase is to normalize labels. The LabelEncoder in Scikit-learn will convert each unique string value into a number, making out data more flexible for various algorithms.

The result is a table of numbers that looks scary to humans, but beautiful to machines.

In [8]:

from sklearn import preprocessing
def encode_features(df_train, df_test):features = ['Fare', 'Cabin', 'Age', 'Sex', 'Lname', 'NamePrefix']df_combined = pd.concat([df_train[features], df_test[features]])for feature in features:le = preprocessing.LabelEncoder()le = le.fit(df_combined[feature])df_train[feature] = le.transform(df_train[feature])df_test[feature] = le.transform(df_test[feature])return df_train, df_testdata_train, data_test = encode_features(data_train, data_test)
data_train.head()

Out[8]:

 PassengerIdSurvivedPclassSexAgeSibSpParchFareCabinLnameNamePrefix
010314100710019
121100103218220
231307000732916
341107103226720
45031700171519

Splitting up the Training Data

Now its time for some Machine Learning.

First, separate the features(X) from the labels(y).

X_all: All features minus the value we want to predict (Survived).

y_all: Only the value we want to predict.

Second, use Scikit-learn to randomly shuffle this data into four variables. In this case, I'm training 80% of the data, then testing against the other 20%.

Later, this data will be reorganized into a KFold pattern to validate the effectiveness of a trained algorithm.

In [9]:

from sklearn.model_selection import train_test_splitX_all = data_train.drop(['Survived', 'PassengerId'], axis=1)
y_all = data_train['Survived']num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)

Fitting and Tuning an Algorithm

Now it's time to figure out which algorithm is going to deliver the best model. I'm going with the RandomForestClassifier, but you can drop any other classifier here, such as Support Vector Machines or Naive Bayes.

In [10]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV# Choose the type of classifier. 
clf = RandomForestClassifier()# Choose some parameter combinations to try
parameters = {'n_estimators': [4, 6, 9], 'max_features': ['log2', 'sqrt','auto'], 'criterion': ['entropy', 'gini'],'max_depth': [2, 3, 5, 10], 'min_samples_split': [2, 3, 5],'min_samples_leaf': [1,5,8]}# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)# Run the grid search
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_# Fit the best algorithm to the data. 
clf.fit(X_train, y_train)

Out[10]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',max_depth=5, max_features='sqrt', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=3,min_weight_fraction_leaf=0.0, n_estimators=9, n_jobs=1,oob_score=False, random_state=None, verbose=0,warm_start=False)

In [11]:

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))
0.798882681564

Validate with KFold

Is this model actually any good? It helps to verify the effectiveness of the algorithm using KFold. This will split our data into 10 buckets, then run the algorithm using a different bucket as the test set for each iteration.

In [12]:

from sklearn.cross_validation import KFolddef run_kfold(clf):kf = KFold(891, n_folds=10)outcomes = []fold = 0for train_index, test_index in kf:fold += 1X_train, X_test = X_all.values[train_index], X_all.values[test_index]y_train, y_test = y_all.values[train_index], y_all.values[test_index]clf.fit(X_train, y_train)predictions = clf.predict(X_test)accuracy = accuracy_score(y_test, predictions)outcomes.append(accuracy)print("Fold {0} accuracy: {1}".format(fold, accuracy))     mean_outcome = np.mean(outcomes)print("Mean Accuracy: {0}".format(mean_outcome)) run_kfold(clf)
Fold 1 accuracy: 0.8111111111111111
Fold 2 accuracy: 0.8651685393258427
Fold 3 accuracy: 0.7640449438202247
Fold 4 accuracy: 0.8426966292134831
Fold 5 accuracy: 0.8314606741573034
Fold 6 accuracy: 0.8202247191011236
Fold 7 accuracy: 0.7528089887640449
Fold 8 accuracy: 0.8089887640449438
Fold 9 accuracy: 0.8876404494382022
Fold 10 accuracy: 0.8426966292134831
Mean Accuracy: 0.8226841448189763
/opt/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)

Predict the Actual Test Data

And now for the moment of truth. Make the predictions, export the CSV file, and upload them to Kaggle.

In [13]:

ids = data_test['PassengerId']
predictions = clf.predict(data_test.drop('PassengerId', axis=1))output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
# output.to_csv('titanic-predictions.csv', index = False)
output.head()

Out[13]:

 PassengerIdSurvived
08920
18930
28940
38950
48960

转载于:https://my.oschina.net/cloudcoder/blog/1068712

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/395256.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Excel 宏编码实现,指定列的字符串截取

1、打开Excel凭证,启用宏,ALTF11 或 菜单“视图”-"宏-查看宏" Sub 分割字符串1() Dim i As Integer Dim b() As String Dim length 用length表示数组的长度 Dim sublength Dim bb() As String 筛选日期 2 点 For i 2 To 20000 b() Split(Ce…

mysql for update 锁_MySql FOR UPDATE 锁的一点问题……

问题描述假设一个情况,这里只是假设,真实的情况可能不会这样设计,但是假如真的发生了....铁老大有一张这样的ticket表,用来存放北京到上海的票。iduidstart_addrend_addrbook_time11300009860上海北京1386666032120上海北京30上海…

服务器机房新风系统,某机房新风系统设计方案参考

《某机房新风系统设计方案参考》由会员分享,可在线阅读,更多相关《某机房新风系统设计方案参考(3页珍藏版)》请在人人文库网上搜索。1、某机房新风系统设计方案参考根据以上要求并结合中华人民共和国电子计算机机房的设计规范,为保证机房正压…

css 画三角形

CSS三角形绘制方法#triangle-up {width: 0;height: 0;border-left: 50px solid transparent;border-right: 50px solid transparent;border-bottom: 100px solid red;}#triangle-down {width: 0;height: 0;border-left: 50px solid transparent;border-right: 50px solid trans…

面试官面试前端_如何面试面试官

面试官面试前端by Aline Lerner通过艾琳勒纳(Aline Lerner) 如何面试面试官 (How to interview your interviewers) For the last few semesters, I’ve had the distinct pleasure of guest-lecturing MIT’s required technical communication class for computer science m…

shell 字符串分割

语法1: substring${string:start:len} string的下标从0开始,以start可是,截取len个字符,并赋值于substring 1 #!/bin/bash 2 #substr${string:start:len} 3 str"123456789" 4 substr${str:3:3} 5 echo $substr 6 7 输出&#xff1…

方格取数(网络流)

题目链接:ヾ(≧∇≦*)ゝ 大致题意:给你一个\(n*m\)的矩阵,可以取任意多个数,但若你取了一个数,那么这个数上下左右的数你就都不能取,问能取到的最大值是多少。 Solution: 首先,我们可以把矩阵上…

mysql创建的数据库都在哪里看_mysql 怎么查看创建的数据库和表

1、 //看当前使用的是哪个数据库 ,如果你还没选择任何数据库,结果是NULL。mysql>select database(); ------------ | DATABASE() | ------------ | menagerie | ------------2、//查看有哪些数据库 mysql> show databases;--------------------| Database …

wordpress 基础文件

需要用到的PHP基础文件有&#xff1a; 404.php404模板 rtl.css 如果网站的阅读方向是自右向左的&#xff0c;会被自动包含进来comments.php 评论模板single.php文章模板。显示单独的一篇文章时被调用&#xff0c;如果模板不存在会使用 index.phpsingle-<post-type>.php自…

ajax请求 apend,jsp如何获取ajax append的数据?

该楼层疑似违规已被系统折叠 隐藏此楼查看此楼我在网上下了个上传图片的js&#xff0c;我想上传图片的时候还提交一些参数&#xff0c;但是后台用request.getParameter("th");获取出来是nullfunction uploadSubmitHandler () {if (state.fileBatch.length ! 0) {var …

linux 机器格式化_为什么机器人应该为我们格式化代码

linux 机器格式化by Artem Sapegin通过Artem Sapegin 为什么机器人应该为我们格式化代码 (Why robots should format our code for us) I used to think that a personal code style is a good thing for a programmer. It shows you are a mature developer who knows what g…

Pytest高级进阶之Fixture

From: https://www.jianshu.com/p/54b0f4016300 一. fixture介绍 fixture是pytest的一个闪光点&#xff0c;pytest要精通怎么能不学习fixture呢&#xff1f;跟着我一起深入学习fixture吧。其实unittest和nose都支持fixture&#xff0c;但是pytest做得更炫。 fixture是pytest特有…

mysql 慢日志报警_一则MySQL慢日志监控误报的问题分析

之前因为各种原因&#xff0c;有些报警没有引起重视&#xff0c;最近放假马上排除了一些潜在的人为原因&#xff0c;发现数据库的慢日志报警有些奇怪&#xff0c;主要表现是慢日志报警不属实&#xff0c;收到报警的即时通信提醒后&#xff0c;隔一会去数据库里面去排查&#xf…

用css实现自定义虚线边框

开发产品功能的时候ui往往会给出虚线边框的效果图&#xff0c;于是乎&#xff0c;我们往往第一时间想到的是用css里的border&#xff0c;可是border里一般就提供两种效果&#xff0c;dashed或者dotted&#xff0c;ui这时就不满意了&#xff0c;说虚线太密了。废话不多说&#x…

无限复活服务器,绝地求生无限复活模式怎么玩 无限复活新手教程

相信不少的绝地求生玩家们最近都听说了其无限复活模式吧?因此肯定想要知道这种模式究竟该怎么玩&#xff0c;所以下面就来为各位带来此玩法的攻略相关&#xff0c;希望各位在看了如下的内容之后恩呢狗狗了解到新手教程攻略一览。“War”模式的设定以及玩法规则如下&#xff1a…

lua math.random()

math.random([n [,m]]) 用法&#xff1a;1.无参调用&#xff0c;产生[0, 1)之间的浮点随机数。 2.一个参数n&#xff0c;产生[1, n]之间的整数。 3.两个参数&#xff0c;产生[n, m]之间的整数。 math.randomseed(n) 用法&#xff1a;接收一个整数n作为随即序列的种子。 例&…

零基础学习ruby_学习Ruby:从零到英雄

零基础学习ruby“Ruby is simple in appearance, but is very complex inside, just like our human body.” — Matz, creator of the Ruby programming language“ Ruby的外观很简单&#xff0c;但是内部却非常复杂&#xff0c;就像我们的人体一样。” — Matz &#xff0c;R…

windows同时启动多个微信

1、创建mychat.bat文件(文件名任意)&#xff0c;输入以下代码&#xff0c;其中"C:\Program Files (x86)\Tencent\WeChat\"为微信的安装路径。以下示例为同时启动两个微信 start/d "C:\Program Files (x86)\Tencent\WeChat\" Wechat.exe start/d "C:\P…

mysql date time year_YEAR、DATE、TIME、DATETIME和TIMESTAMP详细介绍[MySQL数据类型]

为了方便在数据库中存储日期和时间&#xff0c;MySQL提供了表示日期和时间的数据类型&#xff0c;分别是YEAR、DATE、TIME、DATETIME和TIMESTAMP。下面列举了这些MSL中日期和时间数据类型所对应的字节数、取值范围、日期格式以及零值。从上图中可以看出&#xff0c;每种日期和时…

九度oj 题目1380:lucky number

题目描述&#xff1a;每个人有自己的lucky number&#xff0c;小A也一样。不过他的lucky number定义不一样。他认为一个序列中某些数出现的次数为n的话&#xff0c;都是他的lucky number。但是&#xff0c;现在这个序列很大&#xff0c;他无法快速找到所有lucky number。既然这…