Machine Learning from Start to Finish with Scikit-Learn

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

Machine Learning from Start to Finish with Scikit-Learn

This notebook covers the basic Machine Learning process in Python step-by-step. Go from raw data to at least 78% accuracy on the Titanic Survivors dataset.

Steps Covered

  1. Importing a DataFrame
  2. Visualize the Data
  3. Cleanup and Transform the Data
  4. Encode the Data
  5. Split Training and Test Sets
  6. Fine Tune Algorithms
  7. Cross Validate with KFold
  8. Upload to Kaggle

CSV to DataFrame

CSV files can be loaded into a dataframe by calling pd.read_csv . After loading the training and test files, print a sample to see what you're working with.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedata_train = pd.read_csv('../input/train.csv')
data_test = pd.read_csv('../input/test.csv')data_train.sample(3)

Out[1]:

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
474813O'Driscoll, Miss. BridgetfemaleNaN00143117.7500NaNQ
29629703Hanna, Mr. Mansourmale23.50026937.2292NaNC
53053112Quick, Miss. Phyllis Mayfemale2.0112636026.0000NaNS

Visualizing Data

Visualizing data is crucial for recognizing underlying patterns to exploit in the model.

In [2]:

sns.barplot(x="Embarked", y="Survived", hue="Sex", data=data_train);

29143249_QZiT.png

In [3]:

sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data_train,palette={"male": "blue", "female": "pink"},markers=["*", "o"], linestyles=["-", "--"]);

29143250_Owy5.png

Transforming Features

  1. Aside from 'Sex', the 'Age' feature is second in importance. To avoid overfitting, I'm grouping people into logical human age groups.
  2. Each Cabin starts with a letter. I bet this letter is much more important than the number that follows, let's slice it off.
  3. Fare is another continuous value that should be simplified. I ran data_train.Fare.describe() to get the distribution of the feature, then placed them into quartile bins accordingly.
  4. Extract information from the 'Name' feature. Rather than use the full name, I extracted the last name and name prefix (Mr. Mrs. Etc.), then appended them as their own features.
  5. Lastly, drop useless features. (Ticket and Name)

In [4]:

def simplify_ages(df):df.Age = df.Age.fillna(-0.5)bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']categories = pd.cut(df.Age, bins, labels=group_names)df.Age = categoriesreturn dfdef simplify_cabins(df):df.Cabin = df.Cabin.fillna('N')df.Cabin = df.Cabin.apply(lambda x: x[0])return dfdef simplify_fares(df):df.Fare = df.Fare.fillna(-0.5)bins = (-1, 0, 8, 15, 31, 1000)group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']categories = pd.cut(df.Fare, bins, labels=group_names)df.Fare = categoriesreturn dfdef format_name(df):df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])return df    def drop_features(df):return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)def transform_features(df):df = simplify_ages(df)df = simplify_cabins(df)df = simplify_fares(df)df = format_name(df)df = drop_features(df)return dfdata_train = transform_features(data_train)
data_test = transform_features(data_test)
data_train.head()

Out[4]:

 PassengerIdSurvivedPclassSexAgeSibSpParchFareCabinLnameNamePrefix
0103maleStudent101_quartileNBraund,Mr.
1211femaleAdult104_quartileCCumings,Mrs.
2313femaleYoung Adult001_quartileNHeikkinen,Miss.
3411femaleYoung Adult104_quartileCFutrelle,Mrs.
4503maleYoung Adult002_quartileNAllen,Mr.

In [5]:

 

In [5]:

sns.barplot(x="Age", y="Survived", hue="Sex", data=data_train);

29143251_Uo7e.png

In [6]:

sns.barplot(x="Cabin", y="Survived", hue="Sex", data=data_train);

29143253_wK19.png

In [7]:

sns.barplot(x="Fare", y="Survived", hue="Sex", data=data_train);

29143253_HeJf.png

Some Final Encoding

The last part of the preprocessing phase is to normalize labels. The LabelEncoder in Scikit-learn will convert each unique string value into a number, making out data more flexible for various algorithms.

The result is a table of numbers that looks scary to humans, but beautiful to machines.

In [8]:

from sklearn import preprocessing
def encode_features(df_train, df_test):features = ['Fare', 'Cabin', 'Age', 'Sex', 'Lname', 'NamePrefix']df_combined = pd.concat([df_train[features], df_test[features]])for feature in features:le = preprocessing.LabelEncoder()le = le.fit(df_combined[feature])df_train[feature] = le.transform(df_train[feature])df_test[feature] = le.transform(df_test[feature])return df_train, df_testdata_train, data_test = encode_features(data_train, data_test)
data_train.head()

Out[8]:

 PassengerIdSurvivedPclassSexAgeSibSpParchFareCabinLnameNamePrefix
010314100710019
121100103218220
231307000732916
341107103226720
45031700171519

Splitting up the Training Data

Now its time for some Machine Learning.

First, separate the features(X) from the labels(y).

X_all: All features minus the value we want to predict (Survived).

y_all: Only the value we want to predict.

Second, use Scikit-learn to randomly shuffle this data into four variables. In this case, I'm training 80% of the data, then testing against the other 20%.

Later, this data will be reorganized into a KFold pattern to validate the effectiveness of a trained algorithm.

In [9]:

from sklearn.model_selection import train_test_splitX_all = data_train.drop(['Survived', 'PassengerId'], axis=1)
y_all = data_train['Survived']num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)

Fitting and Tuning an Algorithm

Now it's time to figure out which algorithm is going to deliver the best model. I'm going with the RandomForestClassifier, but you can drop any other classifier here, such as Support Vector Machines or Naive Bayes.

In [10]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV# Choose the type of classifier. 
clf = RandomForestClassifier()# Choose some parameter combinations to try
parameters = {'n_estimators': [4, 6, 9], 'max_features': ['log2', 'sqrt','auto'], 'criterion': ['entropy', 'gini'],'max_depth': [2, 3, 5, 10], 'min_samples_split': [2, 3, 5],'min_samples_leaf': [1,5,8]}# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)# Run the grid search
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_# Fit the best algorithm to the data. 
clf.fit(X_train, y_train)

Out[10]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',max_depth=5, max_features='sqrt', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=3,min_weight_fraction_leaf=0.0, n_estimators=9, n_jobs=1,oob_score=False, random_state=None, verbose=0,warm_start=False)

In [11]:

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))
0.798882681564

Validate with KFold

Is this model actually any good? It helps to verify the effectiveness of the algorithm using KFold. This will split our data into 10 buckets, then run the algorithm using a different bucket as the test set for each iteration.

In [12]:

from sklearn.cross_validation import KFolddef run_kfold(clf):kf = KFold(891, n_folds=10)outcomes = []fold = 0for train_index, test_index in kf:fold += 1X_train, X_test = X_all.values[train_index], X_all.values[test_index]y_train, y_test = y_all.values[train_index], y_all.values[test_index]clf.fit(X_train, y_train)predictions = clf.predict(X_test)accuracy = accuracy_score(y_test, predictions)outcomes.append(accuracy)print("Fold {0} accuracy: {1}".format(fold, accuracy))     mean_outcome = np.mean(outcomes)print("Mean Accuracy: {0}".format(mean_outcome)) run_kfold(clf)
Fold 1 accuracy: 0.8111111111111111
Fold 2 accuracy: 0.8651685393258427
Fold 3 accuracy: 0.7640449438202247
Fold 4 accuracy: 0.8426966292134831
Fold 5 accuracy: 0.8314606741573034
Fold 6 accuracy: 0.8202247191011236
Fold 7 accuracy: 0.7528089887640449
Fold 8 accuracy: 0.8089887640449438
Fold 9 accuracy: 0.8876404494382022
Fold 10 accuracy: 0.8426966292134831
Mean Accuracy: 0.8226841448189763
/opt/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)

Predict the Actual Test Data

And now for the moment of truth. Make the predictions, export the CSV file, and upload them to Kaggle.

In [13]:

ids = data_test['PassengerId']
predictions = clf.predict(data_test.drop('PassengerId', axis=1))output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
# output.to_csv('titanic-predictions.csv', index = False)
output.head()

Out[13]:

 PassengerIdSurvived
08920
18930
28940
38950
48960

转载于:https://my.oschina.net/cloudcoder/blog/1068712

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/395256.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Excel 宏编码实现,指定列的字符串截取

1、打开Excel凭证,启用宏,ALTF11 或 菜单“视图”-"宏-查看宏" Sub 分割字符串1() Dim i As Integer Dim b() As String Dim length 用length表示数组的长度 Dim sublength Dim bb() As String 筛选日期 2 点 For i 2 To 20000 b() Split(Ce…

css 画三角形

CSS三角形绘制方法#triangle-up {width: 0;height: 0;border-left: 50px solid transparent;border-right: 50px solid transparent;border-bottom: 100px solid red;}#triangle-down {width: 0;height: 0;border-left: 50px solid transparent;border-right: 50px solid trans…

mysql 慢日志报警_一则MySQL慢日志监控误报的问题分析

之前因为各种原因,有些报警没有引起重视,最近放假马上排除了一些潜在的人为原因,发现数据库的慢日志报警有些奇怪,主要表现是慢日志报警不属实,收到报警的即时通信提醒后,隔一会去数据库里面去排查&#xf…

无限复活服务器,绝地求生无限复活模式怎么玩 无限复活新手教程

相信不少的绝地求生玩家们最近都听说了其无限复活模式吧?因此肯定想要知道这种模式究竟该怎么玩,所以下面就来为各位带来此玩法的攻略相关,希望各位在看了如下的内容之后恩呢狗狗了解到新手教程攻略一览。“War”模式的设定以及玩法规则如下&#xff1a…

mysql date time year_YEAR、DATE、TIME、DATETIME和TIMESTAMP详细介绍[MySQL数据类型]

为了方便在数据库中存储日期和时间,MySQL提供了表示日期和时间的数据类型,分别是YEAR、DATE、TIME、DATETIME和TIMESTAMP。下面列举了这些MSL中日期和时间数据类型所对应的字节数、取值范围、日期格式以及零值。从上图中可以看出,每种日期和时…

安装Tengine

1.安装VMware2.安装CentOS6.53.配置网络a.修改 /etc/sysconfig/network-scripts/ifcfg-eth0配置文件,添加如下内容DEVICEeth0HWADDR00:0C:29:96:01:6BTYPEEthernetUUID41cbd943-024b-4341-ac7a-e4d2142b4938ONBOOTyesNM_CONTROLLEDyesBOOTPROTOnoneIPADDRxxx.xxx.x.xxx#例如:IP…

【OCR技术系列之八】端到端不定长文本识别CRNN代码实现

CRNN是OCR领域非常经典且被广泛使用的识别算法,其理论基础可以参考我上一篇文章,本文将着重讲解CRNN代码实现过程以及识别效果。 数据处理 利用图像处理技术我们手工大批量生成文字图像,一共360万张图像样本,效果如下:…

杜比服务器系统安装教程,win10杜比音效如何安装?win10安装杜比音效的详细教程...

杜比音效想必大家都不陌生,听歌或者看电影开启杜比音效可以给人一种身临其境的感觉。不少朋友都升级了win10系统却不知道如何安装杜比音效?如何为自己的系统安装杜比音效呢?感兴趣的小伙伴请看下面的操作步骤。win10安装杜比音效的方法&#…

前端if else_应该记录的一些项目代码(前端)

1.共享登录(单点登录)主要是前端部分主要是根据是否有cookie来判断是否已经登录主系统,然后再根据是否有当前系统的登录信息来(这块主要是sessionStorage做的)判断是否要再登录当前系统。设置、读取和设置cookie的方法…

Mac端解决(含修改8.0.13版的密码):Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)...

1. 安装mysql但是从来没启动过,今天一启动就报错: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) 其实是mysql服务没起来。。。 localhost:~ miaoying$ mysql.server start Starting MySQL ... SUCCESS! 然后再去sudo mysql就…

塔塔建网站服务器,塔塔帝国忘记哪个区怎么办

7条解答1.在哪个区玩战舰帝国忘记了怎么办?忘了的话可以去官网登陆看看自己的 充值 或者礼包记录 有没有对应的区服 或者电话联系问问客服 通过账号 角色名字来查询2.我忘记在哪个区怎么找如果你有游戏人生资格的话,就很容易找了,在游戏人生的个人主页里…

Ixia推出首款太比特级网络安全测试平台

2016年11月18日,Ixia宣布推出全新CloudStorm平台。作为首款太比特级网络安全测试平台,该平台拥有前所未有的非凡性能,可用于测试及验证超大规模云数据中心不断扩大的容量、效率以及弹性。 ▲Ixia CloudStorm安全测试平台 CloudStorm的正式面市…

服务器选择重装系统,云服务器重装系统选择

云服务器重装系统选择 内容精选换一换将外部镜像文件注册成云平台的私有镜像后,您可以使用该镜像创建新的云服务器,或对已有云服务器的系统进行重装和更换。本节介绍使用镜像创建云服务器的操作。您可以按照通过镜像创建云服务器中的操作指导创建弹性云服…

Gartner Q2服务器市场报告5大要点

服务器场景调查 根据市场研究公司Gartner的调查报告,第二季度Dell的服务器市场取得了丰富的成果,HPE的市场份额比去年同期略有下降,但仍保留了其全球服务器市场第一的位置。 Gartner表示,全球服务器销售收入在第二季度与去年同期相…

力扣——键盘行

给定一个单词列表,只返回可以使用在键盘同一行的字母打印出来的单词。键盘如下图所示。 示例: 输入: ["Hello", "Alaska", "Dad", "Peace"] 输出: ["Alaska", "Dad"]注意: 你可…

Jmeter 通过json Extracted 来获取 指定的值的id

在没有 精确或模糊查询的接口时可以使用jmeter 获取指定的值的ID import java.lang.String ; String getTargetName"iphone632g"; //判读相应结果中是否包含指定值:iphone632g boolean containsCategoryprev.getResponseDataAsString().contains(getTarge…

Do you have an English name? 你有英文名吗?

文中提到的所有人名都是虚构的,如有雷同,纯属巧合。当然,你的洋名儿也可能是德文、法文、意大利文,等々々々。 全球化时代,和老外的交流也多了。“高端”的程序员想要进欧美系外企,想要出国看世界&#xff…

网络安全不是奢侈品,而是必需品

2016年国家网络安全宣传周于9月19日至25日在武汉隆重举办。《长江日报》记者高萌采访了思科全球副总裁、大中华区首席技术官曹图强,以下是9月19日《长江日报》刊登的采访全文: 思科全球副总裁、大中华区首席技术官曹图强昨日下午,思科全球副总…

AS 自定义 Gradle plugin 插件 案例 MD

Markdown版本笔记我的GitHub首页我的博客我的微信我的邮箱MyAndroidBlogsbaiqiantaobaiqiantaobqt20094baiqiantaosina.comAS 自定义 Gradle plugin 插件 案例 MD 目录 目录AS 中自定义 Gradle plugin编写插件传递参数发布插件到仓库使用插件AS 中自定义 Gradle plugin 参考1 参…

中英文对照 —— 机械

0. 汽车 relay:继电器,clutch:离合; motor:发动机(马达);档位: park:停车挡braking:制动(也就是刹车)空挡:neu…