泰坦尼克数据集预测分析_探索性数据分析-泰坦尼克号数据集案例研究(第二部分)

泰坦尼克数据集预测分析

Data is simply useless until you don’t know what it’s trying to tell you.

除非您不知道数据在试图告诉您什么,否则数据将毫无用处。

With this quote we’ll continue on our quest to find the hidden secrets of the Titanic. ‘The Unsinkable’, as it was claimed by its designers and makers proved that even the best of human engineering may sometimes fail when nature comes on to test it.

用这句话,我们将继续寻找泰坦尼克号的秘密。 正如其设计师和制造商所宣称的,“坚不可摧”证明了,即使人类最好的工程学,有时也会由于自然的考验而失败。

In last article, we saw the different attributes of the data and had quick glance on what the data looked like. If you haven’t read part 1 of this blog , I recommend you to kindly read it by clicking here before continuing. In this article we’ll look at the relationships of each of the attributes to the survival of the passenger and to continue with our quest to find out whether you would’ve survived the Titanic Sinking or not.

在上一篇文章中,我们看到了数据的不同属性,并快速浏览了数据的外观。 如果您还没有阅读本博客的第1部分,建议您在继续之前单击此处 ,请仔细阅读。 在本文中,我们将研究每个属性与乘客生存的关系,并继续我们的探索,以找出您是否会在《泰坦尼克号沉没》中幸存。

1.旅客舱位与生存的关联 (1. Co-Relation of Passenger Class with the survival)

Since, there are 3 classes present in the ship. Let’s find out the count of each passengers in each class.

此后,船上共有3个班级。 让我们找出每个班级的每位乘客人数。

Output:

输出:

Image for post

Now, let’s find out the the total number of survivors from each class

现在,让我们找出每个班级的幸存者总数

Output:

输出:

Image for post
Image for post

As you can see, the percentage of the passengers belonging to Upper Class who survived is better than the rest of the two having a survival percentage of around 62.96%.

如您所见,幸存的上层阶级乘客百分比要好于其余两个的生存百分比(约62.96%)。

The Survival Percentage of Middle Class Passengers is around 47.28% which better than the lower class but worse than that of the Upper Class

中产阶级乘客的生存率大约为47.28%,高于低层阶级,但低于上层阶级

The Lower Class was hit the most, having a survival percentage of just 24.23% which is significantly lower than the above two classes.

下层阶级受到的打击最大,生存率仅为24.23%,明显低于上述两个阶级。

The results indicate that the survival of the Titanic Sink was largely affected by the class in which you belong indicating the discrimination based on the class.

结果表明,泰坦尼克号水槽的生存在很大程度上受到您所属类别的影响,表明基于类别的歧视。

2.性别与生存的关系 (2. Co-Relation of Gender with the survival)

Let’s start by printing the number of passengers of each gender.

让我们开始打印每种性别的乘客数量。

Output:

输出:

Image for post

Now, let’s find out the survival percentage of the passengers belonging to each gender.

现在,让我们找出每种性别的乘客的生存率。

Output:

输出:

Image for post
Image for post

The information suggests that the women were given the highest priority while saving lives. Almost 74.2% of the women survived and 18.89% of men survived. (How pure these gentlemen were!😢❤️)

信息表明,在挽救生命的同时,妇女被赋予最高优先权。 几乎有74.2%的女性得以幸存,而18.89%的男性得以幸存。 (这些先生们真是纯洁!😢❤️)

3.年龄与生存的关系 (3. Co-Relation of Age with Survival)

Now, let’s look at the effect of age on the survival. But first, let’s have a quick glance on some stats of the age along with the values that are missing in the data-set.

现在,让我们看看年龄对生存的影响。 但是首先,让我们快速浏览一下年龄的一些统计数据以及数据集中缺少的值。

Output:

输出:

Image for post
Image for post

There are a total of 177 missing values i.e. the age of 177 Passengers are missing in the data-set. These missing values may pose some problems while predicting and hence, need to be addressed.

共有177个缺失值,即数据集中缺少177岁的乘客。 这些缺失值在预测时可能会带来一些问题,因此需要解决。

Now, let’s visualize by plotting some histograms on the basis of the data

现在,让我们根据数据绘制一些直方图以进行可视化

kde = True gives Kernel Density Function for the histogram and rug are the small markings which plots the exact point at which the data were recorded.

kde = True给出了直方图的内核密度函数,而rug是小的标记,它们绘制了记录数据的精确点。

Output:

输出:

Image for post

Now, let’s check out the survival in each group by plotting the following graph with kde. The y-axis actually denotes probability density function for the kernel density estimation and the area under the kde curve give the probability of respective points in x-axis.

现在,让我们用kde绘制下图来检查每组的存活率。 y轴实际上表示用于核密度估计的概率密度函数,而kde曲线下的面积给出了x轴上各个点的概率。

Output:

输出:

Image for post

The following plot show the distribution of gender in each age group.

下图显示了各个年龄段的性别分布。

Output:

输出:

Image for post

Now, let’s find out comparison of survival in each of these groups using kde plot.

现在,让我们使用kde图找出这些组中每个组的生存率比较。

Output:

输出:

Image for post

We can also understand what’s represented in these histograms as follows:

我们还可以理解这些直方图中的表示形式,如下所示:

Output:

输出:

Image for post

4.否的关联 幸存旅客的兄弟姐妹/配偶 (4. Co-Relation of no. of Siblings/Spouses of the passenger with Survival)

Let’s start by understanding the distribution of values of this attribute.

让我们首先了解该属性的值的分布。

Output:

输出:

Image for post

Now, let’s plot the histogram describing the survival of the passengers having respective number of Siblings/Spouses.

现在,让我们绘制直方图,描述具有相应数量的兄弟姐妹/配偶的乘客的生存情况。

Output:

输出:

Image for post

The inference of the above histogram can be derived using the following code:

可以使用以下代码推导以上直方图的推论:

Output:

输出:

Image for post

5.父母/子女人数与生存率的相互关系 (5. Co-relation of No. of Parents/Children with survival)

The distribution of the number of Parents/Children are as follows

父母子女数的分配如下

Output:

输出:

Image for post

Here are the two different plots denoting the survival of passengers having respective no. of Parents/Children. The first one using ‘distplot’ and the second one using ‘countplot’

这是两个不同的图,分别表示编号分别为的乘客的生存情况。 父母/子女。 第一个使用“ distplot”,第二个使用“ countplot”

Output:

输出:

Image for post

6.票价与生存的关联 (6. Co-relation of Fare with survival)

Now, let’s try to understand if there was any regularity in the fare and whether there’s any relation with the survival. The code describes the distribution of the fare.

现在,让我们尝试了解票价是否有规律性以及与生存率是否有关系。 该代码描述了票价的分配。

Output:

输出:

Image for post

Let’s plot the distribution of the Fare classified by the Survival

让我们绘制按幸存分类的票价分布

Output:

输出:

Image for post

Let’s check whether the passengers were charged uniformly or not. If yes, let’s try to understand what are the factors that decided the fare for the tickets.

让我们检查一下乘客是否被统一收费。 如果是,让我们尝试了解决定门票价格的因素是什么。

To check whether ‘Gender’ was the factor to decide the fare of the tickets, here’s the plot for each embarkation followed by the inference of it.

要检查“性别”是否是决定票价的因素,以下是每次登机的情节,然后进行推断。

Output:

输出:

Image for post

Output:

输出:

Image for post
Image for post
Image for post

Thus, as per the data, mean fare charged for women were significantly higher in Cherbourg and Southampton.

因此,根据数据,瑟堡和南安普敦的女性平均车费要高得多。

To check whether ‘Embarkation’ , ‘Class’ and ‘Age’ were the factor deciding the fare of the tickets, here’s the plot for each embarkation and class classified with ‘Survival’ Status followed by the inference of it.

要检查“入库”,“舱位”和“年龄”是否是决定票价的因素,这是按“生存”状态归类的每个登乘舱位和舱位的图,然后进行推断。

Output:

输出:

Image for post

Thus, it is evident from the data that tickets were priced mostly on the basis of Pclass and the point of Embarkation but not on the basis of Age.

因此,从数据中可以明显看出,机票的定价主要基于Pclass和登机地点,而不是基于年龄。

7.登船与生存的关系 (7. Co-relation of Embarkation with survival)

We have seen the description of the data having numerical attributes till now. Here’s a look at the description of the categorical data.

到目前为止,我们已经看到了对具有数值属性的数据的描述。 这里是对分类数据的描述。

Output:

输出:

Image for post

Here’s a plot describing the ratio of the survival of passengers from each port of Embarkation.

这是一张描述每个登船口岸旅客生存率的图表。

Output:

输出:

Image for post

And now here’s the pair-plot of each of the attributes that we have discussed till now.

现在,这是到目前为止我们讨论过的每个属性的配对图。

Output:

输出:

Image for post

As you might have noticed we’ve ignored Passenger_Id, Name of the Passenger, Ticket and Cabin No. as they play little to no role in determining the survival of the passenger.

您可能已经注意到,我们已经忽略了Passenger_Id乘客 姓名机票机舱号 。 因为它们在决定乘客的生存方面几乎没有作用。

Thus, we tried to understand the data by visualizing using various techniques and uncovered various mysteries related to Titanic. In next Article we’ll be understanding the types of data and why some type of data need to be converted into the specific format to be able to fit various Machine Learning models on it. Thank you for joining throughout this journey of exploration and hope, you’ve got the experience of being a detective!🕵

因此,我们试图通过使用各种技术进行可视化来理解数据,并发现与泰坦尼克号有关的各种奥秘。 在下一篇文章中,我们将了解数据的类型以及为什么需要将某种类型的数据转换为特定格式才能适合其上的各种机器学习模型。 感谢您加入探索和希望的整个旅程,您已经成为一名侦探!🕵

Link to the Notebook: Click Here

链接到笔记本: 单击此处

Link to Part 1 of this Blog: Click Here

链接到此博客的第1部分: 单击此处

翻译自: https://medium.com/@bapreetam/exploratory-data-analysis-a-case-study-on-titanic-data-set-part-2-96a9f3df963a

泰坦尼克数据集预测分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389287.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

关于我

我是谁? Who am I?这是个哲学问题。。 简单来说,我是Light,一个靠前端吃饭,又不想单单靠前端吃饭的Coder。 用以下几点稍微给自己打下标签: 工作了两三年,对,我是16年毕业的90后一直…

基于PyTorch搭建CNN实现视频动作分类任务代码详解

数据及具体讲解来源: 基于PyTorch搭建CNN实现视频动作分类任务 import torch import torch.nn as nn import torchvision.transforms as T import scipy.io from torch.utils.data import DataLoader,Dataset import os from PIL import Image from torch.autograd…

missforest_missforest最佳丢失数据插补算法

missforestMissing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work wel…

华硕猛禽1080ti_F-22猛禽动力回路的视频分析

华硕猛禽1080tiThe F-22 Raptor has vectored thrust. This means that the engines don’t just push towards the front of the aircraft. Instead, the thrust can be directed upward or downward (from the rear of the jet). With this vectored thrust, the Raptor can …

Memory-Associated Differential Learning论文及代码解读

Memory-Associated Differential Learning论文及代码解读 论文来源: 论文PDF: Memory-Associated Differential Learning论文 论文代码: Memory-Associated Differential Learning代码 论文解读: 1.Abstract Conventional…

大数据技术 学习之旅_如何开始您的数据科学之旅?

大数据技术 学习之旅Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but star…

数据可视化工具_数据可视化

数据可视化工具Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histo…

Android Studio调试时遇见Install Repository and sync project的问题

我们可以看到,报的错是“Failed to resolve: com.android.support:appcompat-v7:16.”,也就是我们在build.gradle中最后一段中的compile项内容。 AS自动生成的“com.android.support:appcompat-v7:16.”实际上是根据我们的最低版本16来选择16.x.x及以上编…

VGAE(Variational graph auto-encoders)论文及代码解读

一,论文来源 论文pdf Variational graph auto-encoders 论文代码 github代码 二,论文解读 理论部分参考: Variational Graph Auto-Encoders(VGAE)理论参考和源码解析 VGAE(Variational graph auto-en…

tableau大屏bi_Excel,Tableau,Power BI ...您应该使用什么?

tableau大屏biAfter publishing my previous article on data visualization with Power BI, I received quite a few questions about the abilities of Power BI as opposed to those of Tableau or Excel. Data, when used correctly, can turn into digital gold. So what …

网络编程 socket介绍

Socket介绍 Socket是应用层与TCP/IP协议族通信的中间软件抽象层,它是一组接口。在设计模式中,Socket其实就是一个门面模式,它把复杂的TCP/IP协议族隐藏在Socket接口后面,对用户来说,一组简单的接口就是全部。 Socket通…

BP神经网络反向传播手动推导

BP神经网络过程: 基本思想 BP算法是一个迭代算法,它的基本思想如下: 将训练集数据输入到神经网络的输入层,经过隐藏层,最后达到输出层并输出结果,这就是前向传播过程。由于神经网络的输出结果与实际结果…

使用python和pandas进行同类群组分析

背景故事 (Backstory) I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.在为公司进行数据练习时,我偶然发…

搜索引擎优化学习原理_如何使用数据科学原理来改善您的搜索引擎优化工作

搜索引擎优化学习原理Search Engine Optimisation (SEO) is the discipline of using knowledge gained around how search engines work to build websites and publish content that can be found on search engines by the right people at the right time.搜索引擎优化(SEO…

Siamese网络(孪生神经网络)详解

SiameseFCSiamese网络(孪生神经网络)本文参考文章:Siamese背景Siamese网络解决的问题要解决什么问题?用了什么方法解决?应用的场景:Siamese的创新Siamese的理论Siamese的损失函数——Contrastive Loss损失函…

Dubbo 源码分析 - 服务引用

1. 简介 在上一篇文章中,我详细的分析了服务导出的原理。本篇文章我们趁热打铁,继续分析服务引用的原理。在 Dubbo 中,我们可以通过两种方式引用远程服务。第一种是使用服务直联的方式引用服务,第二种方式是基于注册中心进行引用。…

一件登录facebook_我从Facebook的R教学中学到的6件事

一件登录facebookBetween 2018 to 2019, I worked at Facebook as a data scientist — during that time I was involved in developing and teaching a class for R beginners. This was a two-day course that was taught about once a month to a group of roughly 15–20 …

SiameseFC超详解

SiameseFC前言论文来源参考文章论文原理解读首先要知道什么是SOT?(Siamese要做什么)SiameseFC要解决什么问题?SiameseFC用了什么方法解决?SiameseFC网络效果如何?SiameseFC基本框架结构SiameseFC网络结构Si…

Python全栈工程师(字符串/序列)

ParisGabriel Python 入门基础字符串:str用来记录文本信息字符串的表示方式:在非注释中凡是用引号括起来的部分都是字符串‘’ 单引号“” 双引号 三单引""" """ 三双引有内容代表非空字符串否则是空字符串 区别&#xf…

跨库数据表的运算

跨库数据表的运算,一直都是一个说难不算太难,说简单却又不是很简单的、总之是一个麻烦的事。大量的、散布在不同数据库中的数据表们,明明感觉要把它们合并起来,再来个小小的计算,似乎也就那么回事……但真要做起来&…