泰坦尼克数据集预测分析

Data is simply useless until you don’t know what it’s trying to tell you.
除非您不知道数据在试图告诉您什么，否则数据将毫无用处。

With this quote we’ll continue on our quest to find the hidden secrets of the Titanic. ‘The Unsinkable’, as it was claimed by its designers and makers proved that even the best of human engineering may sometimes fail when nature comes on to test it.

用这句话，我们将继续寻找泰坦尼克号的秘密。正如其设计师和制造商所宣称的，“坚不可摧”证明了，即使人类最好的工程学，有时也会由于自然的考验而失败。

In last article, we saw the different attributes of the data and had quick glance on what the data looked like. If you haven’t read part 1 of this blog , I recommend you to kindly read it by clicking here before continuing. In this article we’ll look at the relationships of each of the attributes to the survival of the passenger and to continue with our quest to find out whether you would’ve survived the Titanic Sinking or not.

在上一篇文章中，我们看到了数据的不同属性，并快速浏览了数据的外观。如果您还没有阅读本博客的第1部分，建议您在继续之前单击此处，请仔细阅读。在本文中，我们将研究每个属性与乘客生存的关系，并继续我们的探索，以找出您是否会在《泰坦尼克号沉没》中幸存。

1.旅客舱位与生存的关联 (1. Co-Relation of Passenger Class with the survival)

Since, there are 3 classes present in the ship. Let’s find out the count of each passengers in each class.

此后，船上共有3个班级。让我们找出每个班级的每位乘客人数。

Output:

输出：

Now, let’s find out the the total number of survivors from each class

现在，让我们找出每个班级的幸存者总数

Output:

输出：

As you can see, the percentage of the passengers belonging to Upper Class who survived is better than the rest of the two having a survival percentage of around 62.96%.

如您所见，幸存的上层阶级乘客百分比要好于其余两个的生存百分比(约62.96％)。

The Survival Percentage of Middle Class Passengers is around 47.28% which better than the lower class but worse than that of the Upper Class

中产阶级乘客的生存率大约为47.28％，高于低层阶级，但低于上层阶级

The Lower Class was hit the most, having a survival percentage of just 24.23% which is significantly lower than the above two classes.

下层阶级受到的打击最大，生存率仅为24.23％，明显低于上述两个阶级。

The results indicate that the survival of the Titanic Sink was largely affected by the class in which you belong indicating the discrimination based on the class.

结果表明，泰坦尼克号水槽的生存在很大程度上受到您所属类别的影响，表明基于类别的歧视。

2.性别与生存的关系 (2. Co-Relation of Gender with the survival)

Let’s start by printing the number of passengers of each gender.

让我们开始打印每种性别的乘客数量。

Output:

输出：

Now, let’s find out the survival percentage of the passengers belonging to each gender.

现在，让我们找出每种性别的乘客的生存率。

Output:

输出：

The information suggests that the women were given the highest priority while saving lives. Almost 74.2% of the women survived and 18.89% of men survived. (How pure these gentlemen were!😢❤️)

信息表明，在挽救生命的同时，妇女被赋予最高优先权。几乎有74.2％的女性得以幸存，而18.89％的男性得以幸存。 (这些先生们真是纯洁！😢❤️)

3.年龄与生存的关系 (3. Co-Relation of Age with Survival)

Now, let’s look at the effect of age on the survival. But first, let’s have a quick glance on some stats of the age along with the values that are missing in the data-set.

现在，让我们看看年龄对生存的影响。但是首先，让我们快速浏览一下年龄的一些统计数据以及数据集中缺少的值。

Output:

输出：

There are a total of 177 missing values i.e. the age of 177 Passengers are missing in the data-set. These missing values may pose some problems while predicting and hence, need to be addressed.

共有177个缺失值，即数据集中缺少177岁的乘客。这些缺失值在预测时可能会带来一些问题，因此需要解决。

Now, let’s visualize by plotting some histograms on the basis of the data

现在，让我们根据数据绘制一些直方图以进行可视化

kde = True gives Kernel Density Function for the histogram and rug are the small markings which plots the exact point at which the data were recorded.

kde = True给出了直方图的内核密度函数，而rug是小的标记，它们绘制了记录数据的精确点。

Output:

输出：

Now, let’s check out the survival in each group by plotting the following graph with kde. The y-axis actually denotes probability density function for the kernel density estimation and the area under the kde curve give the probability of respective points in x-axis.

现在，让我们用kde绘制下图来检查每组的存活率。 y轴实际上表示用于核密度估计的概率密度函数，而kde曲线下的面积给出了x轴上各个点的概率。

Output:

输出：

The following plot show the distribution of gender in each age group.

下图显示了各个年龄段的性别分布。

Output:

输出：

Now, let’s find out comparison of survival in each of these groups using kde plot.

现在，让我们使用kde图找出这些组中每个组的生存率比较。

Output:

输出：

We can also understand what’s represented in these histograms as follows:

我们还可以理解这些直方图中的表示形式，如下所示：

Output:

输出：

4.否的关联幸存旅客的兄弟姐妹/配偶 (4. Co-Relation of no. of Siblings/Spouses of the passenger with Survival)

Let’s start by understanding the distribution of values of this attribute.

让我们首先了解该属性的值的分布。

Output:

输出：

Now, let’s plot the histogram describing the survival of the passengers having respective number of Siblings/Spouses.

现在，让我们绘制直方图，描述具有相应数量的兄弟姐妹/配偶的乘客的生存情况。

Output:

输出：

The inference of the above histogram can be derived using the following code:

可以使用以下代码推导以上直方图的推论：

Output:

输出：

5.父母/子女人数与生存率的相互关系 (5. Co-relation of No. of Parents/Children with survival)

The distribution of the number of Parents/Children are as follows

父母子女数的分配如下

Output:

输出：

Here are the two different plots denoting the survival of passengers having respective no. of Parents/Children. The first one using ‘distplot’ and the second one using ‘countplot’

这是两个不同的图，分别表示编号分别为的乘客的生存情况。父母/子女。第一个使用“ distplot”，第二个使用“ countplot”

Output:

输出：

6.票价与生存的关联 (6. Co-relation of Fare with survival)

Now, let’s try to understand if there was any regularity in the fare and whether there’s any relation with the survival. The code describes the distribution of the fare.

现在，让我们尝试了解票价是否有规律性以及与生存率是否有关系。该代码描述了票价的分配。

Output:

输出：

Let’s plot the distribution of the Fare classified by the Survival

让我们绘制按幸存分类的票价分布

Output:

输出：

Let’s check whether the passengers were charged uniformly or not. If yes, let’s try to understand what are the factors that decided the fare for the tickets.

让我们检查一下乘客是否被统一收费。如果是，让我们尝试了解决定门票价格的因素是什么。

To check whether ‘Gender’ was the factor to decide the fare of the tickets, here’s the plot for each embarkation followed by the inference of it.

要检查“性别”是否是决定票价的因素，以下是每次登机的情节，然后进行推断。

Output:

输出：

Output:

输出：

Thus, as per the data, mean fare charged for women were significantly higher in Cherbourg and Southampton.

因此，根据数据，瑟堡和南安普敦的女性平均车费要高得多。

To check whether ‘Embarkation’ , ‘Class’ and ‘Age’ were the factor deciding the fare of the tickets, here’s the plot for each embarkation and class classified with ‘Survival’ Status followed by the inference of it.

要检查“入库”，“舱位”和“年龄”是否是决定票价的因素，这是按“生存”状态归类的每个登乘舱位和舱位的图，然后进行推断。

Output:

输出：

Thus, it is evident from the data that tickets were priced mostly on the basis of Pclass and the point of Embarkation but not on the basis of Age.

因此，从数据中可以明显看出，机票的定价主要基于Pclass和登机地点，而不是基于年龄。

7.登船与生存的关系 (7. Co-relation of Embarkation with survival)

We have seen the description of the data having numerical attributes till now. Here’s a look at the description of the categorical data.

到目前为止，我们已经看到了对具有数值属性的数据的描述。这里是对分类数据的描述。

Output:

输出：

Here’s a plot describing the ratio of the survival of passengers from each port of Embarkation.

这是一张描述每个登船口岸旅客生存率的图表。

Output:

输出：

And now here’s the pair-plot of each of the attributes that we have discussed till now.

现在，这是到目前为止我们讨论过的每个属性的配对图。

Output:

输出：

As you might have noticed we’ve ignored Passenger_Id, Name of the Passenger, Ticket and Cabin No. as they play little to no role in determining the survival of the passenger.

您可能已经注意到，我们已经忽略了Passenger_Id ，乘客姓名，机票和机舱号 。因为它们在决定乘客的生存方面几乎没有作用。

Thus, we tried to understand the data by visualizing using various techniques and uncovered various mysteries related to Titanic. In next Article we’ll be understanding the types of data and why some type of data need to be converted into the specific format to be able to fit various Machine Learning models on it. Thank you for joining throughout this journey of exploration and hope, you’ve got the experience of being a detective!🕵

因此，我们试图通过使用各种技术进行可视化来理解数据，并发现与泰坦尼克号有关的各种奥秘。在下一篇文章中，我们将了解数据的类型以及为什么需要将某种类型的数据转换为特定格式才能适合其上的各种机器学习模型。感谢您加入探索和希望的整个旅程，您已经成为一名侦探！🕵

Link to the Notebook: Click Here

链接到笔记本：单击此处

Link to Part 1 of this Blog: Click Here

链接到此博客的第1部分：单击此处

翻译自: https://medium.com/@bapreetam/exploratory-data-analysis-a-case-study-on-titanic-data-set-part-2-96a9f3df963a

泰坦尼克数据集预测分析

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/389287.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

各种数据库连接的总结

SQL数据库的连接 return new SqlConnection("server127.0.0.1;databasepart;uidsa;pwd;"); oracle连接字符串 OracleConnection oCnn new OracleConnection("Data SourceORCL_SERVER;USERM70;PASSWORDmmm;");oledb连接数据库return new OleDbConnection…

关于我

我是谁？ Who am I？这是个哲学问题。。简单来说，我是Light，一个靠前端吃饭，又不想单单靠前端吃饭的Coder。用以下几点稍微给自己打下标签： 工作了两三年，对，我是16年毕业的90后一直…

L1和L2正则

https://blog.csdn.net/jinping_shi/article/details/52433975转载于:https://www.cnblogs.com/zyber/p/9257843.html

基于PyTorch搭建CNN实现视频动作分类任务代码详解

数据及具体讲解来源： 基于PyTorch搭建CNN实现视频动作分类任务 import torch import torch.nn as nn import torchvision.transforms as T import scipy.io from torch.utils.data import DataLoader,Dataset import os from PIL import Image from torch.autograd…

missforest_missforest最佳丢失数据插补算法

missforestMissing data often plagues real-world datasets, and hence there is tremendous value in imputing, or filling in, the missing values. Unfortunately, standard ‘lazy’ imputation methods like simply using the column median or average don’t work wel…

华硕猛禽1080ti_F-22猛禽动力回路的视频分析

华硕猛禽1080tiThe F-22 Raptor has vectored thrust. This means that the engines don’t just push towards the front of the aircraft. Instead, the thrust can be directed upward or downward (from the rear of the jet). With this vectored thrust, the Raptor can …

聊天常用js代码

温故而知新：柯里化与 bind() 的认知

什么是柯里化?科里化是把一个多参数函数转化为一个嵌套的一元函数的过程。（简单的说就是将函数的参数，变为多次入参） const curry (fn, ...args) > fn.length < args.length ? fn(...args) : curry.bind(null, fn, ...args); // 想要…

OPENVAS运行

https://www.jianshu.com/p/382546aaaab5转载于:https://www.cnblogs.com/diyunpeng/p/9258163.html

Memory-Associated Differential Learning论文及代码解读

Memory-Associated Differential Learning论文及代码解读论文来源： 论文PDF： Memory-Associated Differential Learning论文论文代码： Memory-Associated Differential Learning代码论文解读： 1.Abstract Conventional…

大数据技术学习之旅_如何开始您的数据科学之旅？

大数据技术学习之旅Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but star…

纯API函数实现串口读写。

以最后决定用纯API函数实现串口读写。先从网上搜索相关代码（关键字：C# API 串口），发现网上相关的资料大约来源于一个版本，那就是所谓的msdn提供的样例代码（msdn的具体出处，我没有考证&#xff…

数据可视化工具_数据可视化

数据可视化工具Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histo…

Android Studio调试时遇见Install Repository and sync project的问题

我们可以看到，报的错是“Failed to resolve: com.android.support:appcompat-v7:16.”，也就是我们在build.gradle中最后一段中的compile项内容。 AS自动生成的“com.android.support:appcompat-v7:16.”实际上是根据我们的最低版本16来选择16.x.x及以上编…

Apache Ignite 学习笔记(二): Ignite Java Thin Client

前一篇文章，我们介绍了如何安装部署Ignite集群，并且尝试了用REST和SQL客户端连接集群进行了缓存和数据库的操作。现在我们就来写点代码，用Ignite的Java thin client来连接集群。在开始介绍具体代码之前，让我们先简单的了解一下Ig…

VGAE（Variational graph auto-encoders）论文及代码解读

一，论文来源论文pdf Variational graph auto-encoders 论文代码 github代码二，论文解读理论部分参考： Variational Graph Auto-Encoders（VGAE）理论参考和源码解析 VGAE（Variational graph auto-en…

IIS7设置

IIS 7.0和IIS 6.0相比改变很大谁都知道，而且在IIS 7.0中用VS2005来调试Web项目也不是什么新鲜的话题，但是我还是第一次运用这个东东，所以在此记下我的一些过程，希望能给更多的后来者带了一点参考。其实我写这篇文章时也参考了其他…

tableau大屏bi_Excel，Tableau，Power BI ...您应该使用什么？

tableau大屏biAfter publishing my previous article on data visualization with Power BI, I received quite a few questions about the abilities of Power BI as opposed to those of Tableau or Excel. Data, when used correctly, can turn into digital gold. So what …

python 可视化工具_最佳的python可视化工具

python 可视化工具Disclaimer: I work for Datapane免责声明：我为Datapane工作动机 (Motivation) There are amazing articles on data visualization on Medium every day. Although this comes at the cost of information overload, it shouldn’t prevent you …

网络编程 socket介绍

Socket介绍 Socket是应用层与TCP/IP协议族通信的中间软件抽象层，它是一组接口。在设计模式中，Socket其实就是一个门面模式，它把复杂的TCP/IP协议族隐藏在Socket接口后面，对用户来说，一组简单的接口就是全部。 Socket通…

泰坦尼克数据集预测分析_探索性数据分析-泰坦尼克号数据集案例研究（第二部分）

1.旅客舱位与生存的关联 (1. Co-Relation of Passenger Class with the survival)

2.性别与生存的关系 (2. Co-Relation of Gender with the survival)

3.年龄与生存的关系 (3. Co-Relation of Age with Survival)

4.否的关联幸存旅客的兄弟姐妹/配偶 (4. Co-Relation of no. of Siblings/Spouses of the passenger with Survival)

5.父母/子女人数与生存率的相互关系 (5. Co-relation of No. of Parents/Children with survival)

6.票价与生存的关联 (6. Co-relation of Fare with survival)

7.登船与生存的关系 (7. Co-relation of Embarkation with survival)

相关文章

各种数据库连接的总结

关于我

L1和L2正则

基于PyTorch搭建CNN实现视频动作分类任务代码详解

missforest_missforest最佳丢失数据插补算法

华硕猛禽1080ti_F-22猛禽动力回路的视频分析

聊天常用js代码

温故而知新：柯里化与 bind() 的认知

OPENVAS运行

Memory-Associated Differential Learning论文及代码解读

大数据技术学习之旅_如何开始您的数据科学之旅？

纯API函数实现串口读写。

数据可视化工具_数据可视化

Android Studio调试时遇见Install Repository and sync project的问题

Apache Ignite 学习笔记(二): Ignite Java Thin Client

VGAE（Variational graph auto-encoders）论文及代码解读

IIS7设置

tableau大屏bi_Excel，Tableau，Power BI ...您应该使用什么？

python 可视化工具_最佳的python可视化工具

网络编程 socket介绍

泰坦尼克数据集预测分析_探索性数据分析-泰坦尼克号数据集案例研究（第二部分）

1.旅客舱位与生存的关联 (1. Co-Relation of Passenger Class with the survival)

2.性别与生存的关系 (2. Co-Relation of Gender with the survival)

3.年龄与生存的关系 (3. Co-Relation of Age with Survival)

4.否的关联 幸存旅客的兄弟姐妹/配偶 (4. Co-Relation of no. of Siblings/Spouses of the passenger with Survival)

5.父母/子女人数与生存率的相互关系 (5. Co-relation of No. of Parents/Children with survival)

6.票价与生存的关联 (6. Co-relation of Fare with survival)

7.登船与生存的关系 (7. Co-relation of Embarkation with survival)

相关文章

4.否的关联幸存旅客的兄弟姐妹/配偶 (4. Co-Relation of no. of Siblings/Spouses of the passenger with Survival)