泰坦尼克号 数据分析_第1部分:泰坦尼克号-数据分析基础

泰坦尼克号 数据分析

My goal was to get a better understanding of how to work with tabular data so I challenged myself and started with the Titanic -project. I think this was an excellent way to learn the basics of data analysis with python.

我的目标是更好地了解如何使用表格数据,因此我挑战自我并开始了Titanic项目。 我认为这是学习python数据分析基础知识的绝佳方法。

You can find the competition here: https://www.kaggle.com/c/titanicI really recommend you to try it yourself if you want to learn how to analyze the data and build machine learning models.

您可以在这里找到比赛: https : //www.kaggle.com/c/titanic如果您想学习如何分析数据和建立机器学习模型,我真的建议您自己尝试一下。

I started by uploading the packages:

我首先上传了软件包:

import pandas as pd import numpy as np
import
matplotlib.pyplot as plt
import
seaborn as sns

Pandas is a great package for tabular data analysis. Numpy provides a high-performance multidimensional array object and tools for working with these arrays. Matplotlib packages help you to generate plots, histograms, power spectra, bar charts, etc., with just a few lines of code. Seaborn is developed based on the Matplotlib library and it can be used to create attractive and informative statistical graphics.

Pandas是用于表格数据分析的出色软件包。 Numpy提供了高性能的多维数组对象和用于处理这些数组的工具。 Matplotlib软件包可帮助您仅用几行代码即可生成图,直方图,功率谱,条形图等。 Seaborn是基于Matplotlib库开发的,可用于创建引人入胜且内容丰富的统计图形。

After loading these packages I loaded the data:

加载这些软件包后,我加载了数据:

df=pd.read_csv("train.csv")

Then I had a quick look at the data:

然后,我快速浏览了一下数据:

df.head()
#This prints you the first 5 rows of the table
#If you want to print 10 rows of the table instead of 5, then use
df.head(10)
Image for post
Screenshot of the first rows
第一行的屏幕截图
df.tail()
# This prints you out the last five rows of the table

I recommend starting with a look at the data so that you can be sure everything is as it should be. This is how you can avoid stupid mistakes in further analysis.

我建议先查看数据,以确保所有内容都应该是正确的。 这样可以避免进一步分析中的愚蠢错误。

df.shape
#This prints you the number of rows and columns

It is a good habit to print out the shape of the data in the beginning so you can check the number of columns and rows and be sure you haven’t missed any data during the analysis.

在开始时打印出数据的形状是个好习惯,因此您可以检查列数和行数,并确保在分析过程中没有遗漏任何数据。

分析数据 (Analyze the data)

Then I continued to look at the data by counting the values. This gave me a lot of information about the content of the data.

然后,我继续通过计算值来查看数据。 这给了我很多有关数据内容的信息。

df['Pclass'].value_counts()
# Prints out count of classes values
Image for post
The number of persons in each class. 3rd class was the most popular.
每个班级的人数。 第三类是最受欢迎的。

I prefer using percentages to showcase values. It is easier to understand the values in percentages.

我更喜欢使用百分比来展示价值。 更容易理解百分比值。

df['Pclass'].value_counts(normalize=True)
# same as above just that using "normalize=True" value is printed in percentages
Image for post
55% of people were in 3rd class
55%的人在三等舱

I counted values for each column separately. In the future, I challenge myself to do the function which prints out values but it was not my scope in this project.

我分别计算每列的值。 将来,我会挑战自己执行输出值的功能,但这不是我在本项目中的工作范围。

I wanted to understand also the values of different columns so I used the describe() method for that.

我还想了解不同列的值,因此我使用了describe()方法。

df['Fare'].describe()
# describe() is used to view basic statistical details like count, mean, minimum and maximum values.
Image for post
“Fare” column values
“票价”列值

Here you can see for example that the minimum price for the ticket was 0,00 $ and the maximum price was 512,33 $.

例如,在这里您可以看到门票的最低价格为0,00 $,最高价格为512,33 $。

I did several crosstables to understand which were the determinant values for the surviving.

我做了几个交叉表,以了解哪些是生存的决定性价值。

pd.crosstab(df['Survived'], df['Sex'])
# crosstable number of sex based on surviving.
Image for post
Here I also recommend using percentages instead of numerical values
在这里,我还建议使用百分比而不是数值
pd.crosstab(df['Survived'], df['Sex'], normalize=True)
# Using "normalize=True", you get values in percentage.
Image for post
Same as above just in percentages
与上面相同,只是百分比

Doing crosstables with different values gives you information about the possible correlations between the variables, for example, sex and surviving. As you can see, 26% of women survived and most of the men, 52%, didn’t survive.

使用不同的值进行交叉表可为您提供有关变量之间可能的相关性的信息,例如性别和存活率。 如您所见,有26%的女性幸存下来,而大多数男性(52%)没有幸存。

可视化数据 (Visualize the data)

It is nice to have numerical values in tables but it is easier to understand the visualized data, at least for me. This is why I plotted histograms and bar charts. By creating histograms and bar charts I learned how to visualize the data. Here are a few examples:

在表格中有数值很高兴,但至少对于我来说,更容易理解可视化数据。 这就是为什么我绘制直方图和条形图的原因。 通过创建直方图和条形图,我学习了如何可视化数据。 这里有一些例子:

df.hist(column='Age')
Image for post
In this histogram, you can see that passengers were mostly 20–40 years old.
在此直方图中,您可以看到乘客的年龄大多为20-40岁。

I used seaborn library for the bar charts.

我使用seaborn库制作条形图。

sns.countplot(x='Sex', hue='Survived', data=df);
Image for post
More females survived than males.
存活下来的女性多于男性。

Also, I used a heatmap to see the correlation between different columns.

另外,我使用热图来查看不同列之间的相关性。

corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, annot=True, square=True, annot_kws={'size': 15});
Image for post

Heatmap shows that there is a strong negative correlation between Fares and Classes, so that when one increases other decreases. It is logical because ticket prices in the 1st class are higher than in the 3rd class.

热图显示,票价和舱位之间有很强的负相关性,因此当票价增加时,其他票价会下降。 这是合乎逻辑的,因为第一类的机票价格高于第三类的机票价格。

If we focus on analyzing the correlations between surviving and other values, we see that there is a strong positive correlation between surviving and fare. The probability to survive is higher when the ticket price has been higher.

如果我们专注于分析幸存值与其他值之间的相关性,我们会发现幸存率和票价之间存在很强的正相关性。 当门票价格较高时,生存的可能性较高。

You can find the project in Github. please feel free to try it yourself and comment if there is something that needs clarifying!

您可以在Github中找到该项目。 请随时尝试一下,如果有需要澄清的地方,请发表评论!

Thank you for the highly trained monkey (Risto Hinno) for motivating and inspiring me!

感谢您训练有素的猴子( Risto Hinno )激励和启发我!

翻译自: https://medium.com/swlh/part-1-titanic-basic-of-data-analysis-ab3025d29f6e

泰坦尼克号 数据分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388150.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

vba数组dim_NDArray — —一个基于Java的N-Dim数组工具包

vba数组dim介绍 (Introduction) Within many development languages, there is a popular paradigm of using N-Dimensional arrays. They allow you to write numerical code that would otherwise require many levels of nested loops in only a few simple operations. Bec…

关于position的四个标签

四个标签是static,relative,absolute,fixed。 static 该值是正常流,并且是默认值,因此你很少看到(如果存在的话)指定该值。 relative:框的位置能够相对于它在正常流中的位置有所偏移…

python算法和数据结构_Python中的数据结构和算法

python算法和数据结构To至 Leonardo da Vinci达芬奇(Leonardo da Vinci) 介绍 (Introduction) The purpose of this article is to give you a panorama of data structures and algorithms in Python. This topic is very important for a Data Scientist in order to help …

CSS:元素塌陷问题

2019独角兽企业重金招聘Python工程师标准>>> 描述: 在文档流中,父元素的高度默认是被子元素撑开的,也就是子元素多高,父元素就多高。但是当子元素设置浮动之后,子元素会完全脱离文档流,此时将会…

Celery介绍及常见错误

celery 情景:用户发起request,并等待response返回。在本些views中,可能需要执行一段耗时的程序,那么用户就会等待很长时间,造成不好的用户体验,比如发送邮件、手机验证码等。 使用celery后,情况…

python dash_Dash是Databricks Spark后端的理想基于Python的前端

python dash📌 Learn how to deliver AI for Big Data using Dash & Databricks this recorded webinar with Peter Kim of Plotly and Prasad Kona of Databricks.this通过Plotly的Peter Kim和Databricks的Prasad Kona的网络研讨会了解如何使用Dash&#xff06…

Eclipse 插件开发遇到问题心得总结

Eclipse 插件开发遇到问题心得总结 Posted on 2011-07-17 00:51 季枫 阅读(3997) 评论(0) 编辑 收藏1、Eclipse 中插件开发多语言的实现 为了使用 .properties 文件,需要在 META-INF/MANIFEST.MF 文件中定义: Bundle-Localization: plugin 这样就会…

在Python中查找子字符串索引的5种方法

在Python中查找字符串中子字符串索引的5种方法 (5 Ways to Find the Index of a Substring in Strings in Python) str.find() str.find() str.rfind() str.rfind() str.index() str.index() str.rindex() str.rindex() re.search() re.search() str.find() (str.find()) …

Eclipse 插件开发 向导

阅读目录 最近由于特殊需要,开始学习插件开发。   下面就直接弄一个简单的插件吧!   1 新建一个插件工程   2 创建自己的插件名字,这个名字最好特殊一点,一遍融合到eclipse的时候,不会发生冲突。   3 下一步,进…

线性回归 假设_线性回归的假设

线性回归 假设Linear Regression is the bicycle of regression models. It’s simple yet incredibly useful. It can be used in a variety of domains. It has a nice closed formed solution, which makes model training a super-fast non-iterative process.线性回归是回…

solo

solo - 必应词典 美[soʊloʊ]英[səʊləʊ]n.【乐】独奏(曲);独唱(曲);单人舞;单独表演adj.独唱[奏]的;单独的;单人的v.独奏;放单飞adv.独网络梭罗;独奏曲;索罗变形复数&#xff1…

Eclipse 简介和插件开发天气预报

Eclipse 简介和插件开发 Eclipse 是一个很让人着迷的开发环境,它提供的核心框架和可扩展的插件机制给广大的程序员提供了无限的想象和创造空间。目前网上流传相当丰富且全面的开发工具方面的插件,但是 Eclipse 已经超越了开发环境的概念,可以…

趣味数据故事_坏数据的好故事

趣味数据故事Meet Julia. She’s a data engineer. Julia is responsible for ensuring that your data warehouses and lakes don’t turn into data swamps, and that, generally speaking, your data pipelines are in good working order.中号 EETJulia。 她是一名数据工程…

Linux 4.1内核热补丁成功实践

最开始公司运维同学反馈,个别宿主机上存在进程CPU峰值使用率异常的现象。而数万台机器中只出现了几例,也就是说万分之几的概率。监控产生的些小误差,不会造成宕机等严重后果,很容易就此被忽略了。但我们考虑到这个异常转瞬即逝、并…

python分句_Python循环中的分句,继续和其他子句

python分句Python中的循环 (Loops in Python) for loop for循环 while loop while循环 Let’s learn how to use control statements like break, continue, and else clauses in the for loop and the while loop.让我们学习如何在for循环和while循环中使用诸如break &#xf…

eclipse plugin 菜单

简介: 菜单是各种软件及开发平台会提供的必备功能,Eclipse 也不例外,提供了丰富的菜单,包括主菜单(Main Menu),视图 / 编辑器菜单(ViewPart/Editor Menu)和上下文菜单&am…

python数据建模数据集_Python中的数据集

python数据建模数据集There are useful Python packages that allow loading publicly available datasets with just a few lines of code. In this post, we will look at 5 packages that give instant access to a range of datasets. For each package, we will look at h…

打开editor的接口讨论

【打开editor的接口讨论】 先来看一下workbench吧,workbench从静态划分应该大致如下: 从结构图我们大致就可以猜测出来,workbench page作为一个IWorkbenchPart(无论是eidtor part还是view part&#…

网络攻防技术实验五

2018-10-23 实验五 学 号201521450005 中国人民公安大学 Chinese people’ public security university 网络对抗技术 实验报告 实验五 综合渗透 学生姓名 陈军 年级 2015 区队 五 指导教师 高见 信息技术与网络安全学院 2018年10月23日 实验任务总纲 2018—2019 …

usgs地震记录如何下载_用大叶草绘制USGS地震数据

usgs地震记录如何下载One of the many services provided by the US Geological Survey (USGS) is the monitoring and tracking of seismological events worldwide. I recently stumbled upon their earthquake datasets provided at the website below.美国地质调查局(USGS)…