熊猫数据集_熊猫迈向数据科学的第三部分

熊猫数据集

Data is almost never perfect. Data Scientist spend more time in preprocessing dataset than in creating a model. Often we come across scenario where we find some missing data in data set. Such data points are represented with NaN or Not a Number in Pandas. So it is very important that we discover columns with NaN/null values in early stages while analyzing data.

数据几乎从来都不是完美的。 与创建模型相比,数据科学家在预处理数据集上花费的时间更多。 通常,我们会遇到在数据集中发现一些缺失数据的情况。 此类数据点用NaN表示None Not Number表示 因此,在分析数据的早期发现具有NaN / null值的列非常重要。

We have covered many methods in Pandas library and if you haven’t read previous articles, I recommend you to go through those articles to get in a flow. But if you are following from the beginning then lets get started.

我们已经在Pandas库中介绍了许多方法,如果您还没有阅读过以前的文章,我建议您仔细阅读这些文章以进行学习。 但是,如果您从头开始关注,那就开始吧。

In this article, we are going to learn

在本文中,我们将学习

  1. What is NaN ?

    什么是NaN?
  2. How to find NaN in dataset ?

    如何在数据集中找到NaN?
  3. How to deal with NaN as beginner ?

    如何应对NaN作为初学者?
  4. Finally, some methods to make dataframe more readable.

    最后,一些使数据框更具可读性的方法。

如何在数据集中找到NaN? (How to find NaN in dataset ?)

To check NaN data in a column or in entire dataframe, we use isnull() or isna(). Both of these works as same , so we will use isnull() in this article. If you want to understand why there are two methods for same task, you can learn it here. Lets begin by checking null values in entire dataset.

要检查列或整个数据框中的NaN数据,我们使用isnull()或isna()。 两者的工作原理相同,因此我们将在本文中使用isnull() 。 如果您想了解为什么有两种方法可以完成同一任务,则可以在此处学习 首先检查整个数据集中的空值。

>> print(titanic_data.info())output :RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Here you can see some valuable information about dataset. But information that we are interested is in Non-Null Count column. It shows number of non-null data points in each column. First line of output shows that there are total 891 entries that is 891 data points. We can also directly check number of non-null entries in each column using count() method as well.

在这里,您可以看到有关数据集的一些有价值的信息。 但是我们感兴趣的信息在“ 非空计数”列中。 它显示每列中非空数据点的数量。 输出的第一行显示总共有891个条目,即891个数据点。 我们也可以使用count()方法直接检查每列中非空条目的数量

>> print(titanic_data.count())output :PassengerId    891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64

From here we can conclude that Age, Cabin and Embarked are the columns with null values. There another way to get this result using isnull() method as we discussed earlier.

从这里我们可以得出结论,“ 年龄”,“机舱”和“ 登机”是具有空值的列。 如前所述,还有另一种方法可以使用isnull()方法获得此结果。

>> print(titanic_data.isnull().any())output :PassengerId    False
Survived False
Pclass False
Name False
Sex False
Age True
SibSp False
Parch False
Ticket False
Fare False
Cabin True
Embarked True
dtype: bool
>> print(titanic_data.isnull().sum())output :PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

As we can see this result is much better if we are solely interested in null values.

如我们所见,如果我们只对null值感兴趣,则此结果会更好。

如何应对NaN作为初学者? (How to deal with NaN as beginner ?)

It is important to know number of null values in a column as it can help us understand how to deal with null values. If there are small numbers of null values like in Embarked, then we can remove those entries from dataset. However if most of the values are null like in Cabin, then it is better to skip that column while creating model.

知道一列中空值的数量很重要,因为它可以帮助我们了解如何处理空值。 如果像Embarked中那样有少量的空值那么我们可以从数据集中删除这些条目。 但是,如果像Cabin中的大多数值都为空那么在创建模型时最好跳过该列。

There is another case where null values are not large enough to skip the column and small enough to remove entries as in the case of Age here. For such cases we have many ways to deal with null values, but as a beginner we will learn just one trick here and that is to fill it with a value. We will use fillna() method to do that.

在另一种情况下,空值的大小不足以跳过该列,而其大小不足以删除条目,如此处的Age一样 。 对于这种情况,我们有很多方法可以处理空值,但作为一个初学者,我们将在这里仅学习一个技巧,那就是用值填充它。 我们将使用fillna()方法来做到这一点。

>> titanic_data.Age.fillna("Unknown", inplace = True)
>> print(titanic_data.Age.isnull().any())output :false
# It is Age column have no null values

We used inplace argument so that changes are implemented in dataframe which is calling the method. If we do not pass this argument or keep it False then changes will not appear in our dataset. We can also check if a specific column have null values in same manner as we did for whole dataset.

我们使用了inplace参数,以便在调用该方法的数据框中实现更改。 如果我们不传递此参数或将其保留为False,则更改将不会出现在我们的数据集中。 我们还可以以与整个数据集相同的方式检查特定列是否具有空值。

We can also replace values in a column which are not NaN using replace() method.

我们还可以使用replace()方法替换非NaN列中的值。

>> titanic_data.Sex.replace("male","M",inplace = True)
>> titanic_data.Sex.replace("female","F",inplace = True)
>> print(titanic_data.Sex)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex, Length: 891, dtype: object

一些使数据集更具可读性的方法 (Some methods to make Dataset more readable)

  1. rename() : There might be situation, when we realize that column name is not suitable as per our requirement. We can use rename() method to change column name.

    named() :在某些情况下,我们意识到列名不符合我们的要求。 我们可以使用rename()方法来更改列名。

>> titanic_data.rename(columns={"Sex":"Gender"},inplace=True)
>> print(titanic_data.Gender)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Gender, Length: 891, dtype: object

2. rename_axis() : It is a simple method and as name suggest is used to provide names for axis.

2. named_axis() :这是一种简单的方法,顾名思义,该名称用于提供轴的名称。

>> titanic_data.rename_axis("Sr.No",axis='rows',inplace=True)
>> titanic_data.rename_axis("Catergory",axis='columns',inplace=True)
>> print(titanic_data.head(2))output :Catergory PassengerId Survived Pclass .....
Sr.No
0 1 0 3
1 2 1 1
[2 rows x 12 columns]

With this we come to end of this article and series on Pandas. I believe that methods which we came across in this series are very helpful for analyzing data before we can start training them. However, this is just a small fraction of methods in Pandas library and just a beginning of data exploration and preprocessing. But as a beginner, I think these are enough to get started with Data Science journey. I hope you found this series valuable. Thank you for reading. Keep practicing. Happy Coding ! 😄

这样,我们就结束了本文和有关熊猫的系列文章的结尾。 我相信本系列中遇到的方法在开始训练数据之前对分析数据非常有帮助。 但是,这只是Pandas库中方法的一小部分,也是数据探索和预处理的开始。 但是,作为一个初学者,我认为这些足以开始Data Science之旅。 希望您觉得本系列有价值。 感谢您的阅读。 保持练习。 编码愉快! 😄

翻译自: https://medium.com/swlh/pandas-first-step-towards-data-science-part-3-351321c24cc0

熊猫数据集

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389372.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Pytorch有关张量的各种操作

一,创建张量 1. 生成float格式的张量: a torch.tensor([1,2,3],dtype torch.float)2. 生成从1到10,间隔是2的张量: b torch.arange(1,10,step 2)3. 随机生成从0.0到6.28的10个张量 注意: (1).生成的10个张量中包含0.0和6.28&#xff…

mongodb安装失败与解决方法(附安装教程)

安装mongodb遇到的一些坑 浪费了大量的时间 在此记录一下 主要是电脑系统win10企业版自带的防火墙 当然还有其他的一些坑 一般的问题在第6步骤都可以解决,本教程的安装步骤不够详细的话 请自行百度或谷歌 安装教程很多 我是基于node.js使用mongodb结合Robo 3T数…

【洛谷算法题】P1046-[NOIP2005 普及组] 陶陶摘苹果【入门2分支结构】Java题解

👨‍💻博客主页:花无缺 欢迎 点赞👍 收藏⭐ 留言📝 加关注✅! 本文由 花无缺 原创 收录于专栏 【洛谷算法题】 文章目录 【洛谷算法题】P1046-[NOIP2005 普及组] 陶陶摘苹果【入门2分支结构】Java题解🌏题目…

web性能优化(理论)

什么是性能优化? 就是让用户感觉你的网站加载速度很快。。。哈哈哈。 分析 让我们来分析一下从用户按下回车键到网站呈现出来经历了哪些和前端相关的过程。 缓存 首先看本地是否有缓存,如果有符合使用条件的缓存则不需要向服务器发送请求了。DNS查询建立…

python多项式回归_如何在Python中实现多项式回归模型

python多项式回归Let’s start with an example. We want to predict the Price of a home based on the Area and Age. The function below was used to generate Home Prices and we can pretend this is “real-world data” and our “job” is to create a model which wi…

充分利用UC berkeleys数据科学专业

By Kyra Wong and Kendall Kikkawa黄凯拉(Kyra Wong)和菊川健多 ( Kendall Kikkawa) 什么是“数据科学”? (What is ‘Data Science’?) Data collection, an important aspect of “data science”, is not a new idea. Before the tech boom, every industry al…

文本二叉树折半查询及其截取值

using System;using System.ComponentModel;using System.Data;using System.Drawing;using System.Text;using System.Windows.Forms;using System.Collections;using System.IO;namespace CS_ScanSample1{ /// <summary> /// Logic 的摘要说明。 /// </summary> …

nn.functional 和 nn.Module入门讲解

本文来自《20天吃透Pytorch》 一&#xff0c;nn.functional 和 nn.Module 前面我们介绍了Pytorch的张量的结构操作和数学运算中的一些常用API。 利用这些张量的API我们可以构建出神经网络相关的组件(如激活函数&#xff0c;模型层&#xff0c;损失函数)。 Pytorch和神经网络…

10.30PMP试题每日一题

SC>0&#xff0c;CPI<1&#xff0c;说明项目截止到当前&#xff1a;A、进度超前&#xff0c;成本超值B、进度落后&#xff0c;成本结余C、进度超前&#xff0c;成本结余D、无法判断 答案将于明天和新题一起揭晓&#xff01; 10.29试题答案&#xff1a;A转载于:https://bl…

02-web框架

1 while True:print(server is waiting...)conn, addr server.accept()data conn.recv(1024) print(data:, data)# 1.得到请求的url路径# ------------dict/obj d["path":"/login"]# d.get(”path“)# 按着http请求协议解析数据# 专注于web业…

ai驱动数据安全治理_AI驱动的Web数据收集解决方案的新起点

ai驱动数据安全治理Data gathering consists of many time-consuming and complex activities. These include proxy management, data parsing, infrastructure management, overcoming fingerprinting anti-measures, rendering JavaScript-heavy websites at scale, and muc…

从Text文本中读值插入到数据库中

/// <summary> /// 转换数据&#xff0c;从Text文本中导入到数据库中 /// </summary> private void ChangeTextToDb() { if(File.Exists("Storage Card/Zyk.txt")) { try { this.RecNum.Visibletrue; SqlCeCommand sqlCreateTable…

Dataset和DataLoader构建数据通道

重点在第二部分的构建数据通道和第三部分的加载数据集 Pytorch通常使用Dataset和DataLoader这两个工具类来构建数据管道。 Dataset定义了数据集的内容&#xff0c;它相当于一个类似列表的数据结构&#xff0c;具有确定的长度&#xff0c;能够用索引获取数据集中的元素。 而D…

铁拳nat映射_铁拳如何重塑我的数据可视化设计流程

铁拳nat映射It’s been a full year since I’ve become an independent data visualization designer. When I first started, projects that came to me didn’t relate to my interests or skills. Over the past eight months, it’s become very clear to me that when cl…

Django2 Web 实战03-文件上传

作者&#xff1a;Hubery 时间&#xff1a;2018.10.31 接上文&#xff1a;接上文&#xff1a;Django2 Web 实战02-用户注册登录退出 视频是一种可视化媒介&#xff0c;因此视频数据库至少应该存储图像。让用户上传文件是个很大的隐患&#xff0c;因此接下来会讨论这俩话题&#…

BZOJ.2738.矩阵乘法(整体二分 二维树状数组)

题目链接 BZOJ洛谷 整体二分。把求序列第K小的树状数组改成二维树状数组就行了。 初始答案区间有点大&#xff0c;离散化一下。 因为这题是一开始给点&#xff0c;之后询问&#xff0c;so可以先处理该区间值在l~mid的修改&#xff0c;再处理询问。即二分标准可以直接用点的标号…

从数据库里读值往TEXT文本里写

/// <summary> /// 把预定内容导入到Text文档 /// </summary> private void ChangeDbToText() { this.RecNum.Visibletrue; //建立文件&#xff0c;并打开 string oneLine ""; string filename "Storage Card/YD" DateTime.Now.…

DengAI —如何应对数据科学竞赛? (EDA)

了解机器学习 (Understanding ML) This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020). Some of the ideas presented here are strictly designed for competitions li…

Pytorch模型层简单介绍

模型层layers 深度学习模型一般由各种模型层组合而成。 torch.nn中内置了非常丰富的各种模型层。它们都属于nn.Module的子类&#xff0c;具备参数管理功能。 例如&#xff1a; nn.Linear, nn.Flatten, nn.Dropout, nn.BatchNorm2d nn.Conv2d,nn.AvgPool2d,nn.Conv1d,nn.Co…

有效沟通的技能有哪些_如何有效地展示您的数据科学或软件工程技能

有效沟通的技能有哪些What is the most important thing to do after you got your skills to be a data scientist? It has to be to show off your skills. Otherwise, there is no use of your skills. If you want to get a job or freelance or start a start-up, you ha…