熊猫烧香分析报告_熊猫分析进行最佳探索性数据分析

熊猫烧香分析报告

目录 (Table of Contents)

  1. Introduction

    介绍
  2. Overview

    总览
  3. Variables

    变数
  4. Interactions

    互动互动
  5. Correlations

    相关性
  6. Missing Values

    缺失值
  7. Sample

    样品
  8. Summary

    摘要

介绍 (Introduction)

There are countless ways to perform exploratory data analysis (EDA) in Python (and in R). I do most of mine in the popular Jupyter Notebook. Once I realized there was a library that could summarize my dataset with just one line of code, I made sure to utilize it for every project, reaping countless benefits from the ease of this EDA tool. The EDA step should be performed first before executing any Machine Learning models for all Data Scientists, therefore, the kind and intelligent developers from Pandas Profiling [2] have made it easy to view your dataset in a beautiful format, while also describing the information well in your dataset.

在Python( 和R )中执行探索性数据分析(EDA)的方法有无数种。 我在流行的Jupyter笔记本电脑上做大多数事情。 一旦意识到有一个库可以用一行代码来总结我的数据集,我便确保将其用于每个项目,并从此EDA工具的易用性中获得了无数的收益。 在为所有数据科学家执行任何机器学习模型之前,应首先执行EDA步骤,因此, Pandas Profiling [2]的友善而又聪明的开发人员已轻松以美观的格式查看数据集,同时也很好地描述了信息在您的数据集中。

The Pandas Profiling report serves as this excellent EDA tool that can offer the following benefits: overview, variables, interactions, correlations, missing values, and a sample of your data. I will be using randomly generated data to serve as an example of this useful tool.

熊猫分析报告是一种出色的EDA工具,可提供以下好处:概述,变量,交互作用,相关性,缺失值和数据样本。 我将使用随机生成的数据作为此有用工具的示例。

总览 (Overview)

Image for post
Overview example. Screenshot by Author [3].
概述示例。 作者[3]的屏幕截图。

The overview tab in the report provides a quick glance at how many variables and observations you have or the number of rows and columns. It will also perform a calculation to see how many of your missing cells there are compared to the whole dataframe column. Additionally, it will point out duplicate rows as well and calculate that percentage. This tab is most similar to part of the describe function from Pandas, while providing a better user-interface (UI) experience.

报告中的“概述”选项卡可让您快速浏览一下您拥有多少变量和观测值,或者行和列的数量。 它还将执行计算,以查看与整个数据框列相比有多少个丢失的单元格。 此外,它还将指出重复的行并计算该百分比。 此选项卡与Pandas的describe函数的一部分最为相似,同时提供了更好的用户界面 ( UI )体验。

The overview is broken into dataset statistics and variable types. You can also refer to warnings and reproduction for more specific information on your data.

概述分为数据集统计信息和变量类型。 您还可以参考警告和复制以获取有关数据的更多特定信息。

I will be discussing variables, which are also referred to as columns or features of your dataframe

我将讨论变量,这些变量也称为数据框的列或特征

变数 (Variables)

Image for post
Variables example. Screenshot by Author [4].
变量示例。 作者[4]的屏幕截图。

To achieve more granularity in your descriptive statistics, the variables tab is the way to go. You can look at distinct, missing, aggregations or calculations like mean, min, and max of your dataframe features or variables. You can also see the type of data you are working with (i.e., NUM). Not pictured is when you click on ‘Toggle details’. This toggle prompts a whole plethora of more usable statistics. The details include:

为了使描述性统计信息更加精确,可以使用“变量”选项卡。 您可以查看数据框特征或变量的不同,缺失,聚合或计算,例如均值,最小值和最大值。 您还可以查看正在使用的数据类型( 即NUM )。 当您单击“ 切换详细信息 ”时,未显示图片。 此切换提示大量更多可用统计信息。 详细信息包括:

Statistics — quantile and descriptive

统计-分位数和描述性

quantile

分位数

Minimum
5th percentile
Q1
Median
Q3
95th percentile
Maximum
Range
Interquartile range (IQR)

descriptive

描述性的

Standard deviation
Coefficient of variation (CV)
Kurtosis
Mean
Median Absolute Deviation (MAD)
Skewness
Sum
Variance
Monotonicity

These statistics also provide similar information from the describe function I see most Data Scientists using today, however, there are a few more and it presents them in an easy-to-view format.

这些统计信息还提供了我今天看到的大多数数据科学家使用的describe函数的类似信息,但是,还有更多信息,并且以易于查看的格式显示。

Histograms

直方图

The histograms provide for an easily digestible visual of your variables. You can expect to see the frequency of your variable on the y-axis and fixed-size bins (bins=15 is the default) on the x-axis.

直方图为您的变量提供了易于理解的视觉效果。 你可以期望看到的在y轴变量的在x轴的频率和固定大小的块( 仓= 15是默认值 )。

Common Values

共同价值观

The common values will provide the value, count, and frequency that are most common for your variable.

公用值将提供最常用于变量的值,计数和频率。

Extreme Values

极端值

The extreme values will provide the value, count, and frequency that are in the minimum and maximum values of your dataframe.

极值将提供数据框的最小值和最大值中的值,计数和频率。

互动互动 (Interactions)

Image for post
Interactions example. Screenshot by Author [5].
互动示例。 作者[5]的屏幕截图。

The interactions feature of the profiling report is unique in that you can choose from your list of columns to either be on the x-axis or y-xis provided. For example, pictured above is variable A against variable A, which is why you see overlapping. You can easily switch to other variables or columns to achieve a different plot and an excellent representation of your data points.

分析报告的交互功能是独特的,因为您可以从列列表中选择在提供的x轴还是y-xis上 。 例如,如上图所示, 变量A相对于变量A ,这就是为什么看到重叠的原因。 您可以轻松地切换到其他变量或列,以实现不同的图并很好地表示数据点。

相关性 (Correlations)

Image for post
Correlations example. Screenshot by Author [6].
相关示例。 作者[6]的屏幕截图。

Sometimes making fancier or colorful correlation plots can be time-consuming if you make them from line-by-line Python code. However, with this correlation plot, you can easily visualize the relationships between variables in your data, which are also nicely color-coded. There are four main plots that you can display:

如果使用逐行的 Python代码进行绘制,有时制作更精美的彩色关联图可能会很耗时。 但是,使用此相关图,您可以轻松地可视化数据中变量之间的关系,这些变量也已进行了很好的颜色编码 。 您可以显示四个主要图表:

  • Pearson’s r

    皮尔逊河

  • Spearman’s ρ

    斯皮尔曼的ρ

  • Kendall’s τ

    肯德尔的τ

  • Phik (φk)

    皮克(φk)

You may only be used to one of these correlation methods, so the other ones may sound confusing or not usable. Therefore, the correlation plot also comes provided with a toggle for details onto the meaning of each correlation you can visualize — this feature really helps when you need a refresher on correlation, as well as when you are deciding between which plot(s) to use for your analysis

您可能只习惯了这些相关方法之一,因此其他方法可能听起来令人困惑或无法使用。 因此,相关情节还附带提供了一个切换为细节上可以直观的每个相关的含义-这个功能真的帮助,当你需要在相关复习,以及当你决定为与该阴谋( )使用供您分析

缺失值 (Missing Values)

Image for post
Missing Values example. Screenshot by Author [7].
缺失值示例。 作者[7]的屏幕截图。

As you can see from the plot above, the report tool also includes missing values. You can see how much of each variable is missing, including the count, and matrix. It is a nice way to visualize your data before you perform any models with it. You would preferably want to see a plot like the above, meaning you have no missing values.

从上图可以看到,报告工具还包含缺失值。 您可以看到缺少每个变量的多少,包括计数和矩阵。 这是在执行任何模型之前可视化数据的好方法。 您最好希望看到上面的图,这意味着您没有缺失的值。

样品 (Sample)

Image for post
Sample example. Screenshot by Author [8].
示例示例。 作者[8]的屏幕截图。

Sample acts similarly to the head and tail function where it returns your dataframe’s first few rows or last rows. In this example, you can see the first rows and last rows as well. I use this tab when I want a sense of where my data started and where it ended — I recommend ranking or ordering to see more benefit out of this tab, as you can see the range of your data, with a visual respective representation.

Sample的行为类似于head和tail函数,它返回数据框的前几行或最后几行。 在此示例中,您还可以看到第一行和最后一行。 当我想了解我的数据的开始和结束位置时,可以使用此选项卡-我建议进行排序或排序,以便从该选项卡中获得更多好处,因为您可以看到数据的范围,并具有直观的外观。

摘要 (Summary)

Image for post
Photo by Elena Loshina on Unsplash [9].
艾琳娜·洛西娜 ( Elena Loshina)在《 Unsplash [9]》上拍摄 。

I hope this article provided you with some inspiration for your next exploratory data analysis. Being a Data Scientist can be overwhelming and EDA is often forgotten or not practiced as much as model-building. With the Pandas Profiling report, you can perform EDA with minimal code, providing useful statistics and visualizes as well. That way, you can focus on the fun part of Data Science and Machine Learning, the model process.

我希望本文能为您的下一个探索性数据分析提供一些启发。 身为数据科学家可能会令人不知所措,而EDA常常像建立模型一样被遗忘或未得到实践。 使用Pandas Profiling报告,您可以用最少的代码执行EDA,同时提供有用的统计信息并进行可视化。 这样,您就可以专注于数据科学和机器学习的有趣部分,即模型过程。

To summarize, the main features of Pandas Profiling report include overview, variables, interactions, correlations, missing values, and a sample of your data.

总之,Pandas Profiling报告的主要功能包括概述,变量,交互作用,相关性,缺失值以及数据样本。

Here is the code I used to install and import libraries, as well as to generate some dummy data for the example, and finally, the one line of code used to generate the Pandas Profile report based on your Pandas dataframe [10].

这是我用于安装和导入库以及为示例​​生成一些虚拟数据的代码,最后是用于基于您的Pandas数据框[10]生成Pandas Profile报告的一行代码。

# install library 
#!pip install pandas_profilingimport pandas_profiling
import pandas as pd
import numpy as np# create data
df = pd.DataFrame(np.random.randint(0,200,size=(15, 6)), columns=list('ABCDEF'))# run your report!
df.profile_report()# I did get an error and had to reinstall matplotlib to fix

Please feel free to comment down below if you have any questions or have used this feature before. There is still some information I did not describe, but you can find more of that information on the link I provided from above.

如果您有任何疑问或以前使用过此功能,请在下面随意评论。 仍然有一些我没有描述的信息,但是您可以从上面提供的链接中找到更多的信息。

Thank you for reading, I hope you enjoyed!

谢谢您的阅读,希望您喜欢!

翻译自: https://towardsdatascience.com/the-best-exploratory-data-analysis-with-pandas-profiling-e85b4d514583

熊猫烧香分析报告

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389829.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

白裤子变粉裤子怎么办_使用裤子构建构建数据科学的monorepo

白裤子变粉裤子怎么办At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them shar…

支持向量机SVM算法原理及应用(R)

支持向量机SVM算法原理及应用(R) 2016年08月17日 16:37:25 阅读数:22292更多 个人分类: 数据挖掘实战应用版权声明:本文为博主原创文章,转载请注明来源。 https://blog.csdn.net/csqazwsxedc/article/detai…

mad离群值_全部关于离群值

mad离群值An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from th…

青年报告_了解青年的情绪

青年报告Youth-led media is any effort created, planned, implemented, and reflected upon by young people in the form of media, including websites, newspapers, television shows, and publications. Such platforms connect writers, artists, and photographers in …

post提交参数过多时,取消Tomcat对 post长度限制

1.Tomcat 默认的post参数的最大大小为2M, 当超过时将会出错,可以配置maxPostSize参数来改变大小。 从 apache-tomcat-7.0.63 开始,参数 maxPostSize 的含义就变了: 如果将值设置为 0,表示 POST 最大值为 0,…

map(平均平均精度_客户的平均平均精度

map(平均平均精度Disclaimer: this was created for my clients because it’s rather challenging to explain such a complex metric in simple words, so don’t expect to see much of math or equations here. And remember that I try to keep it simple.免责声明 &#…

Sublime Text 2搭建Go开发环境,代码提示+补全+调试

本文在已安装Go环境的前提下继续。 1、安装Sublime Text 2 2、安装Package Control。 运行Sublime,按下 Ctrl(在Tab键上边),然后输入以下内容: import urllib2,os,hashlib; h 7183a2d3e96f11eeadd761d777e62404 e330…

zookeeper、hbase常见命令

a) Zookeeper:帮助命令-help i. ls /查看zk下根节点目录 ii. create /zk_test my_data//在测试集群没有创建成功 iii. get /zk_test my_data//获取节点信息 iv. set / zk_test my_data//更改节点相关信息 v. delete /zk_test//删除节点信…

鲜活数据数据可视化指南_数据可视化实用指南

鲜活数据数据可视化指南Exploratory data analysis (EDA) is an essential part of the data science or the machine learning pipeline. In order to create a robust and valuable product using the data, you need to explore the data, understand the relations among v…

Linux lsof命令详解

lsof(List Open Files) 用于查看你进程开打的文件,打开文件的进程,进程打开的端口(TCP、UDP),找回/恢复删除的文件。是十分方便的系统监视工具,因为lsof命令需要访问核心内存和各种文件,所以需要…

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解 史密斯卧推(smith press)是固定器械上完成的卧推,对于初级健身者来说,自由卧推(哑铃卧推、杠铃卧推)还不能很好地把握平衡性&…

图像特征 可视化_使用卫星图像可视化建筑区域

图像特征 可视化地理可视化/菲律宾/遥感 (GEOVISUALIZATION / PHILIPPINES / REMOTE-SENSING) Big data is incredible! The way Big Data manages to bring sciences and business domains to new levels is almost sort of magical. It allows us to tap into a variety of a…

375. 猜数字大小 II

375. 猜数字大小 II 我们正在玩一个猜数游戏,游戏规则如下: 我从 1 到 n 之间选择一个数字。你来猜我选了哪个数字。如果你猜到正确的数字,就会 赢得游戏 。如果你猜错了,那么我会告诉你,我选的数字比你的 更大或者更…

海量数据寻找最频繁的数据_在数据中寻找什么

海量数据寻找最频繁的数据Some activities are instinctive. A baby doesn’t need to be taught how to suckle. Most people can use an escalator, operate an elevator, and open a door instinctively. The same isn’t true of playing a guitar, driving a car, or anal…

OSChina 周四乱弹 —— 要成立复仇者联盟了,来报名

2019独角兽企业重金招聘Python工程师标准>>> Osc乱弹歌单(2018)请戳(这里) 【今日歌曲】 Devoes :分享吴若希的单曲《越难越爱 (Love Is Not Easy / TVB剧集《使徒行者》片尾曲)》: 《越难越爱 (Love Is No…

2023. 连接后等于目标字符串的字符串对

2023. 连接后等于目标字符串的字符串对 给你一个 数字 字符串数组 nums 和一个 数字 字符串 target ,请你返回 nums[i] nums[j] (两个字符串连接)结果等于 target 的下标 (i, j) (需满足 i ! j)的数目。 示例 1&…

webapi 找到了与请求匹配的多个操作(ajax报500,4的错误)

1、ajax报500,4的错误,然而多次验证自己的后台方法没错。然后跟踪到如下图的错误信息! 2、因为两个函数都是无参的,返回值也一样。如下图 3,我给第一个函数加了一个参数后,就不报错了,所以我想,…

可视化 nlp_使用nlp可视化尤利西斯

可视化 nlpMy data science experience has, thus far, been focused on natural language processing (NLP), and the following post is neither the first nor last which will include the novel Ulysses, by James Joyce, as its primary target for NLP and literary elu…

本地搜索文件太慢怎么办?用Everything搜索秒出结果(附安装包)

每次用电脑本地的搜索都慢的一批,后来发现了一个搜索利器 基本上搜索任何文件都不用等待。 并且页面非常简洁,也没有任何广告,用起来非常舒服。 软件官网如下: voidtools 官网提供三个版本,用起来差别不大。 网盘链…

小程序入口传参:关于带参数的小程序扫码进入的方法

1.使用场景 1.医院场景:比如每个医生一个id,通过带参数二维码,扫码二维码就直接进入小程序医生页面 2.餐厅场景:比如每个菜一个二维码,通过扫码这个菜的二维码,进入小程序后,可以直接点这道菜&a…