如何在Pandas中使用Excel文件

From what I have seen so far, CSV seems to be the most popular format to store data among data scientists. And that’s understandable, it gets the job done and it’s a quite simple format; in Python, even without any library, one can build a simple CSV parser in under 10 lines of code.

从目前为止我所看到的,CSV似乎是数据科学家中最流行的存储数据格式。 这是可以理解的,它可以完成工作,而且格式非常简单; 在Python中,即使没有任何库,也可以用不到10行代码构建一个简单的CSV解析器。

But you may not always find the data that you need in CSV format. Sometimes the only available format may be an Excel file. Like, for example, this dataset on ons.gov.uk about crime in England and Wales, which is only in xlsx format; dataset that I will use in the examples below.

但是您可能并不总是以CSV格式找到所需的数据。 有时,唯一可用的格式可能是Excel文件。 例如, ons.gov.uk上有关英格兰和威尔士犯罪的数据集,仅采用xlsx格式; 我将在以下示例中使用的数据集。

读取Excel文件 (Reading Excel files)

The simplest way to read Excel files into pandas data frames is by using the following function (assuming you did import pandas as pd):

将Excel文件读入pandas数据帧的最简单方法是使用以下函数(假设您确实import pandas as pd ):

df = pd.read_excel(‘path_to_excel_file’, sheet_name=’…’)

df = pd.read_excel('path_to_excel_file', sheet_name='…')

Where sheet_name can be the name of the sheet we want to read, it’s index, or a list with all the sheets we want to read; the elements of the list can be mixed: sheet names or indices. If we want all the sheets, we can use sheet_name=None. In the case in which we want more sheets to be read, they will be returned as a dictionary of data frames. The keys of such a dictionary will be either the index or name of a sheet, depending on how we specified in sheet_name; in the case of sheet_name=None, the keys will be sheet names.

其中sheet_name可以是我们要读取的工作表的名称,索引或包含我们要读取的所有工作表的列表; 列表中的元素可以混合使用:工作表名称或索引。 如果我们需要所有图纸,可以使用sheet_name=None 。 在我们希望读取更多图纸的情况下,它们将作为数据帧的字典返回。 这样的字典的键将是工作表的索引或名称,这取决于我们在sheet_name指定sheet_name ; 在sheet_name=None的情况下,键将是工作表名称。

Now, if we use it to read our Excel file we get:

现在,如果我们使用它来读取我们的Excel文件,则会得到:

Image for post

That’s right, an error! It turns out that pandas cannot read Excel files on its own, so we need to install another python package to do that.

是的,这是一个错误! 事实证明,熊猫无法自行读取Excel文件,因此我们需要安装另一个python软件包来做到这一点。

There are 2 options that we have: xlrd and openpyxl. The package xlrd can open both Excel 2003 (.xls) and Excel 2007+ (.xlsx) files, whereas openpyxl can open only Excel 2007+ (.xlsx) files. So, we will install xlrd as it can open both formats:

我们有2个选项: xlrdopenpyxl 。 包xlrd可以同时打开Excel 2003(.xlsx)和Excel 2007+(.xlsx)文件,而openpyxl只能打开Excel 2007+(.xlsx)文件。 因此,我们将安装xlrd因为它可以打开两种格式:

pip install xlrd

pip install xlrd

Now, if we try to read the same data again:

现在,如果我们尝试再次读取相同的数据:

Image for post

It works!

有用!

But Excel files can be a little bit messier. Besides data, they may have other comments/explanations in the first and/or last couple of rows.

但是Excel文件可能有点混乱。 除数据外,它们在第一和/或最后几行中可能还有其他注释/解释。

To tell pandas to start reading an Excel sheet from a specific row, use the argument header = 0-indexed row where to start reading. By default, header=0, and the first such row is used to give the names of the data frame columns.

要告诉熊猫开始从特定行读取Excel工作表,请使用参数header = 0索引行开始读取。 默认情况下,header = 0,并且第一个这样的行用于给出数据框列的名称。

To skip rows at the end of a sheet, use skipfooter = number of rows to skip.

要跳过工作表末尾的行,请使用skipfooter =要跳过的行数。

For example:

例如:

Image for post

This is a little better. There are still some issues that are specific to this data. Depending on what we want to achieve we may also need to rearrange the data values into another way. But in this article, we will focus only on reading and writing to and from data frames.

这样好一点了。 仍然存在一些特定于此数据的问题。 根据我们要实现的目标,我们可能还需要将数据值重新排列为另一种方式。 但是在本文中,我们将仅专注于读写数据帧。

Another way to read Excel files besides the one above is by using a pd.ExcelFile object. Such an object can be constructed by using the pd.ExcelFile(‘excel_file_path’) constructor. An ExcelFile object can be used in a couple of ways. Firstly, it has a .sheet_names attribute which is a list of all the sheet names inside the opened Excel file.

除上述方法外,另一种读取Excel文件的方法是使用pd.ExcelFile对象。 可以使用pd.ExcelFile('excel_file_path')构造函数构造此类对象。 ExcelFile对象可以通过两种方式使用。 首先,它具有.sheet_names属性,该属性是打开的Excel文件中所有工作表名称的列表。

Image for post

Then, this ExcelFile object also has a .parse() method that can be used to parse a sheet from the file and return a data frame. The first parameter of this method can be the index of the sheet we want to parse or its name. The rest of the parameters are the same as in the pd.read_excel() function.

然后,此ExcelFile对象还具有.parse()方法,该方法可用于从文件中解析工作表并返回数据框。 此方法的第一个参数可以是我们要解析的工作表的索引或其名称。 其余参数与pd.read_excel()函数中的参数相同。

An example of parsing the second sheet (index 1):

解析第二张纸(索引1)的示例:

Image for post

… and here we parse the same sheet using its name instead of an index:

…在这里,我们使用其名称而不是索引来解析同​​一张纸:

Image for post

ExcelFiles can also be used inside with … as … statements, and if you want to do something a little more elaborate, like parsing only sheets with 2 words in their name, you can do something like:

ExcelFile也可以with … as …语句一起使用,如果您想做一些更复杂的事情,例如仅解析名称中带有2个单词的工作表,则可以执行以下操作:

Image for post

The same thing you can do by using pd.read_excel() instead of .parse() method, like this:

您可以使用pd.read_excel()而不是.parse()方法来执行相同的操作,如下所示:

Image for post

… or, if you simply want all the sheets, you can do:

…或者,如果您只想要所有工作表,则可以执行以下操作:

Image for post

编写Excel文件 (Writing Excel Files)

Now that we know how to read excel files, the next step for us is to be able to also write a data frame to an excel file. We can do that by using the data frame method .to_excel(‘path_to_excel_file’, sheet_name=’…’).

现在我们知道了如何读取excel文件,对我们来说,下一步就是能够将数据帧写入excel文件。 我们可以通过使用数据框方法.to_excel('path_to_excel_file', sheet_name='…')

Let’s first create a simple data frame for writing to an excel file:

首先,让我们创建一个简单的数据框架以写入excel文件:

Image for post

Now we want to write it to an excel file:

现在我们想将其写入一个excel文件:

Image for post

… and we got an error.

……我们遇到了一个错误。

Again, pandas can’t write to excel files on its own; we need another package for that. The main options that we have are:

同样,熊猫不能自己写入excel文件。 我们需要另一个软件包。 我们提供的主要选项是:

  • xlwt — works only with Excel 2003 (.xls) files; append mode not supported

    xlwt仅适用于Excel 2003(.xls)文件; 不支持追加模式

  • xlsxwriter — works only with Excel 2007+ (.xlsx) files; append mode not supported

    xlsxwriter仅适用于Excel 2007+(.xlsx)文件; 不支持追加模式

  • openpyxl — works only with Excel 2007+ (.xlsx) files; supports append mode

    openpyxl仅适用于Excel 2007+(.xlsx)文件; 支持追加模式

If we want to be able to write to the old .xls format we should install xlwt as it is the only that handles those files. For .xlsx files, we will choose openpyxl as it also supports the append mode.

如果我们希望能够写入旧的.xls格式,则应该安装xlwt因为它是唯一处理那些文件的文件。 对于.xlsx文件,我们将选择openpyxl因为它也支持附加模式。

pip install xlwt openpyxl

pip install xlwt openpyxl

Now if we run again the above code, it works; an excel file was created:

现在,如果我们再次运行上面的代码,它可以工作; 创建了一个excel文件:

Image for post

By default, pandas also writes the index column along with our columns. To get rid of it, use index=False like in the code below:

默认情况下,pandas还会将索引列与我们的列一起写入。 要摆脱它,请使用index=False如下面的代码所示:

Image for post

The index column isn’t there now:

索引列现在不存在:

Image for post

What if we want to write more sheets? If we want to add a second sheet to the previous file, do you think that the below code will work?

如果我们想写更多的图纸怎么办? 如果我们想在先前的文件中添加第二张纸,您认为以下代码可以工作吗?

Image for post

The answer is no. It will just overwrite the file with only one sheet: sheet2.

答案是否定的 。 它将仅用一张纸覆盖该文件:sheet2。

To write more sheets to an Excel file we need to use a pd.ExcelWriter object as shown below. First, we create another data frame for sheet2, then we open an Excel file as an ExcelWriter object in which we write the 2 data frames:

要将更多工作表写入Excel文件,我们需要使用pd.ExcelWriter对象,如下所示。 首先,我们为sheet2创建另一个数据框,然后打开一个Excel文件作为ExcelWriter对象,在其中写入2个数据框:

Image for post
Image for post

Now our Excel file should have 2 sheets. If we then want to add another sheet to it, we need to open the file in append mode and run code similar to the previous one. For example:

现在我们的Excel文件应该有2张纸。 然后,如果要向其添加另一张纸,则需要以附加模式打开文件,并运行与上一张相似的代码。 例如:

Image for post

Our Excel file, now, has 3 sheets and looks like this:

我们的Excel文件现在有3张纸,看起来像这样:

Image for post

使用Excel公式 (Working with Excel Formulas)

Probably you are wondering, at this point, about Excel formulas. What about them? How to read from files that have formulas? How to write them to Excel files?

此时,您可能想知道有关Excel公式的信息。 那他们呢 如何从具有公式的文件中读取? 如何将它们写入Excel文件?

Well… good news. It is quite easy. Writing formulas to Excel files is as simple as just writing the string of the formula, and these strings will be automatically interpreted by Excel as formulas.

好吧...好消息。 这很容易。 将公式写入Excel文件就像编写公式的字符串一样简单,并且Excel将自动将这些字符串解释为公式。

As an example:

举个例子:

Image for post

The Excel file produced by the code above is:

上面的代码生成的Excel文件是:

Image for post

Now, if we want to read an Excel file with formulas in it, pandas will read into data frames the result of those formulas.

现在,如果我们要读取其中包含公式的Excel文件,则大熊猫会将这些公式的结果读入数据框。

For example, let’s read our previously created file:

例如,让我们阅读之前创建的文件:

Image for post

Sometimes you need to save the Excel file manually for this to work and not get zeros instead of the result of formulas (hit CTRL+S before executing the above code).

有时,您需要手动保存Excel文件才能使其正常工作,而不是获取零而不是公式的结果(执行上述代码之前,请按CTRL + S)。

Below is the code as a Jupyter notebook:

以下是Jupyter笔记本的代码:

That’s all for this article. Thanks for reading!

这就是本文的全部内容。 谢谢阅读!

翻译自: https://towardsdatascience.com/how-to-work-with-excel-files-in-pandas-c584abb67bfb

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391718.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

数据特征分析-对比分析

对比分析是对两个互相联系的指标进行比较。 绝对数比较(相减):指标在量级上不能差别过大,常用折线图、柱状图 相对数比较(相除):结构分析、比例分析、空间比较分析、动态对比分析 df pd.DataFrame(np.random.rand(30,2)*1000,columns[A_sale…

Linux基线合规检查中各文件的作用及配置脚本

1./etc/motd 操作:echo " Authorized users only. All activity may be monitored and reported " > /etc/motd 效果:telnet和ssh登录后的输出信息 2. /etc/issue和/etc/issue.net 操作:echo " Authorized users only. All…

tableau使用_使用Tableau升级Kaplan-Meier曲线

tableau使用In a previous article, I showed how we can create the Kaplan-Meier curves using Python. As much as I love Python and writing code, there might be some alternative approaches with their unique set of benefits. Enter Tableau!在上一篇文章中 &#x…

Nexus3.x.x上传第三方jar

exus3.x.x上传第三方jar: 1. create repository 选择maven2(hosted),说明: proxy:即你可以设置代理,设置了代理之后,在你的nexus中找不到的依赖就会去配置的代理的地址中找hosted:你可以上传你自…

责备的近义词_考试结果危机:我们应该责备算法吗?

责备的近义词I’ve been considering writing on the topic of algorithms for a little while, but with the Exam Results Fiasco dominating the headline news in the UK during the past week, I felt that now is the time to look more closely into the subject.我一直…

c/c++编译器的安装

MinGW(Minimalist GNU For Windows)是个精简的Windows平台C/C、ADA及Fortran编译器,相比Cygwin而言,体积要小很多,使用较为方便。 MinGW最大的特点就是编译出来的可执行文件能够独立在Windows上运行。 MinGW的组成: 编译器(支持C、…

numpy 线性代数_数据科学家的线性代数—用NumPy解释

numpy 线性代数Machine learning and deep learning models are data-hungry. The performance of them is highly dependent on the amount of data. Thus, we tend to collect as much data as possible in order to build a robust and accurate model. Data is collected i…

spring 注解方式配置Bean

概要: 再classpath中扫描组件 组件扫描(component scanning):Spring可以从classpath下自己主动扫描。侦測和实例化具有特定注解的组件特定组件包含: Component:基本注解。标示了一个受Spring管理的组件&…

零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面...

原文:零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面本章将交大家如何运用Blend 4 内的Text相关功能做出有设计感的登入画面 让你五分钟就能快速做出一个登入画面 ? 本章将教大家如何运用Blend 4 内的Text相关功能做出有设计感的登入…

冠状病毒时代的负责任数据可视化

First, a little bit about me: I’m a data science grad student. I have been writing for Medium for a little while now. I’m a scorpio. I like long walks on beaches. And writing for Medium made me realize the importance of taking personal responsibility ove…

集合_java集合框架

转载自http://blog.csdn.net/zsw101259/article/details/7570033 Java集合框架图 简化图: Java平台提供了一个全新的集合框架。“集合框架”主要由一组用来操作对象的接口组成。不同接口描述一组不同数据类型。 1、Java 2集合框架图 ①集合接口:6个…

显示随机键盘

显示随机键盘 1 <!DOCTYPE html>2 <html lang"zh-cn">3 <head>4 <meta charset"utf-8">5 <title>7-77 课堂演示</title>6 <link rel"stylesheet" type"text/css" href"style…

数据特征分析-统计分析

一、统计分析 统计分析是对定量数据进行统计描述&#xff0c;常从集中趋势和离中趋势两个方面分析。 集中趋势&#xff1a;指一组数据向某一中心靠拢的倾向&#xff0c;核心在于寻找数据的代表值或中心值-统计平均数&#xff08;算数平均数和位置平均数&#xff09; 算术平均数…

数据eda_银行数据EDA:逐步

数据edaThis banking data was retrieved from Kaggle and there will be a breakdown on how the dataset will be handled from EDA (Exploratory Data Analysis) to Machine Learning algorithms.该银行数据是从Kaggle检索的&#xff0c;将详细介绍如何将数据集从EDA(探索性…

结构型模式之组合

重新看组合/合成&#xff08;Composite&#xff09;模式&#xff0c;发现它并不像自己想象的那么简单&#xff0c;单纯从整体和部分关系的角度去理解还是不够的&#xff0c;并且还有一些通俗的模式讲解类的书&#xff0c;由于其举的例子太过“通俗”&#xff0c;以致让人理解产…

计算机网络原理笔记-三次握手

三次握手协议指的是在发送数据的准备阶段&#xff0c;服务器端和客户端之间需要进行三次交互&#xff1a; 第一次握手&#xff1a;客户端发送syn包(synj)到服务器&#xff0c;并进入SYN_SEND状态&#xff0c;等待服务器确认&#xff1b; 第二次握手&#xff1a;服务器收到syn包…

Bigmart数据集销售预测

Note: This post is heavy on code, but yes well documented.注意&#xff1a;这篇文章讲的是代码&#xff0c;但确实有据可查。 问题描述 (The Problem Description) The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in…

数据特征分析-帕累托分析

帕累托分析(贡献度分析)&#xff1a;即二八定律 目的&#xff1a;通过二八原则寻找属于20%的关键决定性因素。 随机生成数据 df pd.DataFrame(np.random.randn(10)*10003000,index list(ABCDEFGHIJ),columns [销量]) #避免出现负数 df.sort_values(销量,ascending False,i…

dt决策树_决策树:构建DT的分步方法

dt决策树介绍 (Introduction) Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred f…

读C#开发实战1200例子记录-2017年8月14日10:03:55

C# 语言基础应用&#xff0c;注释 "///"标记不仅仅可以为代码段添加说明&#xff0c;它还有一项更重要的工作&#xff0c;就是用于生成自动文档。自动文档一般用于描述项目&#xff0c;是项目更加清晰直观。在VisualStudio2015中可以通过设置项目属性来生成自动文档。…