如何在Pandas中使用Excel文件

From what I have seen so far, CSV seems to be the most popular format to store data among data scientists. And that’s understandable, it gets the job done and it’s a quite simple format; in Python, even without any library, one can build a simple CSV parser in under 10 lines of code.

从目前为止我所看到的,CSV似乎是数据科学家中最流行的存储数据格式。 这是可以理解的,它可以完成工作,而且格式非常简单; 在Python中,即使没有任何库,也可以用不到10行代码构建一个简单的CSV解析器。

But you may not always find the data that you need in CSV format. Sometimes the only available format may be an Excel file. Like, for example, this dataset on ons.gov.uk about crime in England and Wales, which is only in xlsx format; dataset that I will use in the examples below.

但是您可能并不总是以CSV格式找到所需的数据。 有时,唯一可用的格式可能是Excel文件。 例如, ons.gov.uk上有关英格兰和威尔士犯罪的数据集,仅采用xlsx格式; 我将在以下示例中使用的数据集。

读取Excel文件 (Reading Excel files)

The simplest way to read Excel files into pandas data frames is by using the following function (assuming you did import pandas as pd):

将Excel文件读入pandas数据帧的最简单方法是使用以下函数(假设您确实import pandas as pd ):

df = pd.read_excel(‘path_to_excel_file’, sheet_name=’…’)

df = pd.read_excel('path_to_excel_file', sheet_name='…')

Where sheet_name can be the name of the sheet we want to read, it’s index, or a list with all the sheets we want to read; the elements of the list can be mixed: sheet names or indices. If we want all the sheets, we can use sheet_name=None. In the case in which we want more sheets to be read, they will be returned as a dictionary of data frames. The keys of such a dictionary will be either the index or name of a sheet, depending on how we specified in sheet_name; in the case of sheet_name=None, the keys will be sheet names.

其中sheet_name可以是我们要读取的工作表的名称,索引或包含我们要读取的所有工作表的列表; 列表中的元素可以混合使用:工作表名称或索引。 如果我们需要所有图纸,可以使用sheet_name=None 。 在我们希望读取更多图纸的情况下,它们将作为数据帧的字典返回。 这样的字典的键将是工作表的索引或名称,这取决于我们在sheet_name指定sheet_name ; 在sheet_name=None的情况下,键将是工作表名称。

Now, if we use it to read our Excel file we get:

现在,如果我们使用它来读取我们的Excel文件,则会得到:

Image for post

That’s right, an error! It turns out that pandas cannot read Excel files on its own, so we need to install another python package to do that.

是的,这是一个错误! 事实证明,熊猫无法自行读取Excel文件,因此我们需要安装另一个python软件包来做到这一点。

There are 2 options that we have: xlrd and openpyxl. The package xlrd can open both Excel 2003 (.xls) and Excel 2007+ (.xlsx) files, whereas openpyxl can open only Excel 2007+ (.xlsx) files. So, we will install xlrd as it can open both formats:

我们有2个选项: xlrdopenpyxl 。 包xlrd可以同时打开Excel 2003(.xlsx)和Excel 2007+(.xlsx)文件,而openpyxl只能打开Excel 2007+(.xlsx)文件。 因此,我们将安装xlrd因为它可以打开两种格式:

pip install xlrd

pip install xlrd

Now, if we try to read the same data again:

现在,如果我们尝试再次读取相同的数据:

Image for post

It works!

有用!

But Excel files can be a little bit messier. Besides data, they may have other comments/explanations in the first and/or last couple of rows.

但是Excel文件可能有点混乱。 除数据外,它们在第一和/或最后几行中可能还有其他注释/解释。

To tell pandas to start reading an Excel sheet from a specific row, use the argument header = 0-indexed row where to start reading. By default, header=0, and the first such row is used to give the names of the data frame columns.

要告诉熊猫开始从特定行读取Excel工作表,请使用参数header = 0索引行开始读取。 默认情况下,header = 0,并且第一个这样的行用于给出数据框列的名称。

To skip rows at the end of a sheet, use skipfooter = number of rows to skip.

要跳过工作表末尾的行,请使用skipfooter =要跳过的行数。

For example:

例如:

Image for post

This is a little better. There are still some issues that are specific to this data. Depending on what we want to achieve we may also need to rearrange the data values into another way. But in this article, we will focus only on reading and writing to and from data frames.

这样好一点了。 仍然存在一些特定于此数据的问题。 根据我们要实现的目标,我们可能还需要将数据值重新排列为另一种方式。 但是在本文中,我们将仅专注于读写数据帧。

Another way to read Excel files besides the one above is by using a pd.ExcelFile object. Such an object can be constructed by using the pd.ExcelFile(‘excel_file_path’) constructor. An ExcelFile object can be used in a couple of ways. Firstly, it has a .sheet_names attribute which is a list of all the sheet names inside the opened Excel file.

除上述方法外,另一种读取Excel文件的方法是使用pd.ExcelFile对象。 可以使用pd.ExcelFile('excel_file_path')构造函数构造此类对象。 ExcelFile对象可以通过两种方式使用。 首先,它具有.sheet_names属性,该属性是打开的Excel文件中所有工作表名称的列表。

Image for post

Then, this ExcelFile object also has a .parse() method that can be used to parse a sheet from the file and return a data frame. The first parameter of this method can be the index of the sheet we want to parse or its name. The rest of the parameters are the same as in the pd.read_excel() function.

然后,此ExcelFile对象还具有.parse()方法,该方法可用于从文件中解析工作表并返回数据框。 此方法的第一个参数可以是我们要解析的工作表的索引或其名称。 其余参数与pd.read_excel()函数中的参数相同。

An example of parsing the second sheet (index 1):

解析第二张纸(索引1)的示例:

Image for post

… and here we parse the same sheet using its name instead of an index:

…在这里,我们使用其名称而不是索引来解析同​​一张纸:

Image for post

ExcelFiles can also be used inside with … as … statements, and if you want to do something a little more elaborate, like parsing only sheets with 2 words in their name, you can do something like:

ExcelFile也可以with … as …语句一起使用,如果您想做一些更复杂的事情,例如仅解析名称中带有2个单词的工作表,则可以执行以下操作:

Image for post

The same thing you can do by using pd.read_excel() instead of .parse() method, like this:

您可以使用pd.read_excel()而不是.parse()方法来执行相同的操作,如下所示:

Image for post

… or, if you simply want all the sheets, you can do:

…或者,如果您只想要所有工作表,则可以执行以下操作:

Image for post

编写Excel文件 (Writing Excel Files)

Now that we know how to read excel files, the next step for us is to be able to also write a data frame to an excel file. We can do that by using the data frame method .to_excel(‘path_to_excel_file’, sheet_name=’…’).

现在我们知道了如何读取excel文件,对我们来说,下一步就是能够将数据帧写入excel文件。 我们可以通过使用数据框方法.to_excel('path_to_excel_file', sheet_name='…')

Let’s first create a simple data frame for writing to an excel file:

首先,让我们创建一个简单的数据框架以写入excel文件:

Image for post

Now we want to write it to an excel file:

现在我们想将其写入一个excel文件:

Image for post

… and we got an error.

……我们遇到了一个错误。

Again, pandas can’t write to excel files on its own; we need another package for that. The main options that we have are:

同样,熊猫不能自己写入excel文件。 我们需要另一个软件包。 我们提供的主要选项是:

  • xlwt — works only with Excel 2003 (.xls) files; append mode not supported

    xlwt仅适用于Excel 2003(.xls)文件; 不支持追加模式

  • xlsxwriter — works only with Excel 2007+ (.xlsx) files; append mode not supported

    xlsxwriter仅适用于Excel 2007+(.xlsx)文件; 不支持追加模式

  • openpyxl — works only with Excel 2007+ (.xlsx) files; supports append mode

    openpyxl仅适用于Excel 2007+(.xlsx)文件; 支持追加模式

If we want to be able to write to the old .xls format we should install xlwt as it is the only that handles those files. For .xlsx files, we will choose openpyxl as it also supports the append mode.

如果我们希望能够写入旧的.xls格式,则应该安装xlwt因为它是唯一处理那些文件的文件。 对于.xlsx文件,我们将选择openpyxl因为它也支持附加模式。

pip install xlwt openpyxl

pip install xlwt openpyxl

Now if we run again the above code, it works; an excel file was created:

现在,如果我们再次运行上面的代码,它可以工作; 创建了一个excel文件:

Image for post

By default, pandas also writes the index column along with our columns. To get rid of it, use index=False like in the code below:

默认情况下,pandas还会将索引列与我们的列一起写入。 要摆脱它,请使用index=False如下面的代码所示:

Image for post

The index column isn’t there now:

索引列现在不存在:

Image for post

What if we want to write more sheets? If we want to add a second sheet to the previous file, do you think that the below code will work?

如果我们想写更多的图纸怎么办? 如果我们想在先前的文件中添加第二张纸,您认为以下代码可以工作吗?

Image for post

The answer is no. It will just overwrite the file with only one sheet: sheet2.

答案是否定的 。 它将仅用一张纸覆盖该文件:sheet2。

To write more sheets to an Excel file we need to use a pd.ExcelWriter object as shown below. First, we create another data frame for sheet2, then we open an Excel file as an ExcelWriter object in which we write the 2 data frames:

要将更多工作表写入Excel文件,我们需要使用pd.ExcelWriter对象,如下所示。 首先,我们为sheet2创建另一个数据框,然后打开一个Excel文件作为ExcelWriter对象,在其中写入2个数据框:

Image for post
Image for post

Now our Excel file should have 2 sheets. If we then want to add another sheet to it, we need to open the file in append mode and run code similar to the previous one. For example:

现在我们的Excel文件应该有2张纸。 然后,如果要向其添加另一张纸,则需要以附加模式打开文件,并运行与上一张相似的代码。 例如:

Image for post

Our Excel file, now, has 3 sheets and looks like this:

我们的Excel文件现在有3张纸,看起来像这样:

Image for post

使用Excel公式 (Working with Excel Formulas)

Probably you are wondering, at this point, about Excel formulas. What about them? How to read from files that have formulas? How to write them to Excel files?

此时,您可能想知道有关Excel公式的信息。 那他们呢 如何从具有公式的文件中读取? 如何将它们写入Excel文件?

Well… good news. It is quite easy. Writing formulas to Excel files is as simple as just writing the string of the formula, and these strings will be automatically interpreted by Excel as formulas.

好吧...好消息。 这很容易。 将公式写入Excel文件就像编写公式的字符串一样简单,并且Excel将自动将这些字符串解释为公式。

As an example:

举个例子:

Image for post

The Excel file produced by the code above is:

上面的代码生成的Excel文件是:

Image for post

Now, if we want to read an Excel file with formulas in it, pandas will read into data frames the result of those formulas.

现在,如果我们要读取其中包含公式的Excel文件,则大熊猫会将这些公式的结果读入数据框。

For example, let’s read our previously created file:

例如,让我们阅读之前创建的文件:

Image for post

Sometimes you need to save the Excel file manually for this to work and not get zeros instead of the result of formulas (hit CTRL+S before executing the above code).

有时,您需要手动保存Excel文件才能使其正常工作,而不是获取零而不是公式的结果(执行上述代码之前,请按CTRL + S)。

Below is the code as a Jupyter notebook:

以下是Jupyter笔记本的代码:

That’s all for this article. Thanks for reading!

这就是本文的全部内容。 谢谢阅读!

翻译自: https://towardsdatascience.com/how-to-work-with-excel-files-in-pandas-c584abb67bfb

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391718.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Js实现div随鼠标移动的方法

HTML: <div id"odiv" style" COLOR: #666; padding: 2px 8px; FONT-SIZE: 12px; MARGIN-RIGHT: 5px; position: absolute; background: #fff; display: block; border: 1px solid #666; top: 50px; left: 10px;"> Move_Me</div>第一种&…

leetcode 867. 转置矩阵

给你一个二维整数数组 matrix&#xff0c; 返回 matrix 的 转置矩阵 。 矩阵的 转置 是指将矩阵的主对角线翻转&#xff0c;交换矩阵的行索引与列索引。 示例 1&#xff1a; 输入&#xff1a;matrix [[1,2,3],[4,5,6],[7,8,9]] 输出&#xff1a;[[1,4,7],[2,5,8],[3,6,9]] …

数据特征分析-对比分析

对比分析是对两个互相联系的指标进行比较。 绝对数比较(相减)&#xff1a;指标在量级上不能差别过大&#xff0c;常用折线图、柱状图 相对数比较(相除)&#xff1a;结构分析、比例分析、空间比较分析、动态对比分析 df pd.DataFrame(np.random.rand(30,2)*1000,columns[A_sale…

Linux基线合规检查中各文件的作用及配置脚本

1./etc/motd 操作&#xff1a;echo " Authorized users only. All activity may be monitored and reported " > /etc/motd 效果&#xff1a;telnet和ssh登录后的输出信息 2. /etc/issue和/etc/issue.net 操作&#xff1a;echo " Authorized users only. All…

tableau使用_使用Tableau升级Kaplan-Meier曲线

tableau使用In a previous article, I showed how we can create the Kaplan-Meier curves using Python. As much as I love Python and writing code, there might be some alternative approaches with their unique set of benefits. Enter Tableau!在上一篇文章中 &#x…

踩坑 net core

webclient 可以替换为 HttpClient 下载获取url的内容&#xff1a; 证书&#xff1a; https://stackoverflow.com/questions/40014047/add-client-certificate-to-net-core-httpclient 转载于:https://www.cnblogs.com/zxs-onestar/p/7340386.html

我从参加#PerfMatters会议中学到的东西

by Stacey Tay通过史黛西泰 我从参加#PerfMatters会议中学到的东西 (What I learned from attending the #PerfMatters conference) 从前端的网络运行情况发布会上的注意事项 (Notes from a front-end web performance conference) This week I had the privilege of attendin…

修改innodb_flush_log_at_trx_commit参数提升insert性能

最近&#xff0c;在一个系统的慢查询日志里发现有个insert操作很慢&#xff0c;达到秒级&#xff0c;并且是比较简单的SQL语句&#xff0c;把语句拿出来到mysql中直接执行&#xff0c;速度却很快。 这种问题一般不是SQL语句本身的问题&#xff0c;而是在具体的应用环境中&#…

leetcode 1178. 猜字谜(位运算)

外国友人仿照中国字谜设计了一个英文版猜字谜小游戏&#xff0c;请你来猜猜看吧。 字谜的迷面 puzzle 按字符串形式给出&#xff0c;如果一个单词 word 符合下面两个条件&#xff0c;那么它就可以算作谜底&#xff1a; 单词 word 中包含谜面 puzzle 的第一个字母。 单词 word…

Nexus3.x.x上传第三方jar

exus3.x.x上传第三方jar&#xff1a; 1. create repository 选择maven2(hosted)&#xff0c;说明&#xff1a; proxy&#xff1a;即你可以设置代理&#xff0c;设置了代理之后&#xff0c;在你的nexus中找不到的依赖就会去配置的代理的地址中找hosted&#xff1a;你可以上传你自…

责备的近义词_考试结果危机:我们应该责备算法吗?

责备的近义词I’ve been considering writing on the topic of algorithms for a little while, but with the Exam Results Fiasco dominating the headline news in the UK during the past week, I felt that now is the time to look more closely into the subject.我一直…

电脑如何设置终端设置代理_如何设置一个严肃的Kubernetes终端

电脑如何设置终端设置代理by Chris Cooney克里斯库尼(Chris Cooney) 如何设置一个严肃的Kubernetes终端 (How to set up a serious Kubernetes terminal) 所有k8s书呆子需要的CLI工具 (All the CLI tools a growing k8s nerd needs) Kubernetes comes pre-packaged with an ou…

spring cloud(二)

1. Feign应用 Feign的作用&#xff1b;使用Feign实现consumer-demo代码中调用服务 导入启动器依赖&#xff1b;开启Feign功能&#xff1b;编写Feign客户端&#xff1b;编写一个处理器ConsumerFeignController&#xff0c;注入Feign客户端并使用&#xff1b;测试 <dependen…

c/c++编译器的安装

MinGW(Minimalist GNU For Windows)是个精简的Windows平台C/C、ADA及Fortran编译器&#xff0c;相比Cygwin而言&#xff0c;体积要小很多&#xff0c;使用较为方便。 MinGW最大的特点就是编译出来的可执行文件能够独立在Windows上运行。 MinGW的组成&#xff1a; 编译器(支持C、…

渗透工具

渗透工具 https://blog.csdn.net/Fly_hps/article/details/89306104 查询工具 https://blog.csdn.net/Fly_hps/article/details/89070552 转载于:https://www.cnblogs.com/liuYGoo/p/11347693.html

numpy 线性代数_数据科学家的线性代数—用NumPy解释

numpy 线性代数Machine learning and deep learning models are data-hungry. The performance of them is highly dependent on the amount of data. Thus, we tend to collect as much data as possible in order to build a robust and accurate model. Data is collected i…

spring 注解方式配置Bean

概要&#xff1a; 再classpath中扫描组件 组件扫描&#xff08;component scanning&#xff09;&#xff1a;Spring可以从classpath下自己主动扫描。侦測和实例化具有特定注解的组件特定组件包含&#xff1a; Component&#xff1a;基本注解。标示了一个受Spring管理的组件&…

主成分分析 独立成分分析_主成分分析概述

主成分分析 独立成分分析by Moshe Binieli由Moshe Binieli 主成分分析概述 (An overview of Principal Component Analysis) This article will explain you what Principal Component Analysis (PCA) is, why we need it and how we use it. I will try to make it as simple…

扩展方法略好于帮助方法

如果针对一个类型实例的代码片段经常被用到&#xff0c;我们可能会想到把之封装成帮助方法。如下是一段针对DateTime类型实例的一段代码&#xff1a;class Program{static void Main(string[] args){DateTime d new DateTime(2001,5,18);switch (d.DayOfWeek){case DayOfWeek.…

零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面...

原文:零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面本章将交大家如何运用Blend 4 内的Text相关功能做出有设计感的登入画面 让你五分钟就能快速做出一个登入画面 ? 本章将教大家如何运用Blend 4 内的Text相关功能做出有设计感的登入…