From what I have seen so far, CSV seems to be the most popular format to store data among data scientists. And that’s understandable, it gets the job done and it’s a quite simple format; in Python, even without any library, one can build a simple CSV parser in under 10 lines of code.
从目前为止我所看到的,CSV似乎是数据科学家中最流行的存储数据格式。 这是可以理解的,它可以完成工作,而且格式非常简单; 在Python中,即使没有任何库,也可以用不到10行代码构建一个简单的CSV解析器。
But you may not always find the data that you need in CSV format. Sometimes the only available format may be an Excel file. Like, for example, this dataset on ons.gov.uk about crime in England and Wales, which is only in xlsx format; dataset that I will use in the examples below.
但是您可能并不总是以CSV格式找到所需的数据。 有时,唯一可用的格式可能是Excel文件。 例如, ons.gov.uk上有关英格兰和威尔士犯罪的数据集,仅采用xlsx格式; 我将在以下示例中使用的数据集。
读取Excel文件 (Reading Excel files)
The simplest way to read Excel files into pandas data frames is by using the following function (assuming you did import pandas as pd
):
将Excel文件读入pandas数据帧的最简单方法是使用以下函数(假设您确实import pandas as pd
):
df = pd.read_excel(‘path_to_excel_file’, sheet_name=’…’)
df = pd.read_excel('path_to_excel_file', sheet_name='…')
Where sheet_name
can be the name of the sheet we want to read, it’s index, or a list with all the sheets we want to read; the elements of the list can be mixed: sheet names or indices. If we want all the sheets, we can use sheet_name=None
. In the case in which we want more sheets to be read, they will be returned as a dictionary of data frames. The keys of such a dictionary will be either the index or name of a sheet, depending on how we specified in sheet_name
; in the case of sheet_name=None
, the keys will be sheet names.
其中sheet_name
可以是我们要读取的工作表的名称,索引或包含我们要读取的所有工作表的列表; 列表中的元素可以混合使用:工作表名称或索引。 如果我们需要所有图纸,可以使用sheet_name=None
。 在我们希望读取更多图纸的情况下,它们将作为数据帧的字典返回。 这样的字典的键将是工作表的索引或名称,这取决于我们在sheet_name
指定sheet_name
; 在sheet_name=None
的情况下,键将是工作表名称。
Now, if we use it to read our Excel file we get:
现在,如果我们使用它来读取我们的Excel文件,则会得到:
That’s right, an error! It turns out that pandas cannot read Excel files on its own, so we need to install another python package to do that.
是的,这是一个错误! 事实证明,熊猫无法自行读取Excel文件,因此我们需要安装另一个python软件包来做到这一点。
There are 2 options that we have: xlrd
and openpyxl
. The package xlrd
can open both Excel 2003 (.xls) and Excel 2007+ (.xlsx) files, whereas openpyxl
can open only Excel 2007+ (.xlsx) files. So, we will install xlrd
as it can open both formats:
我们有2个选项: xlrd
和openpyxl
。 包xlrd
可以同时打开Excel 2003(.xlsx)和Excel 2007+(.xlsx)文件,而openpyxl
只能打开Excel 2007+(.xlsx)文件。 因此,我们将安装xlrd
因为它可以打开两种格式:
pip install xlrd
pip install xlrd
Now, if we try to read the same data again:
现在,如果我们尝试再次读取相同的数据:
It works!
有用!
But Excel files can be a little bit messier. Besides data, they may have other comments/explanations in the first and/or last couple of rows.
但是Excel文件可能有点混乱。 除数据外,它们在第一和/或最后几行中可能还有其他注释/解释。
To tell pandas to start reading an Excel sheet from a specific row, use the argument header = 0-indexed row where to start reading. By default, header=0, and the first such row is used to give the names of the data frame columns.
要告诉熊猫开始从特定行读取Excel工作表,请使用参数header = 0索引行开始读取。 默认情况下,header = 0,并且第一个这样的行用于给出数据框列的名称。
To skip rows at the end of a sheet, use skipfooter = number of rows to skip.
要跳过工作表末尾的行,请使用skipfooter =要跳过的行数。
For example:
例如:
This is a little better. There are still some issues that are specific to this data. Depending on what we want to achieve we may also need to rearrange the data values into another way. But in this article, we will focus only on reading and writing to and from data frames.
这样好一点了。 仍然存在一些特定于此数据的问题。 根据我们要实现的目标,我们可能还需要将数据值重新排列为另一种方式。 但是在本文中,我们将仅专注于读写数据帧。
Another way to read Excel files besides the one above is by using a pd.ExcelFile
object. Such an object can be constructed by using the pd.ExcelFile(‘excel_file_path’)
constructor. An ExcelFile
object can be used in a couple of ways. Firstly, it has a .sheet_names
attribute which is a list of all the sheet names inside the opened Excel file.
除上述方法外,另一种读取Excel文件的方法是使用pd.ExcelFile
对象。 可以使用pd.ExcelFile('excel_file_path')
构造函数构造此类对象。 ExcelFile
对象可以通过两种方式使用。 首先,它具有.sheet_names
属性,该属性是打开的Excel文件中所有工作表名称的列表。
Then, this ExcelFile
object also has a .parse()
method that can be used to parse a sheet from the file and return a data frame. The first parameter of this method can be the index of the sheet we want to parse or its name. The rest of the parameters are the same as in the pd.read_excel()
function.
然后,此ExcelFile
对象还具有.parse()
方法,该方法可用于从文件中解析工作表并返回数据框。 此方法的第一个参数可以是我们要解析的工作表的索引或其名称。 其余参数与pd.read_excel()
函数中的参数相同。
An example of parsing the second sheet (index 1):
解析第二张纸(索引1)的示例:
… and here we parse the same sheet using its name instead of an index:
…在这里,我们使用其名称而不是索引来解析同一张纸:
ExcelFile
s can also be used inside with … as …
statements, and if you want to do something a little more elaborate, like parsing only sheets with 2 words in their name, you can do something like:
ExcelFile
也可以with … as …
语句一起使用,如果您想做一些更复杂的事情,例如仅解析名称中带有2个单词的工作表,则可以执行以下操作:
The same thing you can do by using pd.read_excel()
instead of .parse()
method, like this:
您可以使用pd.read_excel()
而不是.parse()
方法来执行相同的操作,如下所示:
… or, if you simply want all the sheets, you can do:
…或者,如果您只想要所有工作表,则可以执行以下操作:
编写Excel文件 (Writing Excel Files)
Now that we know how to read excel files, the next step for us is to be able to also write a data frame to an excel file. We can do that by using the data frame method .to_excel(‘path_to_excel_file’, sheet_name=’…’)
.
现在我们知道了如何读取excel文件,对我们来说,下一步就是能够将数据帧写入excel文件。 我们可以通过使用数据框方法.to_excel('path_to_excel_file', sheet_name='…')
。
Let’s first create a simple data frame for writing to an excel file:
首先,让我们创建一个简单的数据框架以写入excel文件:
Now we want to write it to an excel file:
现在我们想将其写入一个excel文件:
… and we got an error.
……我们遇到了一个错误。
Again, pandas can’t write to excel files on its own; we need another package for that. The main options that we have are:
同样,熊猫不能自己写入excel文件。 我们需要另一个软件包。 我们提供的主要选项是:
xlwt
— works only with Excel 2003 (.xls) files; append mode not supportedxlwt
仅适用于Excel 2003(.xls)文件; 不支持追加模式xlsxwriter
— works only with Excel 2007+ (.xlsx) files; append mode not supportedxlsxwriter
仅适用于Excel 2007+(.xlsx)文件; 不支持追加模式openpyxl
— works only with Excel 2007+ (.xlsx) files; supports append modeopenpyxl
仅适用于Excel 2007+(.xlsx)文件; 支持追加模式
If we want to be able to write to the old .xls format we should install xlwt
as it is the only that handles those files. For .xlsx files, we will choose openpyxl
as it also supports the append mode.
如果我们希望能够写入旧的.xls格式,则应该安装xlwt
因为它是唯一处理那些文件的文件。 对于.xlsx文件,我们将选择openpyxl
因为它也支持附加模式。
pip install xlwt openpyxl
pip install xlwt openpyxl
Now if we run again the above code, it works; an excel file was created:
现在,如果我们再次运行上面的代码,它可以工作; 创建了一个excel文件:
By default, pandas also writes the index column along with our columns. To get rid of it, use index=False
like in the code below:
默认情况下,pandas还会将索引列与我们的列一起写入。 要摆脱它,请使用index=False
如下面的代码所示:
The index column isn’t there now:
索引列现在不存在:
What if we want to write more sheets? If we want to add a second sheet to the previous file, do you think that the below code will work?
如果我们想写更多的图纸怎么办? 如果我们想在先前的文件中添加第二张纸,您认为以下代码可以工作吗?
The answer is no. It will just overwrite the file with only one sheet: sheet2.
答案是否定的 。 它将仅用一张纸覆盖该文件:sheet2。
To write more sheets to an Excel file we need to use a pd.ExcelWriter
object as shown below. First, we create another data frame for sheet2, then we open an Excel file as an ExcelWriter
object in which we write the 2 data frames:
要将更多工作表写入Excel文件,我们需要使用pd.ExcelWriter
对象,如下所示。 首先,我们为sheet2创建另一个数据框,然后打开一个Excel文件作为ExcelWriter
对象,在其中写入2个数据框:
Now our Excel file should have 2 sheets. If we then want to add another sheet to it, we need to open the file in append mode and run code similar to the previous one. For example:
现在我们的Excel文件应该有2张纸。 然后,如果要向其添加另一张纸,则需要以附加模式打开文件,并运行与上一张相似的代码。 例如:
Our Excel file, now, has 3 sheets and looks like this:
我们的Excel文件现在有3张纸,看起来像这样:
使用Excel公式 (Working with Excel Formulas)
Probably you are wondering, at this point, about Excel formulas. What about them? How to read from files that have formulas? How to write them to Excel files?
此时,您可能想知道有关Excel公式的信息。 那他们呢 如何从具有公式的文件中读取? 如何将它们写入Excel文件?
Well… good news. It is quite easy. Writing formulas to Excel files is as simple as just writing the string of the formula, and these strings will be automatically interpreted by Excel as formulas.
好吧...好消息。 这很容易。 将公式写入Excel文件就像编写公式的字符串一样简单,并且Excel将自动将这些字符串解释为公式。
As an example:
举个例子:
The Excel file produced by the code above is:
上面的代码生成的Excel文件是:
Now, if we want to read an Excel file with formulas in it, pandas will read into data frames the result of those formulas.
现在,如果我们要读取其中包含公式的Excel文件,则大熊猫会将这些公式的结果读入数据框。
For example, let’s read our previously created file:
例如,让我们阅读之前创建的文件:
Sometimes you need to save the Excel file manually for this to work and not get zeros instead of the result of formulas (hit CTRL+S before executing the above code).
有时,您需要手动保存Excel文件才能使其正常工作,而不是获取零而不是公式的结果(执行上述代码之前,请按CTRL + S)。
Below is the code as a Jupyter notebook:
以下是Jupyter笔记本的代码:
That’s all for this article. Thanks for reading!
这就是本文的全部内容。 谢谢阅读!
翻译自: https://towardsdatascience.com/how-to-work-with-excel-files-in-pandas-c584abb67bfb
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391718.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!