人生苦短我用Python pandas文件格式转换

前言
示例1 excel与csv互转
常用格式的方法
- Flat file
- Excel
- JSON
- XML
示例2 常用格式转换
- 简要需求
- 依赖
- export方法
- main方法
附其它格式的方法
- HTML
- Pickling
- Clipboard
- Latex
- HDFStore: PyTables (HDF5)
- Feather
- Parquet
- ORC
- SAS
- SPSS
- SQL
- Google BigQuery
- STATA

前言

pandas支持多种文件格式，通过pandas的IO方法，可以实现不同格式之间的互相转换。本文通过excel与csv互转的示例和pandas的支持的文件格式，实现一个简单的文件格式转换的功能。

示例1 excel与csv互转

在前文实现了excel转csv，即通过pandas将excel转csv，反过来也可以将csv转为excel。

下面是excel与csv互转的示例代码：

excel转csv

def export_csv(input_file, output_path):# 创建ExcelFile对象with pd.ExcelFile(input_file) as xls:# 获取工作表名称列表for i, sheet_name in enumerate(xls.sheet_names):# 读取工作表并转换为DataFramedf = pd.read_excel(xls, sheet_name=sheet_name)output_file = os.path.join(output_path, f'{i + 1}-{sheet_name}.csv')# 将DataFrame中的数据写入CSV文件。df.to_csv(output_file, index=False)

csv转为excel

def export_excel(input_file, output_file):if not output_file:input_path = pathlib.Path(input_file)output_path = input_path.parent / (input_path.stem + '.xlsx')output_file = str(output_path)df = pd.read_csv(input_file)df.to_excel(output_file, index=False)

常用格式的方法

以下来自pandas官网 Input/Outout部分

Flat file

方法	说明
`read_table`(filepath_or_buffer, *[, sep, …])	Read general delimited file into DataFrame.
`read_csv`(filepath_or_buffer, *[, sep, …])	Read a comma-separated values (csv) file into DataFrame.
`DataFrame.to_csv`([path_or_buf, sep, na_rep, …])	Write object to a comma-separated values (csv) file.
`read_fwf`(filepath_or_buffer, *[, colspecs, …])	Read a table of fixed-width formatted lines into DataFrame.

Excel

方法	说明
`read_excel`(io[, sheet_name, header, names, …])	Read an Excel file into a `pandas` `DataFrame`.
`DataFrame.to_excel`(excel_writer, *[, …])	Write object to an Excel sheet.
`ExcelFile`(path_or_buffer[, engine, …])	Class for parsing tabular Excel sheets into DataFrame objects.
`ExcelFile.book`
`ExcelFile.sheet_names`
`ExcelFile.parse`([sheet_name, header, names, …])	Parse specified sheet(s) into a DataFrame.

方法	说明
`Styler.to_excel`(excel_writer[, sheet_name, …])	Write Styler to an Excel sheet.

方法	说明
`ExcelWriter`(path[, engine, date_format, …])	Class for writing DataFrame objects into excel sheets.

JSON

方法	说明
`read_json`(path_or_buf, *[, orient, typ, …])	Convert a JSON string to pandas object.
`json_normalize`(data[, record_path, meta, …])	Normalize semi-structured JSON data into a flat table.
`DataFrame.to_json`([path_or_buf, orient, …])	Convert the object to a JSON string.

方法	说明
`build_table_schema`(data[, index, …])	Create a Table schema from `data`.

XML

方法	说明
`read_xml`(path_or_buffer, *[, xpath, …])	Read XML document into a `DataFrame` object.
`DataFrame.to_xml`([path_or_buffer, index, …])	Render a DataFrame to an XML document.

示例2 常用格式转换

根据常用格式的IO方法，完成一个常用格式的格式转换功能。

第一步从指定格式的文件中读取数据，并将其转换为 DataFrame 对象。

第二部将 DataFrame 中的数据写入指定格式的文件中。

简要需求

根据输入输出的文件后缀名，自动进行格式转换，若格式不支持输出提示。
支持的格式csv，xlsx，json，xml。

依赖

pip install pandas
pip install openpyxl
pip install lxml

export方法

def export(input_file, output_file):if not os.path.isfile(input_file):print('Input file does not exist')returnif input_file.endswith('.csv'):df = pd.read_csv(input_file, encoding='utf-8')elif input_file.endswith('.json'):df = pd.read_json(input_file, encoding='utf-8')elif input_file.endswith('.xlsx'):df = pd.read_excel(input_file)elif input_file.endswith('.xml', encoding='utf-8'):df = pd.read_xml(input_file)else:print('Input file type not supported')returnif output_file.endswith('.csv'):df.to_csv(output_file, index=False)elif output_file.endswith('.json'):df.to_json(output_file, orient='records', force_ascii=False)elif output_file.endswith('.xlsx'):df.to_excel(output_file, index=False)elif output_file.endswith('.xml'):df.to_xml(output_file, index=False)elif output_file.endswith('.html'):df.to_html(output_file, index=False, encoding='utf-8')else:print('Output file type not supported')return

main方法

def main(argv):input_path = Noneoutput_path = Nonetry:shortopts = "hi:o:"longopts = ["ipath=", "opath="]opts, args = getopt.getopt(argv, shortopts, longopts)except getopt.GetoptError:print('usage: export.py -i <inputpath> -o <outputpath>')sys.exit(2)for opt, arg in opts:if opt in ("-h", "--help"):print('usage: export.py -i <inputpath> -o <outputpath>')sys.exit()elif opt in ("-i", "--ipath"):input_path = argelif opt in ("-o", "--opath"):output_path = argprint(f'输入路径为：{input_path}')print(f'输出路径为：{output_path}')export(input_path, output_path)

附其它格式的方法

以下来自pandas官网 Input/Outout部分

HTML

方法	说明
`read_html`(io, *[, match, flavor, header, …])	Read HTML tables into a `list` of `DataFrame` objects.
`DataFrame.to_html`([buf, columns, col_space, …])	Render a DataFrame as an HTML table.

方法	说明
`Styler.to_html`([buf, table_uuid, …])	Write Styler to a file, buffer or string in HTML-CSS format.

Pickling

方法	说明
`read_pickle`(filepath_or_buffer[, …])	Load pickled pandas object (or any object) from file.
`DataFrame.to_pickle`(path, *[, compression, …])	Pickle (serialize) object to file.

Clipboard

方法	说明
`read_clipboard`([sep, dtype_backend])	Read text from clipboard and pass to `read_csv()`.
`DataFrame.to_clipboard`(*[, excel, sep])	Copy object to the system clipboard.

Latex

方法	说明
`DataFrame.to_latex`([buf, columns, header, …])	Render object to a LaTeX tabular, longtable, or nested table.

方法	说明
`Styler.to_latex`([buf, column_format, …])	Write Styler to a file, buffer or string in LaTeX format.

HDFStore: PyTables (HDF5)

方法	说明
`read_hdf`(path_or_buf[, key, mode, errors, …])	Read from the store, close it if we opened it.
`HDFStore.put`(key, value[, format, index, …])	Store object in HDFStore.
`HDFStore.append`(key, value[, format, axes, …])	Append to Table in file.
`HDFStore.get`(key)	Retrieve pandas object stored in file.
`HDFStore.select`(key[, where, start, stop, …])	Retrieve pandas object stored in file, optionally based on where criteria.
`HDFStore.info`()	Print detailed information on the store.
`HDFStore.keys`([include])	Return a list of keys corresponding to objects stored in HDFStore.
`HDFStore.groups`()	Return a list of all the top-level nodes.
`HDFStore.walk`([where])	Walk the pytables group hierarchy for pandas objects.

Warning

One can store a subclass of DataFrame or Series to HDF5, but the type of the subclass is lost upon storing.

Feather

方法	说明
`read_feather`(path[, columns, use_threads, …])	Load a feather-format object from the file path.
`DataFrame.to_feather`(path, **kwargs)	Write a DataFrame to the binary Feather format.

Parquet

方法	说明
`read_parquet`(path[, engine, columns, …])	Load a parquet object from the file path, returning a DataFrame.
`DataFrame.to_parquet`([path, engine, …])	Write a DataFrame to the binary parquet format.

ORC

方法	说明
`read_orc`(path[, columns, dtype_backend, …])	Load an ORC object from the file path, returning a DataFrame.
`DataFrame.to_orc`([path, engine, index, …])	Write a DataFrame to the ORC format.

SAS

方法	说明
`read_sas`(filepath_or_buffer, *[, format, …])	Read SAS files stored as either XPORT or SAS7BDAT format files.

SPSS

方法	说明
`read_spss`(path[, usecols, …])	Load an SPSS file from the file path, returning a DataFrame.

SQL

方法	说明
`read_sql_table`(table_name, con[, schema, …])	Read SQL database table into a DataFrame.
`read_sql_query`(sql, con[, index_col, …])	Read SQL query into a DataFrame.
`read_sql`(sql, con[, index_col, …])	Read SQL query or database table into a DataFrame.
`DataFrame.to_sql`(name, con, *[, schema, …])	Write records stored in a DataFrame to a SQL database.

Google BigQuery

方法	说明
`read_gbq`(query[, project_id, index_col, …])	(DEPRECATED) Load data from Google BigQuery.

STATA

方法	说明
`read_stata`(filepath_or_buffer, *[, …])	Read Stata file into DataFrame.
`DataFrame.to_stata`(path, *[, convert_dates, …])	Export DataFrame object to Stata dta format.

方法	说明
`StataReader.data_label`	Return data label of Stata file.
`StataReader.value_labels`()	Return a nested dict associating each variable name to its value and label.
`StataReader.variable_labels`()	Return a dict associating each variable name with corresponding label.
`StataWriter.write_file`()	Export DataFrame object to Stata dta format.