人生苦短我用Python pandas文件格式转换 前言 示例1 excel与csv互转 常用格式的方法 示例2 常用格式转换 附其它格式的方法 HTML Pickling Clipboard Latex HDFStore: PyTables (HDF5) Feather Parquet ORC SAS SPSS SQL Google BigQuery STATA
前言
pandas
支持多种文件格式,通过pandas
的IO
方法,可以实现不同格式之间的互相转换。本文通过excel
与csv
互转的示例和pandas
的支持的文件格式,实现一个简单的文件格式转换的功能。
示例1 excel与csv互转
在前文实现了excel转csv ,即通过pandas
将excel
转csv
,反过来也可以将csv
转为excel
。
下面是excel
与csv
互转的示例代码:
def export_csv ( input_file, output_path) : with pd. ExcelFile( input_file) as xls: for i, sheet_name in enumerate ( xls. sheet_names) : df = pd. read_excel( xls, sheet_name= sheet_name) output_file = os. path. join( output_path, f' { i + 1 } - { sheet_name} .csv' ) df. to_csv( output_file, index= False )
def export_excel ( input_file, output_file) : if not output_file: input_path = pathlib. Path( input_file) output_path = input_path. parent / ( input_path. stem + '.xlsx' ) output_file = str ( output_path) df = pd. read_csv( input_file) df. to_excel( output_file, index= False )
常用格式的方法
以下来自pandas官网 Input/Outout部分
Flat file
方法 说明 read_table
(filepath_or_buffer, *[, sep, …])Read general delimited file into DataFrame. read_csv
(filepath_or_buffer, *[, sep, …])Read a comma-separated values (csv) file into DataFrame. DataFrame.to_csv
([path_or_buf, sep, na_rep, …])Write object to a comma-separated values (csv) file. read_fwf
(filepath_or_buffer, *[, colspecs, …])Read a table of fixed-width formatted lines into DataFrame.
Excel
方法 说明 read_excel
(io[, sheet_name, header, names, …])Read an Excel file into a pandas
DataFrame
. DataFrame.to_excel
(excel_writer, *[, …])Write object to an Excel sheet. ExcelFile
(path_or_buffer[, engine, …])Class for parsing tabular Excel sheets into DataFrame objects. ExcelFile.book
ExcelFile.sheet_names
ExcelFile.parse
([sheet_name, header, names, …])Parse specified sheet(s) into a DataFrame.
方法 说明 Styler.to_excel
(excel_writer[, sheet_name, …])Write Styler to an Excel sheet.
方法 说明 ExcelWriter
(path[, engine, date_format, …])Class for writing DataFrame objects into excel sheets.
JSON
方法 说明 read_json
(path_or_buf, *[, orient, typ, …])Convert a JSON string to pandas object. json_normalize
(data[, record_path, meta, …])Normalize semi-structured JSON data into a flat table. DataFrame.to_json
([path_or_buf, orient, …])Convert the object to a JSON string.
方法 说明 build_table_schema
(data[, index, …])Create a Table schema from data
.
XML
方法 说明 read_xml
(path_or_buffer, *[, xpath, …])Read XML document into a DataFrame
object. DataFrame.to_xml
([path_or_buffer, index, …])Render a DataFrame to an XML document.
示例2 常用格式转换
根据常用格式的IO方法,完成一个常用格式的格式转换功能。
第一步从指定格式的文件中读取数据,并将其转换为 DataFrame
对象。
第二部将 DataFrame
中的数据写入指定格式的文件中。
简要需求
根据输入输出的文件后缀名,自动进行格式转换,若格式不支持输出提示。 支持的格式csv
,xlsx
,json
,xml
。
依赖
pip install pandas
pip install openpyxl
pip install lxml
export方法
def export ( input_file, output_file) : if not os. path. isfile( input_file) : print ( 'Input file does not exist' ) return if input_file. endswith( '.csv' ) : df = pd. read_csv( input_file, encoding= 'utf-8' ) elif input_file. endswith( '.json' ) : df = pd. read_json( input_file, encoding= 'utf-8' ) elif input_file. endswith( '.xlsx' ) : df = pd. read_excel( input_file) elif input_file. endswith( '.xml' , encoding= 'utf-8' ) : df = pd. read_xml( input_file) else : print ( 'Input file type not supported' ) return if output_file. endswith( '.csv' ) : df. to_csv( output_file, index= False ) elif output_file. endswith( '.json' ) : df. to_json( output_file, orient= 'records' , force_ascii= False ) elif output_file. endswith( '.xlsx' ) : df. to_excel( output_file, index= False ) elif output_file. endswith( '.xml' ) : df. to_xml( output_file, index= False ) elif output_file. endswith( '.html' ) : df. to_html( output_file, index= False , encoding= 'utf-8' ) else : print ( 'Output file type not supported' ) return
main方法
def main ( argv) : input_path = None output_path = None try : shortopts = "hi:o:" longopts = [ "ipath=" , "opath=" ] opts, args = getopt. getopt( argv, shortopts, longopts) except getopt. GetoptError: print ( 'usage: export.py -i <inputpath> -o <outputpath>' ) sys. exit( 2 ) for opt, arg in opts: if opt in ( "-h" , "--help" ) : print ( 'usage: export.py -i <inputpath> -o <outputpath>' ) sys. exit( ) elif opt in ( "-i" , "--ipath" ) : input_path = argelif opt in ( "-o" , "--opath" ) : output_path = argprint ( f'输入路径为: { input_path} ' ) print ( f'输出路径为: { output_path} ' ) export( input_path, output_path)
附其它格式的方法
以下来自pandas官网 Input/Outout部分
HTML
方法 说明 read_html
(io, *[, match, flavor, header, …])Read HTML tables into a list
of DataFrame
objects. DataFrame.to_html
([buf, columns, col_space, …])Render a DataFrame as an HTML table.
方法 说明 Styler.to_html
([buf, table_uuid, …])Write Styler to a file, buffer or string in HTML-CSS format.
Pickling
方法 说明 read_pickle
(filepath_or_buffer[, …])Load pickled pandas object (or any object) from file. DataFrame.to_pickle
(path, *[, compression, …])Pickle (serialize) object to file.
Clipboard
方法 说明 read_clipboard
([sep, dtype_backend])Read text from clipboard and pass to read_csv()
. DataFrame.to_clipboard
(*[, excel, sep])Copy object to the system clipboard.
Latex
方法 说明 DataFrame.to_latex
([buf, columns, header, …])Render object to a LaTeX tabular, longtable, or nested table.
方法 说明 Styler.to_latex
([buf, column_format, …])Write Styler to a file, buffer or string in LaTeX format.
HDFStore: PyTables (HDF5)
方法 说明 read_hdf
(path_or_buf[, key, mode, errors, …])Read from the store, close it if we opened it. HDFStore.put
(key, value[, format, index, …])Store object in HDFStore. HDFStore.append
(key, value[, format, axes, …])Append to Table in file. HDFStore.get
(key)Retrieve pandas object stored in file. HDFStore.select
(key[, where, start, stop, …])Retrieve pandas object stored in file, optionally based on where criteria. HDFStore.info
()Print detailed information on the store. HDFStore.keys
([include])Return a list of keys corresponding to objects stored in HDFStore. HDFStore.groups
()Return a list of all the top-level nodes. HDFStore.walk
([where])Walk the pytables group hierarchy for pandas objects.
Warning
One can store a subclass of DataFrame
or Series
to HDF5, but the type of the subclass is lost upon storing.
Feather
方法 说明 read_feather
(path[, columns, use_threads, …])Load a feather-format object from the file path. DataFrame.to_feather
(path, **kwargs)Write a DataFrame to the binary Feather format.
Parquet
方法 说明 read_parquet
(path[, engine, columns, …])Load a parquet object from the file path, returning a DataFrame. DataFrame.to_parquet
([path, engine, …])Write a DataFrame to the binary parquet format.
ORC
方法 说明 read_orc
(path[, columns, dtype_backend, …])Load an ORC object from the file path, returning a DataFrame. DataFrame.to_orc
([path, engine, index, …])Write a DataFrame to the ORC format.
SAS
方法 说明 read_sas
(filepath_or_buffer, *[, format, …])Read SAS files stored as either XPORT or SAS7BDAT format files.
SPSS
方法 说明 read_spss
(path[, usecols, …])Load an SPSS file from the file path, returning a DataFrame.
SQL
方法 说明 read_sql_table
(table_name, con[, schema, …])Read SQL database table into a DataFrame. read_sql_query
(sql, con[, index_col, …])Read SQL query into a DataFrame. read_sql
(sql, con[, index_col, …])Read SQL query or database table into a DataFrame. DataFrame.to_sql
(name, con, *[, schema, …])Write records stored in a DataFrame to a SQL database.
Google BigQuery
方法 说明 read_gbq
(query[, project_id, index_col, …])(DEPRECATED) Load data from Google BigQuery.
STATA
方法 说明 read_stata
(filepath_or_buffer, *[, …])Read Stata file into DataFrame. DataFrame.to_stata
(path, *[, convert_dates, …])Export DataFrame object to Stata dta format.
方法 说明 StataReader.data_label
Return data label of Stata file. StataReader.value_labels
()Return a nested dict associating each variable name to its value and label. StataReader.variable_labels
()Return a dict associating each variable name with corresponding label. StataWriter.write_file
()Export DataFrame object to Stata dta format.