创建两张表,通过一种是parquet , 一种使用parquet snappy压缩
创建表
使用snappy
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY');使用gzip
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='GZIP');使用uncompressed
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='UNCOMPRESSED');使用默认
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET;也可以在执行语句前执行 set parquet.compression=SNAPPY; 会对之后跑的数据进行压缩,之前已经存在的不会进行snappy压缩
通过 desc formatted tableName 查看表结构
使用parquet snappy
Table Type: EXTERNAL_TABLE
Table Parameters: EXTERNAL TRUE numFiles 25 numPartitions 1 numRows 0 parquet.compression SNAPPY rawDataSize 0 totalSize 4570350557 transient_lastDdlTime 1552269085 # Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params: field.delim \u0001 serialization.format \u0001
使用parquet默认
Table Type: EXTERNAL_TABLE
Table Parameters: EXTERNAL TRUE numFiles 25 numPartitions 1 numRows 0 rawDataSize 0 totalSize 4570650197 transient_lastDdlTime 1552269039 # Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params: field.delim \u0001 serialization.format \u0001
测试数据量:20208432
UNCOMPRESSED :4570325699
PARQUET 默认 :4570650197
parquet gzip :4570314033
parquet snappy :4570350557
textfile :10356207038
通过对比发现,当数据量较少时parquet各压缩方式差别不大,但相比TEXTFILE压缩减少了1倍以上,后续再做一下性能对比测试一下。