Parquet
Apache Parquet is a column-oriented binary storage format optimized for analytical workloads. Originally developed within the Apache Hadoop ecosystem, Parquet provides efficient compression and encoding schemes for large-scale data processing.
Binary Layout
Section | Internal Name | Description | Possible Values / Format |
---|---|---|---|
File Header | magic | 4-byte magic number identifying Parquet files | ASCII: PAR1 (hex: 50 41 52 31 ) |
Row Group | row_group_metadata | Metadata for each row group | Contains column chunk metadata and statistics |
column_chunk | Data for each column in the row group | Compressed and encoded column data | |
File Footer | metadata | File-level metadata including schema and row groups | Thrift-encoded metadata structure |
metadata_length | Length of metadata section | 4-byte little-endian integer | |
magic | Footer magic number | ASCII: PAR1 (hex: 50 41 52 31 ) |
Column Storage Example
Row-based storage (traditional):
id,name,last_name,age
1,John,Buck,35
2,Jane,Doe,27
3,Joe,Dane,42
Column-based Storage (Parquet):
id: [1, 2, 3]
name: [John, Jane, Joe]
last_name: [Buck, Doe, Dane]
age: [35, 27, 42]
Encoding Types
Encoding | Internal Name | Description | Use Case |
---|---|---|---|
Plain | PLAIN | No encoding applied | Small datasets or unsorted data |
Dictionary | PLAIN_DICTIONARY | Values replaced with dictionary indices | Repeated string values |
Run Length | RLE | Consecutive identical values compressed | Sparse or repetitive data |
Bit Packing | BIT_PACKED | Pack values using minimum required bits | Boolean or small integer ranges |
Delta | DELTA_BINARY_PACKED | Store differences between consecutive values | Sorted numerical data |
Compression Codecs
Codec | Description | Best For |
---|---|---|
UNCOMPRESSED | No compression applied | Testing or very small files |
SNAPPY | Fast compression/decompression | General-purpose, balanced performance |
GZIP | Higher compression ratio | Storage-constrained environments |
LZO | Fast decompression | Read-heavy workloads |
BROTLI | Modern compression algorithm | High compression ratio needs |
LZ4 | Extremely fast compression | Low-latency applications |