Skip to main content
Version: 1.6.1

Parquet

Apache Parquet is a column-oriented binary storage format optimized for analytical workloads. Originally developed within the Apache Hadoop ecosystem, Parquet provides efficient compression and encoding schemes for large-scale data processing.

Binary Layout

SectionInternal NameDescriptionPossible Values / Format
File Headermagic4-byte magic number identifying Parquet filesASCII: PAR1 (hex: 50 41 52 31)
Row Grouprow_group_metadataMetadata for each row groupContains column chunk metadata and statistics
column_chunkData for each column in the row groupCompressed and encoded column data
File FootermetadataFile-level metadata including schema and row groupsThrift-encoded metadata structure
metadata_lengthLength of metadata section4-byte little-endian integer
magicFooter magic numberASCII: PAR1 (hex: 50 41 52 31)

Column Storage Example

Row-based storage (traditional):

id,name,last_name,age
1,John,Buck,35
2,Jane,Doe,27
3,Joe,Dane,42

Column-based Storage (Parquet):

id: [1, 2, 3]
name: [John, Jane, Joe]
last_name: [Buck, Doe, Dane]
age: [35, 27, 42]

Encoding Types

EncodingInternal NameDescriptionUse Case
PlainPLAINValues encoded back to back without compressionDefault fallback for all data types
DictionaryRLE_DICTIONARYValues replaced with dictionary indices using RLERepeated string values, low-cardinality columns
Run Length / Bit-Packing HybridRLECombination of bit-packing and run length encodingRepetition/definition levels, dictionary indices, booleans
Delta Binary PackedDELTA_BINARY_PACKEDDelta encoding with binary packing for integersINT32, INT64 with sequential or clustered values
Delta Length Byte ArrayDELTA_LENGTH_BYTE_ARRAYDelta-encoded lengths followed by concatenated dataVariable-length byte arrays
Delta Byte ArrayDELTA_BYTE_ARRAYIncremental/front compression storing prefix lengthsBYTE_ARRAY, FIXED_LEN_BYTE_ARRAY with common prefixes
Byte Stream SplitBYTE_STREAM_SPLITScatters bytes to separate streams for better compressionFLOAT, DOUBLE, INT32, INT64 (added in Parquet 2.8)

Deprecated Encodings

EncodingInternal NameDescriptionReplacement
Plain DictionaryPLAIN_DICTIONARYLegacy dictionary encoding in data pagesUse RLE_DICTIONARY in data pages
Bit PackedBIT_PACKEDFixed-width bit-packing without paddingUse RLE hybrid encoding

Compression Codecs

CodecDescriptionBest For
UNCOMPRESSEDNo compression appliedTesting or very small files
SNAPPYFast compression/decompressionGeneral-purpose, balanced performance
GZIPHigher compression ratioStorage-constrained environments
LZOFast decompressionRead-heavy workloads
BROTLIModern compression algorithmHigh compression ratio needs
LZ4Extremely fast compressionLow-latency applications
ZSTDZstandard compression with configurable levelsBest balance of speed and compression ratio