Skip to main content
Version: 1.3.0

Parquet

Apache Parquet is a column-oriented binary storage format optimized for analytical workloads. Originally developed within the Apache Hadoop ecosystem, Parquet provides efficient compression and encoding schemes for large-scale data processing.

Binary Layout

SectionInternal NameDescriptionPossible Values / Format
File Headermagic4-byte magic number identifying Parquet filesASCII: PAR1 (hex: 50 41 52 31)
Row Grouprow_group_metadataMetadata for each row groupContains column chunk metadata and statistics
column_chunkData for each column in the row groupCompressed and encoded column data
File FootermetadataFile-level metadata including schema and row groupsThrift-encoded metadata structure
metadata_lengthLength of metadata section4-byte little-endian integer
magicFooter magic numberASCII: PAR1 (hex: 50 41 52 31)

Column Storage Example

Row-based storage (traditional):

id,name,last_name,age
1,John,Buck,35
2,Jane,Doe,27
3,Joe,Dane,42

Column-based Storage (Parquet):

id: [1, 2, 3]
name: [John, Jane, Joe]
last_name: [Buck, Doe, Dane]
age: [35, 27, 42]

Encoding Types

EncodingInternal NameDescriptionUse Case
PlainPLAINNo encoding appliedSmall datasets or unsorted data
DictionaryPLAIN_DICTIONARYValues replaced with dictionary indicesRepeated string values
Run LengthRLEConsecutive identical values compressedSparse or repetitive data
Bit PackingBIT_PACKEDPack values using minimum required bitsBoolean or small integer ranges
DeltaDELTA_BINARY_PACKEDStore differences between consecutive valuesSorted numerical data

Compression Codecs

CodecDescriptionBest For
UNCOMPRESSEDNo compression appliedTesting or very small files
SNAPPYFast compression/decompressionGeneral-purpose, balanced performance
GZIPHigher compression ratioStorage-constrained environments
LZOFast decompressionRead-heavy workloads
BROTLIModern compression algorithmHigh compression ratio needs
LZ4Extremely fast compressionLow-latency applications