Look, I know what you're thinking. Another article about file formats? Really? You'd rather be debugging that mysterious production issue or arguing about tabs versus spaces. But hear me out for a minute.
Last week, I was happily hunting through our logs data - you know, the usual terabytes of events that compliance keeps asking for - when our Head of Finance dropped by. "Hey, why is our logging bill so high?"
Narrator: And thus began our hero's journey into the world of file formats.
Remember the good old days when we could just dump everything into CSV files? They were simple, readable, and everyone understood them. Then your data grew. And grew. And suddenly, reading that one column you needed meant downloading 500GB of data you didn't. Your Athena queries started costing more than your coffee budget. That's when you know it's time to talk about Apache Parquet.
The "Why Should I Care?" Part
Here's a fun experiment. Take this piece of Python code:
import pandas as pd
import numpy as np
# Create some annoyingly repetitive data
df = pd.DataFrame({
'user_id': range(1_000_000),
'event': np.random.choice(['login', 'logout', 'purchase'], 1_000_000),
'timestamp': pd.date_range('2024-01-01', periods=1_000_000, freq='1min'),
'session_id': [f'sess_{i % 1000}' for i in range(1_000_000)]
})
# Save it both ways
df.to_csv('regrets.csv')
df.to_parquet('smart_choice.parquet')
CSV vs Parquet Comparison
Now check the file sizes. Go ahead, I'll wait.

Running this on my machine gave:
$ ls -lh regrets.csv smart_choice.parquet
-rw-r--r--@ 1 user staff 48M Nov 1 19:40 regrets.csv
-rw-r--r--@ 1 user staff 13M Nov 1 19:40 smart_choice.parquet
That's about a 73% reduction in size. "But storage is cheap!" I hear you say. Sure, until you're running queries on this data 50 times a day across hundreds of files, each time downloading the entire file because you just need to check the 'event' column. With the Parquet file format, you'd only download the columns you need (about 1/4 of the data in this case), and it's already compressed efficiently.
Let's do some quick math:
- CSV: 48MB × 50 queries × 30 days = 72GB transferred per month
- Parquet (querying just one specific column): 3.25MB × 50 queries × 30 days = 4.875GB transferred per month
Your cloud provider just got a lot less happy about their profit margins on AWS.
Understanding Apache Parquet Files
Apache Parquet is a columnar storage format designed for efficient data storage and retrieval in big data workloads. Having worked with petabyte-scale data pipelines for observability data, I can tell you that choosing the right file format can make or break your data engineering architecture.
What Makes Parquet Special?
- Columnar Storage: Instead of storing data by rows like in traditional row-based formats (CSV, Excel), Parquet stores data by columns. This fundamental difference completely transforms how we interact with large datasets.
- Schema-aware: Unlike CSV files that are just plain text, Parquet files include schema information that defines the data types and structure, making data validation more reliable.
- Optimized for OLAP: Parquet was built specifically for analytical queries (OLAP) rather than transaction processing (OLTP), making it ideal for data lakes and data warehouses.
- Open-source Ecosystem: Being part of the Apache ecosystem means Parquet has broad compatibility with frameworks like Apache Spark, Hadoop, and countless programming languages.
Why Choose Parquet?
1. Efficient Compression
Parquet uses column-level compression, which is more efficient than row-based compression. Here's a real-world example comparing different encoding schemes:
import pandas as pd
import numpy as np
import os
import pyarrow as pa
import pyarrow.parquet as pq
def compare_storage_efficiency():
"""
Compare different compression methods for Parquet files
using realistic data volumes
"""
# First create the date range
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
n_rows = len(dates) # This will be 365
# Create a dataset with repetitive values - matching the date length
df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C'], n_rows),
'value': np.random.randn(n_rows),
'timestamp': dates,
# Add some more columns to make it interesting
'user_id': [f'user_{i%100}' for i in range(n_rows)],
'score': np.random.uniform(0, 100, n_rows)
})
# Compare different compression methods
compressions = ['snappy', 'gzip', 'brotli']
results = {}
print("\nOriginal DataFrame shape:", df.shape)
print("Memory usage:", df.memory_usage(deep=True).sum() / 1024, "KB")
for comp in compressions:
output_file = f'data_{comp}.parquet'
df.to_parquet(output_file, compression=comp)
size = os.path.getsize(output_file)
results[comp] = size
print(f"\n{comp.upper()} compression:")
print(f"File size: {size/1024:.2f} KB")
# Also save as CSV for comparison
df.to_csv('data.csv')
results['csv'] = os.path.getsize('data.csv')
print(f"\nCSV size: {results['csv']/1024:.2f} KB")
# Calculate compression ratios compared to CSV
csv_size = results['csv']
print("\nCompression ratios compared to CSV:")
for method, size in results.items():
if method != 'csv':
ratio = csv_size / size
print(f"{method}: {ratio:.2f}x smaller")
return results, df
# Run the comparison
compression_results, df = compare_storage_efficiency()
# Print a sample of the data to verify
print("\nSample of the data:")
print(df.head())
When I ran this, I got output like:
Original DataFrame shape: (365, 5)
Memory usage: 46.43 KB
SNAPPY compression:
File size: 10.94 KB
GZIP compression:
File size: 9.41 KB
BROTLI compression:
File size: 9.41 KB
CSV size: 22.27 KB
Compression ratios compared to CSV:
snappy: 2.04x smaller
gzip: 2.37x smaller
brotli: 2.37x smaller
Sample of the data:
category value timestamp user_id score
0 B 0.286035 2023-01-01 user_0 33.123630
1 A 0.767671 2023-01-02 user_1 73.225731
2 A 0.179130 2023-01-03 user_2 13.983122
3 B 1.942285 2023-01-04 user_3 57.281699
4 A 1.841576 2023-01-05 user_4 33.683535
As you can see, even with this relatively small dataset, Parquet with GZIP compression is 2.37x smaller than the CSV version. Now imagine scaling that to terabytes!
2. Schema Evolution
One of the most powerful features of the Parquet file format is schema evolution, which makes it perfect for evolving systems. The format allows you to add new columns to your datasets without breaking compatibility with older data.
import pyarrow as pa
import pyarrow.parquet as pq
def demonstrate_schema_evolution():
# Original schema
df1 = pd.DataFrame({
'id': range(5),
'value': range(5)
})
# Write with original schema
table1 = pa.Table.from_pandas(df1)
pq.write_table(table1, 'evolving_data.parquet')
# New schema with additional column
df2 = pd.DataFrame({
'id': range(5, 10),
'value': range(5, 10),
'new_column': ['a', 'b', 'c', 'd', 'e'] # New column added!
})
# Append with new schema
table2 = pa.Table.from_pandas(df2)
pq.write_table(table2, 'evolving_data.parquet',
existing_data_behavior='delete_matching_partitions')
# Reading back works fine with both schemas
read_back = pq.read_table('evolving_data.parquet').to_pandas()
print(read_back)
This is something you simply can't do with CSV files without a lot of manual work or complex data processing pipelines.
3. Predicate Pushdown for Query Performance
When working with large datasets in Amazon S3 or other cloud storage, Parquet allows for predicate pushdown – a fancy term that means "filter before loading." This significantly improves query performance for analytical queries.
# Example using PyArrow to read only specific rows meeting a condition
import pyarrow.parquet as pq
# This only loads the filtered data, not the entire file
filtered_data = pq.read_table(
'large_dataset.parquet',
filters=[('timestamp', '>=', '2024-01-01'), ('level', '=', 'ERROR')]
)
How Parquet Actually Works
Think of CSV files like a library where every time you want to find something, you have to read every book from cover to cover. Parquet is more like having an index for each topic - you only read what you need.
Here's what makes it special:
1. Columnar Storage Structure
In row-based storage formats like CSV, each row is stored together. In columnar storage like Parquet, each column is stored together:
CSV (Row-based):
id,name,age
1,Alice,25
2,Bob,30
3,Charlie,22
Parquet (Conceptually):
Column 'id': [1, 2, 3]
Column 'name': ['Alice', 'Bob', 'Charlie']
Column 'age': [25, 30, 22]
This structure means when you're only interested in the 'age' column, you don't need to read the entire file – just that specific column.
2. Row Groups and Metadata
Parquet organizes data into row groups with built-in metadata that describes:
- Schema information with data types
- Statistics for each column (min/max values)
- Encoding information
This metadata allows the query engine to skip entire chunks of data that don't match filtering criteria.
3. Complex Data Structures Support
Unlike CSV, Parquet can efficiently store nested data structures like arrays, maps, and structs, making it perfect for complex data:
# Creating a DataFrame with nested data
df = pd.DataFrame({
'id': range(3),
'nested': [
{'a': 1, 'b': [1, 2, 3]},
{'a': 2, 'b': [4, 5, 6]},
{'a': 3, 'b': [7, 8, 9]}
]
})
# Save to Parquet (works fine!)
df.to_parquet('nested_data.parquet')
# Try with CSV (will not preserve structure)
df.to_csv('nested_data.csv')
When to Use Parquet
✅ Perfect for:
- Analytics data and data warehouses
- Log storage and event streams
- Metrics data
- Data lakes
- Big data processing in Hadoop ecosystems
- Optimizing storage space and query performance
❌ Not great for:
- Small datasets (<100MB) where the overhead isn't worth it
- Frequent small updates (OLTP workloads)
- When you need direct file editing (like spreadsheets)
- When compatibility with legacy systems is critical
Comparing Parquet with Other Data Formats
Format | Type | Strengths | Weaknesses | Best For |
---|---|---|---|---|
Parquet | Columnar | Compression, query perf | Not human-readable | Analytics |
CSV | Row-based | Simplicity, compatibility | Size, no schema | Small datasets |
Avro | Row-based | Schema evolution | Not columnar | Streaming |
ORC | Columnar | Similar to Parquet | Less ecosystem support | Hive workloads |
Best Practices We Learned the Hard Way
1. Don't Over-Partition
- Good: Partition by
year
,month
- Bad: Partition by
year
,month
,day
,hour
,service
,region
- Why? Too many small files kill S3 performance and can overwhelm file system metadata
I once had a data pipeline that generated one file per minute per service (we had about 50 services). Within a week, we had over 500,000 tiny Parquet files that made listing operations painfully slow. We had to reprocess everything with more reasonable partitioning.
2. Choose Compression Wisely
- Hot data: Snappy (fast compression/decompression)
- Cold data: GZIP or ZSTD (better compression ratio)
- Analytics: Depends on your query patterns
Our team found that for frequently queried log data, the slight increase in storage cost with Snappy was worth the improved query performance compared to GZIP.
3. Row Group Sizes Matter
- Too small: More S3 requests
- Too large: More memory pressure
- Sweet spot: 100MB-200MB for analytics
We tune our row group sizes depending on the expected query patterns. For datasets where we typically scan entire columns, larger row groups work better. For datasets where we apply very selective filters, smaller row groups help with parallelism.
The Plot Twist
Here's the thing about Parquet that no one tells you in the technical docs: it's not just about saving money or making queries faster. It's about changing how you think about data.
Remember that time you had to add a new field to your logging, and everyone panicked about backward compatibility? With Parquet's schema evolution, that's just... not a problem anymore. Need to analyze just one field across five years of data? Go ahead, it won't bankrupt the company.
Real Talk: A Tale of Two Queries
Let me share something that happened last month. We had two identical queries running in production:
-- The query both teams ran:
SELECT date_trunc('hour', timestamp) as hour,
COUNT(*) as errors
FROM logs
WHERE service = 'payment-api'
AND level = 'ERROR'
GROUP BY 1
ORDER BY 1
Team A (running against CSVs on Amazon S3):
- Query cost: $5.23
- Data scanned: 1.2TB
- Runtime: 3.5 minutes
- Number of angry Slack messages: 7
Team B (running against Parquet on Amazon S3 with Athena):
- Query cost: $0.89
- Data scanned: 157GB
- Runtime: 42 seconds
- Number of coffee breaks taken while waiting: 0
The funny part? Both teams were looking at the exact same data. Team B just stopped reading columns they didn't need and filtering data they didn't want. Revolutionary, I know.
Practical Tips for Working with Parquet
Reading Parquet with Python
import pandas as pd
# Read an entire Parquet file
df = pd.read_parquet('mydata.parquet')
# Read only specific columns
df = pd.read_parquet('mydata.parquet', columns=['timestamp', 'user_id'])
# Filter data while reading (using PyArrow)
import pyarrow.parquet as pq
table = pq.read_table('mydata.parquet', filters=[('value', '>', 100)])
df = table.to_pandas()
Reading Parquet in SQL Engines
-- Amazon Athena, Presto, Spark SQL
SELECT * FROM parquet_table
WHERE date_column BETWEEN '2024-01-01' AND '2024-01-31'
AND status = 'completed';
Converting from CSV to Parquet
import pandas as pd
# Simple conversion
df = pd.read_csv('legacy_data.csv')
df.to_parquet('optimized_data.parquet')
# Chunked conversion for large files
chunksize = 100_000
for i, chunk in enumerate(pd.read_csv('huge_file.csv', chunksize=chunksize)):
chunk.to_parquet(f'output/part-{i:05d}.parquet')
Why This Actually Matters
Look, at the end of the day, this isn't just about file formats. It's about:
- Being able to analyze problems without watching the AWS bill climb
- Running queries without planning your coffee break around them
- Adding new data without breaking old code
- Making your Finance team actually like you
In my years of working with data engineering and big data, switching to Parquet for our data lake was one of those rare changes that immediately showed benefits. Query costs dropped by 60%, and data storage costs fell by around 40%. The entire data team suddenly had more budget for the things that actually mattered.
Getting Started
Ready to join the Parquet revolution? Here's how to get started:
- For Python users:
pip install pandas pyarrow
- For Spark users:
# Reading
df = spark.read.parquet("s3://your-bucket/path/to/data")
# Writing
df.write.parquet("s3://your-bucket/path/to/output")
- For AWS Athena/Glue users:
-- Convert your data to Parquet
CREATE TABLE my_parquet_table
WITH (format = 'PARQUET')
AS SELECT * FROM my_csv_table;
Conclusion
If you're still dumping terabytes of data into CSV files, you're probably spending more time and money than you need to. Parquet's columnar storage format offers significant advantages in storage efficiency, query performance, and schema flexibility that are hard to ignore in today's data-driven world.
Give it a try – your cloud bill (and your colleagues waiting for those analytical queries to finish) will thank you.