The Parquet Files: A Surprisingly Entertaining Guide to Columnar Storage
Look, I know what you're thinking. Another article about file formats? Really? You'd rather be debugging that mysterious production issue or arguing about tabs versus spaces. But hear me out for a minute.
Last week, I was happily hunting through our logs data - you know, the usual terabytes of events that compliance keeps asking for - when our Head of Finance dropped by. "Hey, why is our logging bill so high?"
Narrator: And thus began our hero's journey into the world of file formats.
Remembe
Look, I know what you're thinking. Another article about file formats? Really? You'd rather be debugging that mysterious production issue or arguing about tabs versus spaces. But hear me out for a minute.
Last week, I was happily hunting through our logs data - you know, the usual terabytes of events that compliance keeps asking for - when our Head of Finance dropped by. "Hey, why is our logging bill so high?"
Narrator: And thus began our hero's journey into the world of file formats.
Remember the good old days when we could just dump everything into CSV files? They were simple, readable, and everyone understood them. Then your data grew. And grew. And suddenly, reading that one column you needed meant downloading 500GB of data you didn't. Your Athena queries started costing more than your coffee budget. That's when you know it's time to talk about Parquet files.
The "Why Should I Care?" Part
Here's a fun experiment. Take this piece of code:
Now check the file sizes. Go ahead, I'll wait.
Running this on my machine gave
$ ls -lh regrets.csv smart_choice.parquet
-rw-r--r--@ 1 user staff 48M Nov 1 19:40 regrets.csv
-rw-r--r--@ 1 user staff 13M Nov 1 19:40 smart_choice.parquet
That's about a 73% reduction in size. "But storage is cheap!" I hear you say. Sure, until you're running queries on this data 50 times a day across hundreds of files, each time downloading the entire file because you just need to check the 'event' column. With Parquet, you'd only download the columns you need (about 1/4 of the data in this case), and it's already compressed efficiently.
Let's do some quick math:
CSV: 48MB × 50 queries × 30 days = 72GB transferred per month
Parquet (querying just one column): 3.25MB × 50 queries × 30 days = 4.875GB transferred per month
Your cloud provider just got a lot less happy about their profit margins.
Understanding Parquet Files
Apache Parquet is a columnar storage file format designed for efficient data storage and retrieval in big data workloads. Having worked with petabyte-scale data pipelines for observability data through the Last9 Control Plane, I can tell you that choosing the right file format can make or break your o11y architecture.
Why Choose Parquet?
1. Efficient Compression
Parquet uses column-level compression, which is more efficient than row-based compression. Here's a real-world example:
import pandas as pd
import numpy as np
import os
def compare_storage_efficiency():
"""
Compare different compression methods for Parquet files
using realistic data volumes
"""
# First create the date range
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
n_rows = len(dates) # This will be 365
# Create a dataset with repetitive values - matching the date length
df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C'], n_rows),
'value': np.random.randn(n_rows),
'timestamp': dates,
# Add some more columns to make it interesting
'user_id': [f'user_{i%100}' for i in range(n_rows)],
'score': np.random.uniform(0, 100, n_rows)
})
# Compare different compression methods
compressions = ['snappy', 'gzip', 'brotli']
results = {}
print("\nOriginal DataFrame shape:", df.shape)
print("Memory usage:", df.memory_usage(deep=True).sum() / 1024, "KB")
for comp in compressions:
output_file = f'data_{comp}.parquet'
df.to_parquet(output_file, compression=comp)
size = os.path.getsize(output_file)
results[comp] = size
print(f"\n{comp.upper()} compression:")
print(f"File size: {size/1024:.2f} KB")
# Also save as CSV for comparison
df.to_csv('data.csv')
results['csv'] = os.path.getsize('data.csv')
print(f"\nCSV size: {results['csv']/1024:.2f} KB")
# Calculate compression ratios
csv_size = results['csv']
print("\nCompression ratios compared to CSV:")
for method, size in results.items():
if method != 'csv':
ratio = csv_size / size
print(f"{method}: {ratio:.2f}x smaller")
return results, df
# Run the comparison
compression_results, df = compare_storage_efficiency()
# Print a sample of the data to verify
print("\nSample of the data:")
print(df.head())
When you run this, you should see output like:
python3 compare.py
Original DataFrame shape: (365, 5)
Memory usage: 46.427734375 KB
SNAPPY compression:
File size: 10.94 KB
GZIP compression:
File size: 9.41 KB
BROTLI compression:
File size: 9.41 KB
CSV size: 22.27 KB
Compression ratios compared to CSV:
snappy: 2.04x smaller
gzip: 2.37x smaller
brotli: 2.37x smaller
Sample of the data:
category value timestamp user_id score
0 B 0.286035 2023-01-01 user_0 33.123630
1 A 0.767671 2023-01-02 user_1 73.225731
2 A 0.179130 2023-01-03 user_2 13.983122
3 B 1.942285 2023-01-04 user_3 57.281699
4 A 1.841576 2023-01-05 user_4 33.683535
2.Schema evolution
Parquet file format supports schema evolution by default, since it’s designed with the dynamic nature of computer systems in mind. The format allows you to add new columns of data without having to worry about your existing dataset.
def demonstrate_schema_evolution():
# Original schema
df1 = pd.DataFrame({
'id': range(5),
'value': range(5)
})
# Write with original schema
table1 = pa.Table.from_pandas(df1)
pq.write_table(table1, 'evolving_data.parquet')
# New schema with additional column
df2 = pd.DataFrame({
'id': range(5, 10),
'value': range(5, 10),
'new_column': ['a', 'b', 'c', 'd', 'e']
})
# Append with new schema
table2 = pa.Table.from_pandas(df2)
pq.write_table(table2, 'evolving_data.parquet',
existing_data_behavior='delete_matching_partitions')
How Parquet Actually Works
Think of CSV files like a library where every time you want to find something, you have to read every book from cover to cover. Parquet is more like having an index for each topic - you only read what you need.
Here's what makes it special:
Columnar Storage: Only read the columns you need. Looking for error rates? Just grab the timestamp and error columns, leave everything else on disk.
Compression: Each column is compressed separately. Those repeated strings like 'login', 'logout'? They get compressed way better than in row-based formats.
Predicate Pushdown: Fancy term for "filter before loading". Want last week's errors? Parquet tells S3 to only send matching data chunks.
When to Use Parquet
✅ Perfect for:
Analytics data
Log storage
Event streams
Metrics data
Data lakes
❌ Not great for:
Small datasets (<100MB)
Frequent small updates
OLTP workloads
When you need direct file editing
Best Practices We Learned the Hard Way
Don't Over-Partition
Good: year, month
Bad: year, month, day, hour, service, region
Why? Too many small files kill S3 performance
Choose Compression Wisely
Hot data: Snappy (fast compression/decompression)
Cold data: ZSTD (better compression ratio)
Analytics: Depends on your query patterns
Row Group Sizes Matter
Too small: More S3 requests
Too large: More memory pressure
Sweet spot: 100MB-200MB for analytics
The Plot Twist
Here's the thing about Parquet that no one tells you in the technical docs: it's not just about saving money or making queries faster. It's about changing how you think about data.
Remember that time you had to add a new field to your logging, and everyone panicked about backward compatibility? With Parquet's schema evolution, that's just... not a problem anymore. Need to analyze just one field across five years of data? Go ahead, it won't bankrupt the company.
Real Talk: A Tale of Two Queries
Let me share something that happened last month. We had two identical queries running in production:
pythonCopy# The query both teams ran: SELECT date_trunc('hour', timestamp) as hour, COUNT(*) as errors FROM logs WHERE service = 'payment-api' AND level = 'ERROR' GROUP BY 1 ORDER BY 1
Team A (running against CSVs):
Query cost: $5.23
Data scanned: 1.2TB
Runtime: 3.5 minutes
Number of angry Slack messages: 7
Team B (running against Parquet):
Query cost: $0.89
Data scanned: 157GB
Runtime: 42 seconds
Number of coffee breaks taken while waiting: 0
The funny part? Both teams were looking at the exact same data. Team B just stopped reading columns they didn't need and filtering data they didn't want. Revolutionary, I know.
Why This Actually Matters
Look, at the end of the day, this isn't just about file formats. It's about:
Being able to analyze problems without watching the AWS bill
Running queries without planning your coffee break
Adding new data without breaking old code
Making your Finance team actually like you by using this capability through Last9!