Nov 2nd, ‘24/6 min read

The Parquet Files: A Surprisingly Entertaining Guide to Columnar Storage

Look, I know what you're thinking. Another article about file formats? Really? You'd rather be debugging that mysterious production issue or arguing about tabs versus spaces. But hear me out for a minute. Last week, I was happily hunting through our logs data - you know, the usual terabytes of events that compliance keeps asking for - when our Head of Finance dropped by. "Hey, why is our logging bill so high?" Narrator: And thus began our hero's journey into the world of file formats. Remembe

The Parquet Files: A Surprisingly Entertaining Guide to Columnar Storage

Look, I know what you're thinking. Another article about file formats? Really? You'd rather be debugging that mysterious production issue or arguing about tabs versus spaces. But hear me out for a minute.

Last week, I was happily hunting through our logs data - you know, the usual terabytes of events that compliance keeps asking for - when our Head of Finance dropped by. "Hey, why is our logging bill so high?"

Narrator: And thus began our hero's journey into the world of file formats.

Remember the good old days when we could just dump everything into CSV files? They were simple, readable, and everyone understood them. Then your data grew. And grew. And suddenly, reading that one column you needed meant downloading 500GB of data you didn't. Your Athena queries started costing more than your coffee budget. That's when you know it's time to talk about Parquet files.

The "Why Should I Care?" Part

Here's a fun experiment. Take this piece of code:

import pandas as pd
import numpy as np

# Create some annoyingly repetitive data
df = pd.DataFrame({
    'user_id': range(1_000_000),
    'event': np.random.choice(['login', 'logout', 'purchase'], 1_000_000),
    'timestamp': pd.date_range('2024-01-01', periods=1_000_000, freq='1min'),
    'session_id': [f'sess_{i % 1000}' for i in range(1_000_000)]
})

# Save it both ways
df.to_csv('regrets.csv')
df.to_parquet('smart_choice.parquet')

CSV vs Parquet Comparison

Now check the file sizes. Go ahead, I'll wait.

Waiting ...

Running this on my machine gave

$ ls -lh regrets.csv smart_choice.parquet
-rw-r--r--@ 1 user staff 48M Nov 1 19:40 regrets.csv
-rw-r--r--@ 1 user staff 13M Nov 1 19:40 smart_choice.parquet

That's about a 73% reduction in size. "But storage is cheap!" I hear you say. Sure, until you're running queries on this data 50 times a day across hundreds of files, each time downloading the entire file because you just need to check the 'event' column. With Parquet, you'd only download the columns you need (about 1/4 of the data in this case), and it's already compressed efficiently.

Let's do some quick math:

  • CSV: 48MB × 50 queries × 30 days = 72GB transferred per month
  • Parquet (querying just one column): 3.25MB × 50 queries × 30 days = 4.875GB transferred per month

Your cloud provider just got a lot less happy about their profit margins.

Understanding Parquet Files

Apache Parquet is a columnar storage file format designed for efficient data storage and retrieval in big data workloads. Having worked with petabyte-scale data pipelines for observability data through the Last9 Control Plane, I can tell you that choosing the right file format can make or break your o11y architecture.

Why Choose Parquet?

1. Efficient Compression

Parquet uses column-level compression, which is more efficient than row-based compression. Here's a real-world example:

import pandas as pd
import numpy as np
import os

def compare_storage_efficiency():
    """
    Compare different compression methods for Parquet files
    using realistic data volumes
    """
    # First create the date range
    dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
    n_rows = len(dates)  # This will be 365
    
    # Create a dataset with repetitive values - matching the date length
    df = pd.DataFrame({
        'category': np.random.choice(['A', 'B', 'C'], n_rows),
        'value': np.random.randn(n_rows),
        'timestamp': dates,
        # Add some more columns to make it interesting
        'user_id': [f'user_{i%100}' for i in range(n_rows)],
        'score': np.random.uniform(0, 100, n_rows)
    })
    
    # Compare different compression methods
    compressions = ['snappy', 'gzip', 'brotli']
    results = {}
    
    print("\nOriginal DataFrame shape:", df.shape)
    print("Memory usage:", df.memory_usage(deep=True).sum() / 1024, "KB")
    
    for comp in compressions:
        output_file = f'data_{comp}.parquet'
        df.to_parquet(output_file, compression=comp)
        size = os.path.getsize(output_file)
        results[comp] = size
        print(f"\n{comp.upper()} compression:")
        print(f"File size: {size/1024:.2f} KB")
    
    # Also save as CSV for comparison
    df.to_csv('data.csv')
    results['csv'] = os.path.getsize('data.csv')
    print(f"\nCSV size: {results['csv']/1024:.2f} KB")
    
    # Calculate compression ratios
    csv_size = results['csv']
    print("\nCompression ratios compared to CSV:")
    for method, size in results.items():
        if method != 'csv':
            ratio = csv_size / size
            print(f"{method}: {ratio:.2f}x smaller")
    
    return results, df

# Run the comparison
compression_results, df = compare_storage_efficiency()

# Print a sample of the data to verify
print("\nSample of the data:")
print(df.head())

When you run this, you should see output like:

python3 compare.py

Original DataFrame shape: (365, 5)
Memory usage: 46.427734375 KB

SNAPPY compression:
File size: 10.94 KB

GZIP compression:
File size: 9.41 KB

BROTLI compression:
File size: 9.41 KB

CSV size: 22.27 KB

Compression ratios compared to CSV:
snappy: 2.04x smaller
gzip: 2.37x smaller
brotli: 2.37x smaller

Sample of the data:
  category     value  timestamp user_id      score
0        B  0.286035 2023-01-01  user_0  33.123630
1        A  0.767671 2023-01-02  user_1  73.225731
2        A  0.179130 2023-01-03  user_2  13.983122
3        B  1.942285 2023-01-04  user_3  57.281699
4        A  1.841576 2023-01-05  user_4  33.683535

2.Schema evolution

Parquet file format supports schema evolution by default, since it’s designed with the dynamic nature of computer systems in mind. The format allows you to add new columns of data without having to worry about your existing dataset. 

def demonstrate_schema_evolution():
    # Original schema
    df1 = pd.DataFrame({
        'id': range(5),
        'value': range(5)
    })
    
    # Write with original schema
    table1 = pa.Table.from_pandas(df1)
    pq.write_table(table1, 'evolving_data.parquet')
    
    # New schema with additional column
    df2 = pd.DataFrame({
        'id': range(5, 10),
        'value': range(5, 10),
        'new_column': ['a', 'b', 'c', 'd', 'e']
    })
    
    # Append with new schema
    table2 = pa.Table.from_pandas(df2)
    pq.write_table(table2, 'evolving_data.parquet',
                   existing_data_behavior='delete_matching_partitions')

How Parquet Actually Works

Think of CSV files like a library where every time you want to find something, you have to read every book from cover to cover. Parquet is more like having an index for each topic - you only read what you need.

Here's what makes it special:

  1. Columnar Storage: Only read the columns you need. Looking for error rates? Just grab the timestamp and error columns, leave everything else on disk.
  2. Compression: Each column is compressed separately. Those repeated strings like 'login', 'logout'? They get compressed way better than in row-based formats.
  3. Predicate Pushdown: Fancy term for "filter before loading". Want last week's errors? Parquet tells S3 to only send matching data chunks.

When to Use Parquet

✅ Perfect for:

  • Analytics data
  • Log storage
  • Event streams
  • Metrics data
  • Data lakes

❌ Not great for:

  • Small datasets (<100MB)
  • Frequent small updates
  • OLTP workloads
  • When you need direct file editing

Best Practices We Learned the Hard Way

  1. Don't Over-Partition
    • Good: year, month
    • Bad: year, month, day, hour, service, region
    • Why? Too many small files kill S3 performance
  2. Choose Compression Wisely
    • Hot data: Snappy (fast compression/decompression)
    • Cold data: ZSTD (better compression ratio)
    • Analytics: Depends on your query patterns
  3. Row Group Sizes Matter
    • Too small: More S3 requests
    • Too large: More memory pressure
    • Sweet spot: 100MB-200MB for analytics

The Plot Twist

Here's the thing about Parquet that no one tells you in the technical docs: it's not just about saving money or making queries faster. It's about changing how you think about data.

Remember that time you had to add a new field to your logging, and everyone panicked about backward compatibility? With Parquet's schema evolution, that's just... not a problem anymore. Need to analyze just one field across five years of data? Go ahead, it won't bankrupt the company.

Real Talk: A Tale of Two Queries

Let me share something that happened last month. We had two identical queries running in production:

pythonCopy# The query both teams ran:
SELECT date_trunc('hour', timestamp) as hour,
COUNT(*) as
errors
FROM logs
WHERE service = 'payment-api'
AND level = 'ERROR'
GROUP BY 1
ORDER BY 1

Team A (running against CSVs):

  • Query cost: $5.23
  • Data scanned: 1.2TB
  • Runtime: 3.5 minutes
  • Number of angry Slack messages: 7

Team B (running against Parquet):

  • Query cost: $0.89
  • Data scanned: 157GB
  • Runtime: 42 seconds
  • Number of coffee breaks taken while waiting: 0

The funny part? Both teams were looking at the exact same data. Team B just stopped reading columns they didn't need and filtering data they didn't want. Revolutionary, I know.

Why This Actually Matters

Look, at the end of the day, this isn't just about file formats. It's about:

  • Being able to analyze problems without watching the AWS bill
  • Running queries without planning your coffee break
  • Adding new data without breaking old code
  • Making your Finance team actually like you by using this capability through Last9!

Newsletter

Stay updated on the latest from Last9.

Authors

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.

Topics

Handcrafted Related Posts