DuckDB: Process 100M Records 350x Faster than Pandas

Pandas library is well-known for being suitable for beginners in data analysis, but it appears to handle large data sets slowly. In contrast, the open-source DuckDB demonstrates remarkable speed in processing large data due to its excellent columnar storage performance, far surpassing that of Pandas. Moreover, DuckDB comes equipped with a Python library, allowing users familiar with SQL to quickly transition and significantly enhance data processing efficiency.

Next, let’s examine the performance comparison of these two tools when handling data sets exceeding one hundred million records.

1. Benchmark Testing Setup for Pandas and DuckDB

This section presents the data set used for benchmarking and the code implementations for both Pandas and DuckDB. The tests were conducted on an M2 Pro MacBook Pro with a 12/19 core processor and 16 GB of memory.

1.1 Data Set Information

The data set utilized is the trip record data provided by the New York City Taxi and Limousine Commission (TLC). This data was obtained from the official website of the New York City government on April 18, 2024, and is available for free use. Licensing information regarding data usage can be found on the nyc.gov website.

1.2 Benchmark Testing Objectives

The objective of the benchmark test is to load Parquet format data files using both Pandas and DuckDB, and subsequently compute monthly statistics, including total trips, average duration, distance traveled, total fare, and tip amounts.

To achieve this, several date-time fields need to be generated, the data filtered based on specific time periods, and multi-level indexing effectively managed in Pandas.

After loading the data, it is found that there are over 111 million records to process.

For the results, the Desired DataFrame is as follows:

1.3 Pandas Setup

Pandas is a single-threaded library designed for convenient data processing, but it is not adept at quickly handling large volumes of data.

Pandas first needs to load all data into memory at once, and when processing Parquet files, it must read each file sequentially, which is not efficient.

Additionally, using Pandas requires managing the cumbersome task of resetting multi-level indexes, a step necessary to make data in various columns more accessible:

import os
import pandas as pd

# Load data
base_path = "path/to/the/folder"

parquet_files = [os.path.join(base_path, file) for file in os.listdir(base_path) if file.endswith('.parquet')]

dfs = [pd.read_parquet(file) for file in parquet_files]
df_pd = pd.concat(dfs, ignore_index=True)

# Benchmark test function
def calculate_monthly_taxi_stats_pandas(df: pd.DataFrame) -> pd.DataFrame:
    # ... (details of function implementation omitted)
    return df

# Run
res_pandas = calculate_monthly_taxi_stats_pandas(df=df_pd)

2. DuckDB Setup

There are various methods to interact with DuckDB via Python, but the most straightforward way is to use SQL-like commands. In fact, only two SELECT statements are needed to achieve the functionality previously written in Pandas.

DuckDB also includes an efficient parquet_scan() function, which can read all Parquet files in a specified path simultaneously, greatly enhancing data processing efficiency:

import duckdb

# Database connection
conn = duckdb.connect()

# Benchmark test function
def calculate_monthly_taxi_stats_duckdb(conn: duckdb.DuckDBPyConnection, path: str) -> pd.DataFrame:
    # ... (details of function implementation omitted)
    return df

# Run
res_duckdb = calculate_monthly_taxi_stats_duckdb(conn=conn, path="path/to/the/folder/*parquet")

3. Benchmark Test Results — DuckDB is 352 Times Faster than Pandas

DuckDB can process a massive data set of over 100 million records in under two seconds, a speed that is astonishing!

If you can accept certain limitations of DuckDB, it is undoubtedly a viable alternative to Pandas.

4. Conclusion

Overall, DuckDB allows for the rapid writing and execution of data aggregation queries using the familiar SQL language, achieving speed improvements by several orders of magnitude.

DuckDB also supports various file formats, including JSON, CSV, and Excel, and is compatible with products from multiple database vendors. If you plan to use DuckDB in a more professional environment, you will have many flexible options.

Categories: AI Tools Guide
X