Compare DuckDB vs. Polars: Which Data Analysis Tool Wins?

The data industry is constantly evolving, with new tools emerging and their performance continuously improving. The rapid advancements in data tool performance are undoubtedly exciting.

This article introduces two powerful tools: DuckDB and Polars. The test dataset used is of moderate size, reflecting real-world scenarios. The queries are designed to be simple and intuitive, avoiding complexity to ensure the tests are actionable and the results easily understandable.

Test Environment

The benchmark tests were conducted using the 2021 New York City Yellow Taxi trip data, which consists of 30 million records and 18 fields, with a disk storage size of approximately 3GB.

The tests were performed on a 2021 MacBook Pro laptop equipped with an Apple M1 MAX chip, 64GB of RAM, a 1TB SSD, and a 10-core CPU.

DuckDB version 0.10.0 and Polars version 0.20.15 were used in the tests. The code structure was designed based on Marc Garcia’s benchmark methodology, as it provides easy-to-understand code and a clear structure, which helps improve the transparency and reproducibility of the tests.

Test Methodology

The following operations were performed:

Read CSV files
Simple aggregations (sum, average, min, max)
Group aggregations
Window functions
Joins

Here are the source codes for each query:

Reading CSV Files – DuckDB (read_csv_duckdb.py)

Reading CSV Files – Polars (read_csv_polars.py)

Simple Aggregations (Sum, Average, Min, Max) – DuckDB (agg_duckdb.py)

Simple Aggregations (Sum, Average, Min, Max) – Polars (agg_polars.py)

Group Aggregations – DuckDB (groupby_agg_duckdb.py)

Group Aggregations – Polars (groupby_agg_polars.py)

Window Functions – DuckDB (window_func_duckdb.py)

Window Functions – Polars (window_func_polars.py)

Joins – DuckDB (join_duckdb.py)

Joins – Polars (join_polars.py)

Test Results

Here are the benchmark test results:

The test results were quite surprising. DuckDB was expected to outperform Polars in most queries, but the actual results were quite different. The performance differences were most noticeable in reading CSV files and executing window functions.

In the CSV file reading task, Polars was three times faster than DuckDB. In executing window functions, Polars was more than seven times faster.

While Polars is known for its fast CSV reading capabilities, its exceptional performance in window function processing exceeded expectations. This may suggest that the specific window functions designed were more challenging for DuckDB to handle.

On the other hand, in join operations, DuckDB had a speed advantage of about 1.3 times over Polars. Although joins are computationally expensive, they are crucial in analysis work that requires data integration and consolidation.

Step-by-Step Guide

Follow these steps to run the benchmark tests:

Download the 2021 New York City Yellow Taxi trip CSV file.
Create a folder named “data” in the project’s root directory and place the CSV file inside. Ensure the file path is correctly set to “data/2021_Yellow_Taxi_Trip_Data.csv”. If the file name is changed, update the path information in the Python scripts accordingly.
Ensure the operating environment is within a virtual environment.
Install the dependencies.

   pip install -r requirements.txt

   pip install duckdb polars pandas numpy

Finally, run the benchmark test scripts.

   python read_csv_duckdb.py
   python read_csv_polars.py
   python agg_duckdb.py
   python agg_polars.py
   python groupby_agg_duckdb.py
   python groupby_agg_polars.py
   python window_func_duckdb.py
   python window_func_polars.py
   python join_duckdb.py
   python join_polars.py

You can also choose to run unit tests to verify the correctness of the code:

pytest

Caveats and Limitations

The DuckDB benchmark test queries have a certain level of complexity. This is because methods like .arrow(), .pl(), .df(), and .fetchall() used for collecting results, while ensuring complete query execution, may also introduce non-core system factors into the benchmark, affecting the accuracy of the test results.

Among these methods, .arrow() is used in the benchmark tests to efficiently collect query results.

However, it’s important to note that while the .execute() method may seem usable, it may not accurately reflect the entire query execution time, as the final query pipeline is only executed upon calling the result collection method.

In contrast, Polars provides the .collect() method, which can fully implement the construction of a DataFrame.

Conclusion

This test aimed to maintain fairness, and the results show that DuckDB and Polars perform quite similarly.

After a comprehensive evaluation, it’s clear that both DuckDB and Polars exhibit exceptional speed and efficiency. Choosing either tool to advance a project is a wise decision.

This article aimed to reveal the subtle performance differences between DuckDB and Polars, providing a reference for readers when selecting the most suitable tool for their needs.

Compare DuckDB vs. Polars: Which Data Analysis Tool Wins?

Test Environment

Test Methodology

Reading CSV Files – DuckDB (read_csv_duckdb.py)

Reading CSV Files – Polars (read_csv_polars.py)

Simple Aggregations (Sum, Average, Min, Max) – DuckDB (agg_duckdb.py)

Simple Aggregations (Sum, Average, Min, Max) – Polars (agg_polars.py)

Group Aggregations – DuckDB (groupby_agg_duckdb.py)

Group Aggregations – Polars (groupby_agg_polars.py)

Window Functions – DuckDB (window_func_duckdb.py)

Window Functions – Polars (window_func_polars.py)

Joins – DuckDB (join_duckdb.py)

Joins – Polars (join_polars.py)

Test Results

Step-by-Step Guide

Caveats and Limitations

Conclusion

Bing’s Generative AI Search 2024: Ultimate User Experience

LLM Robot Control Architecture Based on Latent Code Bridging (LCB)

Build Your Free AI Search Engine in 2024 | Beat Perplexity

2024’s Ultimate Guide: KG Prompts Boost LLM Causal Reasoning

Google’s Gemma 2B: 2024’s Most Overhyped AI Model

6 Game-Changing LLM Tools for Knowledge Management in 2024

Leave a Reply Cancel reply

Join 40,000+ AI Enthusiasts Receiving Our
Weekly NobleFilt Newsletter

Subscribe now and get exclusive access to our free guide: “10 Game-Changing AI Tools to Supercharge Your Productivity!”

Test Environment

Test Methodology

Reading CSV Files – DuckDB (read_csv_duckdb.py)

Reading CSV Files – Polars (read_csv_polars.py)

Simple Aggregations (Sum, Average, Min, Max) – DuckDB (agg_duckdb.py)

Simple Aggregations (Sum, Average, Min, Max) – Polars (agg_polars.py)

Group Aggregations – DuckDB (groupby_agg_duckdb.py)

Group Aggregations – Polars (groupby_agg_polars.py)

Window Functions – DuckDB (window_func_duckdb.py)

Window Functions – Polars (window_func_polars.py)

Joins – DuckDB (join_duckdb.py)

Joins – Polars (join_polars.py)

Test Results

Step-by-Step Guide

Caveats and Limitations

Conclusion

Similar Posts

Leave a Reply Cancel reply

Join 40,000+ AI Enthusiasts Receiving OurWeekly NobleFilt Newsletter

Subscribe now and get exclusive access to our free guide: “10 Game-Changing AI Tools to Supercharge Your Productivity!”

Join 40,000+ AI Enthusiasts Receiving Our
Weekly NobleFilt Newsletter