The data industry is constantly evolving, with new tools emerging and their performance continuously improving. The rapid advancements in data tool performance are undoubtedly exciting.

DuckDB and Polars

This article introduces two powerful tools: DuckDB and Polars. The test dataset used is of moderate size, reflecting real-world scenarios. The queries are designed to be simple and intuitive, avoiding complexity to ensure the tests are actionable and the results easily understandable.

Test Environment

The benchmark tests were conducted using the 2021 New York City Yellow Taxi trip data, which consists of 30 million records and 18 fields, with a disk storage size of approximately 3GB.

The tests were performed on a 2021 MacBook Pro laptop equipped with an Apple M1 MAX chip, 64GB of RAM, a 1TB SSD, and a 10-core CPU.

DuckDB version 0.10.0 and Polars version 0.20.15 were used in the tests. The code structure was designed based on Marc Garcia’s benchmark methodology, as it provides easy-to-understand code and a clear structure, which helps improve the transparency and reproducibility of the tests.

Test Methodology

The following operations were performed:

  1. Read CSV files
  2. Simple aggregations (sum, average, min, max)
  3. Group aggregations
  4. Window functions
  5. Joins

Here are the source codes for each query:

Reading CSV Files – DuckDB (read_csv_duckdb.py)

Reading CSV Files DuckDB

Reading CSV Files – Polars (read_csv_polars.py)

Reading CSV Files Polars

Simple Aggregations (Sum, Average, Min, Max) – DuckDB (agg_duckdb.py)

Simple Aggregations Sum Average Min Max DuckDB

Simple Aggregations (Sum, Average, Min, Max) – Polars (agg_polars.py)

Simple Aggregations Sum Average Min Max Polars

Group Aggregations – DuckDB (groupby_agg_duckdb.py)

Group Aggregations DuckDB

Group Aggregations – Polars (groupby_agg_polars.py)

Group Aggregations Polars

Window Functions – DuckDB (window_func_duckdb.py)

Window Functions DuckDB

Window Functions – Polars (window_func_polars.py)

Window Functions Polars

Joins – DuckDB (join_duckdb.py)

Joins DuckDB

Joins – Polars (join_polars.py)

Joins Polars

Test Results

Here are the benchmark test results:

benchmark test results

The test results were quite surprising. DuckDB was expected to outperform Polars in most queries, but the actual results were quite different. The performance differences were most noticeable in reading CSV files and executing window functions.

In the CSV file reading task, Polars was three times faster than DuckDB. In executing window functions, Polars was more than seven times faster.

While Polars is known for its fast CSV reading capabilities, its exceptional performance in window function processing exceeded expectations. This may suggest that the specific window functions designed were more challenging for DuckDB to handle.

On the other hand, in join operations, DuckDB had a speed advantage of about 1.3 times over Polars. Although joins are computationally expensive, they are crucial in analysis work that requires data integration and consolidation.

Step-by-Step Guide

Follow these steps to run the benchmark tests:

  1. Download the 2021 New York City Yellow Taxi trip CSV file.
  2. Create a folder named “data” in the project’s root directory and place the CSV file inside. Ensure the file path is correctly set to “data/2021_Yellow_Taxi_Trip_Data.csv”. If the file name is changed, update the path information in the Python scripts accordingly.
  3. Ensure the operating environment is within a virtual environment.
  4. Install the dependencies.
   pip install -r requirements.txt

or

   pip install duckdb polars pandas numpy
  1. Finally, run the benchmark test scripts.
   python read_csv_duckdb.py
   python read_csv_polars.py
   python agg_duckdb.py
   python agg_polars.py
   python groupby_agg_duckdb.py
   python groupby_agg_polars.py
   python window_func_duckdb.py
   python window_func_polars.py
   python join_duckdb.py
   python join_polars.py

You can also choose to run unit tests to verify the correctness of the code:

pytest

Caveats and Limitations

The DuckDB benchmark test queries have a certain level of complexity. This is because methods like .arrow(), .pl(), .df(), and .fetchall() used for collecting results, while ensuring complete query execution, may also introduce non-core system factors into the benchmark, affecting the accuracy of the test results.

Among these methods, .arrow() is used in the benchmark tests to efficiently collect query results.

However, it’s important to note that while the .execute() method may seem usable, it may not accurately reflect the entire query execution time, as the final query pipeline is only executed upon calling the result collection method.

In contrast, Polars provides the .collect() method, which can fully implement the construction of a DataFrame.

Conclusion

This test aimed to maintain fairness, and the results show that DuckDB and Polars perform quite similarly.

After a comprehensive evaluation, it’s clear that both DuckDB and Polars exhibit exceptional speed and efficiency. Choosing either tool to advance a project is a wise decision.

This article aimed to reveal the subtle performance differences between DuckDB and Polars, providing a reference for readers when selecting the most suitable tool for their needs.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *