We are continuing our saga of CPU vs. GPU articles comparing the most common data-processing toolkits, and this time it will be about tabular data. Specifically, we will compare frameworks with Pandas-like Python interfaces on a dataset often used to compare SQL Databases.

TLDR: Use Arrow to parse large datasets and split it in batches to process via CuDF 15x faster!

Accelerated Data-Science Libraries talk at PyData by Ashot Vardanian

If you are looking for a broader overview of the general topic - my recent PyData talk on “Accelerated Data-Science Libraries” was just published on YouTube.

Setup

Meet our dataset - “New York Taxi Rides”.

  • 125 Parquet files,
  • Totaling at 39.7 GB on disk,
  • Containing 1,547,741,381 rows.

The benchmark includes four queries and we will analyze them one by one: 1, 2, 3, 4. But for most impatient, here is a chart.

ADSB Tabular Speedups

Our baseline is Pandas. Every other bar in each group shows a relative speedup of every one of the following libraries:

  1. Modin - a Python library that uses Dask or Ray to parallelize the evaluation of Pandas queries across processes/workers.
  2. CuDF - a hybrid Python, C++, and CUDA library by Nvidia that backs Pandas API calls with GPU kernels.
  3. Spark - a Java-based big-data processing framework with a Python wrapper and a .pandas interface.
  4. Arrow - a C++ framework for in-memory columnar representations and operations on them, with bindings for Python and many other languages.

Frameworks

We also looked at Dask, Dask-CuDF, SQLite, DuckDB, and other alternatives. All those projects are vastly different, so comparing them isn’t precisely Apples to Apples. For now, let’s understand the differences in our original set.

Embedded Memory Constraints
Pandas RAM
Modin RAM
CuDF VRAM
Arrow RAM
Spark RAM/JVM

Two systems (Modin and Spark) are not “embedded” libraries. They depend on external processes with which they communicate through shared memory or sockets. It may not seem like a problem working with 100 rows, but it is with Terabytes. A typical CPU can perform ~10 Billion operations every second, while few I/O engines can handle even a Million packets on a socket. For reference, Redis, often praised for its performance, hardly reaches that mark while still being ahead of Spark.

ADSB Tabular Speedups

The size of our dataset also makes it hard to process it in memory. Not every computer has 40 GB of RAM, but even that won’t be enough once you decompress and reserve temporary memory for running analysis. Luckily, our sandbox machine has 1 TB of RAM, but CuDF relies on GPUs internal memory - VRAM. It is smaller and more expensive, and each GPU only has 24 GBs of it. So we hand-crafted a “batching” mechanism tailored for each query. CuDF would operate on batches of 10 files simultaneously instead of looking at 125 at once. Consequently, it had to parse every file four times for every one of the queries. Other frameworks, only limited by the size of RAM, parsed the files just once before running analytics and reused the in-memory representation. CuDF was already up to 15x faster than Pandas, before we included the parsing time into the measurement. The picture becomes even more dramatic once we do.

Queries

Enough with drawing. Let’s see what exactly we are comparing.

Query 1

The original SQL query has the following structure:

1
2
3
4
SELECT cab_type,
       count(*)
FROM trips
GROUP BY 1;

Rewriting it with Pandas would look similar to this:

1
2
3
selected_df = trips[['cab_type']]
grouped_df = selected_df.groupby('cab_type')
final_df = grouped_df.size().reset_index(name='counts')

It looks pretty straightforward until you start adjusting it for every engine. Their APIs look similar, but not all arguments match, and some functionality will be missing. We will return to stability and interoperability, but before that, let us organize the measurements.

Desktop Server Server Server
Operations Load + Eval Load + Eval Load Eval
Batch Size 2 10 125, full dataset 125, full dataset
Pandas 608.6 s 172.1 s 51.7 s 69.7 s
Modin 743.1 s 141.8 s 121.2 s 7.5 s
CuDF 61.6 s 16.7 s
Spark 159.8 s 68.3 s 11.7 s 158.9 s
Arrow 78.6 s 69.1 s 48.0 s 25.8 s

Every measurement here is the number of seconds it took to evaluate the query. We repeated the measurements many times on two machines:

  1. “Desktop”: Intel Core i7 + Titan Volta + 32 GB of DDR4.
  2. “Server” was mentioned before.

As expected, CuDF can’t fit the whole dataset into VRAM, so only batched measurements are present. Still, CuDF takes almost 5 times less to parse batches and evaluate the Q1 than Pandas just evaluating the query on an already parsed dataset.

Query 2

1
2
3
4
SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY 1;

This query is also quite simple to translate:

1
2
3
selected_df = trips[['passenger_count', 'total_amount']]
grouped_df = selected_df.groupby('passenger_count')
final_df = grouped_df.mean().reset_index()

Results:

Desktop Server Server Server
Operations Load + Eval Load + Eval Load Eval
Batch Size 2 10 125, full dataset 125, full dataset
Pandas 535.1 s 114.1 s 51.7 s 20.0 s
Modin 669.7 s 154.1 s 121.2 s 27.0 s
CuDF 28.4 s 10.6 s
Spark 73.6 s 43.7 s 11.7 s
Arrow 62.7 s 54.4 s 48.0 s 14.4 s

Starting with this query, Spark, SQLite, and a few other backends became unusably slow when evaluating bulk datasets.

Query 3

1
2
3
4
5
6
SELECT passenger_count,
       extract(year from pickup_datetime),
       count(*)
FROM trips
GROUP BY 1,
         2;

Queries like these are often harder for databases. They may often have an index ready for single-column group-by operations, but aggregating based on multiple columns may require a full scan. For in-memory processing systems, it is less of an issue.

1
2
3
4
selected_df = trips[['passenger_count', 'pickup_datetime']]
selected_df['year'] = pd.to_datetime(selected_df.pop('pickup_datetime'), format='%Y-%m-%d %H:%M:%S').dt.year
grouped_df = selected_df.groupby(['passenger_count', 'year'])
final_df = grouped_df.size().reset_index(name='counts')

Results:

Desktop Server Server Server
Operations Load + Eval Load + Eval Load Eval
Batch Size 2 10 125, full dataset 125, full dataset
Pandas 858.5 s 247.8 s 51.7 s 153.0 s
Modin 729.1 s 153.7 s 121.2 s 17.0 s
CuDF 25.5 s 11.2 s
Spark 125.1 s 63.1 s 11.7 s
Arrow 82.4 s 74.5 s 48.0 s 36.1 s

Query 4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT passenger_count,
       extract(year from pickup_datetime),
       round(trip_distance),
       count(*)
FROM trips
GROUP BY 1,
         2,
         3
ORDER BY 2,
         4 desc;

This query is the hardest and comprises everything we have tried before. In Pandas, it looks a bit mouthful:

1
2
3
4
5
6
selected_df = trips[['passenger_count', 'pickup_datetime', 'trip_distance']]
selected_df['trip_distance'] = selected_df['trip_distance'].round().astype(int)
selected_df['year'] = pd.to_datetime(selected_df.pop('pickup_datetime'), format='%Y-%m-%d %H:%M:%S').dt.year
grouped_df = selected_df.groupby(['passenger_count', 'year', 'trip_distance'])
final_df = grouped_df.size().reset_index(name='counts')
final_df = final_df.sort_values(['year', 'counts'], ascending=[True, False]) 

Results:

Desktop Server Server Server
Operations Load + Eval Load + Eval Load Eval
Batch Size 2 10 125, full dataset 125, full dataset
Pandas 859.0 s 470.8 s 51.7 s 178.4 s
Modin 1024.2 s 247.5 s 121.2 s 72.3 s
CuDF 26.9 s 12.0 s
Spark 669.4 s 136.3 s 11.7 s
Arrow 95.7 s 79.4 s 48.0 s 43.9 s

Findings

CuDF

Pros:

Cons:

  • VRAM limits dataset size.
  • Not everyone has a GPU. Why?!
  • Default horizontal scaling methods, namely Dask-CuDF, work horribly, even with RMM pools and NV-Link.

Even current performance is far from saturating the GPU compute capabilities. Parsing & data transfers take more time than the actual kernel execution. CXL, PCI-E 5, and next-gen NV-Link will recompense this, but just partly. This is what GPU utilization looks like in such batched queries on RTX 3090:

CuDF GPU Utilization

The dataset was placed on a fast PCI-E Gen4 SSD, but we also tried rerunning benchmarks over datasets stored on a RAID-0 array with 8x even faster Gen4 SSDs. We hoped it would affect the batched workloads but registered no fluctuations in our measurements.

Implementation Source.

Spark

Pros:

  • Faster than Pandas.
  • Widespread outside of the Python world.
  • A major part of the Apache ecosystem.

Cons:

  • Memory usage patterns and performance are unpredictable due to JVM.
  • Cumbersome configuration via SparkConf, SparkSession, SparkContext.
  • Some types, like datetime64[s] aren’t supported, while others like datetime64[ns] work.
  • Types like category may be dropping the original string labels and need workarounds.
  • Stability.

Implementation Source.

Modin

Pros:

  • API compatibility.

Cons:

  • Poor scaling and over-utilization of hardware resources.
  • Many operations, like masking, still fall back on Pandas’ implementation.

Overall, we are not big fans of Modin’s “using Python to accelerate Python” methodology. We have said it before and will repeat it - Python needs a significant upgrade.

Multi-threading is not usable in Python due to Global Interpreter Lock. That is why PyTorch deprecated DataParallel training in favor of DistributedDataParallel.

Need more than one thread - run multiple processes and multiple Python interpreter instances, all isolated, and pay the inter-process communication costs. Just waiting for those processes to synchronize often takes longer than the task itself, making it meaningless for fine-grained parallelism. This also applies to Ray and Dask.

Implementation Source.

Arrow

Pros:

  • Highest speed and efficiency on CPU.
  • Embedded and easy to configure/manage.
  • Growing ecosystem.

Cons:

  • The API is very different from Pandas.

But is it may not be much of a problem for some users, as Arrow has awe-inspiring support from all sides of the industry…

Arrow

Including both compute and storage. Our Kernel Libraries may not be ready for showtime, but we are slowly joining the ecosystem on the storage side with UKV.

UKV

Arrow representations are its first-class residents. Furthermore, it seems to be the first storage layer solution using Apache Arrow Flight RPC for client-server communication, meaning that every Arrow-compatible project can work with UKV without additional bindings. We have just completed benchmarking it on 10 TB document collections with 10x more entries and 100x more data, so we are certain it will be pivotal in taking the next big scale!

Implementation Source.

Conclusions

To summarize the results, here are the winners across categories:

  1. Consistent performance: Arrow.
  2. Peak performance: CuDF.
  3. Energy efficiency: CuDF.
  4. Accessibility: Modin.
  5. Ecosystem: Arrow and Spark.

This entire benchmark - ADSB, is publicly available on our corporate GitHub, with the original measurements on different machines and scripts for downloading & preprocessing the data. Enjoy 🤗