Pandas, CuDF, Modin, Arrow, Spark, and a Billion Taxi Rides
Choosing the right in-memory tabular processing framework
Contents
We are continuing our saga of CPU vs. GPU articles comparing the most common data-processing toolkits, and this time it will be about tabular data. Specifically, we will compare frameworks with Pandas-like Python interfaces on a dataset often used to compare SQL Databases.
TLDR: Use Arrow to parse large datasets and split it in batches to process via CuDF 15x faster!
If you are looking for a broader overview of the general topic - my recent PyData talk on “Accelerated Data-Science Libraries” was just published on YouTube.
Setup
Meet our dataset - “New York Taxi Rides”.
- 125 Parquet files,
- Totaling at 39.7 GB on disk,
- Containing 1,547,741,381 rows.
The benchmark includes four queries and we will analyze them one by one: 1, 2, 3, 4. But for most impatient, here is a chart.
Our baseline is Pandas. Every other bar in each group shows a relative speedup of every one of the following libraries:
- Modin - a Python library that uses Dask or Ray to parallelize the evaluation of Pandas queries across processes/workers.
- CuDF - a hybrid Python, C++, and CUDA library by Nvidia that backs Pandas API calls with GPU kernels.
- Spark - a Java-based big-data processing framework with a Python wrapper and a
.pandas
interface. - Arrow - a C++ framework for in-memory columnar representations and operations on them, with bindings for Python and many other languages.
Frameworks
We also looked at Dask, Dask-CuDF, SQLite, DuckDB, and other alternatives. All those projects are vastly different, so comparing them isn’t precisely Apples to Apples. For now, let’s understand the differences in our original set.
Embedded | Memory Constraints | |
---|---|---|
Pandas | ✅ | RAM |
Modin | ❌ | RAM |
CuDF | ✅ | VRAM |
Arrow | ✅ | RAM |
Spark | ❌ | RAM/JVM |
Two systems (Modin and Spark) are not “embedded” libraries. They depend on external processes with which they communicate through shared memory or sockets. It may not seem like a problem working with 100 rows, but it is with Terabytes. A typical CPU can perform ~10 Billion operations every second, while few I/O engines can handle even a Million packets on a socket. For reference, Redis, often praised for its performance, hardly reaches that mark while still being ahead of Spark.
The size of our dataset also makes it hard to process it in memory. Not every computer has 40 GB of RAM, but even that won’t be enough once you decompress and reserve temporary memory for running analysis. Luckily, our sandbox machine has 1 TB of RAM, but CuDF relies on GPUs internal memory - VRAM. It is smaller and more expensive, and each GPU only has 24 GBs of it. So we hand-crafted a “batching” mechanism tailored for each query. CuDF would operate on batches of 10 files simultaneously instead of looking at 125 at once. Consequently, it had to parse every file four times for every one of the queries. Other frameworks, only limited by the size of RAM, parsed the files just once before running analytics and reused the in-memory representation. CuDF was already up to 15x faster than Pandas, before we included the parsing time into the measurement. The picture becomes even more dramatic once we do.
Queries
Enough with drawing. Let’s see what exactly we are comparing.
Query 1
The original SQL query has the following structure:
|
|
Rewriting it with Pandas would look similar to this:
|
|
It looks pretty straightforward until you start adjusting it for every engine. Their APIs look similar, but not all arguments match, and some functionality will be missing. We will return to stability and interoperability, but before that, let us organize the measurements.
Desktop | Server | Server | Server | |
---|---|---|---|---|
Operations | Load + Eval | Load + Eval | Load | Eval |
Batch Size | 2 | 10 | 125, full dataset | 125, full dataset |
Pandas | 608.6 s | 172.1 s | 51.7 s | 69.7 s |
Modin | 743.1 s | 141.8 s | 121.2 s | 7.5 s |
CuDF | 61.6 s | 16.7 s | ❌ | ❌ |
Spark | 159.8 s | 68.3 s | 11.7 s | 158.9 s |
Arrow | 78.6 s | 69.1 s | 48.0 s | 25.8 s |
Every measurement here is the number of seconds it took to evaluate the query. We repeated the measurements many times on two machines:
- “Desktop”: Intel Core i7 + Titan Volta + 32 GB of DDR4.
- “Server” was mentioned before.
As expected, CuDF can’t fit the whole dataset into VRAM, so only batched measurements are present. Still, CuDF takes almost 5 times less to parse batches and evaluate the Q1 than Pandas just evaluating the query on an already parsed dataset.
Query 2
|
|
This query is also quite simple to translate:
|
|
Results:
Desktop | Server | Server | Server | |
---|---|---|---|---|
Operations | Load + Eval | Load + Eval | Load | Eval |
Batch Size | 2 | 10 | 125, full dataset | 125, full dataset |
Pandas | 535.1 s | 114.1 s | 51.7 s | 20.0 s |
Modin | 669.7 s | 154.1 s | 121.2 s | 27.0 s |
CuDF | 28.4 s | 10.6 s | ❌ | ❌ |
Spark | 73.6 s | 43.7 s | 11.7 s | ∞ |
Arrow | 62.7 s | 54.4 s | 48.0 s | 14.4 s |
Starting with this query, Spark, SQLite, and a few other backends became unusably slow when evaluating bulk datasets.
Query 3
|
|
Queries like these are often harder for databases. They may often have an index ready for single-column group-by operations, but aggregating based on multiple columns may require a full scan. For in-memory processing systems, it is less of an issue.
|
|
Results:
Desktop | Server | Server | Server | |
---|---|---|---|---|
Operations | Load + Eval | Load + Eval | Load | Eval |
Batch Size | 2 | 10 | 125, full dataset | 125, full dataset |
Pandas | 858.5 s | 247.8 s | 51.7 s | 153.0 s |
Modin | 729.1 s | 153.7 s | 121.2 s | 17.0 s |
CuDF | 25.5 s | 11.2 s | ❌ | ❌ |
Spark | 125.1 s | 63.1 s | 11.7 s | ∞ |
Arrow | 82.4 s | 74.5 s | 48.0 s | 36.1 s |
Query 4
|
|
This query is the hardest and comprises everything we have tried before. In Pandas, it looks a bit mouthful:
|
|
Results:
Desktop | Server | Server | Server | |
---|---|---|---|---|
Operations | Load + Eval | Load + Eval | Load | Eval |
Batch Size | 2 | 10 | 125, full dataset | 125, full dataset |
Pandas | 859.0 s | 470.8 s | 51.7 s | 178.4 s |
Modin | 1024.2 s | 247.5 s | 121.2 s | 72.3 s |
CuDF | 26.9 s | 12.0 s | ❌ | ❌ |
Spark | 669.4 s | 136.3 s | 11.7 s | ∞ |
Arrow | 95.7 s | 79.4 s | 48.0 s | 43.9 s |
Findings
CuDF
Pros:
- Almost 1-to-1 API interop with Pandas, best in class!
- Exceptional performance out of the box.
Cons:
- VRAM limits dataset size.
- Not everyone has a GPU. Why?!
- Default horizontal scaling methods, namely Dask-CuDF, work horribly, even with RMM pools and NV-Link.
Even current performance is far from saturating the GPU compute capabilities. Parsing & data transfers take more time than the actual kernel execution. CXL, PCI-E 5, and next-gen NV-Link will recompense this, but just partly. This is what GPU utilization looks like in such batched queries on RTX 3090:
The dataset was placed on a fast PCI-E Gen4 SSD, but we also tried rerunning benchmarks over datasets stored on a RAID-0 array with 8x even faster Gen4 SSDs. We hoped it would affect the batched workloads but registered no fluctuations in our measurements.
Spark
Pros:
- Faster than Pandas.
- Widespread outside of the Python world.
- A major part of the Apache ecosystem.
Cons:
- Memory usage patterns and performance are unpredictable due to JVM.
- Cumbersome configuration via
SparkConf
,SparkSession
,SparkContext
. - Some types, like
datetime64[s]
aren’t supported, while others likedatetime64[ns]
work. - Types like
category
may be dropping the original string labels and need workarounds. - Stability.
Modin
Pros:
- API compatibility.
Cons:
- Poor scaling and over-utilization of hardware resources.
- Many operations, like masking, still fall back on Pandas' implementation.
Overall, we are not big fans of Modin’s “using Python to accelerate Python” methodology. We have said it before and will repeat it - Python needs a significant upgrade.
Multi-threading is not usable in Python due to Global Interpreter Lock.
That is why PyTorch deprecated DataParallel
training in favor of DistributedDataParallel
.
Need more than one thread - run multiple processes and multiple Python interpreter instances, all isolated, and pay the inter-process communication costs. Just waiting for those processes to synchronize often takes longer than the task itself, making it meaningless for fine-grained parallelism. This also applies to Ray and Dask.
Arrow
Pros:
- Highest speed and efficiency on CPU.
- Embedded and easy to configure/manage.
- Growing ecosystem.
Cons:
- The API is very different from Pandas.
But is it may not be much of a problem for some users, as Arrow has awe-inspiring support from all sides of the industry…
Including both compute and storage. Our Kernel Libraries may not be ready for showtime, but we are slowly joining the ecosystem on the storage side with UKV.
Arrow representations are its first-class residents. Furthermore, it seems to be the first storage layer solution using Apache Arrow Flight RPC for client-server communication, meaning that every Arrow-compatible project can work with UKV without additional bindings. We have just completed benchmarking it on 10 TB document collections with 10x more entries and 100x more data, so we are certain it will be pivotal in taking the next big scale!
Conclusions
To summarize the results, here are the winners across categories:
- Consistent performance: Arrow.
- Peak performance: CuDF.
- Energy efficiency: CuDF.
- Accessibility: Modin.
- Ecosystem: Arrow and Spark.
This entire benchmark - ADSB, is publicly available on our corporate GitHub, with the original measurements on different machines and scripts for downloading & preprocessing the data. Enjoy 🤗