50x Faster NumPy = CuPy
Benchmarking BLAS and other Linear Algebra packages on AMD Threadripper PRO 3995WX and Nvidia RTX 3090
Contents
As this is the first article in the series, let’s start with a somewhat generic introduction. For raw benchmarks, jump here.
Intro to CuPy
Many groups benefit from GPUs: gamers, miners, AI researchers…, but surprisingly few can program them. For a good reason  complexity!
On CPUs, everything is easy. Most of the code is portable.
Threads rarely communicate with each other.
And when they do, there is only a few that you need to synchronize.
On GPUs, it’s a mess. You have 10'000 cores, all of them executing the same task, mainly on a shared piece of information.
That is scary even for the most experienced C++ programmers.
Three years ago I spent weeks experimenting with different ways to essentially sum(numpy.random.rand(1_000_000_000))
on GPUs.
It culminated in a one hour breadthfirst comparison of C++, OpenCL, Vulkan and SyCL, which I presented at CppRussia 2019. What if there was a simpler way?
OpenGL  CUDA  OpenCL  Metal  Vulkan  CuPy  

Release Year  1992  2007  2009  2014  2016  2019 
Runs on Intel  Yes  No  Yes  No  Yes  No 
Runs on AMD  Yes  No  Yes  No  Yes  No 
Runs on Nvidia  Yes  Yes  Yes  No  Yes  Yes 
Runs on Apple  Deprecated  No  MacOS  Yes  MoltenVK  No 
Runs on Android  Yes  No  Depends  No  Yes  No 
One just appeared around that time. CuPy builds on the stable foundation of CUDA. Being the least portable GPGPU adaptation of a Clike language, it is by far the most mature! The gap between CUDA and HIP or any other backend rendered all nonNvidia GPUs obsolete for Machine Learning workloads. It was enough pain in the S to build TensorFlow, PyTorch or MxNet with CUDA support that essentially no one even tried porting them to AMD GPUs.
Still, not everyone wants to leave the warmth and comfort of a Jupyter Notebook to write CUDA kernels. Nvidia has you covered. CuPy implements most of the NumPy operations providing a dropin replacement for Python users.
Experienced software developers now realize that many layers are separating the wmma::
CUDA intrinsics and CuPy.
So we will not be able to benchmark all the interesting cases and are constrained to the most common functionality of NumPy.
Still, even with the f32
matrix multiplications below, the numbers on Ampere are much better than on Turing.
Software
NumPy is not a HighPerformance numerics framework.
It is obviously an improvement over [[0]*100]*100
, but NumPy doesn’t implement HPC staff itself.
All the HPC staff comes from BLAS or Basic Linear Algebra Subroutines.
One of the oldest and most optimized families of software libraries.
Every hardware vendor maintains a team of HPC engineers porting and tuning fast Linear Algebra kernels to their architectures.
Nvidia and Intel are not exceptions, and AMD can run the x86 packages built at Intel.
Essentially, our NumPy vs CuPy match boils to comparing the OpenBLAS, MKL and cuBLAS through their higherlevel interfaces. We will be looking at five such combinations:
 NumPy with BLIS, as a baseline
 NumPy with OpenBLAS
 NumPy with Intel MKL
 CuPy with TF32 Numeric Optimizations Disabled
 CuPy with TF32 Numeric Acceleration on Tensor Cores
Installation & Configuration
With NumPy, things are simple.
If it’s not on your machine yet, pip install numpy
and call it a day.
Rapids portal also provides an easy script for installing Nvidia Python libraries.


Depending on your case, this may end up bringing 50+ dependencies, totalling at ~2.5 GB and taking a few minutes to install. After that, you need to pass extra environment variables to the Python runtime, as Nvidia may disable CUB reductions and Tensor Cores by default:


One point to Griffindor NumPy.
Not so fast.
Which BLAS version did we just download?
How was the library built?
The further you are from bare metal, the easier it is to miss such detail.
Let’s check what pip
brought us:


The answer is  OpenBLAS. For this experiment we will create separate Conda environments and will fill them differently:


MKL is here, all cores are fully loaded, we can safely proceed.
Benchmarks
Moving Averages
This is a simple benchmark, familiar to all analysts working with timeseries. Imagine that rows of the matrix are stock prices, and you want to compute moving averages. This will make your curves smoother. Who doesn’t want that?!


As we will see, in most cases, the difference between OpenBLAS and Intel MKL is indistinguishable.
The same can be said about CuPy with and without TF32
support.
Still, a massive gap generally exists between CPU and GPU solutions and between BLIS and other CPU packages.
The problem here, again, is likely in the configuration files.
BLIS isn’t always saturating the cores, but its OpenMP backend must be adjustible.
All charts are loglog. All matrices contain 32bit floating point numbers. Matrix sizes grow from 512⨉512 to 16K⨉16K.
The biggest challenge with this article was to find a suitable visualization method. We failed.
Matrix Multiplication
This one is pretty interesting. Some might expect a more significant gap, but Threadripper did great in GEMM compared to an average CPU.
When we implement GEMMs on GPU, we generally limit ourselves to 2 levels of matrix tiling.
One comes naturally from CUDAs way of addressing cores  grids and blocks.
The second is needed to use the __shared__
memory between the cores in the same warp.
On the CPUs, more levels of tiling are typical to exploit the deeper memory hierarchy.


All the benchmarks were unary functions and we did’t want to multiply the matrix by itself, so we subtract a matrix of scalars from it. The overall runtime won’t be affected by it and the asymptote remains ~O(N^3).
This must look highly satisfying to Machine Learning practitioners.
Just don’t forget, what a TF32
is.
It is not a 32bit floating point number and not a 16bit number.
It’s actually 19 bits, so on huge matrices you will be accumulating an error.
Singular Values Decomposition
We finally meet a benchmark, where GPU lost.


Pearson Correlations
Aside from being slower at times, CuPy isn’t precisely a onetoone NumPy replacement.
The subtle differences are noted in the documentation here and the others you can notice yourself.
For example, NumPy docs and CuPy docs define different arguments for the corrcoef
function.


Sorting Big Arrays
After all the 2D shenanigans, let’s flatten our matrices and analyze arrays. We will start with sorting, one of the biggest strengths of GPUs. Instead of employing trivial comparisonbased sorting algorithms, they rely on linearcomplexity Radix sort variants. Those aren’t generic and won’t work with any types, only numerics.


Array Medians


Beyond NumPy
Let’s conclude with just one table. Fastest CPU vs fastest GPU. Fastest CPU BLAS implementation  MKL, vs cuBLAS.
Speedups  512²  1024²  2048²  4096²  8192²  16384² 

SVD  0.8x  0.7x  0.6x  0.5x  0.5x  0.6x 
Pearson Corr.  2.5x  2.0x  1.4x  1.1x  1.5x  1.4x 
GEMM  8.3x  17.9x  25.1x  21.0x  19.9x  23.4x 
Moving Average  10.9x  19.9x  49.0x  59.3x  59.4x  53.9x 
Medians  11.2x  30.3x  58.8x  72.2x  99.4x  83.5x 
Sorting  87.7x  274.4x  547.3x  788.8x  920.5x  1008.5x 
CuPy replaces not one framework, not two, but probably three. The part we covered today supersedes NumPy. The second part supersedes SciPy. The third one steps on the toes of Numba.
It’s big, and we will learn it in parts, leaving CuPy JIT and raw CUDA kernels next time. So get your popcorn ready, and start prototyping before the next generation of hardware arrives!