As this is the first article in the series, let’s start with a somewhat generic introduction. For raw benchmarks, jump here.

## Intro to CuPy

Many groups benefit from GPUs: gamers, miners, AI researchers…, but surprisingly few can program them. For a good reason - complexity!

On CPUs, everything is easy. Most of the code is portable. Threads rarely communicate with each other. And when they do, there is only a few that you need to synchronize. On GPUs, it’s a mess. You have 10'000 cores, all of them executing the same task, mainly on a shared piece of information. That is scary even for the most experienced C++ programmers. Three years ago I spent weeks experimenting with different ways to essentially sum(numpy.random.rand(1_000_000_000)) on GPUs.

It culminated in a one hour breadth-first comparison of C++, OpenCL, Vulkan and SyCL, which I presented at CppRussia 2019. What if there was a simpler way?

OpenGL CUDA OpenCL Metal Vulkan CuPy
Release Year 1992 2007 2009 2014 2016 2019
Runs on Intel Yes No Yes No Yes No
Runs on AMD Yes No Yes No Yes No
Runs on Nvidia Yes Yes Yes No Yes Yes
Runs on Apple Deprecated No MacOS Yes MoltenVK No
Runs on Android Yes No Depends No Yes No

One just appeared around that time. CuPy builds on the stable foundation of CUDA. Being the least portable GPGPU adaptation of a C-like language, it is by far the most mature! The gap between CUDA and HIP or any other backend rendered all non-Nvidia GPUs obsolete for Machine Learning workloads. It was enough pain in the S to build TensorFlow, PyTorch or MxNet with CUDA support that essentially no one even tried porting them to AMD GPUs.

Still, not everyone wants to leave the warmth and comfort of a Jupyter Notebook to write CUDA kernels. Nvidia has you covered. CuPy implements most of the NumPy operations providing a drop-in replacement for Python users.

Experienced software developers now realize that many layers are separating the wmma:: CUDA intrinsics and CuPy. So we will not be able to benchmark all the interesting cases and are constrained to the most common functionality of NumPy. Still, even with the f32 matrix multiplications below, the numbers on Ampere are much better than on Turing.

## Software

NumPy is not a High-Performance numerics framework. It is obviously an improvement over [[0]*100]*100, but NumPy doesn’t implement HPC staff itself. All the HPC staff comes from BLAS or Basic Linear Algebra Subroutines. One of the oldest and most optimized families of software libraries. Every hardware vendor maintains a team of HPC engineers porting and tuning fast Linear Algebra kernels to their architectures. Nvidia and Intel are not exceptions, and AMD can run the x86 packages built at Intel.

Essentially, our NumPy vs CuPy match boils to comparing the OpenBLAS, MKL and cuBLAS through their higher-level interfaces. We will be looking at five such combinations:

• NumPy with BLIS, as a baseline
• NumPy with OpenBLAS
• NumPy with Intel MKL
• CuPy with TF32 Numeric Optimizations Disabled
• CuPy with TF32 Numeric Acceleration on Tensor Cores

## Installation & Configuration

With NumPy, things are simple. If it’s not on your machine yet, pip install numpy and call it a day. Rapids portal also provides an easy script for installing Nvidia Python libraries.

 1  conda create -n rapids-21.12 -c rapidsai -c nvidia -c conda-forge rapids=21.12 python=3.8 cudatoolkit=11.5 dask-sql 

Depending on your case, this may end up bringing 50+ dependencies, totalling at ~2.5 GB and taking a few minutes to install. After that, you need to pass extra environment variables to the Python runtime, as Nvidia may disable CUB reductions and Tensor Cores by default:

 1  CUPY_ACCELERATORS="cub,cutensor" CUPY_TF32=1 python numpy_vs_cupy.py 

One point to Griffindor NumPy.

Not so fast. Which BLAS version did we just download? How was the library built? The further you are from bare metal, the easier it is to miss such detail. Let’s check what pip brought us:

 1 2 3  python >>> import numpy as np >>> np.show_config() 

The answer is - OpenBLAS. For this experiment we will create separate Conda environments and will fill them differently:

 1 2 3 4 5  channels: - conda-forge dependencies: - 'blas=*=mkl' - 'numpy' 

MKL is here, all cores are fully loaded, we can safely proceed.

## Benchmarks

### Moving Averages

This is a simple benchmark, familiar to all analysts working with time-series. Imagine that rows of the matrix are stock prices, and you want to compute moving averages. This will make your curves smoother. Who doesn’t want that?!

 1 2 3 4  def moving_average(matrix, window: int = 3): ret = backend.cumsum(matrix, axis=1, dtype=dtype) ret[:, window:] = ret[:, window:] - ret[:, :-window] return ret[:, window - 1:] / window 

As we will see, in most cases, the difference between OpenBLAS and Intel MKL is indistinguishable. The same can be said about CuPy with and without TF32 support. Still, a massive gap generally exists between CPU and GPU solutions and between BLIS and other CPU packages. The problem here, again, is likely in the configuration files. BLIS isn’t always saturating the cores, but its OpenMP backend must be adjustible.

All charts are log-log. All matrices contain 32-bit floating point numbers. Matrix sizes grow from 512⨉512 to 16K⨉16K.

The biggest challenge with this article was to find a suitable visualization method. We failed.

### Matrix Multiplication

This one is pretty interesting. Some might expect a more significant gap, but Threadripper did great in GEMM compared to an average CPU. When we implement GEMMs on GPU, we generally limit ourselves to 2 levels of matrix tiling. One comes naturally from CUDAs way of addressing cores - grids and blocks. The second is needed to use the __shared__ memory between the cores in the same warp. On the CPUs, more levels of tiling are typical to exploit the deeper memory hierarchy.

 1 2  def matrix_multiply(matrix): return backend.matmul(matrix, matrix - backend.ones(matrix.shape, dtype=matrix.dtype)) 

All the benchmarks were unary functions and we did’t want to multiply the matrix by itself, so we subtract a matrix of scalars from it. The overall runtime won’t be affected by it and the asymptote remains ~O(N^3).

This must look highly satisfying to Machine Learning practitioners. Just don’t forget, what a TF32 is. It is not a 32-bit floating point number and not a 16-bit number. It’s actually 19 bits, so on huge matrices you will be accumulating an error.

### Singular Values Decomposition

We finally meet a benchmark, where GPU lost.

 1 2  def singular_decomposition(matrix): return backend.linalg.svd(matrix) 

### Pearson Correlations

Aside from being slower at times, CuPy isn’t precisely a one-to-one NumPy replacement. The subtle differences are noted in the documentation here and the others you can notice yourself. For example, NumPy docs and CuPy docs define different arguments for the corrcoef function.

 1 2 3  def pearson_correlations(matrix): return backend.corrcoef(matrix, rowvar=True, dtype=dtype) if numpy == backend else backend.corrcoef(matrix, rowvar=True) 

### Sorting Big Arrays

After all the 2D shenanigans, let’s flatten our matrices and analyze arrays. We will start with sorting, one of the biggest strengths of GPUs. Instead of employing trivial comparison-based sorting algorithms, they rely on linear-complexity Radix sort variants. Those aren’t generic and won’t work with any types, only numerics.

 1 2  def flat_sort(matrix): return backend.sort(matrix, axis=None) 

### Array Medians

 1 2  def flat_median(matrix): return backend.median(matrix, axis=None) 

## Beyond NumPy

Let’s conclude with just one table. Fastest CPU vs fastest GPU. Fastest CPU BLAS implementation - MKL, vs cuBLAS.

Speedups 512² 1024² 2048² 4096² 8192² 16384²
SVD 0.8x 0.7x 0.6x 0.5x 0.5x 0.6x
Pearson Corr. 2.5x 2.0x 1.4x 1.1x 1.5x 1.4x
GEMM 8.3x 17.9x 25.1x 21.0x 19.9x 23.4x
Moving Average 10.9x 19.9x 49.0x 59.3x 59.4x 53.9x
Medians 11.2x 30.3x 58.8x 72.2x 99.4x 83.5x
Sorting 87.7x 274.4x 547.3x 788.8x 920.5x 1008.5x

CuPy replaces not one framework, not two, but probably three. The part we covered today supersedes NumPy. The second part supersedes SciPy. The third one steps on the toes of Numba.

It’s big, and we will learn it in parts, leaving CuPy JIT and raw CUDA kernels next time. So get your popcorn ready, and start prototyping before the next generation of hardware arrives!