It has been six months since we published the last benchmarks. The last time we did it, collections were only 1 TB in size, and we disabled all the cool features in our engine to get understandable and comparable baselines across all famous DB engines like RocksDB, LevelDB, and WiredTiger.
Those systems are surprisingly bad at multithreading and asynchrony, are largely incapable of bypassing the Linux kernel, and have no GPU acceleration. Yet they power the most established data-processing systems at FAANG and are at the core of every rising DBMS startup… Time for a change.
To properly introduce our system in its full glory, we compare 3 setups:
- UnumDB with CUDA-acceleration on Nvidia GPUs.
If you want to skip the configuration details and jump to speed comparisons - here you go!
You can find all the configuration files in our public repo on GitHub. We have already described the setup and the methodology in over 5000 words the last time, but to remind you:
- Keys are 8 byte integers.
- Values are 1024 byte binary blobs.
- Compression is disabled. why
- Flushing isn’t enforced on writes. why
- Swap space is disabled.
Unlike the last time:
- 64 threads of benchmarks were running concurrently.
- 8x 4 TB SSD partitions were reserved for each DBMS. SLC mode
- No memory usage limits were enforced RAM limits
- Batch size: 256 keys.
- Range-scan lengths: 256 keys.
- CPU: AMD Threadripper PRO 3995WX ~ 128 threads.
- RAM: 8x Samsung M393AAG40M32-CAE ~ 1 TB.
- SSD: 8x Samsung PM1733 U.2 ~ 64 TB.
- GPU: 1x Nvidia RTX 3090.
- OS: Ubuntu 22.04.
- Linux Kernel: 5.15.
- Filesystem: Ext4.
- RocksDB version: 6.29.3.
No two engines are the same. Otherwise, writing them would be meaningless. Here are some of the most notable differences affecting the presented numbers.
RocksDB and Integer Keys
Also worth reminding, RocksDB doesn’t fully support fixed-length keys. Their primary “Block-Based SST” file format is designed for flexible keys. Their “Plain SST” file format misses half of the features needed for these benchmarks. Fixed-length keys are an absolute necessity for a modern High-Performance storage system, and we will be happy to repeat the benchmarks once RocksDB gets it. To at least partially compensate for that, we have disabled the “fixed length values” flag in the used distribution of UnumDB. Yes, we have an optimization for such collections as well.
RocksDB and Flash Arrays
RocksDB also has troubles working with multiple directories. More than one:
- Ingesting SST files during initialization always puts them on the first disk. source
- Passing more than 4 disk paths is impossible. It is hardcoded, and we have no clue as to why. source
Those are major yet trivially solvable problems if you imagine modern flash arrays. Imagine two CPU sockets and 24x Gen4 NVMe SSDs per server. No SATA, no SAS, just direct PCI-e connection to CPUs. This means that most enterprise RocksDB installations have to operate over software RAID. Again, we will happily repeat the benchmarks once they fix those issues.
Such patches will affect RocksDBs numbers, but not even remotely as much as they need to get parity, even with our CPU-only version.
Today, GPUs are installed in most HPC servers and are available on every public cloud, so they are pretty abundant. In terms of comparativeness of results, adding $1'500 to a server worth well over $20'000 hardly changes the cost of running the experiment much. Energy-wise, however, that GPU draws the same 300 W to 400 W of energy under heavy load, same as the CPU.
That may look like much, but if doubling the energy usage, you can get the work done 4x faster - it means a 2x net reduction in overall consumed electricity. Faster service, cleaner planet, profit 💵
The two most important operations in a database engine are - “Batch Reads” and “Batch Upserts”, which is short for “update if present and insert if missing”. We are crushing those both in relative and absolute terms:
|RocksDB||UnumDB||UnumDB + GPU|
|Initialization||603 K||60 M|
|Read||420 K||1 M|
|Batch Read||650 K||4.5 M|
|Range Select||5 M||2 M|
|Scan||17 M||17 M|
|Read & Update, 1:1||64 K||173 K||214 K|
|Read & Upsert, 19:1||128 K||270 K||276 K|
|Batch Upsert||57 K||117 K||182 K|
|Remove||420 K||874 K||1 M|
Truth be told, we didn’t expect to loose in Range Selections. Speed-wise, it is the only unstable operation and yields very different results depending on the range length:
- RocksDB goes from 50 Kops to 17 Mops, on lengths from 1 to ∞.
- UnumDB goes from 100 Kops to 17 Mops, on lengths from 1 to ∞, dipping somewhere in the middle. Will be improved in our November release
Speed is good, but what does it mean for your experiment? For your product? For your budget? 10 TB collections are too big to repeatedly evaluate each workload with each engine on 10 Billion entries. So, aside from initialization, each workload only applied to 10% of randomly sampled keys, meaning 1 Billion operations.
|RocksDB||UnumDB||UnumDB + GPU|
|Experiment Duration||17.6 h||6.3 h||5.2 h|
|Post-Construction Duration||13.0 h||6.0 h||5.0 h|
|Extrapolated Duration||134.9 h ~ 5 d 14 h||59.7 h ~ 2 d 12 h||49.2 h ~ 2 d 1 h|
- The first line describes the actual duration of the presented benchmark run.
- The second line subtracts the duration of “Initialization”, assuming the data is already prepared for experiments.
- The third line extrapolates the duration if we were to scale each workload to the level of “Initialization”.
Knowing experiment duration and hardware specs, we can calculate the costs:
|RocksDB||UnumDB||UnumDB + GPU|
|Extrapolated Duration||134.9 h||59.7 h||49.2 h|
|Closest Machine on AWS||i4i.metal||i4i.metal||i4i.metal + eg1.2xlarge|
|Hourly Price||$11||$11||$11 + $0.4|
|Average Energy Consumption||~ 500 Watts||~ 500 Watts||~ 650 Watts|
|Overall Cost||$1'485||$657, -55%||$561, -62%|
|Overall Energy Consumption||243 MJ||107 MJ, -55%||115 MJ, -53%|
As our numbers are continually improving and our stability is often higher than RocksDB, we are assembling an ecosystem of partly open-source technologies around UKV - a minimalistic C interface, backed by all kinds of database engines and wrapped into familiar application-specific SDK for binary, document and graph workloads!
This gives you the ability to mix and match various collection kinds with different interfaces producing combinations like:
- Embedded Graph DB using LevelDB with NetworkX-like Python interface. Like Neo4J, but without weird DSLs, with a familiar interface, and predictable non-JVM performance and memory-consumption.
- Standalone multi-column transactional Document DB using RocksDB and Apache Arrow representations and Flight RPC for communication, translating Array-of-Structures internals into Pandas-style Structure-of-Arrays API… All packed into a sub 100 MB Docker image and deployable on the cloud.
The UKV library is under active development and is highly unstable yet, but it already knows a lot of tricks.
We have a public changelog and we are inviting everyone to join the development. To summarize, the long-term efforts will be guided towards:
- Improved RPC.
- Support for more languages.
- Semantic (vector search) indices.
- Interoperability with CuDF and CuGraph.
Similarly, in UnumDB we are working on:
- 2x-3x faster mixed transactions.
- 2x faster compactions.
- 60% faster batch-reads.
- Hardware-accelerated implementation of graph collections.
- Hardware-accelerated implementation of document collections.
- Hardware-accelerated implementation of vector collections.
Accelerating analytical operations on GPUs is a known thing, but accelerating CRUD operations, especially updates, is tricky. Since we pioneered those in 2020, to our knowledge, we are still the only engine capable of that. We have gone a long way since then and have prepared much more to surprise our customers! So come see us at WebSummit this year ❤️