Read the newest version here!

Everyone in the tech industry has probably heard the words “data lake”, “data warehouse”, “database management software”, but have you ever wondered how all of that is built? According to Gartner, the DBMS market size was estimated to be πŸ’²65 Billion in 2020 and will reach πŸ’²150 Billion in 2025. This article will be about the technology that underpins that whole industry. About how modern IT infrastructure is created. About famous benchmarks and typical pitfalls.

A sneak peek at Unum capabilities

You can comment and discuss this article on HackerNews.

Why Do We Benchmark?

Let us say you buy an expensive server today or rent it in the cloud. You take a powerful CPU, a bunch of GPUs and some fast SSDs. Why fast? Because persistent storage is, in general, exceptionally slow, and when the data streams through your entire system, the chain is only as strong as the weakest link.

  • GPUs can fetch over 1'000 GB from their memory every single second.
  • CPUs can fetch over 100 GB during the same interval. Slower, but hardly slow.
  • SSDs, in the meantime, generally peak at 4 GB/s in very simple workloads, but when mixed read & write operations are involved, the number may plummet to 10 MB/s. Or in other words, 0.01% of the CPU/RAM throughput.

10'000x is the bandwidth difference between your volatile and persistent memory. For people less familiar with hardware, that is the speed difference between computers short-term and long-term memory. If you want to analyze data, train Artificial Intelligence models or handle significant volumes of transactions that your users generate - you need to bridge that gap!

When you see numbers that big, you know there is a heated market. Every DBMS vendor rushes to promote their solution and claim the throne. To whom should you trust? How to choose a suitable data lake for your business? The only answer - benchmark them!

How We Benchmark? YCSB

Meet YCSB. Yahoo Cloud Serving Benchmark has been the industry standard Key-Value Store evaluation benchmark since 2010. For years people have used it to evaluate MongoDB, ElasticSearch, SnowFlake, Oracle NoSQL Database, Postgres and Redis, among other household names. The reason behind its popularity is its genericness. In other words, if you excel in YCSB, you are likely to excel in most real-life workloads.

Our DBMS natively supports ACID transactions and guarantees consistency of the data. For those operations, the TPC-H or Transaction Processing Performance Council benchmark exists, and we plan to share its results. We have previously used YCSB as one of the metrics for DBMS workloads and have already shared older UnumDB v0.1 results here. Our technology has matured since then, but we still have room for optimizations.

Who Is Against Us?

We start with the toughest kids on the block:

  • LevelDB, which powers Google’s infrastructure.
  • RocksDB, which powers Facebook’s infrastructure and half of the modern DBMS companies.
  • WiredTiger, which powers MongoDB.

Wait, I have not heard those names before, you might say. No problem, you are not alone. It turns out almost all modern databases are written on top of these three projects. It is like the engine of a car. You may be buying an Aston Martin, but they do not have an engine-design team. They buy engines from Mercedes-AMG.

As far as I know, only two kinds of DBMS engines exist:

  1. Key-Value Stores (KVS).
  2. Append-Only Stores.

The second category is a bit weird. To continue the automotive analogy, it is like a dragster car - it can only drive straight and if the direction changes - people generally die. Time-series databases sometimes are built this way, but we want to cover more domains. So we need KVS to support all kinds of data mutations, reorganizations and random insertions.

Dragster

We have isolated the KVS. Now let us demystify it. The theoretical side of things is pretty stagnant. Not much debate is happening, and everyone builds KVS using B-Trees, B+-Trees or LSM Trees. So you do not have to invent the structure, but do you have to implement it? Not necessarily. Everyone reuses or copycats what is already on GitHub.

Recipe

The diagram above is not an exaggeration. It is how modern DBMS companies are assembled. You do not have to invent anything to be a Xillion dollar company. Open-source software has done wonders for the tech community in the past decades. Still, every coin has two sides. The negative one here is the growth of laziness, sloppiness and incompetence. Why innovate if you can take what works and call it a day?

This list goes on and on, and you can read an extended version on Wiki.

It is surprisingly hard to find a “database technology company” that actually designs new technology to solve this bottleneck. We are not a database company, per se, but we did it out of necessity. If one thing is clear from all the charts and numbers above, having faster KVS is essential to designing a high-throughput, low-latency data lake.

Technical Methodology

Every good experiment review starts with a technical specification! Before we go forward, let me say that writing a good, fair benchmark is highly complicated! Maybe harder and more time-consuming than designing state-of-the-art software itself. So we have machines that vary from 3 Watt ARM IoT boards to massive 3'000 Watt x86 servers, which not every datacenter rack can fit. We measure:

  • CPU load,
  • RAM usage,
  • SSD lifetime affects,
  • Electricity consumption,
  • Network throughput, and so much more!

The next problem arises when you take the benchmarking script itself. Often those are written worse than the DBMS itself, so it further degrades the numbers and introduces the bottlenecks that were not present in the original software. Even though our software remains private, we are preparing to open-source new benchmarking and introspection utilities for public use. So follow our GitHub page for upcoming tradeoffs!

Software Versions

  1. WiredTiger. Version 10.0.0.
  2. LevelDB. Version 1.23.
  3. RocksDB. Version 6.20.3 (fork of LevelDB).
  4. UnumDB. Version 0.2 (pre-release, but stable).

Machine hosting runs on Ubuntu 20.04. Every DBMS run on a separate identical SSD. No Docker containers or Virtual Machines are involved. For every dataset size, we configure DBs to keep no more than 5% of data in RAM.

Compression was disabled in all cases to benchmark the persistent data structures themselves. Desired DBMS can be configured with any compression library, so it is not relevant for comparison.

Our Toys

Scientists are grown-up kids, and we are no different. Still, the time goes on, the games change, and the toys get pricier. Let me show you the setup we used for the YCSB benchmark. It was custom built with the help of our partners at Comino. It is impossible to do consistent benchmarking on the cloud, so you need to buy and assemble clusters yourself. That is how we validate the speed and efficiency of our software. Moreover, buying pre-designed systems from giant vendors like HPE and Dell would never give you the flexibility we needed.

  1. The entire system is liquid-cooled with metal blocks, so cooling is as efficient as it can be. Overheating never affects the numbers.
  2. This system only has one socket, unlike most of our dual-socket AMD Epyc machines. It eliminates NUMA artefacts when accessing RAM modules attached to another socket.
  3. Every SSD is directly attached to the CPU over PCI-e. There is no chipset or RAID-cards or PCI-e risers/splitters in between.

Unum Benchmarking Server

So what are the specs?

CPU GPU RAM SSD
AMD Threadripper PRO 3995WX 4x Nvidia RTX 3090 8x Samsung M393AAG40M32-CAE 8x Samsung PM1733 U.2
64 cores, 128 threads 4x 24 GB GDDR6X = 96 GB 8x 128 GB = 1 TB 8x 7.68 TB = 64 TB
2.2 Ghz Base Frequency 4x 142 FP16 TOPS 3.2 GHz Frequency 48 GB/s Read
256 MB L3 cache 2+2 NVLink CL16 Latency 30 GB/s Write

Disabled parts:

  1. Our software becomes faster on GPUs, but these benchmarks purposefully exclude them making comparisons fairer.
  2. We run a single node benchmark, so networking is irrelevant. Subscribe to learn about our upcoming distributed benchmarks over 200 GBit/s Mellanox Infiniband.

Our Results

YCSB randomly generates keys on the fly. Internally it uses a composition of three probability distributions to generate keys: the uniform, the Zipfian and another custom one, which favours the last generated keys. It highlights the strengths and weaknesses of DBs in a comprehensive set of settings. On top of that, we can stretch and scale the benchmark to see the trends. We took the following sizes of collections:

  • 100'000 entries, 100 MB workload size, took 5 minutes.
  • 1'000'000 entries, 1 GB workload size, took 12 minutes.
  • 10'000'000 entries, 10 GB workload size, took 1 hour.
  • 100'000'000 entries, 100 GB workload size, took 7.5 hours.

The overall wall time for all experiments is 9 hours, and we run them multiple times, averaging the results, clearing the caches and resetting the drives. If this sounds impressive, subscribe for our upcoming 50 TB benchmark… 500x more extensive than this comparison.

Now let us dive deeper to see how those numbers change depending on the size of the workload and how they use RAM & SSD space within the system.

Initialization

The task here is simple. We have an empty database. Let us fill it. 100% of all operations are “writes”.

Initialization Speed

Speed is good, but it is not the only metric. We must make sure that the system uses comparable resources. Let us evaluate the balance between memory and consumption and speed. The bigger dataset you take, the more stable are the numbers, so we focus on the 100 GB case in detailed comparisons. Ideally, every DBMS must use 100 GB of SSD space, as the compression was disabled. The rest is metadata that DBMS needs for bookkeeping.

Brand RAM consumption, ↓ SSD consumption, ↓ Speed, ↑
WiredTiger by MongoDB 617.16 MB 123.19 GB 12K ops/s
LevelDB by Google 285.30 MB 110.23 GB 40K ops/s
RocksDB by Facebook 745.02 MB 109.59 GB 46K ops/s
UnumDB by Unum 4.18 GB 108.41 GB 84K ops/s πŸ“πŸ“πŸ“

Triple emojis is our way of marking extreme awesomeness. You only get them if you are number one, and the gap between you and the second-best solution is more significant than between any other consecutive pair of results.

Workload A: 50% Read, 50% Update

Workload A

This is the most common operation in transactional Databases that is close to real-world Read-Modify-Write scenarios. Let us say you run a bank and your client just bought a cup of coffee. You must read his previous account balance, subtract the price of coffee and write the new value back - all before validating the financial transaction to guarantee consistency. Here we immediately see the value of speed. It is not just lower AWS bills. It is a better customer experience.

Brand RAM consumption, ↓ SSD consumption, ↓ Speed, ↑
WiredTiger by MongoDB 285.08 MB 169.37 GB 13K ops/s
LevelDB by Google 1.22 GB 112.75 GB 25K ops/s
RocksDB by Facebook 879.55 MB 112.24 GB 16K ops/s
UnumDB by Unum 4.03 GB 108.41 GB 32K ops/s

Workload C: Only Read Operations

Workload C

We skip the results of Workload B for clarity, as it is virtually identical to Workload D, which we will cover later. This chart shows the most unrealistic scenario but was included for completeness. It shows how fast read operations work if you only run one at a time. In real-world production, we batch the reads and perform them simultaneously to utilize the hardware entirely. See our Workload Z for that!

Brand RAM consumption, ↓ SSD consumption, ↓ Speed, ↑
WiredTiger by MongoDB 177.50 MB 123.19 GB 13K ops/s
LevelDB by Google 2.20 GB 110.03 GB 19K ops/s
RocksDB by Facebook 681.04 MB 109.58 GB 16K ops/s
UnumDB by Unum 2.21 GB 108.41 GB 20K ops/s

Workload D: 95% Read, 5% Insert

Workload D

Unfortunately, this benchmark was not enough to saturate our engine and show it in full glory. From the structural design properties of our solution, we expect this number to grow up by a factor of five for UnumDB for the 100 GB version. Still, for a fair comparison, we put it here. We are still faster than others, but the difference is slim.

Brand RAM consumption, ↓ SSD consumption, ↓ Speed, ↑
WiredTiger by MongoDB 262.93 MB 169.40 GB 17K ops/s
LevelDB by Google 1.83 GB 112.44 GB 29K ops/s
RocksDB by Facebook 783.33 MB 112.67 GB 27K ops/s
UnumDB by Unum 2.24 GB 108.82 GB 32K ops/s

Can we do more? Can we do better?

YES! YCSB covers many workloads, but not all of them. At Unum, we do a lot of Deep Learning, and it is crucial to rapidly sample the data from SSDs and stream data to our GPU clusters. The data leaves the DBMS, it is packed into Apache Arrow buffers and passed to PyTorch, TensorFlow or any other Machine Learning framework. We extended the original benchmark a bit to give you a sneak peek of what is coming!

Workload Z: Batch Selections

Machine Learning & Artificial Intelligence! The buzz words of Z generation, hence the name. You take away the fuzz and what you see - is decades of scientific and technological advances and the kind of hardware we could not imagine 20 years ago. GPUs, TPUs, IPUs - all of it costs money. Those guys are good at crunching data, but unless you can bring it in time, the hardware sits there idle - wasting compute cycles and your money. When our money is on the table, you start optimizing!

Workload Z

Let us technically define this workload. In analytics, we never process data one-by-one. In this specific benchmark - each DBMS receives 256 randomized read requests in a single bundle. All of them are fetched from different random regions of the disk, so the DBMS must smartly order the operations to finish sooner.

Brand RAM consumption, ↓ SSD consumption, ↓ Speed, ↑
WiredTiger by MongoDB 223.45 MB 169.40 GB 17K ops/s
LevelDB by Google 1.97 GB 109.95 GB 29K ops/s
RocksDB by Facebook 631.83 MB 112.67 GB 17K ops/s
UnumDB by Unum 2.21 GB 108.82 GB 124K ops/s πŸ”žπŸ”žπŸ”ž

As we see, there is space for innovation, even in something as primitive as the Key-Value Store. Today, we use this functionality to process datasets like the 200+ TBs of CommonCrawl HTMLs or structured networks with trillions of relations. Essentially, everyone can benefit from this technology, and we will do our best to package it and ship it to our customers. The only thing left now is to assemble a DBMS interface layer on top of it. MongoDB did it, SnowFlake did it, ElasticSearch did it, so it should not be too hard.

  • Tweet
  • Subscribe to receive similar articles πŸ“¨
  • Reach Us to try the fastest data processing software πŸ”₯