2021 was terrific for DataBase Management Software and startups in general. While classical SQL is shrinking, the data-management market as a whole is booming at 17% CAGR and will reach $150 Billion in 2026, according to Gartner. That and the hype allowed dozens of DBMS startups to raise more capital last year alone than in their entire preceding decade-long history. For 13 companies in our previous extended comparison, it meant swallowing $4.5 Billion of VC money.


With so many players and such high-stakes, there must be have been an evaluation metric - a way to sort out the wheat from the chaff. There are two:

  • YCSB: Yahoo Cloud Serving Benchmark,
  • TPC: Transaction processing Performance Council.

Those cover different workloads. The first is for Key-Value Stores (KVS), and the second is for mostly SQL DBMS systems, built on top KVS. So if you are building a DBMS, it makes sense to use both, one for the persistent data structures and one for the higher-level logic. As expected, we use both and outperform other players in both, but we will skip the TPC for now.

With around 4K โญ on GitHub, YCSB is the popular option. In the past, we have used it extensively, and our previous article covers a lot we can skip this time:

  • How ๐Ÿฆ„ are built on top of open-source RocksDB and WiredTiger? jump
  • The liquid-cooled ๐Ÿ‘น monster hardware we use for benchmarking: here
  • 100 MB, 1 GB, 10 GB and 100 GB results here

As we have previously promised, we are back with expanded datasets and new optimizations, but they are not just inside UnumDB! After careful evaluation, we decided to rewrite the original YCSB package, extending and updating it along the way! Oh, and it’s open-source - check it on GitHub ๐Ÿค— If you just want to see the new results - here you go. Spoiler:

UCSB Benchmark Duration for RocksDB, MongoDB and UnumDB


Overall, designing new benchmarks isn’t considered a good tone. Especially if you are going to measure your own (hopefully upcoming) product, it makes it too easy to prioritize the operations you are good at and lessen the others. So we preserved the principal part of YCSB - its canonical random key generators and the three most misleading letters of the name ๐Ÿ˜…

We will talk about many things, including:

  • A benchmark for High-Performance Software must be High-Performance Software in itself.
  • Tracking hardware resource usage from a separate process, Valgrind style.
  • ACID guarantees and multithreading in Key-Value Stores.
  • Cost of running a DBMS in a Docker container.
  • SLC vs MLC vs TLC relation on DBMS speed.
  • 1 TB results for RocksDB, UnumDB and the others.

If it sounds interesting, let’s jump in!

Performance is a Feature

The original YCSB was published over 10 years ago and targeted isolated DBMS applications. Those run in a separate process, in a different address space and communicate through sockets, often via plain-text commands. It was simple enough to be understandable and diverse enough to be broadly applicable, so it took off. People like us have applied it to systems that are much more “low-level” than, let’s say Amazon DynamoDB, Apache Cassandra or ElasticSearch.

In those 10 years, the hardware has changed. Let’s compare AMD CPUs from those two eras:

2012 2022
Top CPU Model Athlon II X4 651K EPYC 7773X
Lithography 32 nm 7 nm
TDP 100 Watt 280 Watt
Core Count 4 64
Clock Frequency 3.0 GHz 2.2 - 3.5 GHz
Cache Size 4 MB 804 MB
PCIe 20x Gen2 128x Gen4
PCIe Bandwidth 10 GB/s 256 GB/s
RAM 2x channel DDR3-1866 8x channel DDR4-3200
RAM Bandwidth 30 GB/s 204 GB/s

In reality, not all of that theoretical bandwidth is always available, but I guess you don’t need cpu-world.com to agree that CPUs changed!

The same applies to SSDs and GPUs. Storage-level technologies are heavily underutilizing the latter. The software must harness all of that speed and parallelism, but it’s only feasible in low-level languages.

Java & Java-like

All performant KVS are implemented in C++ and YCSB is implemented in Java. This means, that you would need some form of a “Foreign Funcion Interface” to interact with the KVS. This immediately adds unnecessary work for our CPU, but it’s a minor problem compared to rest.

Example 1

Every language and its ecosystem has different priorities. Java focuses on the simplicity of development, while C++ trades it for higher performance.

1
2
3
private static String getRowKey(String db, String table, String key) {
    return db + ":" + table + ":" + key;
}

The above snippet is from the Apples & SnowFlakes FoundationDB adapter inside YCSB, but it’s identical across the entire repo. It’s responsible for generating keys for queries. Here is what a modern recommended C++ version would look like:

1
2
3
auto get_row_key(std::string_view db, std::string_view table, std::string_view key) {
    return std::format("{}:{}:{}", db, table, key);
}

My entire Java experience is about 1 week long and happened over 10 years ago. So take the next section with a grain of salt.

From Java 7 onwards, the Java String Pool lives in the Heap space, which is garbage collected by the JVM. This code will produce a StringBuilder, a heap-allocated array of pointers to heap-allocated strings, later materializing in the final concatenated String. Of course, on-heap again. And if we know something about High-Performance Computing, the heap is expensive, but together with Garbage Collection and multithreading, it becomes completely intolerable. The same applies to the C++ version. Yes, we are doing only 1 allocation there, but it is also too slow to be called HPC. We need to replace std::format with std::format_to and export the result into a reusable buffer.

Example 2

If one example is not enough, below is the code snippet, which produces random integers before packing them into String key.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
long nextLong(long itemcount) {
    // from "Quickly Generating Billion-Record Synthetic Databases", Jim Gray et al, SIGMOD 1994
    if (itemcount != countforzeta) {
        synchronized (this) {
            if (itemcount > countforzeta) {
                ...
            else
                ...
        }
    }

    double u = ThreadLocalRandom.current().nextDouble();
    double uz = u * zetan;

    if (uz < 1.0)
        return base;
    if (uz < 1.0 + Math.pow(0.5, theta))
        return base + 1;

    long ret = base + (long) ((itemcount) * Math.pow(eta * u - eta + 1, alpha));
    setLastValue(ret);
    return ret;
}

To generate a long, we are doing numerous operations on doubles, by far the most computationally expensive numeric type on modern computers (except for integer division). Aside from that, this PRNG contains 4x if statements and synchronized (this) mutex. Creating random integers for most distributions is generally within 50 CPU cycles or 10 nanoseconds. In this implementation, every if branch may cost that much, and the mutex may cost orders of magnitude more!

It looks like a severe systemic problem to me, so we have searched for C/C++ ports.

Existing C++ Ports

We are not the first to consider porting:

Those implementations aren’t popular. They solve the first issue, of not needing FFI, to call LevelDB, RocksDB or other C++ persistent data structure libs, but aren’t solving the other problems.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
inline uint64_t ZipfianGenerator::Next(uint64_t num) {
    assert(num >= 2 && num < kMaxNumItems);
    std::lock_guard<std::mutex> lock(mutex_);

    if (num > n_for_zeta_) { 
        RaiseZeta(num);
        eta_ = Eta();
    }
  
    double u = utils::RandomDouble();
    double uz = u * zeta_n_;
  
    if (uz < 1.0) 
        return last_value_ = 0;
    if (uz < 1.0 + std::pow(0.5, theta_))
        return last_value_ = 1;
    return last_value_ = base_ + num * std::pow(eta_ * u - eta_ + 1, alpha_);
}

Again, we are generating random numbers under a mutex, which in turn calls a static std::default_random_engine here. Even the most straightforward functions cause expensive on-heap copies and throw exceptions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
inline bool StrToBool(std::string str) {
    std::transform(str.begin(), str.end(), str.begin(), ::tolower);
    if (str == "true" || str == "1") {
        return true;
    } else if (str == "false" || str == "0") {
        return false;
    } else {
        throw Exception("Invalid bool string: " + str);
    }
}

Being a step in the right direction, it causes malloc-backed heap-allocations in every function interface. Exceptions, for reference, are banned in half of the companies using C++, including Google.

Being Future-Proof

In 2023 we will be looking at up to 2 Million Random 4KB Read operations per second on the next-gen Intel Optane. With just 24 drives in a 2U chassis, we will be getting 50 MOps/s or 200 GB/s. Far more than your memory system can sustain with copies, let alone your heap allocator. Even the Linux kernel is expected to choke at 10 MOps/s, let alone JVM and most software ever written. We started by using older C++ ports with wrappers for RocksDB, LevelDB and LMDB. Then:

  • We added a WiredTiger backend, which is the foundation of MongoDB.
  • We normalized and extended the configuration files.
  • We removed a few more of those Java-ish inefficiencies.

It was easier to throw away all except a couple of classes and rebuild the rest using the Google Benchmark suite by that time.

In every complicated situation start from scratch!

Ancient C++ Wisdom

New Workloads

Before getting into the tedious intricacies, let’s spice things up a little. YCSB had 6 mixed workloads, from A to F, plus initialization. Those mostly do Read+Update, Read+Insert, Read-Only and Write-Only operations.

Good, but not enough. Today State-of-the-Art Language Models are trained on CommonCrawl samples. That dataset contains 300 TB worth of HTML. It’s just one of many datasets used to solve one of many AI problems.

In NLP Everything Is Getting Bigger

To work with such volumes, we wanted more “verbs” than just “set” and “get”, but we had to trim some fat to keep it brief. Instead of 7 initial workloads, we have 8:

  • โˆ…: imports monotonically increasing keys ๐Ÿ”„
  • A: 50% reads + 50% updates, all random
  • C: reads, all random
  • D: 95% reads + 5% inserts, all random
  • E: range scan ๐Ÿ”„
  • โœ—: batch read ๐Ÿ†•
  • Y: batch insert ๐Ÿ†•
  • Z: scans ๐Ÿ†•

The โˆ… was previously implemented as one-by-one inserts, but some KVS support the external construction of its internal representation files. The E was previously mixed with 5% insertions.

How We Configured DBs?

We concentrated on benchmarking the following state-of-the-art Key-Value Stores:

  1. WiredTiger. Version 10.0.0.
  2. LevelDB. Version 1.23.
  3. RocksDB. Version 6.29.3 (fork of LevelDB).
  4. UnumDB. Version 0.2.
  5. LMDB is also supported, but it was too slow to include into chart. Version 0.9.29.

UnumDB is currently in a pre-release form, but we use it internally in a broad set of configurations on terbayte scale collections.

Memory Limits

Every KVS supports setting a RAM limit, which we choose to be 10% or, more commonly, 5% of the overall database size. It’s a typical server setting, as you generally have at least 10x less RAM than disk space. Desktop setups are even less balanced than that. Many users have just 16 GB of RAM and a 1 TB SSD, meaning a 60x gap.

Compression

None of the DBs relies on custom compression. They all reuse similar sets of open-source compression libs, like Snappy and Zlib. We want to benchmark the DBs and not the compression, so we disabled it across all deployments.

Disk Representation

RocksDB famously has multiple file formats. By default, it uses the BlockBasedTable for SSTs, but also provides the PlainTable and BlobDB with separate files for keys and values. The latter ones were unstable and feature-incomplete, so we took the default.

The configuration files for RocksDB, for example, contain over a hundred lines. It includes settings for the Write-Ahead-Log, flushing guarantees, Skip-List Capacity, file sizes, LSM street growth factors, Bloom-filter specs and more. RocksDB has the biggest codebase and is probably the hardest to read, understand and maintain. The cloc utility measured it’s codebase to be โ‰ˆ 650'000 lines of code. Removing blanks and noise here is what we get:

LevelDB RocksDB WiredTiger UnumDB
Code & Tests โ‰ˆ 20 K โ‰ˆ 375 K โ‰ˆ 130 K โ‰ˆ 19 K
Comments โ‰ˆ 4 K โ‰ˆ 107 K โ‰ˆ 95 K โ‰ˆ 3 K
Wrappers & Other โ‰ˆ 3 K โ‰ˆ 90 K โ‰ˆ 85 K โ‰ˆ 1 K

Even though UnumDB is smaller than LevelDB, it’s fair to say that they have the most readable codebase. This, however, comes at the cost of performance.

RocksDB truly stands out in its complexity. So we tried to stick to default configs with minimal changes.

Supported Verbs

KVS variants differ in supported operations. Many are not available natively, so they were simulated using the fastest available functionality.

WiredTiger LevelDB RocksDB UnumDB
Insert โœ… โœ… โœ… โœ…
Select โœ… โœ… โœ… โœ…
Remove โœ… โœ… โœ… โœ…
Scan โœ… โœ… โœ… โœ…
Initialize โœ… โŒ โœ… โœ…
Batch Select โŒ โŒ โœ… โœ…
Batch Insert โŒ โœ… โœ… โœ…

There is also asymmetry elsewhere:

  • WiredTiger supports fixed size integer keys.
  • LevelDB only supports variable length keys and values.
  • RocksDB has minimal support for fixed_key_len, incompatiable with BlockBasedTable.
  • UnumDB supports both fixed size keys and values.

Just like YCSB, we use 8-byte integer keys and 1000-byte values. Both WiredTiger and UnumDB were configured to natively use integer keys. RocksDB wrapper reverts the order of bytes in keys to use the native comparator. None of the DBs was set to use fixed size values, as only UnumDB supports that.

Caveats We Faced

If you use Google Benchmark, you know about its bunch of nifty tricks, like DoNotOptimize or the automatic resolution of the number of iterations at runtime. It’s widespread in micro-benchmarking, but it begs for extensions when you start profiling a DBMS. The ones shipped with UCSB spawn a sibling process that samples usage statistics from the OS. Like valgrind, we read from /proc/* files and aggregate stats like SSD I/O and overall RAM usage.

Durability vs Speed

Unlike humans, ACID is one of the best things that can happen to DBMS ๐Ÿ˜

Like all good things, ACID is unreachable, because of at least one property - Durability. Absolute Durability is practically impossible and high Durability is expensive.

All high-performance DBs are designed as Log Structured Merge Trees. It’s a design that essentially bans in-place file overwrites. Instead, it builds layers of immutable files arranged in a Tree-like order. The problem is that until you have enough content to populate an entire top-level file, you keep data in RAM - in structures often called MemTables.

LSM Tree

If the lights go off, volatile memory will be discarded. So a copy of every incoming write is generally appended to a Write-Ahead-Log (WAL). Two problems here:

  1. You can’t have a full write confirmation before appending to WAL. It’s still a write to disk. A system call. A context switch to kernel space. Want to avoid it with io_uring or SPDK, then be ready to change all the above logic to work in an async manner, but fast enough not to create a new bottleneck. Hint: std::async will not cut it.
  2. WAL is functionally stepping on the toes of a higher-level logic. Every wrapping DBMS, generally implements such mechanisms, so they disable WAL in KVS, to avoid extra stalls and replication. Example: Yugabyte is a port of Postgres to RocksDB and disables the embedded WAL.

We generally disable WAL and benchmark the core. Still, you can tweak all of that in the UCSB configuration files yourself.

Furthermore, as widely discussed, flushing the data still may not guarantee it’s preservation on your SSD. So pick you poison hardware wisely and tune your benchmarks cautiously.

Strict vs Flexible RAM Limits

When users specify a RAM limit for a KVS, they expect all of the required in-memory state to fit into that many bytes. It would be too obvious for modern software, so here is one more problem.

Fast I/O is hard. The faster you want it, the more abstractions you will need to replace.

graph LR Application -->|libc| LIBC[Userspace Buffers] Application -->|mmap| PC[Page Cache] Application -->|mmap+O_DIRECT| BL[Block I/O Layer] Application -->|SPDK| DL[Device Layer] LIBC --> PC PC --> BL BL --> DL

Generally, OS keeps copies of the requested pages in RAM cache. To avoid it, enable O_DIRECT. It will slow down the app and would require some more engineering. For one, all the disk I/O will have to be aligned to page sizes, generally 4KB, which includes both the address in the file and the address in the userspace buffers. Split-loads should also be managed with an extra code on your side. So most KVS (except for UnumDB, of course ๐Ÿ˜‚) solutions don’t bother implementing very fast I/O, like SPDK. In that case, they can’t even know how much RAM the underlying OS has reserved for them. So we have to configure them carefully and, ideally, add external constraints:

1
systemd-run --scope -p MemoryLimit=100M /path/ucsb

Now a question. Let’s say you want to mmap files and be done. Anyways, Linux can do a far better job at managing caches than most DBs. In that case - the memory usage will always be very high but within the limits of that process. As soon as we near the limit - the OS will drop the old caches. Is it better to use the least RAM or the most RAM until the limit?

For our cloud-first offering, we will favour the second option. It will give the users the most value for their money on single-purpose instances.

Furthermore, we allow and enable “Workload Isolation” in UCSB by default. It will create a separate process and a separate address space for each workload of each DB. Between this, we flush the whole system. The caches filled during insertions benchmarks, will be invalidated before the reads begin. This will make the numbers more reliable but limits concurrent benchmarks to one.

In one of the next articles we will write about the in-hardware Memory Management Unit and the Linux mmap implementation, so subscribe ๐Ÿค—

Dataset Size & NAND Modes

Large capacity SSDs store multiple bits per cell. If you are buying a Quad Level Cell SSD, you expect each of them to store 4 bits of relevant information. That may be a false expectation.

SLC MLC vs TLC

The SSD can switch to SLC mode during intensive writes, where IO is faster, especially if a lot of space is available. In the case of an 8 TB SSD, before we reach 2 TB used space, all NAND arrays can, in theory, be populated with just one relevant bit.

SLC vs eMLC vs MLC vs TLC

If you are benchmarking the DBMS, not the SSD, ensure that you did all benchmarks within the same mode. In our case for a 1 TB workload on 8 TB drives, it’s either:

  • starting with an empty drive,
  • starting with an 80% full drive.

Listing the Knobs

As you see, there is a lot to take into account, and everyone may be interested in a different setting. To sum things up, here is a functionality comparison between YCSB and UCSB.

Present in YCSB Present in UCSB
Size of the dataset โœ… โœ…
DB configuration files โœ… โœ…
Workload specifications โœ… โœ…
Tracking hardware usage โŒ โœ…
Workload Isolation โŒ โœ…
Concurrency โŒ โœ…
Batch Operations โŒ โœ…
Bulk Operations โŒ โœ…
Support of Transactions โŒ โœ…

There is too much control-flow to tune, so instead of 1'000 CLI arguments, we organize them into a run.py Python script.

Results ๐Ÿ’ฅ

And then, what’s the point of writing a benchmark if you don’t get to run it! One of the comments to our previous post was wondering, why run small workloads on big machines? The answer is - to upscale the experiment within the same environment and analyze its scaling behaviour.

How we set the Knobs this time:

  • Transactions: โŒ
  • Concurrent: โŒ
  • Workload Isolation: โœ…
  • Sizes: 10 GB, 100 GB, 1 TB.

Here is how long one iteration of the benchmark takes:

  • 10 GB: 42 minutes.
  • 100 GB: 5 hours, 54 minutes.
  • 1 TB: 2 days, 10 hours, 6 minutes.

Totalling at 2 days, 16 hours, 42 minutes. Benchmark duration by DBMS:

  • WiredTiger: 8 hours, 16 minutes.
  • LevelDB: 1 day, 15 hours, 19 minutes.
  • RocksDB: 12 hours, 8 minutes.
  • UnumDB: 4 hours, 59 minutes.

We rerun those benchmarks many times with different settings. Every DBMS received its own 8 TB Samsung SSD, empty. Later we will avoid slower DBs and will focus on bigger setups, that don’t fit on one SSD:

  • Transactions: โœ…
  • Concurrent: โœ…
  • Workload Isolation: โœ…
  • Sizes: 10 TB, 50 TB.

The ๐Ÿฅˆ and ๐Ÿฅ‰ place often change, but the ๐Ÿฅ‡ leader remains constant. With performance difference often being 2x - 5x against the second-best solution in each workload. Sometimes, this speed comes at the cost of using more RAM, but not always. If the gap between ๐Ÿฅ‡ and ๐Ÿฅˆ is bigger than the gap within any consecutive pair of entries in the leaderboard - we mark the result with ๐Ÿ….

0: Bulk Initiailization

Initializing the KVS is done via monotonically ascending keys. The original YCSB initialization always happens one key at a time. We went one step further and implemented a bulk insert functionality. When possible, it constructs big DB-compatible files externally and then submits them into KVS. WiredTiger, RocksDB and UnumDB natively support that, but LevelDB doesn’t.

This is vastly faster than inserting data one by one and even in batches! Think of it as your zero-to-hero time: how fast can you import all your Parquet files from S3 buckets, before starting working with them.

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 1.00 cores 3.99 GiB 989.77 GiB 1.1M ๐Ÿฅˆ
LevelDB โŠƒ Google 0.99 cores 904.12 MiB 984.19 GiB 30.2K
RocksDB โŠƒ Facebook 0.89 cores 2.65 GiB 976.27 GiB 385.3K ๐Ÿฅ‰
UnumDB โŠƒ Unum 1.00 cores 3.90 GiB 968.58 GiB 2.2M ๐Ÿฅ‡๐Ÿ…

A: 50% Random Reads + 50% Random Updates

Every operation is a random single-element operation. Half of them are reads, and half are updates for existing keys.

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 3.19 cores 23.67 GiB 1.02 TiB 101.1K ๐Ÿฅ‰
LevelDB โŠƒ Google 1.78 cores 2.54 GiB 989.39 GiB 115.2K ๐Ÿฅˆ
RocksDB โŠƒ Facebook 1.50 cores 4.95 GiB 978.55 GiB 86.8K
UnumDB โŠƒ Unum 1.00 cores 10.01 GiB 976.32 GiB 197.3K ๐Ÿฅ‡๐Ÿ…

C: 100% Random Reads

Again, not a particularly interesting benchmark, but a common case when dealing with poorly optimized software. Use the batched approach whenever possible.

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 1.39 cores 17.43 GiB 989.77 GiB 146.6K ๐Ÿฅ‰
LevelDB โŠƒ Google 1.12 cores 105.41 MiB 984.40 GiB 30.6K
RocksDB โŠƒ Facebook 0.97 cores 4.72 GiB 976.27 GiB 160.7K ๐Ÿฅˆ
UnumDB โŠƒ Unum 0.98 cores 1.42 GiB 972.30 GiB 175.7K ๐Ÿฅ‡

D: 95% Random Reads + 5% Random Inserts

Unlike A this benchmark inserts new key-value pairs instead of updating the previously existing ones.

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 2.03 cores 18.48 GiB 1.02 TiB 182.4K ๐Ÿฅˆ
LevelDB โŠƒ Google 1.85 cores 232.08 MiB 1.01 TiB 20.1K
RocksDB โŠƒ Facebook 0.99 cores 5.04 GiB 1.00 TiB 173.3K ๐Ÿฅ‰
UnumDB โŠƒ Unum 1.01 cores 17.71 GiB 1022.78 GiB 189.9K ๐Ÿฅ‡

E: Range Select

Here we randomly select a key and then retrieve the following 100 values. One can easily change the scan length through settings and even define it through provided probability distributions.

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 0.38 cores 4.54 GiB 989.77 GiB 250.2K ๐Ÿฅˆ
LevelDB โŠƒ Google 0.32 cores 50.79 MiB 983.99 GiB 236.3K ๐Ÿฅ‰
RocksDB โŠƒ Facebook 0.34 cores 4.72 GiB 976.27 GiB 177.3K
UnumDB โŠƒ Unum 0.52 cores 344.49 MiB 972.30 GiB 384.3K ๐Ÿฅ‡๐Ÿ…

X: Batch Reads

It is a benchmark of “batch selections”, where instead of submitting one read operation at a time and waiting for it synchronously, you request a batch of, let’s say, 256 random keys. It enables the DBMS to execute them in a batch-asynchronous fashion, reordering separate operations internally for speed.

This workload is significant for BI and analytical workloads, network/graph analysis being a perfect example. At every step of your dataset exploration, you are fetching data from very different parts of your dataset to provide insights, and this requires fast “batch selections”!

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 0.50 cores 14.29 GiB 989.77 GiB 40.9K ๐Ÿฅ‰
LevelDB โŠƒ Google 0.64 cores 55.87 MiB 983.99 GiB 16.5K
RocksDB โŠƒ Facebook 0.35 cores 9.07 GiB 976.27 GiB 51.8K ๐Ÿฅˆ
UnumDB โŠƒ Unum 0.37 cores 830.48 MiB 972.30 GiB 303.8K ๐Ÿฅ‡๐Ÿ…

Moreover, today AI researchers mostly train their neural networks after uniformly shuffling the datasets. It is a simple approach, but is it the best strategy if we can rapidly sample batches?

Y: Batch Insert

Instead of inserting one value at a time, like in C, we submit batches of upto 10'000 values, depending on the DB size. It’s natively supported only by LevelDB and UnumDB.

This is different from a transaction with 256 new entries in it, as if one of the operations fails, we want the remaining to proceed. Transactions add the rollback functionality and will be the subject of future publications.

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 1.00 cores 227.22 MiB 1.02 TiB 450.3K ๐Ÿฅ‰
LevelDB โŠƒ Google 0.99 cores 602.71 MiB 1.11 TiB 41.1K
RocksDB โŠƒ Facebook 1.71 cores 5.63 GiB 1.10 TiB 690.2K ๐Ÿฅˆ
UnumDB โŠƒ Unum 1.00 cores 2.38 GiB 1.04 TiB 813.2K ๐Ÿฅ‡

Z: Scan

It streams all the data present in the store, which may be later sampled and forwarded into some neural-network training procedure.

Brand CPU usage RAM Usage Disk Usage Speed
WiredTiger โŠƒ MongoDB 1.22 cores 16.80 GiB 989.77 GiB 1.4M ๐Ÿฅˆ
LevelDB โŠƒ Google 0.59 cores 7.36 GiB 983.99 GiB 1.2M ๐Ÿฅ‰
RocksDB โŠƒ Facebook 0.14 cores 4.72 GiB 976.27 GiB 84.5K
UnumDB โŠƒ Unum 0.34 cores 16.89 GiB 968.58 GiB 1.7M ๐Ÿฅ‡

When comparing LevelDB and RocksDB, we see their rivalry. One is a lot better for sequential scan-like workloads. In contrast, the other one is better at randomized operations. The most astonishing result for UnumDB - it is better at both! WiredTiger often also performs well, but almost always uses the most energy and memory to get there.

Brand 0 A C D E X Y Z
WiredTiger ๐Ÿฅˆ ๐Ÿฅ‰ ๐Ÿฅ‰ ๐Ÿฅˆ ๐Ÿฅˆ ๐Ÿฅ‰ ๐Ÿฅ‰ ๐Ÿฅˆ
LevelDB ๐Ÿฅˆ ๐Ÿฅ‰ ๐Ÿฅ‰
RocksDB ๐Ÿฅ‰ ๐Ÿฅˆ ๐Ÿฅ‰ ๐Ÿฅˆ ๐Ÿฅˆ
UnumDB ๐Ÿฅ‡๐Ÿ… ๐Ÿฅ‡๐Ÿ… ๐Ÿฅ‡ ๐Ÿฅ‡ ๐Ÿฅ‡๐Ÿ… ๐Ÿฅ‡๐Ÿ… ๐Ÿฅ‡ ๐Ÿฅ‡

This may not sound relevant, but it is. Most companies operate 2 data stores separately - one for small real-time transactions and another for big analytical queries. This means more engineers, complex synchronization and outdated analytics. We hope to replace that entire mess with a single solution.

Is this the limit? โšก

Far from it! We run those benchmarks with our hands essentially cuffed. Here is a broader picture:

  • when put in a Docker container, most KVS slow down by 15%, UnumDB - by around 7%.
  • on ARM, the gap between UnumDB and the second-best solution is generally more prominent than on x86 due to our adoption of the Neon instruction family.
  • multithreaded and transactional operations are again a lot faster, as we invest heavily into concurrent in-memory data structures far more performant and memory-efficient than a concurrent Skip-List can be.
  • initialization order has unprecedented affect on the speed of measured operations - when the keys aren’t mononically rising other DBs can often be 2x-5x slower.
  • with GPUs enabled, the gap becomes astronomical, but it’s not a fair comparison, as GPUs would draw more power. 400W per GPU vs 300W when fully loading a 64-core CPU.

So prepare to see broader benchmarks soon! Anyways, our KVS performance seems solid for now, and we are focusing on:

  • horizontal scaling and cloud deployment,
  • going from string values to a higher-level type system,
  • supporting arbitrary JSON-like inputs,
  • an SQL-compatibility layer,
  • bindings for dynamic scripting languages.

It might seem too ambitious for a side-project of an AI-oriented team, but we are not stopping any time soon! It’s not just about examining more data faster. It’s also about studying the same data efficiently. In 2020 alone:

  • data transmission networks consumed 260-340 TWh.
  • data centre electricity use was 200-250 TWh.
  • cryptocurrency mining consumed ~100 TWh.

Energy Efficiency at Unum

Totalling at 600-700 TWh, or what the entire 1.26 Billion population of Africa consumed in that same year. Or over 100x the energy consumption of our home country.

With that in mind, we invite everyone to think twice about the tools we use. At least if we are genuinely committed to making the world a better place through software. It takes only a couple of search queries to realize that CuPy can be 1'000x faster than NumPy. Similarly, other startups modernize the streaming layer, like RedPanda replacing Apache Kafka. We have already announced our non-DBMS research directions for 2022, but all the solutions we are releasing this year target the storage layer. Any computer takes at least 100 microseconds to fetch anything from a fast SSD, meaning that accessing data can be over 200'000x more expensive than processing it. It’s the slowest piece of the modern computer, so our UnumDB should impact millions of applications widening their most critical bottleneck!

C++ Meetups in Yerevan

So far, I have been personally spending millions on our research since 2015. Offices, high-end servers, R&D teams, you name it ๐Ÿ˜‹ Now, we are expanding further, attracting brilliant, inspiring researchers and organizing more technical conferences than ever. What previously felt like an impossible dream of making ๐Ÿ‡ฆ๐Ÿ‡ฒ the second ๐Ÿ‡ฎ๐Ÿ‡ฑ, almost feels like a passed step! If you want to do research with us, join C++ conferences as a speaker or invest, just let us know! Anyways, who would need Oracle or MongoDB once there is UnumDB? ๐Ÿ˜‰

  • Tweet
  • Subscribe to receive similar articles ๐Ÿ“จ
  • Reach Us to try the fastest data processing software ๐Ÿ”ฅ