I recently read an article by Andy Pavlo, one of the most famous people in the database world, reflecting on the database market in 2021. He calls today - “the golden age of DataBase Management Software” (DBMS), and he is right! In one year, many of the startups in the space have raised more than in their entire previous decade-long history!
When you see separate news on yet another DBMS startup, their rounds and their promises - you read it, you agree with it, and you forget it. But once you stitch everything together, every piece of the puzzle, every article, the numbers just become absurd! Let’s investigate!
To skip the disclaimer and cut to the chase, jump here:
To discuss on HackerNews, jump into this thread.
Before we go further, I want to remind readers that we are hardly unbiased. We have an internal DBMS project - UnumDB, that lets us analyze Peta-scale datasets without Google-scale budgets. We needed something truly fast for our analytical workloads, and when everything else failed, we wrote a new DBMS from scratch! Now it’s in private beta, but it will be a publicly available product next year. So I will be talking about our future competitors, and you should probably discount for that.
Second, we are quite critical about modern programming practices, business and money, in general. We will see some big numbers below, but this is hardly an indicator of a rapidly advancing technology. Often the opposite.
People immediately forget about deep fundamental research or engineering when the hype comes. Building a software business becomes a buzz-word fitting content. The more “AI” and “Cloud” you can fit on your landing page, the better the outcome. We of all should know, it’s in our domain name 😂. We are yet to see the need and the ability to digest all of that capital.
Big money means big responsibility, so let’s just hope they will use it for the best of our industry!
Why is there Action?
Database Management Software is a pretty dull kind of IT, at least until recently. The last time it was hip - Oracle was built. Then, a few decades and a ton of open-source software later, new companies emerged.
|Startup||Raised pre-IPO||IPO Date||Raised @ IPO||Current Cap||Focus|
Fueled by seemingly infinite VC pockets, all of them do essentially the same thing - storage infrastructure. They may call it a “data lake”, a “Database Management Software” (DBMS), a “cloud-native storage solution”, or something else, but it’s just a piece of code that saves data to disk and retrieves it. The faster and more scalable is your solution, the better is your technology. The metric is as simple as that.
The fact that the company could raise 💲1.4 Billion before going public is already shocking. On their public debut, the company was valued at 💲22B. A few hours later - it skyrocketed to 💲75B, attracting even wider swarms of VCs to this seemingly forgotten IT sector. On that day, Snowflake drew 💲3.36 Billion more, accumulating to a total of 💲4.76B. There are over 100 countries worldwide with annual government budgets smaller than that. Including countries with a population of over 25 million!
How was the money spent?
|Company||Revenue||Cost of Revenue||G&A||M&S||R&D||EBIT|
Everyone closed this year at a net loss. All had revenues between 💲500M and 💲600M, with the cost of revenue at around 25%-40%. Generally, DBMS vendors only target the top 3 clouds for managed deployments, meaning that these 3 DBMS brands paid about 💲580M to AWS, Azure and GCP. We must also remember that every one of the cloud brands has its adaptation of MongoDB and Elastic. I can only imagine how much revenue Amazon DocumentDB has. The part that is clear, none of the companies spent more on R&D, than on M&S, and we can feel that in the accumulated technical debt.
The Technical Debt
GAAP and IFRS both have rules regarding what is R&D spending. However, companies are never forced to elaborate on how that R&D further splits into different projects within the company. Let’s take Elastic as an example. From their income statements, R&D budgets were:
- 2019: 💲100M
- 2020: 💲165M
- 2021: 💲199M
They have spent 💲450M over three years on R&D alone. As of right now, I would describe the ELK stack as two parts (even though it’s technically three products): storing data and visualizing it. The old core part and the new.
How old, might you ask? The oldest commit in the ElasticSearch repository is from 2010, but the core is much older. ElasticSearch is just a wrapper around the open-source Lucene library, and the latter was started in September 2001. It’s still maintained, but early design decisions invariably affect the outcome. The world was very different back in 2001. The first dual-core CPUs were just announced and would take another four years until reaching the desktop market in 2005! Things, such as multi-threading that is a first-class citizen in modern software, could only be retrofitted there.
I guess, about 20 years ago, people still hoped that high-performance software could be written in high-level languages. Today we know, that doesn’t work. You need either C, C++, or Assembly. If you have a long-running application, you must think about the memory allocation strategies beforehand to reduce the unavoidable memory fragmentation in automatic reference counting environments and avoid multi-thread stalls in the Garbage Collecting ones.
I am not even touching the numerous vulnerabilities that a big & popular piece of high-level open-source software would collect over the years. Such projects have so many dependencies that security issues like the recent Log4J become unavoidable. Elastic suffered from that vulnerability just as much as every other data processing tool written in Java. The world changes, the hardware changes, the software must change too. Due to the engineering problems above, Elastic isn’t vertically scalable by design.
The same applies to theoretical research. If you were to design a text search two decades ago, without deep Theoretical Computer Science (CS) training, you would probably build an “inverted index”. Simply speaking, you keep lists of documents identifiers for every word, describing where those are present. Once fetched, search results are scored with TF-IDF, an algorithm published in 1972. “Substring Search” is one of the most studied problems in CS. Dozens of efficient algorithms were already published in the last 50 years. Of course, a complex algorithm with the lowest asymptotic curve doesn’t always translate into faster code. Still, it would be nice to have Tech startups competing in their ingenuity and investing in fundamental research or advanced engineering. Not the vibrancy of their dashboards.
The whole situation may not be so dramatic, but it’s how I see most of the startups below. Still, there must be a place for one more Snowflake in such a vast market. Whom is it going to be?
The New Cohort, the Wannabees
I may have ended the last part with too obvious of a question. Everyone knows who will be the next Snowflake - it’s Databricks! They go after the same customers. They compete in the same benchmarks, they have raised mind-boggling sums of money, they are both founded by a long list of names, most of whom held leadership positions in the most prominent American enterprises. Both are, of course, headquartered in Silicon Valley.
I took the first 20 DBMS startups I could remember, who have raised over 💲50M during their lifetime.
|Company||Raised in 2021||Total Raised||Valuation||Total Rounds||Share Raised Last Year|
|Greenplum (∈ EMC)||💲0M||💲96M||💲0B||6||0%|
All the data was gathered from Crunchbase. The valuations aren’t always reported, so I had to make a few guesses here and there, rounding to the closest number of billions. This list may not be exhaustive, but it’s impressive, nevertheless. Let’s do some arithmetics:
- There are a total of 20 famous DBMS startups on that list.
- Together they raised a total of 💲7.7 Billion throughout their lifetime.
- Out of 20, 13 have raised in 2021.
- Those 13 companies have attracted 💲6.8 billion over their lifetime.
- Out of 💲6.8 Billion, the 💲4.5 Billion were attracted in 2021 alone.
- This year they raised 2x more than in all previous years combined!
Graphs & Series vs Tables & Docs
DataBricks, Redis, Cockroach Labs and Neo4J, seem to be the closest to IPOs, but only Redis has publicly announced the intention. They are expected to go public in May 2022 at a 💲5B valuation. Neo4J rebranded this year and raised three rounds between June and November, 💲174M + 💲152M + 💲66M, according to Crunchbase.
Investors across all those companies intersect a lot. In alphabetical order, it’s the Altimeter, Andressen Horowitz, Benchmark, Coatue, Index, Insight, Redpoint, Tiger Global. The usual suspects. This year they were mostly interested in following DBMS sectors:
- Graphs, like Neo4J, TigerGraph
- Time-Series, like ClickHouse
- In-Memory, like Redis
Classics is still the king, however. Gartner estimates that the DBMS market will reach 💲150 Billion by 2026. Tabular (commonly called relational) databases still represent the most significant yet shrinking part of the market. Snowflake, DataBricks, PlanetScale, CockroachDB, TiDB, Yugabyte, Greenplum are all examples of companies that solve the same problem - horizontal scalability of relational DBs. It proved to be a substantial limiting factor of the open-source Postgres and MySQL, so myriads of startups started patching, forking and wrapping them, solving one scalability issue at a time. Greenplum was doing that in 2005. PlanetScale began to do that in 2018. Time goes, but very few things change.
Luckily, non-tabular have spiced things up! They are growing so fast that they are stretching the whole market by an average of 17%/year! Of those, Mongo is on top! Couchbase was their rising competitor until it flopped during IPO this year. It is now valued at 💲1B, 35x less than Mongo. Out of the big players, MongoDB seems the most promising to us. Their stack is C++. They don’t try to adapt SQL, which is excellent! They have a JSON-like query language, which isn’t great, but get’s most of the work done. Hopefully, it’s just a step towards altogether abandoning text-based communication protocols and moving to the bright future of Remote Procedure Calls.
Mongo also made a bunch of notable strategic acquisitions over its history. One of them - Realm, was the most ambitious and auspicious persistent storage engine for mobile. Great for the Mongo developers community and great for the future of building apps! Most importantly, though MongoDB acquired WiredTiger. It’s an open-source project that implements persistent key-value stores (KVS). The essential piece of software for building a DBMS company. Only a few companies have ever made decent persistent data structures:
As you can imagine, it’s pretty hard to get right if only the titans of the industry try 😁 Everyone who wants to be a trillion-dollar internet company should have a good KVS. Oracle probably doesn’t have one. That’s why they are worth only 💲250B 😂
A considerable part of the startups in the list above are built by composing RocksDB and an open-source query evaluation engine. In comparatively good cases, a startup would choose a piece of a database to optimize and focus on that tiny part, never acknowledging the whole picture. It’s less risky, more understandable, and VCs are ready to fund that!
2021 for Unum & UnumDB
On the other side of the globe, we are the upside-down ones¡ The geeks, the freaks, the nerds, who have been doing weird risky fun things for the past 6 years. The ones who assemble their liquid-cooled servers to design the fastest software on bare-metal!
This year was huge for us. Here are some of our 2021 achievements and hopes for 2022: nerd-alert
CUTLASSon Nvidia Ampere, Turing & Volta GPUs in classical GEMM workloads on some matrix sizes. I guess it is the wet dream of every HPC developer - to design a fast matrix multiplication kernel. We are not implementing the whole
GraphBLASspecification in any case. We intersect in functionality but mostly try to target very large, sparse, low-precision matrix-to-matrix operations, like the ones needed for training modern Artificial Neural Networks.
We progressed from fine-tuning pre-trained Transformer Neural Networks, as we did since 2017, to training them from scratch. It’s exceptionally computationally demanding and requires research in numerically-stable optimization algorithms. People seem to favour large-batch methods like
LARS, but the evidence of their superiority over
AdamWisn’t always conclusive.
We increased the fleet of
AVX-512-enabled x86 machines and ARM CPUs, which led to even more exciting optimizations in our internal frameworks, mainly compression and search. Still waiting for proper
SVE2hardware support in 2022. Data-centers will receive ginormous upgrades next year, with budgets sometimes exceeding the numbers in this article. We will be upgrading too - here is what everyone will be buying! I can’t wait to have 1 PB of low-latency
NVMEstorage in a 1U chassis.
Linux is becoming increasingly a bottleneck for High-Performance Computing, especially High-Performance I/O. The ceiling seems to be around 10M operations/second, or about 8x Intel Optane SSD per storage node. We are fully utilizing
GPUDirect Storageand actively working on
SPDKto bypass the Linux kernel and the file system altogether. The SSD becomes a block-addressable asynchronous device. We decide what to write and at which addresses. This comes at extreme engineering costs but seems wholly justified. It will further amplify what is already an order of magnitude speed improvement in randomized read operations, critical for analytical BI applications. We have recently release our 100 GB/node benchmarks and just finished conducting 1 TB/node benchmarks:
At this scale, the staggering difference in performance translates into stress tests and benchmarks that can run for days, weeks or even months, similar to training State-of-the-Art neural networks. Even on something as small as 1 Billion entries, WiredTiger would crash, while other DBs would continue. So we had to restart the benchmarks bunch of times. UnumDB, on the other hand, spent less time finishing the 10 TB benchmark than others spent on the 1 TB. Seems like a work worth continuing, right?
Together with upcoming hardware upgrades, such software would store all the data and run all the analytics of a Fortune 500 company on just a couple of servers, with minimal latency, higher flexibility, tremendously lower costs and higher security than today!
What a time to be an engineer! No matter where you are, in Africa or America, in Europe or Asia, or at the crossroads, like us, we all have equal access to knowledge! Just open up arxiv.org, and let’s start hacking, like in the old days! Good luck and much health to all of us in 2022!