A single software company can spend over 💲10 Billion/year, on data centres, but not every year is the same. When all stars align, we see bursts of new technologies reaching the market simultaneously, thus restarting the purchasing super-cycle. 2022 will be just that, so let’s jump a couple of quarters ahead and see what’s on the shopping list of your favourite hyperscaler! Friendly warning: this article is full of technical terms and jargon, so it may be hard to read if you don’t write code or haven’t assembled computers before.
Read the newest version here! Everyone in the tech industry has probably heard the words “data lake”, “data warehouse”, “database management software”, but have you ever wondered how all of that is built? According to Gartner, the DBMS market size was estimated to be 💲65 Billion in 2020 and will reach 💲150 Billion in 2025. This article will be about the technology that underpins that whole industry. About how modern IT infrastructure is created.
David Patterson had recently mentioned that (rephrasing): The programmers may benefit from using complex instruction sets directly, but it is increasingly challenging for compilers to automatically generate them in the right spots. In the last 3-4 years I gave a bunch of talks on the intricacies of SIMD programming, highlighting the divergence in hardware and software design in the past ten years. Chips are becoming bigger and more complicated to add more functionality, but the general-purpose compilers like GCC, LLVM, MSVC and ICC cannot keep up with the pace.
No matter what kind of DB you take, it will almost always be a set of key-value pairs collections. Every table in your SQL DB, every index, every document DB and Graph DB relies on a fast underlying Key-Value Store (KVS) Engine. Those are generally implemented as persistent tree datastructures, such as B-Trees, B+-Trees or LSM Trees. Having faster KVS is an essential prerequisite to building fast DBMS and is of insane value itself.
A modern DBMS can be viewed as a composition of the following layers: A key-value binary storage layer such as RocksDB or WiredTiger. A single-instance logic layer such as Postgres. A distributed load-balancer and replication manager like GreenPlum. Depending on your performance goals you may have to optimize one or more layers in this equation. We replaced all of them with custom solutions. Assuming a DBMS implementation size can reach millions of lines of code, there are many design decisions to make.
Most of the data that flows into our analytical pipelines originally has textual form. Afterwards it transforms into a more structured representation, but we often re-evaluate the same context once more related data arrives. At such times it’s crucial to fetch all the occurences of a specific string or RegEx pattern, so we need a fast text index! Reproduce Our Results Setup Datasets Covid 19 Scientific Papers. Documents: 45,941. Sections: 1,604,649.
Why would an AI-Research team spend years designing new storage infrastructure? Well, let’s look at hardware. A high-end server like DGX A100 today costs around $300'000 and comes with following components: 2x CPUs (with shared RAM), 8x GPUs (each with private VRAM) and 2x SSDs. Almost the entire cost of hardware is the cost of logic and volatile memory - RAM. RAM is fast, but also small. The chances are, your computer wastes most of the time simply fetching and sending data instead of actually “computing”.
At Unum, we develop a neuro-symbolic AI, which means combining discrete structural representations of data and semi-continuous neural representations. Think of it as building a huge Knowledge Graphs. Such graphs have an extremely irregular structure, which makes data access patterns very unpredictable. Sounds like an ultimate workload for a serious DBMS benchmark. Reproduce Our Results Setup Datasets Patent Citation Network. Size: 272 MB. Edges: 16,518,947. Average Degree: 8. Mouse Gene Regulatory Network.