Most of the data that flows into our analytical pipelines originally has textual form. Afterwards it transforms into a more structured representation, but we often re-evaluate the same context once more related data arrives. At such times it’s crucial to fetch all the occurences of a specific string or RegEx pattern, so we need a fast text index!

Reproduce Our Results

Setup

Datasets

  • Covid 19 Scientific Papers.
    • Documents: 45,941.
    • Sections: 1,604,649.
    • Size: 1,7 GB.
  • Political Tweets India Posts.
    • Documents: 12,488,144.
    • Sections: 12,488,144.
    • Size: 2,3 GB.
  • English Wikipedia Dump from 2017.
    • Documents: 4,902,648.
    • Sections: 23,046,187.
    • Size: 18,2 GB.

Device

  • CPU:
    • Model: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz.
    • Cores: 8 (16 threads @ 2.3 Ghz).
  • RAM Space: 16.0 GB.
  • Disk Space: 931.5 GB.
  • OS Family: Darwin.
  • Python Version: 3.7.7.

Databases were configured to use 512 MB of RAM for cache and 4 cores for query execution.

Sequential Writes: Import CSV (docs/sec)

Every datascience project starts by importing the data. Let’s see how long it will take to load every dataset into each DB.

Covid19 PoliticalTweetsIndia Gains
MongoDB 398.40 3,301.61 1x
ElasticSearch 5,915.53 10,262.29 8.98x
UnumDB 142,575.07 930,285.22 319.82x
Covid19 PoliticalTweetsIndia
MongoDB 1 hours, 7 mins 50 mins, 29 secs
ElasticSearch 4 mins, 31 secs 16 mins, 14 secs
UnumDB 0 mins, 11 secs 0 mins, 13 secs

Those benchmarks only tell half of the story. We should not only consider performance, but also the used disk space and the affect on the hardware lifetime, as SSDs don’t last too long. MongoDB generally performs well across different benchmarks, but it failed to import the English Wikipedia in 10 hours. I suspect a bug in the implementation of the text index, as some batch import operations took over 10 mins for a modest batch size of 10,000 docs.

Covid19 PoliticalTweetsIndia EnglishWikipedia
MongoDB 1,9 GB 3,2 GB Expected 60,7 GB
ElasticSearch 2,5 GB 2,9 GB 33,5 GB

Import Overhead - Total Bytes Written