Most of the data that flows into our analytical pipelines originally has textual form. Afterwards it transforms into a more structured representation, but we often re-evaluate the same context once more related data arrives. At such times it’s crucial to fetch all the occurences of a specific string or RegEx pattern, so we need a fast text index!
- Covid 19 Scientific Papers.
- Documents: 45,941.
- Sections: 1,604,649.
- Size: 1,7 GB.
- Political Tweets India Posts.
- Documents: 12,488,144.
- Sections: 12,488,144.
- Size: 2,3 GB.
- English Wikipedia Dump from 2017.
- Documents: 4,902,648.
- Sections: 23,046,187.
- Size: 18,2 GB.
Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz.
- Cores: 8 (16 threads @ 2.3 Ghz).
- RAM Space: 16.0 GB.
- Disk Space: 931.5 GB.
- OS Family: Darwin.
- Python Version: 3.7.7.
Databases were configured to use 512 MB of RAM for cache and 4 cores for query execution.
Sequential Writes: Import CSV (docs/sec)
Every datascience project starts by importing the data. Let’s see how long it will take to load every dataset into each DB.
|MongoDB||1 hours, 7 mins||50 mins, 29 secs|
|ElasticSearch||4 mins, 31 secs||16 mins, 14 secs|
|UnumDB||0 mins, 11 secs||0 mins, 13 secs|
Those benchmarks only tell half of the story. We should not only consider performance, but also the used disk space and the affect on the hardware lifetime, as SSDs don’t last too long. MongoDB generally performs well across different benchmarks, but it failed to import the English Wikipedia in 10 hours. I suspect a bug in the implementation of the text index, as some batch import operations took over 10 mins for a modest batch size of 10,000 docs.
|MongoDB||1,9 GB||3,2 GB||Expected 60,7 GB|
|ElasticSearch||2,5 GB||2,9 GB||33,5 GB|