Processing COVID-19 literature: intake into Redis Cluster and Cbloom filter

Alex Mikhalev
2 min readMay 14, 2020

In the last article, I span up Redis Cluster. This time is to populate it with data. This time I am not going to create a one large data science pipeline, which takes 9 GB RAM just to start, but I am going to split it into “nano” services, which I can debug easily (and hopefully add tests :) at some point.

Kaggle CORD19 data is one sizeable zipped archive with a bunch of JSON files, download Kaggle dataset and unzip.

$ kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge
$ unzip CORD-19-research-challenge.zip

Just a small intake script:

Take all:

  • json_files-> parse_json: parse article_id
  • parse `body_text` return all text as paragraphs
  • Store paragraphs in Redis with article_id:paragraph_id

JSON can be parsed really fast, the role of RedisBloom (Cockoo) filter is to prevent re-processing of files on each step and I was planning to use different filters for each stage. When processing JSON to Redis — I may not care that much, but if I am processing text via BERT/Transformers it will take some time per each line/file.

Unfortunately, RedisBloom python library doesn’t support Redis cluster, so I will have to roll back to storing set in Redis. If I am going to redo this intake for single machine configuration I will re-do it with Rust and Xor-filter.

For now, I stored the Cockoo filter in local Redis instance while populating Redis-Cluster. Good news script run on my laptop (Windows 10) 8 GB RAM. Successfully.

--

--

Alex Mikhalev

I am a systems thinker with a deep understanding of technology and a methodological approach to innovation