Processing COVID-19 literature: intake into Redis Cluster and Cbloom filter

In the last article, I span up Redis Cluster. This time is to populate it with data. This time I am not going to create a one large data science pipeline, which takes 9 GB RAM just to start, but I am going to split it into “nano” services, which I can debug easily (and hopefully add tests :) at some point.

Kaggle CORD19 data is one sizeable zipped archive with a bunch of JSON files, download Kaggle dataset and unzip.

$ kaggle datasets download…