RedisGears for Data Science NLP — Journey continues

If you procrastinate on the open-sourcing project for several months, you forget what you wanted to do before. I thought I wanted to split into streams-based code and open source it, then after completing most of the work I realised I wanted to build an open-source project based on KeysReader part of cord19project— it’s way easier to debug and fits better into DataScience flow: create a batch script, run, validate, register for new data by changing to register.

Also looking at my own code now — it’s not something to be proud of or put into production.

So, let’s rewrite it.

  1. I want to be able to run it at the laptop (although 256 GB RAM server on standby for demo)
  2. It should not be that ugly. It’s a hobby data science project, so few things here and there are ok, but it shall be readable (and hopefully contributable)
  3. Do I care about taking snapshots on every step? No, I only want “interim” — when sentences are spellchecked, so I can hook BERT based tokeniser as a subscriber.

How can I fit NLP pipeline into RedisGears better?

  1. For each record — detect language (discard non English), it’s “filter”
  2. Map paragraphs into a sentence — flatmap
  3. Sentences spellchecker — it’s “map”
  4. Save sentences into hash — processor

And since there is no support for multiple files in a RedisGears project there is no point of splitting project into that many files. Let’s consolidate it into smaller chunks.

New branch pushed into github.

