Great article and a good start with biomedical knowledge.

2 min readJun 16, 2022

I invite you and the rest of the community to collaborate on such NLP pipelines, building Reference Architecture for AI - everything is open-sourced, including the website. I see your flow fits perfectly for this pipeline (https://reference-architecture.ai/docs/nlp/), and then we can expand to more advanced deployments using BERT QA Large models (https://reference-architecture.ai/docs/bert-qa-benchmarking/) .

This approach is similar to the path I took on the CORD19 Kaggle competition - I tried to use UMLS for mapping entities into concepts, and it put me on a learning journey I would like to share:

To help make your solution better:

1) Using Dom parser for XML with the risk of running out of memory, adjusting article parsing into a callback function, which called on a stream of XML data, will minimize memory usage (although Spark memory consumption still will be an issue)

2) The current solution would require effort to scale over 1000 articles: I tried a similar approach during the Kaggle data science competition, and even for a "small" dataset of 50000 articles (full articles, not only abstract), I spent a week waiting for writes to land in Neo4j. Neo4J has a lot of advantages, but it's not designed for fast writing (things have improved since 2019, the last time I tried).

As I learned, the fastest way to write into Neo4j is to create nodes and edges structures in Redis, then dump into CSV using redis-cli and load into neo4j using neo4jimport (a destructive operation).

Which put me on a journey participating in Redis Hackathons and winning Redis "Build On Redis"(https://redis.com/blog/build-on-redis-hackathon-winners/).

Written by Alex Mikhalev