BERT models with millisecond inference — The Pattern project

Newsletter from NVIDIA about the BERT model with 1.2 milliseconds on the latest hardware reminded me that I run BERT QA (bert-large-uncased-whole-word-masking-finetuned-squad) in 1.46 ms on a laptop with Intel i7–10875H on RedisAI during RedisLabs RedisConf 2021 hackathon.

The challenge:

1) BERT QA requires a user input — question, hence it’s not possible to pre-calculate response from the server like in the case of summarization.