-
Notifications
You must be signed in to change notification settings - Fork 16
Description
We currently use Ollama (https://ollama.com/) to prompt LLMs in the text-summary and image-summary plugins. Ollama is easy to use and offers a wide variety of LLMs that can be easily downloaded. However, Ollama does not support batch processing of inputs. It always runs one prompt at a time through the LLM. Some other LLM libraries such as vLLM (https://github.com/vllm-project/vllm) supports batch processing of inputs. But the downside with vLLM is that if we need to download popular local models such as Llama 3.1, Gemma3, etc. from Huggingface, these are "gated" models and require Huggingface authentication (using huggingface-cli) to access the gated models.
The task is to both implement batch processing of prompts through an LLM (must support multi-modal models as well), which enabling us to ship the code in an easy to use manner (We can't ask users to provide huggingface login info, for example, at runtime). We also can't ship the models within the installer (thus eliminating the need for runtime downloads through Huggingface) because that would make the installer too large.
An easier task would be to implement batching using vLLM or other LLM libraries and report the performance gains achieved through batch processing vs using Ollama.