You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If needed, we should optimize huggingface generation to be faster. It is currently synchronous due to the fact that loading an adapter modifies the underlying model. This means that only one "type" of request can happen at a given time (ie base model or a single adapter).
Improvements:
we could only activate the current lock when adapters are actually added to the model; this would keep non-adapter based generation concurrent
Note: the code from that PR should be refactored to have more clear names
A modified "conditional semaphore" approach where multiple models are loaded into memory. Then, those models can be run concurrently and we can route requests to each model based on the type. Some additional thoughts on this approach:
Keep multiple copies of the model (each one is modified by the adapter) on the CPU memory, assuming the main memory is sufficient.
if CPU inference:
just use each model, no lock required.
if GPU inference:
Compute the amount of GPU memory each huggingface model (= just an instance of nn.Module) requires.
Divide it by some coarse unit, e.g., 1GB. take the ceiling, not floor
max_units = [available GPU memory] / 1GB.
Create a global semaphore with max_units capacity for each GPU
Each model acquires [model memory]/1GB units of semaphore to be on a GPU.
Iterate over the semaphores/GPUs and acquire the units when available.
Move the adapted model to GPU before running the inference. This should be a no-op if the model is already on the GPU.
Implementation note:
Releasing the GPU memory should be lazy. (otherwise it gets released every time the semaphore is released)
Pros: Can run different adapters (and the base model) concurrently.
Pros: maximizes the GPU usage.
Cons: affects loading multiple m.session instances. (though I assume this is rare / not common yet)
With a global semaphore based on 1GB units, you can mix multiple sessions
Cons: Still not address the case with multiple GPUs
Multiple semaphores.
Room for improvement: Could be slow if consequtive queries require different adapters.
A scheduler that groups queries to the same adapter together would address this.
If needed, we should optimize huggingface generation to be faster. It is currently synchronous due to the fact that loading an adapter modifies the underlying model. This means that only one "type" of request can happen at a given time (ie base model or a single adapter).
Improvements:
Cons: affects loading multiple m.session instances. (though I assume this is rare / not common yet)Cons: Still not address the case with multiple GPUs