System Info
When using Lorax with a LoRA adapter via the /v1/chat/completions endpoint, the adapter works as expected when "stream": false.
However, when I set "stream": true, the response is clearly from the base model only, and the adapter (adapter_name) appears to be ignored.
Information
Tasks
Reproduction
Works:
{
"model": "Mistral-7B-Instruct-v0.1",
"adapter_name": "Medical-Insights-QA",
"stream": false,
"messages": [
{"role": "user", "content": "What are symptoms of cancer?"}
]
}
Broken (stream: true only uses base model):
{
"model": "Mistral-7B-Instruct-v0.1",
"adapter_name": "Medical-Insights-QA",
"stream": true,
"messages": [
{"role": "user", "content": "What are symptoms of cancer?"}
]
}
Expected behavior
When using the /v1/chat/completions endpoint with "stream": true, I expect the model to generate streamed responses using the specified LoRA adapter (adapter_name) — just like it does when "stream": false.
The adapter should influence generation in both streaming and non-streaming modes, resulting in consistent behavior and outputs aligned with the fine-tuned model.
System Info
When using Lorax with a LoRA adapter via the /v1/chat/completions endpoint, the adapter works as expected when "stream": false.
However, when I set "stream": true, the response is clearly from the base model only, and the adapter (adapter_name) appears to be ignored.
Information
Tasks
Reproduction
Works:
{
"model": "Mistral-7B-Instruct-v0.1",
"adapter_name": "Medical-Insights-QA",
"stream": false,
"messages": [
{"role": "user", "content": "What are symptoms of cancer?"}
]
}
Broken (stream: true only uses base model):
{
"model": "Mistral-7B-Instruct-v0.1",
"adapter_name": "Medical-Insights-QA",
"stream": true,
"messages": [
{"role": "user", "content": "What are symptoms of cancer?"}
]
}
Expected behavior
When using the /v1/chat/completions endpoint with "stream": true, I expect the model to generate streamed responses using the specified LoRA adapter (adapter_name) — just like it does when "stream": false.
The adapter should influence generation in both streaming and non-streaming modes, resulting in consistent behavior and outputs aligned with the fine-tuned model.