-
Notifications
You must be signed in to change notification settings - Fork 771
Description
I have a short question. I build and uploaded an android app deploying LLama3 (https://bwsyncandshare.kit.edu/s/t3898Ge7AZ6SWBn).
However, I couldnt get the model to continue using the last conversation turns.
I assume that the kv cache is stored in module_ internally and here
https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/runner.cpp#L175
only the last decoded token and the position index of that token is given to the model. Is that correct?
To use the last conversation turns within the next prompt I tried to start here
https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/runner.cpp#L277
not with 0 as start position index but with the number of tokens which were decoded during the last conversation turns. However, that didn't work, because the model didn't remember the last conversations (I tried e.g. "My name is Christian" -> answer -> "What is my name?"). Is my approach wrong?
For performance reasons I don't want to give the whole conversation history multiple times to the model.
Best,
Christian