Currently we hard code prompt templates in ExecuTorch LLM apps.
But HF tokenizers know how to apply the chat template, e.g.,
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
This makes using HF models from python with the right template very easy.
Can we have similar logic in our C++ runners?
Currently we hard code prompt templates in ExecuTorch LLM apps.
But HF tokenizers know how to apply the chat template, e.g.,
This makes using HF models from python with the right template very easy.
Can we have similar logic in our C++ runners?