| title | Simulated Users |
|---|---|
| icon | robot |
| sidebarTitle | Simulated Users |
Evaluating conversational agents typically requires expensive human participants or pre-recorded dialogues that don't adapt to agent behavior. EP can simulate end-users in multi-turn evaluations, enabling full conversational loops without a human in the loop. This is powered by a lightweight user simulator derived from 𝜏²-bench and integrated into EP’s rollout manager. EP can simulate end-users in multi-turn evaluations, enabling full conversational loops without a human in the loop. This is powered by a lightweight user simulator derived from 𝜏²-bench and integrated into EP’s rollout manager.
- Generates realistic user turns based on scenario instructions and global guidelines.
- Interleaves with the agent’s tool-using turns to create full conversations.
- Signals when to stop (e.g., task complete, transfer, or out-of-scope) via a special termination token.
Under the hood, EP uses UserSimulator. Rollout orchestration is handled by ExecutionManager. The simulator:
- Builds a system prompt from global guidelines + your scenario instructions.
- Optionally uses tool schemas to steer requests.
- Provides a
is_stop(...)check that EP maps totermination_reason = "user_stop".
Provide dataset_info.user_simulation in your EvaluationRow (or dataset) to turn on the simulator for that row.
{
"messages": [
{ "role": "system", "content": "You are an assistant that uses tools." }
],
"input_metadata": {
"dataset_info": {
"user_prompt_template": "Observation: {observation}",
"environment_context": { "seed": 42 },
"user_simulation": {
"enabled": true,
"system_prompt": "You are a shopper trying to find a red jacket under $100.",
"llm": "gpt-4.1",
"llm_args": { "temperature": 0.0 }
}
}
}
}Fields and defaults:
enabled: boolean flag; if true, EP uses the simulator for the conversation.system_prompt: scenario instructions appended to global guidelines.llm: backing model for the user simulation (default:gpt-4.1).llm_args: sampling args for the simulator (default:{ "temperature": 0.0 }).
When user_simulation.enabled is true:
- EP seeds the conversation with the simulator’s first user message.
- The agent policy receives tool schemas and responds with tool calls or a final answer.
- After each agent turn, the simulator may produce the next user message.
- If the simulator emits a stop intent, EP ends the episode with
termination_reason = user_stop.
Step counting:
- Without simulation: each tool call increments the step counter.
- With simulation: EP increments the step counter after a full agent↔user turn, and records a consolidated control-plane step (reward, termination, tool calls).
import eval_protocol as ep
from eval_protocol.models import EvaluationRow, Message
rows = [
EvaluationRow(
messages=[Message(role="system", content="Use tools to help the user.")],
input_metadata={
"dataset_info": {
"user_prompt_template": "Obs: {observation}",
"environment_context": {"seed": 7},
"user_simulation": {
"enabled": True,
"system_prompt": "Book a table for two tonight at 7pm.",
"llm": "gpt-4.1",
"llm_args": {"temperature": 0.0}
}
}
},
)
]
envs = ep.make("http://localhost:8000/mcp", evaluation_rows=rows, model_id="my-model")
policy = ep.OpenAIPolicy(model_id="gpt-4o-mini")
async def run():
async for row in ep.rollout(envs, policy=policy, steps=64):
print(row.rollout_status.termination_reason)- Keep scenario instructions specific and outcome-oriented to guide the simulator.
- Set
temperaturelow for reproducible behavior (or use record/playback). - Use rewards and control-plane summaries to assess task success rather than only length of the dialogue.
- Simulator does nothing: ensure
user_simulation.enabledistrueand you have at least a system message. - Episode never ends: check that your environment’s rewards/termination are wired, or set a sensible
stepslimit. - Unexpected termination: the simulator may have emitted a stop intent; inspect
termination_reasonand conversation history.
- User simulation integration in rollouts (ExecutionManager):
- Backing user simulator (𝜏²-bench):
- Convenience facade and types: