Skip to content

Avead556/probellm

Repository files navigation

ProbeLLM

A PHP testing framework for LLM-powered agents.
Built on top of PHPUnit / Pest.

Installation · Quick Start · Features · ElevenLabs · Cassettes · Providers · License


Why ProbeLLM?

Testing LLM agents is hard. Responses are non-deterministic, API calls are slow and expensive, and tool-calling flows require multi-turn orchestration.

ProbeLLM solves this with:

  • Fluent DSL for writing multi-turn dialog tests
  • ElevenLabs ConvAI simulation testing with evaluation criteria, tool mocks, and dynamic variables
  • Cassette record/replay so tests run offline, fast, and deterministic
  • LLM-as-judge assertions for evaluating response quality with natural language criteria
  • Tool calling support with auto-resolution of tool_call_id
  • Multimodal attachments (images, PDFs, audio) via local files or URLs
  • PHPUnit attributes for declarative test configuration

Installation

composer require probellm/probellm --dev

Requires PHP 8.4+ and ext-curl.

Quick Start

use ProbeLLM\AgentTestCase;
use ProbeLLM\Attributes\AgentSystem;
use ProbeLLM\Attributes\AgentModel;
use ProbeLLM\Attributes\AgentReplayMode;
use ProbeLLM\DSL\AnswerExpectations;

#[AgentSystem('You are a helpful assistant. Always respond in valid JSON.')]
#[AgentModel('gpt-4o')]
#[AgentReplayMode]
class MyAgentTest extends AgentTestCase
{
    protected function resolveProvider(): LLMProvider
    {
        return OpenAICompatibleProvider::openAI(getenv('OPENAI_API_KEY'));
    }

    public function test_greeting(): void
    {
        $this->dialog()
            ->user('Return JSON with key "greeting" and value "hello".')
            ->answer(function (AnswerExpectations $a) {
                $a->assertJson()
                  ->assertJsonPath('$.greeting', equals: 'hello');
            });
    }
}

First run calls the real API and records cassettes automatically. All subsequent runs use cached responses — instant, no API calls:

./vendor/bin/phpunit

Features

Multi-turn Dialogs

Chain .user() / .answer() / .toolResult() calls to test full conversation flows:

$this->dialog()
    ->user('Return JSON: {"count": 1}')
    ->answer(function (AnswerExpectations $a) {
        $a->assertJson()
          ->assertJsonPath('$.count', equals: 1);
    })
    ->user('Now increment the count.')
    ->answer(function (AnswerExpectations $a) {
        $a->assertJsonPath('$.count', equals: 2);
    });

JSON Assertions

->answer(function (AnswerExpectations $a) {
    $a->assertJson()                                          // valid JSON
      ->assertJsonPath('$.name', equals: 'Alice')             // exact match
      ->assertJsonPath('$.bio', contains: 'engineer')         // substring
      ->assertJsonPath('$.bio', notContains: 'manager')       // negative substring
      ->assertJsonPath('$.items[0].id', notEmpty: true);      // nested array access
})

Tool Calling

Define tools via ToolContract, assert on calls and arguments:

#[AgentTools(SearchTool::class)]
public function test_agent_searches(): void
{
    $this->dialog()
        ->user('Search for "PHP 8.4 features".')
        ->answer(function (AnswerExpectations $a) {
            $a->assertToolCalled('search')
              ->assertToolArgs('search', function (array $args) {
                  self::assertStringContainsString('PHP', $args['query']);
              });
        })
        ->toolResult('search', [
            'results' => [['title' => 'PHP 8.4 Released', 'url' => 'https://php.net']],
        ])
        ->answer(function (AnswerExpectations $a) {
            self::assertNotEmpty($a->lastMessage());
        });
}

Multimodal Attachments

Send images, PDFs, and audio files alongside user messages:

use ProbeLLM\DTO\Attachment;

$this->dialog()
    ->userWithAttachments('What is in this image?', [
        '/path/to/photo.png',                                    // local file
        Attachment::fromUrl('https://example.com/img.jpg'),      // URL
        Attachment::fromBase64($data, 'image/jpeg'),             // base64
    ])
    ->answer(function (AnswerExpectations $a) {
        $a->assertByPrompt('The response describes the contents of the image');
    });

Supported types: image/*, application/pdf, audio/*.

LLM-as-Judge

Use natural language criteria to evaluate responses:

->answer(function (AnswerExpectations $a) {
    $a->assertJson()
      ->assertByPrompt('The response contains a healthy breakfast suggestion')
      ->assertByPrompt('No excessive sugar is recommended');
})

Judge model and temperature can be configured per-call, per-method, or per-class:

// Per-call override
$a->assertByPrompt('Criteria here', model: 'gpt-4o', temperature: 0.1);

// Via attributes
#[JudgeModel('gpt-4o-mini')]
#[JudgeTemperature(0.0)]

PHPUnit Attributes

Declarative configuration at class or method level:

Attribute Scope Description
#[AgentSystem('...')] Class / Method System prompt
#[AgentSystemFile('path')] Class / Method System prompt from file
#[AgentModel('gpt-4o')] Class / Method Model name
#[AgentTemperature(0.7)] Class / Method Sampling temperature
#[AgentTools(SearchTool::class)] Class / Method Enable tool calling
#[AgentReplayMode] Class / Method Enforce cassette-only mode
#[JudgeModel('gpt-4o-mini')] Class / Method Judge model
#[JudgeTemperature(0.0)] Class / Method Judge temperature
#[ElevenLabsAgentId('agent_...')] Class / Method ElevenLabs agent ID
#[ElevenLabsAgentId(env: 'VAR')] Class / Method Agent ID from env variable
#[ElevenLabsTurnsLimit(20)] Class / Method Max simulation turns

Method-level attributes override class-level. Multiple #[AgentSystem] and #[AgentSystemFile] are concatenated.

ElevenLabs ConvAI

Test ElevenLabs conversational AI agents using the simulate-conversation API. ProbeLLM sends a simulated user against your agent and lets you assert on the resulting transcript, tool calls, evaluations, and workflow transfers.

Setup

use ProbeLLM\ElevenLabsTestCase;
use ProbeLLM\Attributes\ElevenLabsAgentId;
use ProbeLLM\Attributes\ElevenLabsTurnsLimit;
use ProbeLLM\DSL\ElevenLabsExpectations;

#[ElevenLabsAgentId(env: 'ELEVENLABS_AGENT_ID')]
#[ElevenLabsTurnsLimit(20)]
class MyVoiceAgentTest extends ElevenLabsTestCase
{
    // Uses ELEVENLABS_API_KEY env var automatically.
    // Override resolveElevenLabsProvider() for custom setup.
}

Simulation Scenario

public function test_greeting(): void
{
    $this->elevenLabs()
        ->withDynamicVariable('companyName', 'Acme Corp')
        ->withUserPrompt('You just called the company, wait for the greeting')
        ->withTurnsLimit(4)
        ->withEvaluation('greeting', 'Agent greeted the user and mentioned the company name')
        ->run(function (ElevenLabsExpectations $e) {
            $e->assertMinTurns(2)
                ->assertAllEvaluationsPassed()
                ->assertByPrompt('The agent greeted the user politely');
        });
}

Dynamic Variables

Pass {{placeholder}} values that your agent's prompt references:

$this->elevenLabs()
    ->withDynamicVariable('companyName', 'Acme Corp')
    ->withDynamicVariable('agentName', 'Sarah')
    ->withDynamicVariables([
        'businessHours' => '9am-5pm',
        'maxDiscount' => 15,
    ])

Tool Mocks

Mock tool responses so the agent's tools return predetermined data during simulation:

$this->elevenLabs()
    ->withToolMock('Create_order', ['status' => 'success', 'request_id' => 'REQ-001'])
    ->withToolMock('Transfer-to-number', ['status' => 'transferred'])

Evaluation Criteria

Define criteria that ElevenLabs evaluates against the conversation:

$this->elevenLabs()
    ->withEvaluation('data_collected', 'Agent collected name, phone, and address')
    ->withEvaluation('lead_created', 'Agent used the Create_order tool')
    ->run(function (ElevenLabsExpectations $e) {
        $e->assertAllEvaluationsPassed();      // all criteria passed
        $e->assertEvaluation('data_collected'); // specific criterion passed
        $e->assertEvaluationFailed('some_id');  // specific criterion failed
        $e->assertEvaluationCount(2);           // expected number of results
    });

ElevenLabs Assertions

Tool assertions

$e->assertToolCalled('Create_order')           // tool was called at least once
  ->assertToolNotCalled('Dangerous_tool')        // tool was NOT called
  ->assertToolCalledTimes('Create_order', 1)    // exact call count
  ->assertToolExecuted('Create_order')          // called AND executed
  ->assertToolCallCount(2)                       // total tool calls
  ->assertNoToolsCalled()                        // no tools called at all
  ->assertToolCallParam('Create_order', 'name', 'John')  // param value
  ->assertToolCallParamContains('Create_order', 'address', 'Maple')
  ->assertToolCallHasParam('Create_order', 'phone')
  ->assertToolArgs('Create_order', function (array $args) {
      self::assertArrayHasKey('name', $args);
  });

Transcript assertions

$e->assertTranscriptContains('hello')            // full transcript contains string
  ->assertTranscriptNotContains('error')          // full transcript does NOT contain
  ->assertTranscriptMatchesRegex('/\d{3}-\d{4}/')
  ->assertAgentSaid('How can I help')             // only agent messages
  ->assertAgentNeverSaid('I am an AI')
  ->assertFirstAgentMessage('Welcome')
  ->assertLastAgentMessage('Goodbye')
  ->assertTranscriptRole(0, 'agent')              // role at index
  ->assertTranscriptContent(0, 'exact text')      // content at index
  ->assertMinTurns(4)                             // at least N entries
  ->assertMaxTurns(20);                           // at most N entries

Workflow / transfer assertions

$e->assertAgentHandled('agent_abc123')            // agent appeared in transcript
  ->assertTransferredToAgent('agent_xyz789')       // conversation transferred
  ->assertWorkflowNodeReached('node_qualifier')
  ->assertAgentCount(2);                           // number of unique agents

Analysis assertions

$e->assertCallSuccessful()                        // analysis.call_successful = "success"
  ->assertTranscriptSummaryContains('booked');     // analysis.transcript_summary contains

LLM Judge

$e->assertByPrompt('The agent collected all required information before creating the request');

Requires a judge provider. Override resolveJudgeProvider() in your test case or set LLM_API_KEY / LLM_BASE_URL env vars (auto-configured in ElevenLabsTestCase).

Cassette System

Cassettes record LLM responses to JSON files in tests/cassettes/. Each cassette is keyed by a SHA256 hash of all inputs (system prompt, messages, model, temperature, tools, test name, turn index) — any change produces a new key.

Decision logic per turn:

Cassette exists? Replay mode? Result
Yes - Load from cassette
No Yes Call API, save cassette
No No Call API (no caching)

ElevenLabs simulations use the same cassette system. The hash is computed from agent ID, user prompt, first message, tool mocks, evaluation criteria, turns limit, dynamic variables, and test name.

Providers

OpenAI-compatible (OpenAI, OpenRouter, Groq, Together, Ollama, etc.)

protected function resolveProvider(): LLMProvider
{
    return new OpenAICompatibleProvider(
        apiKey: getenv('LLM_API_KEY'),
        baseUrl: 'https://api.openai.com/v1',
    );

    // Or use factory methods:
    // return OpenAICompatibleProvider::openAI(getenv('OPENAI_API_KEY'));
    // return OpenAICompatibleProvider::openRouter(getenv('OPENROUTER_API_KEY'));
}

Anthropic (Claude)

protected function resolveProvider(): LLMProvider
{
    return new AnthropicProvider(apiKey: getenv('ANTHROPIC_API_KEY'));
}

ElevenLabs ConvAI

protected function resolveElevenLabsProvider(): ElevenLabsConvaiProvider
{
    return new ElevenLabsProvider(apiKey: getenv('ELEVENLABS_API_KEY'));
}

Separate judge provider

protected function resolveJudgeProvider(): ?LLMProvider
{
    return new AnthropicProvider(apiKey: getenv('ANTHROPIC_API_KEY'));
}

Exception Hierarchy

All exceptions extend ProbeLLMException (which extends RuntimeException), so you can catch them granularly or broadly:

Exception When
CassetteMissingException Replay mode, cassette not found
ProviderException HTTP/curl errors from LLM API
InvalidResponseException Invalid JSON from provider or judge
ToolResolutionException Tool class issues, missing tool_call_id
ConfigurationException Missing ext-curl, file not found, no provider configured

Environment Variables

Variable Description
LLM_API_KEY API key for your LLM provider
LLM_BASE_URL Provider endpoint (default: https://api.openai.com/v1)
ELEVENLABS_API_KEY API key for ElevenLabs ConvAI
ELEVENLABS_AGENT_ID Default agent ID for ElevenLabs tests

License

MIT

About

PHP testing framework for LLM agents — multi-turn dialogs, cassette replay, tool calling, LLM-as-judge assertions

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors