Skip to content

Conversation

@MichaelAnders
Copy link
Contributor

Summary

  • Add SUGGESTION_MODE_MODEL env var to skip or redirect suggestion mode LLM calls to a lighter model
  • Add hallucination guard to drop tool calls when no tools were sent to the model

Problem

Each user message triggers three concurrent runAgentLoop calls (main, suggestion, and others). Suggestion mode calls are helpful, but require LLM execution time (with local models potentially multiple seconds) and/or potentially creating payable token usage.

Changes

  • src/orchestrator/index.js: Added detectSuggestionMode() to identify suggestion mode requests; early return when SUGGESTION_MODE_MODEL=none; hallucination guard for models that hallucinate tool calls without tools
  • src/config/index.js: Added SUGGESTION_MODE_MODEL config (default: "default")
  • src/clients/databricks.js: Minor adjustment for suggestion mode model passthrough
  • .env.example: Documented the new env var with usage examples

Configuration

# Skip suggestion mode entirely
SUGGESTION_MODE_MODEL=none

# Redirect to a lighter model
SUGGESTION_MODE_MODEL=llama3.2:1b

# Default behavior (unchanged)
SUGGESTION_MODE_MODEL=default

Testing

  • SUGGESTION_MODE_MODEL=none: suggestion calls return instantly (0ms)
  • SUGGESTION_MODE_MODEL=llama3.2:1b: redirects to lighter model
  • Default/unset: behavior unchanged
  • Hallucination guard tested with Llama 3.1
  • npm run test:unit passes with no regressions

Problem: Each user message triggers three concurrent runAgentLoop calls
(main, suggestion, and others). When using large local models via Ollama,
each call takes 30-90 seconds of GPU time. Suggestion mode calls are low-value
and waste significant compute resources that could be serving the main request.

Changes implemented:

1. Suggestion mode detection (src/orchestrator/index.js)
   - Added detectSuggestionMode() that scans the last user message for
     the [SUGGESTION MODE: marker used by Claude Code's CLI
   - Tags requests with _requestMode ("suggestion" vs "main") in sanitizePayload

2. Model override and skip logic (src/orchestrator/index.js, src/config/index.js)
   - When SUGGESTION_MODE_MODEL=none, suggestion mode requests return an
     empty response immediately without calling the LLM
   - When set to a model name, suggestion mode uses that lighter model
     instead of the default, freeing GPU for the main request
   - When set to "default" (or unset), behavior is unchanged

3. Hallucination guard (src/orchestrator/index.js)
   - Added guard to drop tool calls when no tools were sent to the model
   - Some models (e.g. Llama 3.1) hallucinate tool_call blocks from
     conversation history even when the request had zero tool definitions

4. Configuration (src/config/index.js, .env.example)
   - Added SUGGESTION_MODE_MODEL env var with "default" as the default value
   - Documented in .env.example with usage examples

5. Databricks client cleanup (src/clients/databricks.js)
   - Minor adjustment to pass through suggestion mode model config

Testing:
- SUGGESTION_MODE_MODEL=none: suggestion mode calls return instantly (0ms)
- SUGGESTION_MODE_MODEL=llama3.2:1b: redirects to lighter model correctly
- SUGGESTION_MODE_MODEL=default (or unset): unchanged behavior
- Main agent loop unaffected in all cases
- Hallucination guard tested with Llama 3.1 no-tool requests
- npm run test:unit passes with no regressions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant