Skip to content

Conversation

@DBOYttt
Copy link

@DBOYttt DBOYttt commented Nov 24, 2025

  • Implement dynamic model discovery from OpenAI API with 1-hour caching
  • Filter models to only include vision-capable models (GPT-4o, GPT-4 variants)
  • Exclude O1/O3 models that don't support image inputs
  • Add OpenAIModule import to TasksModule for dependency injection
  • Make model selector scrollable in UI (max-height: 300px)

This fixes task execution failures when using non-vision models like O3-mini with computer-use agents that send screenshots.
This pull request enhances how available OpenAI models are managed and surfaced in the Bytebot agent. The most significant changes include dynamically fetching and caching OpenAI models that support vision (image inputs), improving fallback logic, and updating the models list to prioritize relevant options. Additionally, there are minor UI improvements to the select dropdown component.

Dynamic OpenAI Model Management:

  • Added a new method getAvailableModels in OpenAIService to fetch available models from the OpenAI API, filter for those supporting vision, cache them for one hour, and provide a fallback to a hardcoded list if needed. This ensures the agent always offers up-to-date and relevant model options.
  • Updated the hardcoded OPENAI_MODELS list to include only models that support vision (image input), with revised names, titles, and context windows.

Integration with Task Controller:

  • Modified the TasksController to fetch OpenAI models dynamically using the new getAvailableModels method, with a fallback to the hardcoded list if fetching fails. Models from other providers are still included based on API key presence.
  • Updated the TasksModule to import OpenAIModule so that OpenAIService can be injected into TasksController.

UI Improvement:

  • Improved the select dropdown in SelectContent by limiting its maximum height and enabling vertical scrolling, enhancing usability when many models are available.

DBOYttt and others added 2 commits November 24, 2025 17:29
- Implement dynamic model discovery from OpenAI API with 1-hour caching
- Filter models to only include vision-capable models (GPT-4o, GPT-4 variants)
- Exclude O1/O3 models that don't support image inputs
- Add OpenAIModule import to TasksModule for dependency injection
- Make model selector scrollable in UI (max-height: 300px)

This fixes task execution failures when using non-vision models like O3-mini
with computer-use agents that send screenshots.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Create cursor-overlay.ts utility with SVG-based cursor generation
- Modify screendump() to capture cursor position and overlay cursor
- Cursor is rendered as black arrow with white outline for visibility
- Fallback to screenshot without cursor if overlay fails

This enables users to see the mouse position in screenshots sent to the API.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings December 3, 2025 13:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the Bytebot agent system by implementing dynamic OpenAI model discovery with vision-capability filtering, addressing runtime failures when non-vision models are used with computer-use agents that send screenshots. The changes include a new model fetching service with 1-hour caching, improved UI scrollability for model selection, and cursor overlay functionality for screenshots.

Key Changes:

  • Dynamic OpenAI model fetching with intelligent filtering for vision-capable models (GPT-4 variants) and 1-hour result caching
  • Hardcoded fallback model list updated to include only vision-capable models (gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4)
  • Screenshot cursor overlay feature to draw mouse cursor position on captured screenshots

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
packages/bytebot-agent/src/openai/openai.service.ts Adds getAvailableModels() method to fetch and cache OpenAI models from API, plus helper methods for model title formatting and context window estimation
packages/bytebot-agent/src/openai/openai.constants.ts Updates hardcoded model list to include only vision-capable GPT-4 variants as fallback
packages/bytebot-agent/src/tasks/tasks.controller.ts Integrates dynamic model fetching with error handling and fallback to hardcoded models
packages/bytebot-agent/src/tasks/tasks.module.ts Imports OpenAIModule to enable OpenAIService dependency injection
packages/bytebot-ui/src/components/ui/select.tsx Adds max-height and scrolling to select dropdown for better UX with many models
packages/bytebotd/src/nut/cursor-overlay.ts New file implementing cursor image creation and overlay functionality
packages/bytebotd/src/nut/nut.service.ts Updates screendump method to optionally overlay cursor position on screenshots

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* @param cursorSize The size of the cursor (default 24)
* @returns A Buffer containing the screenshot with cursor overlay
*/
export async function overlayeCursorOnScreenshot(
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in function name: 'overlaye' should be 'overlay'. The function should be named overlayCursorOnScreenshot instead of overlayeCursorOnScreenshot.

Suggested change
export async function overlayeCursorOnScreenshot(
export async function overlayCursorOnScreenshot(

Copilot uses AI. Check for mistakes.
const models = modelsList.data;

// Filter for relevant chat models that support vision (images/screenshots)
// Exclude O1 and O3 models as they don't support image inputs
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Exclude O1 and O3 models as they don't support image inputs", but the filtering logic below only checks for models starting with 'gpt-'. O1 and O3 models (which would have IDs like 'o1-...' or 'o3-...') are already implicitly excluded by the first filter condition model.id.startsWith('gpt-'). The comment should be clarified to explain that O1/O3 models are excluded because they don't start with 'gpt-', or the comment should be removed if it's redundant.

Suggested change
// Exclude O1 and O3 models as they don't support image inputs
// Only include models whose IDs start with 'gpt-' (O1 and O3 models are excluded by this filter)

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +141
if (modelId.includes('o1')) return 128000;
if (modelId.includes('o3')) return 200000;
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getContextWindow method includes logic for 'o1' and 'o3' models (lines 140-141), but these models are filtered out in getAvailableModels (line 76) because they don't start with 'gpt-'. Since these models are intentionally excluded from the available models list, this dead code should be removed to avoid confusion.

Suggested change
if (modelId.includes('o1')) return 128000;
if (modelId.includes('o3')) return 200000;

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +130
// Convert model IDs like "gpt-4o-mini" to "GPT-4o Mini"
return modelId
.split('-')
.map((part) => {
if (part === 'gpt') return 'GPT';
if (part.match(/^\d/)) return part; // Keep numbers as-is
return part.charAt(0).toUpperCase() + part.slice(1);
})
.join('-')
.replace(/-/g, ' ');
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The formatModelTitle method transforms model IDs like "gpt-4o-mini" to "GPT 4o Mini". However, this transformation logic may not handle all edge cases correctly. For example, "gpt-4o-2024-05-13" would become "GPT 4o 2024 05 13" with spaces between date components. Consider adding test cases or documentation for expected behavior with dated model IDs.

Suggested change
// Convert model IDs like "gpt-4o-mini" to "GPT-4o Mini"
return modelId
.split('-')
.map((part) => {
if (part === 'gpt') return 'GPT';
if (part.match(/^\d/)) return part; // Keep numbers as-is
return part.charAt(0).toUpperCase() + part.slice(1);
})
.join('-')
.replace(/-/g, ' ');
// Convert model IDs like "gpt-4o-mini" to "GPT 4o Mini"
// If the model ID ends with a date (e.g., "2024-05-13"), keep the date together as a single part
const parts = modelId.split('-');
// Check if the last three parts form a date (YYYY-MM-DD)
const len = parts.length;
let formattedParts: string[];
if (
len >= 3 &&
/^\d{4}$/.test(parts[len - 3]) &&
/^\d{2}$/.test(parts[len - 2]) &&
/^\d{2}$/.test(parts[len - 1])
) {
// Group the last three parts as a date
const datePart = `${parts[len - 3]}-${parts[len - 2]}-${parts[len - 1]}`;
formattedParts = parts.slice(0, len - 3).concat([datePart]);
} else {
formattedParts = parts;
}
return formattedParts
.map((part) => {
if (part === 'gpt') return 'GPT';
if (part.match(/^\d/)) return part; // Keep numbers as-is
// If part is a date (YYYY-MM-DD), keep as-is
if (/^\d{4}-\d{2}-\d{2}$/.test(part)) return part;
return part.charAt(0).toUpperCase() + part.slice(1);
})
.join(' ');

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +78
(model) =>
model.id.startsWith('gpt-') &&
!model.id.startsWith('gpt-3.5') && // Exclude GPT-3.5 (no vision support)
!model.id.includes('instruct'), // Exclude instruct models
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter assumes all models starting with 'gpt-' (except gpt-3.5) support vision. However, not all GPT-4 models may support vision (e.g., older base 'gpt-4' vs 'gpt-4-vision-preview' or 'gpt-4-turbo'). Consider using a more explicit allowlist of known vision-capable model patterns (e.g., contains 'gpt-4o', 'gpt-4-turbo', 'gpt-4-vision', 'gpt-4v') or checking model capabilities via the OpenAI API if available. This would prevent non-vision models from being included and causing runtime errors when screenshots are sent.

Suggested change
(model) =>
model.id.startsWith('gpt-') &&
!model.id.startsWith('gpt-3.5') && // Exclude GPT-3.5 (no vision support)
!model.id.includes('instruct'), // Exclude instruct models
(model) => {
// Only include known vision-capable models
const id = model.id;
// Add to this list as new vision-capable models are released
return (
(
id.includes('gpt-4o') ||
id.includes('gpt-4-turbo') ||
id.includes('gpt-4-vision') ||
id.includes('gpt-4v')
) &&
!id.includes('instruct')
);
},

Copilot uses AI. Check for mistakes.
} from '@nut-tree-fork/nut-js';
import { spawn } from 'child_process';
import * as path from 'path';
import { overlayeCursorOnScreenshot } from './cursor-overlay';
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in imported function name: 'overlaye' should be 'overlay'. The import should be overlayCursorOnScreenshot instead of overlayeCursorOnScreenshot.

Suggested change
import { overlayeCursorOnScreenshot } from './cursor-overlay';
import { overlayCursorOnScreenshot } from './cursor-overlay';

Copilot uses AI. Check for mistakes.
// Overlay cursor if position was captured
if (includeCursor && cursorPosition) {
try {
const withCursor = await overlayeCursorOnScreenshot(
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in function call: 'overlaye' should be 'overlay'. The function call should be overlayCursorOnScreenshot instead of overlayeCursorOnScreenshot.

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +92
if (name.includes('gpt-4.1')) return 1;
if (name.includes('gpt-4')) return 2;
if (name.includes('gpt-5')) return 3;
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort priority includes 'gpt-4.1' at priority 1 and 'gpt-5' at priority 3, but according to the filter logic (line 76), only models starting with 'gpt-' and excluding 'gpt-3.5' are included. Since GPT-4.1 and GPT-5 are hypothetical future models that may not exist yet, consider whether these priority cases are necessary. If they are intended for future-proofing, a comment explaining this would be helpful.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +55
// Ensure cursor position is within screenshot bounds
const safeX = Math.max(0, Math.min(cursorX, width - 1));
const safeY = Math.max(0, Math.min(cursorY, height - 1));
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cursor position bounds check should account for the cursor size to prevent the cursor from being clipped at the edges. Currently, Math.min(cursorX, width - 1) allows the cursor to be placed at width - 1, but since the cursor has a size (default 24px), part of it will extend beyond the image boundary. Consider using Math.min(cursorX, width - cursorSize) and Math.min(cursorY, height - cursorSize) instead.

Suggested change
// Ensure cursor position is within screenshot bounds
const safeX = Math.max(0, Math.min(cursorX, width - 1));
const safeY = Math.max(0, Math.min(cursorY, height - 1));
// Ensure cursor position is within screenshot bounds (account for cursor size)
const safeX = Math.max(0, Math.min(cursorX, width - cursorSize));
const safeY = Math.max(0, Math.min(cursorY, height - cursorSize));

Copilot uses AI. Check for mistakes.
DBOYttt and others added 5 commits December 6, 2025 14:54
The Microsoft APT repository was unreliable, causing build failures.
Changed to download .deb package directly from code.visualstudio.com
for both amd64 and arm64 architectures.
- Add -cursor arrow -cursorpos flags to x11vnc configuration
- Enable showDotCursor in react-vnc VncViewer component
- Ensures cursor is visible in live desktop preview

Fixes issue where cursor was not visible to the AI agent during
task execution, causing it to get stuck on positioning.
- Add logic to parse model name and determine provider (openai/anthropic/google)
- Handle model names stored as strings in database
- Fallback to OpenAI's available models list for unknown models

Fixes "No service found for model provider: undefined" error that
prevented task execution.
- Add instructions about cursor visibility in screenshots
- Remind agent to use computer_cursor_position when having trouble
- Discourage repeatedly clicking same coordinates if not working

Helps agent handle positioning issues more intelligently.
- Handle both string and object formats for task.model
- Check type before attempting to parse model name
- Use proper TypeScript casting through unknown

This fixes the "modelName.startsWith is not a function" error
that was causing immediate task failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant