-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add dynamic OpenAI model fetching with vision-only filtering #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Implement dynamic model discovery from OpenAI API with 1-hour caching - Filter models to only include vision-capable models (GPT-4o, GPT-4 variants) - Exclude O1/O3 models that don't support image inputs - Add OpenAIModule import to TasksModule for dependency injection - Make model selector scrollable in UI (max-height: 300px) This fixes task execution failures when using non-vision models like O3-mini with computer-use agents that send screenshots. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Create cursor-overlay.ts utility with SVG-based cursor generation - Modify screendump() to capture cursor position and overlay cursor - Cursor is rendered as black arrow with white outline for visibility - Fallback to screenshot without cursor if overlay fails This enables users to see the mouse position in screenshots sent to the API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the Bytebot agent system by implementing dynamic OpenAI model discovery with vision-capability filtering, addressing runtime failures when non-vision models are used with computer-use agents that send screenshots. The changes include a new model fetching service with 1-hour caching, improved UI scrollability for model selection, and cursor overlay functionality for screenshots.
Key Changes:
- Dynamic OpenAI model fetching with intelligent filtering for vision-capable models (GPT-4 variants) and 1-hour result caching
- Hardcoded fallback model list updated to include only vision-capable models (gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4)
- Screenshot cursor overlay feature to draw mouse cursor position on captured screenshots
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/bytebot-agent/src/openai/openai.service.ts | Adds getAvailableModels() method to fetch and cache OpenAI models from API, plus helper methods for model title formatting and context window estimation |
| packages/bytebot-agent/src/openai/openai.constants.ts | Updates hardcoded model list to include only vision-capable GPT-4 variants as fallback |
| packages/bytebot-agent/src/tasks/tasks.controller.ts | Integrates dynamic model fetching with error handling and fallback to hardcoded models |
| packages/bytebot-agent/src/tasks/tasks.module.ts | Imports OpenAIModule to enable OpenAIService dependency injection |
| packages/bytebot-ui/src/components/ui/select.tsx | Adds max-height and scrolling to select dropdown for better UX with many models |
| packages/bytebotd/src/nut/cursor-overlay.ts | New file implementing cursor image creation and overlay functionality |
| packages/bytebotd/src/nut/nut.service.ts | Updates screendump method to optionally overlay cursor position on screenshots |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * @param cursorSize The size of the cursor (default 24) | ||
| * @returns A Buffer containing the screenshot with cursor overlay | ||
| */ | ||
| export async function overlayeCursorOnScreenshot( |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in function name: 'overlaye' should be 'overlay'. The function should be named overlayCursorOnScreenshot instead of overlayeCursorOnScreenshot.
| export async function overlayeCursorOnScreenshot( | |
| export async function overlayCursorOnScreenshot( |
| const models = modelsList.data; | ||
|
|
||
| // Filter for relevant chat models that support vision (images/screenshots) | ||
| // Exclude O1 and O3 models as they don't support image inputs |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Exclude O1 and O3 models as they don't support image inputs", but the filtering logic below only checks for models starting with 'gpt-'. O1 and O3 models (which would have IDs like 'o1-...' or 'o3-...') are already implicitly excluded by the first filter condition model.id.startsWith('gpt-'). The comment should be clarified to explain that O1/O3 models are excluded because they don't start with 'gpt-', or the comment should be removed if it's redundant.
| // Exclude O1 and O3 models as they don't support image inputs | |
| // Only include models whose IDs start with 'gpt-' (O1 and O3 models are excluded by this filter) |
| if (modelId.includes('o1')) return 128000; | ||
| if (modelId.includes('o3')) return 200000; |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getContextWindow method includes logic for 'o1' and 'o3' models (lines 140-141), but these models are filtered out in getAvailableModels (line 76) because they don't start with 'gpt-'. Since these models are intentionally excluded from the available models list, this dead code should be removed to avoid confusion.
| if (modelId.includes('o1')) return 128000; | |
| if (modelId.includes('o3')) return 200000; |
| // Convert model IDs like "gpt-4o-mini" to "GPT-4o Mini" | ||
| return modelId | ||
| .split('-') | ||
| .map((part) => { | ||
| if (part === 'gpt') return 'GPT'; | ||
| if (part.match(/^\d/)) return part; // Keep numbers as-is | ||
| return part.charAt(0).toUpperCase() + part.slice(1); | ||
| }) | ||
| .join('-') | ||
| .replace(/-/g, ' '); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The formatModelTitle method transforms model IDs like "gpt-4o-mini" to "GPT 4o Mini". However, this transformation logic may not handle all edge cases correctly. For example, "gpt-4o-2024-05-13" would become "GPT 4o 2024 05 13" with spaces between date components. Consider adding test cases or documentation for expected behavior with dated model IDs.
| // Convert model IDs like "gpt-4o-mini" to "GPT-4o Mini" | |
| return modelId | |
| .split('-') | |
| .map((part) => { | |
| if (part === 'gpt') return 'GPT'; | |
| if (part.match(/^\d/)) return part; // Keep numbers as-is | |
| return part.charAt(0).toUpperCase() + part.slice(1); | |
| }) | |
| .join('-') | |
| .replace(/-/g, ' '); | |
| // Convert model IDs like "gpt-4o-mini" to "GPT 4o Mini" | |
| // If the model ID ends with a date (e.g., "2024-05-13"), keep the date together as a single part | |
| const parts = modelId.split('-'); | |
| // Check if the last three parts form a date (YYYY-MM-DD) | |
| const len = parts.length; | |
| let formattedParts: string[]; | |
| if ( | |
| len >= 3 && | |
| /^\d{4}$/.test(parts[len - 3]) && | |
| /^\d{2}$/.test(parts[len - 2]) && | |
| /^\d{2}$/.test(parts[len - 1]) | |
| ) { | |
| // Group the last three parts as a date | |
| const datePart = `${parts[len - 3]}-${parts[len - 2]}-${parts[len - 1]}`; | |
| formattedParts = parts.slice(0, len - 3).concat([datePart]); | |
| } else { | |
| formattedParts = parts; | |
| } | |
| return formattedParts | |
| .map((part) => { | |
| if (part === 'gpt') return 'GPT'; | |
| if (part.match(/^\d/)) return part; // Keep numbers as-is | |
| // If part is a date (YYYY-MM-DD), keep as-is | |
| if (/^\d{4}-\d{2}-\d{2}$/.test(part)) return part; | |
| return part.charAt(0).toUpperCase() + part.slice(1); | |
| }) | |
| .join(' '); |
| (model) => | ||
| model.id.startsWith('gpt-') && | ||
| !model.id.startsWith('gpt-3.5') && // Exclude GPT-3.5 (no vision support) | ||
| !model.id.includes('instruct'), // Exclude instruct models |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filter assumes all models starting with 'gpt-' (except gpt-3.5) support vision. However, not all GPT-4 models may support vision (e.g., older base 'gpt-4' vs 'gpt-4-vision-preview' or 'gpt-4-turbo'). Consider using a more explicit allowlist of known vision-capable model patterns (e.g., contains 'gpt-4o', 'gpt-4-turbo', 'gpt-4-vision', 'gpt-4v') or checking model capabilities via the OpenAI API if available. This would prevent non-vision models from being included and causing runtime errors when screenshots are sent.
| (model) => | |
| model.id.startsWith('gpt-') && | |
| !model.id.startsWith('gpt-3.5') && // Exclude GPT-3.5 (no vision support) | |
| !model.id.includes('instruct'), // Exclude instruct models | |
| (model) => { | |
| // Only include known vision-capable models | |
| const id = model.id; | |
| // Add to this list as new vision-capable models are released | |
| return ( | |
| ( | |
| id.includes('gpt-4o') || | |
| id.includes('gpt-4-turbo') || | |
| id.includes('gpt-4-vision') || | |
| id.includes('gpt-4v') | |
| ) && | |
| !id.includes('instruct') | |
| ); | |
| }, |
| } from '@nut-tree-fork/nut-js'; | ||
| import { spawn } from 'child_process'; | ||
| import * as path from 'path'; | ||
| import { overlayeCursorOnScreenshot } from './cursor-overlay'; |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in imported function name: 'overlaye' should be 'overlay'. The import should be overlayCursorOnScreenshot instead of overlayeCursorOnScreenshot.
| import { overlayeCursorOnScreenshot } from './cursor-overlay'; | |
| import { overlayCursorOnScreenshot } from './cursor-overlay'; |
| // Overlay cursor if position was captured | ||
| if (includeCursor && cursorPosition) { | ||
| try { | ||
| const withCursor = await overlayeCursorOnScreenshot( |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in function call: 'overlaye' should be 'overlay'. The function call should be overlayCursorOnScreenshot instead of overlayeCursorOnScreenshot.
| if (name.includes('gpt-4.1')) return 1; | ||
| if (name.includes('gpt-4')) return 2; | ||
| if (name.includes('gpt-5')) return 3; |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort priority includes 'gpt-4.1' at priority 1 and 'gpt-5' at priority 3, but according to the filter logic (line 76), only models starting with 'gpt-' and excluding 'gpt-3.5' are included. Since GPT-4.1 and GPT-5 are hypothetical future models that may not exist yet, consider whether these priority cases are necessary. If they are intended for future-proofing, a comment explaining this would be helpful.
| // Ensure cursor position is within screenshot bounds | ||
| const safeX = Math.max(0, Math.min(cursorX, width - 1)); | ||
| const safeY = Math.max(0, Math.min(cursorY, height - 1)); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cursor position bounds check should account for the cursor size to prevent the cursor from being clipped at the edges. Currently, Math.min(cursorX, width - 1) allows the cursor to be placed at width - 1, but since the cursor has a size (default 24px), part of it will extend beyond the image boundary. Consider using Math.min(cursorX, width - cursorSize) and Math.min(cursorY, height - cursorSize) instead.
| // Ensure cursor position is within screenshot bounds | |
| const safeX = Math.max(0, Math.min(cursorX, width - 1)); | |
| const safeY = Math.max(0, Math.min(cursorY, height - 1)); | |
| // Ensure cursor position is within screenshot bounds (account for cursor size) | |
| const safeX = Math.max(0, Math.min(cursorX, width - cursorSize)); | |
| const safeY = Math.max(0, Math.min(cursorY, height - cursorSize)); |
The Microsoft APT repository was unreliable, causing build failures. Changed to download .deb package directly from code.visualstudio.com for both amd64 and arm64 architectures.
- Add -cursor arrow -cursorpos flags to x11vnc configuration - Enable showDotCursor in react-vnc VncViewer component - Ensures cursor is visible in live desktop preview Fixes issue where cursor was not visible to the AI agent during task execution, causing it to get stuck on positioning.
- Add logic to parse model name and determine provider (openai/anthropic/google) - Handle model names stored as strings in database - Fallback to OpenAI's available models list for unknown models Fixes "No service found for model provider: undefined" error that prevented task execution.
- Add instructions about cursor visibility in screenshots - Remind agent to use computer_cursor_position when having trouble - Discourage repeatedly clicking same coordinates if not working Helps agent handle positioning issues more intelligently.
- Handle both string and object formats for task.model - Check type before attempting to parse model name - Use proper TypeScript casting through unknown This fixes the "modelName.startsWith is not a function" error that was causing immediate task failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This fixes task execution failures when using non-vision models like O3-mini with computer-use agents that send screenshots.
This pull request enhances how available OpenAI models are managed and surfaced in the Bytebot agent. The most significant changes include dynamically fetching and caching OpenAI models that support vision (image inputs), improving fallback logic, and updating the models list to prioritize relevant options. Additionally, there are minor UI improvements to the select dropdown component.
Dynamic OpenAI Model Management:
getAvailableModelsinOpenAIServiceto fetch available models from the OpenAI API, filter for those supporting vision, cache them for one hour, and provide a fallback to a hardcoded list if needed. This ensures the agent always offers up-to-date and relevant model options.OPENAI_MODELSlist to include only models that support vision (image input), with revised names, titles, and context windows.Integration with Task Controller:
TasksControllerto fetch OpenAI models dynamically using the newgetAvailableModelsmethod, with a fallback to the hardcoded list if fetching fails. Models from other providers are still included based on API key presence.TasksModuleto importOpenAIModuleso thatOpenAIServicecan be injected intoTasksController.UI Improvement:
SelectContentby limiting its maximum height and enabling vertical scrolling, enhancing usability when many models are available.