Skip to content

MuhammadEhsan02/voxibrowse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoxiBrowse

AI agent that browses the web hands-free based on your conversational voice commands, executing real actions mid-conversation. Built for the Mistral Worldwide Hackathon — Online Edition


What It Does

Scenario Action What Happens
Web Search "Search Wikipedia for Artificial Intelligence" VoxiBrowse navigates to Wikipedia, types in the search query, hits enter, and reads out the top result summary.
Form Filling "Enter my name as Muhammad and select priority shipping" Locates the specific form inputs, types the name, and interacts with the dropdown to select the right shipping option.
Scrolling & Navigation "Scroll down and click on the second article" Evaluates the visible elements, scrolls the page down, identifies the new DOM state, and clicks the requested element.
Interruption "Actually, cancel that and go to YouTube" Instantly stops its current TTS narration and browser action, and pivots to load the new URL.

The killer feature is the custom DOM extraction engine. Instead of relying on slow, expensive coordinate-based vision models, VoxiBrowse injects a lightweight JS extractor that maps all visible, interactive elements to numeric IDs (e.g., [42] "Search"). Mistral Large reads this structured text state and accurately interacts with elements by their designated IDs.


Architecture

       Microphone
             │ (PyAudio PCM 16kHz)
             ▼
       ElevenLabs STT (Scribe WebSocket)
             │
             ▼
       Mistral Large (Reasoning & Tool Calling)
             │
       ┌─────┴─────┐
       ▼           ▼
   Playwright    ElevenLabs TTS
 (Browser & DOM)   (Audio Synthesis)
       │           │
       ▼           ▼
   Web Page      Speaker (PyAudio)

Key Design Decisions

  • Custom DOM Addressing: Injects dom_extractor.js into every page to identify visible, interactive elements and assign them sequential data-agent-id numbers. The LLM then accurately triggers Playwright interactions using these safe numeric IDs.
  • Agentic Tool Loop: Mistral Large is strictly constrained to one browser action per turn. This prevents batch-action desync (where clicking an element might change the DOM and invalidate the IDs of subsequent actions).
  • Real-Time Interruption: The async event loop continuously monitors ElevenLabs STT for partial transcripts. If the user speaks while the agent is narrating, TTS playback stops instantly, the LLM call is cancelled, and the conversational context rolls back to process the new command smoothly.
  • Async I/O: The entire system is built on Python's asyncio. Audio streaming to STT, TTS playback chunks, and Playwright execution all happen concurrently without blocking the main event loops.

Tech Stack

Component Technology Role
LLM Mistral Large (mistral-large-latest) Reasoning core and JSON functional tool-calling
Speech-to-Text ElevenLabs Scribe Real-time audio streaming and VAD transcription via WebSocket
Text-to-Speech ElevenLabs (eleven_flash_v2_5) Low-latency voice synthesis for natural conversations
Browser Engine Playwright (Chromium) Headful browser automation and JS evaluation
Audio Processing PyAudio Direct capture of microphone and hardware speaker routing
Runtime Python asyncio High-performance non-blocking concurrent event loop

Project Structure

voxibrowse/
├── run.py                 # Main CLI point of entry
├── requirements.txt       # Dependencies
├── .env                   # Configuration & API Keys
└── backend/
    ├── cli_session.py     # Orchestrates Mic -> STT -> LLM -> Tools -> TTS
    ├── llm_client.py      # Mistral wrapper, tracks action history & loop
    ├── browser_agent.py   # Manages Playwright, Navigation, and DOM injection
    ├── dom_extractor.js   # JS script to map interactive elements to numeric IDs
    ├── tools.py           # Tool schemas strictly formatted for Mistral
    ├── stt_client.py      # ElevenLabs Realtime STT connection
    ├── tts_client.py      # ElevenLabs TTS streaming generator
    ├── audio_utils.py     # PCM binary utilities
    └── config.py          # Environment settings and hardware sample rates

Getting Started

  1. Clone the repository

    git clone https://github.com/muhammadehsan02/voxibrowse.git
    cd voxibrowse
  2. Install Python dependencies

    pip install -r requirements.txt
  3. Install Playwright Chromium Browser

    python -m playwright install chromium
  4. Environment Variables Create a .env file in the root directory:

    MISTRAL_API_KEY=your_mistral_api_key
    ELEVENLABS_API_KEY=your_elevenlabs_api_key
  5. Run the Agent

    python run.py

    Say "hey browser" to wake the agent, and "just stop" to put it to sleep.


Available Browser Tools

Tool Name Description
answer_to_user Verbally speaks text to the user via TTS stream.
click_element Auto-scrolls and clicks an interactive DOM element by its ID.
type_text Instantly fills text into standard input fields.
type_and_submit Types character-by-character to trigger JS debouncers/autocomplete, then hits Enter.
select_option Picks a specified value from native <select> dropdowns or custom JS menus.
scroll_down / scroll_up Dynamically scrolls the page viewport by a chosen intensity (1-5).
go_to_url Navigates the active Playwright tab to a specified URL.
get_iframe_content Reads and extracts the nested DOM of an embedded iframe.
click_iframe_element Directly clicks an element isolated inside a specific iframe by ID.
type_iframe_text Feeds text inputs to elements nested deep within iframes.

How the Agent Loop Works

  1. Listen: PyAudio constantly grabs 16kHz chunks and streams them into ElevenLabs Scribe over websockets.
  2. Think: A committed STT transcript triggers the agent. browser_agent.py evaluates dom_extractor.js to get a structured text representation of the current page. The transcript + page state are sent to Mistral Large.
  3. Act: Mistral decides on exactly one browser action. Playwright executes it.
  4. Speak: Mistral simultaneously generates a short narration of its action, which is streamed through ElevenLabs TTS to PyAudio.
  5. Re-evaluate: The agent triggers dom_extractor.js again to capture the new DOM, feeds it to Mistral, and repeats the process until the user's objective is fully complete.

Author

Muhammad Ehsan — built for the Mistral Worldwide Hackathon
Powered by Mistral

About

VoxiBrowse translates natural speech into multi-step browser actions to search, interact with live pages, and complete complex online tasks entirely through conversational commands.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors