VoxiBrowse

AI agent that browses the web hands-free based on your conversational voice commands, executing real actions mid-conversation. Built for the Mistral Worldwide Hackathon — Online Edition

What It Does

Scenario	Action	What Happens
Web Search	"Search Wikipedia for Artificial Intelligence"	VoxiBrowse navigates to Wikipedia, types in the search query, hits enter, and reads out the top result summary.
Form Filling	"Enter my name as Muhammad and select priority shipping"	Locates the specific form inputs, types the name, and interacts with the dropdown to select the right shipping option.
Scrolling & Navigation	"Scroll down and click on the second article"	Evaluates the visible elements, scrolls the page down, identifies the new DOM state, and clicks the requested element.
Interruption	"Actually, cancel that and go to YouTube"	Instantly stops its current TTS narration and browser action, and pivots to load the new URL.

The killer feature is the custom DOM extraction engine. Instead of relying on slow, expensive coordinate-based vision models, VoxiBrowse injects a lightweight JS extractor that maps all visible, interactive elements to numeric IDs (e.g., [42] "Search"). Mistral Large reads this structured text state and accurately interacts with elements by their designated IDs.

Architecture

       Microphone
             │ (PyAudio PCM 16kHz)
             ▼
       ElevenLabs STT (Scribe WebSocket)
             │
             ▼
       Mistral Large (Reasoning & Tool Calling)
             │
       ┌─────┴─────┐
       ▼           ▼
   Playwright    ElevenLabs TTS
 (Browser & DOM)   (Audio Synthesis)
       │           │
       ▼           ▼
   Web Page      Speaker (PyAudio)

Key Design Decisions

Custom DOM Addressing: Injects dom_extractor.js into every page to identify visible, interactive elements and assign them sequential data-agent-id numbers. The LLM then accurately triggers Playwright interactions using these safe numeric IDs.
Agentic Tool Loop: Mistral Large is strictly constrained to one browser action per turn. This prevents batch-action desync (where clicking an element might change the DOM and invalidate the IDs of subsequent actions).
Real-Time Interruption: The async event loop continuously monitors ElevenLabs STT for partial transcripts. If the user speaks while the agent is narrating, TTS playback stops instantly, the LLM call is cancelled, and the conversational context rolls back to process the new command smoothly.
Async I/O: The entire system is built on Python's asyncio. Audio streaming to STT, TTS playback chunks, and Playwright execution all happen concurrently without blocking the main event loops.

Tech Stack

Component	Technology	Role
LLM	Mistral Large (`mistral-large-latest`)	Reasoning core and JSON functional tool-calling
Speech-to-Text	ElevenLabs Scribe	Real-time audio streaming and VAD transcription via WebSocket
Text-to-Speech	ElevenLabs (`eleven_flash_v2_5`)	Low-latency voice synthesis for natural conversations
Browser Engine	Playwright (Chromium)	Headful browser automation and JS evaluation
Audio Processing	PyAudio	Direct capture of microphone and hardware speaker routing
Runtime	Python `asyncio`	High-performance non-blocking concurrent event loop

Project Structure

voxibrowse/
├── run.py                 # Main CLI point of entry
├── requirements.txt       # Dependencies
├── .env                   # Configuration & API Keys
└── backend/
    ├── cli_session.py     # Orchestrates Mic -> STT -> LLM -> Tools -> TTS
    ├── llm_client.py      # Mistral wrapper, tracks action history & loop
    ├── browser_agent.py   # Manages Playwright, Navigation, and DOM injection
    ├── dom_extractor.js   # JS script to map interactive elements to numeric IDs
    ├── tools.py           # Tool schemas strictly formatted for Mistral
    ├── stt_client.py      # ElevenLabs Realtime STT connection
    ├── tts_client.py      # ElevenLabs TTS streaming generator
    ├── audio_utils.py     # PCM binary utilities
    └── config.py          # Environment settings and hardware sample rates

Getting Started

Clone the repository

git clone https://github.com/muhammadehsan02/voxibrowse.git
cd voxibrowse

Install Python dependencies
```
pip install -r requirements.txt
```
Install Playwright Chromium Browser
```
python -m playwright install chromium
```

Environment Variables Create a .env file in the root directory:

MISTRAL_API_KEY=your_mistral_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key

Run the Agent
```
python run.py
```
Say "hey browser" to wake the agent, and "just stop" to put it to sleep.

Available Browser Tools

Tool Name	Description
`answer_to_user`	Verbally speaks text to the user via TTS stream.
`click_element`	Auto-scrolls and clicks an interactive DOM element by its ID.
`type_text`	Instantly fills text into standard input fields.
`type_and_submit`	Types character-by-character to trigger JS debouncers/autocomplete, then hits Enter.
`select_option`	Picks a specified value from native `<select>` dropdowns or custom JS menus.
`scroll_down` / `scroll_up`	Dynamically scrolls the page viewport by a chosen intensity (1-5).
`go_to_url`	Navigates the active Playwright tab to a specified URL.
`get_iframe_content`	Reads and extracts the nested DOM of an embedded iframe.
`click_iframe_element`	Directly clicks an element isolated inside a specific iframe by ID.
`type_iframe_text`	Feeds text inputs to elements nested deep within iframes.

How the Agent Loop Works

Listen: PyAudio constantly grabs 16kHz chunks and streams them into ElevenLabs Scribe over websockets.
Think: A committed STT transcript triggers the agent. browser_agent.py evaluates dom_extractor.js to get a structured text representation of the current page. The transcript + page state are sent to Mistral Large.
Act: Mistral decides on exactly one browser action. Playwright executes it.
Speak: Mistral simultaneously generates a short narration of its action, which is streamed through ElevenLabs TTS to PyAudio.
Re-evaluate: The agent triggers dom_extractor.js again to capture the new DOM, feeds it to Mistral, and repeats the process until the user's objective is fully complete.

Author

Muhammad Ehsan — built for the Mistral Worldwide Hackathon
Powered by Mistral

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoxiBrowse

What It Does

Architecture

Key Design Decisions

Tech Stack

Project Structure

Getting Started

Available Browser Tools

How the Agent Loop Works

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoxiBrowse

What It Does

Architecture

Key Design Decisions

Tech Stack

Project Structure

Getting Started

Available Browser Tools

How the Agent Loop Works

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages