A lightweight and simple Voice-To-Text Cross-Platform Application made using Tauri, Node.js and Deepgram.
Demo Video: https://drive.google.com/file/d/1jVsNRN4lepOspCLNGmCXqNeVZma1lnuA/view?usp=sharing
-
Simple and Lightweight: Zero-Bloat, minimal UI with fast performance.
-
Cross Platform: Fully cross-platform support enabled using Tauri.
-
Smooth Auto-Scroll: New transcriptions pin to the bottom for a seamless experience.
-
Live Transcriptions: Live Transcriptions that appear as user speaks made possible using Websockets.
-
Minimalistic UI: Incredibly Minimal UI made with Pico CSS and a bit of custom CSS.
-
Copy Transcriptions: Copy all transcriptions with just one click of a button.
- Frontend: Tauri, HTML, CSS, JavaScript.
- Backend: Node.js, Websockets (ws).
- AI: Deepgram.
Pre-requisites: Node.js, Tauri, Rust (Tauri Dependency)
- Clone the Repository
git clone https://github.com/NexusWasLost/Voice-To-Text.git- navigate into the directory
cd Voice-To-Text-
Set Up Frontend
- navigate into client directory
cd client/wispr-clone- Install all the dependencies
npm install
- Start the frontend client
npm run tauri dev
-
Setup Backend
- From client directory, navigate to server directory
cd ../../server- install the server dependencies
npm install
-
Create a
.envfile and enter envs. -
Start the server
npm run dev
-
Refresh the frontend Client to connect if running to connect to the Server.
Here is an example .env
PORT=8080
DEEPGRAM_API_KEY=<your_api_key>
DEEPGRAM_WEBSOCKET_URL=wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&interim_results=true-
Login to Deepgram or Create an Account. > https://deepgram.com/
-
Create a Project (if Deepgram didnt create one already),
-
Navigate to "API KEY" and create an API KEY and copy its value.
-
Inside the
.envtype the valueDEEPGRAM_API_KEY=<your_api_key>(The API Key Goes there) !
The Architecture for this whole system is more or less much straightforward.
- The Client connects to the Node.js server using a WebSocket connection.
- The Client is ready to send data to the server.
- Upon sending the first chunk of data, the server then initiates another WebSocket connection to Deepgram's open socket connection.
- On ready, the data is sent to Deepgram via WebSocket and Deepgram processes the audio data and transcripts it in real time and then returns transcripts.
- On recieving the transcript data from Deepgram, the Node server processes it and then sends back the transcription to the Client.
Here are a few decision I made while making this project:
-
Server (middleman): Even though Deepgram provides URL to their open WebSocket connection and its the same thing being used in the application, I preferred to maintain a server to keep the DEEPGRAM API key secure and also to process the transcription data.
-
Minimalistic UI: The UI made is using base HTML, CSS and JavaScript while using Pico CSS and a bit of custom CSS, I focused totally on the core functionality of the application rather than having a detailed frontend.
-
Code Modularity: I tried to keep the code as modular as possible to ensure better readability and maintainability. Trying to encapsulate similiar functioning parts together.
-
Hosting: Unfortunately, hosting a server that primarily uses WebSockets for free is a big hassle and inefficient, keeping that in mind I decided not to host the server.
-
Transcription Latency: I utilized Deepgram's
is_finalflag to ensure transcript accuracy and readability. While this introduces a slight delay as the AI determines sentence boundaries, it prevents word redundancy, common in raw interim streams.
A key challenge I faced was managing the initial audio metadata.
The issue I faced was that after the initial audio streaming was done, Deepgram won't process the second stream of audio. Upon debugging and referring to AI, the bug I found was the audio metadata !
While streaming, the first byte of the buffer contains the metadata for the audio. Deepgram expects that metadata only once per connection. If the connection remained open and a metadata is received again, Deepgram will discard that byte and any data further sent.
To resolve this, implemented a check for the metadata byte(26) to see for this byte in the buffer. When detected, the server creates a new WebSocket connection to Deepgram; terminating any previous ones, ensuring that the first byte a connection receives is the metadata and all the rest are always audio data.