Voice To Text Application

A lightweight and simple Voice-To-Text Cross-Platform Application made using Tauri, Node.js and Deepgram.

Demo Video: https://drive.google.com/file/d/1jVsNRN4lepOspCLNGmCXqNeVZma1lnuA/view?usp=sharing

Features

Simple and Lightweight: Zero-Bloat, minimal UI with fast performance.
Cross Platform: Fully cross-platform support enabled using Tauri.
Smooth Auto-Scroll: New transcriptions pin to the bottom for a seamless experience.
Live Transcriptions: Live Transcriptions that appear as user speaks made possible using Websockets.
Minimalistic UI: Incredibly Minimal UI made with Pico CSS and a bit of custom CSS.
Copy Transcriptions: Copy all transcriptions with just one click of a button.

Tech Stack

Frontend: Tauri, HTML, CSS, JavaScript.
Backend: Node.js, Websockets (ws).
AI: Deepgram.

How to run locally

Pre-requisites: Node.js, Tauri, Rust (Tauri Dependency)

Clone the Repository

git clone https://github.com/NexusWasLost/Voice-To-Text.git

navigate into the directory

cd Voice-To-Text

Set Up Frontend
- navigate into client directory
```
cd client/wispr-clone
```
- Install all the dependencies
```
npm install
```
- Start the frontend client
```
npm run tauri dev
```
Setup Backend
- From client directory, navigate to server directory
```
cd ../../server
```
- install the server dependencies
```
npm install
```
- Create a .env file and enter envs.
- Start the server
```
npm run dev
```
Refresh the frontend Client to connect if running to connect to the Server.

`.env` contents

Here is an example .env

PORT=8080
DEEPGRAM_API_KEY=<your_api_key>
DEEPGRAM_WEBSOCKET_URL=wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&interim_results=true

Getting Deepgram API Key

Login to Deepgram or Create an Account. > https://deepgram.com/
Create a Project (if Deepgram didnt create one already),
Navigate to "API KEY" and create an API KEY and copy its value.
Inside the .env type the value DEEPGRAM_API_KEY=<your_api_key> (The API Key Goes there) !

Architecture

The Architecture for this whole system is more or less much straightforward.

The Client connects to the Node.js server using a WebSocket connection.
The Client is ready to send data to the server.
Upon sending the first chunk of data, the server then initiates another WebSocket connection to Deepgram's open socket connection.
On ready, the data is sent to Deepgram via WebSocket and Deepgram processes the audio data and transcripts it in real time and then returns transcripts.
On recieving the transcript data from Deepgram, the Node server processes it and then sends back the transcription to the Client.

Decisions

Here are a few decision I made while making this project:

Server (middleman): Even though Deepgram provides URL to their open WebSocket connection and its the same thing being used in the application, I preferred to maintain a server to keep the DEEPGRAM API key secure and also to process the transcription data.
Minimalistic UI: The UI made is using base HTML, CSS and JavaScript while using Pico CSS and a bit of custom CSS, I focused totally on the core functionality of the application rather than having a detailed frontend.
Code Modularity: I tried to keep the code as modular as possible to ensure better readability and maintainability. Trying to encapsulate similiar functioning parts together.
Hosting: Unfortunately, hosting a server that primarily uses WebSockets for free is a big hassle and inefficient, keeping that in mind I decided not to host the server.
Transcription Latency: I utilized Deepgram's is_final flag to ensure transcript accuracy and readability. While this introduces a slight delay as the AI determines sentence boundaries, it prevents word redundancy, common in raw interim streams.

A Key Challenge

A key challenge I faced was managing the initial audio metadata.

The issue I faced was that after the initial audio streaming was done, Deepgram won't process the second stream of audio. Upon debugging and referring to AI, the bug I found was the audio metadata !

While streaming, the first byte of the buffer contains the metadata for the audio. Deepgram expects that metadata only once per connection. If the connection remained open and a metadata is received again, Deepgram will discard that byte and any data further sent.

To resolve this, implemented a check for the metadata byte(26) to see for this byte in the buffer. When detected, the server creates a new WebSocket connection to Deepgram; terminating any previous ones, ensuring that the first byte a connection receives is the metadata and all the rest are always audio data.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
client/wispr-clone		client/wispr-clone
server		server
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice To Text Application

Features

Tech Stack

How to run locally

`.env` contents

Getting Deepgram API Key

Architecture

Decisions

A Key Challenge

References

About

Uh oh!

Releases

Packages

Languages

License

NexusWasLost/Voice-To-Text

Folders and files

Latest commit

History

Repository files navigation

Voice To Text Application

Features

Tech Stack

How to run locally

.env contents

Getting Deepgram API Key

Architecture

Decisions

A Key Challenge

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`.env` contents

Packages