Text-to-Audio GenAI Mini Project

Team Members:

Meghana Bisa (PES2UG23CS337)
Mitha M K (PES2UG23CS339)
Monisha Sharma (PES2UG23CS906) Semester: 6th Semester, PES University

Slide 1: Title Slide

Project Title: Modular Pipeline for Text-to-Audio Generation using Diffusion Models Domain: Generative AI System (LLMs, Diffusion Models) Team: 8

Slide 2: Problem Statement and Abstract

Abstract: Generating high-fidelity audio from free-form text remains highly challenging due to text-audio alignment, semantic richness, and temporal coherence. Our project builds a modular GenAI pipeline capable of transforming short, natural language instructions into high-quality soundscapes, music, and speech. Problem Statement: Standard text-to-audio models struggle with brevity (short prompts lacking acoustic detail). We tackle this by implementing a modular text-to-audio pipeline enhanced with an LLM-inspired RAG approach to automatically enrich prompt semantics, executing it through a Latent Diffusion Model (AudioLDM 2) conditioned by Flan-T5.

Slide 3: Use Case of the Project

Foley & Sound Design Automation: Game developers and film editors can synthesize missing environmental sounds (SFX) instantly.
Dynamic Background Music Generation: Content creators can generate royalty-free background music specifying tempo, genre, and mood.
Accessibility: Generating realistic and dynamic audio descriptions for visually impaired digital experiences.

Slide 4: Novelty of the Proposed Work

Multi-modal Conditioning: Combines Large Language Models (Flan-T5) with Latent Diffusion to bridge textual semantics and audio latents.
RAG-based Prompt Enhancement (Unit-2): Traditional approaches expect prompt engineering directly from the user. We augment user prompts with an offline Retrieval-Augmented Generation (RAG) feature to inject rich acoustic details based on matching keywords.
Comprehensive Evaluation Suite: Moves beyond subjective hearing tests by implementing robust mathematical evaluation frameworks including CLAP alignment, FAD distance, and Cross-domain tracking metrics (CRI, CDTS).

Slide 5: Validation Metrics (Proposed)

Our pipeline uses rigorous, state-of-the-art Generative AI metrics:

CLAP Similarity Score: Measures how well the generated audio aligns with the original text prompt in a joint embedding space.
FAD (Fréchet Audio Distance): Compares VGGish embeddings of synthetic audio against a set of real-world "reference" audio files to determine overall audio realism / fidelity.
POAS (Prompt-to-Audio Similarity): Our custom wrapper over CLAP measuring verbatim semantic intent alignment.
CRI (Cross-domain Robustness Index): Measures standard deviation of POAS scores across distinct audio domains (Speech, Music, SFX).
CDTS (Cross-domain Transfer Score): Measures cross-domain generalization capability.

Slide 6: Existing Work / Literature Review

AudioLDM: Tang, H., et al. (2023). AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. Proposed the foundational Latent space modeling for continuous audio.
Flan-T5 (LLM Instruction Tuning): Chung, H. W., et al. (2022). Scaling Instruction-Finetuned Language Models. We use Flan-T5 as the core text encoder to handle complex instructional prompts for diffusion conditioning.
RAG (Retrieval-Augmented Generation): Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Our project leverages this principle (Unit-2) to fetch acoustic descriptive features prior to generation.

Slide 7: Technical Aspects of GenAI Concepts Used

This project intrinsically integrates exactly the units required for the GenAI Project Scope:

Unit 1 & 4 (LLMs): Uses Flan-T5 (a large language model) directly within the pipeline framework to deeply encode complex textual prompts.
Unit 2 (Prompt Engineering & RAG): Implements a PromptEnhancer class that intercepts user prompts and performs pseudo-retrieval (RAG) to append rich acoustic characteristics (e.g. converting "A piano" -> "A piano, grand piano, sustained pedal, classical reverb, high fidelity audio").
Core Generation Mechanism: Uses AudioLDM 2, a state-of-the-art text-to-audio Latent Diffusion model to generate the actual new content.

Implementation Stack:

diffusers, transformers (Hugging Face)
laion-clap (Text-Audio Feature Encoding)
frechet_audio_distance (VGGish realism metric)
PyTorch (GPU Tensor Computation)

Slide 8: Validation Metrics (Implemented / Showing Code)

We have implemented a custom src/evaluation.py module executing the metrics over a test batch data/prompts.csv. To run the full pipeline and generate the final results table outputs/results_table.csv, run:

pip install -r requirements.txt
python main.py

Outputs include: generated .wav files and a metric console dump aggregating CRI and CDTS.

Draft for internal review formatting prior to PPT creation.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
__pycache__		__pycache__
batch-processing		batch-processing
old files		old files
outputs		outputs
paperwork		paperwork
src		src
templates		templates
with mock scores		with mock scores
.env		.env
README.md		README.md
app.py		app.py
recent-test.pdf		recent-test.pdf
requirements.txt		requirements.txt
system-architecture.jpg		system-architecture.jpg
system_architecture.png		system_architecture.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-Audio GenAI Mini Project

Slide 1: Title Slide

Slide 2: Problem Statement and Abstract

Slide 3: Use Case of the Project

Slide 4: Novelty of the Proposed Work

Slide 5: Validation Metrics (Proposed)

Slide 6: Existing Work / Literature Review

Slide 7: Technical Aspects of GenAI Concepts Used

Implementation Stack:

Slide 8: Validation Metrics (Implemented / Showing Code)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text-to-Audio GenAI Mini Project

Slide 1: Title Slide

Slide 2: Problem Statement and Abstract

Slide 3: Use Case of the Project

Slide 4: Novelty of the Proposed Work

Slide 5: Validation Metrics (Proposed)

Slide 6: Existing Work / Literature Review

Slide 7: Technical Aspects of GenAI Concepts Used

Implementation Stack:

Slide 8: Validation Metrics (Implemented / Showing Code)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages