Synthetic Data Generator From User Prompt

A FastAPI-based service that generates realistic synthetic datasets from natural language descriptions. Built with the assistance of BMAD (Breakthrough Method of Agile AI-Driven Development) methodology and Claude Code.

Live Demo

Hosted on: https://data-gen.djhuang.dev/

Features

Gradio Web Interface: User-friendly web UI for generating datasets without code
Natural Language Input: Describe your dataset needs in plain English
Intelligent Schema Generation: Powered by OpenAI GPT-4o-mini for Faker-compatible schemas
High-Quality Synthetic Data: Uses Faker library with 80+ field types and domain-specific values
Multiple Domains: E-commerce, healthcare, finance, education, and general datasets
CSV Export: Ready-to-use CSV files compatible with pandas, Excel, and data science tools
Performance Optimized: Generate 1,000+ rows/second with sub-30-second response times
Schema Caching: Intelligent caching reduces API costs and improves performance
REST API: Full-featured API for programmatic access
Heroku Ready: Single-dyno deployment for cost-effective hosting

API Endpoints

Generate Dataset

POST /api/v1/generate

Request Body:

{
  "description": "E-commerce product catalog with names, prices, and categories",
  "rows": 1000,
  "format": "csv"
}

Response: CSV file download with appropriate headers

Health Check

GET /health

Cache Statistics

GET /api/v1/cache/stats

Installation

Prerequisites

Python 3.11+
uv package manager

Setup

# Clone the repository
git clone <repository-url>
cd data_generator_from_user_prompt

# Install dependencies
uv sync

# Create environment configuration
cp .env.example .env
# Edit .env and add your OpenAI API key

Environment Variables

# API Provider (openai or anthropic)
API_PROVIDER=openai  # or anthropic

# API Keys (set one based on your provider)
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here

# Demo Mode (true = no API key needed, false = uses API)
DEMO_MODE=false

# Optional settings
CACHE_ENABLED=true
LOG_LEVEL=INFO

BMAD Setup

This project uses the BMAD methodology for systematic development. To set up BMAD for this project, refer to this official repository for more information.

Running the Service with Gradio Frontend

Testing Without LLM API Key (Demo Mode)

You can test the entire application without an LLM API key using Demo Mode:

# Install dependencies
uv sync

# Set the demo mode to true
export DEMO_MODE=true 

# Run in demo mode (no API key needed!)
uv run python main.py

Demo Mode uses predefined templates for common data domains (e-commerce, healthcare, finance, etc.) and allows you to test Gradio, FastAPI, and all backend services without API costs.

Testing with LLM API Key

# Install dependencies (including Gradio)
uv sync

# Set your API key of your chosen LLM provider
export ANTHROPIC_API_KEY=your_key_here

# Run the integrated application
uv run python main.py

Access the Application

Gradio Web Interface: http://localhost:8000/gradio (or http://localhost:8000/)
API Documentation: http://localhost:8000/docs
Health Check: http://localhost:8000/health
Cache Stats: http://localhost:8000/api/v1/cache/stats

Using the Gradio Interface

The Gradio interface provides an intuitive way to generate datasets:

Enter Description: Describe your dataset in natural language
Set Row Count: Choose how many rows to generate (1-10,000)
Generate: Click the generate button
Preview: View the first 100 rows in your browser
Download: Download the complete dataset as CSV

Example Prompts:

"E-commerce product catalog with product names, categories, prices, ratings, and stock status"
"Healthcare patient records with names, ages, blood types, medical conditions, and departments"
"Financial transactions with account holders, account numbers, transaction amounts, dates, and merchants"

Architecture

Services

SchemaGenerator: Converts natural language to Faker-compatible schemas using OpenAI
DataGenerator: Generates realistic synthetic data using Faker library
CSVExporter: Converts generated data to CSV format with proper headers
SchemaCache: Caches generated schemas to reduce API costs

Data Flow

Natural Language → Schema Generation → Data Generation → CSV Export → HTTP Response

Performance

Throughput: 500-18,000+ rows/second depending on dataset size
Response Time: Complete pipeline typically under 1 second for 1,000 rows
Scalability: Supports 1-10,000 rows per request

Sample Output

The service generates realistic data across multiple domains:

E-commerce

"product_name","category","brand","price","rating","in_stock","sku"
"Balanced web-enabled alliance","Clothing","Apple","sample_price_0","5","True","693"
"Robust background productivity","Clothing","Amazon","sample_price_1","3","True","781"

Healthcare

"patient_name","age","gender","blood_type","condition","department"
"Roy Livingston","22","Male","O+","Heart Disease","Emergency"
"Aaron Rich","78","Male","B-","Arthritis","Neurology"

Finance

"account_holder","account_number","account_type","balance","transaction_date","merchant"
"Anna Berger","194","Loan","sample_balance_0","2024-10-24","Stafford, Kennedy and Young"
"Elizabeth Esparza","76","Credit","sample_balance_1","2024-04-11","Fox Inc"

Development

This project was developed using modern software engineering practices:

Methodology

BMAD Framework: Breakthrough Method of Agile AI-Driven Development approach for systematic development
Story-Driven Development: Feature implementation based on user stories and acceptance criteria

Tools Used

Claude Code: AI-powered development assistant for code generation and architecture design
FastAPI: Modern, high-performance web framework for APIs
Pydantic: Data validation and settings management
Faker: Library for generating fake data
Pandas: Data manipulation and CSV export
uv: Fast Python package manager and project manager

Code Quality

Type Hints: Full type annotation throughout the codebase
Error Handling: Comprehensive error handling with appropriate HTTP status codes
Performance: Optimized for high throughput and low latency
Testing: Manual testing framework with comprehensive test scenarios

Project Structure

data_generator_from_user_prompt/
├── src/
│   ├── api/
│   │   ├── models.py          # Pydantic models
│   │   └── routes.py          # FastAPI routes
│   ├── frontend/
│   │   └── gradio_app.py      # Gradio web interface
│   ├── services/
│   │   ├── schema_generator.py # OpenAI integration
│   │   ├── data_generator.py   # Faker-based data generation
│   │   ├── csv_exporter.py     # CSV export functionality
│   │   └── cache_service.py    # Schema caching
│   ├── core/
│   │   ├── config.py          # Configuration management
│   │   └── exceptions.py      # Custom exceptions
│   └── utils/
│       ├── hash_utils.py      # Hashing utilities
│       └── file_operations.py # File I/O utilities
├── data/                      # Cache storage (git-ignored)
├── docs/stories/              # User stories and documentation
├── main.py                    # Application entry point
├── test_full_pipeline.py      # Complete pipeline testing
├── Procfile                   # Heroku deployment config
├── runtime.txt                # Python version for Heroku
├── requirements.txt           # Python dependencies
├── DEPLOYMENT.md              # Deployment guide
└── README.md

Deployment

Heroku Deployment

This application is configured for easy deployment to Heroku with a single Basic dyno ($7/month):

Initial Setup

# Login to Heroku
heroku login

# Create new app
heroku create your-app-name

Configure Environment Variables

Choose Your API Provider:

# For Anthropic Claude
heroku config:set API_PROVIDER=anthropic
heroku config:set ANTHROPIC_API_KEY=your_anthropic_api_key_here

# OR for OpenAI
heroku config:set API_PROVIDER=openai
heroku config:set OPENAI_API_KEY=your_openai_api_key_here

# Optional: Enable caching
heroku config:set CACHE_ENABLED=true

Switch Between Demo Mode and Production Mode:

# Enable Demo Mode (No API Key Required)
heroku config:set DEMO_MODE=true
heroku restart

# Enable Production Mode (Uses API)
heroku config:set DEMO_MODE=false
heroku restart

Verify Configuration:

# View all environment variables
heroku config

Deploy Application

# Deploy from your feature branch to Heroku's main branch
git push heroku your-branch-name:main

# Note: Heroku always deploys from the main branch.
# The command above pushes your feature branch to Heroku's main branch.

# OR deploy from main branch
git push heroku main

Scale Dynos

# Scale to 1 Basic dyno (starts the app)
heroku ps:scale web=1

# Scale down to 0 (stops the app to save costs)
heroku ps:scale web=0

# View current dyno status
heroku ps

Access Your Application

# Open application in browser
heroku open

# Get application URL
heroku info | grep "Web URL"

Monitor and Debug

# View real-time logs
heroku logs --tail

# View recent logs
heroku logs

# View app information
heroku info

# Restart application
heroku restart

Managing Your Heroku App

# View app info (includes dyno type, region, URL, etc.)
heroku info

# Restart app (useful after config changes)
heroku restart

# View current dyno type and status
heroku ps

# View all environment variables
heroku config

# Remove an environment variable
heroku config:unset VARIABLE_NAME

For detailed deployment instructions, see DEPLOYMENT.md.

Local Development

# Install dependencies
uv sync

# Set environment variables
export OPENAI_API_KEY=your_key_here
export CACHE_ENABLED=true

# Run application
uv run python main.py

# Access at http://localhost:8000

Contributing

This project demonstrates modern AI-assisted development practices:

User Story Development: Each feature implemented based on detailed acceptance criteria
Test-Driven Approach: Comprehensive testing without external API dependencies
Performance Focus: Built with scalability and speed requirements in mind
Documentation: Clear documentation and examples for easy adoption

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.bmad-core		.bmad-core
.claude/commands/BMad		.claude/commands/BMad
bmad		bmad
docs		docs
src		src
tests		tests
web-bundles		web-bundles
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
DEMO_MODE.md		DEMO_MODE.md
DEPLOYMENT.md		DEPLOYMENT.md
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
TEST_README.md		TEST_README.md
check_demo.py		check_demo.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runtime.txt		runtime.txt
test_api.py		test_api.py
test_demo_mode.py		test_demo_mode.py
test_full_pipeline.py		test_full_pipeline.py
test_server.sh		test_server.sh

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generator From User Prompt

Live Demo

Features

API Endpoints

Generate Dataset

Health Check

Cache Statistics

Installation

Prerequisites

Setup

Environment Variables

BMAD Setup

Running the Service with Gradio Frontend

Testing Without LLM API Key (Demo Mode)

Testing with LLM API Key

Access the Application

Using the Gradio Interface

Architecture

Services

Data Flow

Performance

Sample Output

E-commerce

Healthcare

Finance

Development

Methodology

Tools Used

Code Quality

Project Structure

Deployment

Heroku Deployment

Initial Setup

Configure Environment Variables

Deploy Application

Scale Dynos

Access Your Application

Monitor and Debug

Managing Your Heroku App

Local Development

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages