Production-quality AI chatbot providing accurate, real-time information about Startup India policies, funding, registration, and benefits.
- RAG Pipeline: Retrieval-Augmented Generation with confidence scoring
- Smart Intent Detection: Automatically categorizes queries (eligibility, funding, registration, etc.)
- Fallback Handling: Graceful handling of off-topic or unclear queries
- Session Management: Tracks conversations and maintains context
- Dynamic Content: Handles JavaScript-rendered pages with Playwright
- PDF Processing: Extracts text from government documents using PyMuPDF
- Content Cleaning: Removes navigation, footers, and irrelevant content
- Structured Output: Organized by topics, sections, and metadata
- Vector Search: ChromaDB with sentence-transformers embeddings
- LLM Integration: Groq API with llama3-8b-8192 for fast responses
- Confidence Scoring: Evaluates answer reliability
- Source Attribution: Links responses to original documents
- Modern UI: Gradient designs, animations, and responsive layout
- Real-time Chat: Live conversation with typing indicators
- Source Cards: Interactive source display with relevance scores
- Quick Topics: Pre-defined questions for common queries
- Admin Panel: System stats, reload functions, and debugging tools
# Python 3.11 or lower required
python --version
# Install Playwright browsers
playwright installgit clone <repository-url>
cd startupguru
pip install -r requirements.txt# Copy the example environment file
cp .env.example .env
# Edit .env and add your actual Groq API key
# Get your API key from: https://console.groq.com/
# Replace 'your_groq_api_key_here' with your actual key# Single command to scrape, process, and deploy
python startupguru_main.py pipeline# Start both backend and frontend
python startupguru_main.py deployAccess Points:
- π¨ Frontend: http://localhost:8501
- π§ API: http://localhost:8000
- π API Docs: http://localhost:8000/docs
# Create virtual environment
python -m venv startupguru_env
source startupguru_env/bin/activate # Linux/Mac
# or
startupguru_env\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt# Install browser drivers
playwright install chromiumScraping Only:
python startupguru_main.py scrape --forceProcessing Only:
python startupguru_main.py process --rebuildAPI Server Only:
python startupguru_main.py serve --port 8000Frontend Only:
python startupguru_main.py frontend --port 8501# Test complete system
python startupguru_main.py test
# View system statistics
python startupguru_main.py statsgraph TB
A[User Query] --> B[Streamlit Frontend]
B --> C[FastAPI Backend]
C --> D[Query Handler]
D --> E[Intent Detection]
D --> F[Vector Search]
F --> G[ChromaDB]
D --> H[LLM Processing]
H --> I[Groq API]
D --> J[Response Generation]
J --> B
K[Web Scraper] --> L[Document Processor]
L --> M[Text Chunking]
M --> N[Embeddings]
N --> G
-
Smart Scraper (
smart_scraper.py)- Playwright-based dynamic content scraping
- PDF text extraction
- Content cleaning and structuring
-
Document Processor (
document_processor.py)- LangChain text splitting
- Sentence-transformers embeddings
- ChromaDB vector storage
-
Query Handler (
query_handler.py)- RAG pipeline implementation
- Confidence scoring
- Fallback mechanisms
-
FastAPI Backend (
startupguru_api.py)- RESTful API endpoints
- Background task processing
- Comprehensive error handling
-
Streamlit Frontend (
startupguru_app.py)- Modern chat interface
- Real-time system monitoring
- Interactive source display
- Framework: FastAPI 0.104+
- LLM: Groq API (llama3-8b-8192)
- Embeddings: sentence-transformers (all-MiniLM-L6-v2)
- Vector DB: ChromaDB 0.4.18+
- Text Processing: LangChain 0.1+
- Browser Automation: Playwright 1.40+
- PDF Processing: PyMuPDF 1.23+
- HTML Parsing: BeautifulSoup4 4.12+
- UI Framework: Streamlit 1.28+
- Styling: Custom CSS with gradients and animations
- Charts: Plotly (optional)
- Logging: Loguru
- CLI: Click
- Environment: python-dotenv
POST /chat
Content-Type: application/json
{
"message": "What is Startup India?",
"session_id": "optional_session_id",
"include_debug": false
}Response:
{
"response": "Startup India is a flagship initiative...",
"confidence": 0.89,
"sources": [
{
"title": "Startup India Initiative",
"url": "https://www.startupindia.gov.in/...",
"topic": "general",
"similarity": 0.92
}
],
"topic_detected": "startup_definition",
"processing_time": 1.23,
"session_id": "session_123"
}GET /health- System health checkGET /stats- System statisticsPOST /scrape- Start background scrapingPOST /process- Start document processingPOST /reload- Full system reloadGET /search- Direct document search
- Response Time: ~1-3 seconds per query
- Throughput: 50+ concurrent requests
- Accuracy: 85%+ confidence on domain queries
- Scraping Speed: ~100 pages in 10 minutes
- β Sentence-transformers for fast embeddings
- β ChromaDB for efficient vector search
- β Groq API for ultra-fast LLM responses
- β Background task processing
- β Caching and session management
- "What is Startup India initiative?"
- "How does Startup India help entrepreneurs?"
- "How to register a startup in India?"
- "What is DPIIT recognition process?"
- "Steps to get startup certification"
- "What are the eligibility criteria for startups?"
- "Who can apply for Startup India benefits?"
- "Age limit for startup recognition"
- "What funding options are available?"
- "Seed fund scheme details"
- "Government grants for startups"
- "Tax exemptions for startups"
- "Income tax benefits under Startup India"
- "Angel tax provisions"
- "Support for women entrepreneurs"
- "Women-specific startup schemes"
# View comprehensive stats
python startupguru_main.py statsAll queries are logged to ./logs/query_log.csv with:
- Timestamp
- Query text
- Response
- Confidence score
- Processing time
- Topic detected
Enable debug information in API calls:
{
"message": "your query",
"include_debug": true
}# Development mode with auto-reload
python startupguru_main.py serve --reload
python startupguru_main.py frontend# Coming soon - Dockerfile for containerized deployment# Production server
python startupguru_main.py deploy- Vercel: Deploy Streamlit frontend
- Railway: Deploy FastAPI backend
- Heroku: Full-stack deployment
# Clone repository
git clone <repo-url>
cd startupguru
# Install development dependencies
pip install -r requirements.txt
pip install black flake8 pytest
# Run tests
python startupguru_main.py test
# Format code
black .- Fork the repository
- Create feature branch
- Add tests
- Submit pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Startup India: Official government initiative data
- Groq: Fast LLM API infrastructure
- LangChain: Document processing framework
- ChromaDB: Vector database solution
- Streamlit: Beautiful web app framework
- Issues: GitHub Issues
- Documentation: This README + API docs
- Community: [Link to community/discord if available]