AI Data Extractor is a modular automation platform designed for organizations to streamline the extraction, validation, and management of customer and invoice data from emails, web forms, and PDF invoices. The system combines rule-based and AI-powered (LLM) extraction, human-in-the-loop review, and seamless export to Google Sheets, providing a scalable, secure, and user-friendly solution for data operations.
- Upload and manage files (emails, forms, invoices) via a web dashboard
- Automated extraction of key fields using hybrid rule-based and LLM methods
- Manual review, editing, and approval of extracted data
- Error and warning detection with clear user feedback
- Export confirmed data to Google Sheets
- Full audit trail and file lifecycle management
- Extensible architecture for future integrations
- Frontend: React web application for user interaction, file management, and data review
- Backend: Python REST API for business logic, extraction workflows, and data management
- Database: Stores file metadata, extracted data, and audit logs securely
- AI Extraction Engine: Supports any major LLM provider and custom rule-based logic
- External Integrations: Google Sheets API for data export
- File Upload: Users upload files (emails, forms, invoices) through the dashboard. Metadata and contents are stored securely.
- Extraction: The backend processes each file, extracting relevant fields using rule-based logic and LLMs. Confidence scores and warnings are generated for each field.
- Review & Approval: Users review extracted data in interactive tables, edit fields as needed, and approve entries. Human-in-the-loop controls ensure only validated data is finalized.
- State Management: Each file moves through three states—unprocessed, waiting (for review), and complete (approved). State transitions are tracked and visible in the dashboard.
- Export: Approved data can be exported to Google Sheets for reporting and further analysis.
- Audit & Compliance: All actions are logged for traceability and compliance. The system is designed to support GDPR and other regulatory requirements.
- Python 3.8+
- Node.js 16+
- Google Cloud project (for Sheets API)
-
Clone the repository:
git clone <repo-url> cd AI-Data-Extractor
-
Install Python dependencies:
pip install -r requirements.txt
-
Initialize the database:
cd src\backend\database python db.py
-
Start the backend server:
cd src\backend uvicorn api.main:app --reload
- Navigate to the frontend directory:
cd src/frontend - Install Node.js dependencies:
npm install
- Start the frontend development server:
npm start
To enable AI-powered extraction, set your ANTHROPIC_API_KEY environment variable with a valid API key from Anthropic (or your chosen LLM provider):
set ANTHROPIC_API_KEY=your_actual_api_key_here- Upload files via the dashboard
- Review, edit, and approve extracted data
- Export confirmed data to Google Sheets
- Monitor file status and audit logs
- Add new extraction models or document types by updating backend extraction logic
- Integrate with other platforms (Excel, ERP, CRM) via modular API endpoints
- Enhance security and compliance features as needed
- See
buisness-documents/technical-description/architecture.mdfor a full technical overview - For troubleshooting, consult the logs and error messages in the dashboard
- Contact the maintainers for support or feature requests
After starting the backend server, you can access the interactive API documentation at:
http://localhost:8000/docs
This provides a full overview of all available endpoints and allows you to test API calls directly from your browser.
MIT License
This project automates extraction and management of data from emails, invoices, and contact forms using Python (FastAPI, SQLAlchemy, LangChain) and React.
- Upload, view, delete, and extract data from files
- LLM and rule-based extraction workflows
- Dashboard and file management UI
- Robust error handling and workflow automation
See the code and comments for backend and frontend setup instructions.