Skip to content

GramosoftAI/GcrawlAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

105 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


✨ Why GcrawlAI?

Most web crawlers dump raw HTML on your lap. GcrawlAI gives your LLM exactly what it needs β€” clean Markdown, structured metadata, and zero noise.

Here's what you can build with it:

πŸ” RAG Pipelines β€” Feed your retrieval-augmented generation system with clean, structured web content instead of tag soup.

πŸ€– AI Search Tools β€” Index the web semantically. GcrawlAI extracts what matters, so your search understands context, not just keywords.

πŸ“„ Document Intelligence Systems β€” Turn web-based reports, filings, and articles into structured data your models can actually reason over.

πŸ’° Price Monitoring Engines β€” Track competitor pricing across e-commerce platforms in real time, without a single broken XPath selector.

πŸ“Š Competitor Intelligence Dashboards β€” Continuously extract product updates, hiring signals, and announcements from competitor websites automatically.

🌐 Market Research Aggregators β€” Collect and synthesize data from hundreds of sources into clean, analysis-ready datasets.

🎯 Lead Generation Pipelines β€” Scrape company directories, job boards, and industry listings to build targeted, enriched prospect lists.

πŸ“° News & Regulatory Trackers β€” Monitor policy changes, regulatory updates, and industry news without the noise of irrelevant content.

πŸ›οΈ Product Catalog Enrichers β€” Pull product descriptions, specs, and images from supplier sites and normalize them into your schema automatically.

No brittle CSS selectors. No HTML parsing headaches. No maintenance nightmares when a site redesigns overnight.

GcrawlAI handles the messy web so you don't have to.

  • ⚑ Instant or Deep β€” Single page real-time extraction or full-site distributed crawling at scale
  • 🧹 LLM-Native Output β€” Auto Markdown conversion, clean enough to feed directly into your vector store
  • πŸ₯· Stealth by Default β€” Playwright stealth mode + automatic browser fallback to bypass bot detection
  • πŸ“Š Real-Time Visibility β€” Live WebSocket progress tracking and an interactive dashboard
  • πŸ” Secure Auth β€” JWT + Email OTP, production-ready from day one
  • 🌍 Fully Open Source β€” MIT licensed. Fork it, extend it, ship it

πŸš€ Features

Feature Description
Single Page Crawl Direct, real-time extraction from any individual URL β€” instant results
Full Site Crawl Distributed crawling of entire websites via Celery workers β€” handles thousands of pages
LLM-Ready Markdown Auto-converts web content into clean Markdown optimized for LLM consumption
HTML & Screenshot Capture Captures raw HTML and full-page screenshots for visual and structural analysis
SEO Metadata Extraction Extracts title, description, keywords, and Open Graph tags automatically
Stealth & Anti-Bot Playwright with stealth plugins; auto-fallback (Chromium β†’ Firefox/Camoufox)
Real-Time Progress Live crawl updates via WebSockets with an interactive dashboard
Secure Auth JWT-based auth, Email OTP signup/verification, and password reset flow

πŸ› οΈ Technology Stack

  • Backend: FastAPI, Python 3.9+
  • Frontend: Angular
  • Database: PostgreSQL
  • Task Queue: Celery + Redis
  • Browser Automation: Playwright
  • Authentication: JWT, BCrypt

πŸ“‹ Prerequisites

  • Python 3.9+
  • PostgreSQL (running on default port 5432)
  • Redis (running on default port 6379)
  • Git

Linux System Dependencies

If you are running on Linux (Debian/Ubuntu), you will need to install the following system dependencies for the automated browsers to function correctly:

sudo apt update

sudo apt install -y \
libnss3 \
libatk1.0-0t64 \
libatk-bridge2.0-0t64 \
libcups2t64 \
libxcomposite1 \
libxdamage1 \
libxrandr2 \
libgbm1 \
libasound2t64 \
libpangocairo-1.0-0 \
libgtk-3-0t64

βš™οΈ Installation

  1. Clone the repository

    git clone https://github.com/GramosoftAI/GcrawlAI.git
    cd GcrawlAI
  2. Create and activate virtual environment

    python -m venv venv
    source venv/bin/activate  # Linux/Mac
    venv\Scripts\activate     # Windows
  3. Install dependencies

    pip install -r requirements.txt
  4. Install Playwright browsers

    playwright install

πŸ”§ Configuration

  1. Database Config: Update config.yaml with your PostgreSQL credentials.

    postgres:
      host: "localhost"
      port: 5432
      database: "crawlerdb"
      user: "postgres"
      password: "your_password"
  2. Initialize Database Tables:

    python -m api.db_setup
    # OR
    python api/db_setup.py

�‍♂️ Running the Application

You need to run 4 separate processes. It's recommended to use separate terminal windows.

1. Start Redis Server (if not running as a service)

redis-server

⚠️ Windows Users: Redis does not run natively on Windows. Use WSL (Windows Subsystem for Linux) or Docker instead.

2. Start Celery Worker

# Linux (User Recommended)
celery -A web_crawler.celery_config worker -l info

# Windows
celery -A web_crawler.celery_config.celery_app worker --loglevel=info --pool=solo

3. Start Backend API

# Windows / Development
uvicorn api.api:app --port 8000

# Linux / Production (User Recommended)
uvicorn api.api:app --host 0.0.0.0 --port 8000 --workers 4 --timeout-keep-alive 120

API Docs will be available at: http://localhost:8000/docs

4. Start Frontend Dashboard

ReadMe for Angular Frontend

Project Structure

.
β”œβ”€β”€ api/                    # FastAPI backend
β”‚   β”œβ”€β”€ api.py              # Main API entry point
β”‚   β”œβ”€β”€ auth_manager.py     # Authentication logic
β”‚   └── db_setup.py         # Database initialization
β”œβ”€β”€ web_crawler/            # Crawler logic
β”‚   β”œβ”€β”€ web_crawler.py      # Core crawler orchestrator
β”‚   β”œβ”€β”€ page_crawler.py     # Individual page processing
β”‚   └── celery_config.py    # Celery configuration
β”œβ”€β”€ config.yaml             # Application configuration
└── requirements.txt        # Python dependencies

πŸ” API Endpoints

  • POST /crawler: Start a new crawl job (single or all).
  • GET /crawler/status/{task_id}: Check Celery task status.
  • GET /crawl/get/content: Retrieve generated content.
  • POST /auth/signup/send-otp: reliable email-based signup.
  • POST /auth/signup/verify-otp: reliable email-based signup.
  • POST /auth/signin: reliable email-based signin.
  • POST /auth/forgot-password: reliable email-based forgot password.
  • POST /auth/reset-password: reliable email-based reset password.

Full interactive API docs available at http://localhost:8000/docs when running locally.


🀝 Contributing

Contributions are welcome and appreciated! Here's how to get involved:

  1. Fork the repository
  2. Create a feature branch β€” git checkout -b feature/YourFeature
  3. Commit your changes β€” git commit -m 'Add YourFeature'
  4. Push to your branch β€” git push origin feature/YourFeature
  5. Open a Pull Request

Please ensure your code follows the existing style and includes relevant tests. For large changes, open an issue first to discuss your proposal.


πŸ™Œ Credits & Inspiration

GcrawlAI was built by the team at Gramosoft Private Limited, inspired by the incredible open-source web scraping and AI ecosystem. We stand on the shoulders of giants:

Project What We Learned
πŸ”₯ Firecrawl LLM-ready markdown output, distributed crawling architecture, and benchmark-driven quality
πŸ•·οΈ ScrapeGraphAI Graph-based pipeline design and LLM-powered structured extraction
🎭 Playwright Browser automation, stealth crawling, and anti-bot bypass strategies
⚑ FastAPI High-performance async API design patterns
🌿 Celery Distributed task queue architecture for large-scale crawling
πŸ”΄ Redis In-memory message brokering for task queue management
🐘 PostgreSQL Reliable relational data storage for crawl results and auth

Disclaimer: GcrawlAI is an independent open-source project built by Gramosoft Private Limited. All referenced projects are the intellectual property of their respective owners and contributors. GcrawlAI is not affiliated with, derived from, or endorsed by any of the above projects. We simply admire their work and credit them accordingly.


πŸ“„ License

GcrawlAI is released under the MIT License.

MIT License

Copyright (c) 2026 Gramosoft Private Limited

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

See the LICENSE file for full details.


πŸ™ Acknowledgements

  • Thank you to all contributors and the open-source community for your continued support
  • GcrawlAI is intended for legitimate data extraction, AI development, and research purposes only
  • Users are responsible for respecting websites' robots.txt directives, terms of service, and applicable privacy policies when crawling

Built with ❀️ by Gramosoft Private Limited

⭐ If GcrawlAI saves you time, please star the repo β€” it helps others discover it!

↑ Back to Top ↑

About

Turn any website into clean, LLM-ready data. Open-source web crawler with stealth mode, distributed crawling, real-time WebSocket progress & Markdown output. Power your AI apps with GcrawlAI.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors