A Python web scraper designed to extract Urdu poetry (ghazals) from Rekhta.org, one of the largest online repositories of Urdu literature.
- Urdu Poetry Extraction: Automatically extracts Urdu verses from Rekhta.org ghazal pages
- Text Cleaning: Removes HTML artifacts and normalizes text formatting
- Unicode Support: Full support for Urdu Unicode characters (Arabic script)
- Simple Output: Saves extracted verses to a clean text file
- Error Handling: Robust error handling with informative messages
- Python 3.13 or higher
- Internet connection to access Rekhta.org
-
Clone the repository:
git clone https://github.com/xposed73/Rekhta-Scraper-Urdu.git cd rekhta-scraper-urdu -
Install dependencies:
uv sync
If you don't have uv installed, you can install it using one of these methods:
On Windows (PowerShell):
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"On macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | shUsing pip:
pip install uvRun the scraper with the default URL (Allama Iqbal's ghazal):
uv run app.pyThe script will:
- Fetch the ghazal from Rekhta.org
- Extract all Urdu verses
- Save them to
ghazal.txt
To scrape a different ghazal, modify the url variable in app.py:
url = "https://www.rekhta.org/ghazals/your-ghazal-url-here"The extracted verses are saved to ghazal.txt in UTF-8 encoding, with each verse on a separate line.
Example output:
ستاروں سے آگے جہاں اور بھی ہیں
ابھی عشق کے امتحاں اور بھی ہیں
تہی زندگی سے نہیں یہ فضائیں
...
- Web Scraping: Uses
requestsandBeautifulSoupto fetch and parse HTML content - Text Extraction: Identifies Urdu text blocks using CSS selectors
- Text Cleaning: Removes HTML entities and normalizes whitespace
- Unicode Detection: Uses regex patterns to identify Urdu/Arabic script
- File Output: Saves clean verses to a text file
rekhta-scraper/
├── app.py # Main scraper script
├── ghazal.txt # Output file (generated)
├── pyproject.toml # Project configuration
├── uv.lock # Dependency lock file
└── README.md # This file
- requests: HTTP library for web requests
- beautifulsoup4: HTML parsing and extraction
- re: Regular expressions for text processing
- Rate Limiting: Be respectful of Rekhta.org's servers. Don't make too many requests in quick succession.
- Terms of Service: Ensure your usage complies with Rekhta.org's terms of service.
- Educational Use: This tool is intended for educational and research purposes.
Contributions are welcome! Here are some ways you can help:
- Bug Reports: Report issues with specific URLs or error messages
- Feature Requests: Suggest improvements or new features
- Code Improvements: Submit pull requests for better code organization
- Documentation: Help improve this README or add code comments
This project is open source. Please ensure your usage complies with Rekhta.org's terms of service.
- Rekhta.org for providing access to Urdu literature
- The Urdu poetry community for preserving and sharing this beautiful literary tradition
- Rekhta.org - The source website
- BeautifulSoup Documentation
- Requests Documentation
Note: This scraper is designed for educational purposes. Please respect the source website's robots.txt and terms of service when using this tool.
If you find this project helpful, consider supporting it by donating via UPI.
Thank you for your support! ❤️
