Rekhta Scraper Urdu

A Python web scraper designed to extract Urdu poetry (ghazals) from Rekhta.org, one of the largest online repositories of Urdu literature.

🌟 Features

Urdu Poetry Extraction: Automatically extracts Urdu verses from Rekhta.org ghazal pages
Text Cleaning: Removes HTML artifacts and normalizes text formatting
Unicode Support: Full support for Urdu Unicode characters (Arabic script)
Simple Output: Saves extracted verses to a clean text file
Error Handling: Robust error handling with informative messages

📋 Prerequisites

Python 3.13 or higher
Internet connection to access Rekhta.org

🚀 Installation

Clone the repository:

git clone https://github.com/xposed73/Rekhta-Scraper-Urdu.git
cd rekhta-scraper-urdu

Install dependencies:
```
uv sync
```

Installing uv

If you don't have uv installed, you can install it using one of these methods:

On Windows (PowerShell):

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

On macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Using pip:

pip install uv

📖 Usage

Basic Usage

Run the scraper with the default URL (Allama Iqbal's ghazal):

uv run app.py

The script will:

Fetch the ghazal from Rekhta.org
Extract all Urdu verses
Save them to ghazal.txt

Custom URL

To scrape a different ghazal, modify the url variable in app.py:

url = "https://www.rekhta.org/ghazals/your-ghazal-url-here"

Output

The extracted verses are saved to ghazal.txt in UTF-8 encoding, with each verse on a separate line.

Example output:

ستاروں سے آگے جہاں اور بھی ہیں
ابھی عشق کے امتحاں اور بھی ہیں
تہی زندگی سے نہیں یہ فضائیں
...

🛠️ How It Works

Web Scraping: Uses requests and BeautifulSoup to fetch and parse HTML content
Text Extraction: Identifies Urdu text blocks using CSS selectors
Text Cleaning: Removes HTML entities and normalizes whitespace
Unicode Detection: Uses regex patterns to identify Urdu/Arabic script
File Output: Saves clean verses to a text file

📁 Project Structure

rekhta-scraper/
├── app.py              # Main scraper script
├── ghazal.txt          # Output file (generated)
├── pyproject.toml      # Project configuration
├── uv.lock            # Dependency lock file
└── README.md          # This file

🔧 Dependencies

requests: HTTP library for web requests
beautifulsoup4: HTML parsing and extraction
re: Regular expressions for text processing

⚠️ Important Notes

Rate Limiting: Be respectful of Rekhta.org's servers. Don't make too many requests in quick succession.
Terms of Service: Ensure your usage complies with Rekhta.org's terms of service.
Educational Use: This tool is intended for educational and research purposes.

🤝 Contributing

Contributions are welcome! Here are some ways you can help:

Bug Reports: Report issues with specific URLs or error messages
Feature Requests: Suggest improvements or new features
Code Improvements: Submit pull requests for better code organization
Documentation: Help improve this README or add code comments

📝 License

This project is open source. Please ensure your usage complies with Rekhta.org's terms of service.

🙏 Acknowledgments

Rekhta.org for providing access to Urdu literature
The Urdu poetry community for preserving and sharing this beautiful literary tradition

🔗 Related Links

Note: This scraper is designed for educational purposes. Please respect the source website's robots.txt and terms of service when using this tool.

🙏 Support My Work

If you find this project helpful, consider supporting it by donating via UPI.

Thank you for your support! ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rekhta Scraper Urdu

🌟 Features

📋 Prerequisites

🚀 Installation

Installing uv

📖 Usage

Basic Usage

Custom URL

Output

🛠️ How It Works

📁 Project Structure

🔧 Dependencies

⚠️ Important Notes

🤝 Contributing

📝 License

🙏 Acknowledgments

🔗 Related Links

🙏 Support My Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
ghazal.txt		ghazal.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Rekhta Scraper Urdu

🌟 Features

📋 Prerequisites

🚀 Installation

Installing uv

📖 Usage

Basic Usage

Custom URL

Output

🛠️ How It Works

📁 Project Structure

🔧 Dependencies

⚠️ Important Notes

🤝 Contributing

📝 License

🙏 Acknowledgments

🔗 Related Links

🙏 Support My Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages