This project focuses on building a robust data pipeline for Ethiopian medical businesses by scraping data from Telegram channels, cleaning and transforming the data, and storing it in a data warehouse for analysis. The pipeline consists of two main tasks:
- Task 1: Data Scraping and Collection Pipeline - Scrapes data from Telegram channels.
- Task 2: Data Cleaning and Transformation - Cleans and transforms the scraped data using Python and DBT.
- Project Overview
- Repository Structure
- Task 1: Data Scraping and Collection Pipeline
- Task 2: Data Cleaning and Transformation
- Setup and Installation
- Usage
- Challenges and Solutions
- Contributing
- License
The goal of this project is to build a data pipeline that:
- Scrapes data from Telegram channels related to Ethiopian medical businesses.
- Cleans and transforms the scraped data.
- Stores the data in a PostgreSQL database for analysis.
The pipeline is designed to be modular, scalable, and easy to maintain.
Scrape data from Telegram channels, including text and media, and store it in a structured format.
- Tools: Python (
telethon,pandas,logging), Telegram API. - Steps:
- Set up Telegram API access using
API_IDandAPI_HASH. - Scrape data from specified Telegram channels (e.g., DoctorsET, Chemed).
- Store raw data in JSON files and media files in a structured directory.
- Log all activities for monitoring and debugging.
- Set up Telegram API access using
- Raw data stored in
raw_data/directory. - Media files stored in
raw_data/media/. - Logs stored in
scraping.log.
Clean and transform the scraped data to ensure consistency, remove duplicates, and prepare it for analysis.
- Tools: Python (
pandas,sqlalchemy), DBT (Data Build Tool). - Steps:
- Load raw data from JSON files.
- Clean data by removing duplicates, handling missing values, and standardizing formats.
- Validate data to ensure quality.
- Store cleaned data in a PostgreSQL database.
- Use DBT to transform data into analytical models.
- Cleaned data stored in PostgreSQL (
raw_medical_datatable). - DBT models for staging (
stg_medical_data) and analytics (fact_messages). - Logs stored in
data_cleaning.log.
git clone https://github.com/Azazh/Medical-Data-Warehouse.git
cd Medical-Data-Warehousepip install -r requirements.txt- Create a database named
medical_dw. - Update the connection string in
data_cleaning.pyandprofiles.yml.
- Obtain
API_IDandAPI_HASHfrom my.telegram.org. - Update the credentials in
telegram_scraper.py.
pip install dbt-postgrespython telegram_scraper.pypython data_cleaning.pycd medical_transform
dbt run --models marts
dbt test
dbt docs generate
dbt docs serve| Challenge | Solution |
|---|---|
| Rate limits on Telegram API | Implemented rate limiting and retries in the scraping script. |
| Inconsistent data formats in Telegram messages | Standardized text and date formats during cleaning. |
| Duplicate messages in scraped data | Removed duplicates based on message_id and channel. |
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Submit a pull request with a detailed description of your changes.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or feedback, please contact: