Ethiopian Medical Businesses Data Pipeline

This project focuses on building a robust data pipeline for Ethiopian medical businesses by scraping data from Telegram channels, cleaning and transforming the data, and storing it in a data warehouse for analysis. The pipeline consists of two main tasks:

Task 1: Data Scraping and Collection Pipeline - Scrapes data from Telegram channels.
Task 2: Data Cleaning and Transformation - Cleans and transforms the scraped data using Python and DBT.

Project Overview

The goal of this project is to build a data pipeline that:

Scrapes data from Telegram channels related to Ethiopian medical businesses.
Cleans and transforms the scraped data.
Stores the data in a PostgreSQL database for analysis.

The pipeline is designed to be modular, scalable, and easy to maintain.

Task 1: Data Scraping and Collection Pipeline

Objective

Scrape data from Telegram channels, including text and media, and store it in a structured format.

Implementation

Tools: Python (telethon, pandas, logging), Telegram API.
Steps:
1. Set up Telegram API access using API_ID and API_HASH.
2. Scrape data from specified Telegram channels (e.g., DoctorsET, Chemed).
3. Store raw data in JSON files and media files in a structured directory.
4. Log all activities for monitoring and debugging.

Output

Raw data stored in raw_data/ directory.
Media files stored in raw_data/media/.
Logs stored in scraping.log.

Task 2: Data Cleaning and Transformation

Objective

Clean and transform the scraped data to ensure consistency, remove duplicates, and prepare it for analysis.

Implementation

Tools: Python (pandas, sqlalchemy), DBT (Data Build Tool).
Steps:
1. Load raw data from JSON files.
2. Clean data by removing duplicates, handling missing values, and standardizing formats.
3. Validate data to ensure quality.
4. Store cleaned data in a PostgreSQL database.
5. Use DBT to transform data into analytical models.

Output

Cleaned data stored in PostgreSQL (raw_medical_data table).
DBT models for staging (stg_medical_data) and analytics (fact_messages).
Logs stored in data_cleaning.log.

Setup and Installation

Clone the Repository

git clone https://github.com/Azazh/Medical-Data-Warehouse.git
cd Medical-Data-Warehouse

Install Python Dependencies

pip install -r requirements.txt

Set Up PostgreSQL Database

Create a database named medical_dw.
Update the connection string in data_cleaning.py and profiles.yml.

Set Up Telegram API

Obtain API_ID and API_HASH from my.telegram.org.
Update the credentials in telegram_scraper.py.

Install DBT

pip install dbt-postgres

Usage

Run the Scraping Script

python telegram_scraper.py

Run the Data Cleaning Script

python data_cleaning.py

Run DBT Transformations

cd medical_transform
dbt run --models marts
dbt test
dbt docs generate
dbt docs serve

Challenges and Solutions

Challenge	Solution
Rate limits on Telegram API	Implemented rate limiting and retries in the scraping script.
Inconsistent data formats in Telegram messages	Standardized text and date formats during cleaning.
Duplicate messages in scraped data	Removed duplicates based on `message_id` and channel.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Submit a pull request with a detailed description of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or feedback, please contact:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
medical_transform		medical_transform
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
dbt_project.yml		dbt_project.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ethiopian Medical Businesses Data Pipeline

Table of Contents

Project Overview

Task 1: Data Scraping and Collection Pipeline

Objective

Implementation

Output

Task 2: Data Cleaning and Transformation

Objective

Implementation

Output

Setup and Installation

Clone the Repository

Install Python Dependencies

Set Up PostgreSQL Database

Set Up Telegram API

Install DBT

Usage

Run the Scraping Script

Run the Data Cleaning Script

Run DBT Transformations

Challenges and Solutions

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ethiopian Medical Businesses Data Pipeline

Table of Contents

Project Overview

Task 1: Data Scraping and Collection Pipeline

Objective

Implementation

Output

Task 2: Data Cleaning and Transformation

Objective

Implementation

Output

Setup and Installation

Clone the Repository

Install Python Dependencies

Set Up PostgreSQL Database

Set Up Telegram API

Install DBT

Usage

Run the Scraping Script

Run the Data Cleaning Script

Run DBT Transformations

Challenges and Solutions

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages