Skip to content

Azazh/Medical-Data-Warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Ethiopian Medical Businesses Data Pipeline

This project focuses on building a robust data pipeline for Ethiopian medical businesses by scraping data from Telegram channels, cleaning and transforming the data, and storing it in a data warehouse for analysis. The pipeline consists of two main tasks:

  1. Task 1: Data Scraping and Collection Pipeline - Scrapes data from Telegram channels.
  2. Task 2: Data Cleaning and Transformation - Cleans and transforms the scraped data using Python and DBT.

Table of Contents

  1. Project Overview
  2. Repository Structure
  3. Task 1: Data Scraping and Collection Pipeline
  4. Task 2: Data Cleaning and Transformation
  5. Setup and Installation
  6. Usage
  7. Challenges and Solutions
  8. Contributing
  9. License

Project Overview

The goal of this project is to build a data pipeline that:

  • Scrapes data from Telegram channels related to Ethiopian medical businesses.
  • Cleans and transforms the scraped data.
  • Stores the data in a PostgreSQL database for analysis.

The pipeline is designed to be modular, scalable, and easy to maintain.

Task 1: Data Scraping and Collection Pipeline

Objective

Scrape data from Telegram channels, including text and media, and store it in a structured format.

Implementation

  • Tools: Python (telethon, pandas, logging), Telegram API.
  • Steps:
    1. Set up Telegram API access using API_ID and API_HASH.
    2. Scrape data from specified Telegram channels (e.g., DoctorsET, Chemed).
    3. Store raw data in JSON files and media files in a structured directory.
    4. Log all activities for monitoring and debugging.

Output

  • Raw data stored in raw_data/ directory.
  • Media files stored in raw_data/media/.
  • Logs stored in scraping.log.

Task 2: Data Cleaning and Transformation

Objective

Clean and transform the scraped data to ensure consistency, remove duplicates, and prepare it for analysis.

Implementation

  • Tools: Python (pandas, sqlalchemy), DBT (Data Build Tool).
  • Steps:
    1. Load raw data from JSON files.
    2. Clean data by removing duplicates, handling missing values, and standardizing formats.
    3. Validate data to ensure quality.
    4. Store cleaned data in a PostgreSQL database.
    5. Use DBT to transform data into analytical models.

Output

  • Cleaned data stored in PostgreSQL (raw_medical_data table).
  • DBT models for staging (stg_medical_data) and analytics (fact_messages).
  • Logs stored in data_cleaning.log.

Setup and Installation

Clone the Repository

git clone https://github.com/Azazh/Medical-Data-Warehouse.git
cd Medical-Data-Warehouse

Install Python Dependencies

pip install -r requirements.txt

Set Up PostgreSQL Database

  1. Create a database named medical_dw.
  2. Update the connection string in data_cleaning.py and profiles.yml.

Set Up Telegram API

  1. Obtain API_ID and API_HASH from my.telegram.org.
  2. Update the credentials in telegram_scraper.py.

Install DBT

pip install dbt-postgres

Usage

Run the Scraping Script

python telegram_scraper.py

Run the Data Cleaning Script

python data_cleaning.py

Run DBT Transformations

cd medical_transform
dbt run --models marts
dbt test
dbt docs generate
dbt docs serve

Challenges and Solutions

Challenge Solution
Rate limits on Telegram API Implemented rate limiting and retries in the scraping script.
Inconsistent data formats in Telegram messages Standardized text and date formats during cleaning.
Duplicate messages in scraped data Removed duplicates based on message_id and channel.

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Submit a pull request with a detailed description of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or feedback, please contact:

About

**About** Kara Solutions is building a data warehouse for Ethiopian medical business insights by scraping Telegram channels. Using Python (Telethon) and DBT, we extract, clean, transform, and store data efficiently. AI-powered analysis and object detection enable better decision-making in the medical sector. ๐Ÿš€

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors