Sanalyz scripts

Tooling for the Sanalyz project

ETL

An ETL pipeline for data extraction, transformation, and loading in the Sanalyz API.

Install the dependencies using:

pip install -r requirements.txt

First download supported datasets in a folder. Currently, the following datasets are supported:

Ensure that the datasets are in CSV format and placed in a folder. The folder structure should look like this:

data/
├── monkeypox.csv
├── covid19.csv

The name of the files are not important, as long as they are in CSV format.

If you want to add support for new datasets, see the Adding Support for New Datasets section.

When your dataset is ready, run the ETL pipeline with the following command:

python etl <path_to_datasets_folder> <api_base_url>

For example:

python etl data https://api.sanalyz.com

This will extract data from the datasets, clean it, transform it to be ready to be loaded, and then load it into the Sanalyz API.

To add support for a new dataset:

Once the new extractor is added, the ETL pipeline will automatically detect and use it.