Contributing to malariagen-data-python

Thanks for your interest in contributing to this project! This guide will help you get started.

About the project

This package provides Python tools for accessing and analyzing genomic data from MalariaGEN, a global research network studying the genomic epidemiology of malaria and its vectors. It provides access to data on Anopheles mosquito species and Plasmodium malaria parasites, with functionality for variant analysis, haplotype clustering, population genetics, and visualization.

Setting up your development environment

Prerequisites

You'll need:

pipx for installing Python tools
git for version control

Both of these can be installed using your distribution's package manager or Homebrew on Mac.

Initial setup

Fork and clone the repository

Fork the repository on GitHub, then clone your fork:

git clone git@github.com:[your-username]/malariagen-data-python.git
cd malariagen-data-python

Add the upstream remote

git remote add upstream https://github.com/malariagen/malariagen-data-python.git

Install Poetry
```
pipx install poetry
```
Install Python 3.12

Python 3.12 is tested in the CI-system and is the recommended version to use.
```
poetry python install 3.12
```
Install the project and its dependencies
```
poetry env use 3.12
poetry install --with dev,test,docs
```
This installs the runtime dependencies along with the dev, test, and docs dependency groups. If you only need to run tests, poetry install --with test is sufficient.

Recommended: Use poetry run to run commands inside the virtual environment:
```
poetry run pytest
poetry run python script.py
```
Optional: If you prefer an interactive shell session, install the shell plugin first:
```
poetry self add poetry-plugin-shell
```
Then activate the environment with:
```
poetry shell
```
After activation, commands run directly inside the virtual environment:
```
pytest
python script.py
```
Install pre-commit hooks
```
pipx install pre-commit
pre-commit install
```
Pre-commit hooks will automatically run ruff (linter and formatter) on your changes before each commit.

Development workflow

Creating a new feature or fix

Sync with upstream

git checkout master
git pull upstream master

Create a feature branch

If an issue does not already exist for your change, create one first. Then create a branch using the convention GH{issue number}-{short description}:
```
git checkout -b GH123-fix-broken-filter
# or
git checkout -b GH456-add-new-analysis
```
Make your changes

Write your code, add tests, update documentation as needed.
Run tests locally

Fast unit tests using simulated data (no external data access):
```
poetry run pytest -v tests --ignore tests/integration
```
To run integration tests which read data from GCS, you'll need to request access to MalariaGEN data on GCS.

Once access has been granted, install the Google Cloud CLI. E.g., if on Linux:
```
./install_gcloud.sh
```
You'll then need to obtain application-default credentials, e.g.:
```
./google-cloud-sdk/bin/gcloud auth application-default login
```
Once this is done, you can run integration tests:
```
poetry run pytest -v tests/integration
```
Tests will run slowly the first time, as data required for testing will be read from GCS. Subsequent runs will be faster as data will be cached locally in the "gcs_cache" folder.

Run typechecking

Run static typechecking with mypy:

poetry run mypy malariagen_data tests --ignore-missing-imports

Check code quality

The pre-commit hooks will run automatically, but you can also run them manually:
```
pre-commit run --all-files
```

Code style

We use ruff for both linting and formatting. The configuration is in pyproject.toml. Key points:

Line length: 88 characters (black default)
Follow PEP 8 conventions
Use type hints where appropriate
Write clear docstrings (we use numpydoc format)

The pre-commit hooks will handle most formatting automatically. If you want to run ruff manually:

ruff check .
ruff format .

Testing

Write tests for new functionality: Add unit tests in the tests/ directory
Test coverage: Aim to maintain or improve test coverage
Fast tests: Unit tests should use simulated data when possible (see tests/anoph/)
Integration tests: Tests requiring GCS data access are slower and run separately

Run dynamic type checking with:

poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.anoph

Documentation

Update docstrings if you modify public APIs
Documentation is built using Sphinx with the pydata theme
API docs are auto-generated from docstrings
Follow the numpydoc style guide

Submitting your contribution

Before opening a pull request

Tests pass locally
Pre-commit hooks pass (or run pre-commit run --all-files)
Code is well-documented
Commit messages are clear and descriptive

Opening a pull request

Push your branch
```
git push origin your-branch-name
```
Create the pull request
- Go to the repository on GitHub
- Click "Pull requests" → "New pull request"
- Select your fork and branch
- Write a clear title and description
Pull request description should include:
- What problem does this solve?
- How does it solve it?
- Any relevant issue numbers (e.g., "Fixes #123")
- Testing done
- Any breaking changes or migration notes

Review process

PRs require approval from a project maintainer
CI tests must pass (pytest on Python 3.10 with NumPy 1.26.4)
Address review feedback by pushing new commits to your branch
Once approved, a maintainer will merge your PR

Communication

Issues: Use GitHub Issues for bug reports and feature requests
Discussions: For questions and general discussion, use GitHub Discussions
Pull requests: Use PR comments for code review discussions
Email: For data access questions, contact support@malariagen.net

Finding something to work on

Look for issues labeled good first issue
Check for issues labeled help wanted
Improve documentation or add examples
Increase test coverage

Questions?

If you're unsure about anything, feel free to:

Open an issue to ask
Start a discussion on GitHub Discussions
Ask in your pull request

We appreciate your contributions and will do our best to help you succeed!

License

By contributing to this project, you agree that your contributions will be licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to malariagen-data-python

About the project

Setting up your development environment

Prerequisites

Initial setup

Development workflow

Creating a new feature or fix

Code style

Testing

Documentation

Submitting your contribution

Before opening a pull request

Opening a pull request

Review process

Communication

Finding something to work on

Questions?

License

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to malariagen-data-python

About the project

Setting up your development environment

Prerequisites

Initial setup

Development workflow

Creating a new feature or fix

Code style

Testing

Documentation

Submitting your contribution

Before opening a pull request

Opening a pull request

Review process

Communication

Finding something to work on

Questions?

License