Thanks for your interest in contributing to this project! This guide will help you get started.
This package provides Python tools for accessing and analyzing genomic data from MalariaGEN, a global research network studying the genomic epidemiology of malaria and its vectors. It provides access to data on Anopheles mosquito species and Plasmodium malaria parasites, with functionality for variant analysis, haplotype clustering, population genetics, and visualization.
You'll need:
Both of these can be installed using your distribution's package manager or Homebrew on Mac.
-
Fork and clone the repository
Fork the repository on GitHub, then clone your fork:
git clone git@github.com:[your-username]/malariagen-data-python.git cd malariagen-data-python -
Add the upstream remote
git remote add upstream https://github.com/malariagen/malariagen-data-python.git
-
Install Poetry
pipx install poetry
-
Install Python 3.12
Python 3.12 is tested in the CI-system and is the recommended version to use.
poetry python install 3.12
-
Install the project and its dependencies
poetry env use 3.12 poetry install --with dev,test,docs
This installs the runtime dependencies along with the
dev,test, anddocsdependency groups. If you only need to run tests,poetry install --with testis sufficient.Recommended: Use
poetry runto run commands inside the virtual environment:poetry run pytest poetry run python script.py
Optional: If you prefer an interactive shell session, install the shell plugin first:
poetry self add poetry-plugin-shell
Then activate the environment with:
poetry shell
After activation, commands run directly inside the virtual environment:
pytest python script.py
-
Install pre-commit hooks
pipx install pre-commit pre-commit install
Pre-commit hooks will automatically run
ruff(linter and formatter) on your changes before each commit.
-
Sync with upstream
git checkout master git pull upstream master
-
Create a feature branch
If an issue does not already exist for your change, create one first. Then create a branch using the convention
GH{issue number}-{short description}:git checkout -b GH123-fix-broken-filter # or git checkout -b GH456-add-new-analysis -
Make your changes
Write your code, add tests, update documentation as needed.
-
Run tests locally
Fast unit tests using simulated data (no external data access):
poetry run pytest -v tests --ignore tests/integration
To run integration tests which read data from GCS, you'll need to request access to MalariaGEN data on GCS.
Once access has been granted, install the Google Cloud CLI. E.g., if on Linux:
./install_gcloud.sh
You'll then need to obtain application-default credentials, e.g.:
./google-cloud-sdk/bin/gcloud auth application-default login
Once this is done, you can run integration tests:
poetry run pytest -v tests/integration
Tests will run slowly the first time, as data required for testing will be read from GCS. Subsequent runs will be faster as data will be cached locally in the "gcs_cache" folder.
-
Run typechecking
Run static typechecking with mypy:
poetry run mypy malariagen_data tests --ignore-missing-imports
-
Check code quality
The pre-commit hooks will run automatically, but you can also run them manually:
pre-commit run --all-files
We use ruff for both linting and formatting. The configuration is in pyproject.toml. Key points:
- Line length: 88 characters (black default)
- Follow PEP 8 conventions
- Use type hints where appropriate
- Write clear docstrings (we use numpydoc format)
The pre-commit hooks will handle most formatting automatically. If you want to run ruff manually:
ruff check .
ruff format .- Write tests for new functionality: Add unit tests in the
tests/directory - Test coverage: Aim to maintain or improve test coverage
- Fast tests: Unit tests should use simulated data when possible (see
tests/anoph/) - Integration tests: Tests requiring GCS data access are slower and run separately
Run dynamic type checking with:
poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.anoph- Update docstrings if you modify public APIs
- Documentation is built using Sphinx with the pydata theme
- API docs are auto-generated from docstrings
- Follow the numpydoc style guide
- Tests pass locally
- Pre-commit hooks pass (or run
pre-commit run --all-files) - Code is well-documented
- Commit messages are clear and descriptive
-
Push your branch
git push origin your-branch-name
-
Create the pull request
- Go to the repository on GitHub
- Click "Pull requests" → "New pull request"
- Select your fork and branch
- Write a clear title and description
-
Pull request description should include:
- What problem does this solve?
- How does it solve it?
- Any relevant issue numbers (e.g., "Fixes #123")
- Testing done
- Any breaking changes or migration notes
- PRs require approval from a project maintainer
- CI tests must pass (pytest on Python 3.10 with NumPy 1.26.4)
- Address review feedback by pushing new commits to your branch
- Once approved, a maintainer will merge your PR
- Issues: Use GitHub Issues for bug reports and feature requests
- Discussions: For questions and general discussion, use GitHub Discussions
- Pull requests: Use PR comments for code review discussions
- Email: For data access questions, contact support@malariagen.net
- Look for issues labeled
good first issue - Check for issues labeled
help wanted - Improve documentation or add examples
- Increase test coverage
If you're unsure about anything, feel free to:
- Open an issue to ask
- Start a discussion on GitHub Discussions
- Ask in your pull request
We appreciate your contributions and will do our best to help you succeed!
By contributing to this project, you agree that your contributions will be licensed under the MIT License.