Contributing to DataBeak

Thank you for your interest in contributing to DataBeak! This guide will help you get started with contributing to the project.

Code of Conduct
Getting Started
Development Setup
Development Workflow
Code Standards
Testing
Documentation
Submitting Changes
Release Process

Code of Conduct

By participating in this project, you agree to abide by our Code of Conduct:

Be respectful and inclusive
Welcome newcomers and help them get started
Focus on constructive criticism
Accept feedback gracefully
Put the project's best interests first

Getting Started

Fork the repository on GitHub

Clone your fork locally:

git clone https://github.com/jonpspri/databeak.git
cd databeak

Add upstream remote:

git remote add upstream https://github.com/jonpspri/databeak.git

Development Setup

Prerequisites

Python 3.12 or higher
Git
uv - Ultra-fast package manager (required)

Installation

Using uv (Required - 10-100x faster than pip)

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or on macOS: brew install uv
# Or with pip: pip install uv

# Clone and setup in one command!
uv sync --all-extras

# Install pre-commit hooks
uv run pre-commit install

# That's it! You're ready to go in seconds!

Alternative: Using pip (slower, not recommended)

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,test,docs]"

# Install pre-commit hooks
pre-commit install

Note: We standardize on uv for all development. It's significantly faster and handles everything pip does plus more.

Verify Installation

# All commands use uv
uv run databeak --help
uv run pytest -n auto
uv run ruff check
uv run mypy src/databeak/

Development Workflow

🚨 IMPORTANT: Direct commits to main are prohibited. Pre-commit hooks enforce branch-based development.

1. Create a Feature Branch

# Update main branch
git checkout main
git pull origin main

# Create descriptive feature branch
git checkout -b feature/your-feature-name
# OR use other prefixes: fix/, docs/, test/, refactor/, chore/

2. Make Your Changes

Follow these guidelines:

Branch-based development only - Never commit directly to main
One feature per PR - Keep pull requests focused
Write tests - All new features must have tests
Update docs - Update README and docstrings as needed
Follow style guide - Use Ruff and MyPy
Conventional commits - Use conventional commit format (enforced by hooks)

3. Run Quality Checks

# All commands use uv for speed and consistency
uv run ruff format # Format code with Ruff
uv run ruff check  # Lint with Ruff
uv run mypy src/databeak/        # Type check with MyPy

4. Test Your Changes

DataBeak uses a three-tier testing structure. Run tests appropriate to your changes:

# Run tests by category
uv run pytest -n auto tests/unit/          # Fast unit tests (run frequently)
uv run pytest -n auto tests/integration/   # Integration tests
uv run pytest -n auto tests/e2e/           # End-to-end tests
uv run pytest -n auto                      # All tests

# Run with coverage
uv run pytest -n auto --cov=src/databeak --cov-report=term-missing

# Run specific tests
uv run pytest -n auto tests/unit/servers/  # Test specific module
uv run pytest -k "test_filter"             # Run tests matching pattern (single test, no parallel)
uv run pytest -n auto -x                   # Stop on first failure

Testing Requirements:

New features must have unit tests in tests/unit/
Bug fixes must include regression tests
Maintain 80%+ code coverage
See Testing Guide for details

5. Create Pull Request

# Push feature branch
git push -u origin feature/your-feature-name

# Create PR using GitHub CLI
gh pr create --title "feat: Add new data filtering tool" --body "Description of changes..."

# OR create via GitHub web interface

Pull Request Requirements:

Descriptive title with conventional commit prefix (feat:, fix:, docs:, etc.)
Clear description explaining what changes and why
Link related issues with "Closes #123" syntax
All checks must pass (tests, linting, type checking)
Review and approval required before merge

Code Standards

Python Style

We use modern Python tooling for code quality:

Ruff for code formatting and linting (line length: 100)
Ruff for linting (replaces flake8, isort, and more)
MyPy for type checking
Pre-commit for automated checks

Code Guidelines

Type Hints: All functions must have type hints

async def process_data(
    session_id: str, options: Dict[str, Any], ctx: Optional[Context] = None
) -> Dict[str, Any]:
    """Process data with given options."""
    ...

Docstrings: Use Google-style docstrings

def analyze_data(df: pd.DataFrame) -> Dict[str, Any]:
    """Analyze DataFrame and return statistics.

    Args:
        df: Input DataFrame to analyze

    Returns:
        Dictionary containing analysis results

    Raises:
        ValueError: If DataFrame is empty
    """

Error Handling: Use specific exceptions

if not session:
    raise ValueError(f"Session {session_id} not found")

Async/Await: Use async for all tool functions

@mcp.tool
async def my_tool(param: str, ctx: Context) -> Dict[str, Any]:
    result = await async_operation(param)
    return {"success": True, "data": result}

Logging: Use appropriate log levels

logger.debug("Processing row %d", row_num)
logger.info("Session %s created", session_id)
logger.warning("Large dataset: %d rows", row_count)
logger.error("Failed to load file: %s", error)

File Structure

src/databeak/
├── __init__.py          # Package initialization
├── server.py            # Main server entry point
├── models/              # Data models and schemas
│   ├── __init__.py
│   ├── csv_session.py   # Session management
│   └── data_models.py   # Pydantic models
├── servers/             # MCP server implementations
│   ├── __init__.py
│   ├── io_operations.py
│   ├── transformations.py
│   ├── analytics.py
│   └── validation.py
├── resources/           # MCP resources
├── prompts/            # MCP prompts
└── utils/              # Utility functions

Testing

Test Structure

tests/
├── unit/               # Unit tests
│   ├── test_models.py
│   ├── test_transformations.py
│   └── test_analytics.py
├── integration/        # Integration tests
│   ├── test_server.py
│   └── test_workflows.py
├── benchmark/          # Performance tests
│   └── test_performance.py
└── fixtures/           # Test data
    └── sample_data.csv

Writing Tests

Use pytest fixtures:

import uuid


@pytest.fixture
async def session_with_data():
    """Create a session with sample data."""
    manager = get_session_manager()
    session_id = str(uuid.uuid4())
    session = manager.get_or_create_session(session_id)
    # ... setup
    yield session_id
    # ... cleanup
    manager.remove_session(session_id)

Test async functions:

@pytest.mark.asyncio
async def test_filter_rows(session_with_data):
    result = await filter_rows(
        session_id=session_with_data, conditions=[{"column": "age", "operator": ">", "value": 18}]
    )
    assert result["success"]
    assert result["rows_after"] < result["rows_before"]

Use parametrize for multiple cases:

@pytest.mark.parametrize(
    "dtype,expected",
    [
        ("int", True),
        ("float", True),
        ("str", False),
    ],
)
def test_is_numeric(dtype, expected):
    assert is_numeric_dtype(dtype) == expected

Coverage Requirements

Minimum coverage: 80%
New features must have >90% coverage
Run coverage: uv run pytest -n auto --cov

Documentation

Docstring Standards

All public functions, classes, and modules must have docstrings:

"""Module description.

This module provides functionality for X, Y, and Z.
"""


class DataProcessor:
    """Process CSV data with various transformations.

    Attributes:
        session_id: Unique session identifier
        df: Pandas DataFrame containing the data
    """

    def transform(self, operation: str, **kwargs: Any) -> pd.DataFrame:
        """Apply transformation to data.

        Args:
            operation: Name of the transformation
            **kwargs: Additional parameters for the operation

        Returns:
            Transformed DataFrame

        Raises:
            ValueError: If operation is not supported

        Examples:
            >>> processor.transform("normalize", columns=["price"])
            >>> processor.transform("fill_missing", strategy="mean")
        """

Updating Documentation

README.md: Update for new features or breaking changes
API Docs: Ensure docstrings are complete
Examples: Add examples for new features
Changelog: Update CHANGELOG.md

Submitting Changes

Pull Request Process

Update your branch:

git fetch upstream
git rebase upstream/main

Push to your fork:

git push origin feature/your-feature-name

Create Pull Request:
- Go to GitHub and create a PR from your fork
- Use a clear, descriptive title
- Fill out the PR template
- Link related issues

PR Template

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Tests pass locally
- [ ] Added new tests
- [ ] Coverage maintained/improved

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Review Process

Automated checks must pass
Code review by at least one maintainer
Address feedback promptly
Squash commits if requested

Release Process

Version Numbering

We follow Semantic Versioning:

MAJOR: Breaking changes
MINOR: New features (backward compatible)
PATCH: Bug fixes

Release Steps

Update version in pyproject.toml
Update CHANGELOG.md
Create release PR
Tag release after merge
Publish to PyPI (automated)

Documentation Standards

Writing Style and Tone

DataBeak documentation maintains professional, factual tone:

Avoid Self-Aggrandizing Language

Prohibited terms:

"exceptional", "perfect", "amazing", "outstanding", "superior"
"revolutionary", "cutting-edge", "world-class", "best-in-class"
"unparalleled", "state-of-the-art", "industry-leading", "premium", "elite"
"ultimate", "maximum", "optimal", "flawless"

Use factual alternatives:

"exceptional standards" → "strict standards"
"perfect compliance" → "clean compliance"
"comprehensive coverage" → "high coverage"
"API design excellence" → "clear API design"
"security best practices" → "defensive practices"

Measurable Claims Only

Acceptable (measurable):

"Zero ruff violations" (verifiable metric)
"100% mypy compliance" (measurable result)
"1100+ unit tests" (concrete count)

Prohibited (subjective claims):

"production quality" (marketing speak)
"advanced analytics" (vague superlative)
"sophisticated architecture" (self-congratulatory)

Professional Descriptors

Use measured, technical language:

"provides" not "delivers amazing"
"supports" not "offers comprehensive"
"implements" not "features advanced"
"handles" not "excels at"

Getting Help

Issues: Use GitHub Issues for bugs and features
Discussions: Use GitHub Discussions for questions
Discord: Join our Discord server (link in README)

Recognition

Contributors are recognized in:

AUTHORS.md file
Release notes
Project README

Thank you for contributing to DataBeak!

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History