Skip to content

Latest commit

 

History

History
523 lines (380 loc) · 12.6 KB

File metadata and controls

523 lines (380 loc) · 12.6 KB

Contributing to DataBeak

Thank you for your interest in contributing to DataBeak! This guide will help you get started with contributing to the project.

Table of Contents

Code of Conduct

By participating in this project, you agree to abide by our Code of Conduct:

  • Be respectful and inclusive
  • Welcome newcomers and help them get started
  • Focus on constructive criticism
  • Accept feedback gracefully
  • Put the project's best interests first

Getting Started

  1. Fork the repository on GitHub

  2. Clone your fork locally:

    git clone https://github.com/jonpspri/databeak.git
    cd databeak
  3. Add upstream remote:

    git remote add upstream https://github.com/jonpspri/databeak.git

Development Setup

Prerequisites

  • Python 3.12 or higher
  • Git
  • uv - Ultra-fast package manager (required)

Installation

Using uv (Required - 10-100x faster than pip)

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or on macOS: brew install uv
# Or with pip: pip install uv

# Clone and setup in one command!
uv sync --all-extras

# Install pre-commit hooks
uv run pre-commit install

# That's it! You're ready to go in seconds!

Alternative: Using pip (slower, not recommended)

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,test,docs]"

# Install pre-commit hooks
pre-commit install

Note: We standardize on uv for all development. It's significantly faster and handles everything pip does plus more.

Verify Installation

# All commands use uv
uv run databeak --help
uv run pytest -n auto
uv run ruff check
uv run mypy src/databeak/

Development Workflow

🚨 IMPORTANT: Direct commits to main are prohibited. Pre-commit hooks enforce branch-based development.

1. Create a Feature Branch

# Update main branch
git checkout main
git pull origin main

# Create descriptive feature branch
git checkout -b feature/your-feature-name
# OR use other prefixes: fix/, docs/, test/, refactor/, chore/

2. Make Your Changes

Follow these guidelines:

  • Branch-based development only - Never commit directly to main
  • One feature per PR - Keep pull requests focused
  • Write tests - All new features must have tests
  • Update docs - Update README and docstrings as needed
  • Follow style guide - Use Ruff and MyPy
  • Conventional commits - Use conventional commit format (enforced by hooks)

3. Run Quality Checks

# All commands use uv for speed and consistency
uv run ruff format # Format code with Ruff
uv run ruff check  # Lint with Ruff
uv run mypy src/databeak/        # Type check with MyPy

4. Test Your Changes

DataBeak uses a three-tier testing structure. Run tests appropriate to your changes:

# Run tests by category
uv run pytest -n auto tests/unit/          # Fast unit tests (run frequently)
uv run pytest -n auto tests/integration/   # Integration tests
uv run pytest -n auto tests/e2e/           # End-to-end tests
uv run pytest -n auto                      # All tests

# Run with coverage
uv run pytest -n auto --cov=src/databeak --cov-report=term-missing

# Run specific tests
uv run pytest -n auto tests/unit/servers/  # Test specific module
uv run pytest -k "test_filter"             # Run tests matching pattern (single test, no parallel)
uv run pytest -n auto -x                   # Stop on first failure

Testing Requirements:

  • New features must have unit tests in tests/unit/
  • Bug fixes must include regression tests
  • Maintain 80%+ code coverage
  • See Testing Guide for details

5. Create Pull Request

# Push feature branch
git push -u origin feature/your-feature-name

# Create PR using GitHub CLI
gh pr create --title "feat: Add new data filtering tool" --body "Description of changes..."

# OR create via GitHub web interface

Pull Request Requirements:

  • Descriptive title with conventional commit prefix (feat:, fix:, docs:, etc.)
  • Clear description explaining what changes and why
  • Link related issues with "Closes #123" syntax
  • All checks must pass (tests, linting, type checking)
  • Review and approval required before merge

Code Standards

Python Style

We use modern Python tooling for code quality:

  • Ruff for code formatting and linting (line length: 100)
  • Ruff for linting (replaces flake8, isort, and more)
  • MyPy for type checking
  • Pre-commit for automated checks

Code Guidelines

  1. Type Hints: All functions must have type hints

    async def process_data(
        session_id: str, options: Dict[str, Any], ctx: Optional[Context] = None
    ) -> Dict[str, Any]:
        """Process data with given options."""
        ...
  2. Docstrings: Use Google-style docstrings

    def analyze_data(df: pd.DataFrame) -> Dict[str, Any]:
        """Analyze DataFrame and return statistics.
    
        Args:
            df: Input DataFrame to analyze
    
        Returns:
            Dictionary containing analysis results
    
        Raises:
            ValueError: If DataFrame is empty
        """
  3. Error Handling: Use specific exceptions

    if not session:
        raise ValueError(f"Session {session_id} not found")
  4. Async/Await: Use async for all tool functions

    @mcp.tool
    async def my_tool(param: str, ctx: Context) -> Dict[str, Any]:
        result = await async_operation(param)
        return {"success": True, "data": result}
  5. Logging: Use appropriate log levels

    logger.debug("Processing row %d", row_num)
    logger.info("Session %s created", session_id)
    logger.warning("Large dataset: %d rows", row_count)
    logger.error("Failed to load file: %s", error)

File Structure

src/databeak/
├── __init__.py          # Package initialization
├── server.py            # Main server entry point
├── models/              # Data models and schemas
│   ├── __init__.py
│   ├── csv_session.py   # Session management
│   └── data_models.py   # Pydantic models
├── servers/             # MCP server implementations
│   ├── __init__.py
│   ├── io_operations.py
│   ├── transformations.py
│   ├── analytics.py
│   └── validation.py
├── resources/           # MCP resources
├── prompts/            # MCP prompts
└── utils/              # Utility functions

Testing

Test Structure

tests/
├── unit/               # Unit tests
│   ├── test_models.py
│   ├── test_transformations.py
│   └── test_analytics.py
├── integration/        # Integration tests
│   ├── test_server.py
│   └── test_workflows.py
├── benchmark/          # Performance tests
│   └── test_performance.py
└── fixtures/           # Test data
    └── sample_data.csv

Writing Tests

  1. Use pytest fixtures:

    import uuid
    
    
    @pytest.fixture
    async def session_with_data():
        """Create a session with sample data."""
        manager = get_session_manager()
        session_id = str(uuid.uuid4())
        session = manager.get_or_create_session(session_id)
        # ... setup
        yield session_id
        # ... cleanup
        manager.remove_session(session_id)
  2. Test async functions:

    @pytest.mark.asyncio
    async def test_filter_rows(session_with_data):
        result = await filter_rows(
            session_id=session_with_data, conditions=[{"column": "age", "operator": ">", "value": 18}]
        )
        assert result["success"]
        assert result["rows_after"] < result["rows_before"]
  3. Use parametrize for multiple cases:

    @pytest.mark.parametrize(
        "dtype,expected",
        [
            ("int", True),
            ("float", True),
            ("str", False),
        ],
    )
    def test_is_numeric(dtype, expected):
        assert is_numeric_dtype(dtype) == expected

Coverage Requirements

  • Minimum coverage: 80%
  • New features must have >90% coverage
  • Run coverage: uv run pytest -n auto --cov

Documentation

Docstring Standards

All public functions, classes, and modules must have docstrings:

"""Module description.

This module provides functionality for X, Y, and Z.
"""


class DataProcessor:
    """Process CSV data with various transformations.

    Attributes:
        session_id: Unique session identifier
        df: Pandas DataFrame containing the data
    """

    def transform(self, operation: str, **kwargs: Any) -> pd.DataFrame:
        """Apply transformation to data.

        Args:
            operation: Name of the transformation
            **kwargs: Additional parameters for the operation

        Returns:
            Transformed DataFrame

        Raises:
            ValueError: If operation is not supported

        Examples:
            >>> processor.transform("normalize", columns=["price"])
            >>> processor.transform("fill_missing", strategy="mean")
        """

Updating Documentation

  1. README.md: Update for new features or breaking changes
  2. API Docs: Ensure docstrings are complete
  3. Examples: Add examples for new features
  4. Changelog: Update CHANGELOG.md

Submitting Changes

Pull Request Process

  1. Update your branch:

    git fetch upstream
    git rebase upstream/main
  2. Push to your fork:

    git push origin feature/your-feature-name
  3. Create Pull Request:

    • Go to GitHub and create a PR from your fork
    • Use a clear, descriptive title
    • Fill out the PR template
    • Link related issues

PR Template

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Tests pass locally
- [ ] Added new tests
- [ ] Coverage maintained/improved

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Review Process

  1. Automated checks must pass
  2. Code review by at least one maintainer
  3. Address feedback promptly
  4. Squash commits if requested

Release Process

Version Numbering

We follow Semantic Versioning:

  • MAJOR: Breaking changes
  • MINOR: New features (backward compatible)
  • PATCH: Bug fixes

Release Steps

  1. Update version in pyproject.toml
  2. Update CHANGELOG.md
  3. Create release PR
  4. Tag release after merge
  5. Publish to PyPI (automated)

Documentation Standards

Writing Style and Tone

DataBeak documentation maintains professional, factual tone:

Avoid Self-Aggrandizing Language

Prohibited terms:

  • "exceptional", "perfect", "amazing", "outstanding", "superior"
  • "revolutionary", "cutting-edge", "world-class", "best-in-class"
  • "unparalleled", "state-of-the-art", "industry-leading", "premium", "elite"
  • "ultimate", "maximum", "optimal", "flawless"

Use factual alternatives:

  • "exceptional standards" → "strict standards"
  • "perfect compliance" → "clean compliance"
  • "comprehensive coverage" → "high coverage"
  • "API design excellence" → "clear API design"
  • "security best practices" → "defensive practices"

Measurable Claims Only

Acceptable (measurable):

  • "Zero ruff violations" (verifiable metric)
  • "100% mypy compliance" (measurable result)
  • "1100+ unit tests" (concrete count)

Prohibited (subjective claims):

  • "production quality" (marketing speak)
  • "advanced analytics" (vague superlative)
  • "sophisticated architecture" (self-congratulatory)

Professional Descriptors

Use measured, technical language:

  • "provides" not "delivers amazing"
  • "supports" not "offers comprehensive"
  • "implements" not "features advanced"
  • "handles" not "excels at"

Getting Help

  • Issues: Use GitHub Issues for bugs and features
  • Discussions: Use GitHub Discussions for questions
  • Discord: Join our Discord server (link in README)

Recognition

Contributors are recognized in:

  • AUTHORS.md file
  • Release notes
  • Project README

Thank you for contributing to DataBeak!