Skip to content

Latest commit

 

History

History
469 lines (356 loc) · 16.4 KB

File metadata and controls

469 lines (356 loc) · 16.4 KB

Developer Activity Data Model Documentation

This document outlines the data model used to track and analyze developer activity across software ecosystems, with special focus on the Sui blockchain ecosystem. The model processes raw data from sources like GitHub and blockchain APIs through several layers of transformation and aggregation to produce insightful metrics.


Core Concepts

To understand the data model, it's essential to be familiar with these core concepts:

Ecosystems

Meaningful groupings of software repositories. An ecosystem can represent:

  • A single project (e.g., Uniswap)
  • A protocol (e.g., Ethereum)
  • A broad technology category (e.g., All Crypto)

Ecosystems serve as the primary unit of analysis.

Parent/Child Ecosystems

Ecosystems can be nested to create a hierarchy:

  • Parent: Broader category (e.g., Sui blockchain)
  • Child: Project belonging to parent (e.g., Walrus, Sui Name Service)

This allows activity to be aggregated upwards or analyzed in detail downwards.

Canonical Developer

A unique identifier for an individual developer. We deduplicate contributions across different names and email addresses to ensure that one person is counted only once, providing an accurate measure of the developer base.

Monthly Active Developer (MAD)

A canonical developer who has authored at least one commit within a rolling 28-day period. For data stability, a commit on day X counts the developer as active until day X+28.

Commit & Developer Counting

Our methodology is designed to credit only original code contributions:

  • Forks: We only count new code added in a fork. Merges from the original upstream repository are excluded.
  • Original Code: We use a fingerprinting technique to identify and eliminate copy-pasted code, ensuring that only the original author and ecosystem receive credit.
  • Branches: We analyze commits from all branches, not just the default main or master, providing a comprehensive view of development activity.

Developer Slices

To capture the diversity of contributions, we segment developers into several categories:

By Activity Level

  • Full-Time: Developers contributing on 10 or more days in a month
  • Part-Time: Developers contributing on fewer than 10 days in a month
  • One-Time: Developers who made only a single contribution within a rolling 3-month window

By Ecosystem Engagement

  • Exclusive: Contributes to only one blockchain ecosystem on a given day
  • Multichain: Contributes to multiple blockchain ecosystems on a given day

By Tenure

  • Newcomer (devs_0_1y): Active in crypto for less than one year
  • Emerging (devs_1_2y): Active for one to two years
  • Established (devs_2y_plus): Active for more than two years

Data Pipeline Layers

The data flows through four distinct layers, from raw source data to final aggregated metrics.

1. PostgreSQL Sources Layer

This layer contains the raw, unprocessed data ingested from our primary sources.

Core Tables:

  • canonical_developers: Master table of unique, deduplicated developers
  • ecosystems_child_ecosystems: Direct parent-child ecosystem relationships
  • raw_commits: Raw commit data from repositories

Metadata:

  • ecosystems_repos_ch: Links ecosystems to specific repositories
  • ecosystems_ch: Definitive list of all ecosystems
  • github_organizations: Information about GitHub organizations
  • repos: Information about repositories
  • ecosystems_github_organizations: Links ecosystems to GitHub organizations

2. Base Layer

Tables in this layer represent cleaned, standardized versions of the source data, preparing it for further processing.

Core Tables:

  • commits: Cleaned and standardized commit data
  • ecosystems_repos: Cleaned mapping of ecosystems to repos
  • canonical_developer_locations: Enriched developer data with location information
  • ecosystems_child_ecosystems: Cleaned parent-child ecosystem relationships

Metadata:

  • organizations: Cleaned GitHub organization data
  • repos: Cleaned repository data
  • ecosystems_organizations: Cleaned mapping of ecosystems to organizations

3. Stage Layer

This layer contains intermediate transformations and aggregations, where data from different base tables is joined and prepared for the final rollup.

Core Transformations:

  • ecosystems_repos_recursive (Critical!)

    • Resolves the parent-child hierarchy
    • For any given parent ecosystem, lists all repositories belonging to it and all of its descendants (children, grandchildren, etc.)
    • Essential for accurate ecosystem-wide queries
  • developer_activities

    • Links each commit from the commits table to a unique canonical_developer
    • Foundational log of "who did what"
    • Basis for all developer attribution
  • ecosystems_child_ecosystems_recursive

    • Flattens the ecosystem hierarchy
    • For any parent, lists all descendant ecosystems

Enrichment Tables:

  • developer_profiles: GitHub profile data (login, avatar, email, name)
  • repository_profiles: GitHub repository metadata (description, language, license, topics)
  • employees: Identified Mysten Labs/Sui employees

4. Rollup Layer

This final layer produces the aggregated tables used for analytics and visualization. These tables synthesize all the processed data into meaningful metrics.

Primary Analytics Tables:

  • eco_mads (Primary output table!)

    • Synthesizes 28-day activity, tenure, and rank data
    • Produces final Monthly Active Developer counts
    • Segmented by developer slices (Full-Time, Part-Time, Newcomer, etc.)
    • Powers most dashboards and analyses
  • eco_mads_excl_employees (Materialized view)

    • Same as eco_mads but excluding employees
    • 80-90% faster queries when filtering employees
    • Refreshed daily

Supporting Analytics Tables:

  • eco_developer_activities: Maps every developer commit to every ecosystem it belongs to (including all parent ecosystems)
  • eco_developer_28d_activities: Aggregates developer activity in 28-day rolling windows
  • repo_developer_28d_activities: Same as above but at repository level
  • eco_developer_tenures: Developer tenure categories (Newcomer, Emerging, Established)
  • eco_developer_contribution_ranks: Developer activity classification (full-time, part-time, one-time)

Application-Specific Tables

Beyond the core Electric Capital pipeline, the dashboard includes additional tracking capabilities:

Event Tracking

event_data

  • Stores hackathon, workshop, and community event participants
  • Links events to developers and repositories
  • Schema:
    • id: Primary key
    • event_name: Event identifier (e.g., "Overflow 2024")
    • event_date: Date of the event
    • developer_github_url: Participant GitHub profile
    • repository_url: Project repository
    • email: Participant email
    • country, city: Location data
    • canonical_developer_id: Link to canonical developer (resolved)
    • repo_id: Link to repository (resolved)
    • created_at: Import timestamp
    • import_metadata: Import source information (JSONB)

event_csv_uploads

  • Stores raw CSV content from event imports
  • Enables version history tracking and re-processing
  • Schema:
    • id: Primary key
    • event_name: Event identifier
    • event_date: Event date
    • uploaded_at: Upload timestamp (allows multiple versions)
    • csv_content: Raw CSV file content (TEXT)
    • filename: Original CSV filename
    • row_count: Number of data rows in CSV

event_csv_row_status

  • Tracks resolution status for each CSV row during import
  • Links to event_csv_uploads via csv_upload_id
  • Schema:
    • id: Primary key
    • csv_upload_id: Foreign key to event_csv_uploads
    • row_index: Zero-based row index in CSV
    • resolved: Boolean indicating if row was successfully matched
    • canonical_developer_id: Resolved developer (if found)
    • developer_github_login: GitHub login for linking
    • repo_id: Resolved repository (if found)
    • repo_full_name: Repository full name for linking
    • resolution_notes: How the row was resolved (e.g., "GitHub profile", "Email match")
  • Unique constraint: (csv_upload_id, row_index)
  • Indexes for fast lookups by upload and resolution status

Event Attribution Logic:

Events use temporal proximity to determine which repositories and packages are related to an event:

  • New Projects: Repositories created within 14 days before to 28 days after the event date

    • No star/fork limits - purely temporal criteria
    • Represents projects built specifically for the event
  • Event Infrastructure: Repositories created 14-60 days before event

    • Templates, toolkits, and event-specific resources
  • Reference Projects: Repositories created 60-180 days before event

    • Projects participants likely referenced while building
  • Established Projects: Repositories created 180+ days before event

    • Existing ecosystem projects

Event Import and Resolution:

When importing event CSVs:

  1. Enhanced Column Scanning: Parser scans ALL columns for GitHub URLs and emails, not just named columns
  2. Multi-Step Resolution: Attempts to resolve developers via:
    • GitHub profile URLs
    • Repository contributor lists
    • Email matching against database
    • Username extraction from failed repository URLs
  3. Row Status Tracking: Each CSV row's resolution outcome is saved to event_csv_row_status
  4. Version History: Multiple CSV uploads per event are preserved with timestamps

Package Tracking

packages

  • Sui blockchain packages from Blockberry API
  • Top 1,000 most active packages by transaction count
  • Schema:
    • package_id: Unique package identifier (Sui address, primary key)
    • package_name: Human-readable package name
    • version: Package version number
    • modules_count: Number of Move modules
    • txs_count: Total historical transaction count
    • create_timestamp: Package creation timestamp (milliseconds)
    • publisher_address: Sui address that published the package
    • project_name: Associated project name
    • project_img: Project image URL
    • is_verified: Verification status
    • fetched_at: Data fetch timestamp
    • updated_at: Last update timestamp

packages_repos

  • Links packages to GitHub repositories
  • Schema:
    • package_id: Foreign key to packages
    • repo_id: Foreign key to repos
    • confidence_score: Match confidence (0.00-1.00)
    • mapping_source: How the link was created (e.g., 'mvr_name', 'name_match')
    • notes: Additional information
    • created_at: Link creation timestamp
  • Primary key: (package_id, repo_id)

Confidence Scores:

  • 1.00: High confidence (from Move Registry or exact name match)
  • <1.00: Fuzzy name matching with decreasing confidence

Current Coverage (as of import):

  • 1,000 packages from Blockberry
  • 438 packages linked to 309 repositories
  • 1,915 total package-repo mappings
  • Uses both Move Registry reverse resolution (mvr_name) and name matching (name_match)

Importing Package Data:

To populate or refresh package data, use the integrated npm scripts:

# One-time: Create database tables (packages, packages_repos, sui_addresses)
npm run create-sui-schema

# Full pipeline: Import packages, link to repos, generate report
npm run sui-packages-pipeline

# Or run steps individually:
npm run import-sui-packages  # Fetch from Blockberry API
npm run link-sui-packages    # Link to GitHub repos
npm run sui-package-report   # Generate ecosystem report

Data Sources:

  • Blockberry API: Top 1,000 packages by transaction count (requires BLOCKBERRY_API_KEY env var)
  • Move Registry API: Package-to-repository mappings
  • Fuzzy name matching: Additional repository links

Refresh Schedule: Weekly or monthly to keep package statistics current.

Implementation Files:

  • scripts/import-blockberry-packages.ts: Fetches and imports packages
  • scripts/link-sui-packages.ts: Links packages to repositories
  • scripts/show-sui-package-ecosystem.ts: Generates reports
  • lib/blockberry-client.ts: Blockberry API client
  • scripts/create-sui-packages-schema.sql: Database schema

sui_addresses

  • Publisher address statistics
  • Schema:
    • address: Sui blockchain address (primary key)
    • canonical_developer_id: Link to canonical developer (nullable)
    • first_seen_at: First package publication
    • last_seen_at: Most recent activity
    • packages_count: Number of packages published
    • total_txs_count: Sum of transactions across all packages
    • notes: Additional information

Data Relationships

Key Relationship Chains

1. Ecosystem Hierarchy → Repository Attribution

ecosystems_child_ecosystems_recursive
  → ecosystems_repos_recursive
    → repos

2. Developer Activity Attribution

commits
  → developer_activities (links to canonical_developer)
    → eco_developer_activities (links to ecosystems)
      → eco_developer_28d_activities (rolling windows)
        → eco_mads (final metrics)

3. Package → Repository → Developer

packages
  → packages_repos (via package_id)
    → repos (via repo_id)
      → developer_activities (via repo_id)
        → canonical_developers (via canonical_developer_id)
          → developer_profiles (enrichment)

4. Package → Developer (via blockchain)

packages
  → sui_addresses (via publisher_address)
    → canonical_developers (via canonical_developer_id)
      → developer_profiles (enrichment)

5. Event → Developer/Repository

event_data
  → canonical_developers (via canonical_developer_id)
  → repos (via repo_id)
  → packages (via repos and temporal proximity)

Important Data Considerations

  1. Ecosystem Hierarchy: Parent ecosystems aggregate activity from all child ecosystems via ecosystems_child_ecosystems_recursive

  2. Repository Attribution: A single repo can belong to multiple ecosystems. Always use ecosystems_repos_recursive to see the full hierarchy.

  3. Developer Deduplication: All commit attribution flows through canonical_developers to ensure unique counting

  4. 28-Day Windows: MAD calculations use rolling 28-day windows, not calendar months

  5. Package Linking: Multiple repositories may match a package name. Use confidence_score to filter for high-quality matches.

  6. Event Attribution: Uses temporal criteria (creation dates relative to event date) rather than popularity metrics


Query Patterns

Get all repos for an ecosystem (including descendants)

SELECT * FROM ecosystems_repos_recursive
WHERE ecosystem_id = ?

Get MADs for an ecosystem

SELECT day, all_devs, exclusive_devs, multichain_devs
FROM eco_mads
WHERE ecosystem_id = ?

Weekly snapshots (Mondays only)

WHERE EXTRACT(ISODOW FROM day) = 1

Get packages linked to repositories

SELECT p.*, r.name, r.link
FROM packages p
JOIN packages_repos pr ON p.package_id = pr.package_id
JOIN repos r ON pr.repo_id = r.id
WHERE pr.confidence_score = 1.0  -- High confidence matches only
ORDER BY p.txs_count DESC

Get event-related packages

WITH event_repos AS (
  SELECT DISTINCT ed.repo_id, ed.event_date
  FROM event_data ed
  WHERE ed.event_name = ?
    AND ed.repo_id IS NOT NULL
)
SELECT DISTINCT p.*
FROM event_repos er
JOIN packages_repos pr ON pr.repo_id = er.repo_id
JOIN packages p ON p.package_id = pr.package_id
WHERE pr.confidence_score = 1.0
  AND p.repo_created_at >= (er.event_date - INTERVAL '14 days')
  AND p.repo_created_at <= (er.event_date + INTERVAL '28 days')

Data Sources

Electric Capital Database

  • Core developer activity data
  • Ecosystem and repository metadata
  • Pre-calculated MAD metrics
  • Updated DAILY - requires running npm run update to refresh all downstream data

GitHub GraphQL API

  • Developer profiles (avatars, names, emails)
  • Repository metadata (descriptions, languages, licenses, topics)
  • Real-time enrichment

Blockberry API

  • Sui package data (top 1,000 by activity)
  • Transaction counts
  • Package metadata

Move Registry API

  • Package-to-repository reverse resolution
  • Project metadata
  • Verification status

CSV Imports

  • Event participant data
  • Custom developer/repository lists
  • Community-contributed data

Documentation References

For detailed examples and implementation guidance:

  • EXAMPLE_QUERIES.md: SQL query patterns for common analytics
  • CLAUDE.md: Project guidance for AI-assisted development
  • README.md: Setup instructions and feature documentation
  • MATERIALIZED_VIEWS.md: Performance optimization details