Developer Activity Data Model Documentation

This document outlines the data model used to track and analyze developer activity across software ecosystems, with special focus on the Sui blockchain ecosystem. The model processes raw data from sources like GitHub and blockchain APIs through several layers of transformation and aggregation to produce insightful metrics.

Core Concepts

To understand the data model, it's essential to be familiar with these core concepts:

Ecosystems

Meaningful groupings of software repositories. An ecosystem can represent:

A single project (e.g., Uniswap)
A protocol (e.g., Ethereum)
A broad technology category (e.g., All Crypto)

Ecosystems serve as the primary unit of analysis.

Parent/Child Ecosystems

Ecosystems can be nested to create a hierarchy:

Parent: Broader category (e.g., Sui blockchain)
Child: Project belonging to parent (e.g., Walrus, Sui Name Service)

This allows activity to be aggregated upwards or analyzed in detail downwards.

Canonical Developer

A unique identifier for an individual developer. We deduplicate contributions across different names and email addresses to ensure that one person is counted only once, providing an accurate measure of the developer base.

Monthly Active Developer (MAD)

A canonical developer who has authored at least one commit within a rolling 28-day period. For data stability, a commit on day X counts the developer as active until day X+28.

Commit & Developer Counting

Our methodology is designed to credit only original code contributions:

Forks: We only count new code added in a fork. Merges from the original upstream repository are excluded.
Original Code: We use a fingerprinting technique to identify and eliminate copy-pasted code, ensuring that only the original author and ecosystem receive credit.
Branches: We analyze commits from all branches, not just the default main or master, providing a comprehensive view of development activity.

Developer Slices

To capture the diversity of contributions, we segment developers into several categories:

By Activity Level

Full-Time: Developers contributing on 10 or more days in a month
Part-Time: Developers contributing on fewer than 10 days in a month
One-Time: Developers who made only a single contribution within a rolling 3-month window

By Ecosystem Engagement

Exclusive: Contributes to only one blockchain ecosystem on a given day
Multichain: Contributes to multiple blockchain ecosystems on a given day

By Tenure

Newcomer (devs_0_1y): Active in crypto for less than one year
Emerging (devs_1_2y): Active for one to two years
Established (devs_2y_plus): Active for more than two years

Data Pipeline Layers

The data flows through four distinct layers, from raw source data to final aggregated metrics.

1. PostgreSQL Sources Layer

This layer contains the raw, unprocessed data ingested from our primary sources.

Core Tables:

canonical_developers: Master table of unique, deduplicated developers
ecosystems_child_ecosystems: Direct parent-child ecosystem relationships
raw_commits: Raw commit data from repositories

Metadata:

ecosystems_repos_ch: Links ecosystems to specific repositories
ecosystems_ch: Definitive list of all ecosystems
github_organizations: Information about GitHub organizations
repos: Information about repositories
ecosystems_github_organizations: Links ecosystems to GitHub organizations

2. Base Layer

Tables in this layer represent cleaned, standardized versions of the source data, preparing it for further processing.

Core Tables:

commits: Cleaned and standardized commit data
ecosystems_repos: Cleaned mapping of ecosystems to repos
canonical_developer_locations: Enriched developer data with location information
ecosystems_child_ecosystems: Cleaned parent-child ecosystem relationships

Metadata:

organizations: Cleaned GitHub organization data
repos: Cleaned repository data
ecosystems_organizations: Cleaned mapping of ecosystems to organizations

3. Stage Layer

This layer contains intermediate transformations and aggregations, where data from different base tables is joined and prepared for the final rollup.

Core Transformations:

ecosystems_repos_recursive (Critical!)
- Resolves the parent-child hierarchy
- For any given parent ecosystem, lists all repositories belonging to it and all of its descendants (children, grandchildren, etc.)
- Essential for accurate ecosystem-wide queries
developer_activities
- Links each commit from the commits table to a unique canonical_developer
- Foundational log of "who did what"
- Basis for all developer attribution
ecosystems_child_ecosystems_recursive
- Flattens the ecosystem hierarchy
- For any parent, lists all descendant ecosystems

Enrichment Tables:

developer_profiles: GitHub profile data (login, avatar, email, name)
repository_profiles: GitHub repository metadata (description, language, license, topics)
employees: Identified Mysten Labs/Sui employees

4. Rollup Layer

This final layer produces the aggregated tables used for analytics and visualization. These tables synthesize all the processed data into meaningful metrics.

Primary Analytics Tables:

eco_mads (Primary output table!)
- Synthesizes 28-day activity, tenure, and rank data
- Produces final Monthly Active Developer counts
- Segmented by developer slices (Full-Time, Part-Time, Newcomer, etc.)
- Powers most dashboards and analyses
eco_mads_excl_employees (Materialized view)
- Same as eco_mads but excluding employees
- 80-90% faster queries when filtering employees
- Refreshed daily

Supporting Analytics Tables:

eco_developer_activities: Maps every developer commit to every ecosystem it belongs to (including all parent ecosystems)
eco_developer_28d_activities: Aggregates developer activity in 28-day rolling windows
repo_developer_28d_activities: Same as above but at repository level
eco_developer_tenures: Developer tenure categories (Newcomer, Emerging, Established)
eco_developer_contribution_ranks: Developer activity classification (full-time, part-time, one-time)

Application-Specific Tables

Beyond the core Electric Capital pipeline, the dashboard includes additional tracking capabilities:

Event Tracking

event_data

Stores hackathon, workshop, and community event participants
Links events to developers and repositories
Schema:
- id: Primary key
- event_name: Event identifier (e.g., "Overflow 2024")
- event_date: Date of the event
- developer_github_url: Participant GitHub profile
- repository_url: Project repository
- email: Participant email
- country, city: Location data
- canonical_developer_id: Link to canonical developer (resolved)
- repo_id: Link to repository (resolved)
- created_at: Import timestamp
- import_metadata: Import source information (JSONB)

event_csv_uploads

Stores raw CSV content from event imports
Enables version history tracking and re-processing
Schema:
- id: Primary key
- event_name: Event identifier
- event_date: Event date
- uploaded_at: Upload timestamp (allows multiple versions)
- csv_content: Raw CSV file content (TEXT)
- filename: Original CSV filename
- row_count: Number of data rows in CSV

event_csv_row_status

Tracks resolution status for each CSV row during import
Links to event_csv_uploads via csv_upload_id
Schema:
- id: Primary key
- csv_upload_id: Foreign key to event_csv_uploads
- row_index: Zero-based row index in CSV
- resolved: Boolean indicating if row was successfully matched
- canonical_developer_id: Resolved developer (if found)
- developer_github_login: GitHub login for linking
- repo_id: Resolved repository (if found)
- repo_full_name: Repository full name for linking
- resolution_notes: How the row was resolved (e.g., "GitHub profile", "Email match")
Unique constraint: (csv_upload_id, row_index)
Indexes for fast lookups by upload and resolution status

Event Attribution Logic:

Events use temporal proximity to determine which repositories and packages are related to an event:

New Projects: Repositories created within 14 days before to 28 days after the event date
- No star/fork limits - purely temporal criteria
- Represents projects built specifically for the event
Event Infrastructure: Repositories created 14-60 days before event
- Templates, toolkits, and event-specific resources
Reference Projects: Repositories created 60-180 days before event
- Projects participants likely referenced while building
Established Projects: Repositories created 180+ days before event
- Existing ecosystem projects

Event Import and Resolution:

When importing event CSVs:

Enhanced Column Scanning: Parser scans ALL columns for GitHub URLs and emails, not just named columns
Multi-Step Resolution: Attempts to resolve developers via:
- GitHub profile URLs
- Repository contributor lists
- Email matching against database
- Username extraction from failed repository URLs
Row Status Tracking: Each CSV row's resolution outcome is saved to event_csv_row_status
Version History: Multiple CSV uploads per event are preserved with timestamps

Package Tracking

packages

Sui blockchain packages from Blockberry API
Top 1,000 most active packages by transaction count
Schema:
- package_id: Unique package identifier (Sui address, primary key)
- package_name: Human-readable package name
- version: Package version number
- modules_count: Number of Move modules
- txs_count: Total historical transaction count
- create_timestamp: Package creation timestamp (milliseconds)
- publisher_address: Sui address that published the package
- project_name: Associated project name
- project_img: Project image URL
- is_verified: Verification status
- fetched_at: Data fetch timestamp
- updated_at: Last update timestamp

packages_repos

Links packages to GitHub repositories
Schema:
- package_id: Foreign key to packages
- repo_id: Foreign key to repos
- confidence_score: Match confidence (0.00-1.00)
- mapping_source: How the link was created (e.g., 'mvr_name', 'name_match')
- notes: Additional information
- created_at: Link creation timestamp
Primary key: (package_id, repo_id)

Confidence Scores:

1.00: High confidence (from Move Registry or exact name match)
<1.00: Fuzzy name matching with decreasing confidence

Current Coverage (as of import):

1,000 packages from Blockberry
438 packages linked to 309 repositories
1,915 total package-repo mappings
Uses both Move Registry reverse resolution (mvr_name) and name matching (name_match)

Importing Package Data:

To populate or refresh package data, use the integrated npm scripts:

# One-time: Create database tables (packages, packages_repos, sui_addresses)
npm run create-sui-schema

# Full pipeline: Import packages, link to repos, generate report
npm run sui-packages-pipeline

# Or run steps individually:
npm run import-sui-packages  # Fetch from Blockberry API
npm run link-sui-packages    # Link to GitHub repos
npm run sui-package-report   # Generate ecosystem report

Data Sources:

Blockberry API: Top 1,000 packages by transaction count (requires BLOCKBERRY_API_KEY env var)
Move Registry API: Package-to-repository mappings
Fuzzy name matching: Additional repository links

Refresh Schedule: Weekly or monthly to keep package statistics current.

Implementation Files:

scripts/import-blockberry-packages.ts: Fetches and imports packages
scripts/link-sui-packages.ts: Links packages to repositories
scripts/show-sui-package-ecosystem.ts: Generates reports
lib/blockberry-client.ts: Blockberry API client
scripts/create-sui-packages-schema.sql: Database schema

sui_addresses

Publisher address statistics
Schema:
- address: Sui blockchain address (primary key)
- canonical_developer_id: Link to canonical developer (nullable)
- first_seen_at: First package publication
- last_seen_at: Most recent activity
- packages_count: Number of packages published
- total_txs_count: Sum of transactions across all packages
- notes: Additional information

Data Relationships

Key Relationship Chains

1. Ecosystem Hierarchy → Repository Attribution

ecosystems_child_ecosystems_recursive
  → ecosystems_repos_recursive
    → repos

2. Developer Activity Attribution

commits
  → developer_activities (links to canonical_developer)
    → eco_developer_activities (links to ecosystems)
      → eco_developer_28d_activities (rolling windows)
        → eco_mads (final metrics)

3. Package → Repository → Developer

packages
  → packages_repos (via package_id)
    → repos (via repo_id)
      → developer_activities (via repo_id)
        → canonical_developers (via canonical_developer_id)
          → developer_profiles (enrichment)

4. Package → Developer (via blockchain)

packages
  → sui_addresses (via publisher_address)
    → canonical_developers (via canonical_developer_id)
      → developer_profiles (enrichment)

5. Event → Developer/Repository

event_data
  → canonical_developers (via canonical_developer_id)
  → repos (via repo_id)
  → packages (via repos and temporal proximity)

Important Data Considerations

Ecosystem Hierarchy: Parent ecosystems aggregate activity from all child ecosystems via ecosystems_child_ecosystems_recursive
Repository Attribution: A single repo can belong to multiple ecosystems. Always use ecosystems_repos_recursive to see the full hierarchy.
Developer Deduplication: All commit attribution flows through canonical_developers to ensure unique counting
28-Day Windows: MAD calculations use rolling 28-day windows, not calendar months
Package Linking: Multiple repositories may match a package name. Use confidence_score to filter for high-quality matches.
Event Attribution: Uses temporal criteria (creation dates relative to event date) rather than popularity metrics

Query Patterns

Get all repos for an ecosystem (including descendants)

SELECT * FROM ecosystems_repos_recursive
WHERE ecosystem_id = ?

Get MADs for an ecosystem

SELECT day, all_devs, exclusive_devs, multichain_devs
FROM eco_mads
WHERE ecosystem_id = ?

Weekly snapshots (Mondays only)

WHERE EXTRACT(ISODOW FROM day) = 1

Get packages linked to repositories

SELECT p.*, r.name, r.link
FROM packages p
JOIN packages_repos pr ON p.package_id = pr.package_id
JOIN repos r ON pr.repo_id = r.id
WHERE pr.confidence_score = 1.0  -- High confidence matches only
ORDER BY p.txs_count DESC

Get event-related packages

WITH event_repos AS (
  SELECT DISTINCT ed.repo_id, ed.event_date
  FROM event_data ed
  WHERE ed.event_name = ?
    AND ed.repo_id IS NOT NULL
)
SELECT DISTINCT p.*
FROM event_repos er
JOIN packages_repos pr ON pr.repo_id = er.repo_id
JOIN packages p ON p.package_id = pr.package_id
WHERE pr.confidence_score = 1.0
  AND p.repo_created_at >= (er.event_date - INTERVAL '14 days')
  AND p.repo_created_at <= (er.event_date + INTERVAL '28 days')

Data Sources

Electric Capital Database

Core developer activity data
Ecosystem and repository metadata
Pre-calculated MAD metrics
Updated DAILY - requires running npm run update to refresh all downstream data

GitHub GraphQL API

Developer profiles (avatars, names, emails)
Repository metadata (descriptions, languages, licenses, topics)
Real-time enrichment

Blockberry API

Sui package data (top 1,000 by activity)
Transaction counts
Package metadata

Move Registry API

Package-to-repository reverse resolution
Project metadata
Verification status

CSV Imports

Event participant data
Custom developer/repository lists
Community-contributed data

Documentation References

For detailed examples and implementation guidance:

EXAMPLE_QUERIES.md: SQL query patterns for common analytics
CLAUDE.md: Project guidance for AI-assisted development
README.md: Setup instructions and feature documentation
MATERIALIZED_VIEWS.md: Performance optimization details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developer Activity Data Model Documentation

Core Concepts

Ecosystems

Parent/Child Ecosystems

Canonical Developer

Monthly Active Developer (MAD)

Commit & Developer Counting

Developer Slices

By Activity Level

By Ecosystem Engagement

By Tenure

Data Pipeline Layers

1. PostgreSQL Sources Layer

2. Base Layer

3. Stage Layer

4. Rollup Layer

Application-Specific Tables

Event Tracking

Package Tracking

Data Relationships

Key Relationship Chains

Important Data Considerations

Query Patterns

Get all repos for an ecosystem (including descendants)

Get MADs for an ecosystem

Weekly snapshots (Mondays only)

Get packages linked to repositories

Get event-related packages

Data Sources

Documentation References

FilesExpand file tree

DATA_STRUCTURE.md

Latest commit

History

DATA_STRUCTURE.md

File metadata and controls

Developer Activity Data Model Documentation

Core Concepts

Ecosystems

Parent/Child Ecosystems

Canonical Developer

Monthly Active Developer (MAD)

Commit & Developer Counting

Developer Slices

By Activity Level

By Ecosystem Engagement

By Tenure

Data Pipeline Layers

1. PostgreSQL Sources Layer

2. Base Layer

3. Stage Layer

4. Rollup Layer

Application-Specific Tables

Event Tracking

Package Tracking

Data Relationships

Key Relationship Chains

Important Data Considerations

Query Patterns

Get all repos for an ecosystem (including descendants)

Get MADs for an ecosystem

Weekly snapshots (Mondays only)

Get packages linked to repositories

Get event-related packages

Data Sources

Documentation References