This document outlines the data model used to track and analyze developer activity across software ecosystems, with special focus on the Sui blockchain ecosystem. The model processes raw data from sources like GitHub and blockchain APIs through several layers of transformation and aggregation to produce insightful metrics.
To understand the data model, it's essential to be familiar with these core concepts:
Meaningful groupings of software repositories. An ecosystem can represent:
- A single project (e.g., Uniswap)
- A protocol (e.g., Ethereum)
- A broad technology category (e.g., All Crypto)
Ecosystems serve as the primary unit of analysis.
Ecosystems can be nested to create a hierarchy:
- Parent: Broader category (e.g., Sui blockchain)
- Child: Project belonging to parent (e.g., Walrus, Sui Name Service)
This allows activity to be aggregated upwards or analyzed in detail downwards.
A unique identifier for an individual developer. We deduplicate contributions across different names and email addresses to ensure that one person is counted only once, providing an accurate measure of the developer base.
A canonical developer who has authored at least one commit within a rolling 28-day period. For data stability, a commit on day X counts the developer as active until day X+28.
Our methodology is designed to credit only original code contributions:
- Forks: We only count new code added in a fork. Merges from the original upstream repository are excluded.
- Original Code: We use a fingerprinting technique to identify and eliminate copy-pasted code, ensuring that only the original author and ecosystem receive credit.
- Branches: We analyze commits from all branches, not just the default
mainormaster, providing a comprehensive view of development activity.
To capture the diversity of contributions, we segment developers into several categories:
- Full-Time: Developers contributing on 10 or more days in a month
- Part-Time: Developers contributing on fewer than 10 days in a month
- One-Time: Developers who made only a single contribution within a rolling 3-month window
- Exclusive: Contributes to only one blockchain ecosystem on a given day
- Multichain: Contributes to multiple blockchain ecosystems on a given day
- Newcomer (
devs_0_1y): Active in crypto for less than one year - Emerging (
devs_1_2y): Active for one to two years - Established (
devs_2y_plus): Active for more than two years
The data flows through four distinct layers, from raw source data to final aggregated metrics.
This layer contains the raw, unprocessed data ingested from our primary sources.
Core Tables:
canonical_developers: Master table of unique, deduplicated developersecosystems_child_ecosystems: Direct parent-child ecosystem relationshipsraw_commits: Raw commit data from repositories
Metadata:
ecosystems_repos_ch: Links ecosystems to specific repositoriesecosystems_ch: Definitive list of all ecosystemsgithub_organizations: Information about GitHub organizationsrepos: Information about repositoriesecosystems_github_organizations: Links ecosystems to GitHub organizations
Tables in this layer represent cleaned, standardized versions of the source data, preparing it for further processing.
Core Tables:
commits: Cleaned and standardized commit dataecosystems_repos: Cleaned mapping of ecosystems to reposcanonical_developer_locations: Enriched developer data with location informationecosystems_child_ecosystems: Cleaned parent-child ecosystem relationships
Metadata:
organizations: Cleaned GitHub organization datarepos: Cleaned repository dataecosystems_organizations: Cleaned mapping of ecosystems to organizations
This layer contains intermediate transformations and aggregations, where data from different base tables is joined and prepared for the final rollup.
Core Transformations:
-
ecosystems_repos_recursive(Critical!)- Resolves the parent-child hierarchy
- For any given parent ecosystem, lists all repositories belonging to it and all of its descendants (children, grandchildren, etc.)
- Essential for accurate ecosystem-wide queries
-
developer_activities- Links each commit from the
commitstable to a uniquecanonical_developer - Foundational log of "who did what"
- Basis for all developer attribution
- Links each commit from the
-
ecosystems_child_ecosystems_recursive- Flattens the ecosystem hierarchy
- For any parent, lists all descendant ecosystems
Enrichment Tables:
developer_profiles: GitHub profile data (login, avatar, email, name)repository_profiles: GitHub repository metadata (description, language, license, topics)employees: Identified Mysten Labs/Sui employees
This final layer produces the aggregated tables used for analytics and visualization. These tables synthesize all the processed data into meaningful metrics.
Primary Analytics Tables:
-
eco_mads(Primary output table!)- Synthesizes 28-day activity, tenure, and rank data
- Produces final Monthly Active Developer counts
- Segmented by developer slices (Full-Time, Part-Time, Newcomer, etc.)
- Powers most dashboards and analyses
-
eco_mads_excl_employees(Materialized view)- Same as
eco_madsbut excluding employees - 80-90% faster queries when filtering employees
- Refreshed daily
- Same as
Supporting Analytics Tables:
eco_developer_activities: Maps every developer commit to every ecosystem it belongs to (including all parent ecosystems)eco_developer_28d_activities: Aggregates developer activity in 28-day rolling windowsrepo_developer_28d_activities: Same as above but at repository leveleco_developer_tenures: Developer tenure categories (Newcomer, Emerging, Established)eco_developer_contribution_ranks: Developer activity classification (full-time, part-time, one-time)
Beyond the core Electric Capital pipeline, the dashboard includes additional tracking capabilities:
event_data
- Stores hackathon, workshop, and community event participants
- Links events to developers and repositories
- Schema:
id: Primary keyevent_name: Event identifier (e.g., "Overflow 2024")event_date: Date of the eventdeveloper_github_url: Participant GitHub profilerepository_url: Project repositoryemail: Participant emailcountry,city: Location datacanonical_developer_id: Link to canonical developer (resolved)repo_id: Link to repository (resolved)created_at: Import timestampimport_metadata: Import source information (JSONB)
event_csv_uploads
- Stores raw CSV content from event imports
- Enables version history tracking and re-processing
- Schema:
id: Primary keyevent_name: Event identifierevent_date: Event dateuploaded_at: Upload timestamp (allows multiple versions)csv_content: Raw CSV file content (TEXT)filename: Original CSV filenamerow_count: Number of data rows in CSV
event_csv_row_status
- Tracks resolution status for each CSV row during import
- Links to
event_csv_uploadsviacsv_upload_id - Schema:
id: Primary keycsv_upload_id: Foreign key to event_csv_uploadsrow_index: Zero-based row index in CSVresolved: Boolean indicating if row was successfully matchedcanonical_developer_id: Resolved developer (if found)developer_github_login: GitHub login for linkingrepo_id: Resolved repository (if found)repo_full_name: Repository full name for linkingresolution_notes: How the row was resolved (e.g., "GitHub profile", "Email match")
- Unique constraint: (
csv_upload_id,row_index) - Indexes for fast lookups by upload and resolution status
Event Attribution Logic:
Events use temporal proximity to determine which repositories and packages are related to an event:
-
New Projects: Repositories created within 14 days before to 28 days after the event date
- No star/fork limits - purely temporal criteria
- Represents projects built specifically for the event
-
Event Infrastructure: Repositories created 14-60 days before event
- Templates, toolkits, and event-specific resources
-
Reference Projects: Repositories created 60-180 days before event
- Projects participants likely referenced while building
-
Established Projects: Repositories created 180+ days before event
- Existing ecosystem projects
Event Import and Resolution:
When importing event CSVs:
- Enhanced Column Scanning: Parser scans ALL columns for GitHub URLs and emails, not just named columns
- Multi-Step Resolution: Attempts to resolve developers via:
- GitHub profile URLs
- Repository contributor lists
- Email matching against database
- Username extraction from failed repository URLs
- Row Status Tracking: Each CSV row's resolution outcome is saved to
event_csv_row_status - Version History: Multiple CSV uploads per event are preserved with timestamps
packages
- Sui blockchain packages from Blockberry API
- Top 1,000 most active packages by transaction count
- Schema:
package_id: Unique package identifier (Sui address, primary key)package_name: Human-readable package nameversion: Package version numbermodules_count: Number of Move modulestxs_count: Total historical transaction countcreate_timestamp: Package creation timestamp (milliseconds)publisher_address: Sui address that published the packageproject_name: Associated project nameproject_img: Project image URLis_verified: Verification statusfetched_at: Data fetch timestampupdated_at: Last update timestamp
packages_repos
- Links packages to GitHub repositories
- Schema:
package_id: Foreign key to packagesrepo_id: Foreign key to reposconfidence_score: Match confidence (0.00-1.00)mapping_source: How the link was created (e.g., 'mvr_name', 'name_match')notes: Additional informationcreated_at: Link creation timestamp
- Primary key: (
package_id,repo_id)
Confidence Scores:
- 1.00: High confidence (from Move Registry or exact name match)
- <1.00: Fuzzy name matching with decreasing confidence
Current Coverage (as of import):
- 1,000 packages from Blockberry
- 438 packages linked to 309 repositories
- 1,915 total package-repo mappings
- Uses both Move Registry reverse resolution (
mvr_name) and name matching (name_match)
Importing Package Data:
To populate or refresh package data, use the integrated npm scripts:
# One-time: Create database tables (packages, packages_repos, sui_addresses)
npm run create-sui-schema
# Full pipeline: Import packages, link to repos, generate report
npm run sui-packages-pipeline
# Or run steps individually:
npm run import-sui-packages # Fetch from Blockberry API
npm run link-sui-packages # Link to GitHub repos
npm run sui-package-report # Generate ecosystem reportData Sources:
- Blockberry API: Top 1,000 packages by transaction count (requires
BLOCKBERRY_API_KEYenv var) - Move Registry API: Package-to-repository mappings
- Fuzzy name matching: Additional repository links
Refresh Schedule: Weekly or monthly to keep package statistics current.
Implementation Files:
scripts/import-blockberry-packages.ts: Fetches and imports packagesscripts/link-sui-packages.ts: Links packages to repositoriesscripts/show-sui-package-ecosystem.ts: Generates reportslib/blockberry-client.ts: Blockberry API clientscripts/create-sui-packages-schema.sql: Database schema
sui_addresses
- Publisher address statistics
- Schema:
address: Sui blockchain address (primary key)canonical_developer_id: Link to canonical developer (nullable)first_seen_at: First package publicationlast_seen_at: Most recent activitypackages_count: Number of packages publishedtotal_txs_count: Sum of transactions across all packagesnotes: Additional information
1. Ecosystem Hierarchy → Repository Attribution
ecosystems_child_ecosystems_recursive
→ ecosystems_repos_recursive
→ repos
2. Developer Activity Attribution
commits
→ developer_activities (links to canonical_developer)
→ eco_developer_activities (links to ecosystems)
→ eco_developer_28d_activities (rolling windows)
→ eco_mads (final metrics)
3. Package → Repository → Developer
packages
→ packages_repos (via package_id)
→ repos (via repo_id)
→ developer_activities (via repo_id)
→ canonical_developers (via canonical_developer_id)
→ developer_profiles (enrichment)
4. Package → Developer (via blockchain)
packages
→ sui_addresses (via publisher_address)
→ canonical_developers (via canonical_developer_id)
→ developer_profiles (enrichment)
5. Event → Developer/Repository
event_data
→ canonical_developers (via canonical_developer_id)
→ repos (via repo_id)
→ packages (via repos and temporal proximity)
-
Ecosystem Hierarchy: Parent ecosystems aggregate activity from all child ecosystems via
ecosystems_child_ecosystems_recursive -
Repository Attribution: A single repo can belong to multiple ecosystems. Always use
ecosystems_repos_recursiveto see the full hierarchy. -
Developer Deduplication: All commit attribution flows through
canonical_developersto ensure unique counting -
28-Day Windows: MAD calculations use rolling 28-day windows, not calendar months
-
Package Linking: Multiple repositories may match a package name. Use
confidence_scoreto filter for high-quality matches. -
Event Attribution: Uses temporal criteria (creation dates relative to event date) rather than popularity metrics
SELECT * FROM ecosystems_repos_recursive
WHERE ecosystem_id = ?SELECT day, all_devs, exclusive_devs, multichain_devs
FROM eco_mads
WHERE ecosystem_id = ?WHERE EXTRACT(ISODOW FROM day) = 1SELECT p.*, r.name, r.link
FROM packages p
JOIN packages_repos pr ON p.package_id = pr.package_id
JOIN repos r ON pr.repo_id = r.id
WHERE pr.confidence_score = 1.0 -- High confidence matches only
ORDER BY p.txs_count DESCWITH event_repos AS (
SELECT DISTINCT ed.repo_id, ed.event_date
FROM event_data ed
WHERE ed.event_name = ?
AND ed.repo_id IS NOT NULL
)
SELECT DISTINCT p.*
FROM event_repos er
JOIN packages_repos pr ON pr.repo_id = er.repo_id
JOIN packages p ON p.package_id = pr.package_id
WHERE pr.confidence_score = 1.0
AND p.repo_created_at >= (er.event_date - INTERVAL '14 days')
AND p.repo_created_at <= (er.event_date + INTERVAL '28 days')Electric Capital Database
- Core developer activity data
- Ecosystem and repository metadata
- Pre-calculated MAD metrics
- Updated DAILY - requires running
npm run updateto refresh all downstream data
GitHub GraphQL API
- Developer profiles (avatars, names, emails)
- Repository metadata (descriptions, languages, licenses, topics)
- Real-time enrichment
Blockberry API
- Sui package data (top 1,000 by activity)
- Transaction counts
- Package metadata
Move Registry API
- Package-to-repository reverse resolution
- Project metadata
- Verification status
CSV Imports
- Event participant data
- Custom developer/repository lists
- Community-contributed data
For detailed examples and implementation guidance:
EXAMPLE_QUERIES.md: SQL query patterns for common analyticsCLAUDE.md: Project guidance for AI-assisted developmentREADME.md: Setup instructions and feature documentationMATERIALIZED_VIEWS.md: Performance optimization details