Skip to content

Add Repository Summary Stats API Endpoint (Git-First Approach) #117

@jonasyr

Description

@jonasyr

Add Repository Summary Stats API Endpoint for Frontend UI Redesign

📋 Summary

Implement a new backend API endpoint /api/repositories/summary to provide comprehensive repository statistics for the frontend UI redesign. This endpoint uses a Git-first approach with sparse/shallow cloning to avoid GitHub API rate limiting issues while minimizing bandwidth and disk usage.

🎯 Motivation

The frontend redesign (issue #87) includes a "Repository Summary" card displaying key repository metrics:

  • Repository name and owner
  • Created date and age
  • Last commit date
  • Total commits count
  • Contributors count
  • Repository status (active/inactive)

Currently, these values are mocked. The backend needs to expose this data through a dedicated endpoint that primarily uses Git operations to avoid rate limiting problems in multi-user deployments.

🚨 Why Git-First + Sparse Clone?

Rate Limit Problem:

  • GitHub API: 60 requests/hour per IP (unauthenticated)
  • All users share the same backend server IP
  • With 10 concurrent users, rate limit exhausted in 6 minutes!

Git-First Solution:

  • ✅ No rate limits - unlimited local Git operations
  • ✅ No external dependencies - works offline
  • ✅ Fast - local operations are instant
  • ✅ Scalable - works for unlimited users
  • ✅ Privacy - no data sent to external APIs
  • ✅ Universal - works for GitHub, GitLab, Bitbucket, self-hosted

Sparse Clone Optimization (95-99% bandwidth reduction):

  • Uses existing getFileTreeSparse() pattern from fileAnalysisService.ts
  • ✅ Downloads ONLY tree structure, no file contents (--filter=blob:none)
  • ✅ Shallow clone with --depth=1 (latest commit only)
  • ✅ Single branch (--single-branch)
  • ✅ No checkout (--no-checkout)
  • Existing infrastructure: Already battle-tested in production

Trade-offs:

  • "Created date" is approximate (first commit date) vs exact (GitHub API) → Acceptable for 99% of use cases
  • Requires minimal Git clone instead of zero network calls → Still 95-99% faster than full clone

✅ Requirements / Acceptance Criteria

API Endpoint

  • Create GET /api/repositories/summary endpoint
  • Accept repoUrl as query parameter
  • Return structured JSON with all required stats
  • Implement proper validation and error handling
  • Add rate limiting (reuse existing middleware)

Data Extraction (Sparse Clone Optimization)

  • Use sparse clone with --filter=blob:none --depth=1 --single-branch --no-checkout
  • Parse repository URL to extract platform, owner, and repo name
  • Get first commit date via git log --reverse --max-count=1
  • Get last commit info via git log -1 --format=%aI|%H|%an
  • Count total commits via git rev-list --count HEAD
  • Count unique contributors via git shortlog -s -n --all | wc -l
  • Calculate repository age from first commit date
  • Determine status based on last commit recency
  • Clean up temp directory after extraction

Response Format

{
  "repository": {
    "name": "analytics-pro",
    "url": "https://github.com/octo-org/analytics-pro.git",
    "owner": "octo-org",
    "platform": "github"
  },
  "created": {
    "date": "2019-06-14T10:30:00.000Z",
    "source": "first-commit"
  },
  "age": {
    "years": 5,
    "months": 5,
    "formatted": "5.7y"
  },
  "lastCommit": {
    "date": "2025-11-17T14:23:00.000Z",
    "relativeTime": "2 days ago",
    "sha": "abc123def456",
    "author": "John Doe"
  },
  "stats": {
    "totalCommits": 3482,
    "contributors": 24,
    "status": "active"
  },
  "metadata": {
    "cached": true,
    "dataSource": "git-sparse-clone",
    "createdDateAccuracy": "approximate",
    "bandwidthSaved": "95-99% vs full clone",
    "lastUpdated": "2025-11-19T10:30:00.000Z"
  }
}

Caching Strategy

  • Cache summary data for 24 hours (Git data is stable)
  • Use cache key: repo:summary:${hash(repoUrl)}
  • Integrate with existing multi-tier cache (Redis → Disk → Memory)
  • Use repositoryCoordinator to prevent duplicate clones
  • Cache remains valid until new commits (commit-hash based validation)

Edge Cases

  • Handle empty repositories (0 commits)
  • Handle very large repositories (>100k commits) efficiently
  • Handle repositories with single commit
  • Handle network timeouts during clone
  • Handle corrupted Git repositories
  • Handle deleted/moved repositories
  • Support all Git URL formats (HTTPS, SSH, .git suffix optional)
  • Clean up temp directories on error

Type Safety

  • Add RepositorySummary interface to @gitray/shared-types
  • Add RepositoryStatus type: 'active' | 'inactive' | 'archived' | 'empty'
  • Add DataSource type: 'git-sparse-clone' | 'git+github-api' | 'cache'
  • Add CreatedDateSource type: 'first-commit' | 'github-api' | 'gitlab-api'

🛠️ Implementation Plan

1. New Service: repositorySummaryService.ts

Location: apps/backend/src/services/repositorySummaryService.ts

Core Functions:

// Main orchestrator - Sparse clone approach
async function getRepositorySummary(repoUrl: string): Promise<RepositorySummary>

// Parse repository URL
function parseRepositoryUrl(repoUrl: string): RepositoryUrlInfo

// Sparse clone for metadata extraction (NEW - based on fileAnalysisService pattern)
async function getSummaryViaSparseClone(repoUrl: string): Promise<{
  firstCommit: string;
  lastCommit: {date: string, sha: string, author: string};
  totalCommits: number;
  contributors: number;
  tempDir: string;
}>

// Git operations (can be run on sparse clone)
async function getFirstCommitDate(repoPath: string): Promise<string>
async function getLastCommitInfo(repoPath: string): Promise<{date, sha, author}>
async function getContributorCount(repoPath: string): Promise<number>

// Calculations
function calculateRepositoryAge(createdDate: Date): AgeInfo
function determineRepositoryStatus(lastCommitDate: Date): RepositoryStatus

// Cleanup
async function cleanupSparseClone(tempDir: string): Promise<void>

Sparse Clone Implementation (based on existing pattern):

async function performSparseCloneForSummary(repoUrl: string): Promise<string> {
  const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'gitray-summary-'));
  const git = simpleGit(tempDir);
  
  try {
    // Initialize repo
    await git.init();
    await git.addRemote('origin', repoUrl);
    
    // Sparse clone with maximum optimization
    await git.raw([
      'fetch',
      '--filter=blob:none',  // Exclude file contents (blobs)
      '--depth=1',           // Only latest commit
      'origin',
      'HEAD'
    ]);
    
    await git.raw(['checkout', 'FETCH_HEAD']);
    
    return tempDir;
  } catch (error) {
    await cleanupSparseClone(tempDir);
    throw error;
  }
}

Git Commands on Sparse Clone:

# All these work on sparse clone (no blobs needed):

# First commit date (repository "creation")
git log --reverse --format=%aI --max-count=1

# Last commit info  
git log -1 --format=%aI|%H|%an

# Contributor count
git shortlog -s -n --all | wc -l

# Total commits
git rev-list --count HEAD

2. New Route in repositoryRoutes.ts

router.get(
  '/summary',
  ...repoUrlValidation(),
  handleValidationErrors,
  async (req: Request, res: Response, next: NextFunction) => {
    try {
      const { repoUrl } = req.query;
      const summary = await repositorySummaryService.getRepositorySummary(repoUrl as string);
      res.status(HTTP_STATUS.OK).json(summary);
    } catch (error) {
      next(error);
    }
  }
);

3. Shared Types in packages/shared-types/src/index.ts

export interface RepositorySummary {
  repository: {
    name: string;
    url: string;
    owner: string;
    platform: 'github' | 'gitlab' | 'bitbucket' | 'other';
  };
  created: {
    date: string; // ISO 8601
    source: CreatedDateSource;
  };
  age: {
    years: number;
    months: number;
    formatted: string; // e.g., "5.7y"
  };
  lastCommit: {
    date: string; // ISO 8601
    relativeTime: string; // e.g., "2 days ago"
    sha: string;
    author: string;
  };
  stats: {
    totalCommits: number;
    contributors: number;
    status: RepositoryStatus;
  };
  metadata: {
    cached: boolean;
    dataSource: DataSource;
    createdDateAccuracy: 'exact' | 'approximate';
    bandwidthSaved?: string;
    lastUpdated: string; // ISO 8601
  };
}

export type RepositoryStatus = 'active' | 'inactive' | 'archived' | 'empty';
export type DataSource = 'git-sparse-clone' | 'git+github-api' | 'cache';
export type CreatedDateSource = 'first-commit' | 'github-api' | 'gitlab-api';

export interface RepositoryUrlInfo {
  platform: 'github' | 'gitlab' | 'bitbucket' | 'other';
  owner: string;
  name: string;
  fullUrl: string;
}

4. Status Determination Logic

function determineRepositoryStatus(lastCommitDate: Date): RepositoryStatus {
  const daysSinceLastCommit = differenceInDays(new Date(), lastCommitDate);
  
  if (daysSinceLastCommit <= 30) return 'active';
  if (daysSinceLastCommit <= 180) return 'inactive';
  return 'archived';
}

Status Rules:

  • active: Last commit within 30 days
  • inactive: Last commit between 30-180 days
  • archived: No commit in 180+ days
  • empty: 0 commits

5. Caching Implementation

async function getRepositorySummary(repoUrl: string): Promise<RepositorySummary> {
  const cacheKey = `repo:summary:${crypto.createHash('md5').update(repoUrl).digest('hex')}`;
  
  // Try cache first (24h TTL)
  const cached = await repositoryCache.get(cacheKey);
  if (cached) {
    return { ...cached, metadata: { ...cached.metadata, cached: true } };
  }
  
  // Perform sparse clone for metadata extraction
  const tempDir = await performSparseCloneForSummary(repoUrl);
  
  try {
    const summary = await buildSummaryFromSparseClone(tempDir, repoUrl);
    
    // Cache for 24 hours
    await repositoryCache.set(cacheKey, summary, 86400);
    
    return summary;
  } finally {
    await cleanupSparseClone(tempDir);
  }
}

📦 Dependencies & Related Code

Existing Services to Leverage

  • apps/backend/src/services/gitService.ts - getCommitCount(tempDir)
  • apps/backend/src/services/repositoryCache.ts - Multi-tier caching
  • apps/backend/src/services/fileAnalysisService.ts - Sparse clone pattern (getFileTreeSparse() lines 1026-1140)
  • apps/backend/src/utils/gitUtils.ts - shallowClone() utility

Existing Sparse Clone Pattern to Reuse

See apps/backend/src/services/fileAnalysisService.ts:1026-1140:

private async getFileTreeSparse(repoUrl: string): Promise<{...}> {
  const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'gitray-sparse-'));
  const git = simpleGit(tempDir);
  
  await git.init();
  await git.addRemote('origin', repoUrl);
  await git.raw(['config', 'core.sparseCheckout', 'true']);
  await git.raw(['fetch', '--filter=blob:none', '--depth=1', 'origin', 'HEAD']);
  await git.raw(['checkout', 'FETCH_HEAD']);
  
  // ... extract metadata ...
  
  return { files, commitHash, tempDir };
}

NPM Packages Already Available

  • simple-git - Git operations
  • date-fns - Date calculations (differenceInDays, formatDistanceToNow)
  • crypto (Node.js built-in) - Hash URLs for cache keys

🧪 Testing Requirements

Unit Tests

Location: apps/backend/__tests__/unit/services/repositorySummaryService.unit.test.ts

Test Cases:

  • performSparseCloneForSummary() - creates temp dir with sparse clone
  • parseRepositoryUrl() - GitHub HTTPS, SSH, GitLab, Bitbucket
  • getFirstCommitDate() - with sparse clone repository
  • getLastCommitInfo() - parse date, sha, author correctly
  • getContributorCount() - count unique authors from sparse clone
  • calculateRepositoryAge() - various date ranges
  • determineRepositoryStatus() - active/inactive/archived/empty
  • cleanupSparseClone() - removes temp directory
  • Cache hit/miss scenarios
  • Error handling (empty repo, corrupted repo, cleanup on error)

Integration Tests

Location: apps/backend/__tests__/unit/routes/repositoryRoutes.unit.test.ts

Test Cases:

  • GET /api/repositories/summary?repoUrl=... returns 200 OK
  • Response matches expected schema
  • Bandwidth usage is minimal (verify sparse clone used)
  • Temp directories cleaned up after request
  • Invalid URL returns 400 Bad Request
  • Caching works (second request faster)
  • Error handling (network timeout, invalid repo)

Manual Test Repositories

# Small repo (fast test)
https://github.com/octocat/Hello-World.git

# Medium repo (verify sparse clone efficiency)
https://github.com/facebook/react.git

# Large repo (stress test)
https://github.com/torvalds/linux.git

# GitLab repo
https://gitlab.com/gitlab-org/gitlab.git

Coverage Target

  • ≥ 80% code coverage (per AGENTS.md)
  • All edge cases covered
  • All error paths tested
  • Cleanup logic tested

📊 Performance Considerations

Optimization Strategies

  1. Sparse Clone (95-99% bandwidth reduction)

    • --filter=blob:none excludes all file contents
    • --depth=1 fetches only latest commit
    • --single-branch fetches only main branch
    • Based on proven fileAnalysisService pattern
  2. Efficient Git Commands

    • --max-count=1 for first/last commit (O(1))
    • --count flag for commit counting (O(1))
    • shortlog -s is optimized by Git
    • All commands work on sparse clone
  3. Aggressive Caching

    • 24-hour TTL (Git data rarely changes)
    • Hash-based cache keys
    • Multi-tier cache (Redis → Disk → Memory)
  4. Resource Cleanup

    • Temp directories cleaned immediately after use
    • Try-finally blocks ensure cleanup on error
    • No resource leaks

Expected Performance

  • First request (sparse clone): 0.5-2 seconds (vs 5-30s for full clone)
  • Bandwidth: 1-5 MB (vs 100-500 MB for full clone)
  • Cached request: <50ms (cache lookup)
  • Large repos (100k commits): Same speed (sparse clone is size-independent for metadata)

Comparison: Full Clone vs Sparse Clone

Metric Full Clone Sparse Clone Improvement
Bandwidth 100-500 MB 1-5 MB 95-99%
Time 5-30 seconds 0.5-2 seconds 75-90%
Disk Usage 100-500 MB 1-5 MB 95-99%
Metadata Accuracy 100% 100% Same

✔️ Definition of Done

  • New endpoint /api/repositories/summary responds correctly
  • All stats from mockup are returned (name, owner, created, age, commits, contributors, status)
  • Sparse clone implementation (not full clone)
  • Bandwidth usage verified (should be <5MB for most repos)
  • Temp directory cleanup verified (no leaks)
  • URL parsing works for GitHub, GitLab, Bitbucket, self-hosted
  • First commit date used as "created" date
  • Last commit info extracted correctly
  • Contributor count accurate
  • Status determination logic works
  • 24-hour caching implemented
  • Unit tests pass with ≥80% coverage
  • Integration tests pass
  • No regression in existing tests
  • Linting passes (pnpm lint)
  • Build succeeds (pnpm build)
  • Frontend can successfully consume the endpoint
  • API documented in README.md
  • PR reviewed and approved

🔗 Related Issues

🚀 Future Enhancements (Out of Scope)

Optional GitHub API Enrichment (v2)

  • Add env var ENABLE_GITHUB_API=false (default: disabled)
  • If user provides GITHUB_TOKEN, fetch exact creation date
  • Only use for single-user deployments
  • Gracefully fallback to Git data if API unavailable

Additional Stats (v2)

  • Repository size (disk usage)
  • Primary language detection
  • Branch count
  • Recent activity trend (commits per week)

📝 Implementation Notes

Why Sparse Clone?

The codebase already uses sparse cloning in fileAnalysisService.ts for file analysis:

  • Proven in production - battle-tested pattern
  • 95-99% bandwidth reduction - excludes all file contents
  • Metadata extraction - Git commands work perfectly on sparse clones
  • No full clone needed - tree structure + commit history is sufficient

Why First Commit ≈ Created Date?

In practice, first commit date is functionally equivalent to repository creation date:

  • Most repos have first commit within hours of creation
  • Users can't distinguish the difference in the UI
  • Avoids rate limiting entirely
  • Works universally across all Git platforms

Alternative Approaches Considered

Full Clone: Wastes 95-99% bandwidth downloading file contents we don't need
GitHub API Primary: Rate limiting kills multi-user deployments
Scraping: Fragile, violates ToS, slower than Git
Sparse Clone + Git-First: Fast, scalable, reliable, minimal bandwidth


Estimated Effort: Medium (8-12 hours)
Priority: High (blocks frontend redesign completion)
Complexity: Medium
Risk: Low (reuses existing sparse clone infrastructure)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions