-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Add Repository Summary Stats API Endpoint for Frontend UI Redesign
📋 Summary
Implement a new backend API endpoint /api/repositories/summary to provide comprehensive repository statistics for the frontend UI redesign. This endpoint uses a Git-first approach with sparse/shallow cloning to avoid GitHub API rate limiting issues while minimizing bandwidth and disk usage.
🎯 Motivation
The frontend redesign (issue #87) includes a "Repository Summary" card displaying key repository metrics:
- Repository name and owner
- Created date and age
- Last commit date
- Total commits count
- Contributors count
- Repository status (active/inactive)
Currently, these values are mocked. The backend needs to expose this data through a dedicated endpoint that primarily uses Git operations to avoid rate limiting problems in multi-user deployments.
🚨 Why Git-First + Sparse Clone?
Rate Limit Problem:
- GitHub API: 60 requests/hour per IP (unauthenticated)
- All users share the same backend server IP
- With 10 concurrent users, rate limit exhausted in 6 minutes!
Git-First Solution:
- ✅ No rate limits - unlimited local Git operations
- ✅ No external dependencies - works offline
- ✅ Fast - local operations are instant
- ✅ Scalable - works for unlimited users
- ✅ Privacy - no data sent to external APIs
- ✅ Universal - works for GitHub, GitLab, Bitbucket, self-hosted
Sparse Clone Optimization (95-99% bandwidth reduction):
- ✅ Uses existing
getFileTreeSparse()pattern fromfileAnalysisService.ts - ✅ Downloads ONLY tree structure, no file contents (
--filter=blob:none) - ✅ Shallow clone with
--depth=1(latest commit only) - ✅ Single branch (
--single-branch) - ✅ No checkout (
--no-checkout) - ✅ Existing infrastructure: Already battle-tested in production
Trade-offs:
- "Created date" is approximate (first commit date) vs exact (GitHub API) → Acceptable for 99% of use cases
- Requires minimal Git clone instead of zero network calls → Still 95-99% faster than full clone
✅ Requirements / Acceptance Criteria
API Endpoint
- Create
GET /api/repositories/summaryendpoint - Accept
repoUrlas query parameter - Return structured JSON with all required stats
- Implement proper validation and error handling
- Add rate limiting (reuse existing middleware)
Data Extraction (Sparse Clone Optimization)
- Use sparse clone with
--filter=blob:none --depth=1 --single-branch --no-checkout - Parse repository URL to extract platform, owner, and repo name
- Get first commit date via
git log --reverse --max-count=1 - Get last commit info via
git log -1 --format=%aI|%H|%an - Count total commits via
git rev-list --count HEAD - Count unique contributors via
git shortlog -s -n --all | wc -l - Calculate repository age from first commit date
- Determine status based on last commit recency
- Clean up temp directory after extraction
Response Format
{
"repository": {
"name": "analytics-pro",
"url": "https://github.com/octo-org/analytics-pro.git",
"owner": "octo-org",
"platform": "github"
},
"created": {
"date": "2019-06-14T10:30:00.000Z",
"source": "first-commit"
},
"age": {
"years": 5,
"months": 5,
"formatted": "5.7y"
},
"lastCommit": {
"date": "2025-11-17T14:23:00.000Z",
"relativeTime": "2 days ago",
"sha": "abc123def456",
"author": "John Doe"
},
"stats": {
"totalCommits": 3482,
"contributors": 24,
"status": "active"
},
"metadata": {
"cached": true,
"dataSource": "git-sparse-clone",
"createdDateAccuracy": "approximate",
"bandwidthSaved": "95-99% vs full clone",
"lastUpdated": "2025-11-19T10:30:00.000Z"
}
}Caching Strategy
- Cache summary data for 24 hours (Git data is stable)
- Use cache key:
repo:summary:${hash(repoUrl)} - Integrate with existing multi-tier cache (Redis → Disk → Memory)
- Use
repositoryCoordinatorto prevent duplicate clones - Cache remains valid until new commits (commit-hash based validation)
Edge Cases
- Handle empty repositories (0 commits)
- Handle very large repositories (>100k commits) efficiently
- Handle repositories with single commit
- Handle network timeouts during clone
- Handle corrupted Git repositories
- Handle deleted/moved repositories
- Support all Git URL formats (HTTPS, SSH, .git suffix optional)
- Clean up temp directories on error
Type Safety
- Add
RepositorySummaryinterface to@gitray/shared-types - Add
RepositoryStatustype:'active' | 'inactive' | 'archived' | 'empty' - Add
DataSourcetype:'git-sparse-clone' | 'git+github-api' | 'cache' - Add
CreatedDateSourcetype:'first-commit' | 'github-api' | 'gitlab-api'
🛠️ Implementation Plan
1. New Service: repositorySummaryService.ts
Location: apps/backend/src/services/repositorySummaryService.ts
Core Functions:
// Main orchestrator - Sparse clone approach
async function getRepositorySummary(repoUrl: string): Promise<RepositorySummary>
// Parse repository URL
function parseRepositoryUrl(repoUrl: string): RepositoryUrlInfo
// Sparse clone for metadata extraction (NEW - based on fileAnalysisService pattern)
async function getSummaryViaSparseClone(repoUrl: string): Promise<{
firstCommit: string;
lastCommit: {date: string, sha: string, author: string};
totalCommits: number;
contributors: number;
tempDir: string;
}>
// Git operations (can be run on sparse clone)
async function getFirstCommitDate(repoPath: string): Promise<string>
async function getLastCommitInfo(repoPath: string): Promise<{date, sha, author}>
async function getContributorCount(repoPath: string): Promise<number>
// Calculations
function calculateRepositoryAge(createdDate: Date): AgeInfo
function determineRepositoryStatus(lastCommitDate: Date): RepositoryStatus
// Cleanup
async function cleanupSparseClone(tempDir: string): Promise<void>Sparse Clone Implementation (based on existing pattern):
async function performSparseCloneForSummary(repoUrl: string): Promise<string> {
const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'gitray-summary-'));
const git = simpleGit(tempDir);
try {
// Initialize repo
await git.init();
await git.addRemote('origin', repoUrl);
// Sparse clone with maximum optimization
await git.raw([
'fetch',
'--filter=blob:none', // Exclude file contents (blobs)
'--depth=1', // Only latest commit
'origin',
'HEAD'
]);
await git.raw(['checkout', 'FETCH_HEAD']);
return tempDir;
} catch (error) {
await cleanupSparseClone(tempDir);
throw error;
}
}Git Commands on Sparse Clone:
# All these work on sparse clone (no blobs needed):
# First commit date (repository "creation")
git log --reverse --format=%aI --max-count=1
# Last commit info
git log -1 --format=%aI|%H|%an
# Contributor count
git shortlog -s -n --all | wc -l
# Total commits
git rev-list --count HEAD2. New Route in repositoryRoutes.ts
router.get(
'/summary',
...repoUrlValidation(),
handleValidationErrors,
async (req: Request, res: Response, next: NextFunction) => {
try {
const { repoUrl } = req.query;
const summary = await repositorySummaryService.getRepositorySummary(repoUrl as string);
res.status(HTTP_STATUS.OK).json(summary);
} catch (error) {
next(error);
}
}
);3. Shared Types in packages/shared-types/src/index.ts
export interface RepositorySummary {
repository: {
name: string;
url: string;
owner: string;
platform: 'github' | 'gitlab' | 'bitbucket' | 'other';
};
created: {
date: string; // ISO 8601
source: CreatedDateSource;
};
age: {
years: number;
months: number;
formatted: string; // e.g., "5.7y"
};
lastCommit: {
date: string; // ISO 8601
relativeTime: string; // e.g., "2 days ago"
sha: string;
author: string;
};
stats: {
totalCommits: number;
contributors: number;
status: RepositoryStatus;
};
metadata: {
cached: boolean;
dataSource: DataSource;
createdDateAccuracy: 'exact' | 'approximate';
bandwidthSaved?: string;
lastUpdated: string; // ISO 8601
};
}
export type RepositoryStatus = 'active' | 'inactive' | 'archived' | 'empty';
export type DataSource = 'git-sparse-clone' | 'git+github-api' | 'cache';
export type CreatedDateSource = 'first-commit' | 'github-api' | 'gitlab-api';
export interface RepositoryUrlInfo {
platform: 'github' | 'gitlab' | 'bitbucket' | 'other';
owner: string;
name: string;
fullUrl: string;
}4. Status Determination Logic
function determineRepositoryStatus(lastCommitDate: Date): RepositoryStatus {
const daysSinceLastCommit = differenceInDays(new Date(), lastCommitDate);
if (daysSinceLastCommit <= 30) return 'active';
if (daysSinceLastCommit <= 180) return 'inactive';
return 'archived';
}Status Rules:
- active: Last commit within 30 days
- inactive: Last commit between 30-180 days
- archived: No commit in 180+ days
- empty: 0 commits
5. Caching Implementation
async function getRepositorySummary(repoUrl: string): Promise<RepositorySummary> {
const cacheKey = `repo:summary:${crypto.createHash('md5').update(repoUrl).digest('hex')}`;
// Try cache first (24h TTL)
const cached = await repositoryCache.get(cacheKey);
if (cached) {
return { ...cached, metadata: { ...cached.metadata, cached: true } };
}
// Perform sparse clone for metadata extraction
const tempDir = await performSparseCloneForSummary(repoUrl);
try {
const summary = await buildSummaryFromSparseClone(tempDir, repoUrl);
// Cache for 24 hours
await repositoryCache.set(cacheKey, summary, 86400);
return summary;
} finally {
await cleanupSparseClone(tempDir);
}
}📦 Dependencies & Related Code
Existing Services to Leverage
- ✅
apps/backend/src/services/gitService.ts-getCommitCount(tempDir) - ✅
apps/backend/src/services/repositoryCache.ts- Multi-tier caching - ✅
apps/backend/src/services/fileAnalysisService.ts- Sparse clone pattern (getFileTreeSparse()lines 1026-1140) - ✅
apps/backend/src/utils/gitUtils.ts-shallowClone()utility
Existing Sparse Clone Pattern to Reuse
See apps/backend/src/services/fileAnalysisService.ts:1026-1140:
private async getFileTreeSparse(repoUrl: string): Promise<{...}> {
const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'gitray-sparse-'));
const git = simpleGit(tempDir);
await git.init();
await git.addRemote('origin', repoUrl);
await git.raw(['config', 'core.sparseCheckout', 'true']);
await git.raw(['fetch', '--filter=blob:none', '--depth=1', 'origin', 'HEAD']);
await git.raw(['checkout', 'FETCH_HEAD']);
// ... extract metadata ...
return { files, commitHash, tempDir };
}NPM Packages Already Available
- ✅
simple-git- Git operations - ✅
date-fns- Date calculations (differenceInDays,formatDistanceToNow) - ✅
crypto(Node.js built-in) - Hash URLs for cache keys
🧪 Testing Requirements
Unit Tests
Location: apps/backend/__tests__/unit/services/repositorySummaryService.unit.test.ts
Test Cases:
-
performSparseCloneForSummary()- creates temp dir with sparse clone -
parseRepositoryUrl()- GitHub HTTPS, SSH, GitLab, Bitbucket -
getFirstCommitDate()- with sparse clone repository -
getLastCommitInfo()- parse date, sha, author correctly -
getContributorCount()- count unique authors from sparse clone -
calculateRepositoryAge()- various date ranges -
determineRepositoryStatus()- active/inactive/archived/empty -
cleanupSparseClone()- removes temp directory - Cache hit/miss scenarios
- Error handling (empty repo, corrupted repo, cleanup on error)
Integration Tests
Location: apps/backend/__tests__/unit/routes/repositoryRoutes.unit.test.ts
Test Cases:
- GET
/api/repositories/summary?repoUrl=...returns 200 OK - Response matches expected schema
- Bandwidth usage is minimal (verify sparse clone used)
- Temp directories cleaned up after request
- Invalid URL returns 400 Bad Request
- Caching works (second request faster)
- Error handling (network timeout, invalid repo)
Manual Test Repositories
# Small repo (fast test)
https://github.com/octocat/Hello-World.git
# Medium repo (verify sparse clone efficiency)
https://github.com/facebook/react.git
# Large repo (stress test)
https://github.com/torvalds/linux.git
# GitLab repo
https://gitlab.com/gitlab-org/gitlab.gitCoverage Target
- ≥ 80% code coverage (per
AGENTS.md) - All edge cases covered
- All error paths tested
- Cleanup logic tested
📊 Performance Considerations
Optimization Strategies
-
Sparse Clone (95-99% bandwidth reduction)
--filter=blob:noneexcludes all file contents--depth=1fetches only latest commit--single-branchfetches only main branch- Based on proven
fileAnalysisServicepattern
-
Efficient Git Commands
--max-count=1for first/last commit (O(1))--countflag for commit counting (O(1))shortlog -sis optimized by Git- All commands work on sparse clone
-
Aggressive Caching
- 24-hour TTL (Git data rarely changes)
- Hash-based cache keys
- Multi-tier cache (Redis → Disk → Memory)
-
Resource Cleanup
- Temp directories cleaned immediately after use
- Try-finally blocks ensure cleanup on error
- No resource leaks
Expected Performance
- First request (sparse clone): 0.5-2 seconds (vs 5-30s for full clone)
- Bandwidth: 1-5 MB (vs 100-500 MB for full clone)
- Cached request: <50ms (cache lookup)
- Large repos (100k commits): Same speed (sparse clone is size-independent for metadata)
Comparison: Full Clone vs Sparse Clone
| Metric | Full Clone | Sparse Clone | Improvement |
|---|---|---|---|
| Bandwidth | 100-500 MB | 1-5 MB | 95-99% |
| Time | 5-30 seconds | 0.5-2 seconds | 75-90% |
| Disk Usage | 100-500 MB | 1-5 MB | 95-99% |
| Metadata Accuracy | 100% | 100% | Same |
✔️ Definition of Done
- New endpoint
/api/repositories/summaryresponds correctly - All stats from mockup are returned (name, owner, created, age, commits, contributors, status)
- Sparse clone implementation (not full clone)
- Bandwidth usage verified (should be <5MB for most repos)
- Temp directory cleanup verified (no leaks)
- URL parsing works for GitHub, GitLab, Bitbucket, self-hosted
- First commit date used as "created" date
- Last commit info extracted correctly
- Contributor count accurate
- Status determination logic works
- 24-hour caching implemented
- Unit tests pass with ≥80% coverage
- Integration tests pass
- No regression in existing tests
- Linting passes (
pnpm lint) - Build succeeds (
pnpm build) - Frontend can successfully consume the endpoint
- API documented in README.md
- PR reviewed and approved
🔗 Related Issues
- feat(frontend): UI Redesign Migration to shadcn/ui #87 - Frontend UI Redesign Migration to shadcn/ui
🚀 Future Enhancements (Out of Scope)
Optional GitHub API Enrichment (v2)
- Add env var
ENABLE_GITHUB_API=false(default: disabled) - If user provides
GITHUB_TOKEN, fetch exact creation date - Only use for single-user deployments
- Gracefully fallback to Git data if API unavailable
Additional Stats (v2)
- Repository size (disk usage)
- Primary language detection
- Branch count
- Recent activity trend (commits per week)
📝 Implementation Notes
Why Sparse Clone?
The codebase already uses sparse cloning in fileAnalysisService.ts for file analysis:
- Proven in production - battle-tested pattern
- 95-99% bandwidth reduction - excludes all file contents
- Metadata extraction - Git commands work perfectly on sparse clones
- No full clone needed - tree structure + commit history is sufficient
Why First Commit ≈ Created Date?
In practice, first commit date is functionally equivalent to repository creation date:
- Most repos have first commit within hours of creation
- Users can't distinguish the difference in the UI
- Avoids rate limiting entirely
- Works universally across all Git platforms
Alternative Approaches Considered
❌ Full Clone: Wastes 95-99% bandwidth downloading file contents we don't need
❌ GitHub API Primary: Rate limiting kills multi-user deployments
❌ Scraping: Fragile, violates ToS, slower than Git
✅ Sparse Clone + Git-First: Fast, scalable, reliable, minimal bandwidth
Estimated Effort: Medium (8-12 hours)
Priority: High (blocks frontend redesign completion)
Complexity: Medium
Risk: Low (reuses existing sparse clone infrastructure)