Add Repository Summary Stats API Endpoint (Git-First Approach)

# Add Repository Summary Stats API Endpoint for Frontend UI Redesign

## 📋 Summary

Implement a new backend API endpoint `/api/repositories/summary` to provide comprehensive repository statistics for the frontend UI redesign. This endpoint uses a **Git-first approach** with **sparse/shallow cloning** to avoid GitHub API rate limiting issues while minimizing bandwidth and disk usage.

## 🎯 Motivation

The frontend redesign (issue #87) includes a "Repository Summary" card displaying key repository metrics:
- Repository name and owner
- Created date and age  
- Last commit date
- Total commits count
- Contributors count
- Repository status (active/inactive)

Currently, these values are mocked. The backend needs to expose this data through a dedicated endpoint that **primarily uses Git operations** to avoid rate limiting problems in multi-user deployments.

## 🚨 Why Git-First + Sparse Clone?

**Rate Limit Problem:**
- GitHub API: 60 requests/hour per IP (unauthenticated)
- **All users share the same backend server IP**
- With 10 concurrent users, rate limit exhausted in 6 minutes!

**Git-First Solution:**
- ✅ No rate limits - unlimited local Git operations
- ✅ No external dependencies - works offline
- ✅ Fast - local operations are instant
- ✅ Scalable - works for unlimited users
- ✅ Privacy - no data sent to external APIs
- ✅ Universal - works for GitHub, GitLab, Bitbucket, self-hosted

**Sparse Clone Optimization (95-99% bandwidth reduction):**
- ✅ **Uses existing `getFileTreeSparse()` pattern** from `fileAnalysisService.ts`
- ✅ Downloads ONLY tree structure, no file contents (`--filter=blob:none`)
- ✅ Shallow clone with `--depth=1` (latest commit only)
- ✅ Single branch (`--single-branch`)
- ✅ No checkout (`--no-checkout`)
- ✅ **Existing infrastructure**: Already battle-tested in production

**Trade-offs:** 
- "Created date" is approximate (first commit date) vs exact (GitHub API) → Acceptable for 99% of use cases
- Requires minimal Git clone instead of zero network calls → Still 95-99% faster than full clone

## ✅ Requirements / Acceptance Criteria

### API Endpoint
- [ ] Create `GET /api/repositories/summary` endpoint
- [ ] Accept `repoUrl` as query parameter
- [ ] Return structured JSON with all required stats
- [ ] Implement proper validation and error handling
- [ ] Add rate limiting (reuse existing middleware)

### Data Extraction (Sparse Clone Optimization)
- [ ] **Use sparse clone** with `--filter=blob:none --depth=1 --single-branch --no-checkout`
- [ ] Parse repository URL to extract platform, owner, and repo name
- [ ] Get **first commit date** via `git log --reverse --max-count=1`
- [ ] Get **last commit info** via `git log -1 --format=%aI|%H|%an`
- [ ] Count **total commits** via `git rev-list --count HEAD`
- [ ] Count **unique contributors** via `git shortlog -s -n --all | wc -l`
- [ ] Calculate repository **age** from first commit date
- [ ] Determine **status** based on last commit recency
- [ ] **Clean up temp directory** after extraction

### Response Format
```json
{
  "repository": {
    "name": "analytics-pro",
    "url": "https://github.com/octo-org/analytics-pro.git",
    "owner": "octo-org",
    "platform": "github"
  },
  "created": {
    "date": "2019-06-14T10:30:00.000Z",
    "source": "first-commit"
  },
  "age": {
    "years": 5,
    "months": 5,
    "formatted": "5.7y"
  },
  "lastCommit": {
    "date": "2025-11-17T14:23:00.000Z",
    "relativeTime": "2 days ago",
    "sha": "abc123def456",
    "author": "John Doe"
  },
  "stats": {
    "totalCommits": 3482,
    "contributors": 24,
    "status": "active"
  },
  "metadata": {
    "cached": true,
    "dataSource": "git-sparse-clone",
    "createdDateAccuracy": "approximate",
    "bandwidthSaved": "95-99% vs full clone",
    "lastUpdated": "2025-11-19T10:30:00.000Z"
  }
}
```

### Caching Strategy
- [ ] Cache summary data for **24 hours** (Git data is stable)
- [ ] Use cache key: `repo:summary:${hash(repoUrl)}`
- [ ] Integrate with existing multi-tier cache (Redis → Disk → Memory)
- [ ] Use `repositoryCoordinator` to prevent duplicate clones
- [ ] **Cache remains valid until new commits** (commit-hash based validation)

### Edge Cases
- [ ] Handle empty repositories (0 commits)
- [ ] Handle very large repositories (>100k commits) efficiently
- [ ] Handle repositories with single commit
- [ ] Handle network timeouts during clone
- [ ] Handle corrupted Git repositories
- [ ] Handle deleted/moved repositories
- [ ] Support all Git URL formats (HTTPS, SSH, .git suffix optional)
- [ ] Clean up temp directories on error

### Type Safety
- [ ] Add `RepositorySummary` interface to `@gitray/shared-types`
- [ ] Add `RepositoryStatus` type: `'active' | 'inactive' | 'archived' | 'empty'`
- [ ] Add `DataSource` type: `'git-sparse-clone' | 'git+github-api' | 'cache'`
- [ ] Add `CreatedDateSource` type: `'first-commit' | 'github-api' | 'gitlab-api'`

## 🛠️ Implementation Plan

### 1. New Service: `repositorySummaryService.ts`
Location: `apps/backend/src/services/repositorySummaryService.ts`

**Core Functions:**

```typescript
// Main orchestrator - Sparse clone approach
async function getRepositorySummary(repoUrl: string): Promise<RepositorySummary>

// Parse repository URL
function parseRepositoryUrl(repoUrl: string): RepositoryUrlInfo

// Sparse clone for metadata extraction (NEW - based on fileAnalysisService pattern)
async function getSummaryViaSparseClone(repoUrl: string): Promise<{
  firstCommit: string;
  lastCommit: {date: string, sha: string, author: string};
  totalCommits: number;
  contributors: number;
  tempDir: string;
}>

// Git operations (can be run on sparse clone)
async function getFirstCommitDate(repoPath: string): Promise<string>
async function getLastCommitInfo(repoPath: string): Promise<{date, sha, author}>
async function getContributorCount(repoPath: string): Promise<number>

// Calculations
function calculateRepositoryAge(createdDate: Date): AgeInfo
function determineRepositoryStatus(lastCommitDate: Date): RepositoryStatus

// Cleanup
async function cleanupSparseClone(tempDir: string): Promise<void>
```

**Sparse Clone Implementation (based on existing pattern):**
```typescript
async function performSparseCloneForSummary(repoUrl: string): Promise<string> {
  const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'gitray-summary-'));
  const git = simpleGit(tempDir);
  
  try {
    // Initialize repo
    await git.init();
    await git.addRemote('origin', repoUrl);
    
    // Sparse clone with maximum optimization
    await git.raw([
      'fetch',
      '--filter=blob:none',  // Exclude file contents (blobs)
      '--depth=1',           // Only latest commit
      'origin',
      'HEAD'
    ]);
    
    await git.raw(['checkout', 'FETCH_HEAD']);
    
    return tempDir;
  } catch (error) {
    await cleanupSparseClone(tempDir);
    throw error;
  }
}
```

**Git Commands on Sparse Clone:**
```bash
# All these work on sparse clone (no blobs needed):

# First commit date (repository "creation")
git log --reverse --format=%aI --max-count=1

# Last commit info  
git log -1 --format=%aI|%H|%an

# Contributor count
git shortlog -s -n --all | wc -l

# Total commits
git rev-list --count HEAD
```

### 2. New Route in `repositoryRoutes.ts`

```typescript
router.get(
  '/summary',
  ...repoUrlValidation(),
  handleValidationErrors,
  async (req: Request, res: Response, next: NextFunction) => {
    try {
      const { repoUrl } = req.query;
      const summary = await repositorySummaryService.getRepositorySummary(repoUrl as string);
      res.status(HTTP_STATUS.OK).json(summary);
    } catch (error) {
      next(error);
    }
  }
);
```

### 3. Shared Types in `packages/shared-types/src/index.ts`

```typescript
export interface RepositorySummary {
  repository: {
    name: string;
    url: string;
    owner: string;
    platform: 'github' | 'gitlab' | 'bitbucket' | 'other';
  };
  created: {
    date: string; // ISO 8601
    source: CreatedDateSource;
  };
  age: {
    years: number;
    months: number;
    formatted: string; // e.g., "5.7y"
  };
  lastCommit: {
    date: string; // ISO 8601
    relativeTime: string; // e.g., "2 days ago"
    sha: string;
    author: string;
  };
  stats: {
    totalCommits: number;
    contributors: number;
    status: RepositoryStatus;
  };
  metadata: {
    cached: boolean;
    dataSource: DataSource;
    createdDateAccuracy: 'exact' | 'approximate';
    bandwidthSaved?: string;
    lastUpdated: string; // ISO 8601
  };
}

export type RepositoryStatus = 'active' | 'inactive' | 'archived' | 'empty';
export type DataSource = 'git-sparse-clone' | 'git+github-api' | 'cache';
export type CreatedDateSource = 'first-commit' | 'github-api' | 'gitlab-api';

export interface RepositoryUrlInfo {
  platform: 'github' | 'gitlab' | 'bitbucket' | 'other';
  owner: string;
  name: string;
  fullUrl: string;
}
```

### 4. Status Determination Logic

```typescript
function determineRepositoryStatus(lastCommitDate: Date): RepositoryStatus {
  const daysSinceLastCommit = differenceInDays(new Date(), lastCommitDate);
  
  if (daysSinceLastCommit <= 30) return 'active';
  if (daysSinceLastCommit <= 180) return 'inactive';
  return 'archived';
}
```

**Status Rules:**
- **active:** Last commit within 30 days
- **inactive:** Last commit between 30-180 days
- **archived:** No commit in 180+ days
- **empty:** 0 commits

### 5. Caching Implementation

```typescript
async function getRepositorySummary(repoUrl: string): Promise<RepositorySummary> {
  const cacheKey = `repo:summary:${crypto.createHash('md5').update(repoUrl).digest('hex')}`;
  
  // Try cache first (24h TTL)
  const cached = await repositoryCache.get(cacheKey);
  if (cached) {
    return { ...cached, metadata: { ...cached.metadata, cached: true } };
  }
  
  // Perform sparse clone for metadata extraction
  const tempDir = await performSparseCloneForSummary(repoUrl);
  
  try {
    const summary = await buildSummaryFromSparseClone(tempDir, repoUrl);
    
    // Cache for 24 hours
    await repositoryCache.set(cacheKey, summary, 86400);
    
    return summary;
  } finally {
    await cleanupSparseClone(tempDir);
  }
}
```

## 📦 Dependencies & Related Code

### Existing Services to Leverage
- ✅ `apps/backend/src/services/gitService.ts` - `getCommitCount(tempDir)`
- ✅ `apps/backend/src/services/repositoryCache.ts` - Multi-tier caching
- ✅ **`apps/backend/src/services/fileAnalysisService.ts`** - **Sparse clone pattern** (`getFileTreeSparse()` lines 1026-1140)
- ✅ `apps/backend/src/utils/gitUtils.ts` - `shallowClone()` utility

### Existing Sparse Clone Pattern to Reuse
See `apps/backend/src/services/fileAnalysisService.ts:1026-1140`:
```typescript
private async getFileTreeSparse(repoUrl: string): Promise<{...}> {
  const tempDir = await fs.mkdtemp(path.join(os.tmpdir(), 'gitray-sparse-'));
  const git = simpleGit(tempDir);
  
  await git.init();
  await git.addRemote('origin', repoUrl);
  await git.raw(['config', 'core.sparseCheckout', 'true']);
  await git.raw(['fetch', '--filter=blob:none', '--depth=1', 'origin', 'HEAD']);
  await git.raw(['checkout', 'FETCH_HEAD']);
  
  // ... extract metadata ...
  
  return { files, commitHash, tempDir };
}
```

### NPM Packages Already Available
- ✅ `simple-git` - Git operations
- ✅ `date-fns` - Date calculations (`differenceInDays`, `formatDistanceToNow`)
- ✅ `crypto` (Node.js built-in) - Hash URLs for cache keys

## 🧪 Testing Requirements

### Unit Tests
Location: `apps/backend/__tests__/unit/services/repositorySummaryService.unit.test.ts`

**Test Cases:**
- [ ] `performSparseCloneForSummary()` - creates temp dir with sparse clone
- [ ] `parseRepositoryUrl()` - GitHub HTTPS, SSH, GitLab, Bitbucket
- [ ] `getFirstCommitDate()` - with sparse clone repository
- [ ] `getLastCommitInfo()` - parse date, sha, author correctly
- [ ] `getContributorCount()` - count unique authors from sparse clone
- [ ] `calculateRepositoryAge()` - various date ranges
- [ ] `determineRepositoryStatus()` - active/inactive/archived/empty
- [ ] `cleanupSparseClone()` - removes temp directory
- [ ] Cache hit/miss scenarios
- [ ] Error handling (empty repo, corrupted repo, cleanup on error)

### Integration Tests
Location: `apps/backend/__tests__/unit/routes/repositoryRoutes.unit.test.ts`

**Test Cases:**
- [ ] GET `/api/repositories/summary?repoUrl=...` returns 200 OK
- [ ] Response matches expected schema
- [ ] Bandwidth usage is minimal (verify sparse clone used)
- [ ] Temp directories cleaned up after request
- [ ] Invalid URL returns 400 Bad Request
- [ ] Caching works (second request faster)
- [ ] Error handling (network timeout, invalid repo)

### Manual Test Repositories
```bash
# Small repo (fast test)
https://github.com/octocat/Hello-World.git

# Medium repo (verify sparse clone efficiency)
https://github.com/facebook/react.git

# Large repo (stress test)
https://github.com/torvalds/linux.git

# GitLab repo
https://gitlab.com/gitlab-org/gitlab.git
```

### Coverage Target
- **≥ 80% code coverage** (per `AGENTS.md`)
- All edge cases covered
- All error paths tested
- Cleanup logic tested

## 📊 Performance Considerations

### Optimization Strategies
1. **Sparse Clone (95-99% bandwidth reduction)**
   - `--filter=blob:none` excludes all file contents
   - `--depth=1` fetches only latest commit
   - `--single-branch` fetches only main branch
   - **Based on proven `fileAnalysisService` pattern**

2. **Efficient Git Commands**
   - `--max-count=1` for first/last commit (O(1))
   - `--count` flag for commit counting (O(1))
   - `shortlog -s` is optimized by Git
   - All commands work on sparse clone

3. **Aggressive Caching**
   - 24-hour TTL (Git data rarely changes)
   - Hash-based cache keys
   - Multi-tier cache (Redis → Disk → Memory)

4. **Resource Cleanup**
   - Temp directories cleaned immediately after use
   - Try-finally blocks ensure cleanup on error
   - No resource leaks

### Expected Performance
- **First request (sparse clone):** 0.5-2 seconds (vs 5-30s for full clone)
- **Bandwidth:** 1-5 MB (vs 100-500 MB for full clone)
- **Cached request:** <50ms (cache lookup)
- **Large repos (100k commits):** Same speed (sparse clone is size-independent for metadata)

### Comparison: Full Clone vs Sparse Clone
| Metric | Full Clone | Sparse Clone | Improvement |
|--------|-----------|--------------|-------------|
| Bandwidth | 100-500 MB | 1-5 MB | **95-99%** |
| Time | 5-30 seconds | 0.5-2 seconds | **75-90%** |
| Disk Usage | 100-500 MB | 1-5 MB | **95-99%** |
| Metadata Accuracy | 100% | 100% | Same |

## ✔️ Definition of Done

- [ ] New endpoint `/api/repositories/summary` responds correctly
- [ ] All stats from mockup are returned (name, owner, created, age, commits, contributors, status)
- [ ] **Sparse clone implementation** (not full clone)
- [ ] Bandwidth usage verified (should be <5MB for most repos)
- [ ] Temp directory cleanup verified (no leaks)
- [ ] URL parsing works for GitHub, GitLab, Bitbucket, self-hosted
- [ ] First commit date used as "created" date
- [ ] Last commit info extracted correctly
- [ ] Contributor count accurate
- [ ] Status determination logic works
- [ ] 24-hour caching implemented
- [ ] Unit tests pass with ≥80% coverage
- [ ] Integration tests pass
- [ ] No regression in existing tests
- [ ] Linting passes (`pnpm lint`)
- [ ] Build succeeds (`pnpm build`)
- [ ] Frontend can successfully consume the endpoint
- [ ] API documented in README.md
- [ ] PR reviewed and approved

## 🔗 Related Issues
- #87 - Frontend UI Redesign Migration to shadcn/ui

## 🚀 Future Enhancements (Out of Scope)

### Optional GitHub API Enrichment (v2)
- Add env var `ENABLE_GITHUB_API=false` (default: disabled)
- If user provides `GITHUB_TOKEN`, fetch exact creation date
- Only use for single-user deployments
- Gracefully fallback to Git data if API unavailable

### Additional Stats (v2)
- Repository size (disk usage)
- Primary language detection
- Branch count
- Recent activity trend (commits per week)

## 📝 Implementation Notes

### Why Sparse Clone?
The codebase **already uses sparse cloning** in `fileAnalysisService.ts` for file analysis:
- **Proven in production** - battle-tested pattern
- **95-99% bandwidth reduction** - excludes all file contents
- **Metadata extraction** - Git commands work perfectly on sparse clones
- **No full clone needed** - tree structure + commit history is sufficient

### Why First Commit ≈ Created Date?
In practice, first commit date is **functionally equivalent** to repository creation date:
- Most repos have first commit within hours of creation
- Users can't distinguish the difference in the UI
- Avoids rate limiting entirely
- Works universally across all Git platforms

### Alternative Approaches Considered
❌ **Full Clone:** Wastes 95-99% bandwidth downloading file contents we don't need  
❌ **GitHub API Primary:** Rate limiting kills multi-user deployments  
❌ **Scraping:** Fragile, violates ToS, slower than Git  
✅ **Sparse Clone + Git-First:** Fast, scalable, reliable, minimal bandwidth

---

**Estimated Effort:** Medium (8-12 hours)  
**Priority:** High (blocks frontend redesign completion)  
**Complexity:** Medium  
**Risk:** Low (reuses existing sparse clone infrastructure)

Metric	Full Clone	Sparse Clone	Improvement
Bandwidth	100-500 MB	1-5 MB	95-99%
Time	5-30 seconds	0.5-2 seconds	75-90%
Disk Usage	100-500 MB	1-5 MB	95-99%
Metadata Accuracy	100%	100%	Same

Add Repository Summary Stats API Endpoint (Git-First Approach) #117

Description

Add Repository Summary Stats API Endpoint for Frontend UI Redesign

📋 Summary

🎯 Motivation

🚨 Why Git-First + Sparse Clone?

✅ Requirements / Acceptance Criteria

API Endpoint

Data Extraction (Sparse Clone Optimization)

Response Format

Caching Strategy

Edge Cases

Type Safety

🛠️ Implementation Plan

1. New Service: repositorySummaryService.ts

2. New Route in repositoryRoutes.ts

3. Shared Types in packages/shared-types/src/index.ts

4. Status Determination Logic

5. Caching Implementation

📦 Dependencies & Related Code

Existing Services to Leverage

Existing Sparse Clone Pattern to Reuse

NPM Packages Already Available

🧪 Testing Requirements

Unit Tests

Integration Tests

Manual Test Repositories

Coverage Target

📊 Performance Considerations

Optimization Strategies

Expected Performance

Comparison: Full Clone vs Sparse Clone

✔️ Definition of Done

🔗 Related Issues

🚀 Future Enhancements (Out of Scope)

Optional GitHub API Enrichment (v2)

Additional Stats (v2)

📝 Implementation Notes

Why Sparse Clone?

Why First Commit ≈ Created Date?

Alternative Approaches Considered

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. New Service: `repositorySummaryService.ts`

2. New Route in `repositoryRoutes.ts`

3. Shared Types in `packages/shared-types/src/index.ts`