Python-based ETL pipeline using Render Workflows SDK for distributed task execution.
The workflow orchestrates parallel data collection from GitHub's API, processes repositories through a 3-layer data pipeline, and stores results in PostgreSQL for analytics.
main_analysis_task (orchestrator)
├── fetch_language_repos (Python)
│ └── analyze_repo_batch
├── fetch_language_repos (TypeScript)
│ └── analyze_repo_batch
├── fetch_language_repos (Go)
│ └── analyze_repo_batch
├── fetch_render_repos
│ └── analyze_repo_batch
└── aggregate_results (ETL: Staging → Analytics)
└── cleanup_old_data
Orchestrator - Spawns 4 parallel tasks (3 languages + Render), then aggregates results.
Execution modes:
- Production: Full pipeline (all languages + Render)
- Dev mode: Python only (
DEV_MODE=true+DEV_REPO_LIMIT=5)
Returns:
{
'repos_processed': int,
'execution_time': float,
'cleanup_stats': dict,
'trace_id': str,
'success': bool
}Parallel task - Fetches trending repos for Python, TypeScript, or Go.
Process:
- Search GitHub API (repos updated in last 30 days, created in last 180 days)
- Take top 25 repos (sorted by stars)
- Fetch READMEs in parallel
- Store in raw layer
- Spawn
analyze_repo_batchsubtask
Returns:
{
'repos': List[Dict], # Enriched repo data
'subtasks': List[Dict], # Timing metadata
'started_at': str, # ISO timestamp
'completed_at': str
}Parallel task - Discovers Render projects via code search.
Detection strategy:
- Code search for
render.yamlin repository root only - Repos created within last 18 months
- Target: 25 projects
Process:
- Search GitHub for
filename:render.yaml path:/ language:yaml - Fetch READMEs in parallel
- Store in raw layer (assigned
language='render') - Spawn
analyze_repo_batchsubtask
Subtask - Analyzes repos in batches of 10.
Per repository:
- Validate required fields (full_name, language, created_at, updated_at)
- Build enriched data structure
- Parse ISO datetime strings to timezone-aware objects
- Store in staging layer (
stg_repos_validated)
Error handling: Continues on individual repo failures (logged but not fatal)
ETL pipeline - Extracts from staging, calculates scores, loads to analytics.
Process:
- Extract top 50 repos per language (balanced)
- Extract ALL Render repos (
language='render') - Calculate momentum scores (70% recency + 30% normalized stars)
- Load to analytics layer:
dim_repositories(upsert with SCD Type 2)fact_repo_snapshots(daily snapshot with momentum score)
- Run data retention cleanup
Scoring formula:
recency_score = exponential_decay(repo_age_days) # 1.0 for ≤14 days, decay to 0.01
normalized_stars = stars / max_stars_in_category
momentum_score = (recency_score * 0.7) + (normalized_stars * 0.3)Maintenance task - Applies tiered data retention policy.
Retention windows:
- Raw layer: 7 days
- Staging layer: 7 days
- Analytics layer: 30 days
Error handling: Failures logged but don't break workflow
| File | Purpose |
|---|---|
workflow.py |
Main workflow entry point; holds the app instance and calls app.start() |
app.py |
Shared Workflows app instance (imported by all task modules) |
github_api.py |
Async GitHub API client (search, fetch) |
connections.py |
Shared resource management (DB pool, HTTP session) |
auth_setup.py |
Interactive GitHub auth token generator |
tasks/main_task.py |
Main orchestrator task (spawns parallel subtasks, aggregates) |
tasks/language_tasks.py |
fetch_language_repos and fetch_render_repos tasks |
tasks/batch_analysis.py |
analyze_repo_batch subtask (validates, enriches, stores to staging) |
etl/extract.py |
Raw layer ingestion (store_raw_repos) and staging extraction |
etl/cleanup.py |
Staging layer storage (store_in_staging) and data retention cleanup |
etl/load.py |
Analytics layer loading (fact/dim inserts, momentum scoring) |
etl/aggregate.py |
ETL orchestration (staging → analytics pipeline) |
lib/encryption.py |
Fernet encryption/decryption for GitHub tokens |
lib/oauth_manager.py |
GitHub OAuth token lifecycle management |
utils/ |
Shared helpers (chunking, tracing, connection init) |
requirements.txt |
Python dependencies |
# Required
DATABASE_URL=postgresql://...
# GitHub Authentication (see Authentication Setup below)
GITHUB_TOKEN_ENCRYPTION_KEY=<generated-key> # Auto-generated by auth_setup.py
# GitHub OAuth (managed automatically, optional for initial seed)
GITHUB_CLIENT_ID=<your-oauth-app-client-id>
GITHUB_CLIENT_SECRET=<your-oauth-app-client-secret>
GITHUB_ACCESS_TOKEN=ghu_... # Optional: Only for initial DB seed
GITHUB_REFRESH_TOKEN=ghr_... # Optional: Only for initial DB seed
# OR: Personal Access Token (PAT) - simpler alternative
GITHUB_ACCESS_TOKEN=ghp_... # No expiration, no refresh needed
# Optional (development)
DEV_MODE=true # Runs Python task only
DEV_REPO_LIMIT=5 # Processes 5 repos instead of 25
# Optional (local dev)
RENDER_USE_LOCAL_DEV=true
RENDER_LOCAL_DEV_URL=http://localhost:8120python ../bin/local_dev.py# Terminal 1: Start task server
cd workflows
pip install -r requirements.txt
python workflow.py # Listens on port 8120
# Terminal 2: Trigger workflow
cd trigger
python trigger.pyWorkflows are triggered via:
- Cron job: Daily at 14:00 UTC (6 AM PST)
- Manual: Dashboard → Workflows → trender-wf → Run Task
- API:
python trigger/trigger.py
Check workflow status:
# Via logs
render logs --service trender-wf
# Via database
psql $DATABASE_URL -c "SELECT * FROM fact_workflow_runs ORDER BY started_at DESC LIMIT 1;"Inspect trace tree:
# Query the task_tree JSONB column
SELECT run_id, status, task_tree FROM fact_workflow_runs
WHERE run_id = 'wfr_...';Common issues:
| Issue | Solution |
|---|---|
| Auth errors / 401 | Run python auth_setup.py to initialize credentials in the database |
| Connection refused | Check DATABASE_URL is correct |
| Rate limit errors | Verify OAuth token is valid; token auto-refreshes daily via refresh_auth.py |
| No repos returned | GitHub API may be rate limited or down |
GitHub API
↓
raw_github_repos (JSONB)
↓
stg_repos_validated (cleaned)
↓
dim_repositories + fact_repo_snapshots (dimensional model)
↓
Dashboard views
- Execution time: ~90 seconds for ~100 repos
- Parallelism: 4 concurrent tasks (25 repos each)
- Batching: Repos analyzed in batches of 10
- README fetching: Parallel requests to minimize API latency
# Fast iteration: Dev mode (Python only, 5 repos)
export DEV_MODE=true
export DEV_REPO_LIMIT=5
python workflow.py
# Full pipeline locally
unset DEV_MODE
python ../bin/local_dev.pyThe WorkflowTrace class tracks execution with hierarchical task timing:
{
'name': 'main_analysis_task',
'started_at': '2026-02-02T12:00:00Z',
'completed_at': '2026-02-02T12:00:15Z',
'status': 'completed',
'children': [
{'name': 'fetch_language_repos', 'language': 'Python', ...},
{'name': 'fetch_language_repos', 'language': 'TypeScript', ...},
...
]
}Stored in fact_workflow_runs for observability.
The workflow requires GitHub API authentication. Run the setup script:
cd workflows
python auth_setup.pyThe script will:
- Auto-generate encryption key (
GITHUB_TOKEN_ENCRYPTION_KEY) for secure token storage - Walk you through OAuth or PAT setup
- Save credentials to database (OAuth only) for automatic refresh
- ✅ Simple setup (one token)
- ✅ No expiration (or you control it)
- ✅ Same API rate limits (5,000 requests/hour)
- Generate at: https://github.com/settings/tokens/new
- Required scopes:
repo,read:org
- ✅ Tokens auto-refresh every 8 hours (zero manual intervention)
- ✅ Encrypted storage in PostgreSQL (Fernet encryption)
- ✅ Self-sustaining chain - refresh tokens renew every 6 months
- 🔒 Requires one-time setup: OAuth app + encryption key
- Use case: Production deployments, long-running workflows
How OAuth auto-refresh works:
- First run: Loads tokens from DB (or seeds from env vars if DB empty)
- Every run: Checks expiry → refreshes if needed → persists to DB
- Mid-run 401 errors: Automatically refreshes and retries
- You never touch tokens again - as long as workflow runs once per 6 months
Required for Render (one-time):
GITHUB_TOKEN_ENCRYPTION_KEY(auto-generated byauth_setup.py)GITHUB_CLIENT_IDandGITHUB_CLIENT_SECRET(from OAuth app)DATABASE_URL(already configured)
Optional for first run only:
GITHUB_ACCESS_TOKENandGITHUB_REFRESH_TOKEN(seeds DB, then auto-managed)
See requirements.txt:
render-sdk: Workflows SDK with @app.task decoratorsasyncpg: Async PostgreSQL driveraiohttp: Async HTTP clientpython-dotenv: Environment variable management
- Make changes to workflow logic
- Test locally:
python ../bin/local_dev.py - Deploy: Push to GitHub (auto-deploy enabled)
- Monitor: Dashboard → Workflows → Logs