-
Notifications
You must be signed in to change notification settings - Fork 0
Description
JIT Token Expiration with Long-Running Sequential Workflows
Problem Summary
When running GitHub Actions workflows with max-parallel: 1 and long-running sequential jobs (total runtime > 60 minutes), JIT (Just-In-Time) runner tokens expire after ~60 minutes, causing jobs to fail with "The operation was canceled" error.
This is a fundamental limitation when:
- Total workflow runtime exceeds JIT token lifetime (~60 minutes)
- Jobs must run sequentially (
max-parallel: 1) - Using ephemeral JIT-configured self-hosted runners
Environment
- Runner Platform: Serverless (Modal/AWS Lambda/Azure Functions/etc.)
- Runner Type: Self-hosted with JIT configuration
- Configuration:
- Jobs: N matrix jobs (where N × job_duration > 60 minutes)
max-parallel: 1(sequential execution)- Example: 37 jobs × 6 minutes = 222 minutes total runtime
Steps to Reproduce
- Create workflow with matrix strategy and
max-parallel: 1:
strategy:
fail-fast: false
max-parallel: 1
matrix:
job_id: [1, 2, 3, ..., N] # N jobs where N × 6_minutes > 60_minutes- Use self-hosted runner with JIT configuration:
# Serverless runner fetches JIT config on webhook receipt
jit_config = await fetch_jit_config(repo_url, job_id, labels)
sandbox = modal.Sandbox.create(env={"GHA_JIT_CONFIG": jit_config})-
Trigger workflow with enough jobs that total runtime exceeds 60 minutes
-
Observe that:
- Jobs 1-10 complete successfully (~60 minutes)
- Jobs 11+ fail with "The operation was canceled"
Expected Behavior
All jobs should complete successfully, with each job getting a fresh JIT token when it starts (not when the webhook is received).
Actual Behavior
| Job Range | Status | Time Elapsed | JIT Token State |
|---|---|---|---|
| 1-10 | ✅ Success | 0-60 min | Valid |
| 11+ | ❌ Failed | 60+ min | Expired |
Error observed:
The operation was canceled.
Failed job timing pattern:
- Jobs complete successfully until ~60-minute mark
- Jobs starting after 60 minutes fail immediately or within 2-3 minutes
- Failure occurs exactly at JIT token expiration time
Root Cause Analysis
JIT Token Lifecycle
From GitHub documentation and runner source code:
- JIT Generation: When
generate-jitconfigAPI is called, GitHub creates a runner registration with a time-limited token - Token Validity: ~60 minutes (confirmed via GitHub Community Discussion #25699)
- Expiration: After 60 minutes, GitHub invalidates the runner registration
- Job Cancellation: Any job using that runner gets "The operation was canceled"
The Math Problem
N jobs × M minutes each = Total runtime
JIT token lifetime = 60 minutes
If Total runtime > 60 minutes:
Jobs 1 to floor(60/M): Complete successfully ✅
Jobs floor(60/M)+1 to N: Fail with expired token ❌
Example with 6-minute jobs:
37 jobs × 6 minutes = 222 minutes total
Jobs 1-10: Complete within 60 min window ✅
Jobs 11-37: Start after token expiration ❌
Why Current Architecture Fails
The serverless runner typically:
- Receives N webhooks simultaneously when workflow triggers
- Fetches N JIT configs immediately (all tokens created at T=0)
- Spawns N sandboxes/containers (each with pre-fetched JIT)
- Jobs run sequentially, but JIT tokens expire at T=60 regardless
Key Issue: JIT tokens are generated at webhook receipt time, not at job execution time.
Attempted Solutions
1. Queue-Based Worker with Deferred JIT Fetch
Approach: Move JIT fetching from webhook handler to worker function that processes jobs sequentially.
# Webhook: Queue metadata only
# Worker: Fetch JIT when job actually runs, then spawnWhy It Fails:
- GitHub expects runner to connect within 2-5 minutes of JIT generation
- Delaying JIT fetch creates race condition where GitHub cancels job
- GitHub's job assignment model expects immediate runner registration
- Doesn't solve fundamental issue: sequential execution still exceeds token lifetime
Reference: actions/runner auth documentation
2. Retry/Refresh JIT Config
Attempt: Detect expired token and re-fetch JIT config.
Why It Fails:
- JIT config is single-use per job
- Cannot re-fetch for same runner ID after expiration
- Job is tied to original runner registration
- Webhook is one-way notification; GitHub doesn't resend or support replay
generate-jitconfigcreates a NEW runner registration, doesn't refresh existing
3. Increase max-parallel
Attempt: Run jobs in parallel to reduce total runtime below 60 minutes.
Why Not Always Possible:
- Some workflows have inherent sequential dependencies
- External API rate limits may require throttling
- Resource constraints (e.g., API quotas, database locks)
- Business logic may require ordered execution
4. Persistent Runner Token
Attempt: Use --token instead of --jitconfig with a long-lived token.
Trade-offs:
- ✅ Solves the expiration problem
- ❌ Security risk (long-lived token vs ephemeral JIT)
- ❌ Requires manual token management and rotation
- ❌ Defeats the purpose of JIT security model
Research & References
GitHub Documentation
-
GitHub Actions Limits: Usage limits for self-hosted runners
- Job queue time: 24 hours
- JIT token lifetime: ~60 minutes (implied, not explicitly documented in main docs)
-
Automatic Token Authentication: GITHUB_TOKEN documentation
- "The installation access token expires after 60 minutes"
-
Self-Hosted Runners: About self-hosted runners
- Documentation on JIT runner configuration
GitHub Community Discussions
-
Discussion #25699: GitHub token lifetime
- Confirms: "The installation access token expires after 60 minutes"
- Direct link to answer
-
Discussion #50472: Long-running workflow GITHUB_TOKEN timeout
- "Unable to extend GITHUB_TOKEN expiration time due to: GITHUB_TOKEN has expired"
- Note: GITHUB_TOKEN (24h) is different from JIT token (60min), but discussion relevant for token expiration patterns
-
Discussion #60513: How to configure idle_timeout with JIT
- Discusses JIT runner lifecycle and limitations
GitHub Issues
-
actions/runner #1799: How long is runner registration token valid?
- Answer: "It's valid for one hour"
- Official confirmation from GitHub maintainer
-
actions/runner #2920: Unable to use ./config remove on JIT runner
- Discusses JIT runner lifecycle issues and missing
gitHubUrlin config - Closed as completed (bug fix released)
- Discusses JIT runner lifecycle issues and missing
-
actions-runner-controller #4183: Runners not terminating after token expiry
- Real-world production issue: "Runners not terminating after job completion – blocked queue due to token expiry"
- Shows token expiration affects even Kubernetes-based runners
-
actions-runner-controller #2466: Jobs expire while on queue
- "Capacity reservations expire before the jobs are even queued"
- Similar underlying problem with token/job timing
-
actions/runner #845: Support for autoscaling self-hosted runners
- Feature request for better autoscaling support
- Related to managing runner lifecycle
External Resources
-
AWS CodeBuild Issue: Failure to get JIT token
- Real-world example of JIT token issues in production
-
Orchestra Guide: JIT Runner Configuration
- Best practices for JIT runner setup (still doesn't solve 60-min limit)
Constraints & Considerations
Why This Is Hard to Solve
- Security Model: JIT tokens are designed to be short-lived for security
- GitHub Architecture: Jobs are assigned to runners at webhook time, not execution time
- Serverless Limitations: Serverless functions can't maintain long-lived connections
- No Token Refresh API: GitHub doesn't provide an API to refresh/extend JIT tokens
Common Misconceptions
❌ "We can just fetch JIT when the job runs"
✅ GitHub expects runner registration within minutes of job assignment
❌ "We can retry failed jobs with fresh JIT"
✅ JIT is tied to specific runner registration; can't re-fetch for same job
❌ "Queue the jobs and process later"
✅ GitHub's job timeout (24h) ≠ JIT token lifetime (60min)
Proposed Solutions
Option 1: Batch Processing (Recommended Workaround)
Split long-running workflows into multiple workflow runs that each complete within 60 minutes:
# Instead of one workflow with 37 jobs,
# Create multiple workflows or use dynamic matrix:
# Workflow Run 1: Jobs 1-9 (54 min)
# Workflow Run 2: Jobs 10-18 (54 min)
# Workflow Run 3: Jobs 19-27 (54 min)
# Workflow Run 4: Jobs 28-N (remaining)Implementation:
strategy:
fail-fast: false
max-parallel: 1
matrix:
# Use only subset per workflow run
job_id: ${{ fromJson(env.JOB_BATCH) }}Pros:
- Works within existing JIT limitations
- No changes to runner infrastructure
- Each batch completes within token window
Cons:
- Requires orchestration to trigger multiple runs
- More complex workflow management
- Job history split across multiple runs
Option 2: Persistent Runner Token (Security Trade-off)
Use traditional runner registration instead of JIT:
# Register runner once (manual or automated)
./config.sh --url https://github.com/OWNER/REPO --token $REGISTRATION_TOKEN
# Run with persistent token
./run.sh --token $RUNNER_TOKENPros:
- Token doesn't expire during job execution
- Simple implementation
Cons:
- Security risk (long-lived token)
- Requires token rotation policy
- Loses benefits of ephemeral runners
Option 3: Hybrid Approach - Batch with Persistent Runner
Use persistent runner for long sequential workflows, JIT for short ones:
if total_estimated_runtime > 3600: # 1 hour
use_persistent_runner()
else:
use_jit_runner()Pros:
- Best of both worlds
- Secure for short jobs, functional for long jobs
Cons:
- More complex runner management
- Still requires persistent token for some cases
Option 4: Workflow-Level Retry with Fresh Webhooks
Instead of job-level retry, trigger new workflow runs:
on:
workflow_dispatch:
schedule:
- cron: '0 */2 * * *' # Every 2 hours
jobs:
check-and-run:
runs-on: ubuntu-latest
steps:
- name: Check which items need processing
id: check
run: |
# Logic to determine unprocessed items
echo "batch=$ITEMS" >> $GITHUB_OUTPUT
- name: Trigger batch workflow
if: steps.check.outputs.batch != '[]'
uses: benc-uk/workflow-dispatch@v1
with:
workflow: process-batch.yml
inputs: '{"items": "${{ steps.check.outputs.batch }}"}'Pros:
- Fresh webhooks = fresh JIT tokens
- Each batch within 60-minute window
Cons:
- Complex orchestration
- Potential for duplicate processing
- Harder to track overall progress
Option 5: GitHub-Supported Solution (Requested)
Request GitHub to support one of:
-
JIT Token Refresh API:
POST /repos/{owner}/{repo}/actions/runners/{runner_id}/refresh-token -
Extended JIT Lifetime:
- Allow configuration of JIT token lifetime (e.g., 4 hours for long workflows)
- Or auto-extend for active runners
-
Job-Level JIT:
- Generate JIT token per job instead of per runner
- Token valid for job duration only
Questions for GitHub
-
Is there an official way to refresh or extend JIT token lifetime for long-running workflows?
-
Can GitHub support increase JIT token lifetime for specific repositories/use cases?
-
Is there a documented pattern for handling workflows that exceed 60 minutes with self-hosted runners?
-
Should the
generate-jitconfigAPI support token refresh or longer lifetimes for sequential job processing? -
Could GitHub provide a "job-level" JIT token that's valid for the duration of a specific job rather than runner registration?
Related Issues & Discussions
- actions/runner #1799 - Token lifetime discussion
- actions-runner-controller #4183 - Token expiry in production
- actions-runner-controller #2466 - Jobs expiring in queue
- GitHub Community #25699 - Token lifetime confirmation
- GitHub Community #50472 - Long-running workflow timeouts
Additional Context
Serverless Runner Architecture
Typical serverless GitHub Actions runner flow:
GitHub Workflow Trigger
↓
GitHub sends workflow_job webhook (action: queued)
↓
Serverless function receives webhook
↓
Function calls GitHub API: POST /actions/runners/generate-jitconfig
↓
GitHub returns JIT config (valid for ~60 minutes)
↓
Function spawns container/sandbox with JIT config
↓
Container runs: ./run.sh --jitconfig $JIT_CONFIG
↓
Runner connects to GitHub and picks up job
↓
Job executes
↓
Job completes, runner exits
The problem occurs when:
- Step 3 (JIT generation) happens at T=0 for all jobs
- Step 7 (job execution) for job N happens at T > 60 minutes
Workaround Checklist
If you're experiencing this issue, check:
- Can you split jobs into multiple workflow runs (< 60 min each)?
- Can you increase
max-parallelto reduce total runtime? - Can you use persistent runner tokens instead of JIT?
- Can you optimize job duration to be < 6 minutes each?
- Can you reduce number of jobs in matrix?
Labels
Suggested labels for this issue:
enhancementself-hosted-runnersjit-tokenslong-running-workflowssequential-jobsdocumentation
Summary
This issue documents a fundamental architectural limitation: JIT tokens are designed for short-lived ephemeral runners (~60 minutes), but GitHub Actions workflows can legitimately require longer sequential execution.
The core conflict:
- JIT Security Model: Short-lived tokens (60 min) for ephemeral runners
- Sequential Workflows: May require >60 min total runtime
- Serverless Architecture: Can't maintain persistent connections
Viable workarounds:
- Batch processing (multiple workflow runs)
- Persistent runner tokens (security trade-off)
- Reduce total runtime (optimize jobs or increase parallelism)
Long-term solution: Requires GitHub to either:
- Extend JIT token lifetime for long workflows
- Provide token refresh mechanism
- Support job-level (not runner-level) JIT tokens
This issue was compiled from multiple real-world production scenarios and extensive research. It aims to document the limitation clearly and provide actionable workarounds while advocating for a supported long-term solution.