Skip to content

JIT Token Expiration with Long-Running Sequential Workflows #2

@manascb1344

Description

@manascb1344

JIT Token Expiration with Long-Running Sequential Workflows

Problem Summary

When running GitHub Actions workflows with max-parallel: 1 and long-running sequential jobs (total runtime > 60 minutes), JIT (Just-In-Time) runner tokens expire after ~60 minutes, causing jobs to fail with "The operation was canceled" error.

This is a fundamental limitation when:

  • Total workflow runtime exceeds JIT token lifetime (~60 minutes)
  • Jobs must run sequentially (max-parallel: 1)
  • Using ephemeral JIT-configured self-hosted runners

Environment

  • Runner Platform: Serverless (Modal/AWS Lambda/Azure Functions/etc.)
  • Runner Type: Self-hosted with JIT configuration
  • Configuration:
    • Jobs: N matrix jobs (where N × job_duration > 60 minutes)
    • max-parallel: 1 (sequential execution)
    • Example: 37 jobs × 6 minutes = 222 minutes total runtime

Steps to Reproduce

  1. Create workflow with matrix strategy and max-parallel: 1:
strategy:
  fail-fast: false
  max-parallel: 1
  matrix:
    job_id: [1, 2, 3, ..., N]  # N jobs where N × 6_minutes > 60_minutes
  1. Use self-hosted runner with JIT configuration:
# Serverless runner fetches JIT config on webhook receipt
jit_config = await fetch_jit_config(repo_url, job_id, labels)
sandbox = modal.Sandbox.create(env={"GHA_JIT_CONFIG": jit_config})
  1. Trigger workflow with enough jobs that total runtime exceeds 60 minutes

  2. Observe that:

    • Jobs 1-10 complete successfully (~60 minutes)
    • Jobs 11+ fail with "The operation was canceled"

Expected Behavior

All jobs should complete successfully, with each job getting a fresh JIT token when it starts (not when the webhook is received).

Actual Behavior

Job Range Status Time Elapsed JIT Token State
1-10 ✅ Success 0-60 min Valid
11+ ❌ Failed 60+ min Expired

Error observed:

The operation was canceled.

Failed job timing pattern:

  • Jobs complete successfully until ~60-minute mark
  • Jobs starting after 60 minutes fail immediately or within 2-3 minutes
  • Failure occurs exactly at JIT token expiration time

Root Cause Analysis

JIT Token Lifecycle

From GitHub documentation and runner source code:

  1. JIT Generation: When generate-jitconfig API is called, GitHub creates a runner registration with a time-limited token
  2. Token Validity: ~60 minutes (confirmed via GitHub Community Discussion #25699)
  3. Expiration: After 60 minutes, GitHub invalidates the runner registration
  4. Job Cancellation: Any job using that runner gets "The operation was canceled"

The Math Problem

N jobs × M minutes each = Total runtime
JIT token lifetime = 60 minutes

If Total runtime > 60 minutes:
  Jobs 1 to floor(60/M): Complete successfully ✅
  Jobs floor(60/M)+1 to N: Fail with expired token ❌

Example with 6-minute jobs:
  37 jobs × 6 minutes = 222 minutes total
  Jobs 1-10: Complete within 60 min window ✅
  Jobs 11-37: Start after token expiration ❌

Why Current Architecture Fails

The serverless runner typically:

  1. Receives N webhooks simultaneously when workflow triggers
  2. Fetches N JIT configs immediately (all tokens created at T=0)
  3. Spawns N sandboxes/containers (each with pre-fetched JIT)
  4. Jobs run sequentially, but JIT tokens expire at T=60 regardless

Key Issue: JIT tokens are generated at webhook receipt time, not at job execution time.

Attempted Solutions

1. Queue-Based Worker with Deferred JIT Fetch

Approach: Move JIT fetching from webhook handler to worker function that processes jobs sequentially.

# Webhook: Queue metadata only
# Worker: Fetch JIT when job actually runs, then spawn

Why It Fails:

  • GitHub expects runner to connect within 2-5 minutes of JIT generation
  • Delaying JIT fetch creates race condition where GitHub cancels job
  • GitHub's job assignment model expects immediate runner registration
  • Doesn't solve fundamental issue: sequential execution still exceeds token lifetime

Reference: actions/runner auth documentation

2. Retry/Refresh JIT Config

Attempt: Detect expired token and re-fetch JIT config.

Why It Fails:

  • JIT config is single-use per job
  • Cannot re-fetch for same runner ID after expiration
  • Job is tied to original runner registration
  • Webhook is one-way notification; GitHub doesn't resend or support replay
  • generate-jitconfig creates a NEW runner registration, doesn't refresh existing

3. Increase max-parallel

Attempt: Run jobs in parallel to reduce total runtime below 60 minutes.

Why Not Always Possible:

  • Some workflows have inherent sequential dependencies
  • External API rate limits may require throttling
  • Resource constraints (e.g., API quotas, database locks)
  • Business logic may require ordered execution

4. Persistent Runner Token

Attempt: Use --token instead of --jitconfig with a long-lived token.

Trade-offs:

  • ✅ Solves the expiration problem
  • ❌ Security risk (long-lived token vs ephemeral JIT)
  • ❌ Requires manual token management and rotation
  • ❌ Defeats the purpose of JIT security model

Research & References

GitHub Documentation

  1. GitHub Actions Limits: Usage limits for self-hosted runners

    • Job queue time: 24 hours
    • JIT token lifetime: ~60 minutes (implied, not explicitly documented in main docs)
  2. Automatic Token Authentication: GITHUB_TOKEN documentation

    • "The installation access token expires after 60 minutes"
  3. Self-Hosted Runners: About self-hosted runners

    • Documentation on JIT runner configuration

GitHub Community Discussions

  1. Discussion #25699: GitHub token lifetime

  2. Discussion #50472: Long-running workflow GITHUB_TOKEN timeout

    • "Unable to extend GITHUB_TOKEN expiration time due to: GITHUB_TOKEN has expired"
    • Note: GITHUB_TOKEN (24h) is different from JIT token (60min), but discussion relevant for token expiration patterns
  3. Discussion #60513: How to configure idle_timeout with JIT

    • Discusses JIT runner lifecycle and limitations

GitHub Issues

  1. actions/runner #1799: How long is runner registration token valid?

    • Answer: "It's valid for one hour"
    • Official confirmation from GitHub maintainer
  2. actions/runner #2920: Unable to use ./config remove on JIT runner

    • Discusses JIT runner lifecycle issues and missing gitHubUrl in config
    • Closed as completed (bug fix released)
  3. actions-runner-controller #4183: Runners not terminating after token expiry

    • Real-world production issue: "Runners not terminating after job completion – blocked queue due to token expiry"
    • Shows token expiration affects even Kubernetes-based runners
  4. actions-runner-controller #2466: Jobs expire while on queue

    • "Capacity reservations expire before the jobs are even queued"
    • Similar underlying problem with token/job timing
  5. actions/runner #845: Support for autoscaling self-hosted runners

    • Feature request for better autoscaling support
    • Related to managing runner lifecycle

External Resources

  1. AWS CodeBuild Issue: Failure to get JIT token

    • Real-world example of JIT token issues in production
  2. Orchestra Guide: JIT Runner Configuration

    • Best practices for JIT runner setup (still doesn't solve 60-min limit)

Constraints & Considerations

Why This Is Hard to Solve

  1. Security Model: JIT tokens are designed to be short-lived for security
  2. GitHub Architecture: Jobs are assigned to runners at webhook time, not execution time
  3. Serverless Limitations: Serverless functions can't maintain long-lived connections
  4. No Token Refresh API: GitHub doesn't provide an API to refresh/extend JIT tokens

Common Misconceptions

"We can just fetch JIT when the job runs"
✅ GitHub expects runner registration within minutes of job assignment

"We can retry failed jobs with fresh JIT"
✅ JIT is tied to specific runner registration; can't re-fetch for same job

"Queue the jobs and process later"
✅ GitHub's job timeout (24h) ≠ JIT token lifetime (60min)

Proposed Solutions

Option 1: Batch Processing (Recommended Workaround)

Split long-running workflows into multiple workflow runs that each complete within 60 minutes:

# Instead of one workflow with 37 jobs,
# Create multiple workflows or use dynamic matrix:

# Workflow Run 1: Jobs 1-9 (54 min)
# Workflow Run 2: Jobs 10-18 (54 min) 
# Workflow Run 3: Jobs 19-27 (54 min)
# Workflow Run 4: Jobs 28-N (remaining)

Implementation:

strategy:
  fail-fast: false
  max-parallel: 1
  matrix:
    # Use only subset per workflow run
    job_id: ${{ fromJson(env.JOB_BATCH) }}

Pros:

  • Works within existing JIT limitations
  • No changes to runner infrastructure
  • Each batch completes within token window

Cons:

  • Requires orchestration to trigger multiple runs
  • More complex workflow management
  • Job history split across multiple runs

Option 2: Persistent Runner Token (Security Trade-off)

Use traditional runner registration instead of JIT:

# Register runner once (manual or automated)
./config.sh --url https://github.com/OWNER/REPO --token $REGISTRATION_TOKEN

# Run with persistent token
./run.sh --token $RUNNER_TOKEN

Pros:

  • Token doesn't expire during job execution
  • Simple implementation

Cons:

  • Security risk (long-lived token)
  • Requires token rotation policy
  • Loses benefits of ephemeral runners

Option 3: Hybrid Approach - Batch with Persistent Runner

Use persistent runner for long sequential workflows, JIT for short ones:

if total_estimated_runtime > 3600:  # 1 hour
    use_persistent_runner()
else:
    use_jit_runner()

Pros:

  • Best of both worlds
  • Secure for short jobs, functional for long jobs

Cons:

  • More complex runner management
  • Still requires persistent token for some cases

Option 4: Workflow-Level Retry with Fresh Webhooks

Instead of job-level retry, trigger new workflow runs:

on:
  workflow_dispatch:
  schedule:
    - cron: '0 */2 * * *'  # Every 2 hours

jobs:
  check-and-run:
    runs-on: ubuntu-latest
    steps:
      - name: Check which items need processing
        id: check
        run: |
          # Logic to determine unprocessed items
          echo "batch=$ITEMS" >> $GITHUB_OUTPUT
      
      - name: Trigger batch workflow
        if: steps.check.outputs.batch != '[]'
        uses: benc-uk/workflow-dispatch@v1
        with:
          workflow: process-batch.yml
          inputs: '{"items": "${{ steps.check.outputs.batch }}"}'

Pros:

  • Fresh webhooks = fresh JIT tokens
  • Each batch within 60-minute window

Cons:

  • Complex orchestration
  • Potential for duplicate processing
  • Harder to track overall progress

Option 5: GitHub-Supported Solution (Requested)

Request GitHub to support one of:

  1. JIT Token Refresh API:

    POST /repos/{owner}/{repo}/actions/runners/{runner_id}/refresh-token
    
  2. Extended JIT Lifetime:

    • Allow configuration of JIT token lifetime (e.g., 4 hours for long workflows)
    • Or auto-extend for active runners
  3. Job-Level JIT:

    • Generate JIT token per job instead of per runner
    • Token valid for job duration only

Questions for GitHub

  1. Is there an official way to refresh or extend JIT token lifetime for long-running workflows?

  2. Can GitHub support increase JIT token lifetime for specific repositories/use cases?

  3. Is there a documented pattern for handling workflows that exceed 60 minutes with self-hosted runners?

  4. Should the generate-jitconfig API support token refresh or longer lifetimes for sequential job processing?

  5. Could GitHub provide a "job-level" JIT token that's valid for the duration of a specific job rather than runner registration?

Related Issues & Discussions

Additional Context

Serverless Runner Architecture

Typical serverless GitHub Actions runner flow:

GitHub Workflow Trigger
        ↓
GitHub sends workflow_job webhook (action: queued)
        ↓
Serverless function receives webhook
        ↓
Function calls GitHub API: POST /actions/runners/generate-jitconfig
        ↓
GitHub returns JIT config (valid for ~60 minutes)
        ↓
Function spawns container/sandbox with JIT config
        ↓
Container runs: ./run.sh --jitconfig $JIT_CONFIG
        ↓
Runner connects to GitHub and picks up job
        ↓
Job executes
        ↓
Job completes, runner exits

The problem occurs when:

  • Step 3 (JIT generation) happens at T=0 for all jobs
  • Step 7 (job execution) for job N happens at T > 60 minutes

Workaround Checklist

If you're experiencing this issue, check:

  • Can you split jobs into multiple workflow runs (< 60 min each)?
  • Can you increase max-parallel to reduce total runtime?
  • Can you use persistent runner tokens instead of JIT?
  • Can you optimize job duration to be < 6 minutes each?
  • Can you reduce number of jobs in matrix?

Labels

Suggested labels for this issue:

  • enhancement
  • self-hosted-runners
  • jit-tokens
  • long-running-workflows
  • sequential-jobs
  • documentation

Summary

This issue documents a fundamental architectural limitation: JIT tokens are designed for short-lived ephemeral runners (~60 minutes), but GitHub Actions workflows can legitimately require longer sequential execution.

The core conflict:

  • JIT Security Model: Short-lived tokens (60 min) for ephemeral runners
  • Sequential Workflows: May require >60 min total runtime
  • Serverless Architecture: Can't maintain persistent connections

Viable workarounds:

  1. Batch processing (multiple workflow runs)
  2. Persistent runner tokens (security trade-off)
  3. Reduce total runtime (optimize jobs or increase parallelism)

Long-term solution: Requires GitHub to either:

  • Extend JIT token lifetime for long workflows
  • Provide token refresh mechanism
  • Support job-level (not runner-level) JIT tokens

This issue was compiled from multiple real-world production scenarios and extensive research. It aims to document the limitation clearly and provide actionable workarounds while advocating for a supported long-term solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions