-
Notifications
You must be signed in to change notification settings - Fork 47
Update job runner spec with cross-account architecture design #4904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
This comment has been minimized.
This comment has been minimized.
58c7118 to
8d52d84
Compare
This comment has been minimized.
This comment has been minimized.
rhysh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, non blocking
| from observatory. | ||
| - Run evaluation jobs in a dedicated AWS account separate from primary infrastructure | ||
| - Jobs don't submit their own results or pull inputs from Observatory | ||
| - Hot pool of pre-warmed nodes with per-job pod teardown (~5-10s startup target, one node per pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hot pool is a strategy; the goals look like 1/ limit the impact of container escape by recycling nodes often, 2/ have fast startup
|
|
||
| **Dispatcher** (primary account, part of Observatory) | ||
|
|
||
| - Creates k8s jobs in eval cluster via cross-account kubeconfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... based on aws eks get-token"
|
|
||
| - Reads job spec from presigned GET URL (env var `JOB_SPEC_URI`) | ||
| - Downloads policies from presigned GET URLs | ||
| - Runs pure episode runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... in the same pod / container"
| - Downloads policies from presigned GET URLs | ||
| - Runs pure episode runner | ||
| - Writes results/replay to presigned PUT URLs | ||
| - No Observatory access, no AWS credentials, no network to primary account |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we make any attempt to restrict the Internet access? (If not, then we'll also have to contend with "the policy is actually a DOS bot" type things.)
|
|
||
| 1. **Job Creation** (primary account) | ||
| - Observatory creates job row in Postgres (status=pending) | ||
| - Dispatcher generates presigned S3 URIs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... for policy zips"
Merge activity
|
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
652ed45 to
4d2744f
Compare

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com