cancel jobs #64

wlggraham · 2025-11-18T00:18:24Z

PR Details

Clickup Link -

Description

This PR enables a job/{jobid}/cancel endpoint that will cancel any ASYNC jobs.

The way this mechanism works:

When the cancel api is hit, it immediately updates that job's status to CANCELLING in the database and also updates the cancelled_by field.

Each machine running Heimdall will create a context cancellation for each async job when fired off. Immediately after starting the plugin handler, a separate Go routine starts polling the database (every 10 seconds) to see if the status changes to CANCELLING. If the job status changes, then the routine cancels the context. All existing ASYNC jobs have components that respect context and will cause the job to fail due to the cancelled context. The job status will remain as CANCELLING for now until we implement the janitor resource clean-up which will terminate any remote resources and finalize with a CANCELED state.

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

My code follows the code style of this project.
My change requires a change to the documentation and I have updated the documentation accordingly.
I have added tests to cover my changes.

wiz-55ccc8b716 · 2025-11-18T00:19:07Z

Wiz Scan Summary

Scanner	Findings
Vulnerabilities	-
Sensitive Data	-
Secrets	-
IaC Misconfigurations	-
SAST Findings	2
Software Supply Chain Findings	-

Total	2

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

wlggraham · 2025-11-18T00:19:39Z

internal/pkg/heimdall/job.go

+	}
+
+	// make sure we have a job object
+	job, ok := currentJob.(*job.Job)


Is there a good reason for all of the endpoints to be returning generics? We wouldn't need to do a type check here if cancelJob expected a job object.

Go does not support Liskov Substitution for return types in the same way Java does with covariant return types.
Go does not allow this- the method signature must exactly match.

internal/pkg/heimdall/jobs_async.go

wlggraham · 2025-11-18T00:21:49Z

internal/pkg/object/command/ecs/ecs.go


-		// Sleep until next poll time
-		time.Sleep(time.Duration(execCtx.PollingInterval))
+		// Check for cancellation or sleep until next poll time


I picked the area where ECS plugin spends the most time (status polling) to implement the cancellation check.

Termination and clean up should be separated. Plugin shouldn't care about termination.
It should be handled here func (h *Heimdall) runAsyncJob(ctx context.Context, j *job.Job) error { for all plugins.
Plugin should be changed from the type function to type interface with 2 functions(terminate and handle.
Our major plugins, Spark, ECS, Trino, Clickhouse allows this feature implementation.
Handle function should rely on all libraries and believe that library is context aware and respect of context cancelation.
What does it mean for PR?

type Handler func(context.Context, *Runtime, *job.Job, *cluster.Cluster) error ->

Handle (context.Context, *Runtime, *job.Job, *cluster.Cluster) error Terminate(context.Context, *Runtime, *job.Job, *cluster.Cluster) error }

Handle function doesn't contains
select {
case <-ctx.Done():
stopAllTasks(execCtx, "Job cancelled by user")
return nil
case <-time.After(time.Duration(execCtx.PollingInterval)):
}

when we run (h *Heimdall) runJob(job *job.Job, command *command.Command, cluster *cluster.Cluster, ctx context.Context) error {
We start 2 gorutines

Execute job

Check job status in db and if it's canceled terminate context. We can call handler.Terminate here because all resources in our hands or wait when Janitor cancel everything.

wlggraham · 2025-11-18T00:22:44Z

internal/pkg/object/command/sparkeks/sparkeks.go


 // handler executes the Spark EKS job submission and execution.
-func (s *sparkEksCommandContext) handler(r *plugin.Runtime, j *job.Job, c *cluster.Cluster) error {
+func (s *sparkEksCommandContext) handler(ct ct.Context, r *plugin.Runtime, j *job.Job, c *cluster.Cluster) error {


Context is set as a var here to be accessed globally. Wondering if this is the right approach to reassign global context to the context we pass in to the handler. @hladush @sanketjadhavSF

hladush · 2025-11-18T01:21:05Z

internal/pkg/heimdall/job.go

 }

-func (h *Heimdall) runJob(job *job.Job, command *command.Command, cluster *cluster.Cluster) error {
+func (h *Heimdall) runJob(job *job.Job, command *command.Command, cluster *cluster.Cluster, ctx context.Context) error {


context is always the first parameter in golang

hladush · 2025-11-18T01:35:17Z

internal/pkg/heimdall/job.go

+func (h *Heimdall) cancelJob(req *jobRequest) (any, error) {
+
+	// validate that job exists and get its current status
+	currentJob, err := h.getJob(req)


You have a user who likes to call api or click a button few times, cancel cancel, cancel, or UI had an issue and send 2 requests with tiny delete.
On machine 1 you execute lines from 173 to 188, after that on another machine the same request is processed and machine 1 executing and set status to cancel.
Can we make canceling an atomic operation?

hladush · 2025-11-18T01:35:36Z

internal/pkg/heimdall/job.go

+func (h *Heimdall) cancelJob(req *jobRequest) (any, error) {
+
+	// validate that job exists and get its current status
+	currentJob, err := h.getJob(req)


add metrics and logs for errors

hladush · 2025-11-18T01:38:14Z

internal/pkg/pool/queries/cancelling_jobs_select.sql

+    j.job_id
+from
+    jobs j
+    join job_statuses js on j.job_status_id = js.job_status_id


why do we need join here if status.go has a clear mapping between job_status_id and name in DB?

sanketjadhavSF · 2025-11-18T07:53:31Z

internal/pkg/pool/queries/job_status_update_by_id.sql

+set
+    job_status_id = $1,
+    job_error = $2,
+    updated_at = extract(epoch from now())::int


we should also add a column updated_by to log the username who has requested the job cancellation for future audit trails or debugging.
may be we can also add an optional column cancellation_reason for logging the cancellation reason as well. we can make it mandatory when we integrate the cancel api with the UI.

…into cancel_jobs merge in main

hladush · 2025-11-18T22:33:49Z

internal/pkg/object/command/trino/trino.go

 }

-func (t *commandContext) handler(r *plugin.Runtime, j *job.Job, c *cluster.Cluster) error {
+func (t *commandContext) handler(ct ct.Context, r *plugin.Runtime, j *job.Job, c *cluster.Cluster) error {


nit: in golang context is imported as context and usual shortage is ctx.
ctx context.Context. Let's unify it

hladush · 2025-11-18T22:36:27Z

internal/pkg/heimdall/handler.go


 		// execute request
-		result, err := fn(&payload)
+		result, err := fn(r.Context(), &payload)


hladush · 2025-11-18T22:38:47Z

internal/pkg/heimdall/job.go

+	pluginCtx, cancel := context.WithCancel(ctx)
+
+	// Start plugin execution in goroutine
+	go func() {


I like this idea

hladush · 2025-11-18T22:43:31Z

internal/pkg/heimdall/job.go

+	}
+
+	// check current job status
+	switch job.Status {


can we run 1 query try to update job and if 0 rows or 1 row is updated return a result based on that?

I think this approach makes the most sense because we already have to make 1 call h.getJob, and based on the status we don't actually need to make a call unless its in a running or new state. I could reuse the updateAsyncJobStatus function though and get rid of this custom one. Thoughts? @hladush

hladush · 2025-11-18T22:44:16Z

internal/pkg/heimdall/job.go

 }
+
+// isJobCancelling checks if a specific job is in CANCELLING state
+func (h *Heimdall) isJobCancelling(j *job.Job) bool {


maybe just getJobStatus and after that everyone will resolve it on top?

hladush · 2025-11-18T22:50:55Z

internal/pkg/pool/pool.go

 	Size  int `yaml:"size,omitempty" json:"size,omitempty"`
 	Sleep int `yaml:"sleep,omitempty" json:"sleep,omitempty"`
-	queue chan T
+	queue chan *job.Job


Pool is an abstract object which can do job on any time of items. Let's keep it abstract.

internal/pkg/heimdall/job.go

hladush · 2025-11-20T22:09:21Z

internal/pkg/object/command/clickhouse/clickhouse.go

 )

-type commandContext struct {
+type clickhouseCommandContext struct {


I see why you named that in that way the general way in golang is to not use package prefix as a name in the type.
maybe let's name it execution context? or commandExecutionContext

hladush · 2025-12-19T00:16:00Z

internal/pkg/heimdall/job.go


+func (h *Heimdall) cancelJob(ctx context.Context, req *jobRequest) (any, error) {
+
+	sess, err := h.Database.NewSession(false)


nit: metrics?

hladush · 2025-12-19T00:16:12Z

internal/pkg/heimdall/job.go

+
+	// Start cancellation monitoring for async jobs
+	if !j.IsSync {
+		go func() {


nit: separate method + metrics?

sanketjadhavSF · 2025-12-19T15:51:52Z

internal/pkg/heimdall/job.go

-		return err
+	// Check if context was cancelled and mark status appropriately
+	if pluginCtx.Err() != nil {
+		j.Status = jobStatus.Cancelling // janitor will update to cancelled when resources are cleaned up


do we have a func/step in janitor to update the status from CANCELLING to CANCELLED?

The Janitor will do this once we implement. Once remote resources are shut down, it will move to cancelled.

wlggraham · 2025-12-19T17:34:06Z

internal/pkg/heimdall/jobs_async.go

-		j.Status = status.Failed
-		j.Error = jobError.Error()
-	}
-


This is redundant. Status already set in job.go

wlggraham and others added 3 commits November 17, 2025 16:44

cancel jobs

5415114

update query mod

925c889

Merge branch 'main' into cancel_jobs

8f23e86

wlggraham commented Nov 18, 2025

View reviewed changes

internal/pkg/heimdall/jobs_async.go Show resolved Hide resolved

wlggraham commented Nov 18, 2025

View reviewed changes

wlggraham requested review from hladush and sanketjadhavSF November 18, 2025 00:23

hladush reviewed Nov 18, 2025

View reviewed changes

sanketjadhavSF reviewed Nov 18, 2025

View reviewed changes

wlggraham added 2 commits November 18, 2025 14:27

move cancellation poll to main job routine

fadc32e

Merge branch 'cancel_jobs' of https://github.com/patterninc/heimdall …

8a784eb

…into cancel_jobs merge in main

hladush reviewed Nov 18, 2025

View reviewed changes

update plugins to use context

fe9d65b

wlggraham commented Nov 20, 2025

View reviewed changes

internal/pkg/heimdall/job.go Show resolved Hide resolved

runtime s3 function fix

ae56e8e

hladush reviewed Nov 20, 2025

View reviewed changes

wlggraham added 3 commits December 18, 2025 15:02

update context naming conventions

8d073a6

add cancelled_by job field

10062cd

update readme with ecs & new endpoint

d1c858d

wlggraham added 2 commits December 18, 2025 15:44

update old cancelled_by job fields

caaa34c

spelling typo

34f6d86

hladush reviewed Dec 19, 2025

View reviewed changes

hladush previously approved these changes Dec 19, 2025

View reviewed changes

sanketjadhavSF reviewed Dec 19, 2025

View reviewed changes

sanketjadhavSF previously approved these changes Dec 19, 2025

View reviewed changes

merge main into cancel branch

3e1e6c6

wlggraham dismissed stale reviews from sanketjadhavSF and hladush via 3e1e6c6 December 19, 2025 17:32

wlggraham commented Dec 19, 2025

View reviewed changes

hladush approved these changes Dec 19, 2025

View reviewed changes

wlggraham merged commit 555b316 into main Dec 19, 2025
6 checks passed

wlggraham deleted the cancel_jobs branch December 19, 2025 17:44


		func (h Heimdall) cancelJob(ctx context.Context, req jobRequest) (any, error) {

		sess, err := h.Database.NewSession(false)

cancel jobs #64

cancel jobs #64

Uh oh!

Conversation

wlggraham commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Details

Description

Types of changes

Checklist

Uh oh!

wiz-55ccc8b716 bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Wiz Scan Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanketjadhavSF Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hladush Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wlggraham commented Nov 18, 2025 •

edited

Loading

wiz-55ccc8b716 bot commented Nov 18, 2025 •

edited

Loading

sanketjadhavSF Nov 18, 2025 •

edited

Loading

hladush Nov 20, 2025 •

edited

Loading