Keep and backfill Usage Event records of running apps, tasks, and service instances by joyvuu-dave · Pull Request #5210 · cloudfoundry/cloud_controller_ng

joyvuu-dave · 2026-06-19T19:20:24Z

This PR fully addresses #4182.

It solves the issue of consumers of Usage Records not having a way of determining the current state of running apps, tasks, and service instances.

With this change, usage event records related to running Apps, Tasks, and Service Instances are kept from being pruned during the normal cleanup job. A one-time backfill also seeds a baseline event (WAS_RUNNING / TASK_WAS_RUNNING) for resources that were already running when the change shipped, so consumers can reconstruct the current state even after the original events have been pruned.

Seeded baseline events are never deleted, because consumers may already have read them. If a resource stops while the backfill is running and its baseline is left without a matching ending event, the backfill adds the missing ending event (STOPPED / DELETED / TASK_STOPPED) instead. Two consequences for consumers, both documented on the V3 usage event resources: an added ending event carries the time of the repair rather than the exact stop time, so the interval it closes can run slightly long; and consumers should close an interval on the first ending event they see and ignore any duplicates. Task stop events are also only emitted when the task has a start event or baseline on record, so consumers never see a stop they cannot pair with a start.

Deployment note

After deploying this change, run rake db:was_running_backfill once so we repair any events that happened while old API servers were still serving traffic.

I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests

philippthun · 2026-06-29T13:07:52Z

Heads-up: #5121 merged, replacing machinist + Sham with FactoryBot. This PR will need a rebase and a fixup - your diff adds around 29 .make / Sham lines. Mechanical conversions:

AppUsageEvent.make(args) → create(:app_usage_event, args)
ServiceUsageEvent.make(args) → create(:service_usage_event, args)
ProcessModelFactory.make(...) stays as-is (kept as a thin wrapper)
Other ClassName.make(args) → create(:symbol, args) (e.g. AppModel.make → create(:app_model))
Sham.foo → generate(:foo) only inside spec/support/factory_definitions/; elsewhere a compatibility shim keeps Sham.foo working

joyvuu-dave · 2026-06-30T01:33:52Z

Done.

The scheduled usage event cleanup job used to delete every record older than the configured cutoff age, including the opening STARTED/CREATED event of a resource that is still running. Once the cleanup deleted that event, nothing was left to reconstruct what is running right now. Database::OldRecordCleanup can now optionally keep the records of running resources. Each model declares its lifecycles via usage_lifecycles: which states open a run (STARTED/CREATED/TASK_STARTED, plus the WAS_RUNNING/TASK_WAS_RUNNING baselines), which state closes it (STOPPED/DELETED/TASK_STOPPED), and which column names the resource. An old opening event is then only deleted when: * a closing event for the same resource exists later and is also old -- the run is over; or * it is neither the first opening of the current run nor the resource's latest one (again judged only against old rows). Consumers only need the first opening (the true start time) and the latest (the current size). The ones in between, written each time a running resource is scaled or updated, tell a consumer nothing it still needs -- and deleting them is what keeps the table size bounded for long-running, frequently-changed resources. The app and service usage event repositories turn this on with keep_running_records: true. Asking for it on a model without usage_lifecycles raises an error instead of silently deleting the records of running resources. Task events get their own lifecycle (TASK_STARTED/TASK_WAS_RUNNING -> TASK_STOPPED, matched by task_guid), so the start events of long-running tasks survive cleanup too. Task baselines use their own TASK_WAS_RUNNING state because task events carry an empty app_guid: if they said WAS_RUNNING, the app lifecycle would see them all as events of one app whose guid is '' and wrongly delete them (and the backfill's repair would write bogus STOPPED events for that phantom app). Deletion runs in a deliberate order: first the opening events that are safe to delete, while the events that make them safe still exist; then everything else. The reverse order could delete a closing event first and leave its opening event looking like a still-running resource. The cleanup log line now reports the row counts BatchDelete returns instead of running extra COUNT queries, and BatchDelete fetches each batch's ids in the same query that checks whether anything is left, halving the evaluations of the (potentially expensive) filtered dataset. Also renames the positional days_ago argument to a cutoff_age_in_days keyword.

Add a composite [state, <guid>, id] index on app_usage_events and service_usage_events. The keep-running cleanup decides whether to delete an event by looking up related events of the same resource (same guid, a given state, a higher or lower id), and the backfill checks whether a resource already has an event on record; both lookups walk exactly this index. Created concurrently on Postgres. Task events need no third index: they are looked up by task_guid, a task has only a handful of events, and the existing app_usage_events_task_guid_index makes that cheap.

…ances Seed a synthetic WAS_RUNNING usage event for every currently-running app process, a TASK_WAS_RUNNING event for every currently-running task, and a WAS_RUNNING event for every existing service instance. Billing consumers can then bootstrap a complete picture of what is running, even though the usage event cleanup deleted the original STARTED/TASK_STARTED/CREATED events long ago. The backfill is a batched VCAP::WasRunningBackfill helper called from thin no_transaction migrations, following the bigint-migration pattern. It walks the started processes / running tasks / service instances in id order, one batch at a time, each batch in its own READ COMMITTED transaction -- so no statement comes near the migration statement timeout, and MySQL's INSERT..SELECT takes no shared next-key locks on the scanned rows while the API keeps serving traffic. Tasks in CANCELING count as running: they stay billable until Diego reports them dead, and no usage event marks the moment a task enters CANCELING. The app query limits its package/droplet subqueries to each batch's apps so it never scans those whole tables, and it COALESCEs nullable legacy columns so one bad NULL row cannot abort a deploy. The seeds skip any resource whose start is already on record -- an earlier baseline, or a real STARTED/TASK_STARTED/CREATED/UPDATED event -- so running the backfill again cannot give a resource a second start that a consumer would bill twice. The API stays live during migrations, so a seed batch can race a stop or delete and write a baseline for a resource that is already gone -- or whose stop event landed earlier in the table, with a lower id. Deleting such rows would not help: consumers read these tables forward, by id, and keep what they read. A poller may already have the baseline, and for tasks a TASK_STOPPED may already have been written against it. You can delete a row; you cannot make a consumer un-read it. So instead, a post-seed repair adds the missing ending event (STOPPED / DELETED / TASK_STOPPED) for every baseline whose resource is no longer running and that has no later ending event (one with a higher id). The ending is built from the baseline row itself, which carries every NOT NULL column an ending needs -- necessary, because the resource row may be gone entirely. A baseline that already has its real ending is never touched, and each added ending stops its baseline from matching the test, so re-running the backfill changes nothing. Two properties of the added ending are deliberate. Its created_at is the repair time, not the true stop time: a bounded overbill that ends, which beats a missing ending billed forever. And its previous_state is the baseline's state, which no normal ending carries, so repaired endings are easy to tell apart. A skip_was_running_backfill config flag lets operators opt out. The migrations check it (not the helper), because they are recorded as applied either way; 'rake db:was_running_backfill' runs the same seeding and repair later. Use the rake task after a skipped migration, once after the deploy that ships these migrations (to repair anything that slipped through while old API servers were still running), or after a destructive usage-event purge, which wipes the task start events that task stop events depend on. The rake task takes a session advisory lock so two runs cannot both add the same missing ending. The migrations' down blocks are deliberate no-ops: consumers may already have read the seeded rows, and deleting a row cannot make a consumer un-read it -- it would only leave the stop events written against these rows without a start event to pair with. Document the WAS_RUNNING/TASK_WAS_RUNNING states, their created_at semantics, the repaired ending events, and the rules consumers must follow on the V3 resources, and list the new states in the legacy V2 usage-event docs because V2 reads the same event rows.

create_stop_event_if_needed skipped the TASK_STOPPED event whenever the TASK_STARTED event was absent. So a task whose start event the cleanup had already deleted never got a stop event when it finished, and a billing consumer that had recorded the start billed the task forever. Now the stop is written when either piece of recorded start evidence exists: the TASK_STARTED event, or the TASK_WAS_RUNNING baseline the backfill seeds for tasks that were already running when the keep-running cleanup was introduced. A legitimately started task always has one of the two: the cleanup no longer deletes the start event of a running task, and the backfill covers tasks that had already lost theirs. When neither exists (say a task canceled before it ever ran), no consumer ever saw the task start, and a stop event would be noise nothing can pair with. The after_destroy hook now goes through the same check. It used to write a stop unconditionally, so destroying a never-started PENDING task (app deletion destroys each non-terminal task) produced exactly the unmatched stop the update path avoids. Both pieces of evidence are looked up in one query, and a comment pins a MySQL constraint: at MySQL's default REPEATABLE READ isolation level, the evidence read must be the first query in the surrounding transaction.

joyvuu-dave mentioned this pull request Jun 19, 2026

Keep Usage Event records of running apps and services #4646

Closed

5 tasks

joyvuu-dave force-pushed the was-running-backfill branch 6 times, most recently from 0790a05 to 15657b0 Compare June 25, 2026 15:56

joyvuu-dave force-pushed the was-running-backfill branch 3 times, most recently from 2282a50 to 81492dc Compare June 29, 2026 22:40

joyvuu-dave force-pushed the was-running-backfill branch 4 times, most recently from f02f8f8 to a9542d8 Compare July 2, 2026 21:31

joyvuu-dave added 4 commits July 2, 2026 16:38

joyvuu-dave force-pushed the was-running-backfill branch from a9542d8 to a1a3a3d Compare July 2, 2026 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210
joyvuu-dave wants to merge 4 commits into
cloudfoundry:mainfrom
joyvuu-dave:was-running-backfill

joyvuu-dave commented Jun 19, 2026 •

edited

Loading

Uh oh!

philippthun commented Jun 29, 2026

Uh oh!

joyvuu-dave commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joyvuu-dave commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deployment note

Uh oh!

philippthun commented Jun 29, 2026

Uh oh!

joyvuu-dave commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joyvuu-dave commented Jun 19, 2026 •

edited

Loading