Skip to content

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210

Open
joyvuu-dave wants to merge 4 commits into
cloudfoundry:mainfrom
joyvuu-dave:was-running-backfill
Open

Keep and backfill Usage Event records of running apps, tasks, and service instances#5210
joyvuu-dave wants to merge 4 commits into
cloudfoundry:mainfrom
joyvuu-dave:was-running-backfill

Conversation

@joyvuu-dave

@joyvuu-dave joyvuu-dave commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

This PR fully addresses #4182.

It solves the issue of consumers of Usage Records not having a way of determining the current state of running apps, tasks, and service instances.

With this change, usage event records related to running Apps, Tasks, and Service Instances are kept from being pruned during the normal cleanup job. A one-time backfill also seeds a baseline event (WAS_RUNNING / TASK_WAS_RUNNING) for resources that were already running when the change shipped, so consumers can reconstruct the current state even after the original events have been pruned.

Seeded baseline events are never deleted, because consumers may already have read them. If a resource stops while the backfill is running and its baseline is left without a matching ending event, the backfill adds the missing ending event (STOPPED / DELETED / TASK_STOPPED) instead. Two consequences for consumers, both documented on the V3 usage event resources: an added ending event carries the time of the repair rather than the exact stop time, so the interval it closes can run slightly long; and consumers should close an interval on the first ending event they see and ignore any duplicates. Task stop events are also only emitted when the task has a start event or baseline on record, so consumers never see a stop they cannot pair with a start.

Deployment note

After deploying this change, run rake db:was_running_backfill once so we repair any events that happened while old API servers were still serving traffic.

  • I have reviewed the contributing guide
  • I have viewed, signed, and submitted the Contributor License Agreement
  • I have made this pull request to the main branch
  • I have run all the unit tests using bundle exec rake
  • I have run CF Acceptance Tests

@joyvuu-dave joyvuu-dave force-pushed the was-running-backfill branch 6 times, most recently from 0790a05 to 15657b0 Compare June 25, 2026 15:56
@philippthun

Copy link
Copy Markdown
Member

Heads-up: #5121 merged, replacing machinist + Sham with FactoryBot. This PR will need a rebase and a fixup - your diff adds around 29 .make / Sham lines. Mechanical conversions:

  • AppUsageEvent.make(args) → create(:app_usage_event, args)
  • ServiceUsageEvent.make(args) → create(:service_usage_event, args)
  • ProcessModelFactory.make(...) stays as-is (kept as a thin wrapper)
  • Other ClassName.make(args) → create(:symbol, args) (e.g. AppModel.make → create(:app_model))
  • Sham.foo → generate(:foo) only inside spec/support/factory_definitions/; elsewhere a compatibility shim keeps Sham.foo working

@joyvuu-dave joyvuu-dave force-pushed the was-running-backfill branch 3 times, most recently from 2282a50 to 81492dc Compare June 29, 2026 22:40
@joyvuu-dave

Copy link
Copy Markdown
Contributor Author

Done.

@joyvuu-dave joyvuu-dave force-pushed the was-running-backfill branch 4 times, most recently from f02f8f8 to a9542d8 Compare July 2, 2026 21:31
The scheduled usage event cleanup job used to delete every record older than
the configured cutoff age, including the opening STARTED/CREATED event of a
resource that is still running. Once the cleanup deleted that event, nothing
was left to reconstruct what is running right now.

Database::OldRecordCleanup can now optionally keep the records of running
resources. Each model declares its lifecycles via usage_lifecycles: which
states open a run (STARTED/CREATED/TASK_STARTED, plus the
WAS_RUNNING/TASK_WAS_RUNNING baselines), which state closes it
(STOPPED/DELETED/TASK_STOPPED), and which column names the resource. An old
opening event is then only deleted when:

* a closing event for the same resource exists later and is also old -- the
  run is over; or
* it is neither the first opening of the current run nor the resource's
  latest one (again judged only against old rows). Consumers only need the
  first opening (the true start time) and the latest (the current size). The
  ones in between, written each time a running resource is scaled or updated,
  tell a consumer nothing it still needs -- and deleting them is what keeps
  the table size bounded for long-running, frequently-changed resources.

The app and service usage event repositories turn this on with
keep_running_records: true. Asking for it on a model without usage_lifecycles
raises an error instead of silently deleting the records of running
resources. Task events get their own lifecycle (TASK_STARTED/TASK_WAS_RUNNING
-> TASK_STOPPED, matched by task_guid), so the start events of long-running
tasks survive cleanup too. Task baselines use their own TASK_WAS_RUNNING
state because task events carry an empty app_guid: if they said WAS_RUNNING,
the app lifecycle would see them all as events of one app whose guid is ''
and wrongly delete them (and the backfill's repair would write bogus STOPPED
events for that phantom app).

Deletion runs in a deliberate order: first the opening events that are safe
to delete, while the events that make them safe still exist; then everything
else. The reverse order could delete a closing event first and leave its
opening event looking like a still-running resource. The cleanup log line now
reports the row counts BatchDelete returns instead of running extra COUNT
queries, and BatchDelete fetches each batch's ids in the same query that
checks whether anything is left, halving the evaluations of the (potentially
expensive) filtered dataset. Also renames the positional days_ago argument to
a cutoff_age_in_days keyword.
Add a composite [state, <guid>, id] index on app_usage_events and
service_usage_events. The keep-running cleanup decides whether to delete an
event by looking up related events of the same resource (same guid, a given
state, a higher or lower id), and the backfill checks whether a resource
already has an event on record; both lookups walk exactly this index. Created
concurrently on Postgres.

Task events need no third index: they are looked up by task_guid, a task has
only a handful of events, and the existing app_usage_events_task_guid_index
makes that cheap.
…ances

Seed a synthetic WAS_RUNNING usage event for every currently-running app
process, a TASK_WAS_RUNNING event for every currently-running task, and a
WAS_RUNNING event for every existing service instance. Billing consumers can
then bootstrap a complete picture of what is running, even though the usage
event cleanup deleted the original STARTED/TASK_STARTED/CREATED events long
ago.

The backfill is a batched VCAP::WasRunningBackfill helper called from thin
no_transaction migrations, following the bigint-migration pattern. It walks
the started processes / running tasks / service instances in id order, one
batch at a time, each batch in its own READ COMMITTED transaction -- so no
statement comes near the migration statement timeout, and MySQL's
INSERT..SELECT takes no shared next-key locks on the scanned rows while the
API keeps serving traffic. Tasks in CANCELING count as running: they stay
billable until Diego reports them dead, and no usage event marks the moment a
task enters CANCELING. The app query limits its package/droplet subqueries to
each batch's apps so it never scans those whole tables, and it COALESCEs
nullable legacy columns so one bad NULL row cannot abort a deploy. The seeds
skip any resource whose start is already on record -- an earlier baseline, or
a real STARTED/TASK_STARTED/CREATED/UPDATED event -- so running the backfill
again cannot give a resource a second start that a consumer would bill twice.

The API stays live during migrations, so a seed batch can race a stop or
delete and write a baseline for a resource that is already gone -- or whose
stop event landed earlier in the table, with a lower id. Deleting such rows
would not help: consumers read these tables forward, by id, and keep what
they read. A poller may already have the baseline, and for tasks a
TASK_STOPPED may already have been written against it. You can delete a row;
you cannot make a consumer un-read it. So instead, a post-seed repair adds
the missing ending event (STOPPED / DELETED / TASK_STOPPED) for every
baseline whose resource is no longer running and that has no later ending
event (one with a higher id). The ending is built from the baseline row
itself, which carries every NOT NULL column an ending needs -- necessary,
because the resource row may be gone entirely. A baseline that already has
its real ending is never touched, and each added ending stops its baseline
from matching the test, so re-running the backfill changes nothing. Two
properties of the added ending are deliberate. Its created_at is the repair
time, not the true stop time: a bounded overbill that ends, which beats a
missing ending billed forever. And its previous_state is the baseline's
state, which no normal ending carries, so repaired endings are easy to tell
apart.

A skip_was_running_backfill config flag lets operators opt out. The
migrations check it (not the helper), because they are recorded as applied
either way; 'rake db:was_running_backfill' runs the same seeding and repair
later. Use the rake task after a skipped migration, once after the deploy
that ships these migrations (to repair anything that slipped through while
old API servers were still running), or after a destructive usage-event
purge, which wipes the task start events that task stop events depend on. The
rake task takes a session advisory lock so two runs cannot both add the same
missing ending. The migrations' down blocks are deliberate no-ops: consumers
may already have read the seeded rows, and deleting a row cannot make a
consumer un-read it -- it would only leave the stop events written against
these rows without a start event to pair with.

Document the WAS_RUNNING/TASK_WAS_RUNNING states, their created_at semantics,
the repaired ending events, and the rules consumers must follow on the V3
resources, and list the new states in the legacy V2 usage-event docs because
V2 reads the same event rows.
create_stop_event_if_needed skipped the TASK_STOPPED event whenever the
TASK_STARTED event was absent. So a task whose start event the cleanup had
already deleted never got a stop event when it finished, and a billing
consumer that had recorded the start billed the task forever.

Now the stop is written when either piece of recorded start evidence exists:
the TASK_STARTED event, or the TASK_WAS_RUNNING baseline the backfill seeds
for tasks that were already running when the keep-running cleanup was
introduced. A legitimately started task always has one of the two: the
cleanup no longer deletes the start event of a running task, and the
backfill covers tasks that had already lost theirs. When neither exists (say
a task canceled before it ever ran), no consumer ever saw the task start,
and a stop event would be noise nothing can pair with.

The after_destroy hook now goes through the same check. It used to write a
stop unconditionally, so destroying a never-started PENDING task (app
deletion destroys each non-terminal task) produced exactly the unmatched
stop the update path avoids. Both pieces of evidence are looked up in one
query, and a comment pins a MySQL constraint: at MySQL's default REPEATABLE
READ isolation level, the evidence read must be the first query in the
surrounding transaction.
@joyvuu-dave joyvuu-dave force-pushed the was-running-backfill branch from a9542d8 to a1a3a3d Compare July 2, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants