Skip to content

feat(transform): add merge_subwatcher_fields — general subwatcher field enrichment#138

Open
TimeToBuildBob wants to merge 2 commits into
ActivityWatch:masterfrom
TimeToBuildBob:feat/merge-subwatcher-fields
Open

feat(transform): add merge_subwatcher_fields — general subwatcher field enrichment#138
TimeToBuildBob wants to merge 2 commits into
ActivityWatch:masterfrom
TimeToBuildBob:feat/merge-subwatcher-fields

Conversation

@TimeToBuildBob
Copy link
Copy Markdown
Contributor

Problem

ActivityWatch has several subwatchers (browser, editor) that observe finer-grained
context while a parent app is active. Users want to categorize time by these fields
(url/$domain for browser, project/file/language for editor), but today
they never reach the categorize() pipeline:

  • #352 (browser url, open since 2020): url field is never merged into the categorized window stream
  • #1305 (editor project/file): aw-webui#851 works around this via concat, but fabricates phantom events that break app/title/duration aggregations

The root cause is that aw_query has no field-enrichment-by-overlap primitive — so
client-side code had to express enrichment as concat (new events) instead of a
merge (new fields on existing events).

Solution

Add merge_subwatcher_fields(base, sub, keys) to aw_transform:

  • For each base event, finds the longest-overlapping subwatcher event and copies
    the named keys into the base event's data dict
  • Timestamps, durations, and event count are unchanged — no phantom events, so
    app/title/duration aggregations stay correct for free
  • Backend transform → every client (webui, native UIs, exporters) benefits
  • Exposes as q2_merge_subwatcher_fields in the query2 function registry

Usage (in a query)

editor_events = flood(query_bucket(bid_editor));
editor_events = filter_period_intersect(editor_events, events);
events = merge_subwatcher_fields(events, editor_events, ["project", "file", "language"]);
events = categorize(events, classes);
RETURN = merge_events_by_keys(events, ["app", "$category"]);

Same pattern for browser url enrichment (closes #352):

browser_events = split_url_events(flood(query_bucket(bid_browser)));
browser_events = filter_period_intersect(browser_events, events);
events = merge_subwatcher_fields(events, browser_events, ["url", "$domain", "$path"]);

Tests

6 new tests covering:

  • Basic field injection into overlapping base events
  • No-overlap passthrough (base unchanged)
  • conflict="base_wins" (default): base keys not overwritten
  • conflict="sub_wins": subwatcher fields win
  • N:1 overlap: attach-longest strategy selects the best subwatcher event
  • Empty sub / empty keys short-circuit

All 164 existing tests continue to pass.

Related

  • Closes #1305 (editor-bucket-aware categorization) — together with a follow-up to aw-webui to swap the concat approach for this transform
  • Closes #352 (categorize by browser URL) — same mechanism, wire browser url/domain through the same loop
  • aw-webui#851 remains a working stopgap; this transform enables the clean replacement

… enrichment

Adds a new backend transform `merge_subwatcher_fields(base, sub, keys)` that
enriches base window events with extra fields from a subwatcher (editor, browser)
by finding the longest-overlapping subwatcher event per base event and copying
the named keys into the base event's data dict.

Unlike the concat workaround used in aw-webui#851, this approach:
- Preserves timestamps, durations, and event count exactly (no phantom events)
- Keeps app/title/duration aggregations correct by construction
- Lives in the backend so every client (webui, native UIs, exporters) benefits
- Handles both editor (project/file/language) and browser (url/$domain) fields
  through one mechanism, closing both #1305 and #352

Exposes as q2_merge_subwatcher_fields in the query2 function registry.
Adds 6 tests covering: basic enrichment, no-overlap passthrough, base_wins and
sub_wins conflict modes, attach-longest N:1 overlap selection, and empty inputs.

Closes ActivityWatch/activitywatch#1305, ActivityWatch/activitywatch#352
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 31, 2026

Greptile Summary

This PR adds merge_subwatcher_fields — a new backend transform that enriches base window events with fields from a subwatcher bucket (browser URL, editor project/file/language) by finding the longest-overlapping subwatcher event, leaving timestamps and durations unchanged and avoiding the phantom-event problem of the existing concat workaround.

  • New transform (aw_transform/merge_subwatcher_fields.py): implements conflict resolution (base_wins / sub_wins), a ValueError guard on invalid conflict strings, and defensive deep-copying of all result events; exposed via q2_merge_subwatcher_fields in the query2 registry with the conflict parameter forwarded correctly.
  • Tests (tests/test_transforms.py): seven new tests cover basic enrichment, no-overlap passthrough, both conflict modes, the attach-longest tie-break, empty inputs, and invalid conflict values — all existing 164 tests continue to pass.

Confidence Score: 5/5

Safe to merge; the transform is additive and does not touch timestamps, durations, or event counts.

The core algorithm is correct: the break/continue boundary conditions are consistent with the rest of the codebase, the conflict-resolution guard surfaces invalid values immediately, and every result event is a deep copy of the original base. The two findings are both defensive-programming suggestions — one around deep-copying mutable subwatcher values (harmless for the typical string data AW stores today), and one around algorithm scalability for very large datasets. Neither introduces incorrect behaviour for the described use cases.

aw_transform/merge_subwatcher_fields.py — the value-copy and inner-scan patterns are worth a second look before this is used on large date ranges.

Important Files Changed

Filename Overview
aw_transform/merge_subwatcher_fields.py New transform implementing the attach-longest overlap strategy; correctness looks solid, but subwatcher values are copied by reference rather than deep-copied, and the inner scan restarts from index 0 for every base event (O(n×m)).
aw_query/functions.py Adds q2_merge_subwatcher_fields wrapper with correct decorator ordering and conflict parameter forwarding; no issues.
aw_transform/init.py Correctly exports merge_subwatcher_fields via import and all; no issues.
tests/test_transforms.py Seven new tests cover the main code paths (basic enrichment, no-overlap, both conflict modes, attach-longest selection, empty inputs, invalid conflict value); good coverage.

Sequence Diagram

sequenceDiagram
    participant Q as Query Pipeline
    participant FPI as filter_period_intersect
    participant MSF as merge_subwatcher_fields
    participant CAT as categorize

    Q->>FPI: subwatcher_events, base_events
    FPI-->>Q: clipped subwatcher_events

    Q->>MSF: base_events, clipped subwatcher_events, keys
    note over MSF: For each base event:<br/>find longest-overlapping sub event<br/>(attach-longest strategy)
    MSF-->>Q: enriched base_events (timestamps/durations unchanged)

    Q->>CAT: enriched base_events, classes
    CAT-->>Q: categorized events with $category
Loading

Reviews (2): Last reviewed commit: "fix(transform): address Greptile P1/P2 f..." | Re-trigger Greptile

Comment thread aw_transform/merge_subwatcher_fields.py Outdated
Comment on lines +64 to +65
if not subwatcher_events or not keys:
return base_events
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The early-return path returns the original base_events list by reference rather than a defensive copy. Every other code path through this function returns a list of deepcopy-ed events, so callers reasonably expect the returned value to be independent of the input. With the early return, any in-place mutation of the result silently corrupts the original list, which is especially dangerous in a query pipeline where the same events variable is reused.

Suggested change
if not subwatcher_events or not keys:
return base_events
if not subwatcher_events or not keys:
return [deepcopy(e) for e in base_events]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e78081b: early return now returns [deepcopy(e) for e in base_events] instead of the original list reference, matching the contract documented in the docstring ('Returns: A new list of base events'). Test updated to assert result is not base.

Comment thread aw_transform/merge_subwatcher_fields.py Outdated
Comment on lines +64 to +65
if not subwatcher_events or not keys:
return base_events
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The conflict parameter is compared only against "base_wins", so any other string — including typos like "sub_win" — silently behaves as "sub_wins". An explicit validation guard surfaces mistakes early.

Suggested change
if not subwatcher_events or not keys:
return base_events
if conflict not in ("base_wins", "sub_wins"):
raise ValueError(f"conflict must be 'base_wins' or 'sub_wins', got {conflict!r}")
if not subwatcher_events or not keys:
return base_events

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e78081b: added if conflict not in ("base_wins", "sub_wins"): raise ValueError(...) guard before the early-return check, so invalid values (including typos like 'sub_win') surface immediately. Test test_merge_subwatcher_fields_invalid_conflict covers this.

Comment thread aw_query/functions.py Outdated
Comment on lines +231 to +236
@q2_function(merge_subwatcher_fields)
@q2_typecheck
def q2_merge_subwatcher_fields(
base_events: list, subwatcher_events: list, keys: list
) -> List[Event]:
return merge_subwatcher_fields(base_events, subwatcher_events, keys)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The conflict parameter from merge_subwatcher_fields is not forwarded through the query2 wrapper, so query2 callers are permanently limited to "base_wins" with no way to opt into "sub_wins". Since q2_typecheck skips optional parameters, adding conflict: str = "base_wins" and forwarding it is a safe, non-breaking change.

Suggested change
@q2_function(merge_subwatcher_fields)
@q2_typecheck
def q2_merge_subwatcher_fields(
base_events: list, subwatcher_events: list, keys: list
) -> List[Event]:
return merge_subwatcher_fields(base_events, subwatcher_events, keys)
@q2_function(merge_subwatcher_fields)
@q2_typecheck
def q2_merge_subwatcher_fields(
base_events: list, subwatcher_events: list, keys: list, conflict: str = "base_wins"
) -> List[Event]:
return merge_subwatcher_fields(base_events, subwatcher_events, keys, conflict=conflict)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e78081b: q2_merge_subwatcher_fields now accepts conflict: str = "base_wins" and forwards it to merge_subwatcher_fields. Query2 callers can now opt into sub_wins via the standard QL2 function call syntax.

…ields

- P1: early return now returns deepcopy list instead of original reference,
  matching the contract 'Returns: A new list of base events'
- P2a: validate conflict param before early return; invalid values raise
  ValueError rather than silently behaving as 'sub_wins'
- P2b: forward conflict param through q2_merge_subwatcher_fields wrapper
  so query2 callers can opt into 'sub_wins' instead of being locked to default
- test: extend empty-inputs test to assert result is not the same list object
- test: add test_merge_subwatcher_fields_invalid_conflict for P2a validation
@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

@greptileai review

@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

Addressed all Greptile findings in e78081b:

  • P1 (defensive copy): merge_subwatcher_fields early-return path now returns [deepcopy(e) for e in base_events] — consistent with every other code path; test asserts result is not base.
  • P2a (conflict validation): Added ValueError guard before the early return; invalid strings like "sub_win" now raise immediately rather than silently falling through to sub_wins semantics.
  • P2b (q2 wrapper): q2_merge_subwatcher_fields now accepts and forwards conflict: str = "base_wins", so QL2 callers can opt into "sub_wins".

CI running. Greptile re-review triggered.

@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

CI is green on all platforms (ubuntu, macOS, Windows, lint) and Greptile re-review came back 5/5 after the P1/P2 fixes in e78081b.

This PR is ready for maintainer review and merge. Happy to address any additional feedback.

@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

Ready to merge — CI green, Greptile 5/5. Erik approved on ActivityWatch/activitywatch#1305 at 2026-05-31T12:53:08Z. Waiting for a maintainer with merge permissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant