Skip to content

spec(manipulation/memory): object memory tracker on memory2, propose …#2067

Draft
jhengyilin wants to merge 4 commits into
dimensionalOS:mainfrom
jhengyilin:feature/mainpulation_with_memory_jhengyi
Draft

spec(manipulation/memory): object memory tracker on memory2, propose …#2067
jhengyilin wants to merge 4 commits into
dimensionalOS:mainfrom
jhengyilin:feature/mainpulation_with_memory_jhengyi

Conversation

@jhengyilin
Copy link
Copy Markdown

@jhengyilin jhengyilin commented May 13, 2026

Spec for the object memory tracker on memory2 (Refs #1893)

flowchart TB
    camera([camera])
    perception["ObjectSceneRegistrationModule<br/><i>detection only — no ObjectDB</i>"]
    tracker["ObjectMemoryTracker<br/><i>(MemoryModule)</i>"]
    manipulation["PickAndPlaceModule<br/><i>manipulation — no API change</i>"]
    skills["@skill recall(name)<br/><i>cross-session memory</i>"]

    camera --> perception
    perception -->|"raw detections<br/>list[DetObject]"| tracker
    tracker -->|"tracked_objects (port)<br/>list[DetObject]"| manipulation

    subgraph m2 ["memory2 — source of truth"]
        direction TB
        obs[("object_observations<br/><i>dense — log</i>")]
        events[("object_events<br/><i>sparse — lifecycle</i>")]
    end

    tracker == "append + inline cache update" ==> obs
    tracker == "append + inline cache update" ==> events

    obs -. "sync .to_list() replay on start()" .-> tracker
    events -. "sync .to_list() replay on start()" .-> tracker

    events --> skills

    classDef stream fill:#fef3c7,stroke:#d97706,stroke-width:2px
    classDef module fill:#dbeafe,stroke:#2563eb,stroke-width:2px
    classDef external fill:#f3f4f6,stroke:#6b7280,stroke-width:1px
    class obs,events stream
    class perception,tracker,manipulation module
    class camera,skills external
Loading

How it works — Propose architecture workflow

t (s) Event Tracker's response
0 First scan sees "cup" at (0.4, 0.1, 0.9) No match → APPEARED event + observation. confidence = 1.0
2–10 More scans of the same cup Tight-spatial match → observation each time. After 6 detections → PROMOTED. Cup is in tracked_objects.
14 Hand covers camera, no detection confidence ≈ 0.77 — still confident
20 Hand still there confidence ≈ 0.51 — borderline
24 Hand still there confidence ≈ 0.41tentative. Out of snapshot, still match-eligible.
25 Hand moves, scan sees cup again Tight-spatial match → observation. Confidence resets to 1.0. No duplicate identity.
60 User moves cup to (1.0, 0.5, 0.9) — 70cm away Tight match fails (>0.2m). Wider-radius voted-name match (drift)MOVED event. No phantom at old position.
120 User takes the cup away After ~45s of decay, confidence < 0.1LOST event. Cup moves to recently-lost bucket.
200 Process crashes and restarts Sync replay (stream.to_list() over both streams) rebuilds the cache from memory2 before the tracker accepts new detections. No bespoke load code.
205 Agent calls recall("cup") Query: events.tags(name="cup").last() → returns LOST event. Process answers about a cup it never saw in its own lifetime.

Why this design:

  • Two streams in memory2object_observations (every matched detection — the evidence) and object_events (lifecycle transitions APPEARED / PROMOTED / LABEL_CHANGED / MOVED / LOST — the story).

  • Continuous belief over binary present/absent — one tunable (time_constant_s = 15) controls how forgiving the tracker is of occlusion. The tentative band (0.2 – 0.5) keeps mid-confidence objects match-eligible, so a single missed scan can't create a duplicate identity.

  • Memory2 holds the persistent record — object history lives in the streams across sessions. The tracker reads from memory2 on startup, so cross-session memory comes for free.

  • No change to manipulation's APItracked_objects publishes list[Object] (same type used today). PickAndPlaceModule work without modification.

Solves the two issues we discussed:

  1. Stable labels across YOLO flickering label (vote across detections instead of latest-frame-detections-wins)
  2. Memory between actions (soft persistence + re-acquire + survives restart)

…architecture design for integrating memory2 with current manipulation stack
@jhengyilin
Copy link
Copy Markdown
Author

jhengyilin commented May 13, 2026

Memory2-native perception (Refs #1893)

Goal: give the agent natural-language access to its workspace — find, recall, and manipulate objects without a fixed class list, with the tracker maintaining identity and lifecycle automatically.

flowchart TB
    camera([camera])

    subgraph m2 ["memory2"]
        direction TB
        semsearch["SemanticSearch<br/><i>continuous CLIP, brightness/sharpness filtered</i>"]
        subgraph db ["recording.db"]
            direction LR
            color[("color_image")]
            depth[("depth_image")]
            info[("camera_info")]
            embedded[("color_image_embedded")]
            obs[("object_observations")]
            events[("object_events")]
        end
    end

    recorder["RGBDCameraRecorder"]
    lazy["LazyPerceptionModule<br/><i>@skill find_objects(prompts)</i><br/>+ startup scan + 10s heartbeat"]
    tracker["ObjectMemoryTracker<br/><i>identity + lifecycle</i>"]
    manip["PickAndPlaceModule<br/><i>manipulation — no API change</i>"]
    recall["@skill recall(name)"]

    camera --> recorder
    recorder ==> color
    recorder ==> depth
    recorder ==> info
    color -. "auto-subscribe" .-> semsearch
    semsearch ==> embedded
    embedded -. "pulled on trigger" .-> lazy
    depth -. ".at(peak.ts)" .-> lazy
    info -. ".last()" .-> lazy
    lazy ==>|"list[Object]"| tracker
    tracker ==> obs
    tracker ==> events
    tracker ==>|"tracked_objects"| manip
    tracker ==>|"watched_names: set[str]"| lazy
    events --> recall

    classDef stream fill:#fef3c7,stroke:#d97706,stroke-width:2px
    classDef module fill:#dbeafe,stroke:#2563eb,stroke-width:2px
    classDef external fill:#f3f4f6,stroke:#6b7280,stroke-width:1px
    class color,depth,info,embedded,obs,events stream
    class recorder,lazy,semsearch,tracker,recall,manip module
    class camera external
Loading

What this unlocks (agent-facing API)

  • find_objects(prompts) — open-vocab detection. Agent passes any natural language ("cup", "red mug with handle", "the screwdriver near the corner"); returns 3D positions for
    matching objects.
  • recall(name) — cross-session memory. "Where did I last see X?" Answers across process restarts.
  • pick(name) — manipulation, reads the tracker's snapshot.

Once any object enters the tracker (via agent call, startup scan, or heartbeat), the tracker's watched_names port adds it to the heartbeat's scan set automatically — discovered objects keep being tracked without the agent re-querying.

How it works — v3 workflow

t (s) Event What happens
0 Blueprint boots Recorder + SemanticSearch begin. CLIP continuously embeds qualifying frames into color_image_embedded.
5 Startup scan fires LazyPerceptionModule scans for blueprint-configured startup_prompts; seeds the tracker with whatever is already in view.
10 User places a cup at (0.4, 0.1, 0.9) Recorder captures.
15 Agent: find_objects("cup, bowl, plate") CLIP search → peaks → VLM → 3D project. Cup detected. APPEARED. Tracker publishes watched_names = {"cup"} — heartbeat now auto-scans for cup.
25 User moves cup to (1.0, 0.5, 0.9) Recorder captures.
30 Heartbeat (every 10s) Scans default_prompts ∪ watched_names = {"cup"}. Detects cup at new location. Tier-3 drift match — silent state refresh; snapshot publishes new pose.
45 User removes cup Recorder captures empty scene.
110 Heartbeat — lookback now excludes all cup frames No detection. confidence(110, 42) ≈ 0.011 < 0.1LOST event fires.
115 Agent: recall("cup") events.tags(name="cup").last()"Last saw a cup at (1.0, 0.5, 0.9) (about 73s ago — event: LOST)".
restart Process crashes and restarts tracker.start() replays both streams; recovers identities, voted names, _lost bucket. Heartbeat resumes via republished watched_names.
Cross-session continuity for manipulation because of this design

Why this design

  • Open-vocab end-to-end. No fixed vocabulary in any module. Heartbeat scans default_prompts ∪ watched_names, so anything the agent or startup scan discovers is automatically maintained
    by ambient detection. Closes the lifecycle gap for any object the agent ever names.
  • Cheap continuous + expensive on-demand. CLIP runs always to build the embedding index that drives fast semantic search. VLM runs only on triggered scans (agent call
    / startup / heartbeat).
  • Tracker is detector-agnostic. raw_detections: In[list[DetObject]] is the only seam. PickAndPlaceModule works
    without modification.
  • Same persistence model. Same memory2 streams (object_observations, object_events), same synchronous replay on start(). Cross-session memory, the new watched_names Out port lets the heartbeat resume tracking exactly what the previous session was tracking.

@jhengyilin
Copy link
Copy Markdown
Author

Memory2-native perception for manipulation (Refs #1893)

flowchart TB
    camera([camera])

    subgraph m2 ["memory2"]
        direction TB
        semsearch["SemanticSearch<br/><i>continuous CLIP, brightness/sharpness filtered</i>"]
        subgraph db ["recording.db"]
            direction LR
            color[("color_image")]
            depth[("depth_image")]
            info[("camera_info")]
            embedded[("color_image_embedded")]
        end
    end

    recorder["RGBDCameraRecorder"]
    lazy["LazyPerceptionModule<br/><i>skills: find_objects · find_objects_near · recall</i>"]
    manip["PickAndPlaceModule<br/><i>manipulation — reads latest_detections</i>"]

    camera --> recorder
    recorder ==> color
    recorder ==> depth
    recorder ==> info
    color -. "auto-subscribe" .-> semsearch
    semsearch ==> embedded
    embedded -. ".search → .filter → .order_by(ts).first()" .-> lazy
    depth -. ".at(obs.ts)" .-> lazy
    info -. ".last()" .-> lazy
    lazy ==>|"latest_detections: list[Object]"| manip

    classDef stream fill:#fef3c7,stroke:#d97706,stroke-width:2px
    classDef module fill:#dbeafe,stroke:#2563eb,stroke-width:2px
    classDef external fill:#f3f4f6,stroke:#6b7280,stroke-width:1px
    class color,depth,info,embedded stream
    class recorder,lazy,semsearch,manip module
    class camera external
Loading

What this unlocks (agent-facing API)

Three skills, each a one-line composition of memory2 primitives. Every skill returns the most recent confident match along with its timestamp.

Skill Composition Returns
find_objects(prompt) .search(vec).filter(sim≥thr).order_by("ts",desc).first() → VLM → 3D project list[Object] + "(seen Ns ago)" summary
find_objects_near(prompt, x, y, z, radius=1.0) .near((x,y,z),r).search(...) (same as above) list[Object] + "(seen Ns ago)" summary
recall(name) .search(vec).filter(sim≥thr).order_by("ts",desc).first() (no VLM, cheaper) Camera pose at match + "(seen Ns ago)"

How it works — walkthrough

t (s) Event What happens
0 Blueprint boots Recorder + SemanticSearch begin. CLIP continuously embeds qualifying frames into color_image_embedded. No detection runs.
10 User places a cup at (0.4, 0.1, 0.9) Recorder captures color/depth/intrinsics with tf-resolved world pose. CLIP embeds the frame.
15 Agent: find_objects("cup") .search(vec).filter(sim≥0.2).order_by("ts",desc).first() → VLM → 3D. Returns "Found 1 cup at (0.4, 0.1, 0.9) (seen 2s ago)". Publishes [cup] on latest_detections.
16 Agent: pick("cup") Manipulation reads latest_detections, picks at (0.4, 0.1, 0.9).
30 User places a screwdriver near the workbench at (1.0, 0.5, 0.8) Recorder captures.
35 Agent: find_objects_near("screwdriver", 1.0, 0.5, 0.8, radius=0.5) .near((1,0.5,0.8),0.5).search(...).filter().order_by("ts",desc).first() — memory2's R*Tree pre-filters to frames captured at the workbench; only those go to VLM.
45 User removes the cup Recorder captures empty scene.
50 Agent: find_objects("cup") Most recent confident cup-match is the t≈42 frame — returns "Found 1 cup at (0.4, 0.1, 0.9) (seen 8s ago)". Agent reads "8s ago" and decides whether to re-query or act.
120 Process crashes and restarts New process, same recording.db. CLIP embeddings persist.
125 Agent: recall("cup") Returns "Last saw 'cup' with camera near (X, Y, Z) (seen 105s ago)" — works because memory2's SQLite is the persistence. Process answers about a cup it never saw in its own lifetime.

Why this design

  • Memory2 IS the temporal/spatial substrate. "Most recent confident match" is .search().filter().order_by("ts",desc).first() — one line, all push-down to indexes.
  • Freshness lives in the response. Skill returns (seen Ns ago); the LLM agent reads it and decides if it's actionable.
  • Stateless skills. Each call is an independent memory2 query → VLM → 3D → publish.
  • Manipulation API unchanged. PickAndPlaceModule.pick(name) reads latest_detections (the cache of the most recent perception result). Same pick(name) skill the team already uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant