spec(manipulation/memory): object memory tracker on memory2, propose … by jhengyilin · Pull Request #2067 · dimensionalOS/dimos

jhengyilin · 2026-05-13T01:32:16Z

Spec for the object memory tracker on memory2 (Refs #1893)

flowchart TB
    camera([camera])
    perception["ObjectSceneRegistrationModule<br/><i>detection only — no ObjectDB</i>"]
    tracker["ObjectMemoryTracker<br/><i>(MemoryModule)</i>"]
    manipulation["PickAndPlaceModule<br/><i>manipulation — no API change</i>"]
    skills["@skill recall(name)<br/><i>cross-session memory</i>"]

    camera --> perception
    perception -->|"raw detections<br/>list[DetObject]"| tracker
    tracker -->|"tracked_objects (port)<br/>list[DetObject]"| manipulation

    subgraph m2 ["memory2 — source of truth"]
        direction TB
        obs[("object_observations<br/><i>dense — log</i>")]
        events[("object_events<br/><i>sparse — lifecycle</i>")]
    end

    tracker == "append + inline cache update" ==> obs
    tracker == "append + inline cache update" ==> events

    obs -. "sync .to_list() replay on start()" .-> tracker
    events -. "sync .to_list() replay on start()" .-> tracker

    events --> skills

    classDef stream fill:#fef3c7,stroke:#d97706,stroke-width:2px
    classDef module fill:#dbeafe,stroke:#2563eb,stroke-width:2px
    classDef external fill:#f3f4f6,stroke:#6b7280,stroke-width:1px
    class obs,events stream
    class perception,tracker,manipulation module
    class camera,skills external

How it works — Propose architecture workflow

t (s)	Event	Tracker's response
0	First scan sees "cup" at `(0.4, 0.1, 0.9)`	No match → `APPEARED` event + observation. `confidence = 1.0`
2–10	More scans of the same cup	Tight-spatial match → observation each time. After 6 detections → `PROMOTED`. Cup is in `tracked_objects`.
14	Hand covers camera, no detection	`confidence ≈ 0.77` — still confident
20	Hand still there	`confidence ≈ 0.51` — borderline
24	Hand still there	`confidence ≈ 0.41` — tentative. Out of snapshot, still match-eligible.
25	Hand moves, scan sees cup again	Tight-spatial match → observation. Confidence resets to 1.0. No duplicate identity.
60	User moves cup to `(1.0, 0.5, 0.9)` — 70cm away	Tight match fails (>0.2m). Wider-radius voted-name match (drift) → `MOVED` event. No phantom at old position.
120	User takes the cup away	After ~45s of decay, `confidence < 0.1` → `LOST` event. Cup moves to recently-lost bucket.
200	Process crashes and restarts	Sync replay (`stream.to_list()` over both streams) rebuilds the cache from memory2 before the tracker accepts new detections. No bespoke load code.
205	Agent calls `recall("cup")`	Query: `events.tags(name="cup").last()` → returns LOST event. Process answers about a cup it never saw in its own lifetime.

Why this design:

Two streams in memory2 — object_observations (every matched detection — the evidence) and object_events (lifecycle transitions APPEARED / PROMOTED / LABEL_CHANGED / MOVED / LOST — the story).
Continuous belief over binary present/absent — one tunable (time_constant_s = 15) controls how forgiving the tracker is of occlusion. The tentative band (0.2 – 0.5) keeps mid-confidence objects match-eligible, so a single missed scan can't create a duplicate identity.
Memory2 holds the persistent record — object history lives in the streams across sessions. The tracker reads from memory2 on startup, so cross-session memory comes for free.
No change to manipulation's API — tracked_objects publishes list[Object] (same type used today). PickAndPlaceModule work without modification.

Solves the two issues we discussed:

Stable labels across YOLO flickering label (vote across detections instead of latest-frame-detections-wins)
Memory between actions (soft persistence + re-acquire + survives restart)

…architecture design for integrating memory2 with current manipulation stack

…th manipulation stack

jhengyilin · 2026-05-13T23:37:18Z

Memory2-native perception (Refs #1893)

Goal: give the agent natural-language access to its workspace — find, recall, and manipulate objects without a fixed class list, with the tracker maintaining identity and lifecycle automatically.

flowchart TB
    camera([camera])

    subgraph m2 ["memory2"]
        direction TB
        semsearch["SemanticSearch<br/><i>continuous CLIP, brightness/sharpness filtered</i>"]
        subgraph db ["recording.db"]
            direction LR
            color[("color_image")]
            depth[("depth_image")]
            info[("camera_info")]
            embedded[("color_image_embedded")]
            obs[("object_observations")]
            events[("object_events")]
        end
    end

    recorder["RGBDCameraRecorder"]
    lazy["LazyPerceptionModule<br/><i>@skill find_objects(prompts)</i><br/>+ startup scan + 10s heartbeat"]
    tracker["ObjectMemoryTracker<br/><i>identity + lifecycle</i>"]
    manip["PickAndPlaceModule<br/><i>manipulation — no API change</i>"]
    recall["@skill recall(name)"]

    camera --> recorder
    recorder ==> color
    recorder ==> depth
    recorder ==> info
    color -. "auto-subscribe" .-> semsearch
    semsearch ==> embedded
    embedded -. "pulled on trigger" .-> lazy
    depth -. ".at(peak.ts)" .-> lazy
    info -. ".last()" .-> lazy
    lazy ==>|"list[Object]"| tracker
    tracker ==> obs
    tracker ==> events
    tracker ==>|"tracked_objects"| manip
    tracker ==>|"watched_names: set[str]"| lazy
    events --> recall

    classDef stream fill:#fef3c7,stroke:#d97706,stroke-width:2px
    classDef module fill:#dbeafe,stroke:#2563eb,stroke-width:2px
    classDef external fill:#f3f4f6,stroke:#6b7280,stroke-width:1px
    class color,depth,info,embedded,obs,events stream
    class recorder,lazy,semsearch,tracker,recall,manip module
    class camera external

What this unlocks (agent-facing API)

find_objects(prompts) — open-vocab detection. Agent passes any natural language ("cup", "red mug with handle", "the screwdriver near the corner"); returns 3D positions for
matching objects.
recall(name) — cross-session memory. "Where did I last see X?" Answers across process restarts.
pick(name) — manipulation, reads the tracker's snapshot.

Once any object enters the tracker (via agent call, startup scan, or heartbeat), the tracker's watched_names port adds it to the heartbeat's scan set automatically — discovered objects keep being tracked without the agent re-querying.

How it works — v3 workflow

t (s)	Event	What happens
0	Blueprint boots	Recorder + SemanticSearch begin. CLIP continuously embeds qualifying frames into `color_image_embedded`.
5	Startup scan fires	`LazyPerceptionModule` scans for blueprint-configured `startup_prompts`; seeds the tracker with whatever is already in view.
10	User places a cup at `(0.4, 0.1, 0.9)`	Recorder captures.
15	Agent: `find_objects("cup, bowl, plate")`	CLIP search → peaks → VLM → 3D project. Cup detected. `APPEARED`. Tracker publishes `watched_names = {"cup"}` — heartbeat now auto-scans for cup.
25	User moves cup to `(1.0, 0.5, 0.9)`	Recorder captures.
30	Heartbeat (every 10s)	Scans `default_prompts ∪ watched_names = {"cup"}`. Detects cup at new location. Tier-3 drift match — silent state refresh; snapshot publishes new pose.
45	User removes cup	Recorder captures empty scene.
110	Heartbeat — lookback now excludes all cup frames	No detection. `confidence(110, 42) ≈ 0.011 < 0.1` → `LOST` event fires.
115	Agent: `recall("cup")`	`events.tags(name="cup").last()` → `"Last saw a cup at (1.0, 0.5, 0.9) (about 73s ago — event: LOST)"`.
restart	Process crashes and restarts	`tracker.start()` replays both streams; recovers identities, voted names, `_lost` bucket. Heartbeat resumes via republished `watched_names`.
Cross-session continuity for manipulation because of this design

Why this design

Open-vocab end-to-end. No fixed vocabulary in any module. Heartbeat scans default_prompts ∪ watched_names, so anything the agent or startup scan discovers is automatically maintained
by ambient detection. Closes the lifecycle gap for any object the agent ever names.
Cheap continuous + expensive on-demand. CLIP runs always to build the embedding index that drives fast semantic search. VLM runs only on triggered scans (agent call
/ startup / heartbeat).
Tracker is detector-agnostic. raw_detections: In[list[DetObject]] is the only seam. PickAndPlaceModule works
without modification.
Same persistence model. Same memory2 streams (object_observations, object_events), same synchronous replay on start(). Cross-session memory, the new watched_names Out port lets the heartbeat resume tracking exactly what the previous session was tracking.

…or agent to query on memory2

jhengyilin · 2026-05-14T03:18:27Z

Memory2-native perception for manipulation (Refs #1893)

flowchart TB
    camera([camera])

    subgraph m2 ["memory2"]
        direction TB
        semsearch["SemanticSearch<br/><i>continuous CLIP, brightness/sharpness filtered</i>"]
        subgraph db ["recording.db"]
            direction LR
            color[("color_image")]
            depth[("depth_image")]
            info[("camera_info")]
            embedded[("color_image_embedded")]
        end
    end

    recorder["RGBDCameraRecorder"]
    lazy["LazyPerceptionModule<br/><i>skills: find_objects · find_objects_near · recall</i>"]
    manip["PickAndPlaceModule<br/><i>manipulation — reads latest_detections</i>"]

    camera --> recorder
    recorder ==> color
    recorder ==> depth
    recorder ==> info
    color -. "auto-subscribe" .-> semsearch
    semsearch ==> embedded
    embedded -. ".search → .filter → .order_by(ts).first()" .-> lazy
    depth -. ".at(obs.ts)" .-> lazy
    info -. ".last()" .-> lazy
    lazy ==>|"latest_detections: list[Object]"| manip

    classDef stream fill:#fef3c7,stroke:#d97706,stroke-width:2px
    classDef module fill:#dbeafe,stroke:#2563eb,stroke-width:2px
    classDef external fill:#f3f4f6,stroke:#6b7280,stroke-width:1px
    class color,depth,info,embedded stream
    class recorder,lazy,semsearch,manip module
    class camera external

What this unlocks (agent-facing API)

Three skills, each a one-line composition of memory2 primitives. Every skill returns the most recent confident match along with its timestamp.

Skill	Composition	Returns
`find_objects(prompt)`	`.search(vec).filter(sim≥thr).order_by("ts",desc).first()` → VLM → 3D project	`list[Object]` + `"(seen Ns ago)"` summary
`find_objects_near(prompt, x, y, z, radius=1.0)`	`.near((x,y,z),r).search(...)` (same as above)	`list[Object]` + `"(seen Ns ago)"` summary
`recall(name)`	`.search(vec).filter(sim≥thr).order_by("ts",desc).first()` (no VLM, cheaper)	Camera pose at match + `"(seen Ns ago)"`

How it works — walkthrough

t (s)	Event	What happens
0	Blueprint boots	Recorder + SemanticSearch begin. CLIP continuously embeds qualifying frames into `color_image_embedded`. No detection runs.
10	User places a cup at `(0.4, 0.1, 0.9)`	Recorder captures color/depth/intrinsics with tf-resolved world pose. CLIP embeds the frame.
15	Agent: `find_objects("cup")`	`.search(vec).filter(sim≥0.2).order_by("ts",desc).first()` → VLM → 3D. Returns `"Found 1 cup at (0.4, 0.1, 0.9) (seen 2s ago)"`. Publishes `[cup]` on `latest_detections`.
16	Agent: `pick("cup")`	Manipulation reads `latest_detections`, picks at `(0.4, 0.1, 0.9)`.
30	User places a screwdriver near the workbench at `(1.0, 0.5, 0.8)`	Recorder captures.
35	Agent: `find_objects_near("screwdriver", 1.0, 0.5, 0.8, radius=0.5)`	`.near((1,0.5,0.8),0.5).search(...).filter().order_by("ts",desc).first()` — memory2's R*Tree pre-filters to frames captured at the workbench; only those go to VLM.
45	User removes the cup	Recorder captures empty scene.
50	Agent: `find_objects("cup")`	Most recent confident cup-match is the t≈42 frame — returns `"Found 1 cup at (0.4, 0.1, 0.9) (seen 8s ago)"`. Agent reads "8s ago" and decides whether to re-query or act.
120	Process crashes and restarts	New process, same `recording.db`. CLIP embeddings persist.
125	Agent: `recall("cup")`	Returns `"Last saw 'cup' with camera near (X, Y, Z) (seen 105s ago)"` — works because memory2's SQLite is the persistence. Process answers about a cup it never saw in its own lifetime.

Why this design

Memory2 IS the temporal/spatial substrate. "Most recent confident match" is .search().filter().order_by("ts",desc).first() — one line, all push-down to indexes.
Freshness lives in the response. Skill returns (seen Ns ago); the LLM agent reads it and decides if it's actionable.
Stateless skills. Each call is an independent memory2 query → VLM → 3D → publish.
Manipulation API unchanged. PickAndPlaceModule.pick(name) reads latest_detections (the cache of the most recent perception result). Same pick(name) skill the team already uses.

jhengyilin added 2 commits May 12, 2026 18:26

spec(manipulation/memory): object memory tracker on memory2, propose …

51efecc

…architecture design for integrating memory2 with current manipulation stack

updated memory2-native and object lifecycle management integration wi…

b34dab8

…th manipulation stack

jhengyilin added 2 commits May 13, 2026 16:57

docstring clean up for cleaner reading

4204b7a

memory2-native design, drop tracker/lifecyle manage, expose @skills f…

4f7199d

…or agent to query on memory2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(manipulation/memory): object memory tracker on memory2, propose …#2067

spec(manipulation/memory): object memory tracker on memory2, propose …#2067
jhengyilin wants to merge 4 commits into
dimensionalOS:mainfrom
jhengyilin:feature/mainpulation_with_memory_jhengyi

jhengyilin commented May 13, 2026 •

edited

Loading

Uh oh!

jhengyilin commented May 13, 2026 •

edited

Loading

Uh oh!

jhengyilin commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhengyilin commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works — Propose architecture workflow

Uh oh!

jhengyilin commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this unlocks (agent-facing API)

How it works — v3 workflow

Why this design

Uh oh!

jhengyilin commented May 14, 2026

What this unlocks (agent-facing API)

How it works — walkthrough

Why this design

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jhengyilin commented May 13, 2026 •

edited

Loading

jhengyilin commented May 13, 2026 •

edited

Loading