Skip to content

feat(team-memory): close Phase 1-3 — decisions, deploy scaffolding, consumer setup#6

Open
matze4u wants to merge 11 commits into
mainfrom
docs/team-memory-roadmap
Open

feat(team-memory): close Phase 1-3 — decisions, deploy scaffolding, consumer setup#6
matze4u wants to merge 11 commits into
mainfrom
docs/team-memory-roadmap

Conversation

@matze4u
Copy link
Copy Markdown
Contributor

@matze4u matze4u commented Apr 27, 2026

Summary

Closes Phase 1–3 of agents/roadmaps/team-memory-deployment.md — the roadmap that turns the per-developer local memory into a single shared brain for the Galawork team, without any package code change. All decisions are recorded as ADRs; deployment artefacts and consumer docs land here so Phase 2 (spike) can start on top of a reviewed baseline.

Phase 1 — Decisions (ADRs Accepted)

  • ADR-0004 — Hosting: Hetzner Cloud CX22 (EU-Falkenstein) + self-managed Postgres+pgvector in Compose; nightly pg_dump to a Hetzner Storage Box. Total ≈ €8.82/month (well under the €25 ceiling).
  • ADR-0005 — Auth: Tailscale tailnet as the network gate, layered with the existing MEMORY_MCP_AUTH_TOKEN bearer. Defense in depth; SSO offboarding via Tailscale group.
  • ADR-0006 — Scope policy: Team-brain default — every consumer .agent-memory.yml omits repository:. Per-entry scope.repository provenance preserved. Any developer may run memory promote; the existing trust pipeline is the V1 quality gate.

Phase 2 — Deploy scaffolding

  • deploy/team-memory/docker-compose.yml — pgvector/pgvector:pg17 + agent-memory in SSE mode, port bound to TAILNET_IP only (never 0.0.0.0).
  • deploy/team-memory/.env.example — required POSTGRES_PASSWORD, MEMORY_MCP_AUTH_TOKEN, TAILNET_IP.
  • deploy/team-memory/README.md — end-to-end runbook (provision → Tailscale → deploy → verify → backups → restore drill → consumer onboarding pointer).
  • agents/analysis/team-memory-spike-notes.md — spike log template (acceptance checks, daily entries, cost + latency reality checks, restore-drill preview, sign-off decision).

Phase 3 — Consumer-side documentation + onboarding

  • docs/consumer-setup-docker-sidecar.md §4 — full team-memory remote-mode reference (1Password-backed bearer fetch, SSE MCP client config, /sse curl health probe, team-brain .agent-memory.yml, troubleshooting rows).
  • docs/consumer-setup-generic.md — Pattern C: MCP over SSE (shared brain).
  • docs/consumer-setup-node.md §4 — SSE alternative for the team-brain case.
  • scripts/team-memory-onboard.sh — read-only developer helper that checks Tailscale, brain DNS, 1Password bearer fetch, /sse handshake; prints copy-pasteable shell exports.

Out of scope (tracked, not in this PR)

  • Phase 3 Steps 3–4 — per-consumer-repo rollout (each repo updates its own .agent-memory.yml in its own PR) and CI memory doctor integration.
  • Phase 4 — migration of existing local DBs (deferred until after the spike).
  • Phase 5 — operations (backups, monitoring, capacity, offboarding runbook). Restore drill is the gate to declare Phase 5 done.

Constraint compliance

  • ✅ No src/ changes — every artefact composes from existing CLI/MCP primitives (memory mcp --transport sse, repository-filter omission, promotion gate).
  • ✅ Cost ≤ €25/month: estimated €8.82 (CX22 + BX11 storage box).
  • ✅ All cross-document links validated by npm run check:links (218 successful, 0 errors).
  • docs/secret-safety.md extended with the policy floor for shared deployments (PII, customer data, production logs, opinions — never permitted regardless of pattern catalog).

Verification

npm run check:links     # ✓ 218 successful, 0 errors
bash -n scripts/team-memory-onboard.sh && scripts/team-memory-onboard.sh --help

The onboarding script's failure paths exercised on a non-tailnet host produce the expected red checks and a non-zero exit.

Reviewer focus

  • ADR-0004/0005/0006 — are the decisions tight enough that Phase 2 spike can proceed?
  • deploy/team-memory/docker-compose.yml — is the TAILNET_IP:7078:7078 port binding enforcement sufficient, or do we need a host-level firewall layer?
  • docs/consumer-setup-docker-sidecar.md §4 — does the bearer-fetch flow match how the team actually distributes secrets?

Co-authored by Augment Code

matze4u added 5 commits April 27, 2026 02:40
Adds the team-wide agent-memory deployment roadmap (5 phases) plus three

Status: Proposed ADRs that scope the Phase-1 decisions:

- ADR-0004: hosting (Hetzner CX22 / AWS RDS / Fly.io / existing Galawork)

- ADR-0005: auth model (Tailscale / Cloudflare Tunnel+mTLS / public+token)

- ADR-0006: scope default + promotion authority

No infra change, no package patch, no decisions made yet — the ADRs lay

out the option matrices so reviewers can weigh in before Phase 2 spike.
ADR-0004: Hetzner CX22 + Storage Box (~€8.82/mo, self-managed pg_dump)

ADR-0005: Tailscale + existing MCP_AUTH_TOKEN bearer (defense in depth)

ADR-0006: team-brain default + any-dev promotion (trust pipeline = gate)

Also adds the policy floor section to docs/secret-safety.md covering

what is never allowed in shared memory beyond the technical pattern

catalog (PII, production data, personal opinions). Closes Phase 1

Steps 1\u20135 of agents/roadmaps/team-memory-deployment.md.
…plate

deploy/team-memory/ holds the maintainer-side artefacts for the shared brain:

  - docker-compose.yml: pgvector/pgvector:pg17 + agent-memory in SSE mode,

    SSE port bound to TAILNET_IP only; postgres internal-only

  - .env.example: required POSTGRES_PASSWORD, MEMORY_MCP_AUTH_TOKEN, TAILNET_IP

  - README.md: end-to-end runbook (provision -> Tailscale -> deploy ->

    verify -> backups -> restore drill -> consumer onboarding pointer)

agents/analysis/team-memory-spike-notes.md is the Phase-2 spike log

template (acceptance checks, daily entries, cost + latency reality

checks, restore-drill preview, sign-off decision).

No package code changed; everything composes from existing CLI/MCP

primitives per the team-memory roadmap constraints.
Adds the team-memory remote-mode path to all three consumer-setup docs:

  - docker-sidecar.md §4: full section (bearer fetch via 1Password,

    SSE MCP client config, /sse curl health probe, team-brain

    .agent-memory.yml shape, troubleshooting rows)

  - generic.md: Pattern C — MCP over SSE, with pointer to the runbook

  - node.md §4: SSE alternative for the team-brain case

All three docs now describe the team-brain default (no repository:

filter, per ADR-0006) and the Tailscale + bearer auth model

(ADR-0005). Roadmap Phase 3 Step 1 ticked.
scripts/team-memory-onboard.sh runs four read-only checks:

  1. Tailscale CLI installed and tailnet up

  2. Brain hostname resolves over the tailnet

  3. MCP bearer fetched from 1Password (op://Engineering/team-memory/mcp-bearer)

  4. SSE handshake on /sse returns the endpoint header

Never edits shell rc files, never writes secrets to disk. Prints

copy-pasteable export commands on success. Failure exits non-zero

with actionable hints. Defaults overridable via MEMORY_BRAIN_HOST,

MEMORY_BRAIN_PORT, MEMORY_BEARER_OP_REF env vars.

Roadmap Phase 3 Step 2 ticked. Steps 3-4 remain open and are tracked

per consumer repo (out of scope for this repo's PR).
@matze4u matze4u changed the title docs(team-memory): add deployment roadmap and Phase-1 ADR stubs feat(team-memory): close Phase 1-3 — decisions, deploy scaffolding, consumer setup Apr 27, 2026
matze4u and others added 6 commits April 27, 2026 04:01
Validated deploy/team-memory/docker-compose.yml on a workstation before any

Hetzner spend: stack composes cleanly, both containers reach healthy, all

four SSE auth boundaries (200/401/403/404) match docs/mcp-http.md, and the

data-plane round-trip (propose -> promote -> verify) works end-to-end.

Findings logged in agents/analysis/team-memory-dryrun-results.md:

 - GHCR :latest is not yet published; runbook §5 will fail until a sha-tag

   lands (papercut, no blocker).

 - Synthetic entries without --file/--scenario depress below the trust

   threshold by design — the smoke test now seeds realistic entries.

scripts/team-memory-smoketest.sh codifies the four acceptance checks from

agents/analysis/team-memory-spike-notes.md so the operator can run them

deterministically during the spike. Spike-notes header references the

dry-run outcome. Roadmap dashboard regenerated per roadmap-progress-sync.
Three setup pieces that live outside docker-compose.yml and that the

spike operator needs in addition to the Compose stack:

 - deploy/team-memory/tailscale-acl.json — pasteable Tailscale ACL

   implementing ADR-0005, with tagOwners, two groups, default-deny ACL

   for tag:memory-host:7078, admin SSH on :22, and regression tests.

 - deploy/team-memory/operator-setup.md — Hetzner Cloud Firewall recipe

   (Console + hcloud CLI), Tailscale ACL pointer, and 1Password vault

   item schema with Bitwarden / Vault / Doppler equivalents.

 - deploy/team-memory/README.md §1/§2/§3 link to the new artefacts so

   the runbook stays the one entry point during the spike.
…ty.md

Phase 1 commit 485bb9a stripped two trailing spaces from the auto-generated

catalog header lines; CI II1 (Secret-pattern doc drift) flagged it. Re-ran

npm run docs:secrets to put them back. Catalog content unchanged (still

v1.0.0, 27 patterns, 17 providers).
…ity step

Two issues surfaced by the dry-run: ':latest' didn't exist (workflow line 69

reserves it for v* git tags, no release shipped yet) and the package was

private (HTTP 401 on anonymous pulls). Operator decision: keep ':latest'

convention intact, default to ':main' for the spike, set the package public.

 - .env.example: MEMORY_IMAGE_TAG=main (was: latest), with comment block

   pointing at sha-tags for production reproducibility and v* tags as the

   future ':latest' source.

 - docker-compose.yml: fallback ${MEMORY_IMAGE_TAG:-main} (was: -latest).

 - operator-setup.md §4 (new): one-time GitHub UI step to flip the GHCR

   package to public, plus the available-tag matrix and revert path.

 - README.md §5: one-time prerequisite note pointing at operator-setup §4.

 - team-memory-dryrun-results.md Finding 1: marked RESOLVED with the

   investigation table and the chosen path.

 - team-memory-spike-notes.md header: dry-run pointer reflects the

   resolution, not the original papercut.

No workflow change. Verified: docker compose config now resolves to

ghcr.io/event4u-app/agent-memory:main; npm run check:links clean (218 ✓).
CLI `memory propose` cannot attach evidence — only ingestion scanners
and `mcp.memory_ingest` from an agent context can. So the previous
smoketest hit two trust-pipeline floors:

1. `--impact normal` requires MIN_EVIDENCE_COUNT=1 → rejected at
   promotion with `evidence_floor`.
2. Even after promotion, zero-evidence entries floor at trust 0.2
   (src/trust/scoring.ts:27), below both default (0.6) and
   low-trust (0.3) retrieval thresholds.

Fix: smoketest now uses --impact low (MIN_EVIDENCE_COUNT=0), verifies
the entry via `memory verify`, and asserts retrieve indexes it as a
candidate (totalCandidates >= 1) instead of expecting a surfaced result
(filtering CLI-only entries is correct behaviour).

Adds a `memory health` probe as a third check.

Re-validated against the local Compose stack: 8/8 ✓.

Updates dryrun-results.md Finding 2 with the technically-accurate
explanation (CLI cannot attach evidence at all — \`--file\`/\`--scenario\`
populate scope, not evidence) and clarifies that real Phase-2 spike
Step-4 round-trip must drive through mcp.memory_ingest from an agent.

Co-authored-by: Mathias Berg <noreply@example.com>
One-page sequence for the Phase 2 spike — short commands per step,
links to the long-form runbook (README.md) and the manual setup pieces
(operator-setup.md). Covers Day 0 prep work (no spend) + Day 1
provisioning + 'if something goes wrong' table.

README §preamble adds a pointer to the cheat-sheet so the entry point
to the runbook is the operator's choice (verbose vs scannable).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant