diff --git a/pages/slack-colibri-bridge.md b/pages/slack-colibri-bridge.md new file mode 100644 index 0000000..99dd077 --- /dev/null +++ b/pages/slack-colibri-bridge.md @@ -0,0 +1,392 @@ +--- +title: Slack → Colibri Bridge (Proposal) +contributors: Tom Larkworthy +--- + +> **Status:** Draft proposal. Looking for feedback. + +A one-way sync that mirrors Slack messages from the FoC Slack workspace into the [Colibri](https://colibri.social) atproto network, so the conversation data ends up public and consumable via the atproto firehose. + +## Why this shape + +- **Lowest-risk path to atproto.** The [social.colibri](https://lexicon.garden/browse/social.colibri) lexicon is the most fleshed-out Slack-shaped lexicon already on atproto. Reusing it lets us playtest Colibri as a Slack replacement without designing a new schema. +- **Slack stays canonical.** One-way (Slack → atproto). No write-back. +- **Public by default.** Records on atproto are public. + +## Architecture + +1. **Slack Events API** pushes message events to a Cloudflare Worker (the *producer*). It verifies Slack's HMAC signature, enqueues the event, and acks within Slack's 3-second budget. +2. **Cloudflare Queue** holds events durably. A consumer Worker pulls batches and publishes to atproto. Queue retries cover PDS slowness and rate limits. +3. **atproto is the source of truth.** The bot's bsky.social repo holds the published Colibri messages and our own sidecar records. The PDS is the dedupe authority — see Race resolution. +4. **Cloudflare D1 (SQLite)** holds: + - A single `cache` table — dumb projection of atproto records from any repo, keyed by `(repo, collection, rkey)`, body stored opaquely as JSON. Wipe it and the firehose rebuilds it. + - Separate private tables (e.g. `oauth_tokens` in v2) for credentials that can't go on atproto. + +The consumer writes records in a fixed order per event: **first** `slackRaw` (lossless capture of the payload as received from Slack), **then** the derived `social.colibri.message`, **then** `slackOrigin` linking the two. If derivation crashes or is later improved, the raw record is still on atproto and the Colibri view can be regenerated without re-pulling Slack. See [slackRaw](#new-comfeelingofbridgeslackraw). + +### Sequence diagram + +```mermaid +sequenceDiagram + autonumber + actor U as User in Slack + participant S as Slack Events API + participant P as Producer Worker + participant Q as CF Queue + participant C as Consumer Worker + participant DB as D1 cache + participant PDS as bsky.social PDS
(bot's repo) + participant App as Colibri app /
firehose readers + + U->>S: send message / reply + S->>P: POST /slack/events + P->>P: verify X-Slack-Signature + P->>Q: enqueue { event } + P-->>S: 200 OK (within 3s) + + C->>Q: pull batch + C->>DB: SELECT cache
WHERE repo=bot
AND collection='…slackOrigin'
AND rkey='channel-ts' + alt cache hit + Note over C: already bridged — skip + else cache miss + C->>PDS: createRecord social.colibri.message + PDS-->>C: { uri, cid } + C->>PDS: createRecord slackOrigin
(deterministic rkey) + alt PDS 200 + C->>DB: INSERT INTO cache + else PDS 409 (concurrent dup) + C->>PDS: getRecord slackOrigin + PDS-->>C: { record } + C->>DB: INSERT INTO cache + end + end + + App->>PDS: firehose / fetch + App-->>U: public conversation visible +``` + +ASCII fallback: + +``` + Slack user + │ + ▼ + Slack Events API ──▶ Producer Worker (verify, ack <3s) + │ + ▼ + CF Queue + │ + ▼ + Consumer Worker + │ + SELECT cache (repo, collection, rkey) + │ + hit ──┼── miss + ▼ ▼ + skip createRecord(message, slackOrigin) + │ + PDS 200 │ PDS 409 (race) + ▼ ▼ + INSERT cache getRecord, INSERT cache + │ + ▼ + firehose ──▶ Colibri app +``` + +## Storage model + +atproto holds the public truth. D1 contains: + +- A single **`cache`** table — a polymorphic key-value mirror of atproto records from any repo, keyed by `(repo, collection, rkey)` with the record body stored as JSON. Dumb projection: no bridge bookkeeping, no fields that aren't already on atproto. Wipe it and the firehose rebuilds it. +- Separate **private** tables for state that can't go on atproto (v2 OAuth tokens). + +Queries that need particular fields use SQLite's JSON1 functions: + +```sql +-- "what's the colibri channel for slack channel C123?" +SELECT json_extract(record, '$.colibriChannelUri') FROM cache +WHERE repo = 'did:plc:' + AND collection = 'com.feelingof.bridge.slackChannel' + AND rkey = 'C123'; +``` + +### Schema sketch + +```sql +CREATE TABLE cache ( + repo TEXT NOT NULL, -- DID of the repo this record lives on + collection TEXT NOT NULL, -- e.g. 'com.feelingof.bridge.slackOrigin' + rkey TEXT NOT NULL, + record TEXT NOT NULL, -- JSON; lexicon-conformant record body + cached_at INTEGER NOT NULL, + PRIMARY KEY (repo, collection, rkey) +); + +-- v2; private; never on atproto +CREATE TABLE oauth_tokens ( + slack_user_id TEXT PRIMARY KEY, + did TEXT NOT NULL, + access_token_enc BLOB, + refresh_token_enc BLOB NOT NULL, + expires_at INTEGER NOT NULL, + scope TEXT, + updated_at INTEGER NOT NULL +); +``` + +### Race resolution + +The PDS is the dedupe authority via the deterministic rkey on `com.feelingof.bridge.slackOrigin`. The consumer: + +1. Check `cache` for `(repo=bot-did, collection='…slackOrigin', rkey='channel-ts')`. Hit → skip. +2. Miss → `createRecord` on PDS. 200 means we won the race. 409 means another consumer already published; `getRecord` fetches the canonical record. +3. Either path → INSERT into `cache`. + +Cache staleness is the residual risk: an upstream edit on the PDS leaves our row out of date until it's evicted. Our sidecar records are mostly write-once so this is rare in practice; v0 ignores it. + +## Lexicons + +### Reused: `social.colibri.message` + +- `text` ← Slack text (truncated to 2048 chars, prefixed `**@user:** ` for attribution — see Identity) +- `channel` ← via the `com.feelingof.bridge.slackChannel` sidecar +- `parent` ← parent's `slackOrigin.messageUri` → rkey +- `createdAt` ← Slack `ts` +- `facets` ← mentions, links (v0.1) +- `attachments` ← deferred (Slack files need blob re-upload) + +### New: `com.feelingof.bridge.slackRaw` + +Lossless archival of the raw Slack event payload. Written **before** any derivation, so the bridge never drops information it doesn't yet know how to render — reactions, edits, attachments, blocks, mrkdwn nuances — even if the v0 Colibri-message derivation ignores most of them. The bot's atproto repo becomes a public, replicable Slack archive that anyone can re-derive a Colibri view from. + +`key: "any"` so the rkey matches the `slackOrigin` rkey for join-free lookup: + +``` +rkey = `${slackChannelId}-${slackTs.replace('.','-')}` +``` + +Edits and reactions update the same record (`putRecord`). v0.1 could split this into per-event-type records (`channelId-ts-eventType-seq`) if we need full event history rather than latest-known state. + +```json +{ + "lexicon": 1, + "id": "com.feelingof.bridge.slackRaw", + "defs": { + "main": { + "type": "record", + "key": "any", + "record": { + "type": "object", + "required": ["slackChannelId", "slackTs", "payload", "capturedAt"], + "properties": { + "slackChannelId": { "type": "string" }, + "slackTs": { "type": "string" }, + "eventType": { "type": "string", "description": "Slack event subtype: 'message', 'message_changed', 'message_deleted', 'reaction_added', etc." }, + "payload": { "type": "unknown", "description": "Raw Slack message object as received (minus auth tokens)." }, + "capturedAt": { "type": "string", "format": "datetime" } + } + } + } + } +} +``` + +Slack file attachments referenced in `payload.files[]` are not blobbed in v0; their `url_private` is captured but the bytes stay on Slack. v0.1 fetches and re-uploads as atproto blobs, referencing them from the derived `social.colibri.message`. + +#### Why this lives under `com.feelingof.*`, not `social.colibri.*` + +`slackRaw` is the *FoC community's* archive of its own Slack history. Its lifetime, schema, and ownership belong to FoC — not to Colibri — and we want that boundary explicit in the lexicon namespace. Three practical consequences: + +- **De-risks Colibri lexicon churn.** Colibri is actively reworking its lexicons on the `feat/rework` branch (the v1 `src/utils/atproto/lexicons.ts` is deleted there; a new `apps/website/src/utils/atproto/lexicons.ts` is in flight, along with new appview spec, streaming event types, and a refactored Message component tree). If a future Colibri version renames `social.colibri.message`, splits the facet model, or changes the channel/community ownership semantics that drove the constraints in this proposal, the `slackRaw` archive is untouched. We re-run the derivation against the new lexicons and republish — no Slack re-pull, no loss. +- **De-risks Colibri disappearing.** If Colibri is abandoned, the FoC archive is still complete, public, and addressable on atproto. A different reader (or a static-site generator off `foc-server`-style infrastructure) can render it. +- **Avoids polluting Colibri's namespace.** Bridge-specific concepts (Slack `ts`, `slack_user_id`, `subtype`) have no business inside `social.colibri.*`. Other Slack-on-Colibri bridges (different communities, different workspaces) would invent their own `com..bridge.slackRaw` analogues; that's the right shape. + +### New: `com.feelingof.bridge.slackOrigin` + +Provenance + dedupe authority. One-to-one with a `social.colibri.message`. `key: "any"` so we control the rkey: + +``` +rkey = `${slackChannelId}-${slackTs.replace('.','-')}` +``` + +A redelivered Slack event hits a 409 at the PDS — that's what makes the bridge idempotent regardless of cache state. We can't put the deterministic rkey on the message itself: `social.colibri.message` uses `key: "tid"`, which expects monotonically-increasing TIDs per repo — backfilling old Slack history after live messages would be rejected. The sidecar sidesteps this; the message gets a fresh PDS-minted TID, the sidecar's `messageUri` carries the bridge. + +```json +{ + "lexicon": 1, + "id": "com.feelingof.bridge.slackOrigin", + "defs": { + "main": { + "type": "record", + "key": "any", + "record": { + "type": "object", + "required": ["messageUri", "slackChannelId", "slackTs", "createdAt"], + "properties": { + "messageUri": { "type": "string", "format": "at-uri" }, + "slackChannelId": { "type": "string" }, + "slackTs": { "type": "string" }, + "slackUserId": { "type": "string" }, + "createdAt": { "type": "string", "format": "datetime" } + } + } + } + } +} +``` + +### New: `com.feelingof.bridge.slackChannel` + +Slack → Colibri channel mapping. Rkey = Slack channel ID. Created lazily on first sighting of a new Slack channel, after auto-creating the corresponding `social.colibri.channel`. + +```json +{ + "lexicon": 1, + "id": "com.feelingof.bridge.slackChannel", + "defs": { + "main": { + "type": "record", + "key": "any", + "record": { + "type": "object", + "required": ["slackChannelId", "colibriChannelUri", "createdAt"], + "properties": { + "slackChannelId": { "type": "string" }, + "slackChannelName": { "type": "string" }, + "colibriChannelUri": { "type": "string", "format": "at-uri" }, + "createdAt": { "type": "string", "format": "datetime" } + } + } + } + } +} +``` + +### New: `com.feelingof.bridge.slackUser` + +Identity record. Rkey = sanitised Slack user ID. `claimedDid` starts unset; populated via the claim flow. The OAuth half (v2) does not appear here — those credentials live in a separate D1 table, never on atproto. + +```json +{ + "lexicon": 1, + "id": "com.feelingof.bridge.slackUser", + "defs": { + "main": { + "type": "record", + "key": "any", + "record": { + "type": "object", + "required": ["slackUserId", "createdAt"], + "properties": { + "slackUserId": { "type": "string" }, + "slackHandle": { "type": "string" }, + "displayName": { "type": "string" }, + "claimedDid": { "type": "string", "format": "did" }, + "claimedAt": { "type": "string", "format": "datetime" }, + "createdAt": { "type": "string", "format": "datetime" } + } + } + } + } +} +``` + +## Identity + +One bot DID on `bsky.social` (e.g. `feelingof.bsky.social`). The bot owns the Colibri community, every category in it, every channel under those categories, and every bridged message. Attribution for the original Slack speaker lives in the message text body as `@user: ...` (rendered with a mention facet once the speaker has claimed a DID). + +### Authorship is immutable + +The author of an atproto record is the DID of the repo it lives on. `putRecord` edits content; `deleteRecord` removes a record; nothing reassigns authorship. Consequences: + +- All bridged messages stay authored by the bot. Forever. +- "Post from user's DID" (v2 below) only applies to *new* messages after claim. Hybrid history. +- Retroactive delete + republish would change the at-uri, breaking links / threads / firehose state. Not recommended. + +### Channels live in the bot's community + +The Colibri appview hard-couples channel ownership to community ownership by author DID. From `jetstream.rs` channel handler: + +```rust +let community_uri = + format!("at://{}/social.colibri.community/{}", did, record.community); +``` + +The community URI an indexed channel belongs to is constructed from the **channel record's author DID** plus the channel's `community` rkey field. A bot-authored channel record can only resolve to a community on the bot's own repo; the appview will not index a bot-authored channel into a community owned by someone else's DID. + +Consequences: + +- The bot bootstraps and owns its own `social.colibri.community` record. Bridged channels cannot be inserted into a pre-existing third-party community by the bot. +- Discovery of the bot-owned space happens by linking to the bot's community URI, not by appearing inside an existing community's sidebar. +- The bot's community + at least one category must exist before any channel lazy-creation. Categories' `channelOrder` is a read-modify-write append per new channel. + +#### Inverse pattern: community owner pre-creates channels + +The constraint is on the channel record's author DID, not on who writes messages into the channel. So a cooperative community owner can pre-create the bridged channels on their own repo, under their own community + category, and the bot publishes messages referencing those channel rkeys. Validated against the FoC Colibri instance: six Slack channels (`present-company`, `share-your-work`, `thinking-together`, `of-ai`, `devlog-together`, `linking-together`) created by the community owner on `did:plc:j7nm3lrd5h7fm3sfhcv3lhfv` under the Feelingsof community; `trendingnotebooks.bsky.social` (with a `social.colibri.membership` for that community) published a 9-message backfill (1 top-level + 8 thread replies) into `#present-company`. Messages appeared with correct threading. + +In this mode the bridge only needs the slack-channel → colibri-channel-rkey mapping; it does no channel-creation, no `social.colibri.community` ownership, no `channelOrder` mutation. The trade-off is one-time manual setup by the community owner and an ongoing convention that the owner adds a Colibri channel whenever a new Slack channel should be bridged. + +### Per-message avatar / displayName + +Avatar and displayName come from the actor's profile record — one per DID. With one bot DID, every bridged message renders with the bot's avatar. The Colibri message lexicon has no override fields. Per-user ghost DIDs (Bridgy Fed's approach) would solve this but we reject them for v0/v1: bsky.social account-creation rate limits, and creating DIDs for users without consent. + +Upstream ask of Colibri: extend `social.colibri.message` with optional render-time author overrides. + +```json +"displayAuthor": { + "type": "object", + "description": "Override author render for bridged messages.", + "properties": { + "name": { "type": "string", "maxLength": 64 }, + "avatar": { "type": "blob", "accept": ["image/jpeg", "image/png"] } + } +} +``` + +### Claim flow + +A user posts `I am did:plc:...` in any bridged Slack channel. The bridge extracts the DID and writes it to the `slackUser.claimedDid` atproto record (and the cache mirrors it). No cryptographic verification in v1 — posting it from their Slack account is the trust signal, and the claim is publicly visible for anyone to challenge. + +v2 requires a counter-claim record on the claimed DID's repo, verifiable from atproto alone. + +### Posting from the claimed DID (v2) + +Once `claimedDid` is set, future messages could be authored from the user's DID. Requires an OAuth credential delegated to the bridge, against the user's PDS — atproto OAuth specifically, not Bluesky app-passwords (which are a bsky.social UX, not portable across PDSes). Refresh tokens are medium-lived; the bridge re-prompts via Slack DM near expiry. Tokens live in D1's `oauth_tokens` table (separate from the atproto cache), encrypted at rest, keyed by Slack user ID. + +### Credential-free alternative: user-driven backfill + +A claimed user can republish their own messages onto their own repo at any time without granting the bridge anything. The bot's repo is a public archive — pull the `slackOrigin` records matching their `slackUserId`, republish the corresponding messages from their own DID. We ship a small CLI. No trust delegation, full data ownership. + +## Asks of Colibri + +These are upstream changes the bridge benefits from but does not block on. Until they land, `slackRaw` preserves enough state to re-derive when they do. + +- **Per-record author override** — optional `displayAuthor: { name, avatar? }` on `social.colibri.message` and `social.colibri.reaction`. Without it every bridged message and every aggregated reaction renders as the bot, with attribution hacked into the message text body as `@user: ` and reactions collapsed to a single "@bot reacted" entry per emoji. Single biggest UX win; unblocks proper reaction multi-reactor counts too. +- **Collapsed / nested thread rendering** — Colibri's current UI is Discourse-flat (every reply is a top-level row referencing a `parent` rkey). Slack's threaded conversations don't survive the trip: a 30-reply thread on one Slack message becomes 30 sibling rows in the channel scroll. Inspected `feat/rework` (substantial monorepo + lexicon rewrite in flight) and the new Message component still renders flat with `parent_message` as a jump-link, not as a collapsed sub-thread. Worth raising as a v2 UX direction. +- **Cross-repo channel ownership** — the appview hard-codes `community_uri = at://{channel_author}/social.colibri.community/{rkey}`. The bot cannot create channels in a community it does not own. Today's workaround is "community owner pre-creates channels"; cleaner is either a `communityRepo` field on `social.colibri.channel`, or a `social.colibri.delegation` record granting channel-creation to a specific DID. +- **Quote facet feature** — Slack's `rich_text_quote` blocks render as `> `-prefixed plain text today because Colibri's facet feature set covers bold/italic/strikethrough/code/mention/link/channel but not quote. +- **Confirm TID-on-rkey monotonicity expectations** — `social.colibri.message` uses `key: "tid"`. `bsky.social` tolerates non-monotonic TIDs on rkeys (otherwise our backfill would 409 against live messages). PDSes that *do* enforce monotonicity would break the bridge. Worth a one-line "we don't require monotonic rkeys" assurance in the lexicon docs, or a switch to `key: "any"`. +- **Attachment shape clarity** — examples / docs for `social.colibri.message.attachments[]` would unblock our v0.1 file-attachment work. + +## Open questions + +- Backfill from `dump-history.js` snapshot, or forward-only? All-channels backfill is significant volume. +- Channel / category layout: single community with flat siblings, or map Slack groupings to Colibri categories? Sidecar is agnostic. +- Private channels and DMs — out of scope. Bot joins public channels only. +- Reactions, edits, deletes — v0 publishes one `social.colibri.reaction` per (target_message, emoji) on the bot's repo; multi-reactor counts are preserved losslessly in `slackRaw` and become recoverable once per-record author override lands. Edits: `putRecord` on both `slackRaw` and the message. Deletes: tombstone the message, retain the `slackRaw` for audit. +- Slack file attachments — `payload.files[]` is preserved in `slackRaw` (including `url_private`) from v0; v0.1 fetches and re-uploads as atproto blobs. bsky.social blob size limits (~1 MB images, ~50 MB video) will force large attachments to external hosting or a more permissive PDS. +- False DID claims. v1 unverified; v2 requires two-sided counter-claim. +- OAuth re-auth UX (v2): frequency cap, fallback when user ignores the prompt. + +## Prior art + +- **[Bridgy Fed](https://fed.brid.gy)** — ActivityPub ↔ atproto. Not applicable directly (Slack isn't ActivityPub) but informs the rejected per-user ghost-DID approach and our `slackOrigin` provenance pattern. +- **[matrix-appservice-slack](https://github.com/matrix-org/matrix-appservice-slack)** — closest sibling. Same Slack-webhook → ghost-users → federated-protocol shape, targeting Matrix. +- **Mariano's `scripts/dump-history.js` + `foc-server`** ([repo](https://github.com/marianoguerra/Feeling-of-Computing)) — the existing FoC Slack pipeline runs in a different shape: a Node CLI that pulls `conversations.history` + `conversations.replies` via Slack's REST API on a manual / weekly cadence, writes JSON to `history/YYYY/MM/DD{,.replies}.json`, indexes it into LanceDB with sentence-transformer embeddings, and serves search via a Rust `axum` binary deployed on Ubuntu under systemd behind nginx (see `foc-server/docs/systemd.md`). Pull, not push; ingest-and-reindex, not bridge. Our `slackRaw` lexicon is the atproto-native analog of those committed `history/*.json` dumps — same archival role, public over the firehose instead of `git push`. + +## Related + +- [[Projects]] — Colibri and FoC are both listed. +- [Colibri lexicons](https://lexicon.garden/browse/social.colibri) +- [Colibri source](https://github.com/colibri-social/colibri.social) +- [FoC repo](https://github.com/marianoguerra/Feeling-of-Computing) — see `scripts/dump-history.js` for the current Slack puller. diff --git a/pages/slack-colibri-bridge.ts b/pages/slack-colibri-bridge.ts new file mode 100644 index 0000000..8670bda --- /dev/null +++ b/pages/slack-colibri-bridge.ts @@ -0,0 +1,694 @@ +#!/usr/bin/env bun +// Reference implementation of the Slack→Colibri bridge backfill. +// See pages/slack-colibri-bridge.md for the design proposal this script realises. +// +// Reads a day's worth of Slack history JSON (Mariano's dump-history.js output) +// and publishes each message as a `social.colibri.message` record on a bot's +// atproto repo. Two-pass: top-level first, then thread replies with `parent` +// set from a deterministic TID derived from `thread_ts`. +// +// Channel lookup precedence: +// 1. tools/slack-to-colibri-channel.json (community-owner-pre-created channels) +// 2. otherwise deterministic rkey derived from Slack channel.created; +// live mode lazy-creates the channel + updates the category's channelOrder +// (needs COLIBRI_COMMUNITY_URI + COLIBRI_CATEGORY_RKEY). +// +// Rich text: when a message carries Slack's structured `blocks` (rich_text), +// it's walked into Colibri `text + facets` covering mentions, channels, links, +// bold/italic/strikethrough/code, code blocks, quotes, lists, and emoji +// (resolved via vendor/feeling-of-computing/conversations/src/emoji-data.js). +// Falls back to the legacy `text` field with regex link extraction when +// `blocks` is absent (older Slack messages, app-posted messages without blocks). +// +// Idempotent: every rkey is derived from Slack identifiers; uses putRecord. +// Dry-run by default. Pass --live to publish. +// --live always needs BSKY_HANDLE + BSKY_APP_PASSWORD. + +import { readFileSync } from "node:fs"; +import { resolve as resolvePath } from "node:path"; +import { pathToFileURL } from "node:url"; +import { parseArgs } from "node:util"; + +const PDS = "https://bsky.social"; +const USERS_JSON = "vendor/feeling-of-computing/history/users.json"; +const CHANNELS_JSON = "vendor/feeling-of-computing/history/channels.json"; +const SLACK_TO_DID_JSON = "tools/slack-to-did.json"; +const SLACK_TO_COLIBRI_CHANNEL_JSON = "tools/slack-to-colibri-channel.json"; +const EMOJI_DATA_JS = + "vendor/feeling-of-computing/conversations/src/emoji-data.js"; + +const { values } = parseArgs({ + args: process.argv.slice(2), + options: { + "src-day": { type: "string" }, + "src-dir": { + type: "string", + default: "vendor/feeling-of-computing/history", + }, + limit: { type: "string", default: "1000" }, + live: { type: "boolean", default: false }, + "delay-ms": { type: "string", default: "200" }, + }, +}); + +if (!values["src-day"]) { + console.error("usage: bun pages/slack-colibri-bridge.ts --src-day YYYY/MM/DD [--limit N] [--live]"); + process.exit(1); +} + +const srcDay = values["src-day"]!; +const srcDir = values["src-dir"]!; +const limit = parseInt(values.limit!, 10); +const dryRun = !values.live; +const delayMs = parseInt(values["delay-ms"]!, 10); + +// ── reference data ────────────────────────────────────────────────────────── +type SlackUser = { + id: string; + name?: string; + real_name?: string; + profile?: { display_name?: string }; +}; +const users: SlackUser[] = JSON.parse(readFileSync(USERS_JSON, "utf-8")); +const nameOf = new Map( + users.map((u) => [u.id, u.profile?.display_name || u.real_name || u.name || u.id]), +); + +type SlackChannel = { id: string; name: string; created: number }; +const channels: SlackChannel[] = JSON.parse(readFileSync(CHANNELS_JSON, "utf-8")); +const channelOf = new Map(channels.map((c) => [c.id, c])); + +const SLACK_TO_DID: Record = (() => { + const map: Record = {}; + try { + const raw = JSON.parse(readFileSync(SLACK_TO_DID_JSON, "utf-8")); + for (const [k, v] of Object.entries(raw)) { + if (k.startsWith("_")) continue; + const did = typeof v === "string" ? v : (v as any)?.did; + if (did) map[k] = did; + } + } catch {} + return map; +})(); + +const MANUAL_CHANNELS: Record = (() => { + const map: Record = {}; + try { + const raw = JSON.parse(readFileSync(SLACK_TO_COLIBRI_CHANNEL_JSON, "utf-8")); + for (const [k, v] of Object.entries(raw)) { + if (k.startsWith("_")) continue; + const rkey = typeof v === "string" ? v : (v as any)?.rkey; + if (rkey) map[k] = rkey; + } + } catch {} + return map; +})(); + +// emoji map (Mariano's tables); loaded dynamically because the file lives in a vendor submodule +let EMOJI_MAP = new Map(); +try { + const url = pathToFileURL(resolvePath(process.cwd(), EMOJI_DATA_JS)).href; + const mod: any = await import(url); + for (const [name, unicode] of mod.entries ?? []) EMOJI_MAP.set(name, unicode); + for (const [name, unicode] of Object.entries(mod.aliases ?? {})) + EMOJI_MAP.set(name, unicode as string); +} catch { + console.error(`(emoji data not loaded from ${EMOJI_DATA_JS}; falling back to :name:)`); +} + +// ── deterministic TID derivation (53b microseconds + 10b clock id) ────────── +const TID_ALPHABET = "234567abcdefghijklmnopqrstuvwxyz"; +function tidFromMicros(microseconds: bigint, clockId = 0): string { + let n = (microseconds << 10n) | BigInt(clockId & 0x3ff); + const chars: string[] = []; + for (let i = 0; i < 13; i++) { + chars.push(TID_ALPHABET[Number(n & 0x1fn)]); + n >>= 5n; + } + return chars.reverse().join(""); +} +function tidFromSlackTs(ts: string, clockId = 0) { + const [sec, usecRaw = ""] = ts.split("."); + const usec = (usecRaw + "000000").slice(0, 6); + return tidFromMicros(BigInt(sec) * 1_000_000n + BigInt(usec), clockId); +} +function hash10(s: string): number { + let h = 0; + for (const c of s) h = (h * 31 + c.charCodeAt(0)) | 0; + return Math.abs(h) & 0x3ff; +} +// Reaction rkey: synthesise time from the *message* ts so reactions live next to +// their target in TID order; clockId distinguishes the emoji. With 10 bits of +// clockId space and a small number of distinct emojis per message, collisions +// are rare; collisions just merge two emoji into one reaction record, which +// `slackRaw` can correct if we re-derive. +function tidForReaction(messageTs: string, emojiName: string) { + return tidFromSlackTs(messageTs, hash10(`react:${emojiName}`)); +} +function colibriChannelRkey(slackChannelId: string): string { + const ch = channelOf.get(slackChannelId); + if (!ch) throw new Error(`unknown slack channel ${slackChannelId}`); + return tidFromMicros(BigInt(ch.created) * 1_000_000n, hash10(slackChannelId)); +} + +// ── facet builder ────────────────────────────────────────────────────────── +const enc = new TextEncoder(); +const utf8Len = (s: string) => enc.encode(s).length; + +type Facet = { + $type: "social.colibri.richtext.facet"; + index: { byteStart: number; byteEnd: number }; + features: any[]; +}; + +class FacetBuilder { + parts: string[] = []; + facets: Facet[] = []; + byteOffset = 0; + + emit(text: string, ...features: any[]) { + if (!text) return; + const start = this.byteOffset; + this.parts.push(text); + this.byteOffset += utf8Len(text); + if (features.length > 0) { + this.facets.push({ + $type: "social.colibri.richtext.facet", + index: { byteStart: start, byteEnd: this.byteOffset }, + features, + }); + } + } + + finish() { + return { text: this.parts.join(""), facets: this.facets }; + } +} + +// ── blocks walker (ported from Mariano's components.js fromData methods) ── +// Produces Colibri text + facets from Slack's rich_text block tree. Element +// types we recognise: rich_text_section, rich_text_quote, rich_text_preformatted, +// rich_text_list (and inside sections: text, link, user, channel, emoji, broadcast). + +function walkBlocks(blocks: any[], b: FacetBuilder) { + for (let i = 0; i < blocks.length; i++) { + const block = blocks[i]; + if (block?.type !== "rich_text") continue; + walkRichTextElements(block.elements ?? [], b); + if (i < blocks.length - 1) b.emit("\n"); + } +} + +function walkRichTextElements(elements: any[], b: FacetBuilder) { + for (let i = 0; i < elements.length; i++) { + const el = elements[i]; + switch (el?.type) { + case "rich_text_section": + walkSection(el.elements ?? [], b); + break; + case "rich_text_quote": + walkQuote(el.elements ?? [], b); + break; + case "rich_text_preformatted": + walkPreformatted(el.elements ?? [], b); + break; + case "rich_text_list": + walkList(el, b); + break; + } + if (i < elements.length - 1) b.emit("\n"); + } +} + +function walkSection(elements: any[], b: FacetBuilder) { + for (const el of elements) walkSectionItem(el, b); +} + +function walkSectionItem(item: any, b: FacetBuilder) { + switch (item?.type) { + case "text": { + const features: any[] = []; + const s = item.style ?? {}; + if (s.bold) + features.push({ $type: "social.colibri.richtext.facet#bold" }); + if (s.italic) + features.push({ $type: "social.colibri.richtext.facet#italic" }); + if (s.strike) + features.push({ $type: "social.colibri.richtext.facet#strikethrough" }); + if (s.code) + features.push({ $type: "social.colibri.richtext.facet#code" }); + b.emit(item.text ?? "", ...features); + break; + } + case "link": { + const text = item.text || item.url; + b.emit(text, { + $type: "social.colibri.richtext.facet#link", + uri: item.url, + }); + break; + } + case "user": { + const did = SLACK_TO_DID[item.user_id]; + const name = nameOf.get(item.user_id) ?? item.user_id; + if (did) + b.emit(`@${name}`, { + $type: "social.colibri.richtext.facet#mention", + did, + }); + else b.emit(`@${name}`); + break; + } + case "channel": { + const ch = channelOf.get(item.channel_id); + const name = ch?.name ?? item.channel_id; + const rkey = MANUAL_CHANNELS[item.channel_id]; + if (rkey) + b.emit(`#${name}`, { + $type: "social.colibri.richtext.facet#channel", + channel: rkey, + }); + else b.emit(`#${name}`); + break; + } + case "emoji": { + // Slack sends `unicode` (codepoint sequence, dash-separated) for standard emoji, + // and only the `name` for custom workspace emoji. + let unicode = ""; + if (item.unicode) { + try { + unicode = String.fromCodePoint( + ...item.unicode.split("-").map((h: string) => parseInt(h, 16)), + ); + } catch {} + } + if (!unicode) unicode = EMOJI_MAP.get(item.name) ?? `:${item.name}:`; + b.emit(unicode); + break; + } + case "broadcast": + b.emit(`@${item.range}`); + break; + case "color": + b.emit(item.value ?? ""); + break; + } +} + +function walkQuote(elements: any[], b: FacetBuilder) { + // Build the inner text, then line-prefix with "> ". Facet offsets inside + // the quoted block are dropped (v0 best-effort). + const inner = new FacetBuilder(); + walkSection(elements, inner); + const { text } = inner.finish(); + const prefixed = text + .split("\n") + .map((l) => `> ${l}`) + .join("\n"); + b.emit(prefixed); +} + +function walkPreformatted(elements: any[], b: FacetBuilder) { + // Render the inner text, then wrap the whole span in a single code facet. + const inner = new FacetBuilder(); + walkSection(elements, inner); + const { text } = inner.finish(); + if (!text) return; + b.emit("\n"); + const start = b.byteOffset; + b.parts.push(text); + b.byteOffset += utf8Len(text); + b.facets.push({ + $type: "social.colibri.richtext.facet", + index: { byteStart: start, byteEnd: b.byteOffset }, + features: [{ $type: "social.colibri.richtext.facet#code" }], + }); + b.emit("\n"); +} + +function walkList(list: any, b: FacetBuilder) { + const ordered = list.style === "ordered"; + const items = list.elements ?? []; + for (let i = 0; i < items.length; i++) { + const prefix = ordered ? `${i + 1}. ` : "• "; + b.emit(prefix); + // Items can be rich_text_section or another rich_text_list. + const child = items[i]; + if (child?.type === "rich_text_section") walkSection(child.elements ?? [], b); + else if (child?.elements) walkSection(child.elements, b); + if (i < items.length - 1) b.emit("\n"); + } +} + +// ── message builder ───────────────────────────────────────────────────────── + +function buildMessage(m: any, channelRkey: string, parentRkey?: string) { + const author = nameOf.get(m.user || "") || m.user || "unknown"; + const claimedDid = SLACK_TO_DID[m.user || ""]; + + const b = new FacetBuilder(); + if (claimedDid) + b.emit(`@${author}`, { + $type: "social.colibri.richtext.facet#mention", + did: claimedDid, + }); + else b.emit(`@${author}`); + b.emit(": "); + + if (Array.isArray(m.blocks) && m.blocks.some((blk: any) => blk?.type === "rich_text")) { + walkBlocks(m.blocks, b); + } else { + // Legacy fallback: plain text + URL regex link facets + entity decoding. + legacyTextFallback(m.text || "", b); + } + + let { text, facets } = b.finish(); + + // 2048-char hard cap. If we truncate, drop any facets that extend past the cut. + if (text.length > 2048) { + text = text.slice(0, 2048); + const maxBytes = utf8Len(text); + facets = facets.filter((f) => f.index.byteEnd <= maxBytes); + } + + return { + rkey: tidFromSlackTs(m.ts), + record: { + $type: "social.colibri.message", + text, + channel: channelRkey, + createdAt: new Date(parseFloat(m.ts) * 1000).toISOString(), + facets, + attachments: [], + ...(parentRkey ? { parent: parentRkey } : {}), + }, + hasBlocks: Array.isArray(m.blocks) && m.blocks.length > 0, + facetCount: facets.length, + truncated: false, + }; +} + +function emojiForReaction(name: string): string { + // Strip Slack ":name::skin-tone-X:" → look up base name, accept the loss of skin tone in v0. + const baseName = name.split("::")[0]; + return EMOJI_MAP.get(baseName) ?? `:${name}:`; +} + +// Walk a message's `reactions` array → one reaction record per emoji (per +// message). Multiple Slack users with the same emoji collapse into one record +// (they'd all author from the bot anyway and the appview likely dedupes by +// (author, emoji, target)). Multi-reactor count is preserved losslessly in +// `slackRaw`. +function reactionsFor(m: any, targetMessageRkey: string) { + const out: { rkey: string; record: any; userCount: number; name: string; emoji: string }[] = []; + for (const r of m.reactions ?? []) { + if (!r?.name) continue; + const emoji = emojiForReaction(r.name); + out.push({ + rkey: tidForReaction(m.ts, r.name), + name: r.name, + emoji, + userCount: (r.users ?? []).length || r.count || 1, + record: { + $type: "social.colibri.reaction", + emoji, + targetMessage: targetMessageRkey, + }, + }); + } + return out; +} + +function legacyTextFallback(raw: string, b: FacetBuilder) { + const decoded = raw + .replace(/<([^>|]+)\|([^>]+)>/g, "$2") + .replace(/<(https?:\/\/[^>]+)>/g, "$1") + .replace(/</g, "<") + .replace(/>/g, ">") + .replace(/&/g, "&"); + const urlRe = /https?:\/\/[^\s<>"']+/g; + let last = 0; + let m: RegExpExecArray | null; + while ((m = urlRe.exec(decoded)) !== null) { + if (m.index > last) b.emit(decoded.slice(last, m.index)); + b.emit(m[0], { + $type: "social.colibri.richtext.facet#link", + uri: m[0], + }); + last = m.index + m[0].length; + } + if (last < decoded.length) b.emit(decoded.slice(last)); +} + +// ── load day's data ───────────────────────────────────────────────────────── +const dayPath = `${srcDir}/${srcDay}`; +let topLevelRaw: any[] = []; +let repliesRaw: any[] = []; +try { + topLevelRaw = JSON.parse(readFileSync(`${dayPath}.json`, "utf-8")); +} catch {} +try { + repliesRaw = JSON.parse(readFileSync(`${dayPath}.replies.json`, "utf-8")); +} catch {} + +const tops = topLevelRaw + .filter( + (m) => + m.type === "message" && + !m.subtype && + m.text && + (!m.thread_ts || m.thread_ts === m.ts), + ) + .sort((a, b) => parseFloat(a.ts) - parseFloat(b.ts)) + .slice(0, limit); + +const replies = repliesRaw + .filter( + (m) => + m.type === "message" && + !m.subtype && + m.text && + m.thread_ts && + m.thread_ts !== m.ts, + ) + .sort((a, b) => parseFloat(a.ts) - parseFloat(b.ts)) + .slice(0, limit); + +const slackChannelsTouched = new Set( + [...tops, ...replies].map((m) => m.channel_id).filter(Boolean), +); + +const channelMap: Record = {}; +const channelSrc: Record = {}; +for (const cid of slackChannelsTouched) { + if (MANUAL_CHANNELS[cid]) { + channelMap[cid] = MANUAL_CHANNELS[cid]; + channelSrc[cid] = "manual"; + } else { + channelMap[cid] = colibriChannelRkey(cid); + channelSrc[cid] = "derived"; + } +} +const allManual = [...slackChannelsTouched].every( + (cid) => channelSrc[cid] === "manual", +); + +// ── preview ───────────────────────────────────────────────────────────────── +console.log(`=== ${srcDay} ===`); +console.log( + `top-level: ${tops.length} replies: ${replies.length} channels: ${slackChannelsTouched.size}`, +); +console.log(""); +console.log(`CHANNELS (${allManual ? "manual mapping" : "deterministic / lazy-create"}):`); +for (const cid of slackChannelsTouched) { + const ch = channelOf.get(cid)!; + console.log( + ` ${cid.padEnd(13)} ${ch.name.padEnd(22)} → ${channelMap[cid]} [${channelSrc[cid]}]`, + ); +} + +const fmtRow = ( + m: any, + built: ReturnType, + parent?: string, +) => { + const tags = `${built.hasBlocks ? "B" : "."}${built.facetCount.toString().padStart(2, " ")}`; + const parentCol = parent ? `parent=${parent}` : " "; + const rxCount = (m.reactions ?? []).length; + const rxTag = rxCount ? `+${rxCount}r` : " "; + return ` ${m.ts} ${(m.channel_name || "?").padEnd(20)} ${built.rkey} ${parentCol} ${tags} ${rxTag} '${built.record.text.slice(0, 70).replace(/\n/g, " ")}…'`; +}; + +console.log(""); +console.log("TOP-LEVEL:"); +for (const m of tops) + console.log(fmtRow(m, buildMessage(m, channelMap[m.channel_id]))); + +console.log(""); +console.log("REPLIES:"); +for (const m of replies) { + const parent = tidFromSlackTs(m.thread_ts!); + console.log(fmtRow(m, buildMessage(m, channelMap[m.channel_id], parent), parent)); +} + +const allWithReactions: { m: any; targetRkey: string }[] = []; +for (const m of tops) + if (m.reactions?.length) + allWithReactions.push({ m, targetRkey: tidFromSlackTs(m.ts) }); +for (const m of replies) + if (m.reactions?.length) + allWithReactions.push({ m, targetRkey: tidFromSlackTs(m.ts) }); + +if (allWithReactions.length > 0) { + console.log(""); + console.log("REACTIONS:"); + for (const { m, targetRkey } of allWithReactions) { + for (const r of reactionsFor(m, targetRkey)) { + console.log( + ` ${m.ts} target=${targetRkey} rkey=${r.rkey} ${r.emoji} (:${r.name}: ×${r.userCount})`, + ); + } + } +} + +if (dryRun) { + console.log(""); + console.log("(dry-run; pass --live to publish)"); + process.exit(0); +} + +// ── live mode ─────────────────────────────────────────────────────────────── +const HANDLE = process.env.BSKY_HANDLE; +const PASSWORD = process.env.BSKY_APP_PASSWORD; +if (!HANDLE || !PASSWORD) { + console.error("set BSKY_HANDLE, BSKY_APP_PASSWORD"); + process.exit(1); +} +const COMMUNITY_URI = process.env.COLIBRI_COMMUNITY_URI; +const CATEGORY_RKEY = process.env.COLIBRI_CATEGORY_RKEY; +if (!allManual && (!COMMUNITY_URI || !CATEGORY_RKEY)) { + console.error("some channels need lazy-create; set COLIBRI_COMMUNITY_URI + COLIBRI_CATEGORY_RKEY,"); + console.error(`or add them to ${SLACK_TO_COLIBRI_CHANNEL_JSON}`); + process.exit(1); +} +const COMMUNITY_RKEY = COMMUNITY_URI?.split("/").pop(); + +const sessRes = await fetch(`${PDS}/xrpc/com.atproto.server.createSession`, { + method: "POST", + headers: { "Content-Type": "application/json" }, + body: JSON.stringify({ identifier: HANDLE, password: PASSWORD }), +}); +if (!sessRes.ok) throw new Error(`login: ${await sessRes.text()}`); +const sess: any = await sessRes.json(); +const did = sess.did as string; +const auth = { + "Content-Type": "application/json", + Authorization: `Bearer ${sess.accessJwt}`, +}; +console.error(`logged in as @${sess.handle} (${did})`); + +async function put(collection: string, rkey: string, record: any) { + const r = await fetch(`${PDS}/xrpc/com.atproto.repo.putRecord`, { + method: "POST", + headers: auth, + body: JSON.stringify({ repo: did, collection, rkey, record }), + }); + if (!r.ok) + throw new Error(`putRecord ${collection}/${rkey}: ${r.status} ${await r.text()}`); + return await r.json(); +} +async function get(repo: string, collection: string, rkey: string) { + const r = await fetch( + `${PDS}/xrpc/com.atproto.repo.getRecord?repo=${repo}&collection=${collection}&rkey=${rkey}`, + ); + if (r.status === 404) return null; + if (!r.ok) throw new Error(`getRecord ${collection}/${rkey}: ${r.status}`); + return await r.json(); +} + +if (!allManual) { + const catRes = await get(did, "social.colibri.category", CATEGORY_RKEY!); + if (!catRes) { + console.error(`category ${CATEGORY_RKEY} not found on ${did}; create it first`); + process.exit(1); + } + const categoryRecord: any = catRes.value; + const existingOrder: string[] = categoryRecord.channelOrder || []; + const newRkeys: string[] = []; + + for (const cid of slackChannelsTouched) { + if (channelSrc[cid] === "manual") continue; + const rkey = channelMap[cid]; + const ch = channelOf.get(cid)!; + const existing = await get(did, "social.colibri.channel", rkey); + if (!existing) { + await put("social.colibri.channel", rkey, { + $type: "social.colibri.channel", + name: ch.name, + type: "text", + category: CATEGORY_RKEY, + community: COMMUNITY_RKEY, + ownerOnly: false, + }); + console.error(` created #${ch.name} (${rkey})`); + } + if (!existingOrder.includes(rkey)) newRkeys.push(rkey); + await new Promise((r) => setTimeout(r, delayMs)); + } + + if (newRkeys.length > 0) { + categoryRecord.channelOrder = [...existingOrder, ...newRkeys]; + await put("social.colibri.category", CATEGORY_RKEY!, categoryRecord); + console.error(` category.channelOrder +${newRkeys.length}`); + } +} + +console.error(""); +console.error("top-level…"); +let okT = 0, failT = 0; +for (const m of tops) { + const built = buildMessage(m, channelMap[m.channel_id]); + try { + await put("social.colibri.message", built.rkey, built.record); + okT++; + } catch (e) { + failT++; + console.error(` fail ${m.ts}: ${e}`); + } + if (delayMs > 0) await new Promise((r) => setTimeout(r, delayMs)); +} + +console.error(""); +console.error("replies…"); +let okR = 0, failR = 0; +for (const m of replies) { + const parent = tidFromSlackTs(m.thread_ts!); + const built = buildMessage(m, channelMap[m.channel_id], parent); + try { + await put("social.colibri.message", built.rkey, built.record); + okR++; + } catch (e) { + failR++; + console.error(` fail ${m.ts}: ${e}`); + } + if (delayMs > 0) await new Promise((r) => setTimeout(r, delayMs)); +} + +console.error(""); +console.error("reactions…"); +let okX = 0, failX = 0; +for (const { m, targetRkey } of allWithReactions) { + for (const r of reactionsFor(m, targetRkey)) { + try { + await put("social.colibri.reaction", r.rkey, r.record); + okX++; + } catch (e) { + failX++; + console.error(` fail ${m.ts} ${r.name}: ${e}`); + } + if (delayMs > 0) await new Promise((rs) => setTimeout(rs, delayMs)); + } +} + +console.error(""); +console.error(`done: ${okT} top-level, ${okR} replies, ${okX} reactions, ${failT + failR + failX} failed`);