Skip to content

Commit 358bce8

Browse files
Merge branch 'main' into brendan/sou-948-access-token-scope-validation-in-resource-server
2 parents 090a2ae + 216c7d8 commit 358bce8

49 files changed

Lines changed: 2725 additions & 95 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1717
- [EE] Added a context-window usage gauge to the Ask Sourcebot chat details, showing how much of the selected model's context window each turn occupies. Window sizes are resolved from the models.dev catalog. [#1370](https://github.com/sourcebot-dev/sourcebot/pull/1370)
1818
- Added language model input-modality and document capability resolution, automatically resolved from the models.dev catalog (falls back to text-only for uncatalogued/self-hosted models). [#1372](https://github.com/sourcebot-dev/sourcebot/pull/1372)
1919
- [EE] Added DPoP sender-constrained OAuth tokens for MCP clients. [#1395](https://github.com/sourcebot-dev/sourcebot/pull/1395)
20+
- [EE] Added text file attachments to Ask Sourcebot, letting users attach text/code/config files to a chat message via the paperclip button, drag-and-drop, or paste, with large pastes auto-converted to attachments. [#1374](https://github.com/sourcebot-dev/sourcebot/pull/1374)
21+
- [EE] Added image attachments to Ask Sourcebot, letting users attach images to a chat message when the selected model supports image input. [#1375](https://github.com/sourcebot-dev/sourcebot/pull/1375)
2022

2123
### Fixed
2224
- Send anonymous server-side PostHog events as personless so unauthenticated requests don't inflate person counts. [#1367](https://github.com/sourcebot-dev/sourcebot/pull/1367)

docs/docs/configuration/environment-variables.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ The following environment variables allow you to configure your Sourcebot deploy
4040
| `SOURCEBOT_TELEMETRY_DISABLED` | `false` | <p>Enables/disables telemetry collection in Sourcebot. See [this doc](/docs/misc/telemetry) for more info.</p> |
4141
| `DEFAULT_MAX_MATCH_COUNT` | `10000` | <p>The default maximum number of search results to return when using search in the web app.</p> |
4242
| `ALWAYS_INDEX_FILE_PATTERNS` | - | <p>A comma separated list of glob patterns matching file paths that should always be indexed, regardless of size or number of trigrams.</p> |
43+
| `SOURCEBOT_CHAT_ATTACHMENT_MAX_IMAGE_BYTES` | `10485760` (10 MiB) | <p>Maximum size in bytes of a single image attachment uploaded to Ask Sourcebot. Enforced server-side at upload time.</p> |
44+
| `SOURCEBOT_CHAT_ATTACHMENT_ORPHAN_TTL_HOURS` | `24` | <p>How long in hours an uploaded-but-unsent attachment is retained before being deleted by the orphan sweep. Set to `0` to disable the sweep.</p> |
4345
| `NODE_USE_ENV_PROXY` | `0` | <p>Enables Node.js to automatically use `HTTP_PROXY`, `HTTPS_PROXY`, and `NO_PROXY` environment variables for network requests. Set to `1` to enable or `0` to disable. See [this doc](https://nodejs.org/en/learn/http/enterprise-network-configuration) for more info.</p> |
4446
| `HTTP_PROXY` | - | <p>HTTP proxy URL for routing non-SSL requests through a proxy server (e.g., `http://proxy.company.com:8080`). Requires `NODE_USE_ENV_PROXY=1`.</p> |
4547
| `HTTPS_PROXY` | - | <p>HTTPS proxy URL for routing SSL requests through a proxy server (e.g., `http://proxy.company.com:8080`). Requires `NODE_USE_ENV_PROXY=1`.</p> |
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
import { AttachmentStatus, PrismaClient } from "@sourcebot/db";
2+
import { createLogger, env, getStorageBackend } from "@sourcebot/shared";
3+
import { setIntervalAsync } from "./utils.js";
4+
5+
const BATCH_SIZE = 1_000;
6+
const ONE_HOUR_MS = 60 * 60 * 1000;
7+
8+
const logger = createLogger('attachment-pruner');
9+
10+
/**
11+
* Periodically reclaims orphaned attachment blobs older than the configured TTL,
12+
* along with their stored bytes, using the `DELETING` tombstone protocol: an
13+
* orphan is first atomically flipped to `DELETING`, then its bytes are deleted,
14+
* and only then is the row removed. Because the row (the only durable handle to
15+
* the bytes) outlives the byte delete, a failed byte delete is always retryable.
16+
*
17+
* Each tick condemns two classes of orphan to `DELETING`, then reclaims all
18+
* tombstones:
19+
*
20+
* 1. PENDING (uploaded-but-never-linked): produced when a user selects a file
21+
* in the chat box but never sends the message.
22+
* 2. COMMITTED with zero links: normally a committed blob is reclaimed inline
23+
* by the chat-delete sweep in the web app, but if that sweep is interrupted
24+
* (process crash / DB error / failed byte delete) the blob is left tombstoned
25+
* or unlinked. This is the backstop for that case.
26+
*
27+
* @note Byte deletion goes through the shared `StorageBackend`, so the web app
28+
* and this worker share one on-disk layout.
29+
*/
30+
export class AttachmentPruner {
31+
private interval?: NodeJS.Timeout;
32+
private readonly storage = getStorageBackend();
33+
34+
constructor(private db: PrismaClient) {}
35+
36+
startScheduler() {
37+
const ttlHours = env.SOURCEBOT_CHAT_ATTACHMENT_ORPHAN_TTL_HOURS;
38+
if (ttlHours <= 0) {
39+
logger.info('SOURCEBOT_CHAT_ATTACHMENT_ORPHAN_TTL_HOURS is 0, attachment orphan pruning is disabled.');
40+
return;
41+
}
42+
43+
logger.debug(`Attachment pruner started. Reclaiming orphaned attachments older than ${ttlHours} hours.`);
44+
45+
// Run immediately on startup, then every hour. The startup call isn't
46+
// awaited, so log any failure here: this worker exits on
47+
// unhandledRejection, and the recurring schedule will retry.
48+
this.pruneOrphanedAttachments().catch((error) => {
49+
logger.warn(`Initial attachment prune failed: ${error}`);
50+
});
51+
this.interval = setIntervalAsync(() => this.pruneOrphanedAttachments(), ONE_HOUR_MS);
52+
}
53+
54+
async dispose() {
55+
if (this.interval) {
56+
clearInterval(this.interval);
57+
this.interval = undefined;
58+
}
59+
}
60+
61+
private async pruneOrphanedAttachments() {
62+
const cutoff = new Date(Date.now() - env.SOURCEBOT_CHAT_ATTACHMENT_ORPHAN_TTL_HOURS * ONE_HOUR_MS);
63+
64+
// Condemn orphans by flipping them to the DELETING tombstone. Each claim
65+
// is atomic, so a PENDING blob committed by a concurrent send (its commit
66+
// matches only PENDING rows) or a zero-link blob re-linked by a concurrent
67+
// duplicate-chat loses the claim and is left intact.
68+
//
69+
// PENDING orphans: uploaded but the message was never sent.
70+
const pendingClaimed = await this.db.attachment.updateMany({
71+
where: {
72+
status: AttachmentStatus.PENDING,
73+
createdAt: { lt: cutoff },
74+
},
75+
data: { status: AttachmentStatus.DELETING },
76+
});
77+
78+
// COMMITTED orphans: blobs left with zero links by an interrupted
79+
// chat-delete sweep in the web app.
80+
const committedClaimed = await this.db.attachment.updateMany({
81+
where: {
82+
status: AttachmentStatus.COMMITTED,
83+
createdAt: { lt: cutoff },
84+
chats: { none: {} },
85+
},
86+
data: { status: AttachmentStatus.DELETING },
87+
});
88+
89+
// Reclaim every tombstone: delete bytes, then the row. This also picks up
90+
// tombstones left behind by the web app's inline reclaim (or a crashed
91+
// earlier tick) whose byte delete failed.
92+
const reclaimed = await this.reclaimTombstonedAttachments();
93+
94+
if (pendingClaimed.count > 0 || committedClaimed.count > 0 || reclaimed > 0) {
95+
logger.debug(
96+
`Attachment prune: condemned ${pendingClaimed.count} PENDING + ` +
97+
`${committedClaimed.count} COMMITTED orphan(s), reclaimed ${reclaimed} tombstone(s).`,
98+
);
99+
}
100+
}
101+
102+
/**
103+
* Deletes the bytes for every `DELETING` tombstone, then removes the row.
104+
* The row (the only durable handle to the bytes) is removed only after its
105+
* bytes are confirmed gone, so a failed byte delete leaves the tombstone in
106+
* place to be retried on the next tick — bytes can never be orphaned by a
107+
* transient storage error. Rows whose byte delete fails this run are
108+
* excluded from subsequent batches so a persistent failure can't spin the
109+
* loop.
110+
*
111+
* @returns the number of tombstones fully reclaimed (bytes + row).
112+
*/
113+
private async reclaimTombstonedAttachments(): Promise<number> {
114+
let totalReclaimed = 0;
115+
const failedIds: string[] = [];
116+
117+
while (true) {
118+
const batch = await this.db.attachment.findMany({
119+
where: { status: AttachmentStatus.DELETING, id: { notIn: failedIds } },
120+
select: { id: true, storageKey: true },
121+
take: BATCH_SIZE,
122+
});
123+
124+
if (batch.length === 0) {
125+
break;
126+
}
127+
128+
const settled = await Promise.allSettled(
129+
batch.map((attachment) => this.storage.delete(attachment.storageKey)));
130+
131+
const reclaimedIds: string[] = [];
132+
batch.forEach((attachment, index) => {
133+
const outcome = settled[index];
134+
if (outcome.status === 'fulfilled') {
135+
reclaimedIds.push(attachment.id);
136+
} else {
137+
logger.warn(`Failed to delete bytes for tombstoned attachment ${attachment.id}, will retry next tick: ${outcome.reason}`);
138+
failedIds.push(attachment.id);
139+
}
140+
});
141+
142+
if (reclaimedIds.length > 0) {
143+
const result = await this.db.attachment.deleteMany({
144+
where: { id: { in: reclaimedIds }, status: AttachmentStatus.DELETING },
145+
});
146+
totalReclaimed += result.count;
147+
}
148+
149+
if (batch.length < BATCH_SIZE) {
150+
break;
151+
}
152+
}
153+
154+
return totalReclaimed;
155+
}
156+
}

packages/backend/src/index.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import 'express-async-errors';
88
import { existsSync } from 'fs';
99
import { mkdir } from 'fs/promises';
1010
import { Api } from "./api.js";
11+
import { AttachmentPruner } from "./attachmentPruner.js";
1112
import { ConfigManager } from "./configManager.js";
1213
import { ConnectionManager } from './connectionManager.js';
1314
import { INDEX_CACHE_DIR, REPOS_CACHE_DIR, SHUTDOWN_SIGNALS } from './constants.js';
@@ -55,10 +56,12 @@ const accountPermissionSyncer = new AccountPermissionSyncer(prisma, settings, re
5556
const repoIndexManager = new RepoIndexManager(prisma, settings, redis, promClient);
5657
const configManager = new ConfigManager(prisma, connectionManager, env.CONFIG_PATH);
5758
const auditLogPruner = new AuditLogPruner(prisma);
59+
const attachmentPruner = new AttachmentPruner(prisma);
5860

5961
connectionManager.startScheduler();
6062
await repoIndexManager.startScheduler();
6163
auditLogPruner.startScheduler();
64+
attachmentPruner.startScheduler();
6265

6366
if (env.PERMISSION_SYNC_ENABLED === 'true' && !await hasEntitlement('permission-syncing')) {
6467
logger.warn('Permission syncing is not supported in current plan. Please contact team@sourcebot.dev for assistance.');
@@ -99,6 +102,7 @@ const listenToShutdownSignals = () => {
99102
await repoPermissionSyncer.dispose()
100103
await accountPermissionSyncer.dispose()
101104
await auditLogPruner.dispose()
105+
await attachmentPruner.dispose()
102106
await configManager.dispose()
103107

104108
await prisma.$disconnect();
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
-- CreateEnum
2+
CREATE TYPE "AttachmentStatus" AS ENUM ('PENDING', 'COMMITTED');
3+
4+
-- CreateTable
5+
CREATE TABLE "Attachment" (
6+
"id" TEXT NOT NULL,
7+
"orgId" INTEGER NOT NULL,
8+
"storageKey" TEXT NOT NULL,
9+
"filename" TEXT NOT NULL,
10+
"mediaType" TEXT NOT NULL,
11+
"sizeBytes" INTEGER NOT NULL,
12+
"checksum" TEXT NOT NULL,
13+
"uploadedById" TEXT,
14+
"status" "AttachmentStatus" NOT NULL DEFAULT 'PENDING',
15+
"createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
16+
17+
CONSTRAINT "Attachment_pkey" PRIMARY KEY ("id")
18+
);
19+
20+
-- CreateTable
21+
CREATE TABLE "ChatAttachment" (
22+
"id" TEXT NOT NULL,
23+
"chatId" TEXT NOT NULL,
24+
"attachmentId" TEXT NOT NULL,
25+
"createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
26+
27+
CONSTRAINT "ChatAttachment_pkey" PRIMARY KEY ("id")
28+
);
29+
30+
-- CreateIndex
31+
CREATE INDEX "Attachment_status_createdAt_idx" ON "Attachment"("status", "createdAt");
32+
33+
-- CreateIndex
34+
CREATE INDEX "ChatAttachment_attachmentId_idx" ON "ChatAttachment"("attachmentId");
35+
36+
-- CreateIndex
37+
CREATE UNIQUE INDEX "ChatAttachment_chatId_attachmentId_key" ON "ChatAttachment"("chatId", "attachmentId");
38+
39+
-- AddForeignKey
40+
ALTER TABLE "Attachment" ADD CONSTRAINT "Attachment_orgId_fkey" FOREIGN KEY ("orgId") REFERENCES "Org"("id") ON DELETE CASCADE ON UPDATE CASCADE;
41+
42+
-- AddForeignKey
43+
ALTER TABLE "Attachment" ADD CONSTRAINT "Attachment_uploadedById_fkey" FOREIGN KEY ("uploadedById") REFERENCES "User"("id") ON DELETE SET NULL ON UPDATE CASCADE;
44+
45+
-- AddForeignKey
46+
ALTER TABLE "ChatAttachment" ADD CONSTRAINT "ChatAttachment_chatId_fkey" FOREIGN KEY ("chatId") REFERENCES "Chat"("id") ON DELETE CASCADE ON UPDATE CASCADE;
47+
48+
-- AddForeignKey
49+
ALTER TABLE "ChatAttachment" ADD CONSTRAINT "ChatAttachment_attachmentId_fkey" FOREIGN KEY ("attachmentId") REFERENCES "Attachment"("id") ON DELETE CASCADE ON UPDATE CASCADE;
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
-- AlterEnum
2+
ALTER TYPE "AttachmentStatus" ADD VALUE 'DELETING';

packages/db/prisma/schema.prisma

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,20 @@ enum ChatVisibility {
2424
PUBLIC
2525
}
2626

27+
/// Lifecycle status of an uploaded attachment blob.
28+
/// PENDING: uploaded but not yet linked to a chat (orphan until a message
29+
/// referencing it is sent). COMMITTED: linked to at least one chat.
30+
/// DELETING: condemned tombstone. The row is kept (as the only durable handle
31+
/// to the bytes) until the stored bytes are confirmed deleted, at which point
32+
/// the row is removed. A failed byte delete leaves the row DELETING for the
33+
/// attachment pruner's reclaim sweep to retry, so a transient storage error
34+
/// can never orphan bytes.
35+
enum AttachmentStatus {
36+
PENDING
37+
COMMITTED
38+
DELETING
39+
}
40+
2741
/// @note: The @map annotation is required to maintain backwards compatibility
2842
/// with the existing database.
2943
/// @note: In the generated client, these mapped values will be in pascalCase.
@@ -272,6 +286,7 @@ model Org {
272286
connections Connection[]
273287
repos Repo[]
274288
apiKeys ApiKey[]
289+
attachments Attachment[]
275290
isOnboarded Boolean @default(false)
276291
imageUrl String?
277292
@@ -456,6 +471,7 @@ model User {
456471
chats Chat[]
457472
sharedChats ChatAccess[]
458473
repoVisits RepoVisit[]
474+
uploadedAttachments Attachment[]
459475
460476
oauthTokens OAuthToken[]
461477
oauthAuthCodes OAuthAuthorizationCode[]
@@ -608,6 +624,69 @@ model Chat {
608624
messages Json // This is a JSON array of `Message` types from @ai-sdk/ui-utils.
609625
610626
sharedWith ChatAccess[]
627+
628+
attachments ChatAttachment[]
629+
}
630+
631+
/// A user-uploaded binary attachment blob (e.g. an image). The bytes live in
632+
/// the configured StorageBackend (keyed by `storageKey`), never in the DB.
633+
/// Attachments are NOT chat-bound: they are uploaded before any chat
634+
/// association exists, and linked to chats via `ChatAttachment`. Permissions
635+
/// are derived entirely from the linked chat(s); there are no independent ACLs.
636+
model Attachment {
637+
id String @id @default(cuid())
638+
639+
org Org @relation(fields: [orgId], references: [id], onDelete: Cascade)
640+
orgId Int
641+
642+
/// Opaque key the StorageBackend uses to locate the bytes.
643+
storageKey String
644+
645+
/// Original (sanitized) filename supplied by the uploader.
646+
filename String
647+
648+
/// Final media type of the stored bytes (validated by decoding at upload).
649+
mediaType String
650+
651+
/// Size of the stored bytes.
652+
sizeBytes Int
653+
654+
/// Hex SHA-256 of the stored bytes (integrity / debugging; not used for dedup).
655+
checksum String
656+
657+
/// The user who uploaded this blob. Uploads require authentication, so this
658+
/// is set at creation (anonymous users cannot upload binary attachments). It
659+
/// is nulled if the uploader is later deleted, so committed attachments
660+
/// survive on the chats they're linked to.
661+
uploadedBy User? @relation(fields: [uploadedById], references: [id], onDelete: SetNull)
662+
uploadedById String?
663+
664+
status AttachmentStatus @default(PENDING)
665+
666+
createdAt DateTime @default(now())
667+
668+
chats ChatAttachment[]
669+
670+
@@index([status, createdAt])
671+
}
672+
673+
/// Join table linking an `Attachment` blob to a `Chat`. This is the linker
674+
/// that makes chat duplication metadata-only (no byte copy) and keeps
675+
/// attachment access purely chat-derived. Deleting a chat cascades these rows;
676+
/// a separate sweep deletes `Attachment`s left with zero links (and their bytes).
677+
model ChatAttachment {
678+
id String @id @default(cuid())
679+
680+
chat Chat @relation(fields: [chatId], references: [id], onDelete: Cascade)
681+
chatId String
682+
683+
attachment Attachment @relation(fields: [attachmentId], references: [id], onDelete: Cascade)
684+
attachmentId String
685+
686+
createdAt DateTime @default(now())
687+
688+
@@unique([chatId, attachmentId])
689+
@@index([attachmentId])
611690
}
612691

613692
/// Represents a user's access to a chat that has been shared with them.

packages/shared/src/env.server.ts

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,23 @@ const options = {
321321
SOURCEBOT_CHAT_PROMPT_CACHE_BREAK_DETECTION_ENABLED: booleanSchema.default('false'),
322322
SOURCEBOT_MCP_TOOL_CALL_TIMEOUT_MS: numberSchema.int().positive().max(maxTimerDelayMs).default(60000),
323323

324+
/**
325+
* Maximum size (in bytes) of a single image attachment uploaded to the
326+
* Ask chat. Enforced server-side at upload time. Distinct from the
327+
* inline-text cap (which lives as a web-package constant).
328+
* @default 10 MiB
329+
*/
330+
SOURCEBOT_CHAT_ATTACHMENT_MAX_IMAGE_BYTES: numberSchema.int().positive().default(10 * 1024 * 1024),
331+
332+
/**
333+
* How long (in hours) an uploaded-but-unlinked (PENDING) attachment
334+
* blob is retained before the orphan sweep deletes it and its bytes.
335+
* Covers "select a file then never send" abandonment. Set to 0 to
336+
* disable the orphan sweep entirely.
337+
* @default 24 hours
338+
*/
339+
SOURCEBOT_CHAT_ATTACHMENT_ORPHAN_TTL_HOURS: numberSchema.int().nonnegative().default(24),
340+
324341
DEBUG_WRITE_CHAT_MESSAGES_TO_FILE: booleanSchema.default('false'),
325342
DEBUG_ENABLE_REACT_SCAN: booleanSchema.default('false'),
326343
DEBUG_ENABLE_REACT_GRAB: booleanSchema.default('false'),

0 commit comments

Comments
 (0)