Skip to content

Allow custom dataset max rows#128

Merged
simantak-dabhade merged 7 commits into
mainfrom
pranav/custom-maxrows
Jun 5, 2026
Merged

Allow custom dataset max rows#128
simantak-dabhade merged 7 commits into
mainfrom
pranav/custom-maxrows

Conversation

@pranavjana

@pranavjana pranavjana commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add a persisted max row count for datasets
  • let users set max rows during dataset creation and edit it later from dataset settings
  • enforce the saved limit in populate prompts, subagent checks, and Convex row insertion

Verification

  • backend: npm run build
  • frontend: tsc --noEmit

Note: Convex codegen could not run locally without CONVEX_DEPLOYMENT set.

image image

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e9e15c2-5e6c-4d84-981d-bbfc33cd93ab

📥 Commits

Reviewing files that changed from the base of the PR and between 094bb7d and d6779aa.

📒 Files selected for processing (5)
  • backend/src/index.ts
  • backend/src/mastra/tools/investigate-tool.ts
  • backend/src/pipeline/populate.ts
  • frontend/app/dataset/[id]/page.tsx
  • frontend/convex/datasets.ts
💤 Files with no reviewable changes (1)
  • backend/src/index.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • backend/src/pipeline/populate.ts
  • backend/src/mastra/tools/investigate-tool.ts
  • frontend/convex/datasets.ts
  • frontend/app/dataset/[id]/page.tsx

📝 Walkthrough

Walkthrough

This PR adds a configurable per-dataset maxRowCount (default 100, capped by FREE_TIER_MONTHLY_QUOTA = 2500). It extends schemas and Convex mutations to store/validate maxRowCount, adds UI for selecting/editing it in NewDataset and Dataset settings (quota-aware), includes maxRowCount in frontend populate calls, and threads the value through /populate, scheduled refreshes, Mastra workflows and agents. The run_subagent tool and dataset row-insert logic enforce the cap and return/signal ROW_LIMIT_REACHED when reached; /populate re-queries the dataset after claim and returns 404 if missing.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Backend as /populate
  participant Workflow as mastra/workflows/populate
  participant Agent as buildPopulateAgent
  participant Subagent as run_subagent (buildSubagentTool)
  participant Convex as Convex DB

  Client->>Backend: POST /populate(maxRowCount, columns)
  Backend->>Convex: claim populate & re-query dataset
  Backend->>Workflow: start workflow with effective maxRowCount
  Workflow->>Agent: pass inputData.maxRowCount
  Agent->>Subagent: construct/run with maxRowCount
  Subagent->>Convex: countByDataset
  Convex-->>Subagent: rowCount
  alt rowCount >= maxRowCount
    Subagent-->>Agent: ROW_LIMIT_REACHED
    Agent-->>Workflow: stop generation due to ROW_LIMIT_REACHED
  else rowCount < maxRowCount
    Subagent->>Convex: insert rows (subject to updated cap)
  end
Loading

Possibly related PRs

  • tinyfish-io/bigset#111: Addresses the same hardcoded 100-row cap by threading maxRowCount through populate orchestration.
  • tinyfish-io/bigset#107: Modifies populate agent/subagent plumbing and instrumentation overlapping the agent/tool signatures changed here.
  • tinyfish-io/bigset#81: Refactors the investigate subagent architecture used by the populate agent and subagent tooling.

Suggested reviewers

  • simantak-dabhade
  • giaphutran12
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Allow custom dataset max rows' clearly and concisely summarizes the main feature added in the changeset—enabling users to configure a custom maximum row count per dataset.
Description check ✅ Passed The description is well-organized and directly related to the changeset, covering the persisted max row count addition, user-facing configuration options, and enforcement points across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pranav/custom-maxrows

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/mastra/tools/investigate-tool.ts (1)

130-130: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update investigate subagent maxSteps to match the 25-step guideline

backend/src/mastra/tools/investigate-tool.ts spawns the investigate subagent with agent.generate(prompt, { maxSteps: 10 }), but the guidelines require maxSteps: 25 for fresh investigate subagents. Update this to maxSteps: 25 (or revise the guideline if 10 is an intentional, documented change).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/tools/investigate-tool.ts` at line 130, The investigate
subagent is being spawned with agent.generate(prompt, { maxSteps: 10 }) which
violates the guideline requiring fresh investigate subagents use maxSteps: 25;
locate the agent.generate call in investigate-tool (the line that assigns const
result = await agent.generate(prompt, { maxSteps: 10 })) and change the options
to { maxSteps: 25 } (or update any surrounding comments to reflect an
intentional alternative if 10 is desired).
🧹 Nitpick comments (2)
backend/src/pipeline/populate.ts (1)

15-15: ⚡ Quick win

Consider adding an upper bound for defense-in-depth.

The frontend validates maxRowCount <= FREE_TIER_MONTHLY_QUOTA, but the backend schema has no upper bound. While data flows through validated frontend mutations in the normal path, adding .max() here would provide defense-in-depth against bugs or direct backend calls.

🛡️ Add upper bound validation
-  maxRowCount: z.number().int().min(1).default(100),
+  maxRowCount: z.number().int().min(1).max(10000).default(100),

Note: Replace 10000 with the actual FREE_TIER_MONTHLY_QUOTA value, or import it from a shared constants location if available.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/pipeline/populate.ts` at line 15, The schema property maxRowCount
in the populate pipeline (defined as maxRowCount:
z.number().int().min(1).default(100)) needs an upper bound for defense-in-depth;
update that zod schema to add .max(FREE_TIER_MONTHLY_QUOTA) (or a literal like
.max(10000) if the constant isn't available) and, if possible, import
FREE_TIER_MONTHLY_QUOTA from the shared constants module so the backend enforces
the same cap as the frontend.
frontend/convex/datasets.ts (1)

65-65: 💤 Low value

Consider consolidating the default row count constant.

DEFAULT_MAX_ROW_COUNT here (line 65) and DEFAULT_MAX_DATASET_ROWS in datasetRows.ts (line 8) both define the same value (100) with different names. This duplication could lead to drift if one is updated without the other.

♻️ Consolidation approach

Define the constant in one location (e.g., datasets.ts) and import it into datasetRows.ts:

// In datasetRows.ts
import { DEFAULT_MAX_ROW_COUNT } from "./datasets.js";
// Remove: const DEFAULT_MAX_DATASET_ROWS = 100;
// Use DEFAULT_MAX_ROW_COUNT throughout
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/convex/datasets.ts` at line 65, Consolidate the duplicate default
row constant by keeping a single exported constant (e.g., DEFAULT_MAX_ROW_COUNT)
and remove the duplicate DEFAULT_MAX_DATASET_ROWS; update the other module to
import and use DEFAULT_MAX_ROW_COUNT instead of defining its own value.
Specifically, export DEFAULT_MAX_ROW_COUNT from the module where it currently
exists, remove the local declaration of DEFAULT_MAX_DATASET_ROWS, and replace
usages of DEFAULT_MAX_DATASET_ROWS with the imported DEFAULT_MAX_ROW_COUNT
(ensure import statements are added and no other references remain).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/index.ts`:
- Around line 696-702: The code calls
setDatasetPopulateStatus(parsed.data.datasetId, "failed", ...) even when the
dataset lookup (const dataset = await
convex.query(internal.datasets.getInternal, { id: parsed.data.datasetId }))
returns null, which can throw because the dataset doesn't exist; remove that
status update in the not-found branch and simply return reply.code(404).send({
error: "Dataset not found" }) (i.e., delete or skip the setDatasetPopulateStatus
call in the block that handles !dataset so only the 404 response is sent).

In `@frontend/app/dataset/`[id]/page.tsx:
- Around line 222-227: The catch block in the max-rows mutation only logs errors
(console.error and captureException) but doesn't surface them to the user; add a
local React state (e.g., maxRowCountError) in the component that you set in the
catch of the update handler (the block that currently logs "[max rows] failed"
and calls captureException with operation "dataset_max_row_count_update" and
datasetId dataset._id), and clear that state on successful mutation; then render
maxRowCountError text in the settings dropdown help text area (around the
existing settings UI) so backend validation/quota errors are visible to the
user.
- Line 214: The silent early return on the quota check (the line with "if (usage
&& maxRowCount > usage.remaining) return;") leaves users without feedback when
save is blocked; update the Save button click handler that contains this check
to surface a clear error/toast/inline message (e.g., set an error state or call
the existing showToast/snackbar) explaining the quota limit, and keep the early
return after showing that message so the save is aborted but the user sees why;
reference the same symbols usage, maxRowCount, and usage.remaining so the change
is applied where the check currently occurs.

In `@frontend/convex/datasets.ts`:
- Around line 427-440: The quota check in updateMaxRowCount uses
args.maxRowCount directly but should only require quota for additional rows;
compute additionalRows = Math.max(0, args.maxRowCount - (dataset.rowCount || 0))
after loading the dataset (loadOwnedDataset) and pass additionalRows to
requireQuotaRemaining(ctx, dataset.ownerId, additionalRows) instead of
args.maxRowCount, then proceed to validateMaxRowCount and ctx.db.patch as
before; this ensures lowering the cap requires no quota and raising it requires
only the incremental quota.

---

Outside diff comments:
In `@backend/src/mastra/tools/investigate-tool.ts`:
- Line 130: The investigate subagent is being spawned with
agent.generate(prompt, { maxSteps: 10 }) which violates the guideline requiring
fresh investigate subagents use maxSteps: 25; locate the agent.generate call in
investigate-tool (the line that assigns const result = await
agent.generate(prompt, { maxSteps: 10 })) and change the options to { maxSteps:
25 } (or update any surrounding comments to reflect an intentional alternative
if 10 is desired).

---

Nitpick comments:
In `@backend/src/pipeline/populate.ts`:
- Line 15: The schema property maxRowCount in the populate pipeline (defined as
maxRowCount: z.number().int().min(1).default(100)) needs an upper bound for
defense-in-depth; update that zod schema to add .max(FREE_TIER_MONTHLY_QUOTA)
(or a literal like .max(10000) if the constant isn't available) and, if
possible, import FREE_TIER_MONTHLY_QUOTA from the shared constants module so the
backend enforces the same cap as the frontend.

In `@frontend/convex/datasets.ts`:
- Line 65: Consolidate the duplicate default row constant by keeping a single
exported constant (e.g., DEFAULT_MAX_ROW_COUNT) and remove the duplicate
DEFAULT_MAX_DATASET_ROWS; update the other module to import and use
DEFAULT_MAX_ROW_COUNT instead of defining its own value. Specifically, export
DEFAULT_MAX_ROW_COUNT from the module where it currently exists, remove the
local declaration of DEFAULT_MAX_DATASET_ROWS, and replace usages of
DEFAULT_MAX_DATASET_ROWS with the imported DEFAULT_MAX_ROW_COUNT (ensure import
statements are added and no other references remain).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c411e1c-525e-449c-a1ab-8465c85c0c8d

📥 Commits

Reviewing files that changed from the base of the PR and between 5a2e7c3 and ef6db16.

📒 Files selected for processing (11)
  • backend/src/index.ts
  • backend/src/mastra/agents/populate.ts
  • backend/src/mastra/tools/investigate-tool.ts
  • backend/src/mastra/workflows/populate.ts
  • backend/src/pipeline/populate.ts
  • frontend/app/dataset/[id]/page.tsx
  • frontend/app/dataset/new/page.tsx
  • frontend/convex/datasetRows.ts
  • frontend/convex/datasets.ts
  • frontend/convex/schema.ts
  • frontend/lib/backend.ts

Comment thread backend/src/index.ts
Comment thread frontend/app/dataset/[id]/page.tsx Outdated
Comment thread frontend/app/dataset/[id]/page.tsx
Comment thread frontend/convex/datasets.ts
# Conflicts:
#	backend/src/mastra/agents/populate.ts
#	backend/src/mastra/tools/investigate-tool.ts
#	backend/src/mastra/workflows/populate.ts
#	frontend/app/dataset/[id]/page.tsx
#	frontend/convex/datasetRows.ts

@simantak-dabhade simantak-dabhade left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simantak-dabhade simantak-dabhade merged commit 0567771 into main Jun 5, 2026
3 checks passed
@simantak-dabhade simantak-dabhade deleted the pranav/custom-maxrows branch June 5, 2026 17:37
balasiddarthan22 pushed a commit to balasiddarthan22/bigset that referenced this pull request Jun 6, 2026
* Cap dataset population at 100 rows

* Handle row cap count failures in subagent tool

* Mention row limit sentinel in populate prompt

* Allow custom dataset max rows

* Allow editing dataset max rows

* Address max rows review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants