Allow custom dataset max rows by pranavjana · Pull Request #128 · tinyfish-io/bigset

pranavjana · 2026-06-05T03:31:19Z

Summary

add a persisted max row count for datasets
let users set max rows during dataset creation and edit it later from dataset settings
enforce the saved limit in populate prompts, subagent checks, and Convex row insertion

Verification

backend: npm run build
frontend: tsc --noEmit

Note: Convex codegen could not run locally without CONVEX_DEPLOYMENT set.

coderabbitai · 2026-06-05T03:31:30Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e9e15c2-5e6c-4d84-981d-bbfc33cd93ab

📥 Commits

Reviewing files that changed from the base of the PR and between 094bb7d and d6779aa.

📒 Files selected for processing (5)

backend/src/index.ts
backend/src/mastra/tools/investigate-tool.ts
backend/src/pipeline/populate.ts
frontend/app/dataset/[id]/page.tsx
frontend/convex/datasets.ts

💤 Files with no reviewable changes (1)

backend/src/index.ts

🚧 Files skipped from review as they are similar to previous changes (4)

backend/src/pipeline/populate.ts
backend/src/mastra/tools/investigate-tool.ts
frontend/convex/datasets.ts
frontend/app/dataset/[id]/page.tsx

📝 Walkthrough

Walkthrough

This PR adds a configurable per-dataset maxRowCount (default 100, capped by FREE_TIER_MONTHLY_QUOTA = 2500). It extends schemas and Convex mutations to store/validate maxRowCount, adds UI for selecting/editing it in NewDataset and Dataset settings (quota-aware), includes maxRowCount in frontend populate calls, and threads the value through /populate, scheduled refreshes, Mastra workflows and agents. The run_subagent tool and dataset row-insert logic enforce the cap and return/signal ROW_LIMIT_REACHED when reached; /populate re-queries the dataset after claim and returns 404 if missing.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Backend as /populate
  participant Workflow as mastra/workflows/populate
  participant Agent as buildPopulateAgent
  participant Subagent as run_subagent (buildSubagentTool)
  participant Convex as Convex DB

  Client->>Backend: POST /populate(maxRowCount, columns)
  Backend->>Convex: claim populate & re-query dataset
  Backend->>Workflow: start workflow with effective maxRowCount
  Workflow->>Agent: pass inputData.maxRowCount
  Agent->>Subagent: construct/run with maxRowCount
  Subagent->>Convex: countByDataset
  Convex-->>Subagent: rowCount
  alt rowCount >= maxRowCount
    Subagent-->>Agent: ROW_LIMIT_REACHED
    Agent-->>Workflow: stop generation due to ROW_LIMIT_REACHED
  else rowCount < maxRowCount
    Subagent->>Convex: insert rows (subject to updated cap)
  end

Possibly related PRs

tinyfish-io/bigset#111: Addresses the same hardcoded 100-row cap by threading maxRowCount through populate orchestration.
tinyfish-io/bigset#107: Modifies populate agent/subagent plumbing and instrumentation overlapping the agent/tool signatures changed here.
tinyfish-io/bigset#81: Refactors the investigate subagent architecture used by the populate agent and subagent tooling.

Suggested reviewers

simantak-dabhade
giaphutran12

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Allow custom dataset max rows' clearly and concisely summarizes the main feature added in the changeset—enabling users to configure a custom maximum row count per dataset.
Description check	✅ Passed	The description is well-organized and directly related to the changeset, covering the persisted max row count addition, user-facing configuration options, and enforcement points across the codebase.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pranav/custom-maxrows

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/src/mastra/tools/investigate-tool.ts (1)
130-130: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update investigate subagent maxSteps to match the 25-step guideline

backend/src/mastra/tools/investigate-tool.ts spawns the investigate subagent with agent.generate(prompt, { maxSteps: 10 }), but the guidelines require maxSteps: 25 for fresh investigate subagents. Update this to maxSteps: 25 (or revise the guideline if 10 is an intentional, documented change).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/tools/investigate-tool.ts` at line 130, The investigate
subagent is being spawned with agent.generate(prompt, { maxSteps: 10 }) which
violates the guideline requiring fresh investigate subagents use maxSteps: 25;
locate the agent.generate call in investigate-tool (the line that assigns const
result = await agent.generate(prompt, { maxSteps: 10 })) and change the options
to { maxSteps: 25 } (or update any surrounding comments to reflect an
intentional alternative if 10 is desired).

🧹 Nitpick comments (2)

backend/src/pipeline/populate.ts (1)
15-15: ⚡ Quick win

Consider adding an upper bound for defense-in-depth.

The frontend validates maxRowCount <= FREE_TIER_MONTHLY_QUOTA, but the backend schema has no upper bound. While data flows through validated frontend mutations in the normal path, adding .max() here would provide defense-in-depth against bugs or direct backend calls.
🛡️ Add upper bound validation
-  maxRowCount: z.number().int().min(1).default(100),
+  maxRowCount: z.number().int().min(1).max(10000).default(100),
Note: Replace 10000 with the actual FREE_TIER_MONTHLY_QUOTA value, or import it from a shared constants location if available.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/pipeline/populate.ts` at line 15, The schema property maxRowCount
in the populate pipeline (defined as maxRowCount:
z.number().int().min(1).default(100)) needs an upper bound for defense-in-depth;
update that zod schema to add .max(FREE_TIER_MONTHLY_QUOTA) (or a literal like
.max(10000) if the constant isn't available) and, if possible, import
FREE_TIER_MONTHLY_QUOTA from the shared constants module so the backend enforces
the same cap as the frontend.
frontend/convex/datasets.ts (1)
65-65: 💤 Low value

Consider consolidating the default row count constant.

DEFAULT_MAX_ROW_COUNT here (line 65) and DEFAULT_MAX_DATASET_ROWS in datasetRows.ts (line 8) both define the same value (100) with different names. This duplication could lead to drift if one is updated without the other.
♻️ Consolidation approach

Define the constant in one location (e.g., datasets.ts) and import it into datasetRows.ts:
// In datasetRows.ts
import { DEFAULT_MAX_ROW_COUNT } from "./datasets.js";
// Remove: const DEFAULT_MAX_DATASET_ROWS = 100;
// Use DEFAULT_MAX_ROW_COUNT throughout
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/convex/datasets.ts` at line 65, Consolidate the duplicate default
row constant by keeping a single exported constant (e.g., DEFAULT_MAX_ROW_COUNT)
and remove the duplicate DEFAULT_MAX_DATASET_ROWS; update the other module to
import and use DEFAULT_MAX_ROW_COUNT instead of defining its own value.
Specifically, export DEFAULT_MAX_ROW_COUNT from the module where it currently
exists, remove the local declaration of DEFAULT_MAX_DATASET_ROWS, and replace
usages of DEFAULT_MAX_DATASET_ROWS with the imported DEFAULT_MAX_ROW_COUNT
(ensure import statements are added and no other references remain).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/index.ts`:
- Around line 696-702: The code calls
setDatasetPopulateStatus(parsed.data.datasetId, "failed", ...) even when the
dataset lookup (const dataset = await
convex.query(internal.datasets.getInternal, { id: parsed.data.datasetId }))
returns null, which can throw because the dataset doesn't exist; remove that
status update in the not-found branch and simply return reply.code(404).send({
error: "Dataset not found" }) (i.e., delete or skip the setDatasetPopulateStatus
call in the block that handles !dataset so only the 404 response is sent).

In `@frontend/app/dataset/`[id]/page.tsx:
- Around line 222-227: The catch block in the max-rows mutation only logs errors
(console.error and captureException) but doesn't surface them to the user; add a
local React state (e.g., maxRowCountError) in the component that you set in the
catch of the update handler (the block that currently logs "[max rows] failed"
and calls captureException with operation "dataset_max_row_count_update" and
datasetId dataset._id), and clear that state on successful mutation; then render
maxRowCountError text in the settings dropdown help text area (around the
existing settings UI) so backend validation/quota errors are visible to the
user.
- Line 214: The silent early return on the quota check (the line with "if (usage
&& maxRowCount > usage.remaining) return;") leaves users without feedback when
save is blocked; update the Save button click handler that contains this check
to surface a clear error/toast/inline message (e.g., set an error state or call
the existing showToast/snackbar) explaining the quota limit, and keep the early
return after showing that message so the save is aborted but the user sees why;
reference the same symbols usage, maxRowCount, and usage.remaining so the change
is applied where the check currently occurs.

In `@frontend/convex/datasets.ts`:
- Around line 427-440: The quota check in updateMaxRowCount uses
args.maxRowCount directly but should only require quota for additional rows;
compute additionalRows = Math.max(0, args.maxRowCount - (dataset.rowCount || 0))
after loading the dataset (loadOwnedDataset) and pass additionalRows to
requireQuotaRemaining(ctx, dataset.ownerId, additionalRows) instead of
args.maxRowCount, then proceed to validateMaxRowCount and ctx.db.patch as
before; this ensures lowering the cap requires no quota and raising it requires
only the incremental quota.

---

Outside diff comments:
In `@backend/src/mastra/tools/investigate-tool.ts`:
- Line 130: The investigate subagent is being spawned with
agent.generate(prompt, { maxSteps: 10 }) which violates the guideline requiring
fresh investigate subagents use maxSteps: 25; locate the agent.generate call in
investigate-tool (the line that assigns const result = await
agent.generate(prompt, { maxSteps: 10 })) and change the options to { maxSteps:
25 } (or update any surrounding comments to reflect an intentional alternative
if 10 is desired).

---

Nitpick comments:
In `@backend/src/pipeline/populate.ts`:
- Line 15: The schema property maxRowCount in the populate pipeline (defined as
maxRowCount: z.number().int().min(1).default(100)) needs an upper bound for
defense-in-depth; update that zod schema to add .max(FREE_TIER_MONTHLY_QUOTA)
(or a literal like .max(10000) if the constant isn't available) and, if
possible, import FREE_TIER_MONTHLY_QUOTA from the shared constants module so the
backend enforces the same cap as the frontend.

In `@frontend/convex/datasets.ts`:
- Line 65: Consolidate the duplicate default row constant by keeping a single
exported constant (e.g., DEFAULT_MAX_ROW_COUNT) and remove the duplicate
DEFAULT_MAX_DATASET_ROWS; update the other module to import and use
DEFAULT_MAX_ROW_COUNT instead of defining its own value. Specifically, export
DEFAULT_MAX_ROW_COUNT from the module where it currently exists, remove the
local declaration of DEFAULT_MAX_DATASET_ROWS, and replace usages of
DEFAULT_MAX_DATASET_ROWS with the imported DEFAULT_MAX_ROW_COUNT (ensure import
statements are added and no other references remain).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c411e1c-525e-449c-a1ab-8465c85c0c8d

📥 Commits

Reviewing files that changed from the base of the PR and between 5a2e7c3 and ef6db16.

📒 Files selected for processing (11)

backend/src/index.ts
backend/src/mastra/agents/populate.ts
backend/src/mastra/tools/investigate-tool.ts
backend/src/mastra/workflows/populate.ts
backend/src/pipeline/populate.ts
frontend/app/dataset/[id]/page.tsx
frontend/app/dataset/new/page.tsx
frontend/convex/datasetRows.ts
frontend/convex/datasets.ts
frontend/convex/schema.ts
frontend/lib/backend.ts

# Conflicts: # backend/src/mastra/agents/populate.ts # backend/src/mastra/tools/investigate-tool.ts # backend/src/mastra/workflows/populate.ts # frontend/app/dataset/[id]/page.tsx # frontend/convex/datasetRows.ts

simantak-dabhade

LGTM

* Cap dataset population at 100 rows * Handle row cap count failures in subagent tool * Mention row limit sentinel in populate prompt * Allow custom dataset max rows * Allow editing dataset max rows * Address max rows review feedback

pranavjana added 5 commits June 1, 2026 16:28

Cap dataset population at 100 rows

4e467ec

Handle row cap count failures in subagent tool

efb4510

Mention row limit sentinel in populate prompt

4def5e8

Allow custom dataset max rows

af35c5b

Allow editing dataset max rows

ef6db16

pranavjana requested a review from simantak-dabhade June 5, 2026 03:33

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread backend/src/index.ts

Comment thread frontend/app/dataset/[id]/page.tsx Outdated

Comment thread frontend/app/dataset/[id]/page.tsx

Comment thread frontend/convex/datasets.ts

pranavjana added 2 commits June 4, 2026 20:48

Merge remote-tracking branch 'origin/main' into pranav/custom-maxrows

094bb7d

# Conflicts: # backend/src/mastra/agents/populate.ts # backend/src/mastra/tools/investigate-tool.ts # backend/src/mastra/workflows/populate.ts # frontend/app/dataset/[id]/page.tsx # frontend/convex/datasetRows.ts

Address max rows review feedback

d6779aa

simantak-dabhade approved these changes Jun 5, 2026

View reviewed changes

simantak-dabhade merged commit 0567771 into main Jun 5, 2026
3 checks passed

simantak-dabhade deleted the pranav/custom-maxrows branch June 5, 2026 17:37

coderabbitai Bot mentioned this pull request Jun 6, 2026

fix: resolve Makefile merge and centralize .env CRLF handling #119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom dataset max rows#128

Allow custom dataset max rows#128
simantak-dabhade merged 7 commits into
mainfrom
pranav/custom-maxrows

pranavjana commented Jun 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simantak-dabhade left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pranavjana commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simantak-dabhade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pranavjana commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading