Skip to content

Improve Playwright row extraction refresh path#142

Closed
AdamEXu wants to merge 3 commits into
tinyfish-io:mainfrom
AdamEXu:pw-row-extraction
Closed

Improve Playwright row extraction refresh path#142
AdamEXu wants to merge 3 commits into
tinyfish-io:mainfrom
AdamEXu:pw-row-extraction

Conversation

@AdamEXu

@AdamEXu AdamEXu commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

No description provided.

Copilot AI review requested due to automatic review settings June 11, 2026 23:28
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c789a5f7-1682-4b19-a821-c039ed63b043

📥 Commits

Reviewing files that changed from the base of the PR and between fe9955e and 31ea152.

⛔ Files ignored due to path filters (1)
  • backend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (12)
  • backend/package.json
  • backend/src/config/models.ts
  • backend/src/env.ts
  • backend/src/index.ts
  • backend/src/mastra/tools/investigate-tool.ts
  • backend/src/mastra/workflows/populate.ts
  • backend/src/mastra/workflows/update.ts
  • backend/src/row-extractors/try-row-extractor.ts
  • frontend/app/dashboard/settings/models/page.tsx
  • frontend/convex/modelConfig.ts
  • frontend/convex/schema.ts
  • frontend/lib/backend.ts

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.


📝 Walkthrough

Walkthrough

This PR introduces configurable GitHub repository row extraction into the BigSet platform. It adds two user-tunable parameters—rowExtractorConcurrency and rowExtractorBrowserAttempts—that flow from environment variables through backend configuration, Convex persistence, and frontend UI. The core extraction logic uses Playwright over TinyFish Browser to read rendered repository facts and GitHub REST API as augmentation, with per-column type-aware comparison for refresh detection. The extractor integrates into the investigate tool for early fact extraction and the update workflow for optimized per-row refresh before falling back to agent-based approaches.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AdamEXu AdamEXu closed this Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds configurable “row extractor” settings and introduces a GitHub-focused row extractor that can insert/refresh dataset rows via TinyFish Browser, with workflow-level integration and a local-mode settings UI.

Changes:

  • Add row extractor settings (concurrency + browser attempts) to model config types, storage schema, backend settings endpoint, and frontend settings UI.
  • Introduce a new GitHub row extractor implementation using TinyFish Browser + Playwright CDP, and wire it into investigate + refresh workflows (with fallback to the existing agent).
  • Normalize/validate row extractor numeric settings (bounds + defaults) across backend and frontend.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
frontend/lib/backend.ts Extends model config types and normalizes row extractor settings in getModelConfig.
frontend/convex/schema.ts Persists new row extractor settings fields in Convex schema.
frontend/convex/modelConfig.ts Allows upserting new fields via Convex mutations.
frontend/app/dashboard/settings/models/page.tsx Adds local-mode UI controls to edit/save row extractor settings.
backend/src/row-extractors/try-row-extractor.ts New GitHub row extractor + refresh path using TinyFish Browser and Playwright CDP.
backend/src/mastra/workflows/update.ts Runs row extractor during refresh; uses model config concurrency and browser attempts.
backend/src/mastra/workflows/populate.ts Extends workflow auth context schema with row extractor settings defaults/bounds.
backend/src/mastra/tools/investigate-tool.ts Tries row extractor before spawning subagent (insert shortcut + fallback).
backend/src/index.ts Accepts new settings fields on /settings/models and forwards to persistence layer.
backend/src/env.ts Adds env defaults for row extractor settings.
backend/src/config/models.ts Normalizes row extractor settings and includes them in returned effective model config.
backend/package.json Adds playwright-core dependency for CDP connection.
Files not reviewed (1)
  • backend/package-lock.json: Generated file
Comments suppressed due to low confidence (1)

frontend/lib/backend.ts:1

  • The new SavedModelConfig fields are typed as number | null, but the backend route only accepts them when typeof ... === \"number\" (so null is silently ignored), and the Convex validators are v.optional(v.number()) (which also won’t accept null). Either (a) change the frontend types to use number | undefined and omit keys to preserve existing values, or (b) explicitly support null end-to-end as a ‘clear/reset’ semantics (backend parsing + Convex schema + persistence).
export interface InferredSchema {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

};

const pkColumns = columns.filter((c) => c.isPrimaryKey);
const maxConcurrent = authContext.modelConfig.rowExtractorConcurrency;
Comment on lines +238 to +240
`[refresh-rows] Processing ${rows.length} rows (max ${maxConcurrent} concurrent)`,
);
await processWithConcurrency(rows, processRow, MAX_CONCURRENT);
await processWithConcurrency(rows, processRow, maxConcurrent);
Comment on lines +125 to +134
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
if (/duplicate/i.test(msg)) {
return {
status: "miss",
reason: `${msg} Move on to the next entity.`,
};
}
return { status: "failed", reason: msg };
}
Comment on lines +469 to +477
const response = await page.request.get(
`https://api.github.com/repos/${encodeURIComponent(owner)}/${encodeURIComponent(repo)}`,
{
headers: {
Accept: "application/vnd.github+json",
},
timeout: FETCH_TIMEOUT_MS,
},
);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants