improvement(connectors): audit and harden all 30 knowledge base connectors by waleedlatif1 · Pull Request #3603 · simstudioai/sim

waleedlatif1 · 2026-03-15T11:59:07Z

Summary

Audited all 30 knowledge base connectors against their service API docs
Fixed query injection in Google Docs/Drive (folderId escaping) and Outlook (OData escaping)
Fixed Notion OAuth token refresh: useBasicAuth + JSON body per their docs
Replaced raw fetch with fetchWithRetry in getJiraCloudId (used by 48 files)
Added syncContext caching to getDocument in HubSpot, Jira, Salesforce, SharePoint, Slack
Added lastModified tags to GitHub, Google Calendar, Google Sheets
Batched concurrent API calls in Obsidian and Zendesk (Promise.all with concurrency 5)
Added private channel support to Slack connector (groups:read/groups:history scopes)
Fixed WordPress URL normalization, OneDrive byte-accurate size checks, Confluence blogpost support in getDocument
Removed redundant offline_access scope from Salesforce, removed no-op Prefer header from Teams
Enhanced /validate-connector skill with input sanitization, API efficiency, and pagination checks

Type of Change

Bug fix
Enhancement

Testing

Tested via TypeScript compilation and lint. All 30 connectors validated against API docs.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…ctors

vercel · 2026-03-15T11:59:12Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Mar 15, 2026 0:39am

cursor · 2026-03-15T11:59:13Z

PR Summary

Medium Risk
Touches many production connectors and shared OAuth/Jira utilities; changes affect auth refresh behavior, scope requirements, and fetch/pagination logic across multiple services.

Overview
Hardens multiple knowledge base connectors by fixing input/query escaping (e.g., Google Drive/Docs folder queries, Outlook OData filters) and normalizing URLs/encoding paths used in sourceUrl generation.

Improves connector correctness and efficiency: getDocument now handles additional content types (Confluence pages vs blogposts), adds syncContext caching to avoid repeated lookups (HubSpot/Jira/Salesforce/SharePoint/Slack), tightens pagination to respect max* caps (Jira), and reduces per-item API calls via bounded concurrency (Obsidian, Zendesk) and field selection (ServiceNow).

Updates auth behavior and metadata: fixes Notion refresh-token exchange to use Basic Auth + JSON body, adjusts required OAuth scopes (Slack private channels, Microsoft Teams channel lookup) and trims Salesforce provider scopes, and adds/extends Last Modified/Created tagging for GitHub, Google Calendar, and Google Sheets. Also adds a new .claude/commands/validate-connector.md command to standardize connector validation and remediation steps.

^{Written by Cursor Bugbot for commit 441d2a6. Configure here.}

greptile-apps · 2026-03-15T12:05:04Z

Greptile Summary

This PR is a broad audit and hardening pass across all 30 knowledge base connectors, fixing a range of correctness, security, and performance issues found by cross-referencing each connector against its service's official API docs.

Key changes and their impact:

Security / injection fixes: OData injection in Outlook getDocument (' → ''); Google Drive/Docs folderId injection (\' escaping in Drive query q parameter).
Auth fixes: Notion OAuth token refresh now correctly uses Basic Auth header + JSON body as required by the Notion API; offline_access scope removed from Salesforce (superseded by refresh_token).
Reliability: getJiraCloudId (used by ~48 files) upgraded from raw fetch to fetchWithRetry with explicit non-OK error handling; VALIDATE_RETRY_OPTIONS threaded through linearGraphQL and inlined in Confluence validateConfig; Jira pagination guard prevents a wasted API call when the issue limit is already reached.
Performance: syncContext caching added to getDocument in HubSpot, Jira, Salesforce, SharePoint, and Slack (avoids re-fetching portalId, cloudId, instanceUrl, siteId, and user maps on every document refresh); SharePoint resolveSiteId refactored to return displayName in one call instead of two; sequential per-item fetches in Obsidian and Zendesk batched to Promise.all with concurrency 5; Reddit getDocument reuses comments already returned by the initial API response.
Content correctness: OneDrive byte-accurate size check and truncation (Buffer.byteLength / Buffer.from(...).subarray); Webflow HTML detection replaced with a tag-shape regex; ServiceNow getDocument adds sysparm_fields to limit payload; Salesforce htmlToPlainText stripping applied to Description and Summary fields.
Feature additions: Confluence getDocument now supports blogposts; Slack connector gains private channel support (groups:read / groups:history); Teams gains Channel.ReadBasic.All scope and drops the no-op Outlook Prefer header; WordPress URL normalisation strips protocol and trailing slashes; lastModified tags added to GitHub, Google Calendar, and Google Sheets.
Tooling: New validate-connector.md Claude skill provides a structured 11-step audit checklist for future connector reviews.

Confidence Score: 4/5

Safe to merge — the majority of changes are correct hardening fixes with no regressions identified; two minor inefficiencies noted but neither affects correctness.
The PR correctly resolves a large number of real bugs (OData injection, Notion auth, Jira pagination, OneDrive byte truncation, etc.) and has been verified against API docs. All new OAuth scopes (groups:read, groups:history, Channel.ReadBasic.All) are confirmed present in the provider config. The Notion auth fix is well-covered by a new dedicated test. The one issue worth following up is that Google Sheets getDocument omits syncContext support and therefore makes a redundant Drive API call per sheet on every document refresh — a performance concern but not a correctness bug. The Confluence sequential probe for blogposts is a minor inefficiency in the same category.
apps/sim/connectors/google-sheets/google-sheets.ts — getDocument doesn't accept syncContext and re-fetches modifiedTime from the Drive API on every per-sheet call.

Important Files Changed

Filename	Overview
apps/sim/lib/oauth/oauth.ts	Fixes Notion OAuth token refresh to use Basic Auth + JSON body per Notion API docs; removes redundant `offline_access` scope from Salesforce; adds `useJsonBody` flag to `ProviderAuthConfig` and plumbs it through `buildAuthRequest`/`refreshOAuthToken`. Logic is correct and well-tested.
apps/sim/tools/jira/utils.ts	Upgrades `getJiraCloudId` from raw `fetch` to `fetchWithRetry` and adds explicit error handling for non-OK responses. Correct improvement used by ~48 files.
apps/sim/connectors/google-sheets/google-sheets.ts	Adds `lastModified` tag via a parallel Drive API call for `modifiedTime`. Works correctly in `listDocuments`, but `getDocument` doesn't accept `syncContext` and re-fetches `modifiedTime` on every per-sheet call — a redundant Drive API call per document when multiple sheets exist in the same spreadsheet.
apps/sim/connectors/confluence/confluence.ts	Adds blogpost support in `getDocument` via sequential endpoint probe (pages → blogposts) and inlines `VALIDATE_RETRY_OPTIONS` into `validateConfig`. Blogpost detection always incurs a 404 round-trip for non-page content; storing the type in metadata would avoid this.
apps/sim/connectors/notion/notion.ts	Correctly restricts Notion block traversal to `child_page` only (excluding `child_database` blocks that can't be fetched via the Pages API). No issues.
apps/sim/connectors/jira/jira.ts	Adds `Math.max(0, ...)` guard and early-return when `remaining === 0`, preventing unnecessary Jira API calls at the limit boundary; adds `syncContext` caching for `cloudId` in `getDocument`.
apps/sim/connectors/hubspot/hubspot.ts	Restores sort-by-`lastmodifieddate` in `listDocuments`, adds `syncContext` caching for `portalId` in `getDocument`, and switches `validateConfig` to a lighter-weight GET call. Removal of `filterGroups: []` from the search body is safe (field is optional per HubSpot docs).
apps/sim/connectors/outlook/outlook.ts	Fixes OData injection in `getDocument` by escaping `'` → `''` in the `$filter` value; switches `validateConfig` search URL construction to `URLSearchParams` to avoid double-encoding. Both fixes are correct.
apps/sim/connectors/onedrive/onedrive.ts	Fixes byte-accurate size check and truncation using `Buffer.byteLength` and `Buffer.from(...).subarray(0, MAX_FILE_SIZE).toString('utf8')`. Correct fix for multi-byte UTF-8 content.
apps/sim/connectors/sharepoint/sharepoint.ts	Refactors `resolveSiteId` to return `{ id, displayName }`, eliminating a redundant second Graph API call for site display name; adds `syncContext` caching in `getDocument`. Net reduction in API calls.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Connector sync triggered] --> B{listDocuments}
    B -->|cursor / pages| C[External API call\nwith fetchWithRetry]
    C --> D[Store result in syncContext\ne.g. cloudId / portalId / siteId]
    D --> E[Build ExternalDocument list]
    E --> F{hasMore?}
    F -->|yes| B
    F -->|no| G[Sync engine: per-document refresh]
    G --> H{getDocument}
    H --> I{syncContext hit?}
    I -->|yes - HubSpot · Jira · SF · SP · Slack| J[Reuse cached ID\nno extra API call]
    I -->|no - Google Sheets| K[Extra Drive API call\nfor modifiedTime each time]
    J --> L[Fetch document content]
    K --> L
    L --> M[Build ExternalDocument\nwith tags & contentHash]

    style K fill:#ffe0e0,stroke:#cc0000
    style I fill:#e0f0ff

Comments Outside Diff (3)

apps/sim/connectors/google-sheets/google-sheets.ts, line 285-310 (link)

modifiedTime Drive API call not cached across getDocument calls

Every getDocument invocation calls fetchSpreadsheetModifiedTime, which hits the Drive API for modifiedTime. Since modifiedTime is a file-level property shared by all sheets in a spreadsheet, a sync that individually refreshes N sheets will make N identical Drive API calls for the same value.

This PR adds syncContext caching to getDocument in five other connectors (HubSpot, Jira, Salesforce, SharePoint, Slack). Google Sheets getDocument was not updated the same way — it doesn't even accept syncContext in its signature yet.

The fix follows the same pattern used by the other connectors:

getDocument: async (
  accessToken: string,
  sourceConfig: Record<string, unknown>,
  externalId: string,
  syncContext?: Record<string, unknown>
): Promise<ExternalDocument | null> => {
  // ...
  let modifiedTime = syncContext?.modifiedTime as string | undefined
  if (!modifiedTime) {
    ;[metadata, modifiedTime] = await Promise.all([
      fetchSpreadsheetMetadata(accessToken, spreadsheetId),
      fetchSpreadsheetModifiedTime(accessToken, spreadsheetId),
    ])
    if (syncContext) {
      syncContext.spreadsheetMetadata = metadata
      syncContext.modifiedTime = modifiedTime
    }
  } else {
    metadata = (syncContext?.spreadsheetMetadata as SpreadsheetMetadata) ?? await fetchSpreadsheetMetadata(accessToken, spreadsheetId)
  }

apps/sim/connectors/confluence/confluence.ts, line 349-374 (link)

Sequential probe always wastes a request for blogpost documents

The new logic tries the pages endpoint first, then falls back to blogposts. For any document that was indexed as a blogpost, getDocument will always make a failed request to /pages/{id} (which returns 404) before succeeding on /blogposts/{id}. At scale this doubles the API traffic for every blogpost document refresh.

A more efficient approach would be to store the content type (page or blogpost) in the document's metadata during listDocuments, then read it back here:
```
// In listDocuments, store type: 'page' | 'blogpost' in metadata
// Then in getDocument:
const contentType = (syncContext?.contentType as string) ?? 'pages' // read from stored metadata if available
const endpoints = contentType === 'blogpost' ? ['blogposts', 'pages'] : ['pages', 'blogposts']
for (const endpoint of endpoints) { ... }
```
This is not a blocking issue — supporting blogposts at all is a clear improvement — but it's worth noting for a follow-up optimisation.
apps/sim/connectors/google-docs/google-docs.ts, line 544 (link)

Wrong escape sequence for Google Drive query syntax

The Google Drive API q parameter requires single quotes inside a quoted literal to be escaped as \'. In JavaScript, the replacement string "\\'" is the two-character sequence \' (backslash + apostrophe), which is exactly what the Drive API expects.

However, the same change was applied identically to google-drive.ts — both files now produce queries like 'abc\'def' in parents. This is correct per the Drive API docs but it is worth noting that an alternative (and arguably safer) approach used by some clients is to reject folder IDs that contain quotes at the validateConfig step, since real Drive folder IDs are opaque base-64-like strings that never contain single quotes.

No code change is strictly required, just documenting this for awareness.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{Last reviewed commit: c29bc2b}

apps/sim/connectors/salesforce/salesforce.ts

apps/sim/connectors/hubspot/hubspot.ts

apps/sim/connectors/jira/jira.ts

…orce - HubSpot: revert to Search API (POST /search) to restore lastmodifieddate DESCENDING sorting - Salesforce: restore ArticleBody field and add it to HTML_FIELDS for proper stripping - Jira: add zero-remaining guard to prevent requesting 0 maxResults

…Version field ArticleBody is not a standard field on KnowledgeArticleVersion per Salesforce API docs. Article body content lives in custom fields on org-specific __kav objects. Including ArticleBody in the SOQL query would cause runtime errors.

waleedlatif1 · 2026-03-15T12:15:34Z

@greptile

waleedlatif1 · 2026-03-15T12:15:38Z

@cursor review

apps/sim/connectors/onedrive/onedrive.ts

apps/sim/connectors/reddit/reddit.ts

apps/sim/connectors/onedrive/onedrive.ts

apps/sim/connectors/webflow/webflow.ts

- OneDrive: use Buffer.subarray for byte-accurate truncation instead of character-count slice - Reddit: deduplicate comment extraction — fetchPostComments now calls extractComments instead of duplicating the logic - Webflow: replace crude value.includes('<') with regex /<[a-z][^>]*>/i to avoid false positives on plain text containing '<' - Jira: add response.ok check in getJiraCloudId before parsing JSON to surface real HTTP errors instead of misleading "No Jira resources found"

waleedlatif1 · 2026-03-15T12:25:26Z

Addressed both issues from the review summary:

getJiraCloudId missing response.ok check — Added a response.ok guard before calling .json(). On HTTP errors (401/403/etc), it now throws a descriptive error with the status code and response body instead of the misleading "No Jira resources found".
OneDrive byte vs character truncation — Switched from text.slice(0, MAX_FILE_SIZE) to Buffer.from(text, 'utf8').subarray(0, MAX_FILE_SIZE).toString('utf8') for byte-accurate truncation.

Both fixed in 441d2a6.

waleedlatif1 · 2026-03-15T12:26:09Z

@greptile

waleedlatif1 · 2026-03-15T12:26:12Z

@cursor review

apps/sim/connectors/github/github.ts

apps/sim/connectors/jira/jira.ts

waleedlatif1 · 2026-03-15T12:31:18Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

…Outlook URL encoding - Jira: replace bare fetch() with fetchWithRetry in downloadJiraAttachments for retry logic on transient errors and rate limits - Outlook: use URLSearchParams in validateConfig $search URL construction to match buildInitialUrl and produce RFC 3986 compliant encoding

waleedlatif1 · 2026-03-15T12:40:12Z

Addressed the review findings:

Fixed:

downloadJiraAttachments raw fetch — Replaced with fetchWithRetry for retry logic on transient errors and 429 rate limits. Fixed in c29bc2b.
Outlook validateConfig URL encoding — Switched from manual encodeURIComponent + raw " delimiters to URLSearchParams (matching buildInitialUrl), producing RFC 3986 compliant %22 encoding. Fixed in c29bc2b.

Won't fix (design limitations, not bugs):
3. Notion listFromParentPage depth-1 — This is intentional. Recursing into deeply nested Notion hierarchies risks exponential API calls and hitting Notion's rate limits (3 req/s). The function name and TSDoc clearly scope it to direct children. Users with deep hierarchies can use the database source type instead.
4. Zendesk ignores cursor/syncContext — Zendesk's Help Center Articles API doesn't support cursor-based incremental sync (it always returns the full set). Tickets are paginated internally within a single listDocuments call via fetchTickets. The connector returns all results in one pass by design, so external cursor/syncContext would add complexity with no benefit.

waleedlatif1 · 2026-03-15T12:41:29Z

@greptile

improvement(connectors): audit and harden all 30 knowledge base conne…

6a9b878

…ctors

vercel bot deployed to Preview March 15, 2026 12:01 View deployment

fix(oauth): update Notion test to match Basic Auth + JSON body config

4649033

vercel bot temporarily deployed to Preview March 15, 2026 12:03 Inactive

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

apps/sim/connectors/salesforce/salesforce.ts Show resolved Hide resolved

apps/sim/connectors/hubspot/hubspot.ts Outdated Show resolved Hide resolved

apps/sim/connectors/jira/jira.ts Outdated Show resolved Hide resolved

vercel bot temporarily deployed to Preview March 15, 2026 12:09 Inactive

vercel bot temporarily deployed to Preview March 15, 2026 12:13 Inactive

cursor bot reviewed Mar 15, 2026

View reviewed changes

apps/sim/connectors/onedrive/onedrive.ts Show resolved Hide resolved

apps/sim/connectors/reddit/reddit.ts Show resolved Hide resolved

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

apps/sim/connectors/onedrive/onedrive.ts Show resolved Hide resolved

apps/sim/connectors/webflow/webflow.ts Show resolved Hide resolved

vercel bot temporarily deployed to Preview March 15, 2026 12:25 Inactive

cursor bot reviewed Mar 15, 2026

View reviewed changes

apps/sim/connectors/github/github.ts Show resolved Hide resolved

apps/sim/connectors/jira/jira.ts Show resolved Hide resolved

cursor bot reviewed Mar 15, 2026

View reviewed changes

vercel bot temporarily deployed to Preview March 15, 2026 12:39 Inactive

waleedlatif1 merged commit 6818c51 into staging Mar 15, 2026
11 checks passed

waleedlatif1 deleted the waleedlatif1/audit-connectors branch March 15, 2026 12:51

Conversation

waleedlatif1 commented Mar 15, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

greptile-apps bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

waleedlatif1 commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Mar 15, 2026 •

edited

Loading

cursor bot commented Mar 15, 2026 •

edited

Loading

greptile-apps bot commented Mar 15, 2026 •

edited

Loading