Skip to content

improvement(connectors): audit and harden all 30 knowledge base connectors#3603

Merged
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/audit-connectors
Mar 15, 2026
Merged

improvement(connectors): audit and harden all 30 knowledge base connectors#3603
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/audit-connectors

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

  • Audited all 30 knowledge base connectors against their service API docs
  • Fixed query injection in Google Docs/Drive (folderId escaping) and Outlook (OData escaping)
  • Fixed Notion OAuth token refresh: useBasicAuth + JSON body per their docs
  • Replaced raw fetch with fetchWithRetry in getJiraCloudId (used by 48 files)
  • Added syncContext caching to getDocument in HubSpot, Jira, Salesforce, SharePoint, Slack
  • Added lastModified tags to GitHub, Google Calendar, Google Sheets
  • Batched concurrent API calls in Obsidian and Zendesk (Promise.all with concurrency 5)
  • Added private channel support to Slack connector (groups:read/groups:history scopes)
  • Fixed WordPress URL normalization, OneDrive byte-accurate size checks, Confluence blogpost support in getDocument
  • Removed redundant offline_access scope from Salesforce, removed no-op Prefer header from Teams
  • Enhanced /validate-connector skill with input sanitization, API efficiency, and pagination checks

Type of Change

  • Bug fix
  • Enhancement

Testing

Tested via TypeScript compilation and lint. All 30 connectors validated against API docs.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Mar 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 15, 2026 0:39am

Request Review

@cursor
Copy link

cursor bot commented Mar 15, 2026

PR Summary

Medium Risk
Touches many production connectors and shared OAuth/Jira utilities; changes affect auth refresh behavior, scope requirements, and fetch/pagination logic across multiple services.

Overview
Hardens multiple knowledge base connectors by fixing input/query escaping (e.g., Google Drive/Docs folder queries, Outlook OData filters) and normalizing URLs/encoding paths used in sourceUrl generation.

Improves connector correctness and efficiency: getDocument now handles additional content types (Confluence pages vs blogposts), adds syncContext caching to avoid repeated lookups (HubSpot/Jira/Salesforce/SharePoint/Slack), tightens pagination to respect max* caps (Jira), and reduces per-item API calls via bounded concurrency (Obsidian, Zendesk) and field selection (ServiceNow).

Updates auth behavior and metadata: fixes Notion refresh-token exchange to use Basic Auth + JSON body, adjusts required OAuth scopes (Slack private channels, Microsoft Teams channel lookup) and trims Salesforce provider scopes, and adds/extends Last Modified/Created tagging for GitHub, Google Calendar, and Google Sheets. Also adds a new .claude/commands/validate-connector.md command to standardize connector validation and remediation steps.

Written by Cursor Bugbot for commit 441d2a6. Configure here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 15, 2026

Greptile Summary

This PR is a broad audit and hardening pass across all 30 knowledge base connectors, fixing a range of correctness, security, and performance issues found by cross-referencing each connector against its service's official API docs.

Key changes and their impact:

  • Security / injection fixes: OData injection in Outlook getDocument ('''); Google Drive/Docs folderId injection (\' escaping in Drive query q parameter).
  • Auth fixes: Notion OAuth token refresh now correctly uses Basic Auth header + JSON body as required by the Notion API; offline_access scope removed from Salesforce (superseded by refresh_token).
  • Reliability: getJiraCloudId (used by ~48 files) upgraded from raw fetch to fetchWithRetry with explicit non-OK error handling; VALIDATE_RETRY_OPTIONS threaded through linearGraphQL and inlined in Confluence validateConfig; Jira pagination guard prevents a wasted API call when the issue limit is already reached.
  • Performance: syncContext caching added to getDocument in HubSpot, Jira, Salesforce, SharePoint, and Slack (avoids re-fetching portalId, cloudId, instanceUrl, siteId, and user maps on every document refresh); SharePoint resolveSiteId refactored to return displayName in one call instead of two; sequential per-item fetches in Obsidian and Zendesk batched to Promise.all with concurrency 5; Reddit getDocument reuses comments already returned by the initial API response.
  • Content correctness: OneDrive byte-accurate size check and truncation (Buffer.byteLength / Buffer.from(...).subarray); Webflow HTML detection replaced with a tag-shape regex; ServiceNow getDocument adds sysparm_fields to limit payload; Salesforce htmlToPlainText stripping applied to Description and Summary fields.
  • Feature additions: Confluence getDocument now supports blogposts; Slack connector gains private channel support (groups:read / groups:history); Teams gains Channel.ReadBasic.All scope and drops the no-op Outlook Prefer header; WordPress URL normalisation strips protocol and trailing slashes; lastModified tags added to GitHub, Google Calendar, and Google Sheets.
  • Tooling: New validate-connector.md Claude skill provides a structured 11-step audit checklist for future connector reviews.

Confidence Score: 4/5

  • Safe to merge — the majority of changes are correct hardening fixes with no regressions identified; two minor inefficiencies noted but neither affects correctness.
  • The PR correctly resolves a large number of real bugs (OData injection, Notion auth, Jira pagination, OneDrive byte truncation, etc.) and has been verified against API docs. All new OAuth scopes (groups:read, groups:history, Channel.ReadBasic.All) are confirmed present in the provider config. The Notion auth fix is well-covered by a new dedicated test. The one issue worth following up is that Google Sheets getDocument omits syncContext support and therefore makes a redundant Drive API call per sheet on every document refresh — a performance concern but not a correctness bug. The Confluence sequential probe for blogposts is a minor inefficiency in the same category.
  • apps/sim/connectors/google-sheets/google-sheets.tsgetDocument doesn't accept syncContext and re-fetches modifiedTime from the Drive API on every per-sheet call.

Important Files Changed

Filename Overview
apps/sim/lib/oauth/oauth.ts Fixes Notion OAuth token refresh to use Basic Auth + JSON body per Notion API docs; removes redundant offline_access scope from Salesforce; adds useJsonBody flag to ProviderAuthConfig and plumbs it through buildAuthRequest/refreshOAuthToken. Logic is correct and well-tested.
apps/sim/tools/jira/utils.ts Upgrades getJiraCloudId from raw fetch to fetchWithRetry and adds explicit error handling for non-OK responses. Correct improvement used by ~48 files.
apps/sim/connectors/google-sheets/google-sheets.ts Adds lastModified tag via a parallel Drive API call for modifiedTime. Works correctly in listDocuments, but getDocument doesn't accept syncContext and re-fetches modifiedTime on every per-sheet call — a redundant Drive API call per document when multiple sheets exist in the same spreadsheet.
apps/sim/connectors/confluence/confluence.ts Adds blogpost support in getDocument via sequential endpoint probe (pages → blogposts) and inlines VALIDATE_RETRY_OPTIONS into validateConfig. Blogpost detection always incurs a 404 round-trip for non-page content; storing the type in metadata would avoid this.
apps/sim/connectors/notion/notion.ts Correctly restricts Notion block traversal to child_page only (excluding child_database blocks that can't be fetched via the Pages API). No issues.
apps/sim/connectors/jira/jira.ts Adds Math.max(0, ...) guard and early-return when remaining === 0, preventing unnecessary Jira API calls at the limit boundary; adds syncContext caching for cloudId in getDocument.
apps/sim/connectors/hubspot/hubspot.ts Restores sort-by-lastmodifieddate in listDocuments, adds syncContext caching for portalId in getDocument, and switches validateConfig to a lighter-weight GET call. Removal of filterGroups: [] from the search body is safe (field is optional per HubSpot docs).
apps/sim/connectors/outlook/outlook.ts Fixes OData injection in getDocument by escaping ''' in the $filter value; switches validateConfig search URL construction to URLSearchParams to avoid double-encoding. Both fixes are correct.
apps/sim/connectors/onedrive/onedrive.ts Fixes byte-accurate size check and truncation using Buffer.byteLength and Buffer.from(...).subarray(0, MAX_FILE_SIZE).toString('utf8'). Correct fix for multi-byte UTF-8 content.
apps/sim/connectors/sharepoint/sharepoint.ts Refactors resolveSiteId to return { id, displayName }, eliminating a redundant second Graph API call for site display name; adds syncContext caching in getDocument. Net reduction in API calls.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Connector sync triggered] --> B{listDocuments}
    B -->|cursor / pages| C[External API call\nwith fetchWithRetry]
    C --> D[Store result in syncContext\ne.g. cloudId / portalId / siteId]
    D --> E[Build ExternalDocument list]
    E --> F{hasMore?}
    F -->|yes| B
    F -->|no| G[Sync engine: per-document refresh]
    G --> H{getDocument}
    H --> I{syncContext hit?}
    I -->|yes - HubSpot · Jira · SF · SP · Slack| J[Reuse cached ID\nno extra API call]
    I -->|no - Google Sheets| K[Extra Drive API call\nfor modifiedTime each time]
    J --> L[Fetch document content]
    K --> L
    L --> M[Build ExternalDocument\nwith tags & contentHash]

    style K fill:#ffe0e0,stroke:#cc0000
    style I fill:#e0f0ff
Loading

Comments Outside Diff (3)

  1. apps/sim/connectors/google-sheets/google-sheets.ts, line 285-310 (link)

    modifiedTime Drive API call not cached across getDocument calls

    Every getDocument invocation calls fetchSpreadsheetModifiedTime, which hits the Drive API for modifiedTime. Since modifiedTime is a file-level property shared by all sheets in a spreadsheet, a sync that individually refreshes N sheets will make N identical Drive API calls for the same value.

    This PR adds syncContext caching to getDocument in five other connectors (HubSpot, Jira, Salesforce, SharePoint, Slack). Google Sheets getDocument was not updated the same way — it doesn't even accept syncContext in its signature yet.

    The fix follows the same pattern used by the other connectors:

    getDocument: async (
      accessToken: string,
      sourceConfig: Record<string, unknown>,
      externalId: string,
      syncContext?: Record<string, unknown>
    ): Promise<ExternalDocument | null> => {
      // ...
      let modifiedTime = syncContext?.modifiedTime as string | undefined
      if (!modifiedTime) {
        ;[metadata, modifiedTime] = await Promise.all([
          fetchSpreadsheetMetadata(accessToken, spreadsheetId),
          fetchSpreadsheetModifiedTime(accessToken, spreadsheetId),
        ])
        if (syncContext) {
          syncContext.spreadsheetMetadata = metadata
          syncContext.modifiedTime = modifiedTime
        }
      } else {
        metadata = (syncContext?.spreadsheetMetadata as SpreadsheetMetadata) ?? await fetchSpreadsheetMetadata(accessToken, spreadsheetId)
      }
  2. apps/sim/connectors/confluence/confluence.ts, line 349-374 (link)

    Sequential probe always wastes a request for blogpost documents

    The new logic tries the pages endpoint first, then falls back to blogposts. For any document that was indexed as a blogpost, getDocument will always make a failed request to /pages/{id} (which returns 404) before succeeding on /blogposts/{id}. At scale this doubles the API traffic for every blogpost document refresh.

    A more efficient approach would be to store the content type (page or blogpost) in the document's metadata during listDocuments, then read it back here:

    // In listDocuments, store type: 'page' | 'blogpost' in metadata
    // Then in getDocument:
    const contentType = (syncContext?.contentType as string) ?? 'pages' // read from stored metadata if available
    const endpoints = contentType === 'blogpost' ? ['blogposts', 'pages'] : ['pages', 'blogposts']
    for (const endpoint of endpoints) { ... }

    This is not a blocking issue — supporting blogposts at all is a clear improvement — but it's worth noting for a follow-up optimisation.

  3. apps/sim/connectors/google-docs/google-docs.ts, line 544 (link)

    Wrong escape sequence for Google Drive query syntax

    The Google Drive API q parameter requires single quotes inside a quoted literal to be escaped as \'. In JavaScript, the replacement string "\\'" is the two-character sequence \' (backslash + apostrophe), which is exactly what the Drive API expects.

    However, the same change was applied identically to google-drive.ts — both files now produce queries like 'abc\'def' in parents. This is correct per the Drive API docs but it is worth noting that an alternative (and arguably safer) approach used by some clients is to reject folder IDs that contain quotes at the validateConfig step, since real Drive folder IDs are opaque base-64-like strings that never contain single quotes.

    No code change is strictly required, just documenting this for awareness.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Last reviewed commit: c29bc2b

…orce

- HubSpot: revert to Search API (POST /search) to restore lastmodifieddate DESCENDING sorting
- Salesforce: restore ArticleBody field and add it to HTML_FIELDS for proper stripping
- Jira: add zero-remaining guard to prevent requesting 0 maxResults
…Version field

ArticleBody is not a standard field on KnowledgeArticleVersion per Salesforce
API docs. Article body content lives in custom fields on org-specific __kav
objects. Including ArticleBody in the SOQL query would cause runtime errors.
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

- OneDrive: use Buffer.subarray for byte-accurate truncation instead of
  character-count slice
- Reddit: deduplicate comment extraction — fetchPostComments now calls
  extractComments instead of duplicating the logic
- Webflow: replace crude value.includes('<') with regex /<[a-z][^>]*>/i
  to avoid false positives on plain text containing '<'
- Jira: add response.ok check in getJiraCloudId before parsing JSON to
  surface real HTTP errors instead of misleading "No Jira resources found"
@waleedlatif1
Copy link
Collaborator Author

Addressed both issues from the review summary:

  1. getJiraCloudId missing response.ok check — Added a response.ok guard before calling .json(). On HTTP errors (401/403/etc), it now throws a descriptive error with the status code and response body instead of the misleading "No Jira resources found".

  2. OneDrive byte vs character truncation — Switched from text.slice(0, MAX_FILE_SIZE) to Buffer.from(text, 'utf8').subarray(0, MAX_FILE_SIZE).toString('utf8') for byte-accurate truncation.

Both fixed in 441d2a6.

@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

…Outlook URL encoding

- Jira: replace bare fetch() with fetchWithRetry in downloadJiraAttachments
  for retry logic on transient errors and rate limits
- Outlook: use URLSearchParams in validateConfig $search URL construction
  to match buildInitialUrl and produce RFC 3986 compliant encoding
@waleedlatif1
Copy link
Collaborator Author

Addressed the review findings:

Fixed:

  1. downloadJiraAttachments raw fetch — Replaced with fetchWithRetry for retry logic on transient errors and 429 rate limits. Fixed in c29bc2b.
  2. Outlook validateConfig URL encoding — Switched from manual encodeURIComponent + raw " delimiters to URLSearchParams (matching buildInitialUrl), producing RFC 3986 compliant %22 encoding. Fixed in c29bc2b.

Won't fix (design limitations, not bugs):
3. Notion listFromParentPage depth-1 — This is intentional. Recursing into deeply nested Notion hierarchies risks exponential API calls and hitting Notion's rate limits (3 req/s). The function name and TSDoc clearly scope it to direct children. Users with deep hierarchies can use the database source type instead.
4. Zendesk ignores cursor/syncContext — Zendesk's Help Center Articles API doesn't support cursor-based incremental sync (it always returns the full set). Tickets are paginated internally within a single listDocuments call via fetchTickets. The connector returns all results in one pass by design, so external cursor/syncContext would add complexity with no benefit.

@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1 waleedlatif1 merged commit 6818c51 into staging Mar 15, 2026
11 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/audit-connectors branch March 15, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant