Skip to content

Fix/ingestion flow#159

Merged
nikilok merged 9 commits into
mainfrom
fix/ingestion-flow
Jun 10, 2026
Merged

Fix/ingestion flow#159
nikilok merged 9 commits into
mainfrom
fix/ingestion-flow

Conversation

@nikilok

@nikilok nikilok commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Sponsor licence numbers are now tracked and displayed in search results and sponsor details.
  • Improvements

    • Location information now derives from Companies House profiles for enhanced accuracy.
    • Search results employ improved deduplication and grouping logic for sponsor records.
  • Documentation

    • Added comprehensive guide documenting HMRC CSV format updates and integration changes.

@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
learn-tanstack-start Ready Ready Preview, Comment Jun 10, 2026 4:52pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@nikilok, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 27 minutes and 59 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fb5b7617-55d2-4885-af0a-86fe5819dcf6

📥 Commits

Reviewing files that changed from the base of the PR and between fe85e6a and d6681ae.

📒 Files selected for processing (2)
  • apps/web/scripts/ingest-hmrc-csv.ts
  • apps/web/src/api/hmrc.ts
📝 Walkthrough

Walkthrough

This PR migrates HMRC sponsor location data from HMRC-provided town/city fields to Companies House-derived location. The schema, ingestion, search API, matching adapters, and UI components are updated to source location from Companies House profiles via query-time joins while capturing sponsor licence numbers.

Changes

HMRC Location Data Migration

Layer / File(s) Summary
Database Schema & Migrations
packages/db/src/schema.ts, packages/db/migrations/0025_add-sponsor-licence.sql, packages/db/migrations/0026_drop-town-county.sql, packages/db/migrations/0027_widen-sponsor-licence.sql, packages/db/migrations/meta/*
Drop town_city and county columns from hmrc_skilled_workers; add sponsor_licence_number and sponsor_status columns with supporting indexes. Three sequential migrations implement the changes; snapshots and journal track schema state.
Cache Control Headers
apps/web/src/api/cache-headers.ts
Add SHORT_EDGE_CACHE constant for negative lookups and setSsrCacheControl isomorphic function to set Cache-Control on SSR document responses, complementing RPC-level cache control.
HMRC Search & Lookup API
apps/web/src/api/hmrc.ts
Refactor searchHmrc to group by (organisationName, nameSlug, typeRating, route), compute canonical slugId via min(hash), aggregate sponsorLicenceNumbers, and join Companies House profiles to derive locality and region. Update getHmrcBySlugId to return canonical group-level data with new cache behavior. Change getHmrcBySlug stale-slug fallback to order by hash ascending and return up to 10 rows instead of 2.
HMRC CSV Ingestion
apps/web/scripts/ingest-hmrc-csv.ts
Update CSV schema to expect sponsor licence number and status instead of location/type columns. Rewrite hash computation to use licence + typeRating + route. Update staging table DDL, row validation, batch INSERT logic, and indexes to match the new schema.
Sponsor Lookup Adapters
apps/web/scripts/drain-review-queue.ts, apps/web/scripts/seed-companies-house.ts, apps/web/src/api/companiesHouse.ts, apps/web/src/lib/phase5/sql.ts
Simplify SponsorRow to track only route; update loadSponsors to group by route without locality tie-breaking. Remove HMRC locality lookups from resolver path and set locality to null. Update makeLookupSponsor to select only route and force locality to null in returned shape.
Sitemap Generation
apps/web/scripts/generate-sitemap.ts
Use Drizzle sql() for deterministic de-duplication, grouping by (organisationName, nameSlug, typeRating, route, companiesHouseProfiles.updatedAt) and selecting min(hash) as canonical hash per group.
Location Display Components
apps/web/src/components/HmrcCard.tsx, apps/web/src/components/HmrcResults.tsx, apps/web/src/components/McpTools.tsx
Import and use formatLocation(locality, region) for location display instead of formatting town_city/county. Add sponsorLicenceNumbers to MCP tool outputs.
Company Detail Route
apps/web/src/routes/company.$id.$slug.tsx
Introduce registeredLocation() helper to derive location from Companies House address fields (locality/address_line_2 + region). Update route loader to handle null-cache invalidation and SSR-only canonical redirects. Add licence number display in sponsor cards when Companies House profile is missing or licence data is present. Update page metadata and rendering to use the new location helper.
Implementation Documentation
docs/hmrc-csv-format-change.md
Comprehensive guide covering schema/ingest changes, query updates, UI wiring, matching pipeline adjustments, sitemap/redirect behavior, verification steps, and open confirmations for the location data migration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • nikilok/learn-tanstack-start#130: Main PR's drain-review-queue.ts changes to sponsor location handling directly impact the Phase 5 scorer logic that uses compareForInlineResolution for candidate ranking.
  • nikilok/learn-tanstack-start#99: Main PR's removal of locality fields from sponsor data and forcing null values is directly aligned with Phase 5 "same-rank inline resolution" and the compareForInlineResolution decision logic.
  • nikilok/learn-tanstack-start#61: Main PR's updates to apps/web/src/api/hmrc.ts (getHmrcBySlug stale-slug behavior) and route redirect/canonicalization logic directly overlap with the introduction of getHmrcBySlug and 404-to-redirect handling.

Poem

🐰 No more town and county in the HMRC feed,
Companies House profiles provide what sponsors need,
Licences captured, location refined,
Location data from a better source we find—
A hop towards clarity! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Fix/ingestion flow' is vague and generic, using non-descriptive terminology that doesn't convey meaningful information about the substantial changes to HMRC data handling, sponsor licence tracking, and query refactoring across multiple files. Consider a more specific title that highlights the main change, such as 'Refactor HMRC data handling to use sponsor licence numbers instead of location fields' or 'Update HMRC ingestion to track sponsor licence and remove location-based deduplication'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ingestion-flow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/web/src/api/hmrc.ts (1)

220-228: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't cap stale-slug fallback to 10 raw licence rows.

The comment here talks about a handful of (rating, route) groups, but this query still returns one row per licence. Any slug with more than 10 licences can drop the requested hash from the loader's containment scan and incorrectly 301 to a different record even though that hash exists again.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/web/src/api/hmrc.ts` around lines 220 - 228, The query against
hmrcSkilledWorkers currently uses .limit(10) which caps raw licence rows and can
drop the requested hash; remove the .limit(10) and instead return group-level
results (or at minimum all matching rows) so the loader's containment scan won't
miss the requested hash. Modify the db.select/from/where block that builds rows
(the call using hmrcSkilledWorkers, eq(hmrcSkilledWorkers.nameSlug, slug),
orderBy(asc(hmrcSkilledWorkers.hash))) to either remove the limit or replace it
with a GROUP BY on the intended grouping columns (e.g.,
hmrcSkilledWorkers.rating and hmrcSkilledWorkers.route) and select a single
representative per group (e.g., min/max hash) so you get "a handful" of groups
rather than up to 10 arbitrary licence rows.
🧹 Nitpick comments (1)
docs/hmrc-csv-format-change.md (1)

96-96: ⚡ Quick win

Clarify or remove the "CLAUDE.md" reference.

The documentation references a "CLAUDE.md invariant" but no such file appears in the provided context, coding guidelines, or review stack. This could confuse readers.

If this refers to an internal document, consider either:

  • Linking to it explicitly
  • Replacing it with "coding guidelines" or "useCardMetrics sync requirement"
  • Removing the reference and keeping just the technical constraint description
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/hmrc-csv-format-change.md` at line 96, The doc line references a
mysterious "CLAUDE.md invariant" alongside `fixedHeight:62` and
`useCardMetrics`; update this to avoid confusion by either (a) adding an
explicit link to the CLAUDE.md file if it exists, (b) replacing "CLAUDE.md
invariant" with a clearer phrase like "coding guidelines" or "useCardMetrics
sync requirement", or (c) remove the file reference entirely and leave only the
technical constraint (e.g., "`fixedHeight:62` — `useCardMetrics` must stay in
sync"), ensuring the terms `fixedHeight:62` and `useCardMetrics` remain
unchanged so the technical requirement is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/web/scripts/ingest-hmrc-csv.ts`:
- Around line 117-125: The ingestion currently dedupes solely by the hash
produced in computeHash(licence,typeRating,route), which can hide conflicting
values in other columns (e.g., organisation_name, sponsor_status); before
collapsing rows by hmrc_skilled_workers.hash, group incoming CSV rows by the
computed hash and compare all non-hash, persistent fields (at least
organisation_name and sponsor_status) for consistency; if any group contains
differing values, surface the conflict (fail the import or log and skip
according to policy) so you don’t silently retain arbitrary
organisation_name/sponsor_status, and only then dedupe/insert the canonical row
into the table or abort.

In `@apps/web/src/api/hmrc.ts`:
- Around line 54-58: The array_agg(distinct ...) expression for
sponsorLicenceNumbers is non-deterministic because it doesn't guarantee order;
change it to produce a deterministic ordered array by using ORDER BY inside the
aggregate (or an ordered subselect). Update the field that references
hmrcSkilledWorkers.sponsorLicenceNumber (the sponsorLicenceNumbers projection)
to use coalesce(array_agg(distinct ${hmrcSkilledWorkers.sponsorLicenceNumber}
ORDER BY ${hmrcSkilledWorkers.sponsorLicenceNumber}) FILTER (WHERE
${hmrcSkilledWorkers.sponsorLicenceNumber} IS NOT NULL), '{}') so the array
elements are consistently ordered, and make the same change for the other
occurrence mentioned (the block at lines ~154-157).

---

Outside diff comments:
In `@apps/web/src/api/hmrc.ts`:
- Around line 220-228: The query against hmrcSkilledWorkers currently uses
.limit(10) which caps raw licence rows and can drop the requested hash; remove
the .limit(10) and instead return group-level results (or at minimum all
matching rows) so the loader's containment scan won't miss the requested hash.
Modify the db.select/from/where block that builds rows (the call using
hmrcSkilledWorkers, eq(hmrcSkilledWorkers.nameSlug, slug),
orderBy(asc(hmrcSkilledWorkers.hash))) to either remove the limit or replace it
with a GROUP BY on the intended grouping columns (e.g.,
hmrcSkilledWorkers.rating and hmrcSkilledWorkers.route) and select a single
representative per group (e.g., min/max hash) so you get "a handful" of groups
rather than up to 10 arbitrary licence rows.

---

Nitpick comments:
In `@docs/hmrc-csv-format-change.md`:
- Line 96: The doc line references a mysterious "CLAUDE.md invariant" alongside
`fixedHeight:62` and `useCardMetrics`; update this to avoid confusion by either
(a) adding an explicit link to the CLAUDE.md file if it exists, (b) replacing
"CLAUDE.md invariant" with a clearer phrase like "coding guidelines" or
"useCardMetrics sync requirement", or (c) remove the file reference entirely and
leave only the technical constraint (e.g., "`fixedHeight:62` — `useCardMetrics`
must stay in sync"), ensuring the terms `fixedHeight:62` and `useCardMetrics`
remain unchanged so the technical requirement is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ac48351c-94c0-45ec-b24a-db04fc6a6fab

📥 Commits

Reviewing files that changed from the base of the PR and between 754b971 and fe85e6a.

📒 Files selected for processing (21)
  • apps/web/scripts/drain-review-queue.ts
  • apps/web/scripts/generate-sitemap.ts
  • apps/web/scripts/ingest-hmrc-csv.ts
  • apps/web/scripts/seed-companies-house.ts
  • apps/web/src/api/cache-headers.ts
  • apps/web/src/api/companiesHouse.ts
  • apps/web/src/api/hmrc.ts
  • apps/web/src/components/HmrcCard.tsx
  • apps/web/src/components/HmrcResults.tsx
  • apps/web/src/components/McpTools.tsx
  • apps/web/src/lib/phase5/sql.ts
  • apps/web/src/routes/company.$id.$slug.tsx
  • docs/hmrc-csv-format-change.md
  • packages/db/migrations/0025_add-sponsor-licence.sql
  • packages/db/migrations/0026_drop-town-county.sql
  • packages/db/migrations/0027_widen-sponsor-licence.sql
  • packages/db/migrations/meta/0025_snapshot.json
  • packages/db/migrations/meta/0026_snapshot.json
  • packages/db/migrations/meta/0027_snapshot.json
  • packages/db/migrations/meta/_journal.json
  • packages/db/src/schema.ts

Comment thread apps/web/scripts/ingest-hmrc-csv.ts
Comment thread apps/web/src/api/hmrc.ts
@nikilok nikilok merged commit 021d18e into main Jun 10, 2026
5 checks passed
@nikilok nikilok deleted the fix/ingestion-flow branch June 10, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant