Skip to content

Embedding lambda sends updateEmbedding message to Thrall#4631

Merged
ellenmuller merged 59 commits intomainfrom
em-write-embedding-to-es
Mar 11, 2026
Merged

Embedding lambda sends updateEmbedding message to Thrall#4631
ellenmuller merged 59 commits intomainfrom
em-write-embedding-to-es

Conversation

@ellenmuller
Copy link
Copy Markdown
Contributor

@ellenmuller ellenmuller commented Feb 18, 2026

What does this change?

Completes the pipeline for persisting image embeddings into Elasticsearch by having the image-embedder lambda send embedding data to Thrall via its Kinesis stream, where Thrall writes it to ES using the existing migrationAwareUpdater pattern.

Previously, the image-embedder lambda generated embeddings and stored them in the S3 Vector Store, but they were not indexed in Elasticsearch. This PR closes that gap. We are now writing the embeddings to both.

How it works

  1. Image-embedder lambda — after generating a Cohere embedding, the lambda now serialises an UpdateEmbeddingMessage and publishes it to the Thrall Kinesis stream using PutRecords. Failed Kinesis publishes are reported back as batchItemFailures so SQS can retry them.
  2. Thrall (MessageProcessor) — a new UpdateEmbeddingMessage case is handled, delegating to ElasticSearch.updateEmbedding.
  3. Thrall (ElasticSearch)updateEmbedding uses migrationAwareUpdater with a Painless script to set ctx._source.embedding on the image document.

⚠️ Rollout note

Kinesis publishing is currently gated off on PROD — embeddings are only written to Thrall on non-PROD stages while testing is in progress.

We will monitor TEST after this is merged in and start embedding to ES on PROD after we feel confident that all is well!

How should a reviewer test this change?

How can success be measured?

Who should look at this?

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@ellenmuller ellenmuller added the feature Departmental tracking: work on a new feature label Feb 18, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 18, 2026

Comment thread image-embedder-lambda/src/index.ts Outdated
Comment thread image-embedder-lambda/src/index.ts Outdated
@joelochlann
Copy link
Copy Markdown
Member

Not related to this PR but just for reference while we have it. This should count the vectors in the vectors store in TEST:

aws s3vectors list-vectors \
  --vector-bucket-name image-embeddings-test \
  --index-name cohere-embed-english-v3 \
  --profile media-service \
  --region eu-central-1 \
  --page-size 1000  | jq '.vectors | length'

@ellenmuller ellenmuller force-pushed the em-write-embedding-to-es branch from 9aacaa9 to 03517c3 Compare March 2, 2026 11:28
@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented Mar 11, 2026

Seen on auth, image-loader, metadata-editor, leases, cropper, collections, media-api, kahuna (merged by @ellenmuller 9 minutes and 49 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented Mar 11, 2026

Seen on usage (merged by @ellenmuller 9 minutes and 56 seconds ago) Please check your changes!

@gu-prout
Copy link
Copy Markdown

gu-prout Bot commented Mar 11, 2026

Seen on thrall (merged by @ellenmuller 10 minutes and 1 second ago) Please check your changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants