Skip to content

LlamaIndexEmbeddingOperator returns vector=None for every chunk #68416

@vojay-dev

Description

@vojay-dev

Under which category would you file this issue?

Providers

Apache Airflow version

3.2.2+astro.1

What happened and how to reproduce it?

Versions (as tested):

  • apache-airflow-providers-common-ai 0.4.0 (bug also present on main as of 2026-06-11)
  • llama-index-core 0.14.22, llama-index-embeddings-openai 0.6.0
  • Apache Airflow 3.2.2 (Astro Runtime 3.2-5), Python 3.13

Summary:
LlamaIndexEmbeddingOperator.execute() returns {"chunks": [{"text", "metadata", "vector"}], ...}, but vector is always None. Downstream tasks consuming the documented chunk output (e.g. inserting vectors into a vector table) fail or silently store nulls.

Root cause:
The operator builds the index and then reads embeddings back off its own local nodes list, relying on a side effect that doesn't exist (llamaindex_embedding.py lines ~128–149 at tag providers-common-ai/0.4.0):

# ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
# side effect of building the index; ...
index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=False)
...
chunks = [{"text": node.text, "metadata": node.metadata, "vector": node.embedding} for node in text_nodes]

But VectorStoreIndex._get_node_with_embedding() in llama-index-core attaches embeddings to copies, never the originals:

result = node.model_copy()
result.embedding = embedding

I checked llama-index-core tags v0.10.68, v0.11.23, v0.12.52, v0.13.5, and 0.14.22, all copy (older ones via node.copy()). So the side-effect assumption has never held; no version pin fixes it. The embeddings end up only inside the index's vector store (index.vector_store.data.embedding_dict for SimpleVectorStore, keyed by node_id).

Minimal reproduction (no API key needed):

from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.embeddings.mock_embed_model import MockEmbedding

docs = [Document(text="hello world", metadata={"id": 1})]
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=50).get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))

print(nodes[0].embedding)                          # None  <-- what the operator returns
print(index.vector_store.data.embedding_dict)      # {node_id: [0.5, ...]}  <-- where vectors actually are

Or via the operator itself with any real connection: every entry in result["chunks"] has "vector": None.

What you think should happen instead?

Suggested fixes (either works):

  1. After building the index, read vectors back from the store: index.vector_store.data.embedding_dict[node.node_id] (works for the SimpleVectorStore default; needs a fallback for stores that don't retain data).
  2. Pre-embed before building: call embed_model.get_text_embedding_batch([...]) and assign node.embedding on the original nodes first. llama_index.core.indices.utils.embed_nodes() skips nodes whose .embedding is already set, so VectorStoreIndex reuses them — no duplicate API calls, and the existing return code works unchanged. (I verified the skip behavior in 0.14.22.)

Workaround for users: set persist_dir, then load vectors downstream via StorageContext.from_defaults(persist_dir=...)ctx.vector_store.data.embedding_dict + ctx.docstore.get_node(node_id).metadata.

Operating System

No response

Deployment

None

Apache Airflow Provider(s)

common-ai

Versions of Apache Airflow Providers

Providers (pip freeze | grep apache-airflow-providers):

apache-airflow-providers-celery==3.20.0
apache-airflow-providers-common-ai==0.4.0
apache-airflow-providers-common-compat==1.15.0
apache-airflow-providers-common-io==1.7.2
apache-airflow-providers-common-sql==1.30.2
apache-airflow-providers-elasticsearch==6.5.4
apache-airflow-providers-openlineage==2.17.0
apache-airflow-providers-smtp==3.0.1
apache-airflow-providers-standard==1.13.1

Other:

llama-index-core==0.14.22
llama-index-embeddings-openai==0.6.0

Official Helm Chart version

Not Applicable

Kubernetes Version

No response

Helm Chart configuration

No response

Docker Image customizations

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions