Under which category would you file this issue?
Providers
Apache Airflow version
3.2.2+astro.1
What happened and how to reproduce it?
Versions (as tested):
- apache-airflow-providers-common-ai 0.4.0 (bug also present on
main as of 2026-06-11)
- llama-index-core 0.14.22, llama-index-embeddings-openai 0.6.0
- Apache Airflow 3.2.2 (Astro Runtime 3.2-5), Python 3.13
Summary:
LlamaIndexEmbeddingOperator.execute() returns {"chunks": [{"text", "metadata", "vector"}], ...}, but vector is always None. Downstream tasks consuming the documented chunk output (e.g. inserting vectors into a vector table) fail or silently store nulls.
Root cause:
The operator builds the index and then reads embeddings back off its own local nodes list, relying on a side effect that doesn't exist (llamaindex_embedding.py lines ~128–149 at tag providers-common-ai/0.4.0):
# ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
# side effect of building the index; ...
index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=False)
...
chunks = [{"text": node.text, "metadata": node.metadata, "vector": node.embedding} for node in text_nodes]
But VectorStoreIndex._get_node_with_embedding() in llama-index-core attaches embeddings to copies, never the originals:
result = node.model_copy()
result.embedding = embedding
I checked llama-index-core tags v0.10.68, v0.11.23, v0.12.52, v0.13.5, and 0.14.22, all copy (older ones via node.copy()). So the side-effect assumption has never held; no version pin fixes it. The embeddings end up only inside the index's vector store (index.vector_store.data.embedding_dict for SimpleVectorStore, keyed by node_id).
Minimal reproduction (no API key needed):
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.embeddings.mock_embed_model import MockEmbedding
docs = [Document(text="hello world", metadata={"id": 1})]
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=50).get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))
print(nodes[0].embedding) # None <-- what the operator returns
print(index.vector_store.data.embedding_dict) # {node_id: [0.5, ...]} <-- where vectors actually are
Or via the operator itself with any real connection: every entry in result["chunks"] has "vector": None.
What you think should happen instead?
Suggested fixes (either works):
- After building the index, read vectors back from the store:
index.vector_store.data.embedding_dict[node.node_id] (works for the SimpleVectorStore default; needs a fallback for stores that don't retain data).
- Pre-embed before building: call
embed_model.get_text_embedding_batch([...]) and assign node.embedding on the original nodes first. llama_index.core.indices.utils.embed_nodes() skips nodes whose .embedding is already set, so VectorStoreIndex reuses them — no duplicate API calls, and the existing return code works unchanged. (I verified the skip behavior in 0.14.22.)
Workaround for users: set persist_dir, then load vectors downstream via StorageContext.from_defaults(persist_dir=...) → ctx.vector_store.data.embedding_dict + ctx.docstore.get_node(node_id).metadata.
Operating System
No response
Deployment
None
Apache Airflow Provider(s)
common-ai
Versions of Apache Airflow Providers
Providers (pip freeze | grep apache-airflow-providers):
apache-airflow-providers-celery==3.20.0
apache-airflow-providers-common-ai==0.4.0
apache-airflow-providers-common-compat==1.15.0
apache-airflow-providers-common-io==1.7.2
apache-airflow-providers-common-sql==1.30.2
apache-airflow-providers-elasticsearch==6.5.4
apache-airflow-providers-openlineage==2.17.0
apache-airflow-providers-smtp==3.0.1
apache-airflow-providers-standard==1.13.1
Other:
llama-index-core==0.14.22
llama-index-embeddings-openai==0.6.0
Official Helm Chart version
Not Applicable
Kubernetes Version
No response
Helm Chart configuration
No response
Docker Image customizations
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
Under which category would you file this issue?
Providers
Apache Airflow version
3.2.2+astro.1
What happened and how to reproduce it?
Versions (as tested):
mainas of 2026-06-11)Summary:
LlamaIndexEmbeddingOperator.execute()returns{"chunks": [{"text", "metadata", "vector"}], ...}, butvectoris alwaysNone. Downstream tasks consuming the documented chunk output (e.g. inserting vectors into a vector table) fail or silently store nulls.Root cause:
The operator builds the index and then reads embeddings back off its own local
nodeslist, relying on a side effect that doesn't exist (llamaindex_embedding.pylines ~128–149 at tagproviders-common-ai/0.4.0):But
VectorStoreIndex._get_node_with_embedding()in llama-index-core attaches embeddings to copies, never the originals:I checked llama-index-core tags v0.10.68, v0.11.23, v0.12.52, v0.13.5, and 0.14.22, all copy (older ones via
node.copy()). So the side-effect assumption has never held; no version pin fixes it. The embeddings end up only inside the index's vector store (index.vector_store.data.embedding_dictforSimpleVectorStore, keyed by node_id).Minimal reproduction (no API key needed):
Or via the operator itself with any real connection: every entry in
result["chunks"]has"vector": None.What you think should happen instead?
Suggested fixes (either works):
index.vector_store.data.embedding_dict[node.node_id](works for theSimpleVectorStoredefault; needs a fallback for stores that don't retaindata).embed_model.get_text_embedding_batch([...])and assignnode.embeddingon the original nodes first.llama_index.core.indices.utils.embed_nodes()skips nodes whose.embeddingis already set, soVectorStoreIndexreuses them — no duplicate API calls, and the existing return code works unchanged. (I verified the skip behavior in 0.14.22.)Workaround for users: set
persist_dir, then load vectors downstream viaStorageContext.from_defaults(persist_dir=...)→ctx.vector_store.data.embedding_dict+ctx.docstore.get_node(node_id).metadata.Operating System
No response
Deployment
None
Apache Airflow Provider(s)
common-ai
Versions of Apache Airflow Providers
Providers (
pip freeze | grep apache-airflow-providers):Other:
Official Helm Chart version
Not Applicable
Kubernetes Version
No response
Helm Chart configuration
No response
Docker Image customizations
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct