LlamaIndexEmbeddingOperator returns vector=None for every chunk

### Under which category would you file this issue?

Providers

### Apache Airflow version

3.2.2+astro.1

### What happened and how to reproduce it?

**Versions (as tested):**
- apache-airflow-providers-common-ai **0.4.0** (bug also present on `main` as of 2026-06-11)
- llama-index-core **0.14.22**, llama-index-embeddings-openai 0.6.0
- Apache Airflow 3.2.2 (Astro Runtime 3.2-5), Python 3.13

**Summary:**
`LlamaIndexEmbeddingOperator.execute()` returns `{"chunks": [{"text", "metadata", "vector"}], ...}`, but `vector` is always `None`. Downstream tasks consuming the documented chunk output (e.g. inserting vectors into a vector table) fail or silently store nulls.

**Root cause:**
The operator builds the index and then reads embeddings back off its own local `nodes` list, relying on a side effect that doesn't exist ([`llamaindex_embedding.py` lines ~128–149 at tag `providers-common-ai/0.4.0`](https://github.com/apache/airflow/blob/providers-common-ai/0.4.0/providers/common/ai/src/airflow/providers/common/ai/operators/llamaindex_embedding.py)):

```python
# ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
# side effect of building the index; ...
index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=False)
...
chunks = [{"text": node.text, "metadata": node.metadata, "vector": node.embedding} for node in text_nodes]
```

But `VectorStoreIndex._get_node_with_embedding()` in llama-index-core attaches embeddings to **copies**, never the originals:

```python
result = node.model_copy()
result.embedding = embedding
```

I checked llama-index-core tags v0.10.68, v0.11.23, v0.12.52, v0.13.5, and 0.14.22, all copy (older ones via `node.copy()`). So the side-effect assumption has never held; no version pin fixes it. The embeddings end up only inside the index's vector store (`index.vector_store.data.embedding_dict` for `SimpleVectorStore`, keyed by node_id).

**Minimal reproduction (no API key needed):**

```python
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.embeddings.mock_embed_model import MockEmbedding

docs = [Document(text="hello world", metadata={"id": 1})]
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=50).get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))

print(nodes[0].embedding)                          # None  <-- what the operator returns
print(index.vector_store.data.embedding_dict)      # {node_id: [0.5, ...]}  <-- where vectors actually are
```

Or via the operator itself with any real connection: every entry in `result["chunks"]` has `"vector": None`.

### What you think should happen instead?

**Suggested fixes (either works):**
1. After building the index, read vectors back from the store: `index.vector_store.data.embedding_dict[node.node_id]` (works for the `SimpleVectorStore` default; needs a fallback for stores that don't retain `data`).
2. Pre-embed before building: call `embed_model.get_text_embedding_batch([...])` and assign `node.embedding` on the original nodes first. `llama_index.core.indices.utils.embed_nodes()` skips nodes whose `.embedding` is already set, so `VectorStoreIndex` reuses them — no duplicate API calls, and the existing return code works unchanged. (I verified the skip behavior in 0.14.22.)

**Workaround for users:** set `persist_dir`, then load vectors downstream via `StorageContext.from_defaults(persist_dir=...)` → `ctx.vector_store.data.embedding_dict` + `ctx.docstore.get_node(node_id).metadata`.

### Operating System

_No response_

### Deployment

None

### Apache Airflow Provider(s)

common-ai

### Versions of Apache Airflow Providers

**Providers** (`pip freeze | grep apache-airflow-providers`):
```
apache-airflow-providers-celery==3.20.0
apache-airflow-providers-common-ai==0.4.0
apache-airflow-providers-common-compat==1.15.0
apache-airflow-providers-common-io==1.7.2
apache-airflow-providers-common-sql==1.30.2
apache-airflow-providers-elasticsearch==6.5.4
apache-airflow-providers-openlineage==2.17.0
apache-airflow-providers-smtp==3.0.1
apache-airflow-providers-standard==1.13.1
```

Other:
```
llama-index-core==0.14.22
llama-index-embeddings-openai==0.6.0
```

### Official Helm Chart version

Not Applicable

### Kubernetes Version

_No response_

### Helm Chart configuration

_No response_

### Docker Image customizations

_No response_

### Anything else?

_No response_

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaIndexEmbeddingOperator returns vector=None for every chunk #68416

Under which category would you file this issue?

Apache Airflow version

What happened and how to reproduce it?

What you think should happen instead?

Operating System

Deployment

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Official Helm Chart version

Kubernetes Version

Helm Chart configuration

Docker Image customizations

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LlamaIndexEmbeddingOperator returns vector=None for every chunk #68416

Description

Under which category would you file this issue?

Apache Airflow version

What happened and how to reproduce it?

What you think should happen instead?

Operating System

Deployment

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Official Helm Chart version

Kubernetes Version

Helm Chart configuration

Docker Image customizations

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions