Feature Request: Embeddings as Native LLM Input
The Problem — The Lossy Round-Trip
Every retrieval-augmented system today follows the same pipeline:
text → embedding → vector search → retrieve text → feed to LLM
The embedding step captures semantic meaning in high-dimensional space (e.g., 4096 dimensions). The retrieval step then converts it back to flat text — discarding the geometric relationships, cluster positions, and distance signals that the vector space already computed. The LLM then re-encodes that text into its own internal representations, reconstructing what the embedding already knew.
This is a lossy round-trip. The information exists in vector form, gets serialized to text, and then gets re-vectorized internally by the model. The intermediate text step is a bottleneck — both in fidelity and in token cost.
The Proposal — Vector Prompt Interface
What if the Messages API accepted embedding vectors as a native input modality, alongside text and images?
Conceptually, this would look like a new content block type:
response = client.messages.create(
model="claude-opus-4-6",
messages=[{
"role": "user",
"content": [
# Traditional text context
{"type": "text", "text": "Given these retrieved memories, answer my question:"},
# NEW: embedding vectors injected at the input-encoding layer
{
"type": "embedding",
"vectors": [
{"data": [0.0234, -0.0891, ...], "dimensions": 4096, "label": "memory_1"},
{"data": [0.0112, -0.0453, ...], "dimensions": 4096, "label": "memory_2"},
],
"model": "qwen3-embedding-8b", # or voyage-3, text-embedding-3-large, etc.
"metadata": {
"distances": [0.12, 0.34], # cosine distances from query
"retrieval_scores": [0.95, 0.82] # fusion scores if available
}
},
{"type": "text", "text": "What was the key breakthrough last week?"}
]
}]
)
Why This Matters
-
Lossless retrieval context — Geometric relationships between retrieved items (cluster distances, traversal paths, similarity scores) arrive intact instead of being serialized to text descriptions.
-
Token efficiency — A 4096-dimensional embedding carries the semantic weight of thousands of tokens in a single vector. Systems with 32K+ retrievable items could provide richer context without hitting token limits.
-
Native agent-to-agent communication — Multi-agent systems increasingly use embeddings as inter-agent signals (e.g., binary-quantized vectors streamed via Kafka). Accepting these natively eliminates the serialization/deserialization overhead.
-
Retrieval metadata preservation — Fusion scores, distance metrics, graph traversal paths, and other retrieval signals could be passed directly rather than described in natural language.
Precedent
Images proved that LLMs can process non-text modalities at the input-encoding layer. The architecture already supports multimodal input. Embeddings are semantically closer to the model's internal representations than pixels are — they're a more natural fit for this pattern.
Real-World Use Case
We operate a multi-agent system (UCIS) with:
- 32,000+ memories in a graph database, each with 4096d embeddings
- 6-signal retrieval fusion (vector search, keyword, temporal, Q-value, foresight, ACT-R decay) producing ranked results with rich scoring metadata
- Binary-quantized embedding streaming between 12 agents via Kafka
- 3 embedding pipelines (nightly batch, real-time streaming, on-demand)
The entire infrastructure produces rich vector representations — and then throws them away at the last mile, converting back to text for the API call. Every system doing RAG at scale has this same bottleneck.
Summary
This is a feature request, not a research problem. The embedding infrastructure exists across the ecosystem. The multimodal input architecture exists in the model. What's missing is the API surface to connect them.
Thank you for considering this.
Feature Request: Embeddings as Native LLM Input
The Problem — The Lossy Round-Trip
Every retrieval-augmented system today follows the same pipeline:
The embedding step captures semantic meaning in high-dimensional space (e.g., 4096 dimensions). The retrieval step then converts it back to flat text — discarding the geometric relationships, cluster positions, and distance signals that the vector space already computed. The LLM then re-encodes that text into its own internal representations, reconstructing what the embedding already knew.
This is a lossy round-trip. The information exists in vector form, gets serialized to text, and then gets re-vectorized internally by the model. The intermediate text step is a bottleneck — both in fidelity and in token cost.
The Proposal — Vector Prompt Interface
What if the Messages API accepted embedding vectors as a native input modality, alongside text and images?
Conceptually, this would look like a new content block type:
Why This Matters
Lossless retrieval context — Geometric relationships between retrieved items (cluster distances, traversal paths, similarity scores) arrive intact instead of being serialized to text descriptions.
Token efficiency — A 4096-dimensional embedding carries the semantic weight of thousands of tokens in a single vector. Systems with 32K+ retrievable items could provide richer context without hitting token limits.
Native agent-to-agent communication — Multi-agent systems increasingly use embeddings as inter-agent signals (e.g., binary-quantized vectors streamed via Kafka). Accepting these natively eliminates the serialization/deserialization overhead.
Retrieval metadata preservation — Fusion scores, distance metrics, graph traversal paths, and other retrieval signals could be passed directly rather than described in natural language.
Precedent
Images proved that LLMs can process non-text modalities at the input-encoding layer. The architecture already supports multimodal input. Embeddings are semantically closer to the model's internal representations than pixels are — they're a more natural fit for this pattern.
Real-World Use Case
We operate a multi-agent system (UCIS) with:
The entire infrastructure produces rich vector representations — and then throws them away at the last mile, converting back to text for the API call. Every system doing RAG at scale has this same bottleneck.
Summary
This is a feature request, not a research problem. The embedding infrastructure exists across the ecosystem. The multimodal input architecture exists in the model. What's missing is the API surface to connect them.
Thank you for considering this.