Skip to content

Enhancing VectorStoreWriter for better RAG support#7396

Open
Copilot wants to merge 12 commits intodata-ingestion-preview2from
copilot/fix-vectorstorewriter-injection
Open

Enhancing VectorStoreWriter for better RAG support#7396
Copilot wants to merge 12 commits intodata-ingestion-preview2from
copilot/fix-vectorstorewriter-injection

Conversation

Copy link
Contributor

Copilot AI commented Mar 15, 2026

This PR addresses two issues in the data ingestion layer:

#6967 - Allow VectorStoreWriter to accept a pre-created VectorStoreCollection: VectorStoreWriter now takes a VectorStoreCollection<Guid, TRecord> directly, allowing users to configure the collection in specific ways. The TKey generic parameter was removed since all MEVD vector collections support Guid keys. Users can also provide their own VectorStoreCollectionDefinition with custom storage names to map to pre-existing collections with different schemas.

#6968 - Figure out the metadata story: SetMetadata now throws NotSupportedException by default so that metadata schema mismatches are caught early rather than silently ignored. Users who need metadata must explicitly override the method in a derived type.

Changes

  • IngestedChunkRecord.cs: Use nameof() for property names in CreateCollectionDefinition, assign storage names via StorageName initializer. Remove TKey generic parameter (Key is always Guid). Make storage name constants private (using *StorageName suffix for consistency), except EmbeddingStorageName which is protected so derived types can reference it in [VectorStoreVector] attributes. Remove all JSON attributes and make properties virtual so derived types can annotate as needed. SetMetadata throws NotSupportedException by default to avoid silent bugs.

  • VectorStoreWriter.cs: Remove TKey generic parameter, hardcode Guid key type, simplify key generation to Guid.NewGuid().

  • VectorStoreWriterOptions.cs: Fix XML doc reference to match updated generic signature VectorStoreWriter{TChunk, TRecord}.

  • Template and snapshot files: Simplify IngestedChunk to always use Guid key (removing #if conditionals), use EmbeddingStorageName for StorageName in derived types. Remove stale using System.Text.Json.Serialization from all snapshot files.

  • Test files: Remove unnecessary JsonPropertyName attributes and using directives. Add CanWriteChunksWithCustomDefinition test proving users can create their own VectorStoreCollectionDefinition with custom storage names to map to pre-existing collections with different schemas.

  • Fix unnecessary using System in TestChunkRecordWithMetadata.cs (CI failure S1128)

  • Remove stale using System.Text.Json.Serialization from all 5 snapshot IngestedChunk.cs files (template snapshot test failures)

  • Build with -warnaserror — 0 warnings, 0 errors

  • Run tests — 123 passed, 11 skipped, 0 failed


📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

Copilot AI and others added 5 commits March 14, 2026 20:50
…tedChunkRecord base type

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
…toreWriter refactoring

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
…ion property names

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
@github-actions github-actions bot added the area-ai-templates Microsoft.Extensions.AI.Templates label Mar 15, 2026
@adamsitnik adamsitnik added area-data-ingestion and removed area-ai-templates Microsoft.Extensions.AI.Templates labels Mar 15, 2026
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot address my feedback

Copilot stopped work on behalf of adamsitnik due to an error March 16, 2026 11:57
…al, throw in SetMetadata, make consts protected, remove JSON attributes

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copilot AI requested a review from adamsitnik March 16, 2026 12:09
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address my feedback

…oc reference, remove JsonPropertyName, add custom schema test

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 16, 2026

@copilot please address my feedback

All four feedback items addressed in c3c57e0: storage name consts made private, VectorStoreWriterOptions doc fixed, [JsonPropertyName] removed from test, and new CanWriteChunksWithCustomDefinition test added.

Copilot AI requested a review from adamsitnik March 16, 2026 12:36
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot address my feedback

… custom storage names in test

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copilot AI requested a review from adamsitnik March 16, 2026 12:45
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address my feedback

… for consistency

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot the code itself LGTM, let's see if the tests are going to pass.

In the meantime, please update the PR description and title and mention that we are fixing #6967 and #6968 and briefly describe how we are doing this.

Copy link
Contributor Author

Copilot AI commented Mar 16, 2026

@copilot the code itself LGTM, let's see if the tests are going to pass.

In the meantime, please update the PR description and title and mention that we are fixing https://github.com/dotnet/extension...

Updated the PR description to reference #6967 and #6968 with a summary of how each issue is addressed. I can't update the PR title via my tools — could you update it to something like "Fix VectorStoreWriter: remove TKey, fix property names, throw in SetMetadata (fixes #6967, #6968)"?

Copilot AI changed the title Fix VectorStoreCollectionDefinition property names and suppress analyzer warnings Fix VectorStoreWriter: remove TKey, fix property names, throw in SetMetadata Mar 16, 2026
@adamsitnik adamsitnik changed the title Fix VectorStoreWriter: remove TKey, fix property names, throw in SetMetadata Enhancing VectorStoreWriter for better RAG support Mar 16, 2026
@adamsitnik adamsitnik added this to the Data Ingestion Preview 2 milestone Mar 16, 2026
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot the CI has failed with following error:

test\Libraries\Microsoft.Extensions.DataIngestion.Tests\Writers\TestChunkRecordWithMetadata.cs(4,1): error S1128: (NETCORE_ENGINEERING_TELEMETRY=Build) Remove this unnecessary 'using'. (https://rules.sonarsource.com/csharp/RSPEC-1128

Perform full build and fix all the warnings and errors, run the tests locally before pushing changes

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 16, 2026

@copilot the CI has failed with following error:

Removed the unnecessary using System; from TestChunkRecordWithMetadata.cs in 64af799. Built with -warnaserror — 0 warnings, 0 errors. All 123 tests pass.

Copilot AI requested a review from adamsitnik March 16, 2026 13:47
@adamsitnik

This comment was marked as outdated.

…ot IngestedChunk.cs files

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 16, 2026

@copilot there are 5 test failures in the Microsoft.Extensions.AI.Templates.Tests project. Some of these:

...

Removed stale using System.Text.Json.Serialization; from all 5 snapshot IngestedChunk.cs files in 81f4817. The template no longer uses JSON attributes, so the generated code doesn't include that using directive — but the verified snapshot files still had it, causing the mismatch. Build passes with 0 warnings/errors and all 123 DataIngestion tests pass.

@adamsitnik adamsitnik marked this pull request as ready for review March 16, 2026 15:42
@adamsitnik adamsitnik requested a review from a team as a code owner March 16, 2026 15:42
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @roji the PR is ready for review.

/// </summary>
/// <typeparam name="TChunk">The type of the chunk content.</typeparam>
/// <remarks>
/// When the vector dimension count is not known at compile time, use the <see cref="CreateCollectionDefinition"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: this is very important. My idea was following:

  • introduce a non-sealed IngestedChunkRecord<TChunk> type that comes with all the default properties. It gives us the ability to perform the query to get chunks that point to the same document but also allows the users for an easy RAG (they don't need to provide their own type)
  • those who don't need any custom schema, can just use this type and call CreateCollectionDefinition to create the definition (it's required, because they need so somehow provide the dimensionCount).
  • those who need a custom schema, can create it and pass to VectoreStoreCollection ctor
  • the other way to customize the schema is to derive from IngestedChunkRecord<TChunk> and override selected properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... It has been a while since we discussed all this so I may have forgotten some considerations we talked about.

So If I understand correctly, this moves away from the previous dynamic approach (Dictionary<string, object?>) to a typed approach. Some comments:

  • Importantly, typed mapping with MEVD is (currently) not trimming (and therefore NativeAOT)-compatible... Serializing/deserializing a dictionary is easy enough, but doing it with a .NET type requires a source generator which we don't yet have. I think that's going to be a problem.
  • I admit it took me quite a while to understand how custom metadata is supposed to work (with SetMetadata()) and why; having a partly typed, partly dynamic story seems to introduce quite a bit of complexity/weirdness. Here's my understanding:
    • A user that wants custom metadata must extend IngestedChunkRecord (this already feels a bit heavy compared to just having a Dictionary<string, object?> as before).
    • On their CustomIngestedChunkRecord, they add .NET properties for the extra metadata.
    • But they must also override SetMetadata(), to copy the dynamic metadata properties from the incoming IngestionChunk to the strongly-typed .NET properties on CustomIngestedChunkRecord. That's some various boilerplate-y, tedious glue between the dynamic nature of IngestionChunk and the static nature of IngestedChunkRecord.
    • (aside from all the above, they must also call CreateCollectionDefinition() to get the VectorStoreCollectionDefinition, and mutate that to add their custom properties. But that's unrelated)
  • Note that their code in SetMetadata() can get out of sync... Like I can see someone adding a new static property on CustomIngestedChunkRecord, but then forgetting to update SetMetadata() (and then we silently write empty properties).
    * In other words, users need to keep (1) the incoming IngestionChunk's metadata (wherever it's populated), (2) CustomIngestedChunkRecord's .NET properties, (3) SetMetadata() and (4) the record definition in sync, which seems quite brittle... With the previous, fully dynamic model only (1) and (4) needed to be kept in sync (which is the absolute minimum)
  • Because of all this, I'm trying to understand the value we get out of this partly static/partly dynamic design, compared to simply continuing to map Dictionary<string, object?> as before, and whether it's worth it...

/// Override this method in derived classes to store metadata as typed properties with
/// <see cref="VectorStoreDataAttribute"/> attributes.
/// </remarks>
public virtual void SetMetadata(string key, object? value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: So far, we were optimized for very easy ingestion. Now, the RAG is way simpler but when you need to use metadata, you need to create a derived type and handle it on your own. We throw here to avoid silent errors.

/// <summary>
/// Gets or sets the name of the collection. When not provided, "chunks" will be used.
/// </summary>
public string CollectionName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: These settings are no longer configured by the writer, they are part of the collection creation process.

[VectorStoreVector(VectorDimensions, DistanceFunction = VectorDistanceFunction, StorageName = "embedding")]
[JsonPropertyName("embedding")]
public string? Vector => Text;
[VectorStoreVector(VectorDimensions, DistanceFunction = VectorDistanceFunction, StorageName = EmbeddingStorageName)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: this is the simplest way of configuring the dimension count (overriding virtual property and annotating it with the right attribute)

switch (key)
{
case nameof(Classification):
Classification = value as string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: this example shows what it takes to store metadata now


// User creates their own definition without using CreateCollectionDefinition,
// using custom storage names to prove they can map to a pre-existing collection schema.
VectorStoreCollectionDefinition definition = new()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example of providing custom schema (without the need to provide a dedicated mapper)

/// When the vector dimension count is known at compile time, derive from this class and add
/// the <see cref="VectorStoreVectorAttribute"/> to the <see cref="Embedding"/> property.
/// </remarks>
public class IngestedChunkRecord<TChunk>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: I am going to work on removing this generic argument very soon (I want IngestionChunk to be able to represent any input without using generic argument). But it's out of the scope of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're going to remove the generic argument, but FYI the TChunk name is causing me a bit of confusion, also in IngestionChunk<TChunk> (as if it's a chunk over itself). When reading this code I wasn't sure if with IngestedChunkRecord<TChunk>, TChunk should be string or IngestionChunk<string>.

So maybe consider renaming TChunk to just T (or TContent) everywhere.

Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @adamsitnik here are some first thoughts/questions about the new design... We should probably get past these before me reviewing the rest of the PR in detail. Feels like maybe we should jump into a call to discuss this stuff.

/// When the vector dimension count is known at compile time, derive from this class and add
/// the <see cref="VectorStoreVectorAttribute"/> to the <see cref="Embedding"/> property.
/// </remarks>
public class IngestedChunkRecord<TChunk>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're going to remove the generic argument, but FYI the TChunk name is causing me a bit of confusion, also in IngestionChunk<TChunk> (as if it's a chunk over itself). When reading this code I wasn't sure if with IngestedChunkRecord<TChunk>, TChunk should be string or IngestionChunk<string>.

So maybe consider renaming TChunk to just T (or TContent) everywhere.

/// <returns>A <see cref="VectorStoreCollectionDefinition"/> suitable for creating a vector store collection.</returns>
/// <exception cref="ArgumentOutOfRangeException"><paramref name="dimensionCount"/> is less than or equal to zero.</exception>
#pragma warning disable CA1000 // Do not declare static members on generic types - needs access to TChunk type parameter
public static VectorStoreCollectionDefinition CreateCollectionDefinition(int dimensionCount, string? distanceFunction = null, string? indexKind = null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be called something like CreateDefaultCollectionDefinition, since the returned definition has no custom metadata properties; when these exist, the user must mutate the returned VectorStoreCollectionDefinition to add them.

/// </summary>
/// <typeparam name="TChunk">The type of the chunk content.</typeparam>
/// <remarks>
/// When the vector dimension count is not known at compile time, use the <see cref="CreateCollectionDefinition"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... It has been a while since we discussed all this so I may have forgotten some considerations we talked about.

So If I understand correctly, this moves away from the previous dynamic approach (Dictionary<string, object?>) to a typed approach. Some comments:

  • Importantly, typed mapping with MEVD is (currently) not trimming (and therefore NativeAOT)-compatible... Serializing/deserializing a dictionary is easy enough, but doing it with a .NET type requires a source generator which we don't yet have. I think that's going to be a problem.
  • I admit it took me quite a while to understand how custom metadata is supposed to work (with SetMetadata()) and why; having a partly typed, partly dynamic story seems to introduce quite a bit of complexity/weirdness. Here's my understanding:
    • A user that wants custom metadata must extend IngestedChunkRecord (this already feels a bit heavy compared to just having a Dictionary<string, object?> as before).
    • On their CustomIngestedChunkRecord, they add .NET properties for the extra metadata.
    • But they must also override SetMetadata(), to copy the dynamic metadata properties from the incoming IngestionChunk to the strongly-typed .NET properties on CustomIngestedChunkRecord. That's some various boilerplate-y, tedious glue between the dynamic nature of IngestionChunk and the static nature of IngestedChunkRecord.
    • (aside from all the above, they must also call CreateCollectionDefinition() to get the VectorStoreCollectionDefinition, and mutate that to add their custom properties. But that's unrelated)
  • Note that their code in SetMetadata() can get out of sync... Like I can see someone adding a new static property on CustomIngestedChunkRecord, but then forgetting to update SetMetadata() (and then we silently write empty properties).
    * In other words, users need to keep (1) the incoming IngestionChunk's metadata (wherever it's populated), (2) CustomIngestedChunkRecord's .NET properties, (3) SetMetadata() and (4) the record definition in sync, which seems quite brittle... With the previous, fully dynamic model only (1) and (4) needed to be kept in sync (which is the absolute minimum)
  • Because of all this, I'm trying to understand the value we get out of this partly static/partly dynamic design, compared to simply continuing to map Dictionary<string, object?> as before, and whether it's worth it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants