Skip to content

fix: preserve preprocessing pipeline state during serialization#936

Merged
ooples merged 4 commits intomasterfrom
fix/preprocessing-serialization
Mar 7, 2026
Merged

fix: preserve preprocessing pipeline state during serialization#936
ooples merged 4 commits intomasterfrom
fix/preprocessing-serialization

Conversation

@ooples
Copy link
Copy Markdown
Owner

@ooples ooples commented Mar 7, 2026

Summary

Fixes #930 — Preprocessing pipeline fitted state (means, standard deviations, min/max values, etc.) was lost during serialization/deserialization because Newtonsoft.Json's DefaultContractResolver only serializes public properties by default, not private fields where transformers store their fitted parameters.

Changes

  • Add [JsonProperty] attributes to private fields across 12 preprocessing transformers (StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler, DecimalScaler, GlobalContrastScaler, LogScaler, LogMeanVarianceScaler, LpNormScaler, Normalizer, SimpleImputer) and TransformerBase
  • Add [JsonConstructor] to MinMaxScaler and RobustScaler to resolve ambiguous constructor selection during deserialization (both have two constructors)
  • Replace ValueTuple in PreprocessingPipeline._steps with a new PipelineStep<T, TInput> class — ValueTuples don't serialize reliably with Newtonsoft.Json (field names are lost)
  • Add [JsonProperty(TypeNameHandling = TypeNameHandling.Auto)] to PreprocessingInfo.Pipeline and TargetPipeline for polymorphic deserialization of pipeline step transformers
  • Add 16 integration tests verifying serialization round-trips for all modified transformers, multi-step pipelines, PreprocessingInfo, and inverse transforms after deserialization

Files changed

  • 14 source files in src/Preprocessing/ (attribute additions + PipelineStep class)
  • 1 new test file: tests/.../PreprocessingSerializationIntegrationTests.cs

Test plan

  • All 16 new serialization round-trip tests pass
  • All 58 existing PreprocessingPipelineIntegrationTests pass (no regressions)
  • All 49 existing ScalerIntegrationTests pass
  • All 24 existing LinearRegressionIntegrationTests pass
  • Build succeeds on net10.0

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • JSON serialization support added across preprocessing components so fitted state, configurations, named pipeline steps, and nested pipelines can be persisted and restored without changing runtime behavior.
  • Tests

    • Comprehensive integration tests added to verify round-trip serialization/deserialization and correctness of Transform/InverseTransform after restore.

- Add [JsonProperty] attributes to private fields across all preprocessing
  transformers (StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler,
  DecimalScaler, GlobalContrastScaler, LogScaler, LogMeanVarianceScaler,
  LpNormScaler, Normalizer, SimpleImputer) so Newtonsoft.Json serializes
  fitted state (means, std devs, min/max, etc.)

- Add [JsonProperty] to TransformerBase.IsFitted and ColumnIndices for
  proper deserialization of base class state

- Add [JsonConstructor] to MinMaxScaler and RobustScaler to resolve
  ambiguous constructor selection during deserialization

- Replace ValueTuple in PreprocessingPipeline._steps with a new
  PipelineStep<T, TInput> class that serializes reliably with
  Newtonsoft.Json (ValueTuples lose field names during serialization)

- Add [JsonProperty(TypeNameHandling = TypeNameHandling.Auto)] to
  PreprocessingInfo.Pipeline and TargetPipeline for polymorphic
  deserialization of pipeline step transformers

- Add 16 integration tests verifying serialization round-trips for all
  modified transformers, pipelines, and inverse transforms

Fixes #930

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 13:51
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
aidotnet_website Ready Ready Preview, Comment Mar 7, 2026 4:37pm
aidotnet-playground-api Ready Ready Preview, Comment Mar 7, 2026 4:37pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 7, 2026

Walkthrough

Adds Newtonsoft.Json attributes across preprocessing components to enable (de)serialization of fitted state, replaces tuple-backed pipeline steps with a public serializable PipelineStep<T, TInput>, tightens TransformerBase.ColumnIndices setter to private, and adds extensive integration tests verifying JSON round-trips and inverse transforms.

Changes

Cohort / File(s) Summary
Scaler JSON serialization
src/Preprocessing/Scalers/DecimalScaler.cs, src/Preprocessing/Scalers/GlobalContrastScaler.cs, src/Preprocessing/Scalers/LogMeanVarianceScaler.cs, src/Preprocessing/Scalers/LogScaler.cs, src/Preprocessing/Scalers/LpNormScaler.cs, src/Preprocessing/Scalers/MaxAbsScaler.cs, src/Preprocessing/Scalers/Normalizer.cs, src/Preprocessing/Scalers/StandardScaler.cs
Added using Newtonsoft.Json and [JsonProperty] annotations to private fields and [JsonIgnore] to public accessors where appropriate so internal scaler state is included/excluded in JSON (no algorithmic changes).
Advanced scaler serialization
src/Preprocessing/Scalers/MinMaxScaler.cs, src/Preprocessing/Scalers/RobustScaler.cs
Annotated internal state with [JsonProperty], added [JsonIgnore] to public accessors, and marked constructors with [JsonConstructor] to support deserialization while preserving public API signatures.
Base transformer / imputer serialization
src/Preprocessing/Imputers/SimpleImputer.cs, src/Preprocessing/TransformerBase.cs
Applied [JsonProperty] to internal fields/properties; tightened TransformerBase.ColumnIndices setter to private set; annotated IsFitted for serialization. No public-signature removals, but mutability changed for ColumnIndices.
Pipeline architecture / serialization
src/Preprocessing/PreprocessingPipeline.cs
Introduced public PipelineStep<T, TInput> (serializable) and replaced tuple-backed _steps with List<PipelineStep<T, TInput>>; updated Steps property, Add methods, iteration, and added [JsonProperty] annotations to enable pipeline (de)serialization.
Preprocessing metadata serialization
src/Preprocessing/PreprocessingInfo.cs
Annotated Pipeline and TargetPipeline with [JsonProperty(TypeNameHandling = TypeNameHandling.Auto)] to enable polymorphic pipeline (de)serialization.
Integration tests
tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs
Added comprehensive round-trip JSON integration tests covering many scalers, SimpleImputer, pipelines, PreprocessingInfo, column indices handling, inverse-transform verification, and serializer settings to assert fidelity after deserialization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Suggested labels

feature

Poem

🧩 Pipelines learn to pack and send,
Private fields no longer bend,
Steps named clearly, state retained,
Round-trip JSON — no logic strained,
Tests applaud the fix — well-aimed! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: preserve preprocessing pipeline state during serialization' directly and clearly summarizes the main change: enabling JSON serialization of preprocessing transformer state by adding JsonProperty attributes and replacing ValueTuples with a proper PipelineStep class.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/preprocessing-serialization

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Preprocessing/PreprocessingPipeline.cs`:
- Around line 35-39: The parameterless PipelineStep constructor sets Transformer
to null! solely for JSON deserialization but can be misused by callers; update
it to discourage manual construction by either marking the constructor internal
(or private) and adding the [JsonConstructor] attribute, or add a clear comment
above the constructor explaining it's only for deserialization and should not be
used directly; reference the PipelineStep() constructor and the Transformer
property so reviewers can locate and apply the change.

In
`@tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs`:
- Around line 239-258: The test SimpleImputer_RoundTrip_PreservesFittedState
should also verify that Transform behavior is preserved after serialization:
after fitting imputer with CreateTestMatrix(), capture a small test input (e.g.,
a matrix/row with missing values), call imputer.Transform(testInput) to get
expectedOutput, then call deserialized.Transform(testInput) to get actualOutput
and assert they are equal (element-wise) using existing test helpers or
Assert.Equal for sequences/arrays; update the test to use SimpleImputer<double>,
CreateTestMatrix(), imputer.Transform and deserialized.Transform and compare
outputs to ensure transform equivalence.
- Around line 138-166: The test
MinMaxScaler_CustomRange_RoundTrip_PreservesRange should explicitly assert that
the FeatureRangeMin and FeatureRangeMax values survive serialization: after
deserializing (variable deserialized) add assertions that
deserialized.FeatureRangeMin equals scaler.FeatureRangeMin and
deserialized.FeatureRangeMax equals scaler.FeatureRangeMax (using the same
tolerance as numeric compares if needed); keep the existing transform
comparisons but add these direct property checks to make the intent explicit
(location: inside the test after Assert.NotNull(deserialized) and before
comparing transformed matrices).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ba481d87-22fb-46b3-9ceb-35e04c5a6310

📥 Commits

Reviewing files that changed from the base of the PR and between 640251b and ae9eed6.

📒 Files selected for processing (15)
  • src/Preprocessing/Imputers/SimpleImputer.cs
  • src/Preprocessing/PreprocessingInfo.cs
  • src/Preprocessing/PreprocessingPipeline.cs
  • src/Preprocessing/Scalers/DecimalScaler.cs
  • src/Preprocessing/Scalers/GlobalContrastScaler.cs
  • src/Preprocessing/Scalers/LogMeanVarianceScaler.cs
  • src/Preprocessing/Scalers/LogScaler.cs
  • src/Preprocessing/Scalers/LpNormScaler.cs
  • src/Preprocessing/Scalers/MaxAbsScaler.cs
  • src/Preprocessing/Scalers/MinMaxScaler.cs
  • src/Preprocessing/Scalers/Normalizer.cs
  • src/Preprocessing/Scalers/RobustScaler.cs
  • src/Preprocessing/Scalers/StandardScaler.cs
  • src/Preprocessing/TransformerBase.cs
  • tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs

Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #930 by ensuring preprocessing transformers and pipelines preserve their fitted/configuration state across Newtonsoft.Json serialization/deserialization, aligning behavior with AiModelResult.Serialize/Deserialize() so persisted models keep producing correct predictions.

Changes:

  • Added Newtonsoft.Json annotations ([JsonProperty], plus targeted [JsonConstructor]) to persist fitted parameters/configuration stored in non-public members.
  • Reworked PreprocessingPipeline step storage from ValueTuple to a dedicated PipelineStep<T,TInput> type to improve JSON round-tripping of step name/transformer pairs.
  • Added integration tests covering JSON round-trips for multiple transformers, single/multi-step pipelines, PreprocessingInfo, and inverse transforms after deserialization.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs Adds integration tests verifying preprocessing serialization round-trips with AiModelResult-like settings.
src/Preprocessing/TransformerBase.cs Marks core transformer state (IsFitted, ColumnIndices) for JSON serialization.
src/Preprocessing/PreprocessingPipeline.cs Introduces PipelineStep<T,TInput>, serializes pipeline internals, and adapts step storage to a serializable form.
src/Preprocessing/PreprocessingInfo.cs Adds JSON polymorphism handling attributes for pipeline properties.
src/Preprocessing/Scalers/StandardScaler.cs Serializes fitted parameters and configuration fields.
src/Preprocessing/Scalers/MinMaxScaler.cs Serializes fitted parameters/config and adds [JsonConstructor] for deserialization selection.
src/Preprocessing/Scalers/RobustScaler.cs Serializes fitted parameters/config and adds [JsonConstructor] for deserialization selection.
src/Preprocessing/Scalers/MaxAbsScaler.cs Serializes fitted parameters.
src/Preprocessing/Scalers/DecimalScaler.cs Serializes fitted parameters.
src/Preprocessing/Scalers/LogScaler.cs Serializes fitted parameters.
src/Preprocessing/Scalers/GlobalContrastScaler.cs Serializes fitted parameters.
src/Preprocessing/Scalers/LogMeanVarianceScaler.cs Serializes fitted parameters/config (_epsilon).
src/Preprocessing/Scalers/LpNormScaler.cs Serializes fitted parameters/config (p, norms, column count).
src/Preprocessing/Scalers/Normalizer.cs Serializes configuration (_normType).
src/Preprocessing/Imputers/SimpleImputer.cs Serializes fitted parameters/config (strategy/fill settings/statistics).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/Preprocessing/TransformerBase.cs Outdated
Comment thread src/Preprocessing/Scalers/LogMeanVarianceScaler.cs
Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Comment thread src/Preprocessing/PreprocessingInfo.cs
Comment thread src/Preprocessing/Scalers/LpNormScaler.cs
Comment thread src/Preprocessing/Scalers/Normalizer.cs
Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Comment thread src/Preprocessing/PreprocessingInfo.cs
- Remove null-forgiving operator from PipelineStep (use JsonConstructor)
- Make PipelineStep.Transformer setter private (immutable after construction)
- Fix ColumnIndices deserialization (add private setter for Json.NET)
- Fix Steps property to return cached read-only list (no alloc per call)
- Add explicit range assertions to MinMaxScaler custom range test
- Add transform equivalence check to SimpleImputer test
- Add LpNormScaler serialization round-trip test
- Add Normalizer serialization round-trip test
- Add LogMeanVarianceScaler serialization round-trip test
- TypeNameHandling.Auto is safe: SafeSerializationBinder restricts types

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs (1)

254-285: ⚠️ Potential issue | 🟠 Major

Exercise the imputation path with missing values.

CreateTestMatrix() has no missing entries, so this round-trip only proves the deserialized imputer can pass through already-complete data. If Statistics deserialize to the wrong values, this test can still pass. Use a transform input with missing cells and compare the imputed outputs.

As per coding guidelines, "Good tests should ... assert specific expected values" and "test edge cases and error conditions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs`
around lines 254 - 285, The test SimpleImputer_RoundTrip_PreservesFittedState
currently uses CreateTestMatrix() which has no missing values, so it doesn't
exercise imputation; change the Arrange/Act to use input data that contains
missing entries (e.g., a matrix with double.NaN in some cells) before calling
imputer.Fit/Transform so the imputer actually computes Statistics, then after
deserialization assert deserialized.IsFitted and deserialized.Statistics and
verify that for the positions that were missing the imputed values from
deserialized.Transform match those from the original imputer.Transform (and
optionally assert those imputed values equal specific expected numbers), using
the same tolerance as the existing loop; reference SimpleImputer<double>.Fit,
SimpleImputer<double>.Transform and the Statistics property when updating the
test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Preprocessing/PreprocessingPipeline.cs`:
- Around line 23-24: The PipelineStep Name property is mutable and allows
callers to rename steps after addition; change the public setter to a private
setter on the Name property (e.g., public string Name { get; private set; }) so
step identity can only be set at construction/deserialization, preserving the
duplicate-name/whitespace validation enforced by Add() and preventing ambiguity
in GetStep(); keep the existing [JsonProperty] attribute so deserialization
still works and prefer private over internal visibility if adjusting access
elsewhere.
- Around line 41-45: The PipelineStep JsonConstructor currently allows Name to
become string.Empty and accepts a null Transformer, which can introduce invalid
pipeline state; update the PipelineStep(string name, IDataTransformer<T, TInput,
TInput> transformer) constructor to validate inputs and throw immediately: throw
ArgumentNullException for a null transformer and ArgumentException (or
ArgumentNullException) for a null or empty name so Name and Transformer cannot
be set to invalid values; reference the PipelineStep constructor, the Name and
Transformer properties, and the IDataTransformer<T, TInput, TInput> parameter
when making the change.

In `@src/Preprocessing/TransformerBase.cs`:
- Around line 49-50: Add a unit test that verifies non-default ColumnIndices
round-trip and that Transform preserves untouched columns: construct a
TransformerBase (or a concrete subclass used in tests) and set ColumnIndices to
a non-default array (e.g. [0,2,...]); serialize and deserialize it using the
same project serialization helpers (the code paths that exercise TransformerBase
serialization), then assert deserialized.ColumnIndices equals the original array
and call Transform on a sample input row to assert that columns not listed in
ColumnIndices remain unchanged while selected columns are processed; reference
TransformerBase.ColumnIndices and the Transform method (or the concrete subclass
used in existing tests) when locating where to add this case.

In
`@tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs`:
- Around line 365-399: The test only validates Pipeline round-trip; add a
non-null TargetPipeline to the PreprocessingInfo and assert its transform
behavior is preserved after serialization. Concretely: create and Fit a
target-side PreprocessingPipeline (e.g., using a scaler), set it on the
PreprocessingInfo.TargetPipeline before serializing, call TransformTargets (or
the appropriate target-transform method) on the original and deserialized info
with a test target vector, and Assert equality of resulting values (with the
same tolerance used for features). Ensure you reference the
PreprocessingInfo.TargetPipeline, PreprocessingPipeline.Add/Fit, and
TransformTargets methods in the change.

---

Duplicate comments:
In
`@tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs`:
- Around line 254-285: The test SimpleImputer_RoundTrip_PreservesFittedState
currently uses CreateTestMatrix() which has no missing values, so it doesn't
exercise imputation; change the Arrange/Act to use input data that contains
missing entries (e.g., a matrix with double.NaN in some cells) before calling
imputer.Fit/Transform so the imputer actually computes Statistics, then after
deserialization assert deserialized.IsFitted and deserialized.Statistics and
verify that for the positions that were missing the imputed values from
deserialized.Transform match those from the original imputer.Transform (and
optionally assert those imputed values equal specific expected numbers), using
the same tolerance as the existing loop; reference SimpleImputer<double>.Fit,
SimpleImputer<double>.Transform and the Statistics property when updating the
test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1f1d0898-a649-4b33-8528-ec19dbf861ca

📥 Commits

Reviewing files that changed from the base of the PR and between ae9eed6 and 5aca2de.

📒 Files selected for processing (3)
  • src/Preprocessing/PreprocessingPipeline.cs
  • src/Preprocessing/TransformerBase.cs
  • tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs

Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Comment thread src/Preprocessing/TransformerBase.cs
…ests

- Made PipelineStep.Name private set to prevent mutation after construction
- PipelineStep constructor now validates name (non-whitespace) and transformer
  (non-null), failing fast on invalid serialized steps
- Added ColumnIndices round-trip test proving non-default indices survive
  serialization and untouched columns remain unchanged
- Added TargetPipeline round-trip tests proving both null and non-null target
  pipelines survive serialization with transform equivalence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs (1)

138-181: 🧹 Nitpick | 🔵 Trivial

Consider explicit range property assertions.

The behavioral verification (checking transformed column min/max values) is valid, but directly asserting the deserialized properties would make the test intent clearer and catch deserialization issues that happen to produce equivalent transforms.

Optional: Add explicit range assertions
         // Assert
         Assert.NotNull(deserialized);
+        Assert.Equal(-1.0, deserialized.FeatureRangeMin, 1e-10);
+        Assert.Equal(1.0, deserialized.FeatureRangeMax, 1e-10);
 
         // Verify the custom range [-1, 1] is preserved
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs`
around lines 138 - 181, After Assert.NotNull(deserialized) add explicit
assertions that the deserialized MinMaxScaler<double> preserved its target range
by checking the scaler range properties on the deserialized instance (e.g.,
assert deserialized.Min == -1.0 and deserialized.Max == 1.0 or, if your API uses
different names, assert the equivalent properties such as LowerBound/UpperBound
or TargetMin/TargetMax equal -1.0 and 1.0); place these checks immediately after
the null check (before the transformed-data min/max and element-wise
comparisons) so deserialization of the range is validated directly on the
deserialized variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs`:
- Around line 138-181: After Assert.NotNull(deserialized) add explicit
assertions that the deserialized MinMaxScaler<double> preserved its target range
by checking the scaler range properties on the deserialized instance (e.g.,
assert deserialized.Min == -1.0 and deserialized.Max == 1.0 or, if your API uses
different names, assert the equivalent properties such as LowerBound/UpperBound
or TargetMin/TargetMax equal -1.0 and 1.0); place these checks immediately after
the null check (before the transformed-data min/max and element-wise
comparisons) so deserialization of the range is validated directly on the
deserialized variable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f97c7cb8-8e9c-4cf1-8446-7d6234fe9bb1

📥 Commits

Reviewing files that changed from the base of the PR and between 5aca2de and 7dce4cb.

📒 Files selected for processing (2)
  • src/Preprocessing/PreprocessingPipeline.cs
  • tests/AiDotNet.Tests/IntegrationTests/Preprocessing/PreprocessingSerializationIntegrationTests.cs

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/Preprocessing/Scalers/StandardScaler.cs
Comment thread src/Preprocessing/Scalers/RobustScaler.cs
Comment thread src/Preprocessing/PreprocessingPipeline.cs Outdated
Comment thread src/Preprocessing/Imputers/SimpleImputer.cs
Comment thread src/Preprocessing/PreprocessingPipeline.cs
Comment thread src/Preprocessing/Scalers/GlobalContrastScaler.cs
Comment thread src/Preprocessing/Scalers/MinMaxScaler.cs
Comment thread src/Preprocessing/Scalers/Normalizer.cs
Comment thread src/Preprocessing/Scalers/MaxAbsScaler.cs
Comment thread src/Preprocessing/Scalers/LogScaler.cs
…ameHandling

Add [JsonIgnore] to public getter-only properties that mirror [JsonProperty]
backing fields in all transformers (StandardScaler, RobustScaler, SimpleImputer,
DecimalScaler, GlobalContrastScaler, MinMaxScaler, Normalizer, MaxAbsScaler,
LogScaler) to prevent Newtonsoft emitting both field and property.

Remove per-member TypeNameHandling.Auto from PipelineStep.Transformer and
_finalTransformer — rely on serializer-level settings with SafeSerializationBinder
instead, preventing unsafe TypeNameHandling bypass.

Add Deconstruct method to PipelineStep for backward-compatible tuple destructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Preprocessing/PreprocessingPipeline.cs (1)

407-421: ⚠️ Potential issue | 🟡 Minor

Shallow clone shares transformer state - potential for unexpected mutation.

The Clone() method shares PipelineStep instances (and thus Transformer instances) between original and clone. Since transformers hold fitted state (_isFitted, computed statistics, etc.), calling Fit() on the clone will mutate the shared transformers and affect the original pipeline.

The current comment mentions transformers need ICloneable for deep cloning, but doesn't warn callers about the mutation risk. Consider either:

  1. Adding a warning to the XML doc about mutation side effects, or
  2. Returning an unfitted clone that requires re-fitting (as the current doc suggests with "without fitted state", but the code doesn't enforce this)
📝 Suggested documentation improvement
     /// <summary>
     /// Creates a clone of this pipeline without fitted state.
     /// </summary>
-    /// <returns>A new unfitted pipeline with the same structure.</returns>
+    /// <returns>
+    /// A new pipeline referencing the same transformer instances as the original.
+    /// <para><b>Warning:</b> This is a shallow clone. Fitting the clone will mutate
+    /// the shared transformers and affect the original pipeline. The clone starts
+    /// with <c>_isFitted = false</c>, but the underlying transformers retain their
+    /// fitted state from the original.</para>
+    /// </returns>
     public PreprocessingPipeline<T, TInput, TOutput> Clone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Preprocessing/PreprocessingPipeline.cs` around lines 407 - 421, The Clone
method currently copies PipelineStep references (via _steps) and
_finalTransformer, sharing fitted transformer state; change Clone() in
PreprocessingPipeline<T,TInput,TOutput> so it returns an unfitted clone: for
each PipelineStep in _steps create a new PipelineStep instance (or otherwise
copy only its configuration, not the fitted Transformer) and add that to
clone._steps, and set clone._finalTransformer to a non-fitted transformer (null
or a new instance without fitted state) so calling Fit() on the clone won't
mutate the original's transformer state; if PipelineStep or transformers expose
a Reset/Unfit method use that to clear _isFitted and any computed statistics,
otherwise copy only immutable configuration fields and omit fitted fields when
constructing the new steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Preprocessing/Scalers/Normalizer.cs`:
- Around line 48-55: The backing field _normType is currently marked with
[JsonProperty] without a name which serializes as "_normType" and breaks the
existing "NormType" JSON contract; change the attribute to explicitly pin the
JSON name (e.g., [JsonProperty("NormType")]) on the _normType field so the
serialized key remains "NormType" and add a regression test that deserializes a
payload containing "NormType" into the Normalizer (and verifies NormType
property value) to ensure backward compatibility.

---

Outside diff comments:
In `@src/Preprocessing/PreprocessingPipeline.cs`:
- Around line 407-421: The Clone method currently copies PipelineStep references
(via _steps) and _finalTransformer, sharing fitted transformer state; change
Clone() in PreprocessingPipeline<T,TInput,TOutput> so it returns an unfitted
clone: for each PipelineStep in _steps create a new PipelineStep instance (or
otherwise copy only its configuration, not the fitted Transformer) and add that
to clone._steps, and set clone._finalTransformer to a non-fitted transformer
(null or a new instance without fitted state) so calling Fit() on the clone
won't mutate the original's transformer state; if PipelineStep or transformers
expose a Reset/Unfit method use that to clear _isFitted and any computed
statistics, otherwise copy only immutable configuration fields and omit fitted
fields when constructing the new steps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c94df883-510b-4a84-97ee-916cbc790bb1

📥 Commits

Reviewing files that changed from the base of the PR and between 7dce4cb and 8a18af4.

📒 Files selected for processing (10)
  • src/Preprocessing/Imputers/SimpleImputer.cs
  • src/Preprocessing/PreprocessingPipeline.cs
  • src/Preprocessing/Scalers/DecimalScaler.cs
  • src/Preprocessing/Scalers/GlobalContrastScaler.cs
  • src/Preprocessing/Scalers/LogScaler.cs
  • src/Preprocessing/Scalers/MaxAbsScaler.cs
  • src/Preprocessing/Scalers/MinMaxScaler.cs
  • src/Preprocessing/Scalers/Normalizer.cs
  • src/Preprocessing/Scalers/RobustScaler.cs
  • src/Preprocessing/Scalers/StandardScaler.cs

Comment thread src/Preprocessing/Scalers/Normalizer.cs
@ooples ooples merged commit 49903c7 into master Mar 7, 2026
37 of 40 checks passed
@ooples ooples deleted the fix/preprocessing-serialization branch March 7, 2026 19:37
ooples added a commit that referenced this pull request Mar 9, 2026
* fix: add serialization support to preprocessing pipeline fitted state

- Add [JsonProperty] attributes to private fields across all preprocessing
  transformers (StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler,
  DecimalScaler, GlobalContrastScaler, LogScaler, LogMeanVarianceScaler,
  LpNormScaler, Normalizer, SimpleImputer) so Newtonsoft.Json serializes
  fitted state (means, std devs, min/max, etc.)

- Add [JsonProperty] to TransformerBase.IsFitted and ColumnIndices for
  proper deserialization of base class state

- Add [JsonConstructor] to MinMaxScaler and RobustScaler to resolve
  ambiguous constructor selection during deserialization

- Replace ValueTuple in PreprocessingPipeline._steps with a new
  PipelineStep<T, TInput> class that serializes reliably with
  Newtonsoft.Json (ValueTuples lose field names during serialization)

- Add [JsonProperty(TypeNameHandling = TypeNameHandling.Auto)] to
  PreprocessingInfo.Pipeline and TargetPipeline for polymorphic
  deserialization of pipeline step transformers

- Add 16 integration tests verifying serialization round-trips for all
  modified transformers, pipelines, and inverse transforms

Fixes #930

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 13 pr review comments - serialization safety, tests, perf

- Remove null-forgiving operator from PipelineStep (use JsonConstructor)
- Make PipelineStep.Transformer setter private (immutable after construction)
- Fix ColumnIndices deserialization (add private setter for Json.NET)
- Fix Steps property to return cached read-only list (no alloc per call)
- Add explicit range assertions to MinMaxScaler custom range test
- Add transform equivalence check to SimpleImputer test
- Add LpNormScaler serialization round-trip test
- Add Normalizer serialization round-trip test
- Add LogMeanVarianceScaler serialization round-trip test
- TypeNameHandling.Auto is safe: SafeSerializationBinder restricts types

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: make pipelinestep immutable, validate constructor, add missing tests

- Made PipelineStep.Name private set to prevent mutation after construction
- PipelineStep constructor now validates name (non-whitespace) and transformer
  (non-null), failing fast on invalid serialized steps
- Added ColumnIndices round-trip test proving non-default indices survive
  serialization and untouched columns remain unchanged
- Added TargetPipeline round-trip tests proving both null and non-null target
  pipelines survive serialization with transform equivalence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent JSON duplicate serialization and remove per-member TypeNameHandling

Add [JsonIgnore] to public getter-only properties that mirror [JsonProperty]
backing fields in all transformers (StandardScaler, RobustScaler, SimpleImputer,
DecimalScaler, GlobalContrastScaler, MinMaxScaler, Normalizer, MaxAbsScaler,
LogScaler) to prevent Newtonsoft emitting both field and property.

Remove per-member TypeNameHandling.Auto from PipelineStep.Transformer and
_finalTransformer — rely on serializer-level settings with SafeSerializationBinder
instead, preventing unsafe TypeNameHandling bypass.

Add Deconstruct method to PipelineStep for backward-compatible tuple destructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ooples added a commit that referenced this pull request Mar 10, 2026
…ining directories (#933) (#941)

* fix: prevent data leakage by fitting preprocessing only on training data

In BuildSupervisedInternalAsync(), the preprocessing pipeline (e.g.,
StandardScaler) was fitted on the ENTIRE dataset before the train/test
split. This leaked test/validation statistics (mean, std dev) into the
training pipeline, causing artificially inflated metrics.

The fix restructures the method:
1. Split data into train/val/test FIRST using DataSplitter.Split()
2. FitTransform preprocessing pipeline on training data ONLY
3. Transform (not FitTransform) validation and test data

The federated learning path is unchanged — it correctly uses all data
as training data, so FitTransform on everything remains correct.

Fixes #929

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 12 pr review comments - leakage tests, pragma scope, registry sync

- Scope #pragma warning disable CS8600/CS8604 with matching restore
- Restore PreprocessingRegistry.Current in all ConfigurePreprocessing overloads
- Update PreprocessingRegistry docs to reflect thread-safety and usage pattern
- Rewrite tests to actually detect leakage: verify scaler produces different
  output than all-data scaler, verify transform shape and non-identity
- Add end-to-end predict test with preprocessing
- Add test verifying no-preprocessing path still works with predict

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: move fitresample after train/test split to prevent data leakage

FitResample (SMOTE, outlier removal, augmentation) was applied to the
full dataset before splitting, allowing synthetic samples derived from
test/validation data to leak into training. Now FitResample runs after
the split on training data only in both standard and AutoML paths.
Also fixed AutoML preprocessing to FitTransform on training only and
Transform on validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update registry docs and use consistent shuffle policy for splits

Updated PreprocessingRegistry docs to clearly warn about concurrent build
safety issues. Applied same time-series non-shuffle policy to the final
supervised training split that AutoML already uses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use post-preparation training data in no-preprocessing else branch

When _dataPreparationPipeline runs (resampling/outlier removal) it modifies
XTrain/yTrain. The else branch (no preprocessing) was incorrectly using
preparedX/preparedY (pre-preparation) instead of XTrain/yTrain, losing
training-only preparation results for downstream logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: make data leakage tests self-sufficient without PR 928

Refactor tests that called result.Predict() (which fails due to the
feature selection bug fixed in PR #928) to verify preprocessing pipeline
state directly via PreprocessingInfo.TransformFeatures() instead.
Tests now verify the data leakage fix independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: preserve preprocessing pipeline state during serialization (#936)

* fix: add serialization support to preprocessing pipeline fitted state

- Add [JsonProperty] attributes to private fields across all preprocessing
  transformers (StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler,
  DecimalScaler, GlobalContrastScaler, LogScaler, LogMeanVarianceScaler,
  LpNormScaler, Normalizer, SimpleImputer) so Newtonsoft.Json serializes
  fitted state (means, std devs, min/max, etc.)

- Add [JsonProperty] to TransformerBase.IsFitted and ColumnIndices for
  proper deserialization of base class state

- Add [JsonConstructor] to MinMaxScaler and RobustScaler to resolve
  ambiguous constructor selection during deserialization

- Replace ValueTuple in PreprocessingPipeline._steps with a new
  PipelineStep<T, TInput> class that serializes reliably with
  Newtonsoft.Json (ValueTuples lose field names during serialization)

- Add [JsonProperty(TypeNameHandling = TypeNameHandling.Auto)] to
  PreprocessingInfo.Pipeline and TargetPipeline for polymorphic
  deserialization of pipeline step transformers

- Add 16 integration tests verifying serialization round-trips for all
  modified transformers, pipelines, and inverse transforms

Fixes #930

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 13 pr review comments - serialization safety, tests, perf

- Remove null-forgiving operator from PipelineStep (use JsonConstructor)
- Make PipelineStep.Transformer setter private (immutable after construction)
- Fix ColumnIndices deserialization (add private setter for Json.NET)
- Fix Steps property to return cached read-only list (no alloc per call)
- Add explicit range assertions to MinMaxScaler custom range test
- Add transform equivalence check to SimpleImputer test
- Add LpNormScaler serialization round-trip test
- Add Normalizer serialization round-trip test
- Add LogMeanVarianceScaler serialization round-trip test
- TypeNameHandling.Auto is safe: SafeSerializationBinder restricts types

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: make pipelinestep immutable, validate constructor, add missing tests

- Made PipelineStep.Name private set to prevent mutation after construction
- PipelineStep constructor now validates name (non-whitespace) and transformer
  (non-null), failing fast on invalid serialized steps
- Added ColumnIndices round-trip test proving non-default indices survive
  serialization and untouched columns remain unchanged
- Added TargetPipeline round-trip tests proving both null and non-null target
  pipelines survive serialization with transform equivalence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent JSON duplicate serialization and remove per-member TypeNameHandling

Add [JsonIgnore] to public getter-only properties that mirror [JsonProperty]
backing fields in all transformers (StandardScaler, RobustScaler, SimpleImputer,
DecimalScaler, GlobalContrastScaler, MinMaxScaler, Normalizer, MaxAbsScaler,
LogScaler) to prevent Newtonsoft emitting both field and property.

Remove per-member TypeNameHandling.Auto from PipelineStep.Transformer and
_finalTransformer — rely on serializer-level settings with SafeSerializationBinder
instead, preventing unsafe TypeNameHandling bypass.

Add Deconstruct method to PipelineStep for backward-compatible tuple destructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove static PreprocessingRegistry — prevents concurrent build race conditions (#935)

* fix: remove static PreprocessingRegistry usage from AiModelBuilder (#931)

Remove the 3 `PreprocessingRegistry<T, TInput>.Current = _preprocessingPipeline`
assignments from AiModelBuilder.ConfigurePreprocessing() and the 3 analogous
PostprocessingRegistry assignments from ConfigurePostprocessing(). The pipeline
is already stored as an instance field on the builder and flows to AiModelResult
via PreprocessingInfo — the global static registry was unnecessary and caused
race conditions when multiple builders ran concurrently.

Changes:
- Remove 6 static registry assignments from AiModelBuilder.cs
- Add [Obsolete] attribute to PreprocessingRegistry<T, TInput> class
- Replace PreprocessingRegistry usage in DocumentNeuralNetworkBase with
  an instance-level PreprocessingTransformer property
- Add 4 integration tests verifying concurrent builds don't cross-contaminate

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 7 pr review comments - registry deprecation, thread safety, docs

- Fix PreprocessingTransformer docs: clarify protected accessibility
- Add PostprocessingTransformer instance property to DocumentNeuralNetworkBase
- Replace PostprocessingRegistry static usage with instance-based transformer
- Mark PostprocessingRegistry as [Obsolete] (same treatment as PreprocessingRegistry)
- Update PreprocessingRegistry.Current docs to reflect no longer auto-set
- Fix concurrent test to use separate data loaders (thread-safety)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: wire preprocessing transformer onto document models in buildasync

Changed PreprocessingTransformer and PostprocessingTransformer from
protected to protected internal so AiModelBuilder can set them.
BuildSupervisedInternalAsync now wires the preprocessing pipeline onto
DocumentNeuralNetworkBase models, replacing the static registry approach
that caused race conditions during concurrent builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: narrow transformer surface, fail fast on unfitted, wire postprocessing

- Changed PreprocessingTransformer/PostprocessingTransformer back to protected
  and added internal ConfigureTransformers() method for AiModelBuilder wiring
- PreprocessDocument/PostprocessOutput now throw if transformer is set but not
  fitted instead of silently falling back to defaults
- Wired both preprocessing AND postprocessing onto DocumentNeuralNetworkBase
- Updated PreprocessingRegistry docs to reflect deprecated/removed status
- Added [Collection] to registry tests and try/finally cleanup
- Fixed "override" wording to "assign" in docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: centralize document transformer injection and strengthen test assertions

Extract transformer wiring into ConfigureDocumentTransformers() helper method
that always passes current pipeline values (clearing prior state on reuse).
Called from both supervised build path and BuildProgramSynthesisInferenceOnlyResult().

Strengthen PreprocessingRegistryIntegrationTests to verify PreprocessingInfo
is non-null and fitted, not just Assert.NotNull(result).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove unfitted check from PostprocessOutput, update docs, strengthen tests

PostprocessOutput no longer throws on unfitted transformers since
postprocessing transforms (Softmax, LabelDecoder) are often stateless.
Updated XML docs to reflect instance-level transformer priority.
Strengthen concurrent and sequential test assertions to verify
PreprocessingInfo is non-null and fitted on each result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove null-forgiving operators from video

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove null-forgiving operators from neural network layers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: warn on empty attributions and use guarded locals in jwt attestation verifier

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: include gain layer in backward pass and parameter updates in deepfilternet

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rebuild streaming buffers after deserializing fft/hop size in neural noise reducer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove null-forgiving operators from federated learning

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove null-forgiving operators from federated learning

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: respect bootstrap flag and align estimator probabilities to ensemble label space

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove null-forgiving operators from clustering modules

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: throw on invalid ensemble members and cold model in adaptive random forest

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: unify trained-state error messages in ordinal regression

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove unused alpha_s and use guarded local consistently in unipc scheduler

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add guard for latent variance in compute elbo

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: hoist cached gradient lookup out of per-sample loops

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: replace console.writeline with debug.writeline for checkpoint reporting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: cache compressed momentum after resampling projection matrices

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: show total sharded memory independently of per-device memory in tostring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address 21 pr review comments for code quality improvements

- CLIPScore: throw on embedding/weight dimension mismatch instead of Math.Min
- Adam8BitOptimizer: remove redundant _vScales null check
- BFGSOptimizer: move _inverseHessian null check before UpdateInverseHessian call
- FTRLOptimizer: use post-update accumulator (nPlusGradSq) in both paths
- MomentumOptimizer: snapshot _velocity before serialization loop
- OptimizerBase: use RequireModel() instead of Model!
- SymbolicPhysicsLearner: Clone throws on null children like Evaluate/Format
- IncrementalPCA: fix error message to "internal state inconsistent"
- ContradictionDetector: wrap JToken.Value<bool>() in try-catch
- CriticModel: wrap JToken.Value<double>() in try-catch
- LogNormalAFT: hoist Coefficients null check outside loops (3 methods)
- WeibullAFT: hoist Coefficients null check outside loops (3 methods)
- NHiTSModel: remove redundant ContainsKey before TryGetValue
- DiffusionTools: support both 1D and 2D audio_data deserialization
- GMMDetector: validate feature count before scoring a point
- TimeSeriesForestClassifier: reject NumTrees <= 0 during training

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use guarded sigmaTs local variable consistently in dpm solver methods

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: apply feature selection in Predict() — fixes critical dimension mismatch bug (#928)

* fix: apply feature selection in AiModelResult.Predict() — fixes dimension mismatch bug

The NormalOptimizer (default) randomly selects a subset of features during
training via PrepareAndEvaluateSolution, but AiModelResult.Predict() never
applied the same feature selection to new input data. This caused the model
to receive all columns while it was trained on fewer, throwing "Number of
columns must equal length of vector" errors.

Changes:
- Add SelectedFeatureIndices (List<int>) to OptimizationStepData,
  OptimizationResult, and ModelResult to store actual column indices
- Propagate indices through optimizer flow: PrepareAndEvaluateSolution →
  UpdateBestSolution → CreateOptimizationResult
- Apply feature selection in AiModelResult.Predict() after preprocessing
  but before Model.Predict(), using OptimizerHelper.SelectFeatures()
- Add 12 integration tests verifying the full BuildAsync → Predict pipeline

Closes #927

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 7 pr review comments - feature selection, copy semantics, docs

- Remove unused LooseTolerance constant
- Fix off-by-one: use != instead of > for feature index count check
- Apply feature selection in InferenceSession.Predict() (critical gap)
- Add comprehensive docs to SelectedFeatureIndices property
- Fix WithParameters to deep-copy collections (consistent with DeepCopy)
- Add copy independence assertion to WithParameters test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: support 1D tensor feature selection and add predict integration test

SelectFeaturesTensor previously threw for 1D tensors. Now handles them
by directly selecting elements at the specified indices. Added integration
test proving Predict applies SelectedFeatureIndices correctly with a real
FeedForwardNeuralNetwork.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use identity-check instead of count heuristic for feature selection

Replaced inputSize == featureIndices.Count heuristic with proper
identity-mapping check using SequenceEqual. A reordered selection like
[2, 0, 1] has the same count but still needs reordering. Also validates
indices are in-range before calling SelectFeatures. Renamed misleading
test names and added shape assertion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: fix collinear test fixture and add permuted feature selection test

- Fixed direct-model test to use non-collinear features with correct
  formula comment (y = 2*x1 + 3*x2 + 1) and tighter assertion ranges
- Added permutation test proving [2, 0, 1] selection is not treated as
  identity — verifies the SequenceEqual check works correctly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add null guards to OptimizationResult Clone and WithParameters

Callers that create OptimizationResult via object initializer (bypassing
the constructor) leave CoefficientLowerBounds, CoefficientUpperBounds,
FitnessHistory, SelectedFeatures, and SelectedFeatureIndices as null.
Deep-copy calls in Clone/WithParameters would throw NullReferenceException.
Now falls back to empty defaults when properties are null.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: unify feature selection in InferenceSequence, one-time bounds check, robust assertions

- InferenceSequence.Predict() now delegates to ApplySelectedFeaturesForPrediction
  instead of using a count-based check that skipped permuted selections
- Move bounds validation to one-time check (cached _featureIndicesValidated flag)
  instead of scanning on every Predict() call
- Assert permuted predictions actually differ from identity predictions
- Replace fragile selectedCount < 10 with valid range assertion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: assert feature selection actually occurred in regression test

Add assertion verifying optimizer selected a feature subset before testing
Predict() with full-width input, proving the test exercises the reduced-column
code path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: cache identity check with volatile int, set deterministic weights in test

Replace per-call SequenceEqual + Enumerable.Range allocation with cached
volatile int state (0=unknown, 1=identity, 2=needs-selection). Uses a
simple loop instead of LINQ for zero-allocation identity check.
Thread-safe via volatile read/write semantics.

Set explicit non-uniform weights [1, 2, 3] in permuted feature selection
test to ensure permutation deterministically produces different output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve first batch of pr #941 review comments (10 of 54)

- InterpretabilityDashboard: filter empty attributions instead of using
  console output, use validCount for averaging
- BaggingClassifier: validate sampleSize vs data size when bootstrap=false
- Denclue: hoist _attractors and _attractorDensities guards out of loops
- BIRCH: use Labels ?? Predict(x) instead of Labels! in FitPredict
- SelfOrganizingMap: hoist _weights and _neuronLabels guards out of loops
- CLIQUE: hoist Labels guard out of accumulation loop
- SUBCLU: hoist _trainingData and Labels guards out of loops

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve second batch of pr #941 review comments (20 of 54)

- GraphIsomorphismLayer: add local variable aliases for _lastMlpHiddenPreRelu
  in Forward and BackwardManual for clearer null safety
- PagedAttentionKernel: add braces to ToFloat/FromFloat if blocks, replace
  null-forgiving with explicit boxing guards
- KnowledgeDistillationTrainerBase: remove Console.WriteLine from library
  checkpoint loading code
- DiffusionMemoryManager: separate PerDeviceMemory and TotalShardedMemory
  formatting into independent checks
- MixtureOfExpertsLayer: add null guard before ApplyActivationToGraph
  instead of null-forgiving operator
- KVCache: replace null-forgiving with explicit boxing guard for Half
  conversion
- CachedMultiHeadAttention: hoist _cache null guard before Append call
- CachedGroupedQueryAttention: hoist _cache null guard before Append call
- DeformableConvolutionalLayer: stale comments (variables already renamed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve build errors from linter's automated null-forgiving removal

Fix syntax and compilation errors introduced by the automated linter that
removed null-forgiving operators without proper replacements:
- Fix missing closing parentheses in foreach statements (COPKMeans, AgglomerativeClustering)
- Fix duplicate variable declarations (AntColonyOptimization, CachedGroupedQueryAttention, CachedMultiHeadAttention)
- Fix dropped underscore prefixes on private field references (KVCache, ConvLSTMLayer, PrincipalNeighbourhoodAggregationLayer)
- Add validated accessor properties for nullable cache arrays (KVCache)
- Fix variable ordering issues (FloraAdapter)
- Fix nullable reference handling (VectorModel, UncertaintyCalibrationData, CrossValidationResult, AiModelResult.Uncertainty)
- Fix Debug.WriteLine() call with no arguments (KnowledgeDistillationTrainerBase)
- Add null guard for ClassLabels in PredictProbabilities (BaggingClassifier)
- Fix nullable permutation array access (ConcatenateLayer)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: hoist null guards for lambdas and alphaTs in DPMSolverMultistepScheduler

Replace null-forgiving operators with proper null-coalescing throws
for _lambdas and _alphaTs in FirstOrderUpdate, SecondOrderUpdate,
and ThirdOrderUpdate methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove redundant null guard in LeafSent140FederatedDatasetLoader

The testFilePath is already validated by IsNullOrWhiteSpace check
in the enclosing if statement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: capture PreprocessingInfo in local variable to prevent TOCTOU

Snapshot PreprocessingInfo once before accessing its members
to avoid potential race conditions with concurrent access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: dispose pre-permutation tensor to prevent memory leak in ConcatenateLayer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: accumulate projector gradients from both views in BarlowTwins, SimCLR, and iBOT

BarlowTwins: capture view1 projector grads before view2 backward clears them,
then combine both views' grads for the projector update.

SimCLR: replay projector forward per-view before backward to use correct caches,
and accumulate projector gradients from both views.

iBOT: backpropagate the full objective (clsLoss1 + clsLoss2 + lambda*(patchLoss1 + patchLoss2))
instead of only clsLoss1. Add ComputeMaskedPatchGradient for MSE gradient on masked patches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: correct backward pass in RIFE, FILM, and RAFT video models

RIFE: use cached block input shape instead of previous block output shape
for gradient splitting to correctly match the fused input width.

FILM: fail fast on all required backward-pass caches instead of silently
skipping fusion and flow estimator backprop when caches are null.

RAFT: replace incorrect upsampleConv.Backward with bilinear UpsampleFlowBackward
that mirrors the actual forward upsampling, and hoist invariant layer guards
out of the refinement loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: strengthen DataLeakagePrevention tests with behavioral assertions

ScalerStatisticsReflectTrainingDataOnly: add positive assertion that the
pipeline transformation is not identity, confirming it was fitted on data.

StillWorksCorrectly: exercise the model with actual predictions and assert
outputs are finite numbers, not just that objects are non-null.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address major PR review comments across multiple files

KNNEvaluator: hoist _trainLabels null check outside the voting loop.
JointMutualInformation: use local miScores instead of field on line 212.
ContradictionDetector: treat malformed contradictory fields as false instead
of falling through to text scan that matches JSON key names.
BFGSOptimizer: guard inverse Hessian update with curvature condition (y*s > 0)
and use validated locals instead of null-forgiving operators.
IncrementalPCA: validate all fitted state upfront in PartialFit and remove
null-forgiving on _explainedVariance.
RealGatedLinearRecurrenceLayer: hoist all cached tensor validation to top of
Backward to prevent partial gradient corruption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address remaining major PR review comments

PreprocessingRegistry: align documentation — clarify get/set is thread-safe
but concurrent builds still overwrite each other's pipeline.

InMemoryFederatedTrainer: fail fast when HE is enabled but no parameter
indices are resolved instead of silently downgrading. Fix DP accounting:
FedAsync uses 1/totalClients per iteration, FedBuff uses bufferSize/totalClients.

AiModelBuilder: extract IsTimeSeriesTask() helper that checks both AutoML
options and model type name for chronological split decisions. Reject
row-changing data preparation in natural-client federated mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address trivial PR review comments across multiple files

DeformableConvolutionalLayer: rename misleading variable names (mwShape/mbShape
to maskWeights/maskBias, mw2/mwg/mbg to maskW/maskWeightGrad/maskBiasGrad).

S4DLayer: hoist _lastHiddenStatesReal/Imag guards out of the BPTT loop.

OnlineFeatureSelector: pass resolved running arrays into ComputeOnlineCorrelation
instead of re-checking nullable fields inside the hot loop.

SSLPretrainingPipeline: extract encoder to a local variable to avoid duplicate
null-check expressions.

SubpixelConvolutionalLayer: hoist ScalarActivation out of the per-element loop.

BaggingClassifier: replace OrderBy(random.Next()) with Fisher-Yates partial
shuffle for O(sampleSize) unbiased sampling without replacement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: hoist gradient tensor validations in HedgehogLayer.FeatureMapBackward

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address 12 pr review comments across multiple files

- FloraAdapter: fix momentum reset bug (was zeroed every step)
- KnowledgeDistillationTrainerBase: remove dead checkpoint branch
- KVCache: hoist loop-invariant lookups outside inner loops
- AiModelResult: capture stable model reference in explainer delegate
- VectorModel: guard empty coefficients, remove dead BoxOrDefault
- ConvLSTMLayer: use local lastInput instead of field reference
- PrincipalNeighbourhoodAggregationLayer: hoist invariant tensors,
  remove null-forgiving operators in fallback path
- BarlowTwins: remove redundant projector null-check
- FILM: use local-cache pattern for cached features
- DeepFilterNet: propagate gain layer backward gradient

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: net471 build errors in baggingclassifier and leafsent140loader

- Replace range syntax pool[..sampleSize] with Array.Copy for net471
  compatibility (RuntimeHelpers.GetSubArray not available)
- Add explicit null check before string.IsNullOrWhiteSpace for net471
  compatibility (missing NotNullWhen attribute)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use correct inner class name and remove null-forgiving in dashboard

Replace InterpretabilityExplanation with Explanation (the actual inner
class name) and replace null-forgiving operators with proper null guards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address pr #941 review comments

- KDTree: use single guarded local for _data null check in Compare
- BallTree: use single guarded local for _data null check in Sort
- DiffusionMemoryManager: remove duplicate unconditional total sharded line
- InfluenceFunctionExplainer: hoist cachedGradients null check outside loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: preprocessing pipeline state lost during serialization/deserialization

3 participants