Skip to content

feat(evaluation): Add Pareto-Optimal Evaluation#213

Open
ankitlade12 wants to merge 6 commits intoNixtla:mainfrom
ankitlade12:feat/multi-objective-eval
Open

feat(evaluation): Add Pareto-Optimal Evaluation#213
ankitlade12 wants to merge 6 commits intoNixtla:mainfrom
ankitlade12:feat/multi-objective-eval

Conversation

@ankitlade12
Copy link
Copy Markdown

Description

This pull request introduces multi-objective evaluation capabilities to utilsforecast. It adds a robust ParetoFrontier class directly within evaluation.py, providing a model-agnostic and dataframe-agnostic way to identify the best-performing models across conflicting metrics (e.g., minimizing RMSE while minimizing MAE, or minimizing latency while maximizing accuracy).

Since utilsforecast acts as the foundational evaluation layer for Nixtla's ecosystem, integrating Pareto selection natively enables downstream libraries (like mlforecast and statsforecast) to leverage model multi-objective benchmarking out-of-the-box using the standard output of evaluate().

Key Changes

  • Added ParetoFrontier class in evaluation.py: Includes is_dominated mathematically validated bounding logic and exposed find_non_dominated routines.
  • Built-in 2D Plotting: Implemented plot_pareto_2d() to visually inspect the trade-off frontier. Matplotlib is lazily imported and handled gracefully with an explicit warnings.warn if missing, addressing reviewer feedback to avoid raw print statements.
  • Dataframe Agnosticism (AnyDFType): Ensures pandas and polars arrays are passed cleanly through the mathematical logic, addressing previous maintainer concerns surrounding hard pandas dependencies.
  • Exposed in __init__.py: Integrated evaluate and ParetoFrontier into __all__ for easy top-level access (from utilsforecast import ParetoFrontier).

Example Usage

from utilsforecast.evaluation import evaluate, ParetoFrontier
from utilsforecast.losses import mae, rmse

# 1. Run standard evaluation
performance_df = evaluate(cv_results, metrics=[mae, rmse], agg_fn="mean")

# 2. Get Pareto Optimal Models
pareto_optimal_df = ParetoFrontier.find_non_dominated(performance_df)

# 3. Visualize Trade-offs
ax = ParetoFrontier.plot_pareto_2d(performance_df, metric_x='rmse', metric_y='mae')

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 2, 2026

CLA assistant check
All committers have signed the CLA.

@ankitlade12 ankitlade12 force-pushed the feat/multi-objective-eval branch from 4328af7 to 00dbac1 Compare March 4, 2026 07:58
Copy link
Copy Markdown
Contributor

@nasaul nasaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core Pareto logic looks algorithmically sound, but the PR needs some changes and add tests for ParetoFrontier

- Add evaluate and ParetoFrontier to __init__.py __all__
- Use narwhals native DataFrame filtering in ParetoFrontier
- Properly extract evaluate() model columns in plot_pareto_2d
- Add tests for ParetoFrontier
@ankitlade12 ankitlade12 requested a review from nasaul March 10, 2026 04:26
@nasaul
Copy link
Copy Markdown
Contributor

nasaul commented Apr 5, 2026

Thanks for the updates — the core Pareto dominance algorithm is correct and the narwhals-based approach in find_non_dominated is the right direction. There are a few bugs and design issues that need to be addressed before merging.

Bugs

1. plot_pareto_2d breaks polars (hard pd.DataFrame dependency)

In the "metric" column branch, a plain pd.DataFrame is constructed internally, and then pandas-only methods are called on the result of find_non_dominated:

pareto_sorted = pareto_df.sort_values(metric_x)  # pandas only
ax.scatter(pareto_df[metric_x], ...)              # pandas-only indexing
for _, row in plot_df.iterrows():                 # pandas only

The method signature accepts AnyDFType but breaks silently for polars input. Either convert to pandas explicitly at the start of the plotting method, or use narwhals throughout.

2. plot_pareto_2d mutates a caller-provided DataFrame

plot_df["model"] = plot_df.index.astype(str)

This modifies the passed-in DataFrame in place when it's pandas. Use .copy() or .assign() instead.

Design Issues

3. __init__.py unexpectedly exports evaluate at package level

evaluate was not previously exported from utilsforecast.__init__. Adding it here is an unintended API surface change and causes eager import of all of evaluation.py on every import utilsforecast. Only ParetoFrontier (if desired) should be added.

4. Confusing metrics parameter in find_non_dominated

In the "metric" column branch (evaluate() output format), when metrics is passed it is actually used as model names, not metric names:

else:
    models = metrics  # misleading: metrics is used as model names here

A user will naturally pass metric names like ["rmse", "mae"] and get unexpected behavior. Please rename the parameter (e.g., model_subset) or document this clearly, and consider whether the evaluate() format branch even needs this override.

Minor

  • evaluation.py module-level __all__ still only has ['evaluate'] — add ParetoFrontier.
  • Missing two blank lines before the ParetoFrontier class definition (PEP 8).
  • Test has a stray # wait: debug comment that should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants