Skip to content

Improving oracle#18

Open
arifimtiaz012 wants to merge 5 commits into
mainfrom
improve_oracle
Open

Improving oracle#18
arifimtiaz012 wants to merge 5 commits into
mainfrom
improve_oracle

Conversation

@arifimtiaz012
Copy link
Copy Markdown
Collaborator

Summary

Added new models HGBT and GP.
Refactored model selection criteria from RMSE toward Spearman ranking quality, and added a benchmarking workflow strictly for oracle performance observation (per chosen dataset).

image

Key Changes

Model Pool

  • Added:
    • HistGradientBoostingRegressor (HGBT)
    • GaussianProcessRegressor (GP) with ConstantKernel * RBF + WhiteKernel
  • All models (including Extra Trees) now use n_estimators=200
  • Model pool configurable via model_candidates

Model Selection (Ranking-first)

  • Selection priority:

    1. Spearman (top-K%)
    2. Spearman (full dataset)
    3. RMSE (tiebreaker)
  • Metrics (via cross_val_predict):

    • rmse
    • spearman_all
    • spearman_top_k
  • Added top_k_pct (default 3%)


Categorical Handling

  • Strategy change for missing categorical values: most_frequentconstant("no_value")
  • Preserves missingness as explicit "no_value" category
  • Avoids ambiguity on CSV reload

Oracle Metadata

  • Now stores:
    • Per-model metrics (rmse, Spearmans)
    • Selected model scores
    • top_k_pct, objective, seed, backend_id
  • Replaces RMSE-only tracking

Utilities

  • Added _safe_spearman
    • Handles small samples (<3)
    • Guards against NaNs

CLI Updates

  • build-oracle:
    • --top-k-pct
    • --model-candidates

Benchmarking (New)

  • New script: oracle_benchmark.py

Design

  • Seed with true targets, then iterate using oracle predictions
  • Compare all models vs random baseline

Features

  • Stratified seed sampling (by target quantiles)
  • Mixed-feature nearest-neighbour lookup
  • Metrics:
    • Convergence (best value per iteration)
    • Top-K% recovery rate
    • First-hit iteration

Output

  • Multi-panel PDF (per model + random baseline)
  • See above image as an example

Note

  • Code defaults to trying all four implemented models. For development purposes, you may want to run using only ExtraTrees and RandomForest as before if runs feel too slow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant