Skip to content

Integration with Upgini#251

Open
c3p0upgini wants to merge 2 commits intoautogluon:mainfrom
upgini:main
Open

Integration with Upgini#251
c3p0upgini wants to merge 2 commits intoautogluon:mainfrom
upgini:main

Conversation

@c3p0upgini
Copy link
Copy Markdown

@c3p0upgini c3p0upgini commented Nov 14, 2025

Hey there,
We're Upgini, a team developing automated data processing, label-supervised data retrieval and robust feature selection. We're currently investigating how data processing affects AI agent performance, and we did several tests with AutoGluon Assistant. We think library selection approach is a solid alternative to full code generation, and we also see that on tabular data AutoGluon can surpass most of current agentic approaches.

We propose to extend the tooling, in particular to add data processing tools to the agent's capabilities. We acknowledge that the current integration with Upgini is intentionally quite hard-coded. We chose the shortest path to validate the hypothesis that adding automated feature enrichment could improve model quality.

In such a design, the LLM would be able to select the appropriate preprocessing tool based on the task description, with Upgini being one of the available tools. This would also allow us to provide clear documentation describing the use cases where Upgini is beneficial and when it should be applied.

Another thing we noticed is that AutoGluon can overfit on large datasets. For example, on New york city taxi fare prediction Kaggle competition model fits much better on the 1 million sample, than on original 55 millions dataset. So we added support of sampling by setting environment variable MLZERO_SAMPLE_SIZE.

Even at this early stage, we are already seeing measurable improvements — for example, on several Kaggle competitions the integration yields about +1.2% improvement even without external data. If the MLZero team is interested in deeper Upgini support, we would be happy to proceed with a more architecturally correct and extensible integration.

Description

Added integration with Upgini library to automatically enrich tabular datasets during the preprocessing stage.
This allows the model to receive additional external features and select the most relevant ones before training.
Added optional sampling controlled by the environment variable MLZERO_SAMPLE_SIZE. AutoGluon tends to overfit on very large datasets, so sampling can sometimes lead to better metrics.

Major changes:

  • azure_openai_chat.py – fixed handling of o1 and o3 models that don’t support the temperature parameter (tested with the o3-mini model deployed in Azure).
  • bash_coder_prompt.py – fixed environment handling so that packages are installed into the same runtime environment (previously, installed packages were not visible when executing generated LLM Python scripts).
  • python_coder_prompt.py – added LLM instructions for feature enrichment using Upgini before model training.

How Has This Been Tested?

  • Unit tests (pytest tests/)
  • Integration tests - multiple runs of MLZero agent on various tabular datasets from MLE Benchmark:
  • Nomad Semiconductors
  • Tabular Series Playground
  • New York City Taxi Fare Prediction
  • Verified pipeline behavior with and without Upgini enrichment enabled.

Configuration Changes

  • Added config file for models deployed in Azure: azure.yaml

For the integration to work, you must define the following environment variable before running:

export UPGINI_API_KEY=<your_api_key>

You can obtain your API key after registering at https://profile.upgini.com/.

Type of Change

  • Bug fix
  • New feature

Related Work / Benchmark Results

For a detailed comparison of model performance with and without the Upgini integration, see the benchmark runs in the MLE Benchmark repository:

🔗 [Results PR]

This PR demonstrates the improvement obtained by enriching datasets with Upgini during preprocessing.

@c3p0upgini
Copy link
Copy Markdown
Author

@FANGAreNotGnu, kindly ask you to review this PR

@FANGAreNotGnu
Copy link
Copy Markdown
Collaborator

Thank you for this work on integrating Upgini with AutoGluon Assistant.

After careful consideration, we've decided not to merge this functionality. Our main concern is the additional commercial dependency: while we use paid LLM APIs for core functionality, adding another paid service for feature enrichment expands external dependencies beyond our intended scope and increases the maintenance burden.

As an alternative, you might consider maintaining the integration as a standalone repository.

We appreciate your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants