Open
Conversation
Add integration with Upgini + support Azure provider
Author
|
@FANGAreNotGnu, kindly ask you to review this PR |
Collaborator
|
Thank you for this work on integrating Upgini with AutoGluon Assistant. After careful consideration, we've decided not to merge this functionality. Our main concern is the additional commercial dependency: while we use paid LLM APIs for core functionality, adding another paid service for feature enrichment expands external dependencies beyond our intended scope and increases the maintenance burden. As an alternative, you might consider maintaining the integration as a standalone repository. We appreciate your contribution. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey there,
We're Upgini, a team developing automated data processing, label-supervised data retrieval and robust feature selection. We're currently investigating how data processing affects AI agent performance, and we did several tests with AutoGluon Assistant. We think library selection approach is a solid alternative to full code generation, and we also see that on tabular data AutoGluon can surpass most of current agentic approaches.
We propose to extend the tooling, in particular to add data processing tools to the agent's capabilities. We acknowledge that the current integration with Upgini is intentionally quite hard-coded. We chose the shortest path to validate the hypothesis that adding automated feature enrichment could improve model quality.
In such a design, the LLM would be able to select the appropriate preprocessing tool based on the task description, with Upgini being one of the available tools. This would also allow us to provide clear documentation describing the use cases where Upgini is beneficial and when it should be applied.
Another thing we noticed is that AutoGluon can overfit on large datasets. For example, on New york city taxi fare prediction Kaggle competition model fits much better on the 1 million sample, than on original 55 millions dataset. So we added support of sampling by setting environment variable MLZERO_SAMPLE_SIZE.
Even at this early stage, we are already seeing measurable improvements — for example, on several Kaggle competitions the integration yields about +1.2% improvement even without external data. If the MLZero team is interested in deeper Upgini support, we would be happy to proceed with a more architecturally correct and extensible integration.
Description
Added integration with Upgini library to automatically enrich tabular datasets during the preprocessing stage.
This allows the model to receive additional external features and select the most relevant ones before training.
Added optional sampling controlled by the environment variable
MLZERO_SAMPLE_SIZE. AutoGluon tends to overfit on very large datasets, so sampling can sometimes lead to better metrics.Major changes:
azure_openai_chat.py– fixed handling of o1 and o3 models that don’t support the temperature parameter (tested with the o3-mini model deployed in Azure).bash_coder_prompt.py– fixed environment handling so that packages are installed into the same runtime environment (previously, installed packages were not visible when executing generated LLM Python scripts).python_coder_prompt.py– added LLM instructions for feature enrichment using Upgini before model training.How Has This Been Tested?
pytest tests/)Configuration Changes
For the integration to work, you must define the following environment variable before running:
You can obtain your API key after registering at https://profile.upgini.com/.
Type of Change
Related Work / Benchmark Results
For a detailed comparison of model performance with and without the Upgini integration, see the benchmark runs in the MLE Benchmark repository:
🔗 [Results PR]
This PR demonstrates the improvement obtained by enriching datasets with Upgini during preprocessing.