Add Ability to Generate Test Cases for GEPA based on User Instruction#293
Draft
auschoi96 wants to merge 4 commits intodatabricks-solutions:mainfrom
Draft
Add Ability to Generate Test Cases for GEPA based on User Instruction#293auschoi96 wants to merge 4 commits intodatabricks-solutions:mainfrom
auschoi96 wants to merge 4 commits intodatabricks-solutions:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After some initial testing and discussion internally, we decided that more test cases are needed, especially as resources update. For example, if the zerobus sdk updates and has breaking changes, we want to make sure that's captured in the skill. Or, maybe the user only wants serverless to be used, then they can run this to make sure it prioritizes the skill.
This should generate new test_cases in the ground_truth.yaml and help make GEPA more accurate.
What's in the PR
This PR address the following:
Test Plan.
You can run the following commands to test the new flags and optimizations. You will need to set the correct env variables according to the .test/README.md:
uv run python .test/scripts/optimize.py databricks-zerobus-ingest --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment"This is in example of using --focus to generate more examples which will aide in the optimization
uv run python .test/scripts/optimize.py databricks-zerobus-ingest --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment" --focus "ensure the latest databricks-sdk is being used like 0.97.0" --focus "ensure the latest databricks zerobus sdk is being using 1.1.0" --focus "make sure compatibility across runtimes"Test plan