-
Notifications
You must be signed in to change notification settings - Fork 143
[Feature] Support tool-use message with compresive format #429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feature] Support tool-use message with compresive format #429
Conversation
Summary of ChangesHello @jiapingW, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the system's ability to process conversation datasets that include tool-use messages. It introduces a standardized approach to handle varying formats of tool parameters by serializing complex data types into strings during dataset ingestion and deserializing them during template construction. This ensures data integrity and compatibility with underlying data structures, preventing errors caused by inconsistent data types and allowing for more reliable training and evaluation of models that utilize tool-use capabilities. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust data loading mechanism to handle inconsistencies in the tool_calls field within conversation datasets. It achieves this by serializing dictionary/list-like fields into JSON strings during data loading and then deserializing them back during the parsing stage. The changes are well-structured, primarily introducing a safe_conversations_generator and updating data loading scripts to use it. My review includes suggestions to improve the robustness of the JSON parsing logic and enhance the logging within the new generator function.
specforge/data/parse.py
Outdated
| if sentence["tool_calls"] is not None: | ||
| sentence["tool_calls"] = json.loads(sentence["tool_calls"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block can be brittle. It will raise a KeyError if a message (sentence) does not contain the tool_calls key. It will also raise a TypeError if tool_calls is already a dictionary/list instead of a string, or a json.JSONDecodeError for a malformed JSON string. To make it more robust, you should safely access tool_calls using .get(), check if it's a string before parsing, and handle potential JSON decoding errors.
tool_calls = sentence.get("tool_calls")
if isinstance(tool_calls, str):
try:
sentence["tool_calls"] = json.loads(tool_calls)
except json.JSONDecodeError:
warnings.warn(f"Failed to parse tool_calls JSON: {tool_calls}")
sentence["tool_calls"] = NoneReferences
- When handling data from external sources, always validate and handle potential errors gracefully. This includes checking for key existence, verifying data types, and using try-except blocks for operations that can fail, like JSON parsing.
specforge/utils.py
Outdated
| print( | ||
| f"⚠️ Line {i}: 'conversations' is not a list. Please check!" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better logging practice and consistency with the rest of the file, it's recommended to use the logger object instead of print for warnings. Also, enumerate is 0-indexed, so reporting i + 1 would be more intuitive for users who are looking at line numbers in their file editor.
| print( | |
| f"⚠️ Line {i}: 'conversations' is not a list. Please check!" | |
| ) | |
| logger.warning( | |
| f"Line {i + 1}: 'conversations' is not a list. Please check!" | |
| ) |
References
- In library or application code, use the
loggingmodule instead ofprint()for diagnostic output. This allows the application's user to configure the verbosity and destination of the logs.
specforge/utils.py
Outdated
| continue | ||
|
|
||
| cleaned_convs = [] | ||
| for idx, msg in enumerate(raw_convs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop variable idx is not used. You can replace enumerate(raw_convs) with just raw_convs to simplify the code and signal that the index is not needed.
| for idx, msg in enumerate(raw_convs): | |
| for msg in raw_convs: |
References
- Unused variables should be removed to improve code clarity and maintainability. It signals to future readers that the variable is not needed.
specforge/utils.py
Outdated
| yield {"conversations": cleaned_convs} | ||
|
|
||
| except Exception as e: | ||
| print(f"⚠️ Skipping line {i}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better logging practice and consistency with the rest of the file, it's recommended to use the logger object instead of print for warnings. Also, enumerate is 0-indexed, so reporting i + 1 would be more intuitive for users who are looking at line numbers in their file editor.
| print(f"⚠️ Skipping line {i}: {e}") | |
| logger.warning(f"Skipping line {i + 1}: {e}") |
References
- In library or application code, use the
loggingmodule instead ofprint()for diagnostic output. This allows the application's user to configure the verbosity and destination of the logs.
Motivation
Because the current model requires tool parameters when using tool-use, and these tool parameters have various and inconsistent formats, directly loading the conversations stored in the Dataset as a list can lead to errors, such as some fields being dictionaries and others being strings. Therefore, we treat the commonly used tool-calls field in the agent as a string, and then parse it into a list from JSON during the template construction process to ensure that the Dataset can correctly store the conversations.
Modifications
Related Issues
Accuracy Test
We conducted tests on complex business agent scenarios, implementing dataset mapping and caching operations. I will also add corresponding tests for data anonymization later.
Benchmark & Profiling
Checklist