Skip to content

Conversation

@jiapingW
Copy link
Collaborator

Motivation

Because the current model requires tool parameters when using tool-use, and these tool parameters have various and inconsistent formats, directly loading the conversations stored in the Dataset as a list can lead to errors, such as some fields being dictionaries and others being strings. Therefore, we treat the commonly used tool-calls field in the agent as a string, and then parse it into a list from JSON during the template construction process to ensure that the Dataset can correctly store the conversations.

Modifications

Related Issues

Accuracy Test

We conducted tests on complex business agent scenarios, implementing dataset mapping and caching operations. I will also add corresponding tests for data anonymization later.

Benchmark & Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jiapingW, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the system's ability to process conversation datasets that include tool-use messages. It introduces a standardized approach to handle varying formats of tool parameters by serializing complex data types into strings during dataset ingestion and deserializing them during template construction. This ensures data integrity and compatibility with underlying data structures, preventing errors caused by inconsistent data types and allowing for more reliable training and evaluation of models that utilize tool-use capabilities.

Highlights

  • Robust Tool-Use Message Handling: Introduced a new mechanism to handle tool-use messages within conversation datasets more robustly. This addresses inconsistencies in tool parameter formats, ensuring that complex structures like tool_calls are correctly processed.
  • Data Serialization for Dataset Compatibility: Implemented a safe_conversations_generator that serializes list/dict-type field values (e.g., tool_calls) within conversation messages into JSON strings during dataset loading. This prevents type conflicts when working with data processing frameworks like Apache Arrow.
  • Dynamic Deserialization during Template Application: Modified the conversation parsing logic to deserialize the tool_calls field from its JSON string representation back into its original list/dict format just before applying the chat template. This ensures the model receives the correct structured data.
  • Dataset Loading with Generator: Updated scripts/prepare_hidden_states.py and scripts/train_eagle3.py to use datasets.Dataset.from_generator with the new safe_conversations_generator, replacing the direct load_dataset call for improved data handling.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust data loading mechanism to handle inconsistencies in the tool_calls field within conversation datasets. It achieves this by serializing dictionary/list-like fields into JSON strings during data loading and then deserializing them back during the parsing stage. The changes are well-structured, primarily introducing a safe_conversations_generator and updating data loading scripts to use it. My review includes suggestions to improve the robustness of the JSON parsing logic and enhance the logging within the new generator function.

Comment on lines 95 to 96
if sentence["tool_calls"] is not None:
sentence["tool_calls"] = json.loads(sentence["tool_calls"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block can be brittle. It will raise a KeyError if a message (sentence) does not contain the tool_calls key. It will also raise a TypeError if tool_calls is already a dictionary/list instead of a string, or a json.JSONDecodeError for a malformed JSON string. To make it more robust, you should safely access tool_calls using .get(), check if it's a string before parsing, and handle potential JSON decoding errors.

                tool_calls = sentence.get("tool_calls")
                if isinstance(tool_calls, str):
                    try:
                        sentence["tool_calls"] = json.loads(tool_calls)
                    except json.JSONDecodeError:
                        warnings.warn(f"Failed to parse tool_calls JSON: {tool_calls}")
                        sentence["tool_calls"] = None
References
  1. When handling data from external sources, always validate and handle potential errors gracefully. This includes checking for key existence, verifying data types, and using try-except blocks for operations that can fail, like JSON parsing.

Comment on lines 329 to 331
print(
f"⚠️ Line {i}: 'conversations' is not a list. Please check!"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better logging practice and consistency with the rest of the file, it's recommended to use the logger object instead of print for warnings. Also, enumerate is 0-indexed, so reporting i + 1 would be more intuitive for users who are looking at line numbers in their file editor.

Suggested change
print(
f"⚠️ Line {i}: 'conversations' is not a list. Please check!"
)
logger.warning(
f"Line {i + 1}: 'conversations' is not a list. Please check!"
)
References
  1. In library or application code, use the logging module instead of print() for diagnostic output. This allows the application's user to configure the verbosity and destination of the logs.

continue

cleaned_convs = []
for idx, msg in enumerate(raw_convs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loop variable idx is not used. You can replace enumerate(raw_convs) with just raw_convs to simplify the code and signal that the index is not needed.

Suggested change
for idx, msg in enumerate(raw_convs):
for msg in raw_convs:
References
  1. Unused variables should be removed to improve code clarity and maintainability. It signals to future readers that the variable is not needed.

yield {"conversations": cleaned_convs}

except Exception as e:
print(f"⚠️ Skipping line {i}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better logging practice and consistency with the rest of the file, it's recommended to use the logger object instead of print for warnings. Also, enumerate is 0-indexed, so reporting i + 1 would be more intuitive for users who are looking at line numbers in their file editor.

Suggested change
print(f"⚠️ Skipping line {i}: {e}")
logger.warning(f"Skipping line {i + 1}: {e}")
References
  1. In library or application code, use the logging module instead of print() for diagnostic output. This allows the application's user to configure the verbosity and destination of the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants