[Feature] Support tool-use message with compresive format #429

jiapingW · 2026-01-15T14:17:35Z

Motivation

Because the current model requires tool parameters when using tool-use, and these tool parameters have various and inconsistent formats, directly loading the conversations stored in the Dataset as a list can lead to errors, such as some fields being dictionaries and others being strings. Therefore, we treat the commonly used tool-calls field in the agent as a string, and then parse it into a list from JSON during the template construction process to ensure that the Dataset can correctly store the conversations.

Modifications

Related Issues

Accuracy Test

We conducted tests on complex business agent scenarios, implementing dataset mapping and caching operations. I will also add corresponding tests for data anonymization later.

Benchmark & Profiling

Checklist

[✅] Format your code according to the Code Formatting with Pre-Commit.
[] Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2026-01-15T14:17:54Z

Summary of Changes

Hello @jiapingW, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the system's ability to process conversation datasets that include tool-use messages. It introduces a standardized approach to handle varying formats of tool parameters by serializing complex data types into strings during dataset ingestion and deserializing them during template construction. This ensures data integrity and compatibility with underlying data structures, preventing errors caused by inconsistent data types and allowing for more reliable training and evaluation of models that utilize tool-use capabilities.

Highlights

Robust Tool-Use Message Handling: Introduced a new mechanism to handle tool-use messages within conversation datasets more robustly. This addresses inconsistencies in tool parameter formats, ensuring that complex structures like tool_calls are correctly processed.
Data Serialization for Dataset Compatibility: Implemented a safe_conversations_generator that serializes list/dict-type field values (e.g., tool_calls) within conversation messages into JSON strings during dataset loading. This prevents type conflicts when working with data processing frameworks like Apache Arrow.
Dynamic Deserialization during Template Application: Modified the conversation parsing logic to deserialize the tool_calls field from its JSON string representation back into its original list/dict format just before applying the chat template. This ensures the model receives the correct structured data.
Dataset Loading with Generator: Updated scripts/prepare_hidden_states.py and scripts/train_eagle3.py to use datasets.Dataset.from_generator with the new safe_conversations_generator, replacing the direct load_dataset call for improved data handling.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust data loading mechanism to handle inconsistencies in the tool_calls field within conversation datasets. It achieves this by serializing dictionary/list-like fields into JSON strings during data loading and then deserializing them back during the parsing stage. The changes are well-structured, primarily introducing a safe_conversations_generator and updating data loading scripts to use it. My review includes suggestions to improve the robustness of the JSON parsing logic and enhance the logging within the new generator function.

gemini-code-assist · 2026-01-15T14:19:56Z

specforge/data/parse.py

+                if sentence["tool_calls"] is not None:
+                    sentence["tool_calls"] = json.loads(sentence["tool_calls"])


This block can be brittle. It will raise a KeyError if a message (sentence) does not contain the tool_calls key. It will also raise a TypeError if tool_calls is already a dictionary/list instead of a string, or a json.JSONDecodeError for a malformed JSON string. To make it more robust, you should safely access tool_calls using .get(), check if it's a string before parsing, and handle potential JSON decoding errors.

tool_calls = sentence.get("tool_calls") if isinstance(tool_calls, str): try: sentence["tool_calls"] = json.loads(tool_calls) except json.JSONDecodeError: warnings.warn(f"Failed to parse tool_calls JSON: {tool_calls}") sentence["tool_calls"] = None

References

When handling data from external sources, always validate and handle potential errors gracefully. This includes checking for key existence, verifying data types, and using try-except blocks for operations that can fail, like JSON parsing.

gemini-code-assist · 2026-01-15T14:19:57Z

specforge/utils.py

+                        print(
+                            f"⚠️ Line {i}: 'conversations' is not a list. Please check!"
+                        )


For better logging practice and consistency with the rest of the file, it's recommended to use the logger object instead of print for warnings. Also, enumerate is 0-indexed, so reporting i + 1 would be more intuitive for users who are looking at line numbers in their file editor.

Suggested change

print(

f"⚠️ Line {i}: 'conversations' is not a list. Please check!"

)

logger.warning(

f"Line {i + 1}: 'conversations' is not a list. Please check!"

)

References

In library or application code, use the logging module instead of print() for diagnostic output. This allows the application's user to configure the verbosity and destination of the logs.

gemini-code-assist · 2026-01-15T14:19:57Z

specforge/utils.py

+                        continue
+
+                cleaned_convs = []
+                for idx, msg in enumerate(raw_convs):


The loop variable idx is not used. You can replace enumerate(raw_convs) with just raw_convs to simplify the code and signal that the index is not needed.

Suggested change

for idx, msg in enumerate(raw_convs):

for msg in raw_convs:

References

Unused variables should be removed to improve code clarity and maintainability. It signals to future readers that the variable is not needed.

gemini-code-assist · 2026-01-15T14:19:57Z

specforge/utils.py

+                yield {"conversations": cleaned_convs}
+
+            except Exception as e:
+                print(f"⚠️ Skipping line {i}: {e}")


For better logging practice and consistency with the rest of the file, it's recommended to use the logger object instead of print for warnings. Also, enumerate is 0-indexed, so reporting i + 1 would be more intuitive for users who are looking at line numbers in their file editor.

Suggested change

print(f"⚠️ Skipping line {i}: {e}")

logger.warning(f"Skipping line {i + 1}: {e}")

References

In library or application code, use the logging module instead of print() for diagnostic output. This allows the application's user to configure the verbosity and destination of the logs.

…ve_jsonl_data_file

jiapingW added 2 commits January 14, 2026 15:39

support handle comprehensive jsonl data file

6c778df

support handle different tool-use message

2560f94

jiapingW requested review from FlamingoPg, shuaills, sleepcoo and zyksir as code owners January 15, 2026 14:17

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

jiapingW added 4 commits January 15, 2026 22:21

Merge remote-tracking branch 'upstream/main' into support_comprehensi…

5501b3a

…ve_jsonl_data_file

polish code

61cbcc1

tokenizer add max_length

9c94079

polish code

83b83d8

sleepcoo approved these changes Jan 16, 2026

View reviewed changes

jiapingW mentioned this pull request Jan 21, 2026

[Draft] Long Context Training VRAM Optimization #446

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support tool-use message with compresive format #429

[Feature] Support tool-use message with compresive format #429

Uh oh!

jiapingW commented Jan 15, 2026

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Uh oh!

gemini-code-assist bot Jan 15, 2026

Uh oh!

gemini-code-assist bot Jan 15, 2026

Uh oh!

gemini-code-assist bot Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if sentence["tool_calls"] is not None:
		sentence["tool_calls"] = json.loads(sentence["tool_calls"])

-                        print(
-                            f"⚠️ Line {i}: 'conversations' is not a list. Please check!"
-                        )
+                        logger.warning(
+                            f"Line {i + 1}: 'conversations' is not a list. Please check!"
+                        )

	print(f"⚠️ Skipping line {i}: {e}")
	logger.warning(f"Skipping line {i + 1}: {e}")

[Feature] Support tool-use message with compresive format #429

Are you sure you want to change the base?

[Feature] Support tool-use message with compresive format #429

Uh oh!

Conversation

jiapingW commented Jan 15, 2026

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants