Skip to content

Add difficulty map builder#1693

Open
undfined wants to merge 19 commits into
mainfrom
split-create-difficulty-map
Open

Add difficulty map builder#1693
undfined wants to merge 19 commits into
mainfrom
split-create-difficulty-map

Conversation

@undfined
Copy link
Copy Markdown

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a difficulty-aware curriculum sampling framework for RLVR and GRPO training. It includes a new sampler capable of both static scheduling and adaptive sampling based on reward and advantage signals, alongside a utility script for generating difficulty metadata from Hugging Face datasets using Beta-Binomial posterior estimation. Comprehensive unit tests are provided for both components. Reviewer feedback focuses on enhancing code robustness and clarity by adding missing type hints, specifying UTF-8 encoding for file operations, and removing redundant conditional logic in serialization helpers.



def _build_difficulty_bucket_index(
dataset,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dataset parameter is missing a type hint. Adding one (e.g., Any or a more specific sequence type) improves code clarity and maintainability.

Suggested change
dataset,
dataset: Any,


def __init__(
self,
dataset,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dataset parameter in the __init__ method is missing a type hint.

Suggested change
dataset,
dataset: Any,

*, output_jsonl: Path, schema_json: Path, metadata_json: Path, dataset: Dataset, dataset_metadata: DatasetMetadata
) -> None:
output_jsonl.parent.mkdir(parents=True, exist_ok=True)
with output_jsonl.open("w") as output_file:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is recommended to specify the encoding when opening files for writing to ensure consistency across different platforms, especially for JSON data which should be UTF-8.

Suggested change
with output_jsonl.open("w") as output_file:
with output_jsonl.open("w", encoding="utf-8") as output_file:

schema_payload: Any = dataset.features.to_dict()
except AttributeError:
schema_payload = str(dataset.features)
with schema_json.open("w") as output_file:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when opening the schema file for writing to ensure consistent behavior across environments.

Suggested change
with schema_json.open("w") as output_file:
with schema_json.open("w", encoding="utf-8") as output_file:

json.dump(schema_payload, output_file, indent=2, sort_keys=True)

metadata_json.parent.mkdir(parents=True, exist_ok=True)
with metadata_json.open("w") as output_file:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify encoding="utf-8" when opening the metadata file for writing.

Suggested change
with metadata_json.open("w") as output_file:
with metadata_json.open("w", encoding="utf-8") as output_file:

(key if isinstance(key, str) else "" if key is None else str(key)): make_jsonable(item)
for key, item in value.items()
}
return "" if value is None else str(value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check value is None is redundant here because None is already handled at line 1130. This line can be simplified to just return the string representation.

Suggested change
return "" if value is None else str(value)
return str(value)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant