Conversation
📝 WalkthroughWalkthroughThe PR adds a new MACE model variant "mace-osaka24-large" to the discovery system. It includes enum registration linking the model to its metadata file, a conversion script that aligns and reorders JSON data between reference and target formats, and comprehensive YAML metadata containing model configuration, training details, hyperparameters, and evaluation metrics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In @models/mace/convert_json_to_2024_format_mace_osaka24_large.py:
- Around line 32-43: The function align_sort_order_gz_input is missing type
annotations and contains E701 violations; update its signature to include types
for ref_file, target_file, output_file (e.g., str) and an explicit return type
(-> None), and split any combined statements into separate lines (replace "if
ref_data is None: return" and "if target_data is None: return" with two-line if
blocks that return on their own line) so the code conforms to style and
type-checking.
- Around line 13-30: The linter wants type annotations: update the signature of
load_json_smart to accept filepath: str and return Optional[Any] (since
json.load can return any JSON type and the function returns None on
FileNotFoundError), e.g. def load_json_smart(filepath: str) -> Optional[Any]:,
and add the necessary typing import (from typing import Optional, Any) at the
top of the module; keep the existing behavior and return types unchanged.
In @models/mace/mace-osaka24-large.yml:
- Line 91: The YAML has a syntax error on the field "pred_file_url": add a space
after the colon so the line reads "pred_file_url:
https://figshare.com/ndownloader/files/60020837" (ensure there is exactly one
space after the colon and no inline comments or stray characters) so the file
parses correctly.
- Line 45: The pr_url field currently contains a placeholder comment; replace
that placeholder in the models/mace/mace-osaka24-large.yml entry for pr_url with
the actual pull request URL string (the real PR link) so the required metadata
field is populated before merging; ensure the value is a valid URL and follows
the same quoting/format style as other YAML fields in this file.
- Line 4: The YAML entry's date_added value is incorrect; update the date_added
field in the mace-osaka24-large model entry (field name: date_added) from
'2025-12-23' to the correct commit date '2026-01-09' so the model metadata
reflects when it was added to the repository.
🧹 Nitpick comments (2)
models/mace/convert_json_to_2024_format_mace_osaka24_large.py (2)
115-117: Remove redundant pass statement.The
passstatement on line 116 is unnecessary after a function call and serves no purpose.♻️ Proposed fix
if __name__ == "__main__": - align_sort_order_gz_input(ref_filename, target_filename, output_filename) - pass + align_sort_order_gz_input(REF_FILE, target_filename, output_filename)Note: Also updated
ref_filenametoREF_FILEif you apply the earlier refactoring suggestion.
13-30: Consider adding error handling for JSON parsing failures.While the function handles
FileNotFoundError, it doesn't catch potentialjson.JSONDecodeErrorexceptions that could occur if the file exists but contains invalid JSON.♻️ Proposed improvement
def load_json_smart(filepath: str) -> dict[str, any] | None: """ Function to load JSON files. Automatically handles gzip decompression if the file ends with .gz. """ print(f"Loading: {filepath}") try: if filepath.endswith('.gz'): # Open with gzip module in text mode (rt) if extension is .gz with gzip.open(filepath, 'rt', encoding='utf-8') as f: return json.load(f) else: # Open normally for standard .json files with open(filepath, 'r', encoding='utf-8') as f: return json.load(f) except FileNotFoundError: print(f"Error: File {filepath} not found.") return None + except json.JSONDecodeError as e: + print(f"Error: Invalid JSON in {filepath}: {e}") + return None
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
matbench_discovery/enums.pymodels/mace/convert_json_to_2024_format_mace_osaka24_large.pymodels/mace/mace-osaka24-large.yml
🧰 Additional context used
🪛 GitHub Actions: Lint
models/mace/mace-osaka24-large.yml
[error] 91-91: YAML decode failed: could not parse key-value at line 91 (pred_file_url).
[warning] 88-88: Note: recent changes suggested by hooks; ensure YAML formatting is consistent.
models/mace/convert_json_to_2024_format_mace_osaka24_large.py
[error] 6-6: Line too long (98 > 88)
[error] 9-9: Line too long (123 > 88)
[error] 10-10: Line too long (120 > 88)
[error] 12-12: ANN201 Missing return type annotation for public function load_json_smart
[error] 12-12: ANN001 Missing type annotation for function argument filepath
[error] 31-31: ANN201 Missing return type annotation for public function align_sort_order_gz_input
[error] 31-31: ANN001 Missing type annotation for function argument ref_file
[error] 31-31: ANN001 Missing type annotation for function argument target_file
[error] 31-31: ANN001 Missing type annotation for function argument output_file
[error] 37-37: E701 Multiple statements on one line (colon)
[error] 41-41: E701 Multiple statements on one line (colon)
🪛 GitHub Actions: Tests
models/mace/mace-osaka24-large.yml
[error] 91-92: YAML syntax error: could not find expected ':' while scanning a simple key.
🪛 Gitleaks (8.30.0)
models/mace/mace-osaka24-large.yml
[high] 2-2: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 Ruff (0.14.10)
models/mace/convert_json_to_2024_format_mace_osaka24_large.py
38-38: Multiple statements on one line (colon)
(E701)
42-42: Multiple statements on one line (colon)
(E701)
🪛 YAMLlint (1.37.1)
models/mace/mace-osaka24-large.yml
[error] 92-92: syntax error: could not find expected ':'
(syntax)
🔇 Additional comments (1)
matbench_discovery/enums.py (1)
338-340: LGTM! Enum entry correctly added.The new model enum entry is properly structured and follows the existing pattern. The comment provides helpful context about the training data sources and split.
| # Specify input filenames here (supports both .gz and .json) | ||
| ref_filename = '../../data/phonons/2024-11-09-kappas-phononDB-PBE-noNAC.json.gz' # Reference file | ||
| model_name = 'mace-osaka24-large' | ||
| json_name = 'force-sets.json.gz' | ||
| target_filename = f'./{model_name}/2025-11-21-kappa-103-FIRE-dist=0.01-fmax=0.0001-symprec=1e-05/{json_name}' # Target file | ||
| output_filename = f'./{model_name}/2025-11-21-kappa-103-FIRE-dist=0.01-fmax=0.0001-symprec=1e-05/reformated_{json_name}' |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Refactor configuration to improve maintainability and fix line length violations.
The hardcoded paths have multiple issues flagged by the linter (lines exceed 88 characters). Consider refactoring to improve readability and maintainability.
♻️ Proposed refactoring
-# --- Execution Settings ---
-# Specify input filenames here (supports both .gz and .json)
-ref_filename = '../../data/phonons/2024-11-09-kappas-phononDB-PBE-noNAC.json.gz' # Reference file
-model_name = 'mace-osaka24-large'
-json_name = 'force-sets.json.gz'
-target_filename = f'./{model_name}/2025-11-21-kappa-103-FIRE-dist=0.01-fmax=0.0001-symprec=1e-05/{json_name}' # Target file
-output_filename = f'./{model_name}/2025-11-21-kappa-103-FIRE-dist=0.01-fmax=0.0001-symprec=1e-05/reformated_{json_name}'
+# --- Execution Settings ---
+# Specify input filenames here (supports both .gz and .json)
+REF_FILE = "../../data/phonons/2024-11-09-kappas-phononDB-PBE-noNAC.json.gz"
+MODEL_NAME = "mace-osaka24-large"
+JSON_NAME = "force-sets.json.gz"
+KAPPA_DIR = "2025-11-21-kappa-103-FIRE-dist=0.01-fmax=0.0001-symprec=1e-05"
+
+# Build paths
+target_filename = f"./{MODEL_NAME}/{KAPPA_DIR}/{JSON_NAME}"
+output_filename = f"./{MODEL_NAME}/{KAPPA_DIR}/reformated_{JSON_NAME}"Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 GitHub Actions: Lint
[error] 6-6: Line too long (98 > 88)
[error] 9-9: Line too long (123 > 88)
[error] 10-10: Line too long (120 > 88)
| def load_json_smart(filepath): | ||
| """ | ||
| Function to load JSON files. | ||
| Automatically handles gzip decompression if the file ends with .gz. | ||
| """ | ||
| print(f"Loading: {filepath}") | ||
| try: | ||
| if filepath.endswith('.gz'): | ||
| # Open with gzip module in text mode (rt) if extension is .gz | ||
| with gzip.open(filepath, 'rt', encoding='utf-8') as f: | ||
| return json.load(f) | ||
| else: | ||
| # Open normally for standard .json files | ||
| with open(filepath, 'r', encoding='utf-8') as f: | ||
| return json.load(f) | ||
| except FileNotFoundError: | ||
| print(f"Error: File {filepath} not found.") | ||
| return None |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add type annotations to fix linting errors.
The function is missing return type and parameter type annotations, causing pipeline failures.
♻️ Proposed fix
-def load_json_smart(filepath):
+def load_json_smart(filepath: str) -> dict[str, any] | None:
"""
Function to load JSON files.
Automatically handles gzip decompression if the file ends with .gz.
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def load_json_smart(filepath): | |
| """ | |
| Function to load JSON files. | |
| Automatically handles gzip decompression if the file ends with .gz. | |
| """ | |
| print(f"Loading: {filepath}") | |
| try: | |
| if filepath.endswith('.gz'): | |
| # Open with gzip module in text mode (rt) if extension is .gz | |
| with gzip.open(filepath, 'rt', encoding='utf-8') as f: | |
| return json.load(f) | |
| else: | |
| # Open normally for standard .json files | |
| with open(filepath, 'r', encoding='utf-8') as f: | |
| return json.load(f) | |
| except FileNotFoundError: | |
| print(f"Error: File {filepath} not found.") | |
| return None | |
| def load_json_smart(filepath: str) -> dict[str, Any] | None: | |
| """ | |
| Function to load JSON files. | |
| Automatically handles gzip decompression if the file ends with .gz. | |
| """ | |
| print(f"Loading: {filepath}") | |
| try: | |
| if filepath.endswith('.gz'): | |
| # Open with gzip module in text mode (rt) if extension is .gz | |
| with gzip.open(filepath, 'rt', encoding='utf-8') as f: | |
| return json.load(f) | |
| else: | |
| # Open normally for standard .json files | |
| with open(filepath, 'r', encoding='utf-8') as f: | |
| return json.load(f) | |
| except FileNotFoundError: | |
| print(f"Error: File {filepath} not found.") | |
| return None |
🤖 Prompt for AI Agents
In @models/mace/convert_json_to_2024_format_mace_osaka24_large.py around lines
13 - 30, The linter wants type annotations: update the signature of
load_json_smart to accept filepath: str and return Optional[Any] (since
json.load can return any JSON type and the function returns None on
FileNotFoundError), e.g. def load_json_smart(filepath: str) -> Optional[Any]:,
and add the necessary typing import (from typing import Optional, Any) at the
top of the module; keep the existing behavior and return types unchanged.
| def align_sort_order_gz_input(ref_file, target_file, output_file): | ||
| print("Starting process...") | ||
|
|
||
| # 1. Load the reference file (2024 version) | ||
| # We read this to establish the "correct sort order" of mp_ids. | ||
| ref_data = load_json_smart(ref_file) | ||
| if ref_data is None: return | ||
|
|
||
| # 2. Load the target file (2025 version) | ||
| target_data = load_json_smart(target_file) | ||
| if target_data is None: return | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add type annotations and fix style violations.
Multiple issues on this function:
- Missing type annotations for parameters and return type (pipeline failure)
- Lines 38 and 42 have E701 violations (multiple statements on one line)
♻️ Proposed fixes
-def align_sort_order_gz_input(ref_file, target_file, output_file):
+def align_sort_order_gz_input(
+ ref_file: str, target_file: str, output_file: str
+) -> None:
print("Starting process...")
# 1. Load the reference file (2024 version)
# We read this to establish the "correct sort order" of mp_ids.
ref_data = load_json_smart(ref_file)
- if ref_data is None: return
+ if ref_data is None:
+ return
# 2. Load the target file (2025 version)
target_data = load_json_smart(target_file)
- if target_data is None: return
+ if target_data is None:
+ return🧰 Tools
🪛 GitHub Actions: Lint
[error] 37-37: E701 Multiple statements on one line (colon)
[error] 41-41: E701 Multiple statements on one line (colon)
🪛 Ruff (0.14.10)
38-38: Multiple statements on one line (colon)
(E701)
42-42: Multiple statements on one line (colon)
(E701)
🤖 Prompt for AI Agents
In @models/mace/convert_json_to_2024_format_mace_osaka24_large.py around lines
32 - 43, The function align_sort_order_gz_input is missing type annotations and
contains E701 violations; update its signature to include types for ref_file,
target_file, output_file (e.g., str) and an explicit return type (-> None), and
split any combined statements into separate lines (replace "if ref_data is None:
return" and "if target_data is None: return" with two-line if blocks that return
on their own line) so the code conforms to style and type-checking.
| model_name: mace-osaka24-large | ||
| model_key: mace-osaka24-large | ||
| model_version: v0.3.6 | ||
| date_added: '2025-12-23' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the actual file content
cat -n models/mace/mace-osaka24-large.ymlRepository: janosh/matbench-discovery
Length of output: 7409
🏁 Script executed:
# Check git metadata for this file if available
git log --oneline models/mace/mace-osaka24-large.yml 2>/dev/null | head -5Repository: janosh/matbench-discovery
Length of output: 105
🏁 Script executed:
# Look for other model files to understand the dating pattern
fd -e yml models/mace/ | head -10Repository: janosh/matbench-discovery
Length of output: 422
🏁 Script executed:
# Get detailed commit information including the author date
git show 9067d82 --no-patch --format="%aI %s" 2>/dev/nullRepository: janosh/matbench-discovery
Length of output: 123
🏁 Script executed:
# Check if there are other model files and their date_added values to understand the pattern
fd . models/mace/ --full-path -e yml | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && grep -E "^date_added:|^date_published:" {}'Repository: janosh/matbench-discovery
Length of output: 342
Update date_added to match the actual commit date.
The date_added field is set to '2025-12-23', but the model was committed to the repository on 2026-01-09 (17 days later). The date_added should reflect when the model entry was added to this repository. Update this to '2026-01-09'.
🤖 Prompt for AI Agents
In @models/mace/mace-osaka24-large.yml at line 4, The YAML entry's date_added
value is incorrect; update the date_added field in the mace-osaka24-large model
entry (field name: date_added) from '2025-12-23' to the correct commit date
'2026-01-09' so the model metadata reflects when it was added to the repository.
| doi: https://doi.org/10.48550/arXiv.2412.13088 | ||
| paper: https://arxiv.org/pdf/2412.13088v2 | ||
| pypi: https://pypi.org/project/mace-torch | ||
| pr_url: ## I will add after pr. |
There was a problem hiding this comment.
Update the pr_url placeholder before merging.
The pr_url field contains a placeholder comment. This should be updated with the actual PR URL before the PR is merged, as it's a required metadata field per the model schema.
💡 Suggested update
Based on the PR objectives, the URL should be:
-pr_url: ## I will add after pr.
+pr_url: https://github.com/janosh/matbench-discovery/pull/302📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pr_url: ## I will add after pr. | |
| pr_url: https://github.com/janosh/matbench-discovery/pull/302 |
🤖 Prompt for AI Agents
In @models/mace/mace-osaka24-large.yml at line 45, The pr_url field currently
contains a placeholder comment; replace that placeholder in the
models/mace/mace-osaka24-large.yml entry for pr_url with the actual pull request
URL string (the real PR link) so the required metadata field is populated before
merging; ensure the value is a valid URL and follows the same quoting/format
style as other YAML fields in this file.
| kappa_103: | ||
| κ_SRME: 0.7785 | ||
| pred_file: models/mace/mace-osaka24-large/2025-11-21-kappa-103-FIRE-dist=0.01-fmax=0.0001-symprec=1e-05/reformated_force-sets.json.gz | ||
| pred_file_url:https://figshare.com/ndownloader/files/60020837 |
There was a problem hiding this comment.
Fix YAML syntax error: missing space after colon.
The pred_file_url field is missing a space after the colon, causing YAML parsing to fail. This is blocking the CI pipeline.
🐛 Proposed fix
- pred_file_url:https://figshare.com/ndownloader/files/60020837
+ pred_file_url: https://figshare.com/ndownloader/files/60020837📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pred_file_url:https://figshare.com/ndownloader/files/60020837 | |
| pred_file_url: https://figshare.com/ndownloader/files/60020837 |
🧰 Tools
🪛 GitHub Actions: Lint
[error] 91-91: YAML decode failed: could not parse key-value at line 91 (pred_file_url).
🪛 GitHub Actions: Tests
[error] 91-92: YAML syntax error: could not find expected ':' while scanning a simple key.
🤖 Prompt for AI Agents
In @models/mace/mace-osaka24-large.yml at line 91, The YAML has a syntax error
on the field "pred_file_url": add a space after the colon so the line reads
"pred_file_url: https://figshare.com/ndownloader/files/60020837" (ensure there
is exactly one space after the colon and no inline comments or stray characters)
so the file parses correctly.
|
Hello, thanks for the contribution. Looking over this I am hesitant about its value for inclusion in the leaderboard. The performance doesn't appear meaningfully different from MACE-MP-0 but this model is non-compliant. Compared to other models that use multi-head/hydra approaches to dealing with inconsistent training data seen for other non-compliant models it's not clear to me that the dataset fusion approach here is suitably explored to actually show value in a saturated field. |
ff888ab to
93cc690
Compare
76b10f0 to
e069cc4
Compare
441c1f6 to
d2838a7
Compare
11b8e76 to
1e4ecf1
Compare
610666a to
0edb8a4
Compare
Description
Dear Matbench Discovery Developers
This PR adds
MACE-Osaka24-large, a single-head universal MLIP trained on the integrated dataset of organic (OFF23) and inorganic (MPtrj) systems.Since our model is based on the
MACEarchitecture, we utilized the scripts in models/mace for benchmarking. However, we encountered a format mismatch error when evaluating thermal conductivity using models/mace/test_mace_kappa.py and scripts/evals/kappa.py. To address this, we created models/mace/convert_json_to_2024_format_mace_osaka24_large.py to convert the data format and successfully completed the evaluation.Best regards,
Tomoya Shiota
Checklist
Please open PRs as
draftand only mark asready to reviewafter checking off the following items:models/<arch_name>/<model_variant>.ymlfor my submission.arch_nameis the name of the architecture andmodel_variant.ymlincludes things like author details, training set names and important hyperparameters.Model.<model_variant>enum on theModelenum inenums.py.<yyyy-mm-dd>-<model_variant>-preds.csv.gz).<yyyy-mm-dd>-wbm-IS2RE-FIRE.jsonl.gz). JSON Lines allows fast loading of small numbers of structures withpandas.read_json(lines=True, nrows=100)for inspection.<yyyy-mm-dd>-kappa-103-FIRE-<values-of-dist|fmax|symprec>.gz).<yyyy-mm-dd>-diatomics.json.gz).models/<arch_name>/<model_variant>.yml). If not using Figshare I have included the urls to the cloud storage service in the description of the PR.test_<arch_name>_<task>.pyfortaskindiscovery,kappa,diatomics) that generated the prediction files.just prepare-model-submission <model_variant>as described in the contributing guide to check metadata and generate the plots needed for submission.prek installoruvx prek) and ensured all checks are passing.Marking your PR as
ready to reviewwill trigger an automated code review, please address any issues raised by the automated review. For the maintainers minimizing the final human review process enables us to merge your submissions much faster!Additional Information (Optional)
train_<arch_name>.py) if I trained a model specifically for this benchmark.readme.mdwith additional details about my model.Summary by CodeRabbit
New Features
✏️ Tip: You can customize this high-level summary in your review settings.