Skip to content

feat: add noether-init for scaffolding#113

Open
kinggongzilla wants to merge 12 commits intomainfrom
feat-add-project-setup-cli
Open

feat: add noether-init for scaffolding#113
kinggongzilla wants to merge 12 commits intomainfrom
feat-add-project-setup-cli

Conversation

@kinggongzilla
Copy link
Contributor

@kinggongzilla kinggongzilla commented Mar 10, 2026

What is new:
noether-init cli command to create a new project with the specified model and dataset. After running this command, the train command is printed, which the user can run to start training immediately.

This does:

  • copies python files from a template_files directory substituting the project_name for imports
  • generates dynamic python files (__init__.py files to import chosen model, any_model_config.py, etc.)
  • copies YAML files and substitutes project name, dataset path, optimizer or tracker where necessary (e.g. kind: __PROJECT__.model.Transformer )

Example how to run: noether-init my_project -m ab_upt -d shapenet_car --dataset-path /Users/davidhauser/emmi/shapenet

Files that contain the logic inside the src/noether/scaffold folder:

  • cli.py entry point for cli command
  • config.py contains logic for mapping user inputs to a config object
  • /references folder contains YAMLs that define which Noether YAMLs should be copied depending on user cli inputs
  • file_copier.py handles logic for actually copying/generating all templates and substituting
  • choices.py enums for possible user inputs
  • generator.py orchestrates project generation

Custom dataset or custom model is not yet supported with this, can be added to create boilerplate files, but will also add some more template files etc.

@github-actions
Copy link

github-actions bot commented Mar 10, 2026

Coverage

Tests Skipped Failures Errors Time
1137 21 💤 0 ❌ 0 🔥 32.072s ⏱️

@kinggongzilla kinggongzilla marked this pull request as draft March 10, 2026 13:35
@kinggongzilla kinggongzilla marked this pull request as draft March 10, 2026 13:35
Copy link
Member

@Ndles Ndles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. I didn't go over the templates themselves, but the logic that takes care of the CLI. Also didn't check the tests as they are not final yet.

_MODEL_CLASS_NAMES: dict[str, str] = {
"transformer": "Transformer",
"upt": "UPT",
"ab_upt": "ABUPT",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABUPT is the model from the tutorial, not the nother.modeling.models, there it's AnchorBranchedUPT. I think somehow we need to decide which one to use when as it may be confusing 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For the scaffolding I would keep the ABUPT wrapper around AnchorBranchedUPT, because it gives users a starting point to add their own logic which I think is helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, does it make sense to to create a small README.md inside of the project to highlight these details or maybe make it as part of the documentation of the tool itself?

Copy link
Contributor Author

@kinggongzilla kinggongzilla Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we definitely need documentation for noether-init and will add it there, i think that should be sufficient but open to anything

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it file_manager.py, create a class in there with static methods, and extend it as needed. For example, if we expose extra flags into CLI to do a cleanup - FileManager would take care of it.

@kinggongzilla kinggongzilla force-pushed the feat-add-project-setup-cli branch from 065674e to 30a90f3 Compare March 11, 2026 10:27
@kinggongzilla
Copy link
Contributor Author

I have updated the PR with:

  • Addressing PR feedback from Pavel
  • Adjusting templates for breaking changes from init method simplification
  • Adding proper tests
  • Adding documentation (sphinx and in README files where useful)

@kinggongzilla kinggongzilla marked this pull request as ready for review March 11, 2026 13:37
Copy link
Collaborator

@MauritsBleeker MauritsBleeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few general comments:

  • How do we make this CLI scale to different model versions/dataset? I think we need to structure the files based on the problem.
  • Where are the pipeline(s) and trainers? Do we need those?
  • Why don't we import the dataset stats from the dataset zoo? I feel there is too much duplication going on.


> You can use `noether-init` to automatically scaffold a complete project with your choice of model, dataset, and configuration. See the [scaffolding tutorial](https://noether-docs.emmi.ai/tutorials/scaffolding_a_new_project.html) for details.

This folder contains skeleton/boilerplate code for a minimal working `Noether` training pipeline, including all required components.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pipelines, trainers, etc., those are also quite dataset/model specific? Do we add those later?

Comment on lines +83 to +91
- ``project_name`` (positional) — project name, must be a valid Python identifier (no hyphens)
- ``--model, -m`` — model architecture (``transformer``, ``upt``, ``ab_upt``, ``transolver``)
- ``--dataset, -d`` — dataset (``shapenet_car``, ``drivaernet``, ``drivaerml``, ``ahmedml``, ``emmi_wing``)
- ``--dataset-path`` — path to dataset on disk

**Optional arguments:**

- ``--optimizer, -o`` — optimizer, default: ``adamw`` (also: ``lion``)
- ``--tracker, -t`` — experiment tracker, default: ``disabled`` (also: ``wandb``, ``trackio``, ``tensorboard``)
Copy link
Collaborator

@MauritsBleeker MauritsBleeker Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples are all very specific to what we have right now. The number of models, etc., will grow in the relatively near future. Maybe we should not hardcode everything here, but provide a ref to the dataset/model zoo. Maybe just one one/two examples/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay good point. I have updated this section to only give one example for each parameter. Then I added a link to the new 'Scaffolding a Project" page, where we list all available parameters.

Comment on lines +1 to +3
Scaffolding a New Project
=========================

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to rethink the role of the trainer/pipeline within this noether-init

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly do you mean this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, you init a project with a dataset and a model. But you still need a pipeline/trainer to run anything, right? Do we implement that manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline, trainer etc you need for a specific dataset are specified in the YAML files for each dataset inside scaffold/references. So the user selects a dataset and the reference YAML maps to the correct pipeline. trainer etc

Comment on lines +1 to +2
# Copyright © 2025 Emmi AI GmbH. All rights reserved.
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to rethink the folder structure.

This is not the template for the AB-UPT. Especially with the BaseModel implementation, this is tailored to AeroDynamic CFD (for cars). I think we should reflect this in the file structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also related to my comment below and how we can handle future different versions of ab_upt etc.

When we do have different AB-UPT implementations in the future we can reflect this in the folder structure here. To keep things simple, for now I would suggest to keep it like this, until we actually need to change it.
However if you think it makes more sense to adapt the structure immediately we can also do it like this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also get your point. I would do it directly, such that we enforce this pattern a bit, since we know this will come eventually anyway. However, I also understand if we keep it simple for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maurits and I discussed this and agreed that we would do the renaming and grouping of different (ABUPT) models for different tasks/domains in the future when we have other ABUPT (or UPT etc.) implementations

**Required arguments:**

- ``project_name`` (positional) — project name, must be a valid Python identifier (no hyphens)
- ``--model, -m`` — model architecture (``transformer``, ``upt``, ``ab_upt``, ``transolver``)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for now. But soon we will have a different AB-UPT for a different task. How will we do the naming then? This AB-UPT implementation won't work as a universal model class.

Copy link
Contributor Author

@kinggongzilla kinggongzilla Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each dataset we have reference files in scaffold/references. These define which pipeline, trainer, callback templates etc. are being copied for a specific dataset.

If in the future we have other implementations of AB-UPT or UPT, etc. for different tasks/datasets we can add a mapping in these reference files such that the correct UPT or AB-UPT implementation is automatically copied from the templates

Does this make sense?

Comment on lines +16 to +20
TRANSFORMER = "transformer"
UPT = "upt"
AB_UPT = "ab_upt"
TRANSOLVER = "transolver"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier. This won't work in the relatively near future when new versions of models are added that solve a different task.

Comment on lines +27 to +38
model: Annotated[ModelChoice, typer.Option("--model", "-m", help="Model architecture")] = ..., # type: ignore[assignment]
dataset: Annotated[DatasetChoice, typer.Option("--dataset", "-d", help="Dataset")] = ..., # type: ignore[assignment]
dataset_path: Annotated[str, typer.Option("--dataset-path", help="Path to dataset")] = ..., # type: ignore[assignment]
optimizer: Annotated[OptimizerChoice, typer.Option("--optimizer", "-o", help="Optimizer")] = OptimizerChoice.ADAMW,
tracker: Annotated[
TrackerChoice, typer.Option("--tracker", "-t", help="Experiment tracker")
] = TrackerChoice.DISABLED,
hardware: Annotated[HardwareChoice, typer.Option("--hardware", help="Hardware target")] = HardwareChoice.GPU,
project_dir: Annotated[Path, typer.Option("--project-dir", "-l", help="Where to create project dir")] = Path("."),
wandb_entity: Annotated[
str | None, typer.Option("--wandb-entity", help="W&B entity, e.g. 'my-team' (defaults to your W&B username)")
] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the trainer/pipeline not part of the options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as far as I understand the pipelines and trainers depend on the dataset that is being used. So in this regard they are fixed and do not need to be selected by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants