-
Notifications
You must be signed in to change notification settings - Fork 8
feat: add noether-init for scaffolding #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
08cecb1
86a27d7
dab3d65
30a90f3
f23360a
9b94346
a14adb8
5414ab4
5978a57
7482ed4
d20ab4a
2c059e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -183,6 +183,7 @@ | |
| "**/*.ipynb", | ||
| "**/*.md", | ||
| "**/.venv/**", | ||
| "**/scaffold/template_files/**", | ||
| ] | ||
|
|
||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| Scaffolding a New Project | ||
| ========================= | ||
|
|
||
|
Comment on lines
+1
to
+3
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to rethink the role of the trainer/pipeline within this noether-init
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How exactly do you mean this?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I understand it, you init a project with a dataset and a model. But you still need a pipeline/trainer to run anything, right? Do we implement that manually?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The pipeline, trainer etc you need for a specific dataset are specified in the YAML files for each dataset inside |
||
| The ``noether-init`` command generates a complete, ready-to-train Noether project for | ||
| models and datasets supported out of the box by the framework. It creates all required Python modules, Hydra configuration | ||
| files, schemas, data pipelines, trainers, and callbacks, giving you a working starting point that you | ||
| can adapt to your own use case. | ||
|
|
||
| Prerequisites | ||
| ------------- | ||
|
|
||
| Before scaffolding, download and preprocess the dataset you want to use. Each dataset has its own | ||
| fetching and preprocessing instructions — see the | ||
| `Dataset Zoo README <https://github.com/Emmi-AI/noether/blob/main/src/noether/data/datasets/README.md>`_ | ||
| for an overview and links to dataset-specific guides. | ||
|
|
||
| Example Usage | ||
| ------------- | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| uv run noether-init my_project \ | ||
| --model upt \ | ||
| --dataset shapenet_car \ | ||
| --dataset-path /path/to/shapenet_car | ||
|
|
||
| This creates a ``my_project/`` directory in the current working directory with a UPT model and the ``shapenet_car`` dataset. | ||
| After completion, ``noether-init`` prints a summary of the configuration and the corresponding | ||
| ``noether-train`` command to start training. | ||
|
|
||
| Arguments | ||
| --------- | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 25 50 25 | ||
|
|
||
| * - Option | ||
| - Values | ||
| - Default | ||
| * - ``project_name`` *(required)* | ||
| - Positional argument. Must be a valid Python identifier (no hyphens). | ||
| - | ||
| * - ``--model, -m`` *(required)* | ||
| - ``transformer``, ``upt``, ``ab_upt``, ``transolver`` | ||
| - | ||
| * - ``--dataset, -d`` *(required)* | ||
| - ``shapenet_car``, ``drivaernet``, ``drivaerml``, ``ahmedml``, ``emmi_wing`` | ||
| - | ||
| * - ``--dataset-path`` *(required)* | ||
| - Path to the dataset on disk | ||
| - | ||
| * - ``--optimizer, -o`` | ||
| - ``adamw``, ``lion`` | ||
| - ``adamw`` | ||
| * - ``--tracker, -t`` | ||
| - ``wandb``, ``trackio``, ``tensorboard``, ``disabled`` | ||
| - ``disabled`` | ||
| * - ``--hardware`` | ||
| - ``gpu``, ``mps``, ``cpu`` | ||
| - ``gpu`` | ||
| * - ``--project-dir, -l`` | ||
| - Parent directory for the project folder | ||
| - current directory | ||
| * - ``--wandb-entity`` | ||
| - W&B entity name (only with ``--tracker wandb``) | ||
| - your W&B username | ||
|
|
||
| Generated Project Structure | ||
| --------------------------- | ||
|
|
||
| The generated project contains: | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| my_project/ | ||
| ├── configs/ | ||
| │ ├── callbacks/ # Training callback configs | ||
| │ ├── data_specs/ # Data specification configs | ||
| │ ├── dataset_normalizers/ | ||
| │ ├── dataset_statistics/ | ||
| │ ├── datasets/ # Dataset configs | ||
| │ ├── experiment/ # Experiment configs (one per model) | ||
| │ ├── model/ # Model architecture config | ||
| │ ├── optimizer/ # Optimizer config | ||
| │ ├── pipeline/ # Data pipeline config | ||
| │ ├── tracker/ # Experiment tracker config | ||
| │ ├── trainer/ # Trainer config | ||
| │ └── train.yaml # Main training config | ||
| ├── model/ # Model implementation | ||
| ├── schemas/ # Configuration dataclasses | ||
| ├── pipeline/ # Data processing (collators, sample processors) | ||
| ├── trainers/ # Training loop implementation | ||
| └── callbacks/ # Training callbacks | ||
|
|
||
| All Python files are wired up with correct imports for your chosen model, and all Hydra configs reference | ||
| your dataset path, optimizer, and tracker selections. | ||
|
|
||
| Running Training | ||
| ---------------- | ||
|
|
||
| After scaffolding, start training with: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| uv run noether-train \ | ||
| --config-dir my_project/configs \ | ||
| --config-name train \ | ||
| +experiment=upt | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| # Copyright © 2025 Emmi AI GmbH. All rights reserved. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Copyright © 2025 Emmi AI GmbH. All rights reserved. | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from enum import StrEnum | ||
|
|
||
| _MODEL_CLASS_NAMES: dict[str, str] = { | ||
| "transformer": "Transformer", | ||
| "upt": "UPT", | ||
| "ab_upt": "ABUPT", | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. For the scaffolding I would keep the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. alright, does it make sense to to create a small README.md inside of the project to highlight these details or maybe make it as part of the documentation of the tool itself?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think we definitely need documentation for |
||
| "transolver": "Transolver", | ||
| } | ||
|
|
||
|
|
||
| class ModelChoice(StrEnum): | ||
| TRANSFORMER = "transformer" | ||
| UPT = "upt" | ||
| AB_UPT = "ab_upt" | ||
| TRANSOLVER = "transolver" | ||
|
|
||
|
Comment on lines
+16
to
+20
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned earlier. This won't work in the relatively near future when new versions of models are added that solve a different task. |
||
| @property | ||
| def class_name(self) -> str: | ||
| return _MODEL_CLASS_NAMES[self.value] | ||
|
|
||
| @property | ||
| def module_name(self) -> str: | ||
| return self.value | ||
|
|
||
| @property | ||
| def schema_module(self) -> str: | ||
| return f"{self.value}_config" | ||
|
|
||
| @property | ||
| def config_class_name(self) -> str: | ||
| return f"{self.class_name}Config" | ||
|
|
||
|
|
||
| class DatasetChoice(StrEnum): | ||
| SHAPENET_CAR = "shapenet_car" | ||
| DRIVAERNET = "drivaernet" | ||
| DRIVAERML = "drivaerml" | ||
| AHMEDML = "ahmedml" | ||
| EMMI_WING = "emmi_wing" | ||
|
|
||
|
|
||
| class OptimizerChoice(StrEnum): | ||
| ADAMW = "adamw" | ||
| LION = "lion" | ||
|
|
||
|
|
||
| class TrackerChoice(StrEnum): | ||
| WANDB = "wandb" | ||
| TRACKIO = "trackio" | ||
| TENSORBOARD = "tensorboard" | ||
| DISABLED = "disabled" | ||
|
|
||
|
|
||
| class HardwareChoice(StrEnum): | ||
| GPU = "gpu" | ||
| MPS = "mps" | ||
| CPU = "cpu" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # Copyright © 2025 Emmi AI GmbH. All rights reserved. | ||
|
|
||
| from pathlib import Path | ||
| from typing import Annotated | ||
|
|
||
| import typer | ||
|
|
||
| from .choices import DatasetChoice, HardwareChoice, ModelChoice, OptimizerChoice, TrackerChoice | ||
| from .config import ScaffoldConfig, resolve_config | ||
| from .generator import generate_project | ||
|
|
||
| app = typer.Typer( | ||
| name="noether-init", | ||
| help="Scaffold a new Noether training project.", | ||
| add_completion=False, | ||
| ) | ||
|
|
||
|
|
||
| @app.command() | ||
| def main( | ||
| project_name: Annotated[ | ||
| str, | ||
| typer.Argument( | ||
| help="Project name (valid Python identifier). Examples: 'my_project', 'MyProject1'). No hyphens allowed." | ||
| ), | ||
| ], | ||
| model: Annotated[ModelChoice, typer.Option("--model", "-m", help="Model architecture")] = ..., # type: ignore[assignment] | ||
| dataset: Annotated[DatasetChoice, typer.Option("--dataset", "-d", help="Dataset")] = ..., # type: ignore[assignment] | ||
| dataset_path: Annotated[str, typer.Option("--dataset-path", help="Path to dataset")] = ..., # type: ignore[assignment] | ||
| optimizer: Annotated[OptimizerChoice, typer.Option("--optimizer", "-o", help="Optimizer")] = OptimizerChoice.ADAMW, | ||
| tracker: Annotated[ | ||
| TrackerChoice, typer.Option("--tracker", "-t", help="Experiment tracker") | ||
| ] = TrackerChoice.DISABLED, | ||
| hardware: Annotated[HardwareChoice, typer.Option("--hardware", help="Hardware target")] = HardwareChoice.GPU, | ||
| project_dir: Annotated[Path, typer.Option("--project-dir", "-l", help="Where to create project dir")] = Path("."), | ||
| wandb_entity: Annotated[ | ||
| str | None, typer.Option("--wandb-entity", help="W&B entity, e.g. 'my-team' (defaults to your W&B username)") | ||
| ] = None, | ||
|
Comment on lines
+27
to
+38
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are the trainer/pipeline not part of the options?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So as far as I understand the pipelines and trainers depend on the dataset that is being used. So in this regard they are fixed and do not need to be selected by the user. |
||
| ) -> None: | ||
| """Scaffold a new Noether training project.""" | ||
| # Validate project name | ||
| if not project_name.isidentifier(): | ||
| typer.echo(f"Error: '{project_name}' is not a valid Python identifier.", err=True) | ||
| raise typer.Exit(1) | ||
|
|
||
| # Resolve to absolute path | ||
| project_dir = (project_dir / project_name).resolve() | ||
|
|
||
| # Check if project dir already exists | ||
| if project_dir.exists(): | ||
| typer.echo(f"Error: Directory already exists: {project_dir}", err=True) | ||
| raise typer.Exit(1) | ||
|
|
||
| # Build config | ||
| config = resolve_config( | ||
| project_name=project_name, | ||
| model=model, | ||
| dataset=dataset, | ||
| dataset_path=dataset_path, | ||
| optimizer=optimizer, | ||
| tracker=tracker, | ||
| hardware=hardware, | ||
| project_dir=project_dir, | ||
| wandb_entity=wandb_entity, | ||
| ) | ||
|
|
||
| # Generate | ||
| typer.echo(f"Creating project '{project_name}' at {project_dir}") | ||
| generate_project(config) | ||
|
|
||
| # Print summary | ||
| _print_summary(config) | ||
|
|
||
|
|
||
| def _print_summary(config: ScaffoldConfig) -> None: | ||
| typer.echo( | ||
| "\nProject created successfully!\n" | ||
| "Configuration:\n" | ||
| f" Project: {config.project_name}\n" | ||
| f" Model: {config.model.value}\n" | ||
| f" Dataset: {config.dataset.value}\n" | ||
| f" Optimizer: {config.optimizer.value}\n" | ||
| f" Tracker: {config.tracker.value}\n" | ||
| f" Hardware: {config.hardware.value}\n" | ||
| f" Path: {config.project_dir}\n" | ||
| ) | ||
| # Suggest run command | ||
| typer.echo( | ||
| "To train, run:\n" | ||
| f" uv run noether-train --config-dir {config.project_dir}/configs --config-name train +experiment={config.model.value}\n\n" | ||
| "Experiment configs for all models are in configs/experiment/." | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| app() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about pipelines, trainers, etc., those are also quite dataset/model specific? Do we add those later?