Skip to content

Resuming training from a checkpoint: Incorrect function call and misspelt attribute in ckpt loading logic #71

@DhruvaRajwade

Description

@DhruvaRajwade

Describe the bug
The checkpoint loading logic contains two bugs due to incorrect function and attribute references, preventing proper resumption of training from a saved checkpoint.

  1. In examples/text/logic/state.py, line 62: self._data_state.test.load_state_dict(loaded_state["test_sampler"])
    (FIX: self._data_state.test.sampler.load_state_dict(loaded_state["test_sampler"]) )
    Here, there's a typo that tries to yoink a state_dict from a Dataset class

  2. In examples/text/main_train.py, line 27: cfg = checkpointing.load_hydra_config_from_run(cfg.load_dir)
    (FIX: cfg = checkpointing.load_cfg_from_path(cfg.load_dir) )
    Here, the function name is incorrect; the function exists with a different name in utils/checkpointing.py.

To Reproduce
Set load_dir = 'path_to_ckpt_parent' in examples/text/configs/config.yaml and run examples/text/run_train.py

Expected behavior
The checkpoint gets picked up, and training resumes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions