Skip to content

[Faithfulness] Major fault in gradient propagation #9

@weipeilun

Description

@weipeilun

First, there is no meta-learning at all in your implementation. And there is no real 'levels' as well in your implementation.
You are actually updating the 'initial weights' in the outer loop in different frequenies between layers, pretending it's an inner loop optimization process of meta learning. As I quote from the NL paper:
"Meta learning paradigm (or learning to learn) (Schmidhuber et al. 1996; Finn et al. 2017; Akyürek et al. 2022) aim to automate a part of such decisions by modeling it as a two-level optimization procedure, in which the outer model aims to learn to set parameters for the inner procedure to maximize the performance across a set of tasks."
Where is the inner procedure? It's completely missing!

Second, you just can't do backpropagation like that.
Let's put aside the whole teacher-student thing and just focus on the "vanilla pretraining stage" (sorry I use this term just for simplicity) - after all the NL paper didn't mention semi-supervised, right?
For each level 0 <= l < L, the gradients are backpropagated from the final loss all the way to the parameters in level l. And there can't be any explicit loss / surprise for each level where l < L - 1 (at least for FFN levels).
Then consider the frequencies between levels. Let's say level l's frequency is 64 and l - 1's frequency is 128. Since l - 1 will only backpropagatate once after l's second backpropagatation, in order to perform that one backpropagation right, l - 1 need both chunk's gradients which is backpropagatated from different l parameters where l < L.
Then the whole training process is a completely different story.

Third, the DeepMomentum optimizer here is basicly a adam... Well, you can't say adam is not "deep", but it's strictly a special variant of DGD and it's based on Hebbian-rule. The so called "preconditioner" in your optimizer is not the preconditioner in NL paper where it specifically refers to "mapping to a proper orthogonal space" (see section 4.2 and 4.3).

Last but not least, let me present "the real Nested Learning implementation": https://github.com/weipeilun/Nested-Learning-Pytorch. It may be a little bit more tricky to get started, but I believe it grasped the essence of the Google's NL paper.
Despite all these above, I want to give my sincerest thanks to this repository for providing a good structure, giving me a really nice point to start with :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions