-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
I am generally fine (and quite happy) about feature extraction of models as per: https://rwightman.github.io/pytorch-image-models/feature_extraction/
However, I can't figure out how to solve something generally given any backbone used:
Think of a multimodal semantic segmentation task, assuming we have RGB and D as inputs. Both RGB and D is sent through their respective encoder branches (one backbone for each), we fuse features at stride 2,4,8,etc. and apply some decoder. This approach is very easy to do.
Okay. now imagine that we have a third encoder branch, which does not have any input, but gets the fusion product from the RGB and D branches at stride 2, and having that as its stride 2 input, invokes all modules to arrive at stride 4. At stride 4 it fuses its features with the stride 4 features of RGB and D branches and arrives at stride 8, etc. etc.
As I am aware TIMM generally assumes that we have an input at the start of the network and we parametrize TIMM to give us the outputs at specific strides. I am afraid there is no such feature to say, that 'hey, this is the input for stride 2, give me output at stride 4'. Or is there a way, but I failed to find that? I would be really useful to support fusion architectures.
I would welcome any ideas on how to do that, if it's possible.
Thanks a lot,
Balint