Hi, and thank you for your great work!
I was wondering if the early exit techniques introduced in the paper can be extended to be used with language modeling, or do they only apply to classification tasks? I think the only difference is that (1) language modeling has a rather large answer space at tens of thousands of vocabularies, and that (2) language models usually output a probability distribution to be sampled. Maybe it is because the conservative predictions are not strong enough when facing such a large number of possible sampling outcomes?
I see that you have a later work (CALM) addressing the case on language models by enforcing the early-exit objective during training, but I think the approaches used in CATs are more desirable because it is distribution-free and model-agnostic.
Thank you for your time!
Hi, and thank you for your great work!
I was wondering if the early exit techniques introduced in the paper can be extended to be used with language modeling, or do they only apply to classification tasks? I think the only difference is that (1) language modeling has a rather large answer space at tens of thousands of vocabularies, and that (2) language models usually output a probability distribution to be sampled. Maybe it is because the conservative predictions are not strong enough when facing such a large number of possible sampling outcomes?
I see that you have a later work (CALM) addressing the case on language models by enforcing the early-exit objective during training, but I think the approaches used in CATs are more desirable because it is distribution-free and model-agnostic.
Thank you for your time!