Paper: https://arxiv.org/abs/2602.03846
PLATE (Plasticity-Tunable Efficient Adapters) is a parameter-efficient fine-tuning method that reduces catastrophic forgetting in continual learning by combining neuron selection with orthogonal input basis computation. This method does not require any pre-training (old task) data nor features.
cd PLATE
pip install .See requirements.txt for full dependencies. Minimum requirements:
- Python >= 3.8
- PyTorch >= 2.0.0
- peft >= 0.15.0
- transformers >= 4.40.0
from transformers import AutoModelForCausalLM
from plate import PLATEConfig, get_plate_model
# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B")
# Configure PLATE
config = PLATEConfig(
r=64, # Number of trainable neurons
col_tau=0.9, # Input orthogonality threshold
plate_alpha=1.0, # Scaling factor
max_rank =512, # Maximum input basis dimension
plate_dropout=0.0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)
# Apply PLATE adapter
model = get_plate_model(model, config)
model.print_trainable_parameters()PLATE adapts a frozen pretrained linear layer W using a structured PEFT-style update:
Only A is trained; B and Q are computed once from the frozen pretrained weights and then kept fixed.
-
A (trainable):
A ∈ R^{r×k}
The only learned parameters per layer (initialized to zeros). Trainable parameter count per layer isr * k. -
B (frozen output selector):
B ∈ R^{d_out×r}
Selects which output neurons are allowed to change. PLATE chooses thermost redundant output rows ofW, then uses the corresponding identity columns as a selector. -
Q (frozen input basis):
Q ∈ R^{d_in×k}
Orthonormal basis for a low-energy input subspace computed from the frozen rows ofW(rows not selected byB). It constrains updates to directions that weakly excite old task data/features, reducing drift on old behavior.
r: number of trainable output neurons per layer- Higher
r⇒ more capacity for the new task, but typically more forgetting risk.
- Higher
col_tau: input energy threshold controlling the size of the low-energy subspace (i.e.,k)- Higher
col_tau⇒ smallerk(stricter constraint).
- Higher
max_rank: cap onk(maximum input basis dimension).rho: scaling factor inW' = W + ρ * (B A Q^T).
These hyperparameters mainly control the input/output dimensions of the trainable adapter parameters (through r and k).
See examples/run_example.py for a complete training example.
Qwen 2.5-3B - PLATE sweep across (r,τ): Columns fix the PLATE output rank r ∈ {32,64,128,256} and sweep τ ∈ {0.70,0.80,0.90,0.98} (green, solid) against LoRA baselines with varying ranks (blue, dashed). Top row reports WikiText-2 perplexity (forgetting) and bottom row reports Middle English perplexity (task learning), both over training steps.
Local-geometry view of forgetting - PLATE sweep across (r,τ): We sweep PLATE's two knobs on a two-moons continual-learning toy. Blue points denote the old-task dataset
If you use PLATE in your work, please cite:
@misc{cosentino2026plateplasticitytunableefficientadapters,
title={PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning},
author={Romain Cosentino},
year={2026},
eprint={2602.03846},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.03846},
}
