Skip to content

HuggingFace Integration Friction: Weight Tying, Generation, Trainer, and Dataset Collator Alignment #17

@david-thrower

Description

@david-thrower

Background

The avant-garde model core (recurrent heterogeneous graph, LTI, ACT, Mamba-2 SSD, Titans) is stable and working. The HF wrapper layer (hf_model.py, trainer.py) has accumulated friction points that break standard HF workflows (Trainer, pipeline, AutoModelForCausalLM.from_pretrained, model.generate()). A previous patch fixed the redundant lm_head dead-weight bug; this issue completes the V5 integration.

Problems Identified

# Problem Location Impact
1 Weight tying bypasses HF standard mechanism hf_model.py HelixPreTrainedModel _tied_weights_keys = {} and get_expanded_tied_weights_keys override kill model.tie_weights() and from_pretrained weight restoration.
2 Custom generate_ext() instead of standard generate() hf_model.py HelixForCausalLM Cannot use pipeline("text-generation", ...) or standard StoppingCriteria; agent must re-implement stop strings, temperature, top-p manually.
3 Custom Trainer reinvents transformers.Trainer trainer.py Duplicates AMP, gradient accumulation, scheduler, logging. Loses DeepSpeed/FSDP support.
4 Dataset overlap masking at risk from standard collators dataset.py / trainer.py DataCollatorForLanguageModeling regenerates labels and destroys custom -100 overlap masks.
5 Tokenizer wrapper adds friction tokenizer.py HelixTokenizer is not a first-class PreTrainedTokenizer; passing it to HF Trainer requires ._backend indirection.
6 Auto-registration wrapped in silent try/except hf_model.py bottom Registration failures are swallowed.

Architectural Constraints (Non-Negotiable)

The following are sacred and must not be modified:

  • helix_lm/graph.py — heterogeneous graph wiring
  • helix_lm/recurrent.py — LTI injection, ACT halting, recurrence
  • helix_lm/nodes.py — all node implementations (attention variants, SwiGLU, SSM, Titans, gates)
  • helix_lm/mamba2.py — Mamba-2 SSD parallel scan
  • helix_lm/rope.py — rotary embeddings

The following behaviors must be preserved exactly:

  • Document-aware chunking with overlap masking: DocumentAwareDataset must continue to return labels with -100 on overlap heads and padding tails. Standard DataCollatorForLanguageModeling must not be used; it overwrites these labels.
  • No KV-cache / no past_key_values state: The recurrent graph re-initializes node_states = {} on every forward call. Generation must pass the full sequence (or last seq_len window) on every step, never just input_ids[:, -1:]. config.use_cache must remain False.
  • Parameter count stability: After all fixes, HelixForCausalLM(HelixConfig.tiny(vocab_size=50257)) must report exactly 13,347,974 parameters.

Proposed Fix Stages

See attached agent prompt package for executable stage-by-stage tasks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions