-
Notifications
You must be signed in to change notification settings - Fork 29
Description

I saved the model in main.py , and reloaded it to do text generation
the model architecture is like this
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x QLlamaDecoderLayer(
(self_attn): QLlamaAttention(
(q_proj): QLinearLayer()
(k_proj): QLinearLayer()
(v_proj): QLinearLayer()
(o_proj): QLinearLayer()
(rotary_emb): LlamaRotaryEmbedding()
(act_quant): Quantizer()
(v_quant): Quantizer()
(k_quant): Quantizer()
)
(mlp): QLlamaMLP(
(gate_proj): QLinearLayer()
(down_proj): QLinearLayer()
(up_proj): QLinearLayer()
(act_fn): SiLU()
(act_quant): Quantizer()
)
(input_layernorm): QLlamaRMSNorm(
(originalNorm): LlamaRMSNorm()
(act_quant): Quantizer()
)
(post_attention_layernorm): QLlamaRMSNorm(
(originalNorm): LlamaRMSNorm()
(act_quant): Quantizer()
)
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
When using the generate() from transformer lib and set the use_cache to True
from transformers import AutoModelForCausalLM, LlamaTokenizer, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
tokenizer = AutoTokenizer.from_pretrained("../../llama-2-7b-hf")
loaded_model.eval()
loaded_model = loaded_model.to(device)
input_text = "explain what is AI"
inputs = tokenizer(input_text, return_tensors="pt",add_special_tokens=True).to(device)
output = loaded_model.generate(
inputs.input_ids,
max_length=50,
eos_token_id=tokenizer.eos_token_id,
do_sample=True,
use_cache=False
)
generated_text = tokenizer.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print("Generated text:", generated_text)
KeyError Traceback (most recent call last)
File ~/miniconda3/envs/atom_env/lib/python3.10/site-packages/transformers/cache_utils.py:97, in DynamicCache.getitem(self, layer_idx)
95 return (self.key_cache[layer_idx], self.value_cache[layer_idx])
96 else:
---> 97 raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'
and while setting the use_cache=False the text it generates doesn't make sense

How and I use the Atom model to do text generation, how would you suggest me to work on this work? thank you