Skip to content

feat: batch forward for prompt#25

Open
rain1201 wants to merge 1 commit intoRightNow-AI:mainfrom
rain1201:pr
Open

feat: batch forward for prompt#25
rain1201 wants to merge 1 commit intoRightNow-AI:mainfrom
rain1201:pr

Conversation

@rain1201
Copy link

@rain1201 rain1201 commented Mar 24, 2026

What does this PR do?

Add batch forward to process full prompt while reading the model for only one time.

Type of change

  • Performance improvement

Testing

  • Tested on x86-64 (Linux/macOS/Windows)
  • Tested with TinyLlama 1.1B Q4_K_M

Test command:

./picolm model.gguf -p "The capital of France is" -n 20 -t 0

Output:

Loading model: ../../tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
Model config:
  n_embd=2048, n_ffn=5632, n_heads=32, n_kv_heads=4
  n_layers=22, vocab_size=32000, max_seq=2048
  head_dim=64, rope_base=10000.0
Allocating 1.17 MB for runtime state (+ 44.00 MB FP16 KV cache)
Tokenizer loaded: 32000 tokens, bos=1, eos=2
Prompt: 6 tokens, generating up to 20 (temp=0.00, top_p=0.90, threads=4)
---
Paris.

2. B.C. The capital of ancient Rome was Rome.

3
---

Checklist

  • Code compiles without warnings (make native)
  • No new dependencies added
  • Memory usage not increased (check stderr output)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant