This project started as a way for me to learn how different AI techniques work together. I wanted to understand how search algorithms like Monte Carlo Tree Search could help language models think more carefully through problems instead of just generating an answer immediately. The core idea is inference-time compute scaling — spending more compute at inference by exploring multiple reasoning paths before committing to a solution, rather than generating an answer in a single forward pass.
The question I was exploring:
Can the reward modeling and routing layers be bootstrapped from live queries alone, without collecting a labeled dataset upfront?
The underlying LLM (Qwen) and the sentence encoder (all-MiniLM-L6-v2) are both pre-trained — that foundation is given. But the PRM and router on top of them start random and have no offline training phase. The conventional approach for PRMs is to collect labeled reasoning traces, run a dedicated training job, freeze the model, and then deploy. Kaelum skips that step: the LATS search process itself generates per-step reward labels for free as each query runs. The tradeoff is that each query costs N×(one LLM call) instead of one, but those extra calls produce the supervision signal that gradually improves both the step scorer and the router.
The system uses a Mixture-of-Experts (MoE) style routing architecture, dispatching queries to six specialized workers: math, code, logic, factual, creative, and analysis. The tree built during LATS is local to one query — its nodes are the actual reasoning text for that question and have no meaning elsewhere. What accumulates across queries are the learned components: PRM weights, router weights, and per-worker reward deltas. A semantic cache lets near-duplicate queries skip the tree entirely.
The human feedback loop was something I added later when I realized the router could improve if users could tell it when it picked the wrong worker. Now you can rate the worker selection, answer quality, and individual reasoning steps. Those adjustments persist across sessions and directly influence future routing decisions and reward calculations.
Measured on Qwen2.5-7B-Instruct, 200 mixed queries across all worker types.
| Metric | Baseline | Math | Code | Logic | Factual | Creative | Overall |
|---|---|---|---|---|---|---|---|
| Answer correctness | 61% | 84% | — | 79% | 76% | — | 80% |
| Syntax / structure validity | 78% | — | 97% | — | — | 83% | 93% |
| PRM gate pass (1st attempt) | — | 69% | 74% | 68% | 71% | 77% | 72% |
| PRM gate pass (after reflection) | — | 89% | 91% | 85% | 87% | 90% | 88% |
| Avg latency — cold (s) | 4.1 | 7.4 | 6.2 | 6.9 | 5.8 | 5.1 | 6.8 |
| Avg latency — cache hit (s) | 4.1 | 0.4 | 0.3 | 0.4 | 0.3 | 0.4 | 0.4 |
| Cache hit rate | — | 28% | 31% | 19% | 24% | 14% | 23% |
| Router accuracy (after 50 feedback samples) | — | 94% | 96% | 89% | 91% | 87% | 91% |
Cold latency is higher because LATS runs multiple simulations. The cache makes up for it on repeated or semantically similar queries.
What happens to one query from input to answer, in order.
Layer 1 — Embedding (core/search/router.py)
The raw query string goes through all-MiniLM-L6-v2 (frozen, never trained) and comes out as a 384-dim float vector. That vector is concatenated with 14 scalar features computed from the text: character length, word count, and binary keyword flags like has_math_symbols and has_code_keywords. Result: a 398-dim input vector.
Layer 2 — Cache lookup (core/search/tree_cache.py)
The 398-dim vector is compared by cosine similarity against every stored query embedding. If a close enough match exists, a CacheValidator LLM call checks whether the stored answer actually addresses this new phrasing. If yes, return immediately — skip layers 3–9.
Layer 3 — PolicyNetwork (router) (core/search/router.py)
The 398-dim vector goes into a two-layer MLP: 398 → 256 (ReLU + residual skip connection from input) → 6 logits (softmax over workers) + 1 depth output + 1 simulations output. With probability math_worker, depth 4, 8 simulations.
Layers 4–7 — LATS search (core/search/lats.py + core/verification/process_reward_model.py)
One tree is built per query. The tree is created fresh and grows across all N simulations — never rebuilt or reset mid-query. Every simulation adds exactly one new node. The root node is the query itself. All other nodes are reasoning steps: short text generated by the LLM. Layers 4–7 repeat N times:
Layer 4 — Select: starting from the root, pick the child with the highest UCT score at each level until a leaf is reached:
On simulation 1 the root has no children, so select terminates immediately at root — root is the leaf, nothing to compare. Branching develops naturally: once a node is visited, any unvisited sibling scores
Layer 5 — Expand (LLM call): send the path from root to the leaf as context and ask the LLM to generate the next reasoning step. Attach the response as a new child node.
Layer 6 — Score (PRM): pass the new node through the Process Reward Model — a 1158→256→64→1 MLP with sigmoid output. Its input is:
where
Layer 7 — Backpropagate (arithmetic, not gradient descent): walk back up to root, at every ancestor:
After all N simulations, walk root → best child (highest
Layer 8 — PRM gate (core/verification/process_reward_model.py)
Re-score every node on the answer path through the PRM and take the mean:
If ReflectionEngine prompts the LLM to identify what went wrong and rewrite the bad steps. The failed path's nodes are penalized in the tree and LATS continues from layer 4 on the same tree. Retry up to max_reflection_iterations times.
Layer 9 — Cache write-back (core/search/tree_cache.py)
The result is stored keyed by the query embedding. Future queries check this cache at layer 2.
The layers above are fixed inference. Layer 7 backprop is not training — it's just arithmetic counters that steer the next simulation. The actual learning happens here, after the answer is returned, through three independent updates.
Update 1 — Router: REINFORCE (core/search/router.py)
Pull the last 32 outcomes from the replay buffer. For each, compute:
Average over the batch, call .backward(), Adam takes one step on the PolicyNetwork weights (398→256→6), gradient clipped at 1.0. High-reward runs produce a large gradient and strongly reinforce that routing choice. Low-reward runs barely update. The LLM, encoder, and PRM are not touched.
Update 2 — PRM: online BCE (core/verification/process_reward_model.py)
Every node on the winning path is added to the training buffer as (1158-dim input, V(s)/N(s) label) — a soft label reflecting how well that specific branch scored during search. Human scores take priority when available. Every 25 new examples, run BCE loss over the full buffer and take one Adam step on the PRM MLP weights (1158→256→64→1):
This is also the mechanism for cross-query knowledge transfer. The trees themselves can't be reused — their nodes are literal text specific to one question. But the PRM weights accumulate patterns from every tree ever built: steps that define terms before using them, steps that decompose a problem correctly, steps that introduce a formula with justification all score higher over time. When a new tree is built, these updated weights steer UCT toward those patterns from simulation 1.
Update 3 — Human feedback: reward deltas (core/learning/human_feedback.py)
No gradient descent. You rate the answer and HumanFeedbackEngine applies arithmetic deltas to a per-worker float in reward_adjustments.json:
Wrong worker:
What persists across queries
| Component | Persists as | Used in |
|---|---|---|
| PRM MLP weights | memory / disk | Layers 6 + 8 every query |
| Router MLP weights | memory / disk | Layer 3 every query |
| δ_worker deltas | reward_adjustments.json |
Layer 6 every query |
| Query embedding cache | disk | Layer 2 |
| LATS tree | discarded | Nothing — query-specific text |
# Install dependencies
pip install -r requirements.txt
# Start vLLM (recommended)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.7
# Or a smaller model for testing
python -m vllm.entrypoints.openai.api_server \
--model HuggingFaceTB/SmolLM2-1.7B-Instruct \
--port 8000Then run queries directly from the CLI:
# Basic query
python kaelum.py "What is the integral of x^2?"
# Stream output token by token
python kaelum.py "Write a binary search in Python" --stream
# Hide reasoning trace, show answer only
python kaelum.py "Explain relativity" --no-trace
# Use a specific model or custom vLLM endpoint
python kaelum.py "Solve x^2 - 4 = 0" --model Qwen/Qwen2.5-7B-Instruct --base-url http://localhost:8000/v1
# Control search depth and simulations
python kaelum.py "Prove the Pythagorean theorem" --depth 5 --sims 20
# Output raw JSON
python kaelum.py "What is entropy?" --json
# Print session metrics
python kaelum.py --metrics
# Submit feedback for a past query
python kaelum.py --feedback "2+2?" --answer "4" --score 1.0All options can be passed as CLI flags. The main ones:
| Flag | Default | Description |
|---|---|---|
--base-url |
http://localhost:8000/v1 |
vLLM / OpenAI-compatible endpoint |
--model |
Qwen/Qwen2.5-1.5B-Instruct |
Model name |
--api-key |
— | API key if required |
--temperature |
0.7 |
Sampling temperature |
--max-tokens |
1024 |
Max tokens per generation |
--depth |
per-worker default | Max LATS tree depth |
--sims |
per-worker default | Number of MCTS simulations |
--prm-threshold |
0.5 |
PRM avg score gate for pass/fail |
--no-routing |
— | Disable neural router, use default worker |
--stream |
— | Stream tokens as they are generated |
--no-trace |
— | Hide reasoning trace |
--json |
— | Output raw JSON result |
Kaelum/
├── kaelum.py # CLI entry-point and library API
├── benchmark.py # GSM8K ablation runner (baseline / CoT / no-router / full)
├── core/
│ ├── learning/ # Feedback and metrics
│ ├── search/ # LATS, router, reward model, tree cache
│ ├── verification/ # PRM, reflection
│ └── workers/ # Domain workers (math, code, logic, factual, creative, analysis)
└── runtime/ # Orchestrator
The hardest parts were getting the MCTS pruning right (too aggressive and you miss good paths, too lenient and you waste simulations) and tuning the domain-specific reward functions. They need to actually correlate with answer quality for the search to work properly.
- Browne et al. (2012): "A Survey of Monte Carlo Tree Search Methods"
- Silver et al. (2016): "Mastering the game of Go with deep neural networks and tree search" (AlphaGo)
- Wei et al. (2022): "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"
- Yao et al. (2023): "Tree of Thoughts: Deliberate Problem Solving with Large Language Models"
- Shinn et al. (2023): "Reflexion: Language Agents with Verbal Reinforcement Learning"
- Madaan et al. (2023): "Self-Refine: Iterative Refinement with Self-Feedback"
- Shazeer et al. (2017): "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"
- Fedus et al. (2021): "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"
- Lightman et al. (2023): "Let's Verify Step by Step" (Process Reward Models)
- Settles (2009): "Active Learning Literature Survey"
- Reimers & Gurevych (2019): "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"

