@@ -186,23 +186,29 @@ This document lives in `foundational_brain/BEHIND_THE_SCENES.md` and explains th
186186## Appendix A (v0.2 pipeline specifics)
187187
188188### A1. JSONL → vector mapping
189+
189190- Each record has a free‑text symptom map (Name → Severity 0–10) and a ` label_name ` .
190191- We map names to fixed symptom IDs; build x = [ presence; severity] , where presence_i = 1 if severity_i > 0 else 0 and severity_i ∈ [ 0,1] .
191192- Label is mapped to a class index and then one‑hot y.
192193
193194### A2. Class balance and explicit negatives
195+
194196- Balanced per‑class counts (or class weights) suppress prior skew.
195197- Explicit negatives encode “absence of key symptoms” (e.g., dysuria=0, frequency=0 in respiratory cases), teaching strong negative evidence.
196198
197199### A3. Training objective (softmax + cross‑entropy)
200+
198201- Same equations as main text; we optimize NLL with SGD.
199202- Validation split for early stopping; select the best epoch by lowest val loss.
203+ Implementation note: v2 applies gradient descent (subtracting gradients) for both layers; the foundational demo used MSE with a toy update.
200204
201205### A4. Probability calibration (temperature scaling)
202- - Pick T* on the validation set by minimizing NLL(softmax(z/T)).
203- - At inference, ŷ = softmax(z/T* ). This improves reliability (confidence ≈ accuracy).
206+
207+ - Pick T\* on the validation set by minimizing NLL(softmax(z/T)).
208+ - At inference, ŷ = softmax(z/T\* ). This improves reliability (confidence ≈ accuracy).
204209
205210### A5. Expected Information Gain (EIG) in adaptive questioning
211+
206212- Current posterior P(d) (after clinical rules).
207213- For a candidate symptom s, approximate P(yes|d) from disease symptom frequencies.
208214- P(yes) = Σ_d P(d) P(yes|d); P(no) = 1 − P(yes).
@@ -211,14 +217,17 @@ This document lives in `foundational_brain/BEHIND_THE_SCENES.md` and explains th
211217- EIG(s) = H(P) − [ P(yes) H(P(d|yes)) + P(no) H(P(d|no))] . We ask the s with highest EIG.
212218
213219### A6. Evidence‑aware stop and triage
220+
214221- We stop only if (a) top‑1 probability ≥ threshold and (b) minimal supporting evidence exists (e.g., at least one GU key for UTI).
215222- First question is selected from a small triage set (respiratory vs GU vs GI discriminators) to reduce early ambiguity.
216223
217224### A7. Quick metrics
225+
218226- Confusion matrix and ECE bins provide a fast snapshot of class separation and calibration.
219227- ECE = Σ_bins (n_bin/N) |mean_conf − mean_acc|.
220228
221229### A8. Where to look
230+
222231- Data gen: ` medical_diagnosis_model/data/generate_v02.py `
223232- Train from JSONL: ` medical_diagnosis_model/versions/v2/medical_neural_network_v2.py `
224233- Pipeline + metrics: ` medical_diagnosis_model/tools/train_pipeline.py `
0 commit comments