Skip to content

dlekdns08/UASEF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

89 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

UASEF

UASEF

Uncertainty-Aware Safe Escalation Framework for Medical LLM Agents

LLM ๊ธฐ๋ฐ˜ ์˜๋ฃŒ ์—์ด์ „ํŠธ๊ฐ€ ์ž์‹ ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ •๋Ÿ‰ํ™”ํ•˜๊ณ , ์œ„ํ—˜๋„๋ฅผ ํŒ๋‹จํ•˜์—ฌ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€์—๊ฒŒ ์ž๋™ ์ธ๊ณ„ํ•˜๋Š” ์—ฐ๊ตฌ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.


๋ชฉ์ฐจ

  1. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ
  2. ํ•ต์‹ฌ ์„ค๊ณ„ ์ฒ ํ•™
  3. ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ
  4. ์•„ํ‚คํ…์ฒ˜ ์ƒ์„ธ
  5. ๋ฐ์ดํ„ฐ์…‹
  6. ์‹คํ—˜ ์„ค๊ณ„
  7. ํ‰๊ฐ€ ์ง€ํ‘œ
  8. ์„ค์น˜ ๋ฐ ํ™˜๊ฒฝ ๊ตฌ์„ฑ
  9. ์‹คํ—˜ ์‹คํ–‰
  10. ์ถœ๋ ฅ ํŒŒ์ผ
  11. ๋…ผ๋ฌธ ๊ถŒ์žฅ ์„ค์ •
  12. ์ฐธ๊ณ ๋ฌธํ—Œ

1. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ

๋ฌธ์ œ ์ •์˜

LLM์„ ์˜๋ฃŒ ํ˜„์žฅ์— ๋ฐฐํฌํ•  ๋•Œ ๊ฐ€์žฅ ํฐ ์žฅ๋ฒฝ์€ "๋ชจ๋ธ์ด ์–ธ์ œ ํ‹€๋ฆฌ๋Š”์ง€ ๋ชจ๋ฅธ๋‹ค" ๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์˜๋ฃŒ ๋„๋ฉ”์ธ์—์„œ ์ž˜๋ชป๋œ ์ž์‹ ๊ฐ(overconfidence)์€ ์น˜๋ช…์  ๊ฒฐ๊ณผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ์ ‘๊ทผ๋ฒ•๋“ค์€ ์„ธ ๊ฐ€์ง€ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

๊ธฐ์กด ์ ‘๊ทผ๋ฒ• ํ•œ๊ณ„
๋‹จ์ˆœ threshold ๊ธฐ๋ฐ˜ ์ž„๊ณ„๊ฐ’์ด ์ž„์˜์ ์ด๋ฉฐ ํ†ต๊ณ„์  ๋ณด์žฅ ์—†์Œ
Human-in-the-loop (ํ•ญ์ƒ) ์ž์œจ ์ฒ˜๋ฆฌ ๋ถˆ๊ฐ€ โ†’ ์šด์˜ ๋น„์šฉ, ์ง€์—ฐ
ํ™•๋ฅ  ๋ณด์ •(calibration) ๋„๋ฉ”์ธ ์ด๋™(distribution shift) ์‹œ ๋ณด์žฅ ๋ถ•๊ดด

UASEF์˜ ์ œ์•ˆ

UASEF๋Š” Conformal Prediction(CP) ์ด๋ก ์„ ์˜๋ฃŒ LLM์— ์ ์šฉํ•˜์—ฌ ์„ธ ๊ฐ€์ง€๋ฅผ ๋™์‹œ์— ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

  1. ํ†ต๊ณ„์  ๋ณด์žฅ: P(s_test โ‰ค qฬ‚) โ‰ฅ 1 - ฮฑ โ€” ์ด๋ก ์ ์œผ๋กœ ์ฆ๋ช…๋œ ์ปค๋ฒ„๋ฆฌ์ง€
  2. ๋™์  ์œ„ํ—˜๋„ ๋ฐ˜์˜: ์ „๋ฌธ๊ณผ๋ชฉยท์‹œ๋‚˜๋ฆฌ์˜ค์— ๋”ฐ๋ผ ์ž„๊ณ„๊ฐ’ ์ž๋™ ์กฐ์ •
  3. ์„ ํƒ์  ์—์Šค์ปฌ๋ ˆ์ด์…˜: ์ž์œจ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ์ผ€์ด์Šค๋Š” AI๊ฐ€, ๋ถˆํ™•์‹คํ•˜๊ฑฐ๋‚˜ ๊ณ ์œ„ํ—˜์ธ ์ผ€์ด์Šค๋งŒ ์ „๋ฌธ์˜์—๊ฒŒ ์ธ๊ณ„

2. ํ•ต์‹ฌ ์„ค๊ณ„ ์ฒ ํ•™

์™œ Conformal Prediction์ธ๊ฐ€?

๊ธฐ์กด ํ™•๋ฅ  ๋ณด์ •์˜ ๋ฌธ์ œ์  (temperature scaling, Platt scaling)

  • ๋ชจ๋ธ์ด "70% ํ™•์‹ " ์ด๋ผ๊ณ  ๋งํ•ด๋„, ์‹ค์ œ๋กœ 70% ๋งž์„ ๊ฒƒ์ด๋ผ๋Š” ๋ณด์žฅ์ด ์—†์Œ
  • ํ•™์Šต ๋ถ„ํฌ์™€ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์—์„œ ๋ณด์ • ์„ฑ์งˆ์ด ์œ ์ง€๋˜์ง€ ์•Š์Œ

Conformal Prediction์˜ ๊ฐ•์ 

  • ๋ถ„ํฌ ๊ฐ€์ • ๋ถˆํ•„์š”. ๊ตํ™˜๊ฐ€๋Šฅ์„ฑ(exchangeability)๋งŒ ๊ฐ€์ •
  • qฬ‚ = โŒˆ(n+1)(1-ฮฑ)โŒ‰/n ๋ฒˆ์งธ ์ˆœ์œ„ ๋น„์ ํ•ฉ ์ ์ˆ˜ โ†’ ์ด ์ž„๊ณ„๊ฐ’ ํ•˜๋‚˜๋กœ P(s_test โ‰ค qฬ‚) โ‰ฅ 1-ฮฑ ์„ฑ๋ฆฝ
  • ๋”ฐ๋ผ์„œ "ฮฑ = 0.05๋กœ ์„ค์ • โ†’ ์‹ค์ œ ์—์Šค์ปฌ๋ ˆ์ด์…˜ ๋ˆ„๋ฝ๋ฅ  โ‰ค 5%"๊ฐ€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ณด์žฅ๋จ

์™œ Nonconformity Score๋กœ token logprob์„ ์“ฐ๋Š”๊ฐ€?

๋น„์ ํ•ฉ ์ ์ˆ˜(nonconformity score)๋Š” "์ด ํ…Œ์ŠคํŠธ ํฌ์ธํŠธ๊ฐ€ ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ์…‹๊ณผ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ๊ฐ€"๋ฅผ ์ˆ˜์น˜ํ™”ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. UASEF๋Š” ํ‰๊ท  negative log-likelihood๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

s(x) = -mean(log P(t_i | context, t_1, ..., t_{i-1}))

์ด ์„ ํƒ์˜ ์ด์œ :

  • ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ๊ฐ ํ† ํฐ์˜ ํ™•๋ฅ ์„ ๊ทธ๋Œ€๋กœ ๋ฐ˜์˜ โ†’ ๋‹ต๋ณ€ ์ƒ์„ฑ ๊ณผ์ • ์ž์ฒด์˜ ๋ถˆํ™•์‹ค์„ฑ
  • Temperature = 0์ผ ๋•Œ๋„ ์˜๋ฏธ ์žˆ์Œ (greedy decoding์ด์ง€๋งŒ logprob์€ ์—ฌ์ „ํžˆ ๋ถ„ํฌ๋ฅผ ๋ฐ˜์˜)
  • API ์ถ”๊ฐ€ ํ˜ธ์ถœ ๋ถˆํ•„์š” (generate ํ•œ ๋ฒˆ์œผ๋กœ score์™€ ๋‹ต๋ณ€์„ ๋™์‹œ์— ์–ป์Œ)

์™œ ์„ธ ๋ชจ๋“ˆ๋กœ ๋ถ„๋ฆฌํ–ˆ๋Š”๊ฐ€?

UQM  โ†’  "์ด ์งˆ๋ฌธ์ด ์–ผ๋งˆ๋‚˜ ์–ด๋ ค์šด๊ฐ€?" (CP ๊ธฐ๋ฐ˜ ํ†ต๊ณ„ ์ธก์ •)
RTC  โ†’  "์–ผ๋งˆ๋‚˜ ์–ด๋ ค์›Œ์•ผ ์—์Šค์ปฌ๋ ˆ์ด์…˜ํ•  ๊ฒƒ์ธ๊ฐ€?" (์œ„ํ—˜๋„ ๊ธฐ๋ฐ˜ ์ž„๊ณ„๊ฐ’)
EDE  โ†’  "์ตœ์ข…์ ์œผ๋กœ ์—์Šค์ปฌ๋ ˆ์ด์…˜ํ•  ๊ฒƒ์ธ๊ฐ€?" (๋‹ค์ค‘ ์‹ ํ˜ธ ํ†ตํ•ฉ ๊ฒฐ์ •)

์„ธ ๊ด€์‹ฌ์‚ฌ๋ฅผ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ:

  • UQM์€ CP ์ด๋ก  ์ปดํฌ๋„ŒํŠธ๋กœ๋งŒ ๊ต์ฒด ๊ฐ€๋Šฅ (weighted CP, conformal risk control ๋“ฑ)
  • RTC์˜ ์ „๋ฌธ๊ณผ๋ชฉ ์œ„ํ—˜๋„ ์˜จํ†จ๋กœ์ง€๋Š” ์ž„์ƒ ์ „๋ฌธ๊ฐ€ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ๋…๋ฆฝ ์—…๋ฐ์ดํŠธ ๊ฐ€๋Šฅ
  • EDE์˜ ํŠธ๋ฆฌ๊ฑฐ ์ •์ฑ…์€ ๊ธฐ๊ด€๋ณ„ ํ”„๋กœํ† ์ฝœ์— ๋งž๊ฒŒ ์กฐ์ • ๊ฐ€๋Šฅ

์™œ LangGraph ReAct ๊ตฌ์กฐ์ธ๊ฐ€?

๋‹จ์ˆœ ์ฟผ๋ฆฌ-์‘๋‹ต์ด ์•„๋‹Œ **์ถ”๋ก -ํ–‰๋™ ๋ฃจํ”„(Reasoning + Acting)**๋ฅผ ํƒํ•œ ์ด์œ :

  • ์˜๋ฃŒ ์งˆ๋ฌธ์€ ๋‹จ์ผ ๋‹ต๋ณ€๋ณด๋‹ค ๋„๊ตฌ ํ™œ์šฉ(์•ฝ๋ฌผ ์ƒํ˜ธ์ž‘์šฉ DB, ๊ฐ€์ด๋“œ๋ผ์ธ ๊ฒ€์ƒ‰)์ด ํ•„์š”
  • UASEF๋Š” ์—์ด์ „ํŠธ ๋‚ด๋ถ€๊ฐ€ ์•„๋‹Œ ์™ธ๋ถ€์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ํŒ์ • โ†’ ์—์ด์ „ํŠธ ์ถœ๋ ฅ์„ ๊ฐ์‚ฌ(audit)ํ•˜๋Š” ๊ตฌ์กฐ
  • LangGraph์˜ StateGraph๊ฐ€ ReAct ๋ฃจํ”„์™€ UASEF ์ฒดํฌ ๋…ธ๋“œ๋ฅผ ๋ช…ํ™•ํžˆ ๋ถ„๋ฆฌ

3. ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

UASEF/
โ”œโ”€โ”€ models/                         # ํ•ต์‹ฌ ๋ชจ๋“ˆ
โ”‚   โ”œโ”€โ”€ model_interface.py          # LMStudio / OpenAI ํ†ตํ•ฉ ์ถ”์ƒํ™” ๋ ˆ์ด์–ด
โ”‚   โ”œโ”€โ”€ uqm.py                      # Uncertainty Quantification Module (CP ๊ธฐ๋ฐ˜)
โ”‚   โ”œโ”€โ”€ rtc_ede.py                  # Risk-Threshold Calibrator + Escalation Decision Engine
โ”‚   โ”œโ”€โ”€ rtc_calibration.py          # โ˜… RTC ๋ฐฐ์œจ Pareto sweep (๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์—ญ์‚ฐ)
โ”‚   โ”œโ”€โ”€ entropy_calibration.py      # โ˜… ์—”ํŠธ๋กœํ”ผ ์ž„๊ณ„๊ฐ’ Youden's J ์ž๋™ ๊ฒฐ์ •
โ”‚   โ””โ”€โ”€ ede_coefficient_search.py   # โ˜… EDE confidence ๊ณ„์ˆ˜ grid search
โ”‚
โ”œโ”€โ”€ agent/                          # LangGraph ReAct ์—์ด์ „ํŠธ
โ”‚   โ”œโ”€โ”€ graph.py                    # StateGraph ์กฐ๋ฆฝ
โ”‚   โ”œโ”€โ”€ nodes.py                    # ๋…ธ๋“œ ํ•จ์ˆ˜ + AgentComponents
โ”‚   โ”œโ”€โ”€ state.py                    # MedicalAgentState TypedDict
โ”‚   โ””โ”€โ”€ tools.py                    # ์˜๋ฃŒ ๋„๊ตฌ 4์ข… (drug, guideline, lab, DDx)
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ loader.py                   # MedQA / MedAbstain / PubMedQA / MIMIC-III ๋กœ๋”
โ”‚   โ”œโ”€โ”€ raw/                        # ๋กœ์ปฌ JSONL ํŒŒ์ผ ์œ„์น˜ (.gitignore)
โ”‚   โ””โ”€โ”€ README.md                   # ๋ฐ์ดํ„ฐ ์†Œ์Šค ๋ฐ ๋‹ค์šด๋กœ๋“œ ๊ฐ€์ด๋“œ
โ”‚
โ”œโ”€โ”€ experiments/
โ”‚   โ”œโ”€โ”€ configs/                    # ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ YAML ์„ค์ •
โ”‚   โ”‚   โ”œโ”€โ”€ base_config.yaml        # ๊ณตํ†ต ๊ธฐ๋ณธ๊ฐ’ (์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ๊ฒฐ๊ณผ ํฌํ•จ)
โ”‚   โ”‚   โ”œโ”€โ”€ scenario_emergency.yaml
โ”‚   โ”‚   โ”œโ”€โ”€ scenario_rare_disease.yaml
โ”‚   โ”‚   โ””โ”€โ”€ scenario_multimorbidity.yaml
โ”‚   โ”œโ”€โ”€ config_utils.py             # โ˜… ๊ณตํ†ต ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ config ๋กœ๋”
โ”‚   โ”œโ”€โ”€ run_calibration_pipeline.py # โ˜… ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ํŒŒ์ดํ”„๋ผ์ธ (Step 1โ†’5)
โ”‚   โ”œโ”€โ”€ run_experiment.py           # ์ˆœ์ฐจ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ—˜ (LMStudio vs OpenAI)
โ”‚   โ”œโ”€โ”€ run_agent_experiment.py     # LangGraph ์—์ด์ „ํŠธ ์‹คํ—˜
โ”‚   โ”œโ”€โ”€ run_baseline_comparison.py  # ๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต (no_esc / threshold_only / full_uasef)
โ”‚   โ”œโ”€โ”€ eval_medabstain.py          # MedAbstain AP/NAP ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ํ‰๊ฐ€
โ”‚   โ”œโ”€โ”€ pareto_sweep.py             # ฮฑ sweep โ†’ Pareto frontier + ฮฑ ๊ถŒ๊ณ 
โ”‚   โ”œโ”€โ”€ run_all_experiments.py      # โ˜… ์ „์ฒด ์‹คํ—˜ ํ†ตํ•ฉ ์‹คํ–‰ + ์š”์•ฝ ๋ณด๊ณ ์„œ ์ƒ์„ฑ
โ”‚   โ””โ”€โ”€ visualize_results.py        # ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
โ”‚
โ”œโ”€โ”€ results/                        # ์‹คํ—˜ ๊ฒฐ๊ณผ (์ž๋™ ์ƒ์„ฑ, .gitignore)
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ .env.example

4. ์•„ํ‚คํ…์ฒ˜ ์ƒ์„ธ

4.1 UQM โ€” Uncertainty Quantification Module

ํŒŒ์ผ: models/uqm.py

UQM์€ ๋‹จ์ผ ์งˆ๋ฌธ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ํ†ต๊ณ„์ ์œผ๋กœ ๋ณด์žฅ๋œ ์ˆ˜์น˜๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด๋ถ€ ํ๋ฆ„

์งˆ๋ฌธ ์ž…๋ ฅ
   โ†“
_get_score(): LLM ํ˜ธ์ถœ โ†’ token logprobs ์ˆ˜์ง‘
   โ†“
compute_nonconformity_score(): s = -mean(logprobs)
   โ†“
calibrator.threshold์™€ ๋น„๊ต โ†’ should_escalate
   โ†“
compute_entropy(): top_logprobs๋กœ ์œ„์น˜๋ณ„ ์กฐ๊ฑด๋ถ€ ์—”ํŠธ๋กœํ”ผ ๊ณ„์‚ฐ
   โ†“
UncertaintyResult ๋ฐ˜ํ™˜

Conformal Calibration ์ˆ˜์‹

๋ณด์ • ์ง‘ํ•ฉ {s_1, ..., s_n} (๋น„์ ํ•ฉ ์ ์ˆ˜๋“ค)์—์„œ ์ž„๊ณ„๊ฐ’ ๊ณ„์‚ฐ:

qฬ‚ = s_{(โŒˆ(n+1)(1-ฮฑ)โŒ‰)}  โ† n๋ฒˆ์งธ ์ˆœ์œ„ ์ ์ˆ˜

๋ณด์žฅ: P(s_test โ‰ค qฬ‚) โ‰ฅ 1 - ฮฑ

์‹ค์ œ ๊ตฌํ˜„์—์„œ๋Š” numpy.quantile์„ ์‚ฌ์šฉํ•˜๋ฉฐ, level์„ min(1.0, โŒˆ(n+1)(1-ฮฑ)โŒ‰/n) ์œผ๋กœ ๋ณด์ •ํ•˜์—ฌ ์œ ํ•œ ํ‘œ๋ณธ์—์„œ์˜ ๋ณด์ˆ˜์„ฑ์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.

Scoring Method ๋น„๊ต

๋ฐฉ์‹ ์ˆ˜์‹ ํŠน์ง• ๋…ผ๋ฌธ ์œ„์น˜
logprob (Primary + Ablation) s = -mean(token logprobs) CP ๋ณด์žฅ โœ“, ๋‹จ์ผ ์ฟผ๋ฆฌ ์ฃผ์š” ๊ธฐ์—ฌ + Ablation
self_consistency (๋Œ€์•ˆ) s = Jaccard_diversity ร— 5 CP ๋ณด์žฅ โœ“, NํšŒ ์ฟผ๋ฆฌ, logprobs ๋ถˆํ•„์š” ๋ธ”๋ž™๋ฐ•์Šค LLM ํ˜ธํ™˜์šฉ
auto ๋Ÿฐํƒ€์ž„ ๊ฐ์ง€ ์žฌํ˜„์„ฑ ์ €ํ•˜ ์œ„ํ—˜ ๋น„๊ถŒ์žฅ

์™œ logprob์ด Primary์ด๊ณ  Ablation ๋ชจ๋‘์ธ๊ฐ€? LM Studio์˜ OpenAI-compatible API๋Š” token-level logprobs๋ฅผ ์ง€์›ํ•˜๋ฏ€๋กœ, OpenAI์™€ ๋กœ์ปฌ GGUF ๋ชจ๋ธ ๋ชจ๋‘ ๋™์ผํ•œ logprob ๋น„์ ํ•ฉ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Ablation์˜ ๋ชฉ์ ์€ scoring method ์ฐจ์ด๊ฐ€ ์•„๋‹ˆ๋ผ, ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ๋„ CP coverage ๋ณด์žฅ์ด ์„ฑ๋ฆฝํ•จ์„ ๊ฒ€์ฆํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. self_consistency๋Š” logprobs๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š๋Š” Claude API, Gemini API ๋“ฑ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์•ˆ์ž…๋‹ˆ๋‹ค.

์—”ํŠธ๋กœํ”ผ ๊ณ„์‚ฐ

compute_entropy(response: ModelResponse) ๋Š” top_logprobs๊ฐ€ ์žˆ์„ ๋•Œ๋งŒ ์œ ํšจํ•œ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

# ๊ฐ ํ† ํฐ ์œ„์น˜์—์„œ ์ƒ์œ„ k๊ฐœ logprob์œผ๋กœ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ ๊ทผ์‚ฌ
probs = softmax(top_k_logprobs)   # ์ •๊ทœํ™”
H_pos = -sum(p * log(p))          # ์œ„์น˜๋ณ„ ์—”ํŠธ๋กœํ”ผ
H_avg = mean(H_pos)               # ์ „์ฒด ํ‰๊ท  (nats/token)

top_logprobs๊ฐ€ ์—†์œผ๋ฉด float("nan") ๋ฐ˜ํ™˜ โ€” ๊ฐœ๋ณ„ ํ† ํฐ logprob์œผ๋กœ๋Š” Shannon ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค (๊ฐ ๊ฐ’์ด ์™„์ „ํ•œ ์–ดํœ˜ ๋ถ„ํฌ๋ฅผ ๊ตฌ์„ฑํ•˜์ง€ ์•Š์Œ).

Distribution Shift ์ฒ˜๋ฆฌ

# ๋ณด์ •: MedQA ๋ถ„ํฌ
uqm.calibrate(cal_questions, distribution_source="medqa")

# ํ‰๊ฐ€: MIMIC-III ๋ถ„ํฌ (๋‹ค๋ฅธ ๋ถ„ํฌ!) โ†’ ์ž๋™ ๊ฒฝ๊ณ  + Weighted CP ์ „ํ™˜
uqm.evaluate(question, distribution_source="mimic3")

Weighted CP (Tibshirani et al., 2019)๋Š” ๊ตํ™˜๊ฐ€๋Šฅ์„ฑ ์œ„๋ฐ˜ ์‹œ ์ปค๋ฒ„๋ฆฌ์ง€ ๋ณด์žฅ์„ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค.

w_i = 1 + k ร— Jaccard(cal_i, test)   # ๋ฐ€๋„๋น„ ๊ทผ์‚ฌ

qฬ‚_w = inf{q : ฮฃ_{s_i โ‰ค q} w_i / (ฮฃ w_i + w_{n+1}) โ‰ฅ 1-ฮฑ}

w_{n+1} (ํ…Œ์ŠคํŠธ ํฌ์ธํŠธ ์ž์‹ ์˜ weight)๋ฅผ ๋ถ„๋ชจ์— ํฌํ•จํ•ด์•ผ CP ํ•˜ํ•œ ๋ณด์žฅ์ด ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค. w_{n+1} = 1 + k (Jaccard(test, test) = 1.0 ์ด๋ฏ€๋กœ ์ตœ๋Œ€ ์œ ์‚ฌ๋„).

UncertaintyResult ์ฃผ์š” ํ•„๋“œ

ํ•„๋“œ ์„ค๋ช…
nonconformity_score ๋น„์ ํ•ฉ ์ ์ˆ˜ โ€” ํด์ˆ˜๋ก ๋ถˆํ™•์‹ค
margin threshold - score โ€” ์–‘์ˆ˜=์•ˆ์ „ ์—ฌ์œ , ์Œ์ˆ˜=์ž„๊ณ„๊ฐ’ ์ดˆ๊ณผ
confidence_entropy ์œ„์น˜๋ณ„ ์กฐ๊ฑด๋ถ€ ์—”ํŠธ๋กœํ”ผ (nats/token). top_logprobs ์—†์œผ๋ฉด nan
should_escalate score > threshold ์—ฌ๋ถ€
weighted_cp_used Weighted CP ์ ์šฉ ์—ฌ๋ถ€
prediction_set_size ํ•ญ์ƒ 1. ํ•˜์œ„ ํ˜ธํ™˜์„ฑ ์œ ์ง€์šฉ ํ•„๋“œ (binary outcome์—์„œ prediction set์€ ๋‹จ์ผ ์›์†Œ)

LLM ์ง€์› ์š”๊ฑด

scoring_method logprobs ํ•„์š” ์ ์šฉ ๊ฐ€๋Šฅ LLM ๋…ผ๋ฌธ ์œ„์น˜
logprob (Primary + Ablation) ํ•„์ˆ˜ GPT-4o, GPT-4o-mini, LMStudio (llama.cpp) ์ฃผ์š” ๊ธฐ์—ฌ + Ablation
self_consistency (๋Œ€์•ˆ) ๋ถˆํ•„์š” ๋ชจ๋“  LLM ๋ธ”๋ž™๋ฐ•์Šค LLM ํ˜ธํ™˜์šฉ

logprob ๋ฐฉ์‹์€ token-level logprobs๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š๋Š” Claude API, Gemini API, Cohere ๋“ฑ์—์„œ ValueError๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ํ™˜๊ฒฝ์—์„œ๋Š” self_consistency๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”. LM Studio๋Š” OpenAI-compatible API๋กœ logprobs๋ฅผ ์ง€์›ํ•˜๋ฏ€๋กœ Primary์™€ ๋™์ผํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.


4.2 RTC โ€” Risk-Threshold Calibrator

ํŒŒ์ผ: models/rtc_ede.py, models/rtc_calibration.py

UQM์ด ๋ฐ˜ํ™˜ํ•œ ๊ธฐ๋ณธ ์ž„๊ณ„๊ฐ’ qฬ‚๋ฅผ ์ „๋ฌธ๊ณผ๋ชฉ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์˜ ์œ„ํ—˜๋„์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

์กฐ์ • ์ˆ˜์‹

adjusted_threshold = qฬ‚ ร— risk_multiplier ร— scenario_multiplier
์œ„ํ—˜ ๋“ฑ๊ธ‰ ๊ธฐ๋ณธ ๋ฐฐ์œจ ํ•ด๋‹น ์ „๋ฌธ๊ณผ๋ชฉ
CRITICAL ร—0.60 ์‘๊ธ‰์˜ํ•™, ์ค‘ํ™˜์ž์˜ํ•™, ์™ธ์ƒ์™ธ๊ณผ
HIGH ร—0.75 ์‹ฌ์žฅ๋‚ด๊ณผ, ์‹ ๊ฒฝ๊ณผ, ์ข…์–‘ํ•™, ์‹ฌํ‰์™ธ๊ณผ
MODERATE ร—1.00 ๋‚ด๊ณผ, ์™ธ๊ณผ, ์†Œ์•„๊ณผ, ์‚ฐ๋ถ€์ธ๊ณผ
LOW ร—1.30 ์ผ๋ฐ˜ ์™ธ๋ž˜, ์˜ˆ๋ฐฉ์˜ํ•™, ํ”ผ๋ถ€๊ณผ, ์ •์‹ ๊ฑด๊ฐ•์˜ํ•™๊ณผ

emergency / rare_disease ์‹œ๋‚˜๋ฆฌ์˜ค์—๋Š” ์ถ”๊ฐ€ ร—0.85 ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์„ค๊ณ„ ์ด์œ : ์‘๊ธ‰์˜ํ•™์—์„œ ์—์Šค์ปฌ๋ ˆ์ด์…˜ ๋ˆ„๋ฝ(False Negative)์˜ ๋น„์šฉ์€ ์ผ๋ฐ˜ ์™ธ๋ž˜์— ๋น„ํ•ด ํ›จ์”ฌ ํฝ๋‹ˆ๋‹ค. ์ž„๊ณ„๊ฐ’์„ ๋‚ฎ์ถ”๋ฉด ๋” ๋งŽ์€ ์ผ€์ด์Šค๊ฐ€ ์—์Šค์ปฌ๋ ˆ์ด์…˜๋˜์ง€๋งŒ, ์œ„ํ—˜ํ•œ ์ผ€์ด์Šค๋ฅผ ๋†“์น  ํ™•๋ฅ ์ด ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. ์ด ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ์ „๋ฌธ๊ณผ๋ชฉ ์˜จํ†จ๋กœ์ง€๋กœ ์ธ์ฝ”๋”ฉํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋ฐฐ์œจ ์—ญ์‚ฐ (rtc_calibration.py)

์œ„ ํ‘œ์˜ ๋ฐฐ์œจ์€ ๊ธฐ๋ณธ๊ฐ’์ž…๋‹ˆ๋‹ค. run_calibration_pipeline.py๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ์—์„œ Pareto sweep์œผ๋กœ ๋ฐฐ์œจ์„ ์ž๋™ ์—ญ์‚ฐํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ์œ„ํ—˜๋„ ์ˆ˜์ค€๋ณ„๋กœ ํ›„๋ณด ๋ฐฐ์œจ (์˜ˆ: CRITICAL โˆˆ {0.50, 0.55, 0.60, 0.65, 0.70}) sweep
โ†’ Safety Recall โ‰ฅ 0.95 AND Over-Escalation โ‰ค 0.15 ๋ฅผ ๋™์‹œ ์ถฉ์กฑํ•˜๋Š” ํ›„๋ณด ์ค‘
  Over-Escalation์ด ์ตœ์†Œ์ธ ๋ฐฐ์œจ ์„ ํƒ (์ œ์•ฝ ๋ถˆ์ถฉ์กฑ ์‹œ Safety Recall ์ตœ๋Œ€ fallback)
โ†’ ๊ฒฐ๊ณผ๋ฅผ base_config.yaml์˜ rtc ์„น์…˜์— ์ €์žฅ

๊ฒฐ๊ณผ๋Š” RTC(base_threshold, multipliers=cfg["rtc"]) ํ˜•ํƒœ๋กœ ๋ชจ๋“  ์‹คํ—˜ ํŒŒ์ผ์— ์ž๋™ ์ฃผ์ž…๋ฉ๋‹ˆ๋‹ค.

Pareto Frontier ๋ถ„์„

rtc.pareto_frontier(sweep_results)

pareto_sweep.py์˜ ์‹ค์ธก ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„ ๊ฐ (ฮฑ, specialty) ์กฐํ•ฉ์—์„œ (coverage, escalation_rate) ์Œ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‹ค์ œ๋กœ ์ธก์ •๋œ trade-off๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ณ  ์ตœ์  ฮฑ๋ฅผ ๊ถŒ๊ณ ํ•ฉ๋‹ˆ๋‹ค.


4.3 EDE โ€” Escalation Decision Engine

ํŒŒ์ผ: models/rtc_ede.py, models/entropy_calibration.py, models/ede_coefficient_search.py

์„ธ ๊ฐ€์ง€ ํŠธ๋ฆฌ๊ฑฐ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์ตœ์ข… ์—์Šค์ปฌ๋ ˆ์ด์…˜ ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

ํŠธ๋ฆฌ๊ฑฐ ๊ตฌ์กฐ

Trigger 1 โ€” UNCERTAINTY_EXCEEDED:
    nonconformity_score > adjusted_threshold
    โ†’ CP ์ด๋ก ์˜ ์ง์ ‘ ์‹ ํ˜ธ (์ฃผ ํŠธ๋ฆฌ๊ฑฐ)

Trigger 2 โ€” HIGH_RISK_ACTION:
    CRITICAL_KEYWORDS ๊ฐ์ง€   (EOL ๊ฒฐ์ •, Code Blue)  โ†’ ํ•ญ์ƒ ํŠธ๋ฆฌ๊ฑฐ
    PROCEDURAL_KEYWORDS ๊ฐ์ง€ (intubation, ์Šน์••์ œ)   โ†’ UNCERTAINTY_MODIFIERS ๋™๋ฐ˜ ์‹œ๋งŒ ํŠธ๋ฆฌ๊ฑฐ

Trigger 3 โ€” NO_EVIDENCE:
    ๊ทผ๊ฑฐ ๋ถ€์žฌ ํ‘œํ˜„ ๊ฐ์ง€ (์•„๋ž˜ ์ฐธ์กฐ)

ํ•˜๋‚˜๋ผ๋„ ํ™œ์„ฑํ™” โ†’ should_escalate = True

Trigger 2 ์„ค๊ณ„ ์ด์œ : "์—ํ”ผ๋„คํ”„๋ฆฐ์„ ์•„๋‚˜ํ•„๋ฝ์‹œ์Šค์— ํˆฌ์—ฌํ•˜์„ธ์š”" ๊ฐ™์€ ์ •์ƒ์ ์ธ ์ฒ˜์น˜ ๊ถŒ๊ณ ๊ฐ€ ํ‚ค์›Œ๋“œ๋งŒ์œผ๋กœ ์—์Šค์ปฌ๋ ˆ์ด์…˜๋˜๋Š” False Positive๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ์ˆ  ํ‚ค์›Œ๋“œ๋Š” ๋ถˆํ™•์‹ค ํ‘œํ˜„(consider, may need, if deteriorates ๋“ฑ)๊ณผ ํ•จ๊ป˜ ๋‚˜ํƒ€๋‚  ๋•Œ๋งŒ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด DNR, withdraw care ๊ฐ™์€ EOL ๊ฒฐ์ •์€ AI๊ฐ€ ๋‹จ๋…์œผ๋กœ ํŒ๋‹จํ•ด์„œ๋Š” ์•ˆ ๋˜๋ฏ€๋กœ ํ•ญ์ƒ ์—์Šค์ปฌ๋ ˆ์ด์…˜ํ•ฉ๋‹ˆ๋‹ค.

Trigger 3 NO_EVIDENCE ํ‚ค์›Œ๋“œ ๋ชฉ๋ก

๊ทผ๊ฑฐ ๋ถ€์žฌ ํ‘œํ˜„์€ ์ถœ์ฒ˜๋ณ„๋กœ ๊ด€๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ ์žฌํ˜„ ์‹œ source ํ•„๋“œ๋ฅผ ์ธ์šฉ ๊ทผ๊ฑฐ๋กœ ์‚ฌ์šฉํ•˜์„ธ์š”.

์ถœ์ฒ˜ ์˜ˆ์‹œ ํ‘œํ˜„
medabstain "i am not certain", "insufficient evidence", "limited data"
savage2025 "this is unclear", "evidence is mixed", "conflicting data"
manual (GPT-4o 500๊ฑด) "clinical judgment needed", "differential is broad"

ํƒ์ง€ ํ•จ์ˆ˜ detect_no_evidence(text) ๋Š” (triggered: bool, matched_phrases: list[str]) ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์—ฌ ๋…ผ๋ฌธ ์žฌํ˜„์— ํ•„์š”ํ•œ ๋งค์นญ ์ฆ๊ฑฐ๋ฅผ ํ•จ๊ป˜ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Confidence ๊ณ„์‚ฐ

confidence = min(1.0,
    len(triggers) / 3
    + t1_weight    if UNCERTAINTY_EXCEEDED in triggers   # ๊ธฐ๋ณธ 0.4
    + entropy_boost if entropy > entropy_threshold       # ๊ธฐ๋ณธ 0.15, ๊ธฐ๋ณธ ์ž„๊ณ„๊ฐ’ 2.0
)

์—”ํŠธ๋กœํ”ผ๋Š” ๋ณ„๋„ ํŠธ๋ฆฌ๊ฑฐ๊ฐ€ ์•„๋‹Œ ์‹ ๋ขฐ๋„ ๊ฐ€์ค‘์น˜๋กœ๋งŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์„ธ ๊ณ„์ˆ˜(t1_weight, entropy_boost, entropy_threshold)๋Š” ๋ชจ๋‘ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฐ์ถœํ•˜์—ฌ base_config.yaml์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

์—”ํŠธ๋กœํ”ผ ์ž„๊ณ„๊ฐ’ ์ž๋™ ๊ฒฐ์ • (entropy_calibration.py)

ENTROPY_HIGH_THRESHOLD = 2.0 ํ•˜๋“œ์ฝ”๋”ฉ ๋Œ€์‹  calibration ๋ฐ์ดํ„ฐ์—์„œ Youden's J ํ†ต๊ณ„๋Ÿ‰์œผ๋กœ ์ž๋™ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

Youden's J = Sensitivity + Specificity - 1  (์ตœ๋Œ€ํ™” ์ง€์  ์„ ํƒ)
โ†’ ๊ฒฐ๊ณผ๋ฅผ base_config.yaml์˜ entropy_threshold์— ์ €์žฅ

EDE ๊ณ„์ˆ˜ grid search (ede_coefficient_search.py)

t1_weight    โˆˆ {0.2, 0.3, 0.4, 0.5}
entropy_boost โˆˆ {0.05, 0.10, 0.15, 0.20}

์ตœ์ ํ™” ๋ชฉํ‘œ: F1-safety = harmonic_mean(Safety Recall, 1 โˆ’ Over-Escalation Rate)
โ†’ ๊ฒฐ๊ณผ๋ฅผ base_config.yaml์˜ ede ์„น์…˜์— ์ €์žฅ

4.4 LangGraph ์—์ด์ „ํŠธ

ํŒŒ์ผ: agent/graph.py, agent/nodes.py, agent/state.py

๊ทธ๋ž˜ํ”„ ํ๋ฆ„

START โ†’ reason โ†’ [tool_calls?] โ†’ act โ”€โ”€โ†’ reason  (ReAct ๋ฃจํ”„, ์ตœ๋Œ€ 5ํšŒ)
                                         โ†“
                               uasef_check  โ† ์›๋ณธ ์งˆ๋ฌธ ๋…๋ฆฝ ์žฌํŒ
                               โ†™          โ†˜
                          escalate      finalize
                             โ†“               โ†“
                           END             END

์ฃผ์š” ์„ค๊ณ„ ๊ฒฐ์ •

โ‘  uasef_check๋Š” ์—์ด์ „ํŠธ์™€ ๋…๋ฆฝ

uasef_check ๋…ธ๋“œ๋Š” ์—์ด์ „ํŠธ์˜ ๋ฉ”์‹œ์ง€ ํžˆ์Šคํ† ๋ฆฌ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์›๋ณธ ์งˆ๋ฌธ์„ ์ง์ ‘ UQM์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ๊ฐ€ ๋„๊ตฌ๋กœ ์ •๋ณด๋ฅผ ๋งŽ์ด ์ˆ˜์ง‘ํ–ˆ๋”๋ผ๋„ UASEF๋Š” ๋ณ„๋„๋กœ ํŒ์ •ํ•ฉ๋‹ˆ๋‹ค.

์ด ์„ค๊ณ„ ์ด์œ :

  • ์—์ด์ „ํŠธ๊ฐ€ ํ‹€๋ฆฐ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•ด๋„ UASEF๊ฐ€ ์•ˆ์ „๋ง ์—ญํ• 
  • ์—์ด์ „ํŠธ ์ถœ๋ ฅ์„ ๊ฐ์‚ฌ(audit)ํ•˜๋Š” ์™ธ๋ถ€ ์ปดํฌ๋„ŒํŠธ ํŒจํ„ด
โ‘ก LLM ์žฌํ˜ธ์ถœ ์ตœ์†Œํ™”

reason ๋…ธ๋“œ์—์„œ ์ด๋ฏธ logprobs=True๋กœ LLM์„ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. uasef_check์—์„œ ๋งˆ์ง€๋ง‰ AIMessage์˜ response_metadata์—์„œ logprobs๋ฅผ ์ถ”์ถœํ•ด pre_computed_response๋กœ UQM์— ์ „๋‹ฌํ•˜๋ฉด, logprob ๋ชจ๋“œ์—์„œ ๋‘ ๋ฒˆ์งธ LLM ํ˜ธ์ถœ์„ ์ƒ๋žตํ•ฉ๋‹ˆ๋‹ค.

pre_resp = _extract_model_response(last_ai_message, backend)
unc = components.uqm.evaluate(question, pre_computed_response=pre_resp)
# pre_resp๊ฐ€ ์žˆ์œผ๋ฉด LLM ์žฌํ˜ธ์ถœ ์—†์ด score ๊ณ„์‚ฐ
โ‘ข AgentComponents๋ฅผ functools.partial๋กœ ๋ฐ”์ธ๋”ฉ

LangGraph State์— ๋น„์ง๋ ฌํ™” ๊ฐ์ฒด(UQM, RTC, EDE)๋ฅผ ๋„ฃ์ง€ ์•Š๊ณ , functools.partial๋กœ ๊ฐ ๋…ธ๋“œ ํ•จ์ˆ˜์— ํด๋กœ์ €๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. State๋Š” JSON ์ง๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ๋งŒ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

โ‘ฃ ์˜๋ฃŒ ๋„๊ตฌ 4์ข… (Mock ๊ตฌํ˜„)
๋„๊ตฌ ์—ญํ•  ์‹ค์ œ ์—ฐ๊ตฌ ๊ต์ฒด ๋Œ€์ƒ
drug_interaction_checker ์•ฝ๋ฌผ ์ƒํ˜ธ์ž‘์šฉ ํ™•์ธ Drugs@FDA API / Lexicomp
clinical_guideline_search ์ž„์ƒ ๊ฐ€์ด๋“œ๋ผ์ธ ๊ฒ€์ƒ‰ UpToDate / PubMed E-utilities
lab_reference_lookup ๊ฒ€์‚ฌ ์ฐธ๊ณ ์น˜ ์กฐํšŒ LOINC / ๊ธฐ๊ด€ ๋‚ด LIS
differential_diagnosis ๊ฐ๋ณ„ ์ง„๋‹จ Isabel DDx / ๊ธฐ๊ด€ ๋‚ด CDR

5. ๋ฐ์ดํ„ฐ์…‹

์ž๋™ ๋กœ๋”ฉ ์šฐ์„ ์ˆœ์œ„

1. data/raw/*.jsonl       (๋กœ์ปฌ JSONL ํŒŒ์ผ)
2. HuggingFace datasets   (์ž๋™ ๋‹ค์šด๋กœ๋“œ)
3. ๋‚ด์žฅ fallback          (๊ฐœ๋ฐœ/ํ…Œ์ŠคํŠธ ์ „์šฉ, 30๊ฐœ)

MedQA (USMLE 4-options)

  • ์—ญํ• : Calibration + ๊ธฐ๋ณธ ์‹œ๋‚˜๋ฆฌ์˜ค ํ…Œ์ŠคํŠธ
  • ์ถœ์ฒ˜: Jin et al., 2021 โ€” "What Disease does this Patient Have?"
  • HuggingFace ID: GBaker/MedQA-USMLE-4-options
  • ์‚ฌ์šฉ split: train (calibration), test (ํ…Œ์ŠคํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค)

USMLE(๋ฏธ๊ตญ ์˜์‚ฌ๋ฉดํ—ˆ์‹œํ—˜) ์Šคํƒ€์ผ์˜ 4์ง€์„ ๋‹ค ๋ฌธ์ œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ •๋‹ต๋งŒ ์•Œ์•„๋„ ๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์™œ ํ‹€๋ ธ๋Š”์ง€๋ฅผ ํ†ตํ•ด ๋ถˆํ™•์‹ค์„ฑ์„ ์ธก์ •ํ•˜๋Š” ๋ฐ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

MedAbstain

  • ์—ญํ• : ํฌ๊ท€์งˆํ™˜ยท๋ถˆํ™•์‹ค ์‹œ๋‚˜๋ฆฌ์˜ค, safety ํ‰๊ฐ€์˜ ํ•ต์‹ฌ
  • ์ถœ์ฒ˜: Zhu et al., 2023 โ€” "Can LLMs Express Their Uncertainty?"

4๊ฐ€์ง€ ๋ณ€ํ˜•์ด ์žˆ์œผ๋ฉฐ, AP์™€ NAP๊ฐ€ safety ํ‰๊ฐ€์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

๋ณ€ํ˜• ์„ค๋ช… expected_escalate ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค
AP Abstention + Perturbed True ํฌ๊ท€์งˆํ™˜ (๋ถˆํ™•์‹ค + ๋ณ€ํ˜• ์งˆ๋ฌธ)
NAP Normal + Perturbed True ํฌ๊ท€์งˆํ™˜ (์ •์ƒ ๋‹ต๋ณ€์ด์ง€๋งŒ ๋ณ€ํ˜• ์งˆ๋ฌธ)
A Abstention only True ์ผ๋ฐ˜ ๋ถˆํ™•์‹ค ์ผ€์ด์Šค
NA Normal False ์ •์ƒ ์ผ€์ด์Šค (True Negative ๊ฒ€์ฆ)

์™œ AP/NAP๊ฐ€ ํ•ต์‹ฌ์ธ๊ฐ€?: AP์™€ NAP๋Š” ์›๋ž˜ ์งˆ๋ฌธ์„ ๋ฏธ๋ฌ˜ํ•˜๊ฒŒ ๋ณ€ํ˜•(perturb)ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์•ˆ์ •์„ฑ์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ณ€ํ˜•๋œ ์งˆ๋ฌธ์— ์ž์‹  ์žˆ๊ฒŒ ๋‹ตํ•˜๋Š” ๋ชจ๋ธ์€ ์—์Šค์ปฌ๋ ˆ์ด์…˜ํ•ด์•ผ ํ•˜๋Š” ์ƒํ™ฉ์„ ๋†“์น  ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค.

PubMedQA (์„ ํƒ ์‚ฌํ•ญ)

  • ์—ญํ• : rare_disease ๋ฒ„ํ‚ท ๋ณด๊ฐ• + NO_EVIDENCE ํŠธ๋ฆฌ๊ฑฐ(Trigger 3) ๊ฒ€์ฆ
  • ์ถœ์ฒ˜: Jin et al., 2019 โ€” "PubMedQA: A Dataset for Biomedical Research Question Answering"
  • HuggingFace ID: pubmed_qa / pqa_labeled (1,000 expert-labeled)

final_decision = "maybe" ์ผ€์ด์Šค๋งŒ expected_escalate=True๋กœ ์„ค์ •ํ•˜์—ฌ rare_disease ๋ฒ„ํ‚ท์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํ™œ์„ฑํ™” ๋ฐฉ๋ฒ•:

# experiments/configs/base_config.yaml
data:
  include_pubmedqa: true

MIMIC-III (์„ ํƒ ์‚ฌํ•ญ)

  • ์—ญํ• : ์‹ค์ œ ICU ์ž„์ƒ ๊ธฐ๋ก์œผ๋กœ distribution shift ์‹คํ—˜
  • ์กฐ๊ฑด: PhysioNet DUA(Data Use Agreement) ์„œ๋ช… ํ•„์š”
  • ์‚ฌ์šฉ ๋ชฉ์ : Weighted CP๊ฐ€ ๋ถ„ํฌ ์ด๋™ ์ƒํ™ฉ์—์„œ๋„ ์ปค๋ฒ„๋ฆฌ์ง€ ๋ณด์žฅ์„ ๋ณต์›ํ•˜๋Š”์ง€ ๊ฒ€์ฆ

CP ๋ณด์žฅ์€ calibration๊ณผ evaluation์ด ๊ฐ™์€ ๋ถ„ํฌ์—์„œ ๋‚˜์˜ฌ ๋•Œ๋งŒ ์œ ํšจํ•ฉ๋‹ˆ๋‹ค (exchangeability). MedQA๋กœ ๋ณด์ •ํ•œ ๋’ค MIMIC-III๋กœ ํ‰๊ฐ€ํ•˜๋ฉด CP ๋ณด์žฅ์ด ๊นจ์ง€๋ฉฐ, ์ด๋ฅผ Weighted CP๋กœ ๋ณต์›ํ•˜๋Š” ๊ฒƒ์ด ์‹คํ—˜์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.


6. ์‹คํ—˜ ์„ค๊ณ„

์ „์ฒด ์‹คํ—˜ ํŒŒ์ดํ”„๋ผ์ธ

โ”€โ”€โ”€ ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ (1ํšŒ, run_calibration_pipeline.py) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
MedQA (unlabeled)
       โ†“
   UQM.calibrate()         โ† Split CP: 80%๋กœ qฬ‚ ๊ณ„์‚ฐ, 20%๋กœ coverage ๊ฒ€์ฆ
       โ†“
MedQA/MedAbstain (labeled, calibration split)
       โ†“
   entropy_calibration     โ† Youden's J โ†’ entropy_threshold
   rtc_calibration         โ† Pareto sweep โ†’ ์œ„ํ—˜๋„๋ณ„ multiplier
   ede_coefficient_search  โ† F1-safety grid search โ†’ t1_weight, entropy_boost
       โ†“
   base_config.yaml ๊ฐฑ์‹    โ† rtc / entropy_threshold / ede ์„น์…˜

โ”€โ”€โ”€ ์‹คํ—˜ (run_experiment.py ๋“ฑ) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
MedQA/MedAbstain (test split)
       โ†“
   RTC(multipliers=cfg["rtc"])        โ† ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋ฐฐ์œจ ์ฃผ์ž…
   EDE(t1_weight, entropy_boost, ...) โ† ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๊ณ„์ˆ˜ ์ฃผ์ž…
       โ†“
   UQM.evaluate() โ†’ EDE.decide()     โ† 3 ํŠธ๋ฆฌ๊ฑฐ ํ†ตํ•ฉ โ†’ should_escalate
       โ†“
  Safety Recall / Over-Escalation Rate / Conformal Coverage

6.0 ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ํŒŒ์ดํ”„๋ผ์ธ (run_calibration_pipeline.py)

๋ชจ๋“  ์‹คํ—˜ ์ „ 1ํšŒ ์‹คํ–‰ํ•˜์—ฌ ํ•˜๋“œ์ฝ”๋”ฉ ๊ธฐ๋ณธ๊ฐ’์„ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๊ฐ’์œผ๋กœ ๊ต์ฒดํ•˜๊ณ  base_config.yaml์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ๋ชจ๋“  ์‹คํ—˜ ํŒŒ์ผ์€ ์ด config๋ฅผ ์ž๋™์œผ๋กœ ์ฝ์–ด ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์‹คํ–‰ ์ˆœ์„œ

Step 1  CP Calibration        โ†’ UQM.calibrate() โ†’ base threshold qฬ‚ ์‚ฐ์ถœ
Step 2  ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘     โ†’ load_scenarios() โ†’ UQM.evaluate() โ†’ (scores, labels, entropy)
Step 3  Entropy Threshold     โ†’ entropy_calibration.py โ†’ Youden's J โ†’ entropy_threshold
Step 4a RTC ๋ฐฐ์œจ Pareto Sweep โ†’ rtc_calibration.py โ†’ ์œ„ํ—˜๋„๋ณ„ optimal multiplier
Step 4b EDE Coefficient Searchโ†’ ede_coefficient_search.py โ†’ (t1_weight, entropy_boost)
Step 5  base_config.yaml ๊ฐฑ์‹  โ†’ rtc / entropy_threshold / ede ์„น์…˜ ๋ฎ์–ด์“ฐ๊ธฐ

์„ค์ • ์ฃผ์ž… ํ๋ฆ„

run_calibration_pipeline.py
    โ†’ base_config.yaml (rtc, entropy_threshold, ede ์„น์…˜ ๊ฐฑ์‹ )
        โ†“ config_utils.load_calibration_config()
        โ†“
๋ชจ๋“  ์‹คํ—˜ ํŒŒ์ผ (run_experiment, run_agent_experiment, eval_medabstain, ...)
    โ†’ RTC(base_threshold, multipliers=rtc_cfg)
    โ†’ EDE(t1_weight=..., entropy_boost=..., entropy_threshold=...)

์‹คํ–‰

# ๊ฐœ๋ฐœ ํ…Œ์ŠคํŠธ (๋น ๋ฆ„)
python experiments/run_calibration_pipeline.py --backend openai

# ๋…ผ๋ฌธ ํ’ˆ์งˆ (๊ถŒ์žฅ)
python experiments/run_calibration_pipeline.py --backend openai --n-cal 500 --n-labeled 50

์ถœ๋ ฅ: results/calibration_report.json + base_config.yaml ์ž๋™ ๊ฐฑ์‹ 


6.1 ์ˆœ์ฐจ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ—˜ (run_experiment.py)

LangGraph ์—์ด์ „ํŠธ ์—†์ด UQM โ†’ RTC โ†’ EDE๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ์‹คํ–‰ํ•˜๋Š” ๊ธฐ๋ณธ ํŒŒ์ดํ”„๋ผ์ธ์ž…๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ตฌ์กฐ

๊ตฌ๋ถ„ ๋ฐฑ์—”๋“œ Scoring Method ๋…ผ๋ฌธ ์œ„์น˜
[Primary] OpenAI (GPT-4o-mini) logprob โ€” token-level logprobs ๊ธฐ๋ฐ˜ CP ์ฃผ์š” ๊ฒฐ๊ณผ
[Ablation] LMStudio (๋กœ์ปฌ, meta-llama-3.1-8b-instruct) logprob โ€” LM Studio OpenAI-compatible API๋กœ token-level logprobs ์ถ”์ถœ "๋กœ์ปฌ GGUF ๋ชจ๋ธ์—๋„ logprob CP ์ ์šฉ ๊ฐ€๋Šฅ" ๊ฒ€์ฆ

๋‘ ๋ฐฑ์—”๋“œ ๋ชจ๋‘ ๋™์ผํ•œ logprob ๋น„์ ํ•ฉ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Ablation์˜ ๋ชฉ์ ์€ scoring method ์ฐจ์ด๊ฐ€ ์•„๋‹ˆ๋ผ, LM Studio์˜ OpenAI-compatible API๋ฅผ ํ†ตํ•ด ๋กœ์ปฌ GGUF ๋ชจ๋ธ์—์„œ๋„ token-level logprobs๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Œ์„ ๊ฒ€์ฆํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ์‹œ๋‚˜๋ฆฌ์˜ค: Emergency / Rare Disease / Multimorbidity

Config ์˜ค๋ฒ„๋ผ์ด๋“œ ๊ณ„์ธต

# base_config.yaml โ†’ scenario_emergency.yaml โ†’ CLI ์ธ์ž
# ์˜ค๋ฅธ์ชฝ์ด ์™ผ์ชฝ์„ ๋ฎ์–ด์”๋‹ˆ๋‹ค.

uqm:
  alpha: 0.05
  scoring_method: logprob
  holdout_fraction: 0.2
data:
  n_calibration: 30      # ๋…ผ๋ฌธ ๊ถŒ์žฅ: 500
  n_test_per_scenario: 3 # ๋…ผ๋ฌธ ๊ถŒ์žฅ: 50

์‹คํ—˜ ํ๋ฆ„

  1. _build_datasets(): Config์— ๋”ฐ๋ผ MedQA / MedAbstain ๋กœ๋“œ
  2. UQM.calibrate(): calibration set์œผ๋กœ qฬ‚ ๊ณ„์‚ฐ + hold-out์œผ๋กœ ์‹ค์ธก coverage ๊ฒ€์ฆ
  3. ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„๋กœ UQM.evaluate() โ†’ EDE.decide() ์‹คํ–‰
  4. compute_metrics(): TP/FN/FP/TN โ†’ Safety Recall, Over-Escalation Rate ๊ณ„์‚ฐ
  5. JSON + CSV๋กœ ์ €์žฅ

6.2 LangGraph ์—์ด์ „ํŠธ ์‹คํ—˜ (run_agent_experiment.py)

ReAct ์—์ด์ „ํŠธ๊ฐ€ ๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•ด ์ถ”๋ก ํ•˜๊ณ , UASEF๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์—์Šค์ปฌ๋ ˆ์ด์…˜์„ ํŒ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ˆœ์ฐจ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ—˜๊ณผ ๋™์ผํ•œ ์ผ€์ด์Šค๋ฅผ ์—์ด์ „ํŠธ๋กœ ์‹คํ–‰ํ•˜์—ฌ ๋„๊ตฌ ์‚ฌ์šฉ์˜ ํšจ๊ณผ๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ์ธก์ • ํ•ญ๋ชฉ

  • react_iterations: reason ๋…ธ๋“œ ํ˜ธ์ถœ ํšŸ์ˆ˜ (์ถ”๋ก  ๊นŠ์ด)
  • tool_calls: ๋„๊ตฌ๋ณ„ ์‚ฌ์šฉ ํšŸ์ˆ˜
  • avg_tool_calls_per_case: ์ผ€์ด์Šค๋‹น ํ‰๊ท  ๋„๊ตฌ ํ˜ธ์ถœ ์ˆ˜

์‹œ๋‚˜๋ฆฌ์˜ค โ†’ ์ „๋ฌธ๊ณผ๋ชฉ ๋งคํ•‘:

์‹œ๋‚˜๋ฆฌ์˜ค ์ „๋ฌธ๊ณผ๋ชฉ RTC ์œ„ํ—˜๋„ ์ž„๊ณ„๊ฐ’ ๋ฐฐ์œจ
emergency emergency_medicine CRITICAL ร—0.60 ร— 0.85 = ร—0.51
rare_disease neurology HIGH ร—0.75 ร— 0.85 = ร—0.64
multimorbidity internal_medicine MODERATE ร—1.00

์—์ด์ „ํŠธ ๊ทธ๋ž˜ํ”„ ์‹คํ–‰ ์ƒ์„ธ:

graph.invoke(
    initial_state,
    config={"recursion_limit": 25}  # ๋ฌดํ•œ ๋ฃจํ”„ ๋ฐฉ์ง€
)

max_iterations=5์™€ recursion_limit=25๋Š” ๋…๋ฆฝ์ ์ž…๋‹ˆ๋‹ค. max_iterations๋Š” reason ๋…ธ๋“œ ํ˜ธ์ถœ ํšŸ์ˆ˜๋ฅผ ์ œํ•œํ•˜๊ณ , recursion_limit์€ LangGraph ๋ ˆ๋ฒจ์˜ ์ „์ฒด ๋…ธ๋“œ ์ „ํ™˜ ํšŸ์ˆ˜๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.


6.3 MedAbstain ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ํ‰๊ฐ€ (eval_medabstain.py)

MedAbstain 4๊ฐœ ๋ณ€ํ˜•์—์„œ UASEF๊ฐ€ ์—์Šค์ปฌ๋ ˆ์ด์…˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ฐ์ง€ํ•˜๋Š”์ง€ ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์ธก์ • ์ง€ํ‘œ

์ง€ํ‘œ ์ˆ˜์‹ ์ค‘์š”์„ฑ
Safety Recall TP / (TP + FN) ํ•ต์‹ฌ โ€” ํƒ€ํ˜‘ ๋ถˆ๊ฐ€
Precision TP / (TP + FP) ๋ถˆํ•„์š”ํ•œ ์—์Šค์ปฌ๋ ˆ์ด์…˜ ๋น„์œจ
F1 2 ร— Precision ร— Recall / (P + R) ๊ท ํ˜• ์ง€ํ‘œ
Specificity TN / (TN + FP) ์ •์ƒ ์ผ€์ด์Šค ์ž์œจ ์ฒ˜๋ฆฌ ๋น„์œจ
AUROC ์ˆœ์œ„ ์„ฑ๋Šฅ ์ž„๊ณ„๊ฐ’ ๋…๋ฆฝ์  ํŒ๋ณ„๋ ฅ

๋ณ€ํ˜•๋ณ„ ๋น„๊ต์˜ ์˜๋ฏธ:

  • AP recall < NAP recall โ†’ ๋ชจ๋ธ์ด Abstention + Perturbation ์กฐํ•ฉ์„ ๋” ์–ด๋ ค์›Œํ•จ
  • A recall < AP recall โ†’ Perturbation์ด ์—†์–ด๋„ ๋ถˆํ™•์‹คํ•œ ์ผ€์ด์Šค๋ฅผ ๋†“์นจ
  • NA specificity๊ฐ€ ๋‚ฎ์œผ๋ฉด โ†’ ์ •์ƒ ์ผ€์ด์Šค๋ฅผ ๊ณผ๋„ํ•˜๊ฒŒ ์—์Šค์ปฌ๋ ˆ์ด์…˜ (Over-Escalation ๋ฌธ์ œ)

Weighted CP ๋น„๊ต ์‹คํ—˜:

# ํ‘œ์ค€ CP
python experiments/eval_medabstain.py --backend openai

# Weighted CP (๋ถ„ํฌ ์ด๋™ ์ƒํ™ฉ ์‹œ๋ฎฌ๋ ˆ์ด์…˜)
python experiments/eval_medabstain.py --backend openai --weighted-cp

๋‘ ๊ฒฐ๊ณผ์˜ ์ฐจ์ด๊ฐ€ Weighted CP์˜ ๊ธฐ์—ฌ๋ฅผ ์ •๋Ÿ‰ํ™”ํ•ฉ๋‹ˆ๋‹ค.

Abstention Accuracy

compute_abstention_accuracy()๋Š” UASEF์˜ CP ๊ธฐ๋ฐ˜ ์—์Šค์ปฌ๋ ˆ์ด์…˜๊ณผ ๋ณ„๋„๋กœ, LLM์ด ์Šค์Šค๋กœ ๋ถˆํ™•์‹ค์„ฑ์„ ์–ธ์–ด๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

๋ถ„๋ฅ˜ ์กฐ๊ฑด ์˜๋ฏธ
TA (True Abstain) expected=True + ์‘๋‹ต์— ๋ถˆํ™•์‹ค ํ‘œํ˜„ ํฌํ•จ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ uncertainty ํ‘œํ˜„
FA (False Abstain) expected=False + ์‘๋‹ต์— ๋ถˆํ™•์‹ค ํ‘œํ˜„ ํฌํ•จ ๋ถˆํ•„์š”ํ•œ uncertainty ํ‘œํ˜„
TR (True Answer) expected=False + ๋ถˆํ™•์‹ค ํ‘œํ˜„ ์—†์Œ ์ž์‹  ์žˆ๊ฒŒ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋‹ต๋ณ€
MA (Missed Abstain) expected=True + ๋ถˆํ™•์‹ค ํ‘œํ˜„ ์—†์Œ โ† ๋…ผ๋ฌธ ํ•ต์‹ฌ ์ง€ํ‘œ (๊ณ„ํš์„œ ๋ชฉํ‘œ: +10%p ๊ฐœ์„ )

๊ฒฐ๊ณผ๋Š” medabstain_eval.json์˜ abstention_accuracy ํ•„๋“œ์— ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.


6.4 Pareto Frontier Alpha Sweep (pareto_sweep.py)

Coverage-Escalation Rate ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์˜ ์‹ค์ œ ์ธก์ •์ž…๋‹ˆ๋‹ค. ฮฑ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐ’์œผ๋กœ ์Šค์œ•ํ•˜๋ฉฐ ๊ฐ (ฮฑ, specialty) ์กฐํ•ฉ์—์„œ ์‹ค์ธก (coverage, escalation_rate)๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

์Šค์œ• ๋ฒ”์œ„:

ALPHAS     = [0.01, 0.05, 0.10, 0.15, 0.20, 0.30]
SPECIALTIES = [
    ("emergency_medicine", "emergency"),
    ("internal_medicine",  "multimorbidity"),
    ("general_practice",   "routine"),
]

์ด ์‹คํ—˜ ์ˆ˜: 6 ร— 3 ร— 2 (๋ฐฑ์—”๋“œ) = 36 ํฌ์ธํŠธ

Pure CP ๋ชจ๋“œ:

Pareto sweep์—์„œ๋Š” Trigger 2 (ํ‚ค์›Œ๋“œ)์™€ Trigger 3 (๊ทผ๊ฑฐ ๋ถ€์žฌ)๋ฅผ ์ œ์™ธํ•˜๊ณ  CP Trigger๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ˆœ์ˆ˜ํ•œ Conformal Prediction์˜ ํšจ๊ณผ๋งŒ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

# ์ˆœ์ˆ˜ CP Trigger๋งŒ
escalated = unc.nonconformity_score > rtc_config.adjusted_threshold

ฮฑ ๊ถŒ๊ณ  ์•Œ๊ณ ๋ฆฌ์ฆ˜:

์ž…๋ ฅ: (ฮฑ, specialty) ๋ณ„ ์‹ค์ธก (coverage, escalation_rate)
๋ชฉํ‘œ: specialty๋ณ„ ์ตœ์  ฮฑ ์„ ํƒ

์šฐ์„ ์ˆœ์œ„:
  1. coverage โ‰ฅ 0.95 AND escalation_rate โ‰ค 0.15 โ†’ utility = coverage - 2ร—esc_rate ์ตœ๋Œ€
  2. coverage โ‰ฅ 0.95๋งŒ ์ถฉ์กฑ โ†’ escalation_rate ์ตœ์†Œ
  3. ์•„๋ฌด๊ฒƒ๋„ ์ถฉ์กฑ ์•ˆ ๋จ โ†’ utility ์ตœ๋Œ€ (fallback)

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์•ˆ์ „ ์ œ์•ฝ(coverage)์„ ํšจ์œจ(escalation_rate)๋ณด๋‹ค ํ•ญ์ƒ ์šฐ์„ ํ•ฉ๋‹ˆ๋‹ค. ์˜๋ฃŒ ๋„๋ฉ”์ธ์—์„œ coverage ๋ฏธ์ถฉ์กฑ์€ ์ƒ๋ช… ์œ„ํ—˜๊ณผ ์ง๊ฒฐ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.


6.5 ๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต ์‹คํ—˜ (run_baseline_comparison.py)

๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ๊ธฐ์—ฌ๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ์—์Šค์ปฌ๋ ˆ์ด์…˜ ์ „๋žต์„ ๋™์ผํ•œ ์ผ€์ด์Šค์—์„œ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

์ „๋žต ์„ค๋ช… ์ธก์ • ๋ชฉ์ 
no_escalation ํ•ญ์ƒ ์ž์œจ ํ–‰๋™ Safety Recall 0 ๊ธฐ์ค€์„ 
threshold_only CP Trigger 1๋งŒ ์‚ฌ์šฉ (T2/T3/์—”ํŠธ๋กœํ”ผ ์ œ์™ธ) ์ˆœ์ˆ˜ CP ํšจ๊ณผ ๋ถ„๋ฆฌ
full_uasef T1 + T2 + T3 + ์—”ํŠธ๋กœํ”ผ ๊ฐ€์ค‘์น˜ ์ „์ฒด ์‹œ์Šคํ…œ ์„ฑ๋Šฅ

threshold_only vs full_uasef ์ฐจ์ด๊ฐ€ EDE์˜ ํ‚ค์›Œ๋“œยท๊ทผ๊ฑฐ ๋ถ€์žฌ ํŠธ๋ฆฌ๊ฑฐ๊ฐ€ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ธฐ์—ฌํ•˜๋Š” Safety Recall ํ–ฅ์ƒ๋Ÿ‰์ž…๋‹ˆ๋‹ค.

โš  ๋ฏธ๊ตฌํ˜„: ๊ณ„ํš์„œ์˜ Temperature Scaling / MC Dropout ๋น„๊ต๋Š” ํ˜„์žฌ ๊ตฌํ˜„๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ถ”๊ฐ€ ์‹œ BaselineScorer ์ธํ„ฐํŽ˜์ด์Šค(score(), threshold())๋ฅผ ์ค€์ˆ˜ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.


6.6 ์ „์ฒด ์‹คํ—˜ ํ†ตํ•ฉ ์‹คํ–‰๊ธฐ (run_all_experiments.py)

์œ„ 4๊ฐœ ์‹คํ—˜(์—์ด์ „ํŠธ, ๋ฒ ์ด์Šค๋ผ์ธ, MedAbstain, Pareto Sweep)์„ ํ•œ ๋ฒˆ์— ์ˆœ์ฐจ ์‹คํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ฉ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค.

์‹คํ–‰ ๋ฐฉ์‹

๊ฐ ์‹คํ—˜ ๋ชจ๋“ˆ์˜ ํ•จ์ˆ˜๋ฅผ ์ง์ ‘ importํ•˜์—ฌ ์‹คํ–‰ํ•˜๋ฏ€๋กœ subprocess ์˜ค๋ฒ„ํ—ค๋“œ ์—†์ด ๋™์ผํ•œ Python ํ”„๋กœ์„ธ์Šค์—์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ์‹คํ—˜์ด ์‹คํŒจ(์˜ˆ: ๋ฐฑ์—”๋“œ ์—ฐ๊ฒฐ ์˜ค๋ฅ˜)ํ•ด๋„ ๋‚˜๋จธ์ง€ ์‹คํ—˜์€ ๊ณ„์† ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ์ถœ๋ ฅ ํŒŒ์ผ

ํŒŒ์ผ ์„ค๋ช…
results/all_experiments_summary.json ๋ชจ๋“  ์‹คํ—˜์˜ ํ•ต์‹ฌ ์ง€ํ‘œ(Safety Recall, AUROC, ฮฑ ๊ถŒ๊ณ  ๋“ฑ) ํ†ตํ•ฉ JSON
results/all_experiments_report.md Safety Recall โ‰ฅ 0.95 ๋‹ฌ์„ฑ ์—ฌ๋ถ€๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ•œ Markdown ๋ณด๊ณ ์„œ

--skip ์˜ต์…˜

ํŠน์ • ์‹คํ—˜์„ ๊ฑด๋„ˆ๋›ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. LMStudio ์„œ๋ฒ„๊ฐ€ ์—†๋Š” ํ™˜๊ฒฝ์—์„œ openai ๋‹จ๋… ์‹คํ–‰ ์‹œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

# pareto sweep ์ œ์™ธ (์‹œ๊ฐ„์ด ๊ฐ€์žฅ ์˜ค๋ž˜ ๊ฑธ๋ฆผ)
python experiments/run_all_experiments.py --backend openai --skip pareto

7. ํ‰๊ฐ€ ์ง€ํ‘œ

ํ•ต์‹ฌ ์ง€ํ‘œ

์ง€ํ‘œ ๋ชฉํ‘œ ์ˆ˜์‹ ์˜๋ฏธ
Safety Recall โ‰ฅ 0.95 TP / (TP + FN) ์—์Šค์ปฌ๋ ˆ์ด์…˜ํ•ด์•ผ ํ•  ์ผ€์ด์Šค๋ฅผ ๋†“์น˜์ง€ ์•Š์Œ
Over-Escalation Rate โ‰ค 0.15 FP / (FP + TN) ์ž์œจ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ์ผ€์ด์Šค๋ฅผ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋„˜๊ธฐ์ง€ ์•Š์Œ
Conformal Coverage โ‰ฅ 1-ฮฑ hold-out์—์„œ s โ‰ค qฬ‚์ธ ๋น„์œจ CP ์ด๋ก  ๋ณด์žฅ์˜ ์‹ค์ธก ๊ฒ€์ฆ

์ง€ํ‘œ ํ•ด์„

  • Safety Recall 0.95๋Š” "์—์Šค์ปฌ๋ ˆ์ด์…˜์ด ํ•„์š”ํ•œ 100๊ฐœ ์ผ€์ด์Šค ์ค‘ 95๊ฐœ ์ด์ƒ ๊ฐ์ง€"๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋…ผ์˜ ์—ฌ์ง€ ์—†๋Š” ์ตœ์†Œ ์š”๊ตฌ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค.
  • Over-Escalation Rate 0.15๋Š” "์ž์œจ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ์ผ€์ด์Šค ์ค‘ 15% ์ดํ•˜๋งŒ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ์ „๋ฌธ์˜์—๊ฒŒ ์ „๋‹ฌ"์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด ์šด์˜ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • Conformal Coverage๊ฐ€ 1-ฮฑ๋ณด๋‹ค ๋‚ฎ์œผ๋ฉด CP ์ด๋ก ์ด ์‹ค์ œ๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ calibration ๋ฐ์ดํ„ฐ ๋ถ€์กฑ(n < 30) ๋˜๋Š” distribution shift๊ฐ€ ์›์ธ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ง€ํ‘œ ๊ฐ„ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

ฮฑ ๋‚ฎ์ถค โ†’ Coverage โ†‘, Safety Recall โ†‘, Over-Escalation Rate โ†‘
ฮฑ ๋†’์ž„ โ†’ Coverage โ†“, Safety Recall โ†“, Over-Escalation Rate โ†“

RTC multiplier ๋‚ฎ์ถค โ†’ adjusted_threshold โ†“ โ†’ ๋” ๋งŽ์€ ์—์Šค์ปฌ๋ ˆ์ด์…˜
RTC multiplier ๋†’์ž„ โ†’ adjusted_threshold โ†‘ โ†’ ์ ์€ ์—์Šค์ปฌ๋ ˆ์ด์…˜

์ด ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ์ „๋ฌธ๊ณผ๋ชฉ๋ณ„๋กœ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด Pareto Sweep์˜ ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.


8. ์„ค์น˜ ๋ฐ ํ™˜๊ฒฝ ๊ตฌ์„ฑ

# uv ์„ค์น˜ (์—†์œผ๋ฉด)
curl -LsSf https://astral.sh/uv/install.sh | sh

# ์˜์กด์„ฑ ์„ค์น˜
uv sync

# ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •
cp .env.example .env
# .env์—์„œ OPENAI_API_KEY, LMSTUDIO_MODEL ์ˆ˜์ •

LMStudio (๋กœ์ปฌ ๋ชจ๋ธ)

  1. LMStudio ์•ฑ ์‹คํ–‰ โ†’ ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ (๊ถŒ์žฅ: meta-llama-3.1-8b-instruct)
  2. Local Server ํƒญ โ†’ Start Server (๊ธฐ๋ณธ ํฌํŠธ: 1234)
  3. .env์˜ LMSTUDIO_MODEL์„ ๋กœ๋“œ๋œ ๋ชจ๋ธ๋ช…์œผ๋กœ ์ˆ˜์ •

LangSmith ํŠธ๋ ˆ์ด์‹ฑ (์„ ํƒ)

์—์ด์ „ํŠธ ์‹คํ—˜์˜ ReAct ๋ฃจํ”„๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ์ถ”์ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

# .env์— ์ถ”๊ฐ€
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=<your-key>
LANGCHAIN_PROJECT=UASEF-agent

9. ์‹คํ—˜ ์‹คํ–‰

Step 0: ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ํŒŒ์ดํ”„๋ผ์ธ (์ฒซ ์‹คํ–‰ ์‹œ 1ํšŒ)

# ๊ฐœ๋ฐœ ํ…Œ์ŠคํŠธ
python experiments/run_calibration_pipeline.py --backend openai

# ๋…ผ๋ฌธ ํ’ˆ์งˆ (๊ถŒ์žฅ)
python experiments/run_calibration_pipeline.py --backend openai --n-cal 500 --n-labeled 50

์ด ๋‹จ๊ณ„๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด base_config.yaml์˜ rtc / entropy_threshold / ede ์„น์…˜์ด ์ž๋™ ๊ฐฑ์‹ ๋ฉ๋‹ˆ๋‹ค. ์ดํ›„ ๋ชจ๋“  ์‹คํ—˜์— ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.


์ „์ฒด ์‹คํ—˜ ํ•œ ๋ฒˆ์— ์‹คํ–‰ (๊ถŒ์žฅ)

# [Primary] OpenAI๋งŒ โ€” ๋น ๋ฅธ ์Šค๋ชจํฌ ํ…Œ์ŠคํŠธ
python experiments/run_all_experiments.py --backend openai

# [Primary] OpenAI ๋…ผ๋ฌธ ํ’ˆ์งˆ
python experiments/run_all_experiments.py --backend openai \
    --n-cal 500 --n-test 50 --n-medabstain 100 --n-pareto-test 100

# [Primary + Ablation] ๋…ผ๋ฌธ ์ตœ์ข… ์‹คํ–‰ (openai=logprob, lmstudio=logprob ์ž๋™ ์„ ํƒ)
python experiments/run_all_experiments.py --n-cal 500 --n-test 50

# ํŠน์ • ์‹คํ—˜ ๊ฑด๋„ˆ๋›ฐ๊ธฐ
python experiments/run_all_experiments.py --backend openai --skip pareto

์‹คํ–‰ ํ›„ results/all_experiments_report.md์—์„œ [Primary] / [Ablation] ๊ตฌ๋ถ„์ด ๋ช…์‹œ๋œ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


์ˆœ์ฐจ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ—˜

# [Primary + Ablation] ์ „์ฒด ์‹คํ—˜ (scoring method ์ž๋™ ์„ ํƒ)
python experiments/run_experiment.py --n-cal 500 --n-test 50

# [Primary] OpenAI๋งŒ (logprob)
python experiments/run_experiment.py --backend openai --n-cal 500 --n-test 50

# [Ablation] ๋กœ์ปฌ๋งŒ (logprob via LM Studio)
python experiments/run_experiment.py --backend lmstudio --n-cal 500 --n-test 50

# ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ config ์ ์šฉ
python experiments/run_experiment.py --config experiments/configs/scenario_emergency.yaml

# ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
python experiments/visualize_results.py

LangGraph ์—์ด์ „ํŠธ ์‹คํ—˜

# [Primary + Ablation] ์ „์ฒด ์‹คํ—˜
python experiments/run_agent_experiment.py --n-cal 500 --n-test 50

# [Primary] OpenAI๋งŒ
python experiments/run_agent_experiment.py --backend openai --n-cal 500 --n-test 50

# [Ablation] ๋กœ์ปฌ๋งŒ
python experiments/run_agent_experiment.py --backend lmstudio --n-cal 500 --n-test 50

# PubMedQA ํฌํ•จ
python experiments/run_agent_experiment.py --backend openai --include-pubmedqa

๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต ์‹คํ—˜

# [Primary + Ablation] ์ „์ฒด ๋น„๊ต
python experiments/run_baseline_comparison.py --n-cal 500 --n-test 50

# [Primary] OpenAI๋งŒ
python experiments/run_baseline_comparison.py --backend openai --n-cal 500 --n-test 50

MedAbstain ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ํ‰๊ฐ€

# ์ „์ฒด ๋ณ€ํ˜• (AP, NAP, A, NA)
python experiments/eval_medabstain.py --backend openai

# ํ•ต์‹ฌ safety ์ผ€์ด์Šค๋งŒ (AP/NAP)
python experiments/eval_medabstain.py --backend openai --variants AP NAP --n 100

# Weighted CP ๋น„๊ต
python experiments/eval_medabstain.py --backend openai --weighted-cp

Pareto Frontier + ฮฑ ๊ถŒ๊ณ 

# ฮฑ sweep ์‹คํ–‰
python experiments/pareto_sweep.py --backend openai --n-cal 500

# ๊ธฐ์กด sweep ๊ฒฐ๊ณผ์—์„œ ๊ถŒ๊ณ ๋งŒ ์žฌ๊ณ„์‚ฐ
python -c "
from experiments.pareto_sweep import recommend_alpha, print_recommendations
recs = recommend_alpha()
print_recommendations(recs)
"

๊ฐœ๋ณ„ ๋ชจ๋“ˆ ํ…Œ์ŠคํŠธ

# ๋ชจ๋ธ ์—ฐ๊ฒฐ ํ™•์ธ (logprobs ์ง€์› ์—ฌ๋ถ€ ํฌํ•จ)
python models/model_interface.py

# UQM ๋‹จ๋… (logprob ๋™์ž‘ ํ™•์ธ, self_consistency ๋น„๊ต ๊ฐ€๋Šฅ)
python models/uqm.py

# RTC + EDE ๋‹จ๋… (๊ฐ€์ƒ UncertaintyResult๋กœ ํŠธ๋ฆฌ๊ฑฐ ํ™•์ธ)
python models/rtc_ede.py

10. ์ถœ๋ ฅ ํŒŒ์ผ

ํŒŒ์ผ ์ƒ์„ฑ ์Šคํฌ๋ฆฝํŠธ ์„ค๋ช…
results/experiment_results.json run_experiment.py ๋ฐฑ์—”๋“œ๋ณ„, ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ ์ „์ฒด ์ผ€์ด์Šค ๊ฒฐ๊ณผ
results/comparison_table.csv run_experiment.py Safety Recall / Over-Escalation Rate / Coverage ์š”์•ฝํ‘œ
results/agent_results.json run_agent_experiment.py ์—์ด์ „ํŠธ ์‹คํ—˜ ์ „์ฒด ๊ฒฐ๊ณผ (tool_calls, react_iterations ํฌํ•จ)
results/agent_comparison_table.csv run_agent_experiment.py ์—์ด์ „ํŠธ ๋น„๊ต ์š”์•ฝ
results/baseline_comparison.json run_baseline_comparison.py no_escalation / threshold_only / full_uasef ์ „๋žต๋ณ„ Safety Recall + Over-Escalation Rate
results/baseline_comparison.csv run_baseline_comparison.py ๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต ์š”์•ฝํ‘œ
results/medabstain_eval.json eval_medabstain.py ๋ณ€ํ˜•๋ณ„ Precision / Recall / F1 / AUROC + Abstention Accuracy ์ „์ฒด ๊ฒฐ๊ณผ
results/medabstain_eval_summary.csv eval_medabstain.py ๋ฐฑ์—”๋“œ ร— ๋ณ€ํ˜• ์š”์•ฝํ‘œ
results/pareto_sweep_results.json pareto_sweep.py ฮฑ ร— specialty ์‹ค์ธก (coverage, escalation_rate)
results/pareto_frontier.png pareto_sweep.py ฮฑ ๋ณ„ trajectory + ์ด์ƒ์  ์˜์—ญ
results/alpha_recommendations.json pareto_sweep.py specialty๋ณ„ ์ตœ์  ฮฑ ๋ฐ ๊ถŒ๊ณ  ์ด์œ 
results/comparison_bar.png visualize_results.py ๋ฐฑ์—”๋“œ๋ณ„ Safety Recall / Over-Escalation Rate ๋ฐ”์ฐจํŠธ
results/latency_comparison.png visualize_results.py ๋กœ์ปฌ vs ํด๋ผ์šฐ๋“œ ์‘๋‹ต ์ง€์—ฐ ๋น„๊ต
results/all_experiments_summary.json run_all_experiments.py ๋ชจ๋“  ์‹คํ—˜ ํ•ต์‹ฌ ์ง€ํ‘œ ํ†ตํ•ฉ (์—์ด์ „ํŠธยท๋ฒ ์ด์Šค๋ผ์ธยทMedAbstainยทPareto)
results/all_experiments_report.md run_all_experiments.py Safety Recall โ‰ฅ 0.95 ๋‹ฌ์„ฑ ์—ฌ๋ถ€ ํฌํ•จ Markdown ๋ณด๊ณ ์„œ
results/calibration_report.json run_calibration_pipeline.py ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ์ „ ๊ณผ์ • ๊ฒฐ๊ณผ (RTC sweep, Youden's J, EDE grid search, ROC data)

11. ๋…ผ๋ฌธ ๊ถŒ์žฅ ์„ค์ •

Primary / Ablation ๊ตฌ์กฐ

๊ตฌ๋ถ„ ๋ฐฑ์—”๋“œ scoring_method ๋…ผ๋ฌธ ์„น์…˜
[Primary] openai logprob Main Results
[Ablation] lmstudio logprob Ablation Study

๊ถŒ์žฅ Config

# experiments/configs/base_config.yaml
uqm:
  alpha: 0.05
  scoring_method: auto       # openai=logprob(Primary), lmstudio=logprob(Ablation) ์ž๋™ ์„ ํƒ
  holdout_fraction: 0.2
data:
  n_calibration: 500         # CP ๋ณด์žฅ ์‹ค์šฉ ํ•˜ํ•œ
  n_test_per_scenario: 50    # ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ ์ผ€์ด์Šค ์ˆ˜

# ์•„๋ž˜ ์„น์…˜์€ run_calibration_pipeline.py ์‹คํ–‰ ํ›„ ์ž๋™ ๊ฐฑ์‹ ๋ฉ๋‹ˆ๋‹ค.
# ์ง์ ‘ ํŽธ์ง‘ํ•˜์ง€ ๋งˆ์„ธ์š”.
rtc:
  CRITICAL: 0.60   # rtc_calibration.py Pareto sweep ๊ฒฐ๊ณผ
  HIGH: 0.75
  MODERATE: 1.00
  LOW: 1.30

entropy_threshold: 2.0   # entropy_calibration.py Youden's J ๊ฒฐ๊ณผ

ede:
  t1_weight: 0.40        # ede_coefficient_search.py grid search ๊ฒฐ๊ณผ
  entropy_boost: 0.15

ํ˜„์žฌ ๊ธฐ๋ณธ๊ฐ’(n_calibration=30)์€ ๊ฐœ๋ฐœ/๋””๋ฒ„๊ทธ ์ „์šฉ์ž…๋‹ˆ๋‹ค. n์ด ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด qฬ‚๊ฐ€ ๋ณด์ˆ˜์ (over-coverage)์ด ๋˜์–ด ์ง€ํ‘œ๊ฐ€ ๋‚™๊ด€์ ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ ํ’ˆ์งˆ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ n โ‰ฅ 500์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ์žฌํ˜„์„ฑ

calibration_report.json์— ์ „์ฒด sweep ๊ฒฐ๊ณผ(RTC Pareto, ROC curve, EDE grid)๊ฐ€ ์ €์žฅ๋˜์–ด ๋…ผ๋ฌธ ๋ถ€๋ก ํ…Œ์ด๋ธ”์„ ์ง์ ‘ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ์‹คํ—˜์„ ๋ถ„๋ฆฌํ•˜์—ฌ ์‹คํ–‰ํ•˜๋ฉด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ˆ„์ถœ ์—†์ด ๋…๋ฆฝ์ ์ธ test set ํ‰๊ฐ€๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ์„œ์ˆ  ์ฃผ์˜์‚ฌํ•ญ

  • Primary์™€ Ablation ๋ชจ๋‘ ๋™์ผํ•œ logprob ๋น„์ ํ•ฉ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์ˆ˜์น˜๋ฅผ ๊ฐ™์€ ํ…Œ์ด๋ธ”์—์„œ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ, ๋ชจ๋ธ(GPT-4o-mini vs ๋กœ์ปฌ GGUF)์ด ๋‹ค๋ฅด๋ฏ€๋กœ nonconformity score์˜ ์ ˆ๋Œ€๊ฐ’ ์Šค์ผ€์ผ ์ฐจ์ด๋Š” ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  • Ablation ์„น์…˜์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ๊ธฐ์ˆ : "We apply the same logprob-based nonconformity scoring to both OpenAI and local GGUF models via LM Studio's OpenAI-compatible API, demonstrating that the CP coverage guarantee holds across both deployment environments."
  • Primary ๊ฒฐ๊ณผ๊ฐ€ ๋…ผ๋ฌธ ์ฃผ์š” ์ฃผ์žฅ์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. Ablation์€ "๋กœ์ปฌ GGUF ๋ชจ๋ธ์—์„œ๋„ ๋™์ผํ•œ logprob CP ์ ์šฉ ๊ฐ€๋Šฅ"์„ ๋ณด์ด๋Š” ๋ณด์กฐ ์ฆ๊ฑฐ์ž…๋‹ˆ๋‹ค.

CP ์ด๋ก  ๋ณด์ฆ (Angelopoulos & Bates, 2021)

qฬ‚ = โŒˆ(n+1)(1-ฮฑ)โŒ‰/n ๋ฒˆ์งธ ์ˆœ์œ„ ๋น„์ ํ•ฉ ์ ์ˆ˜

P(s_test โ‰ค qฬ‚) โ‰ฅ 1 - ฮฑ   (์ด๋ก ์  ํ•˜ํ•œ)

n = 500, ฮฑ = 0.05 โ†’ ์‹ค์ธก coverage โ‰ˆ 0.95 (์ด๋ก ๊ฐ’๊ณผ ๊ทผ์ ‘)
n = 30,  ฮฑ = 0.05 โ†’ ์‹ค์ธก coverage โ‰ˆ 0.97~1.00 (๋ณด์ˆ˜์  โ€” ๊ณผ์ถ”์ •)

12. ์ฐธ๊ณ ๋ฌธํ—Œ

  • Conformal Prediction ๊ธฐ์ดˆ Angelopoulos, A. N., & Bates, S. (2021). A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv:2107.07511

  • Weighted Conformal Prediction (Distribution Shift) Tibshirani, R. J., Barber, R. F., Candรจs, E. J., & Ramdas, A. (2019). Conformal prediction under covariate shift. NeurIPS 2019. arXiv:1904.06019

  • MedQA (USMLE ๋ฐ์ดํ„ฐ์…‹) Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14). arXiv:2009.13081

  • PubMedQA (Biomedical QA) Jin, Q., Dhingra, B., Liu, T., Cohen, W., & Lu, X. (2019). PubMedQA: A dataset for biomedical research question answering. EMNLP 2019. arXiv:1909.06146

  • MedAbstain (LLM ๋ถˆํ™•์‹ค์„ฑ ํ‘œํ˜„) Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, X., Zhang, X., & Ye, H. (2023). PromptBench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv:2306.13063

  • NO_EVIDENCE ํ‚ค์›Œ๋“œ ์ถœ์ฒ˜ (Trigger 3) Savage, T., et al. (2025). Diagnostic errors and uncertainty in medical AI: a framework for safe escalation. (source: savage2025 in NO_EVIDENCE_PHRASES)

  • MIMIC-III (ICU ์ž„์ƒ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค) Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.

  • ReAct (์ถ”๋ก +ํ–‰๋™ ์—์ด์ „ํŠธ) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023. arXiv:2210.03629

About

Uncertainty-Aware Safe Action Elicitation Framework for Clinical Decision-Making Agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages