Skip to content

Latest commit

 

History

History
43 lines (27 loc) · 1.13 KB

File metadata and controls

43 lines (27 loc) · 1.13 KB

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Jitian Zhao*, Changho Shin*, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala

Paper Link: TBD

image

Install

pip install -r requirements.txt

Run pipeline

1) Generate LLM judge outputs

python scripts/save_judge_outputs.py \
  --datasets asset_ratings civilcomments_binary allenai_preference_test_sets/pku_better_binary \
  --mode gaussian_mixture

Output path example: judge_outputs/fully_gaussian/asset/Qwen3-8B.csv

2) Run aggregations

Fully Gaussian (table 1 experiment):

python scripts/fully_gaussian_main.py --seed 2024

Gaussian mixture (table 2 experiment):

python scripts/gaussian_mixture_main.py --seed 42 --datasets civilcomments pku_better

Citation

TBD