diff --git a/environments/medexqa/README.md b/environments/medexqa/README.md
new file mode 100644
index 00000000..79115a63
--- /dev/null
+++ b/environments/medexqa/README.md
@@ -0,0 +1,249 @@
+# medexqa-env- by mnishant2
+
+### Overview
+- **Environment ID**: `medexqa`
+- **Short description**: Medical QA with multiple-choice questions and explanations across five underrepresented medical specialties
+- **Tags**: medical, clinical, single-turn, multiple-choice, explanations, evaluation
+
+### Datasets
+- **Primary dataset(s)**: MedExQA
+- **Source links**: [Paper](https://arxiv.org/abs/2406.06331), [HuggingFace Dataset](https://huggingface.co/datasets/bluesky333/MedExQA), [GitHub](https://github.com/knowlab/MedExQA)
+- **Split sizes**:
+
+    | Specialty                   | Dev | Test | Total |
+    | --------------------------- | --- | ---- | ----- |
+    | Biomedical Engineering      | 4   | 144  | 148   |
+    | Clinical Laboratory Science | 9   | 368  | 377   |
+    | Clinical Psychology         | 3   | 108  | 111   |
+    | Occupational Therapy        | 5   | 189  | 194   |
+    | Speech Language Pathology   | 4   | 131  | 135   |
+    | **Total**                   | **25** | **940** | **965** |
+
+### Task
+- **Type**: single-turn
+- **Prompting**: Uses the authors' instruction embedded in the user message; options A/B/C/D are included.
+  ```
+  The following is a multiple-choice question. Please choose the most suitable one among A, B, C and D as the answer to this question. Your answer should be paired with an explanation why you chose that answer.
+  ```
+- **Answer extraction [authors' logic](https://github.com/knowlab/MedExQA/blob/9a5b34af103b0c8ba0c00906e278f6572249fafa/evaluate_pipe_MedExQA.py)** :
+  - Canonical letter extraction using a sequence of regex patterns (e.g., explicit "Answer is A:", leading letter, etc.)
+  - If no explicit letter is found, fuzzy matching (thefuzz) maps the generated text to the closest option and returns the corresponding letter
+- **Parser**: `Parser` or `ThinkParser` with `extract_fn=extract_boxed_answer` supported for think-mode; MCQ scoring uses the authors' extraction logic above.
+- Run Evaluation per specialty or on multiple specialties
+- Use lexical metrics('rougeL', 'bleu', 'bertscore', 'meteor') or use an LLM-as-a-judge for explanation evaluation
+- **Rubric overview**:
+  - MCQ accuracy: 0 or 100 per example
+  - Explanation score: 0–100 per example (lexical metrics average); 0 if the answer is wrong
+  - Combined score: weighted average of MCQ and explanation (`mcq_weight`, `explanation_weight`)
+- **Model Download**:
+  In the first run it will download `wordnet`, `NLTK` and `sciBERT` models for running the lexical metrics
+
+### Quickstart
+
+- Run MCQ-only (no explanation scoring):
+```bash
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"use_explanations": false}'
+```
+
+- Run with explanation scoring (lexical metrics):
+```bash
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"use_explanations": true}'
+```
+
+- Use LLM-as-judge for explanations (instead of lexical metrics):
+```bash
+export JUDGE_API_KEY=sk-...
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"use_explanations": true, "use_judge": true, "judge_model": "gpt-4o-mini"}'
+```
+
+- Configure sampling and rollouts:
+```bash
+uv run vf-eval medexqa \
+  -m gpt-4.1-mini \
+  -n -1 -r 3 -t 1024 -T 0.7 \
+  -a '{"use_think": false, "use_explanations": true, "mcq_weight": 0.5, "explanation_weight": 0.5}'
+```
+
+### Environment Arguments
+
+| Arg                    | Type                   | Default        | Description |
+| ---------------------- | ---------------------- | -------------- | ----------- |
+| `specialty`            | list[str] \/ str \| None | `None`         | Select one or more specialties. Codes: `BE`, `CLS`, `CP`, `OT`, `SLP`. `None`\/`ALL` loads all. |
+| `use_think`            | bool                   | `False`        | Use `ThinkParser` to support `<think>...</think>` blocks. |
+| `use_explanations`     | bool                   | `True`         | Whether to compute explanation scores. |
+| `explanation_metrics`  | list[str] \/ str \| None | `None`         | Lexical metrics to use: any of `rougeL`, `bleu`, `meteor`, `bertscore`. `None`\/`"all"` averages all four. |
+| `mcq_weight`           | float                  | `0.5`          | Weight for MCQ accuracy in the combined score. |
+| `explanation_weight`   | float                  | `0.5`          | Weight for explanation in the combined score. |
+| `use_judge`            | bool                   | `False`        | Use LLM-as-judge for explanations instead of lexical metrics. |
+| `judge_mode`           | str \| None            | `None`         | Judge mode: `"g-eval"` or `"factscore"`. Required when `use_judge=True`. |
+| `judge_model`          | str                    | `gpt-4o-mini`  | Judge model name. |
+| `judge_base_url`       | str \| None            | `None`         | Judge API base URL. |
+| `judge_api_key`        | str \| None            | `None`         | Judge API key (falls back to `JUDGE_API_KEY` or `OPENAI_API_KEY`). |
+| `use_coverage`         | bool                   | `False`        | For FactScore only: enable coverage calculation (measures recall). Default is support-only (precision). Increases API calls from ~6-8 to ~12-15 per example. |
+| `seed`                 | int \| None            | `None`         | When multiple specialties are selected, shuffles the combined eval set with this seed. |
+
+### Metrics
+
+- **Answer accuracy (per example)**: 0 or 100. Uses authors' regex+fuzzy logic to extract a letter.
+- **Explanation score (per example)**: 0–100. If the answer is wrong, the explanation score is 0.
+  - Lexical metrics supported: `rougeL`, `bleu`, `meteor`, `bertscore` (w/ SciBERT `allenai/scibert_scivocab_uncased`).
+  - Selection via `explanation_metrics` (list or `'all'`/`None` to average all four).
+- **Combined score**: `mcq_weight * accuracy + explanation_weight * explanation`.
+
+Optional LLM-as-judge for explanations:
+- Set `use_explanations=true` and `use_judge=true` to replace lexical metrics with judge scoring (0–100 after scaling).
+- Must specify `judge_mode` when using LLM-as-judge. Two modes are supported:
+
+#### Judge Modes
+
+**G-Eval Mode** (`judge_mode="g-eval"`):
+- Uses Chain-of-Thought evaluation with 7 structured steps to assess explanation quality.
+- Evaluates medical accuracy, correct option justification, distractor analysis, reference alignment, reasoning clarity, and completeness.
+- Outputs structured JSON with detailed step-by-step analysis and a final score (0-100).
+
+**FactScore Mode** (`judge_mode="factscore"`):
+- Decomposes explanations into atomic medical claims (5-7 key claims) using specialty-specific few-shot examples.
+- Verifies each claim against reference explanations using 3-level support scoring (FULLY_SUPPORTED=1.0, PARTIALLY_SUPPORTED=0.5, NOT_SUPPORTED=0.0).
+- Optionally computes coverage to measure if model covers key concepts from references (disabled by default; enable with `use_coverage=True` for comprehensive precision+recall evaluation).
+
+Usage examples:
+```bash
+# G-Eval mode
+export JUDGE_API_KEY=sk-...
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"use_explanations": true, "use_judge": true, "judge_mode": "g-eval", "judge_model": "gpt-4o-mini"}'
+
+# FactScore mode (support-only, fast)
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"use_explanations": true, "use_judge": true, "judge_mode": "factscore", "judge_model": "gpt-4o-mini"}'
+
+# FactScore mode with coverage (support+coverage, comprehensive)
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"use_explanations": true, "use_judge": true, "judge_mode": "factscore", "judge_model": "gpt-4o-mini", "use_coverage": true}'
+```
+
+#### Judge Rescoring Tool
+
+The `judge_rescore.py` script allows you to re-evaluate previously generated model completions with LLM judges and save detailed results to CSV files. This is useful for experimenting with different judge configurations without re-running model inference.
+
+**Usage:**
+```bash
+# Rescore with FactScore (fast mode: support-only, ~6-8 API calls per example)
+export OPENAI_API_KEY=sk-...
+uv run python environments/medexqa/tools/judge_rescore.py \
+  --judge factscore \
+  --base https://openrouter.ai/api/v1 \
+  --model openai/gpt-4o-mini \
+  --key_var OPENAI_API_KEY \
+  --input_glob 'environments/medexqa/outputs/evals/**/results.jsonl' \
+  --out_csv_prefix 'environments/medexqa/outputs/judge_scores/medexqa_' \
+  --sleep_ms 2000 \
+  --max_retries 6 \
+  --max_tokens 512 \
+  --verbose
+
+# Rescore with FactScore + coverage (comprehensive: support+coverage, ~12-15 API calls per example)
+uv run python environments/medexqa/tools/judge_rescore.py \
+  --judge factscore \
+  --use_coverage \
+  --model openai/gpt-4o-mini \
+  ... # other args same as above
+
+# Rescore with G-Eval (1 API call per example)
+uv run python environments/medexqa/tools/judge_rescore.py \
+  --judge geval \
+  --model openai/gpt-4o-mini \
+  --max_tokens 768 \
+  ... # other args same as above
+
+# Rescore with both judges
+uv run python environments/medexqa/tools/judge_rescore.py \
+  --judge both \
+  --model openai/gpt-4o-mini \
+  ... # other args same as above
+```
+
+**Output:** Creates CSV files with detailed judge outputs:
+- `medexqa_factscore.csv`: Contains extracted claims, support/coverage rates, and final scores
+- `medexqa_geval.csv`: Contains structured JSON evaluation details and scores
+
+### Specialty Selection and Macro Average
+
+- Single specialty by code:
+```bash
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"specialty": "CLS"}'
+```
+
+- Multiple specialties:
+```bash
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"specialty": ["CLS", "CP"], "seed": 42}'
+```
+
+- All specialties:
+```bash
+uv run vf-eval medexqa -m gpt-4.1-mini -a '{"specialty": "ALL"}'
+```
+
+## IMPORTANT: Macro-average accuracy (as reported in the paper):
+- Run each specialty separately and average the per-run average answer accuracies; or
+- Run multiple specialties with `-s` to save results. Each saved example includes its `specialty` in `info`, along with the `per-example answer_accuracy_reward`. Use the saved JSONL to compute per-specialty accuracies and then take the unweighted mean across specialties.
+
+### Testing Instructions
+
+#### 1. Environment Setup
+```bash
+# Navigate to repository root
+cd /data/storage_hpc_nishant/med-lm-envs
+
+# Sync uv environment
+uv sync
+```
+
+#### 2. Quick Validation Test (MCQ-only)
+```bash
+uv run vf-eval medexqa -m gpt-4.1-mini -n 5 -a '{"use_explanations": false}'
+```
+
+#### 3. Full Evaluation with Save
+```bash
+export OPENAI_API_KEY=sk-...
+uv run vf-eval medexqa -m gpt-4.1-mini -n -1 -s -a '{"specialty": "ALL", "use_explanations": true}'
+```
+
+#### 4. LLM-as-Judge for Explanations
+```bash
+export JUDGE_API_KEY=sk-...
+uv run vf-eval medexqa -m gpt-4.1-mini -n -1 -s -a '{"use_explanations": true, "use_judge": true, "mcq_weight": 0.5, "explanation_weight": 0.5}'
+```
+
+#### 5. With Think Tags
+```bash
+uv run vf-eval medexqa -m gpt-4.1-mini -n -1 -a '{"use_think": true}'
+```
+
+#### 6. Example Run with openrouter 
+```bash
+export OPENROUTER_API_KEY=....
+uv run vf-eval medexqa  -m openai/gpt-oss-20b:free -b https://openrouter.ai/api/v1 -k OPENAI_API_KEY -n 10 -r 1 -c 1 -a '{"use_explanations": true, "explanation_metrics": "all", "specialty": ["BE", "OT"]}' -s
+```
+output 
+```bash
+Rewards:
+reward: avg - 59.416, std - 19.928
+r1: [67.79, 65.809, 64.158, 66.619, 69.124, 0.0, 66.957, 66.327, 66.87, 60.503]
+answer_accuracy_reward: avg - 90.000, std - 30.000
+r1: [100.0, 100.0, 100.0, 100.0, 100.0, 0.0, 100.0, 100.0, 100.0, 100.0]
+explanation_reward: avg - 28.832, std - 10.577
+r1: [35.58, 31.618, 28.316, 33.239, 38.249, 0.0, 33.915, 32.653, 33.741, 21.006]
+```
+### Citation
+
+```bibtex
+@article{kim2024medexqa,
+  title={MedExQA: Medical Question Answering Benchmark with Multiple Explanations},
+  author={Kim, Yunsoo and Wu, Jinge and Abdulle, Yusuf and Wu, Honghan},
+  journal={arXiv preprint arXiv:2406.06331},
+  year={2024}
+}
+```
+### Authors
+This environment has been put together by:
+
+Nishant Mishra - ([mnishant2](https://github.com/mnishant2))
\ No newline at end of file
diff --git a/environments/medexqa/medexqa/__init__.py b/environments/medexqa/medexqa/__init__.py
new file mode 100644
index 00000000..f01e6615
--- /dev/null
+++ b/environments/medexqa/medexqa/__init__.py
@@ -0,0 +1,5 @@
+"""MedExQA environment package."""
+
+from medexqa.main import load_environment
+
+__all__ = ["load_environment"]
diff --git a/environments/medexqa/medexqa/factscore_judge/__init__.py b/environments/medexqa/medexqa/factscore_judge/__init__.py
new file mode 100644
index 00000000..1149e4e5
--- /dev/null
+++ b/environments/medexqa/medexqa/factscore_judge/__init__.py
@@ -0,0 +1,7 @@
+"""FactScore judge for MedExQA explanations."""
+
+from .atomic_facts_judge import create_factscore_judge_rubric, explanation_factscore_reward
+from .atomic_facts_generator import AtomicFactGenerator
+
+__all__ = ["create_factscore_judge_rubric", "explanation_factscore_reward", "AtomicFactGenerator"]
+
diff --git a/environments/medexqa/medexqa/factscore_judge/atomic_facts_generator.py b/environments/medexqa/medexqa/factscore_judge/atomic_facts_generator.py
new file mode 100644
index 00000000..b82d1e4d
--- /dev/null
+++ b/environments/medexqa/medexqa/factscore_judge/atomic_facts_generator.py
@@ -0,0 +1,215 @@
+import json
+from typing import List
+
+from openai import AsyncOpenAI
+
+
+class AtomicFactGenerator:
+    """
+    MedExQA-specific atomic facts generator.
+
+    Extracts concise, checkable medical claims from an MCQA explanation that
+    support the chosen option and, when useful, refute key distractors.
+    Returns a Python list of strings (facts), not raw model text.
+    """
+
+    def __init__(self, async_openai_client: AsyncOpenAI | None, model_name: str = "gpt-4o-mini") -> None:
+        self.client = async_openai_client
+        self.model_name = model_name
+
+    async def run(self, explanation_text: str, state: dict = None) -> List[str]:
+        """
+        Extract atomic facts from an MCQA explanation.
+        
+        Args:
+            explanation_text: The explanation text to extract claims from
+            state: Optional state dict for token tracking
+        """
+        explanation = (explanation_text or "").strip()
+        if not explanation:
+            return []
+
+        primary = await self._extract_json_claims(explanation, state=state)
+        if primary:
+            return primary
+
+        fallback = await self._extract_json_claims(explanation, fallback=True, state=state)
+        return fallback or []
+
+    async def _extract_json_claims(self, explanation: str, fallback: bool = False, state: dict = None) -> List[str]:
+        if self.client is None:
+            return []
+
+        if not fallback:
+            prompt = (
+                "You are a medical expert evaluating MCQA (multiple-choice question) explanations.\n"
+                "Extract atomic, checkable medical claims that: (1) justify why the correct option is right, "
+                "(2) when applicable, explain why key distractors are wrong, (3) preserve medical terminology.\n\n"
+                "Rules:\n"
+                "- Output a strict JSON array of strings ONLY (no extra text).\n"
+                "- Extract 5-7 MOST IMPORTANT claims (prioritize key medical concepts).\n"
+                "- Each claim ≤ 30 words; no duplicates; no vague statements.\n"
+                "- Preserve technical terms and abbreviations (e.g., 'DEXA', 'PTFE', 'AAC').\n"
+                "- If no checkable medical content, return [].\n\n"
+                "Few-shot examples (imitate format exactly):\n\n"
+                "# Biomedical Engineering Example 1:\n"
+                "Explanation: Membrane oxygenators require materials with high gas permeability for O2 and CO2 exchange. "
+                "Silicone rubber, polypropylene, and Teflon are highly permeable polymers. Ceramic membranes are dense, "
+                "brittle, have poor gas permeability, and can cause hemolysis.\n"
+                "Claims JSON: [\n"
+                "  \"Membrane oxygenators require high gas permeability for O2 and CO2 exchange.\",\n"
+                "  \"Silicone rubber has excellent gas permeability and biocompatibility.\",\n"
+                "  \"Polypropylene provides high gas transfer and is durable.\",\n"
+                "  \"Teflon (PTFE) is chemically inert with good blood contact properties.\",\n"
+                "  \"Ceramic membranes have poor gas permeability compared to polymers.\",\n"
+                "  \"Ceramic membranes are brittle and can cause hemolysis.\"\n"
+                "]\n\n"
+                "# Biomedical Engineering Example 2:\n"
+                "Explanation: Thermographic cameras detect infrared radiation emitted by objects due to temperature. "
+                "All objects above absolute zero emit infrared radiation. X-rays and UV are higher-energy and not used for thermal imaging. "
+                "Microwaves are used for radar, not temperature scanning.\n"
+                "Claims JSON: [\n"
+                "  \"Thermographic cameras detect infrared radiation from objects.\",\n"
+                "  \"All objects above absolute zero emit infrared radiation.\",\n"
+                "  \"Infrared is ideal for measuring surface temperatures.\",\n"
+                "  \"X-rays are too high-energy for conventional thermal imaging.\",\n"
+                "  \"Microwaves are used for radar applications, not thermal cameras.\"\n"
+                "]\n\n"
+                "# Clinical Laboratory Science Example 1:\n"
+                "Explanation: Hemoglobin A1c measures average blood glucose over 2-3 months by detecting glycated hemoglobin. "
+                "Fasting glucose only reflects current levels. Random glucose varies throughout the day. Oral glucose tolerance test is diagnostic but not for monitoring.\n"
+                "Claims JSON: [\n"
+                "  \"Hemoglobin A1c measures average blood glucose over 2-3 months.\",\n"
+                "  \"A1c detects glycated hemoglobin formed by glucose binding.\",\n"
+                "  \"Fasting glucose only reflects current blood glucose levels.\",\n"
+                "  \"Random glucose varies throughout the day and is unreliable for averages.\",\n"
+                "  \"OGTT is diagnostic but not suitable for long-term monitoring.\"\n"
+                "]\n\n"
+                "# Clinical Laboratory Science Example 2:\n"
+                "Explanation: Gram staining differentiates bacteria by cell wall structure. Gram-positive bacteria have thick peptidoglycan walls "
+                "that retain crystal violet stain. Gram-negative bacteria have thin peptidoglycan and outer membranes, appearing pink after counterstaining.\n"
+                "Claims JSON: [\n"
+                "  \"Gram staining differentiates bacteria by cell wall structure.\",\n"
+                "  \"Gram-positive bacteria have thick peptidoglycan cell walls.\",\n"
+                "  \"Thick peptidoglycan retains crystal violet stain in Gram-positive bacteria.\",\n"
+                "  \"Gram-negative bacteria have thin peptidoglycan and outer membranes.\",\n"
+                "  \"Gram-negative bacteria appear pink after safranin counterstaining.\"\n"
+                "]\n\n"
+                "# Clinical Psychology Example 1:\n"
+                "Explanation: Cognitive-behavioral therapy (CBT) is first-line for generalized anxiety disorder, with strong evidence for efficacy. "
+                "Psychodynamic therapy lacks robust evidence for GAD. Exposure therapy is specific to phobias. Supportive therapy alone is insufficient for GAD.\n"
+                "Claims JSON: [\n"
+                "  \"CBT is first-line treatment for generalized anxiety disorder.\",\n"
+                "  \"CBT has strong evidence for efficacy in treating GAD.\",\n"
+                "  \"Psychodynamic therapy lacks robust evidence for GAD treatment.\",\n"
+                "  \"Exposure therapy is specific to phobias, not GAD.\",\n"
+                "  \"Supportive therapy alone is insufficient for GAD management.\"\n"
+                "]\n\n"
+                "# Clinical Psychology Example 2:\n"
+                "Explanation: The PHQ-9 is a validated 9-item screening tool for major depressive disorder with scores 0-27. "
+                "Scores ≥10 indicate moderate depression requiring clinical evaluation. It assesses DSM-5 criteria for MDD.\n"
+                "Claims JSON: [\n"
+                "  \"PHQ-9 is a validated screening tool for major depressive disorder.\",\n"
+                "  \"PHQ-9 contains 9 items with total scores ranging 0-27.\",\n"
+                "  \"Scores ≥10 indicate moderate depression needing evaluation.\",\n"
+                "  \"PHQ-9 assesses DSM-5 diagnostic criteria for MDD.\"\n"
+                "]\n\n"
+                "# Occupational Therapy Example 1:\n"
+                "Explanation: The Barthel Index measures independence in activities of daily living (ADL) across 10 domains. "
+                "Scores range 0-100, with higher scores indicating greater independence. It's reliable for tracking functional recovery post-stroke.\n"
+                "Claims JSON: [\n"
+                "  \"Barthel Index measures independence in activities of daily living.\",\n"
+                "  \"The index assesses 10 functional domains.\",\n"
+                "  \"Scores range from 0 (dependent) to 100 (independent).\",\n"
+                "  \"Higher Barthel Index scores indicate greater functional independence.\",\n"
+                "  \"Barthel Index is reliable for tracking post-stroke recovery.\"\n"
+                "]\n\n"
+                "# Occupational Therapy Example 2:\n"
+                "Explanation: Adaptive utensils with built-up handles improve grip for patients with arthritis by reducing required pinch force. "
+                "Weighted utensils help tremor patients. Angled utensils assist those with limited wrist mobility. Standard utensils lack these modifications.\n"
+                "Claims JSON: [\n"
+                "  \"Built-up handle utensils improve grip for arthritis patients.\",\n"
+                "  \"Built-up handles reduce required pinch force during eating.\",\n"
+                "  \"Weighted utensils help stabilize tremors during eating.\",\n"
+                "  \"Angled utensils assist patients with limited wrist mobility.\",\n"
+                "  \"Standard utensils lack these adaptive modifications.\"\n"
+                "]\n\n"
+                "# Speech Pathology Example 1:\n"
+                "Explanation: Videofluoroscopic swallow study (VFSS) is the gold standard for dysphagia evaluation, visualizing all swallowing phases. "
+                "It detects aspiration, penetration, and pharyngeal residue. Clinical swallow exam cannot visualize aspiration. Endoscopy misses oral phase.\n"
+                "Claims JSON: [\n"
+                "  \"VFSS is the gold standard for dysphagia evaluation.\",\n"
+                "  \"VFSS visualizes all phases of swallowing in real-time.\",\n"
+                "  \"VFSS can detect aspiration, penetration, and pharyngeal residue.\",\n"
+                "  \"Clinical swallow examination cannot visualize aspiration.\",\n"
+                "  \"Fiberoptic endoscopic evaluation misses the oral phase of swallowing.\"\n"
+                "]\n\n"
+                "# Speech Pathology Example 2:\n"
+                "Explanation: The Peabody Picture Vocabulary Test (PPVT) assesses receptive vocabulary in children and adults. "
+                "It requires pointing to pictures, not verbal responses, making it suitable for nonverbal individuals. Expressive language tests require speech production.\n"
+                "Claims JSON: [\n"
+                "  \"PPVT assesses receptive vocabulary in children and adults.\",\n"
+                "  \"PPVT requires pointing to pictures, not verbal responses.\",\n"
+                "  \"PPVT is suitable for assessing nonverbal individuals.\",\n"
+                "  \"Expressive language tests require speech production.\"\n"
+                "]\n\n"
+                "Now extract claims for the explanation below.\n\n"
+                f"Explanation:\n{explanation}\n\n"
+                "Claims JSON:"
+            )
+        else:
+            prompt = (
+                "Extract atomic, checkable medical claims from this MCQA explanation.\n"
+                "Return ONLY a JSON array of 4–10 strings; each ≤ 30 words. If none, return [].\n\n"
+                f"Explanation:\n{explanation}\n\n"
+                "Claims JSON:"
+            )
+
+        try:
+            resp = await self.client.chat.completions.create(
+                model=self.model_name,
+                messages=[{"role": "user", "content": prompt}],
+                temperature=0.0,
+                max_tokens=512,
+            )
+            text = (resp.choices[0].message.content or "").strip()
+            return _parse_json_list(text)
+        except Exception:
+            return []
+
+
+def _parse_json_list(text: str) -> List[str]:
+    try:
+        data = json.loads(text)
+        if isinstance(data, list):
+            out = []
+            for x in data:
+                s = (str(x) or "").strip()
+                if s:
+                    out.append(s)
+            # keep unique order
+            seen = set()
+            uniq = []
+            for s in out:
+                if s not in seen:
+                    uniq.append(s)
+                    seen.add(s)
+            return uniq
+        return []
+    except Exception:
+        # fallback: find bracketed content
+        try:
+            start = text.find("[")
+            end = text.rfind("]")
+            if start != -1 and end != -1 and end > start:
+                data = json.loads(text[start : end + 1])
+                if isinstance(data, list):
+                    return [str(x).strip() for x in data if str(x).strip()]
+        except Exception:
+            pass
+        return []
+
+
+
+
+
diff --git a/environments/medexqa/medexqa/factscore_judge/atomic_facts_judge.py b/environments/medexqa/medexqa/factscore_judge/atomic_facts_judge.py
new file mode 100644
index 00000000..6125e742
--- /dev/null
+++ b/environments/medexqa/medexqa/factscore_judge/atomic_facts_judge.py
@@ -0,0 +1,246 @@
+"""
+FactScore-style judge for MedExQA explanations (reference-only, no external retrieval).
+
+Two-step process:
+1) Extract atomic medical claims from the model's explanation.
+2) Verify each claim against available references: question, correct option text, exp0, exp1.
+
+Returns support rate in [0, 1], scaled to [0, 100] for reward.
+"""
+
+import json
+import re
+import verifiers as vf
+from .atomic_facts_generator import AtomicFactGenerator
+
+
+JUDGE_TEMPLATE = """You are a medical knowledge verification expert. Evaluate if the Passage supports the Claim.
+
+PASSAGE:
+{response}
+
+CLAIM TO VERIFY:
+{answer}
+
+INSTRUCTIONS:
+1. Check if the claim is FULLY supported by the passage with explicit evidence
+2. Check if the claim is PARTIALLY supported (implied/inferable but not explicit)
+3. Check if the claim is NOT supported (no evidence or contradicts passage)
+
+Respond with EXACTLY ONE of:
+- "FULLY_SUPPORTED" - explicit evidence exists in passage
+- "PARTIALLY_SUPPORTED" - implied/inferable from passage
+- "NOT_SUPPORTED" - no evidence or contradicts passage
+
+Your response:""".strip()
+
+
+def extract_support_level(text: str) -> tuple[float, bool]:
+    """
+    Extract support level from LLM judge response.
+
+    Returns:
+        (score, valid): score is 0.0, 0.5, or 1.0; valid indicates if parsing succeeded
+    """
+    cleaned_text = (text or "").strip().upper()
+
+    # Check for 3-level responses
+    if "FULLY_SUPPORTED" in cleaned_text or "FULLY SUPPORTED" in cleaned_text:
+        return (1.0, True)
+    if "PARTIALLY_SUPPORTED" in cleaned_text or "PARTIALLY SUPPORTED" in cleaned_text:
+        return (0.5, True)
+    if "NOT_SUPPORTED" in cleaned_text or "NOT SUPPORTED" in cleaned_text:
+        return (0.0, True)
+
+    # Fallback to old binary format for backwards compatibility
+    cleaned_lower = cleaned_text.lower()
+    has_true = "true" in cleaned_lower
+    has_false = "false" in cleaned_lower
+    if has_true and not has_false:
+        return (1.0, True)
+    if has_false and not has_true:
+        return (0.0, True)
+
+    # Ambiguous response
+    return (0.0, False)
+
+
+async def explanation_factscore_reward(
+    judge,
+    prompt,
+    completion,
+    answer,
+    state,
+    **kwargs,
+) -> float:
+    # parse explanation text
+    if isinstance(completion, list) and completion:
+        explanation = completion[-1].get("content", "") or ""
+    else:
+        explanation = str(completion)
+
+    info = kwargs.get("info", {}) or {}
+    options = {k: info.get(k, "") for k in ["A", "B", "C", "D"]}
+    question = info.get("question", "")
+    exp0 = info.get("exp0", "")
+    exp1 = info.get("exp1", "")
+    correct_letter = (answer or "").strip().upper()
+    correct_option_text = options.get(correct_letter, "")
+
+    # Gate explanation to zero if predicted MCQ answer is wrong
+    # Parse answer first (extracts from \boxed{} in think mode, returns raw text in normal mode)
+    parser = kwargs.get("parser")
+    if parser:
+        parsed = parser.parse_answer(completion) or ""
+    else:
+        parsed = explanation
+
+    from medarc_verifiers.rewards.multiple_choice_accuracy import multiple_choice_accuracy
+
+    is_correct = multiple_choice_accuracy(
+        llm_answer=parsed,
+        answer_letter=correct_letter,
+        answer_text=correct_option_text,
+        accept_answer_text=True,
+        strip_tex=False,
+    )
+
+    if not is_correct:
+        return 0.0
+
+    # Build references block
+    refs = (
+        f"Question: {question}\n"
+        f"Correct option ({correct_letter}): {correct_option_text}\n"
+        f"Reference Explanation 1: {exp0}\n"
+        f"Reference Explanation 2: {exp1}"
+    )
+
+    # Initialize generator (reuse medredqa style)
+    llm_client = kwargs.get("judge_client")
+    llm_model = kwargs.get("judge_model", "gpt-4o-mini")
+    generator = AtomicFactGenerator(llm_client, model_name=llm_model)
+
+    # Extract atomic claims from model explanation
+    try:
+        if llm_client is None:
+            # No client available - cannot extract claims
+            return 0.0
+        claims = await generator.run(explanation, state=state)
+    except Exception as e:
+        # Log extraction error for debugging
+        import sys
+        print(f"Warning: Atomic facts extraction failed: {e}", file=sys.stderr)
+        claims = []
+    if not claims:
+        return 0.0
+
+    # Step 2a: verify each model claim against references (support_rate)
+    # One call per claim like MedRedQA approach
+    support_score = 0.0
+    total = 0
+
+    for claim in claims:
+        total += 1
+        # Call judge like medredqa does: judge(prompt, completion, answer, state, **kwargs)
+        # prompt is not used in template, completion becomes {response}, answer becomes {answer}
+        judge_response = await judge(prompt, refs, str(claim), state, **kwargs)
+        score, ok = extract_support_level(judge_response)
+        if ok:
+            support_score += score
+
+    support_rate = (support_score / total) if total > 0 else 0.0
+
+    # Step 2b: Coverage rate - DISABLED by default for speed
+    # This measures recall: does the model explanation cover key reference concepts?
+    # Enable with use_coverage=True in kwargs for balanced precision+recall evaluation
+    use_coverage = kwargs.get("use_coverage", False)
+    coverage_rate = 0.0
+
+    if use_coverage and llm_client is not None:
+        # Extract claims from both reference explanations
+        all_ref_claims: list[str] = []
+
+        # Extract from reference 1
+        if (exp0 or "").strip():
+            try:
+                ref0_claims = await generator.run(exp0, state=state)
+                all_ref_claims.extend(ref0_claims)
+            except Exception:
+                pass
+
+        # Extract from reference 2
+        if (exp1 or "").strip():
+            try:
+                ref1_claims = await generator.run(exp1, state=state)
+                all_ref_claims.extend(ref1_claims)
+            except Exception:
+                pass
+
+        # Remove duplicates while preserving order
+        seen = set()
+        unique_ref_claims = []
+        for claim in all_ref_claims:
+            if claim not in seen:
+                unique_ref_claims.append(claim)
+                seen.add(claim)
+
+        # Verify each reference claim against model explanation
+        coverage_score = 0.0
+        coverage_total = 0
+
+        for ref_claim in unique_ref_claims:
+            coverage_total += 1
+            # Check if model explanation supports this reference claim
+            # Call judge: passage=explanation, claim=ref_claim
+            cov_response = await judge(prompt, explanation, str(ref_claim), state, **kwargs)
+            score, ok = extract_support_level(cov_response)
+            if ok:
+                coverage_score += score
+
+        coverage_rate = (coverage_score / coverage_total) if coverage_total > 0 else 0.0
+
+    # Combine support and coverage (if enabled)
+    if use_coverage:
+        # Use weighted combination when coverage is enabled
+        w_support = float(kwargs.get("support_weight", 0.5))
+        w_coverage = float(kwargs.get("coverage_weight", 0.5))
+        denom = w_support + w_coverage if (w_support + w_coverage) > 0 else 1.0
+        final = (w_support * support_rate + w_coverage * coverage_rate) / denom
+    else:
+        # Coverage disabled: use support_rate only
+        final = support_rate
+
+    # Optionally stash structured details for external loggers (if passed in kwargs)
+    # Caller can access via state or judge logs; for rescore tool we return these via logs reconstruction
+    state = state or {}
+    try:
+        state["factscore_details"] = {
+            "support_rate": float(support_rate),
+            "coverage_rate": float(coverage_rate),
+        }
+    except Exception:
+        pass
+
+    return float(final * 100.0)
+
+
+def create_factscore_judge_rubric(
+    parser: vf.Parser,
+    judge_client,
+    judge_model: str = "gpt-4o-mini",
+    use_coverage: bool = False,
+    explanation_weight: float = 1.0,
+) -> vf.JudgeRubric:
+    # Pass judge_prompt like medredqa does - uses standard {response} and {answer} placeholders
+    rubric = vf.JudgeRubric(
+        judge_client=judge_client,
+        judge_model=judge_model,
+        judge_prompt=JUDGE_TEMPLATE,
+        parser=parser,
+        use_coverage=use_coverage,  # Pass through to reward function via kwargs
+    )
+    rubric.add_reward_func(explanation_factscore_reward, weight=explanation_weight)
+    return rubric
+
+
diff --git a/environments/medexqa/medexqa/geval_judge/__init__.py b/environments/medexqa/medexqa/geval_judge/__init__.py
new file mode 100644
index 00000000..faa0103a
--- /dev/null
+++ b/environments/medexqa/medexqa/geval_judge/__init__.py
@@ -0,0 +1,6 @@
+"""G-Eval judge for MedExQA explanations."""
+
+from .geval_judge import create_geval_judge_rubric, explanation_geval_reward
+
+__all__ = ["create_geval_judge_rubric", "explanation_geval_reward"]
+
diff --git a/environments/medexqa/medexqa/geval_judge/geval_judge.py b/environments/medexqa/medexqa/geval_judge/geval_judge.py
new file mode 100644
index 00000000..d83e90eb
--- /dev/null
+++ b/environments/medexqa/medexqa/geval_judge/geval_judge.py
@@ -0,0 +1,214 @@
+import re
+import json
+import verifiers as vf
+
+
+GEVAL_CRITERIA = """You are evaluating a medical MCQA explanation for quality and correctness.
+
+Assess the explanation across these dimensions:
+
+1. MEDICAL ACCURACY: Does the explanation contain factually correct medical information that aligns with the correct option and reference explanations? Are there any medical errors or contradictions?
+
+2. CORRECT OPTION JUSTIFICATION: Does the explanation clearly explain WHY the correct answer is medically appropriate using valid clinical/scientific reasoning?
+
+3. DISTRACTOR ANALYSIS (when applicable): Does the explanation explain why incorrect options are wrong? Note: Not all explanations need this, but it enhances quality when present.
+
+4. REFERENCE ALIGNMENT: Do the explanation's core medical claims align with the key concepts in both reference explanations?
+
+5. REASONING CLARITY: Is the medical reasoning easy to follow with logical flow from evidence to conclusion?
+
+6. COMPLETENESS: Does the explanation cover the essential medical concepts without major omissions?"""
+
+
+GEVAL_EVALUATION_STEPS = [
+    "Extract all medical claims from the actual explanation and list them explicitly",
+    "Compare each claim against the correct option text and both reference explanations. Mark claims as: ALIGNED (matches references), CONTRADICTS (conflicts with references), or NEW_INFO (additional but not contradictory)",
+    "Identify if the explanation justifies WHY the correct option is right (not just states it is correct)",
+    "Check for any major medical errors, inaccuracies, or unsupported claims that could mislead",
+    "Assess whether distractor refutation is present and accurate (if applicable to this question)",
+    "Evaluate overall reasoning clarity, logical flow, and completeness",
+    "Synthesize findings into a score using this rubric: 0.0-0.2 (major errors/irrelevant/contradicts references), 0.2-0.4 (significant gaps/multiple minor errors), 0.4-0.6 (acceptable but incomplete/some inaccuracies), 0.6-0.8 (good quality with minor issues), 0.8-1.0 (excellent: comprehensive, accurate, well-reasoned)"
+]
+
+
+GEVAL_PROMPT_TEMPLATE = """You are a strict medical explanation evaluator following a structured evaluation process.
+
+CRITERIA:
+{criteria}
+
+EVALUATION STEPS (follow these in order):
+{evaluation_steps}
+
+OUTPUT FORMAT:
+Respond with a JSON object containing your step-by-step analysis and final score. Use this exact structure:
+{{
+  "step1_claims_extracted": ["claim1", "claim2", ...],
+  "step2_alignment_analysis": {{
+    "aligned_claims": [...],
+    "contradicting_claims": [...],
+    "new_info_claims": [...]
+  }},
+  "step3_correct_option_justified": true/false,
+  "step4_medical_errors_found": ["error description"] or [],
+  "step5_distractor_refutation": "present_and_accurate" / "present_but_weak" / "absent" / "not_applicable",
+  "step6_reasoning_assessment": "clear" / "somewhat_clear" / "confusing",
+  "step7_final_score": 0.XX,
+  "score_justification": "Brief 1-2 sentence explanation of the score"
+}}
+
+QUESTION CONTEXT:
+Question: {question}
+Options:
+{options}
+Correct Answer: {correct_answer}
+
+REFERENCE EXPLANATIONS:
+Reference 1: {ref_exp1}
+Reference 2: {ref_exp2}
+
+MODEL EXPLANATION TO EVALUATE:
+{model_explanation}
+
+Provide your evaluation as JSON:"""
+
+
+def _extract_score_from_json(text: str) -> tuple[float, dict]:
+    """
+    Extract score from JSON response.
+
+    Returns:
+        (score, parsed_dict): score is 0.0-1.0; parsed_dict contains full evaluation
+    """
+    try:
+        # Try to parse as JSON first
+        # Find JSON object in response (may have extra text before/after)
+        json_match = re.search(r'\{.*\}', text, re.DOTALL)
+        if json_match:
+            json_str = json_match.group(0)
+            data = json.loads(json_str)
+
+            # Extract score from step7_final_score or final_score
+            score = float(data.get("step7_final_score", data.get("final_score", 0.0)))
+            score = max(0.0, min(1.0, score))
+            return score, data
+    except Exception:
+        pass
+
+    # Fallback: try old "final_score:" pattern
+    try:
+        m = re.search(r"final_score\s*:\s*(\d+\.\d+|\d+)", text, flags=re.IGNORECASE)
+        if m:
+            score = float(m.group(1))
+            return max(0.0, min(1.0, score)), {}
+    except Exception:
+        pass
+
+    # Last fallback: extract any number
+    try:
+        m = re.search(r"(\d+\.\d+|\d+)", text.strip())
+        if m:
+            val = float(m.group(1))
+            return max(0.0, min(1.0, val)), {}
+    except Exception:
+        pass
+
+    return 0.0, {}
+
+
+async def explanation_geval_reward(
+    judge,
+    prompt,
+    completion,
+    answer,
+    state,
+    **kwargs,
+) -> float:
+    # Extract the last assistant message content as the explanation text
+    if isinstance(completion, list) and completion:
+        completion_text = completion[-1].get("content", "") or ""
+    else:
+        completion_text = str(completion)
+
+    info = kwargs.get("info", {}) or {}
+    options = {k: info.get(k, "") for k in ["A", "B", "C", "D"]}
+    question = info.get("question", "")
+    exp0 = info.get("exp0", "")
+    exp1 = info.get("exp1", "")
+    correct_letter = (answer or "").strip().upper()
+
+    # Gate explanation to zero if predicted MCQ answer is wrong
+    # Parse answer first (extracts from \boxed{} in think mode, returns raw text in normal mode)
+    parser = kwargs.get("parser")
+    if parser:
+        parsed = parser.parse_answer(completion) or ""
+    else:
+        parsed = completion_text
+
+    from medarc_verifiers.rewards.multiple_choice_accuracy import multiple_choice_accuracy
+
+    correct_option_text = options.get(correct_letter, "")
+    is_correct = multiple_choice_accuracy(
+        llm_answer=parsed,
+        answer_letter=correct_letter,
+        answer_text=correct_option_text,
+        accept_answer_text=True,
+        strip_tex=False,
+    )
+
+    if not is_correct:
+        return 0.0
+
+    # Format options string
+    opts_str = "\n".join(f"{k}. {options.get(k, '')}" for k in ["A", "B", "C", "D"])
+
+    # Format evaluation steps for prompt
+    steps_formatted = "\n".join([f"{i+1}. {step}" for i, step in enumerate(GEVAL_EVALUATION_STEPS)])
+
+    # Build the full prompt using the new template
+    full_prompt = GEVAL_PROMPT_TEMPLATE.format(
+        criteria=GEVAL_CRITERIA,
+        evaluation_steps=steps_formatted,
+        question=question,
+        options=opts_str,
+        correct_answer=f"{correct_letter} ({options.get(correct_letter, '')})",
+        ref_exp1=exp0,
+        ref_exp2=exp1,
+        model_explanation=completion_text
+    )
+
+    # Call judge with structured prompt requesting JSON
+    judge_response = await judge([
+        {"role": "system", "content": "You are a strict, deterministic medical evaluator. Follow the evaluation steps carefully and output valid JSON only."},
+        {"role": "user", "content": full_prompt}
+    ], "", "", state, **kwargs)
+
+    # Parse JSON response and extract score
+    txt = str(judge_response)
+    score, eval_details = _extract_score_from_json(txt)
+
+    # Optionally store evaluation details in state for debugging/logging
+    if state is not None and eval_details:
+        try:
+            state["geval_details"] = eval_details
+        except Exception:
+            pass
+
+    return float(score * 100.0)
+
+
+def create_geval_judge_rubric(
+    parser: vf.Parser,
+    judge_client,
+    judge_model: str = "gpt-4o-mini",
+    explanation_weight: float = 1.0,
+) -> vf.JudgeRubric:
+    rubric = vf.JudgeRubric(
+        judge_client=judge_client,
+        judge_model=judge_model,
+        judge_prompt="{question}",  # not used directly; reward builds full prompt
+        parser=parser,
+    )
+    rubric.add_reward_func(explanation_geval_reward, weight=explanation_weight)
+    return rubric
+
+
diff --git a/environments/medexqa/medexqa/main.py b/environments/medexqa/medexqa/main.py
new file mode 100644
index 00000000..45440b7d
--- /dev/null
+++ b/environments/medexqa/medexqa/main.py
@@ -0,0 +1,351 @@
+import os
+import re
+
+import verifiers as vf
+from datasets import Dataset, concatenate_datasets
+import pandas as pd
+import evaluate
+from openai import AsyncOpenAI
+
+from medarc_verifiers.rewards.multiple_choice_accuracy import multiple_choice_accuracy
+from medarc_verifiers.utils.randomize_multiple_choice import randomize_multiple_choice
+
+
+# MedExQA specialties
+SPECIALTIES = [
+    "biomedical_engineer",
+    "clinical_laboratory_scientist",
+    "clinical_psychologist",
+    "occupational_therapist",
+    "speech_pathologist",
+]
+
+# author prompt directly taken from https://github.com/knowlab/MedExQA/blob/9a5b34af103b0c8ba0c00906e278f6572249fafa/evaluate_pipe_MedExQA.py#L32
+def _build_question_str(question: str, options: dict[str, str], use_think: bool = False) -> str:
+    """Build user prompt with authors' instruction embedded (as in their script).
+
+    The instruction lives in the user message; the system prompt remains empty in
+    normal mode. In think mode, system prompt instructs use of <think> and <answer> tags.
+    """
+    instruction = (
+        "The following is a multiple-choice question. Please choose the most suitable one "
+        "among A, B, C and D as the answer to this question. Your answer should be paired "
+        "with an explanation why you chose that answer.\n\n"
+    )
+    opts = "\n".join(f"{k}. {v}" for k, v in options.items())
+    return f"{instruction}{question}\n{opts}\nAnswer:"
+
+
+def _to_vf_format(ds: Dataset, shuffle_answers: bool, shuffle_seed: int | None) -> Dataset:
+    """Normalize raw rows into the fields expected by SingleTurnEnv.
+
+    Produces rows of the form:
+      - question: string containing authors' instruction, question, and options
+      - answer: gold letter (A/B/C/D) - shuffled if shuffle_answers=True
+      - info: original fields including exp0/exp1 and specialty, plus shuffled options
+    """
+    def _format_row(row: dict) -> dict:
+        question = row.get("question", "") or ""
+
+        # Build options dict from A, B, C, D columns
+        opts = {
+            "A": row.get("A", ""),
+            "B": row.get("B", ""),
+            "C": row.get("C", ""),
+            "D": row.get("D", ""),
+        }
+
+        # Get answer letter
+        answer_letter = (row.get("answer") or "").strip().upper()
+        if answer_letter not in ("A", "B", "C", "D"):
+            return None
+
+        # Shuffle options if requested
+        if shuffle_answers:
+            opts, answer_letter, _ = randomize_multiple_choice(
+                options=opts,
+                answer_choice=answer_letter,
+                seed=shuffle_seed,
+                row_id=question,  # Use question text for deterministic per-row shuffling
+            )
+
+        question_str = _build_question_str(question, opts)
+
+        # Keep original data in info
+        info = dict(row)
+
+        # Update info with shuffled values
+        if shuffle_answers:
+            info["A"] = opts["A"]
+            info["B"] = opts["B"]
+            info["C"] = opts["C"]
+            info["D"] = opts["D"]
+            info["answer"] = answer_letter
+
+        return {
+            "question": question_str,
+            "answer": answer_letter,
+            "info": info,
+        }
+
+    # Disable cache when shuffling to ensure fresh randomization
+    load_from_cache_file = not shuffle_answers
+    return ds.map(
+        _format_row,
+        remove_columns=ds.column_names,
+        load_from_cache_file=load_from_cache_file
+    ).filter(lambda row: row is not None, load_from_cache_file=load_from_cache_file)
+
+
+def load_environment(
+    use_think: bool = False,
+    use_explanations: bool = True,
+    mcq_weight: float = 0.5,
+    explanation_weight: float = 0.5,
+    specialty: list[str] | str | None = None,  # list of short codes or full names; None/"ALL" => all
+    explanation_metrics: list[str] | str | None = None,  # None/"all" => average of all four
+    # MCQ shuffling
+    shuffle_answers: bool = False,
+    shuffle_seed: int | None = 1618,
+    # Optional judge settings
+    use_judge: bool = False,
+    judge_mode: str | None = None,  # "g-eval" | "factscore"
+    judge_model: str = "gpt-4o-mini",
+    judge_base_url: str | None = None,
+    judge_api_key: str | None = None,
+    use_coverage: bool = False,  # For FactScore: enable coverage calculation (slower but comprehensive)
+    **kwargs
+) -> vf.Environment:
+    """
+    Single-turn MedExQA environment using HuggingFace `bluesky333/MedExQA` dataset
+
+    Key behaviors:
+      - User prompt embeds the authors' instruction and the options (authors' format).
+      - System prompt: empty (normal) or THINK_BOXED (think mode).
+      - Specialty selection: accepts list or string; loads requested specialties (None/ALL => all).
+      - MCQ accuracy: authors' regex+fuzzy extraction; returns 0 or 100.
+      - Explanation score: lexical metrics (ROUGE-L, BLEU, METEOR, BERTScore) averaged 0–100; 0 if answer wrong.
+      - Optional judge mode: explanation scored by JudgeRubric (0–100).
+    """
+
+    # Load specialties (one or more)
+    # Note: MedExQA only has dev and test splits, no train split
+    # Load TSV files directly since HF dataset has column name issues
+
+    # Resolve allowed specialties up-front and only load those files
+    code_map = {
+        "BE": "biomedical_engineer",
+        "CLS": "clinical_laboratory_scientist",
+        "CP": "clinical_psychologist",
+        "OT": "occupational_therapist",
+        "SLP": "speech_pathologist",
+        "ALL": "all",
+    }
+    allowed_names: set[str]
+    if specialty is None or (isinstance(specialty, str) and (specialty.upper() in ("ALL", ""))):
+        allowed_names = set(SPECIALTIES)
+    elif isinstance(specialty, str):
+        allowed_names = {code_map.get(specialty.upper(), specialty)}
+    else:
+        tmp = set()
+        for s in specialty:
+            name = code_map.get((s or "").upper(), s)
+            if name and name != "all":
+                tmp.add(name)
+        allowed_names = tmp if tmp else set(SPECIALTIES)
+    macro_active = len(allowed_names) > 1
+
+    # Load all requested specialties
+    test_datasets = []
+    for sp_name in SPECIALTIES:
+        if sp_name not in allowed_names:
+            continue
+        try:
+            url = f"https://huggingface.co/datasets/bluesky333/MedExQA/resolve/main/test/{sp_name}_test.tsv"
+            df = pd.read_csv(
+                url,
+                sep='\t',
+                header=None,
+                names=["question", "A", "B", "C", "D", "exp0", "exp1", "answer"]
+            )
+            df['specialty'] = sp_name
+            ds_part = Dataset.from_pandas(df, preserve_index=False)
+            test_datasets.append(ds_part)
+        except Exception as e:
+            print(f"Warning: Could not load {sp_name}: {e}")
+            continue
+
+    # Concatenate and format for verifiers - no training dataset available
+    test_combined = concatenate_datasets(test_datasets) if test_datasets else None
+    test_ds = _to_vf_format(test_combined, shuffle_answers, shuffle_seed) if test_combined else None
+
+    # Shuffle examples if multiple specialties were selected
+    if macro_active and test_ds is not None:
+        try:
+            test_ds = test_ds.shuffle(seed=int(kwargs.get("seed", 0)))
+        except Exception:
+            pass
+
+    # Setup system prompt and parser - standardized with medredqa approach
+    # - Normal mode: No system prompt, parser returns raw text
+    # - Think mode: XML system prompt, XMLParser extracts from <answer> tags
+    if use_think:
+        # Like medredqa: think in <think> tags, answer+explanation in <answer> tags
+        system_prompt = (
+            "Think step-by-step inside <think>...</think> tags. "
+            "Then, inside <answer>...</answer> tags, provide your final answer choice (A, B, C, or D) "
+            "followed by an explanation of why you chose that answer."
+        )
+        parser = vf.XMLParser(fields=["think", "answer"], answer_field="answer")
+    else:
+        # Normal mode: no system prompt, parser returns raw text for multiple_choice_accuracy
+        system_prompt = ""
+        parser = vf.Parser()
+
+    # (shuffling handled above when multiple specialties)
+
+    # Lexical Metrics selection; pass individually or None/'all'/'overall' => average of all four
+    base_metrics = ["rougeL", "bleu", "meteor", "bertscore"]
+    if explanation_metrics is None:
+        selected_metrics = base_metrics
+    else:
+        if isinstance(explanation_metrics, str) and explanation_metrics.lower() in ("all", "overall"):
+            selected_metrics = base_metrics
+        elif isinstance(explanation_metrics, list) and any(str(m).lower() in ("all", "overall") for m in explanation_metrics):
+            selected_metrics = base_metrics
+        else:
+            selected_metrics = explanation_metrics
+
+    def compute_metric_score(metric_name: str, prediction: str, refs: list[str]) -> float:
+        try:
+            name = metric_name.lower()
+            if name in ("rouge", "rougel"):
+                rouge = evaluate.load("rouge")
+                res = rouge.compute(predictions=[prediction], references=[refs])
+                return float(res.get("rougeL", 0.0)) * 100.0
+            if name == "bleu":
+                bleu = evaluate.load("bleu")
+                res = bleu.compute(predictions=[prediction], references=[refs])
+                sc = float(res.get("bleu", 0.0))
+                return sc * 100.0 if sc <= 1.0 else sc
+            if name == "meteor":
+                meteor = evaluate.load("meteor")
+                res = meteor.compute(predictions=[prediction], references=[refs])
+                sc = float(res.get("meteor", 0.0))
+                return sc * 100.0 if sc <= 1.0 else sc
+            if name == "bertscore":
+                bscore = evaluate.load("bertscore")
+                res = bscore.compute(
+                    predictions=[prediction],
+                    references=[refs],
+                    model_type="allenai/scibert_scivocab_uncased",
+                    lang="en",
+                    rescale_with_baseline=False,
+                )
+                f1_list = res.get("f1", [])
+                return (float(f1_list[0]) * 100.0) if f1_list else 0.0
+            return 0.0
+        except Exception:
+            return 0.0
+
+    def compute_expl_score(pred: str, exp0: str, exp1: str) -> float:
+        refs = [exp0 or "", exp1 or ""]
+        metric_vals = [compute_metric_score(m, pred, refs) for m in selected_metrics]
+        metric_vals = [v for v in metric_vals if v is not None]
+        if not metric_vals:
+            return 0.0
+        # always average across selected metrics
+        return (sum(metric_vals) / len(metric_vals))
+
+    # Note: No per-example macro scaling.
+
+    def _get_completion_text(completion_obj) -> str:
+        if isinstance(completion_obj, list) and completion_obj:
+            return completion_obj[-1].get("content", "") or ""
+        return completion_obj if isinstance(completion_obj, str) else str(completion_obj)
+
+    def answer_accuracy_reward(parser, completion, answer, **kwargs) -> float:
+        # Parse answer first (extracts from \boxed{} in think mode, returns raw text in normal mode)
+        parsed = parser.parse_answer(completion) or ""
+        info = kwargs.get("info", {}) or {}
+
+        # Get answer_text for fallback matching
+        options = {"A": info.get("A", ""), "B": info.get("B", ""), "C": info.get("C", ""), "D": info.get("D", "")}
+        answer_text = options.get(answer, "")
+
+        is_correct = multiple_choice_accuracy(
+            llm_answer=parsed,
+            answer_letter=answer,
+            answer_text=answer_text,
+            accept_answer_text=True,
+            strip_tex=False,  # MedExQA doesn't use LaTeX
+        )
+        return 100.0 if is_correct else 0.0
+
+    def explanation_reward(parser, completion, answer, **kwargs) -> float:
+        # Parse answer first (extracts from \boxed{} in think mode, returns raw text in normal mode)
+        parsed = parser.parse_answer(completion) or ""
+        info = kwargs.get("info", {}) or {}
+
+        # Get answer_text for fallback matching
+        options = {"A": info.get("A", ""), "B": info.get("B", ""), "C": info.get("C", ""), "D": info.get("D", "")}
+        answer_text = options.get(answer, "")
+
+        # Check if answer is correct using multiple_choice_accuracy
+        is_correct = multiple_choice_accuracy(
+            llm_answer=parsed,
+            answer_letter=answer,
+            answer_text=answer_text,
+            accept_answer_text=True,
+            strip_tex=False,
+        )
+
+        if not is_correct:
+            return 0.0
+        else:
+            # For lexical metrics, use the raw completion text (not parsed)
+            completion_text = _get_completion_text(completion)
+            return compute_expl_score(completion_text, info.get("exp0", ""), info.get("exp1", ""))
+
+    # Optional: Use LLM-as-judge for explanation instead of lexical metrics
+    if use_explanations and use_judge:
+        api_key = judge_api_key if judge_api_key else os.getenv("JUDGE_API_KEY") or os.getenv("OPENAI_API_KEY")
+        judge_client = AsyncOpenAI(base_url=judge_base_url, api_key=api_key) if api_key else None
+        if judge_mode is None:
+            raise ValueError("use_judge=True requires judge_mode to be one of {'g-eval','factscore'}")
+        if judge_mode not in ("g-eval", "factscore"):
+            raise ValueError("judge_mode must be 'g-eval' or 'factscore'")
+
+        if judge_mode == "g-eval":
+            from medexqa.geval_judge.geval_judge import create_geval_judge_rubric
+            judge_rubric = create_geval_judge_rubric(parser=parser, judge_client=judge_client, judge_model=judge_model, explanation_weight=explanation_weight)
+            # Combine answer accuracy with the judge-based explanation score
+            judge_rubric.add_reward_func(answer_accuracy_reward, weight=mcq_weight)
+            rubric = judge_rubric
+        elif judge_mode == "factscore":
+            from medexqa.factscore_judge.atomic_facts_judge import create_factscore_judge_rubric
+            judge_rubric = create_factscore_judge_rubric(parser=parser, judge_client=judge_client, judge_model=judge_model, use_coverage=use_coverage, explanation_weight=explanation_weight)
+            judge_rubric.add_reward_func(answer_accuracy_reward, weight=mcq_weight)
+            rubric = judge_rubric
+    else:
+        # Lexical metrics for explanations (or MCQ-only if use_explanations=False)
+        if use_explanations:
+            rubric = vf.Rubric(
+                funcs=[answer_accuracy_reward, explanation_reward],
+                weights=[mcq_weight, explanation_weight],
+                parser=parser
+            )
+        else:
+            # MCQ-only mode
+            rubric = vf.Rubric(funcs=[answer_accuracy_reward], weights=[1.0], parser=parser)
+
+    env = vf.SingleTurnEnv(
+        dataset=None,  # No training split available
+        eval_dataset=test_ds,
+        system_prompt=system_prompt,
+        parser=parser,
+        rubric=rubric,
+        **kwargs
+    )
+
+    return env
diff --git a/environments/medexqa/pyproject.toml b/environments/medexqa/pyproject.toml
new file mode 100644
index 00000000..ffa5a7d4
--- /dev/null
+++ b/environments/medexqa/pyproject.toml
@@ -0,0 +1,33 @@
+[project]
+name = "medexqa"
+version = "0.1.0"
+description = "MedExQA Evaluation - Medical QA with Multiple Explanations"
+readme = "README.md"
+requires-python = ">=3.11"
+authors = [
+    { name = "Nishant Mishra", email = "mnishant2@gmail.com" },
+]
+dependencies = [
+    "datasets>=4.0.0",
+    "verifiers>=0.1.2.post0",
+    "pandas>=2.0.0",
+    "evaluate>=0.4.0",
+    "rouge-score>=0.1.2",
+    "sacrebleu>=2.4.0",
+    "bert-score>=0.3.13",
+    "openai>=1.0.0",
+]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build]
+include = ["medexqa/**"]
+packages = ["medexqa"]
+
+[tool.prime.environment]
+# lets Prime/vf-eval know where the loader lives in a flat repo
+loader = "medexqa:load_environment"
+display_name = "MedExQA"
+visibility = "PUBLIC"
diff --git a/environments/medexqa/tools/judge_rescore.py b/environments/medexqa/tools/judge_rescore.py
new file mode 100644
index 00000000..7f5ada8a
--- /dev/null
+++ b/environments/medexqa/tools/judge_rescore.py
@@ -0,0 +1,350 @@
+import argparse
+import asyncio
+import csv
+import glob
+import json
+import os
+import re
+from typing import Any, Dict, List, Tuple
+
+from openai import AsyncOpenAI
+
+# Reuse existing judge implementations from the environment
+from environments.medexqa.geval_judge.geval_judge import (
+    explanation_geval_reward as geval_reward,
+)
+from environments.medexqa.factscore_judge.atomic_facts_judge import (
+    explanation_factscore_reward as factscore_reward,
+)
+
+
+def _extract_numeric(text: str) -> float:
+    m = re.search(r"(\d+\.\d+|\d+)", (text or "").strip())
+    if not m:
+        return 0.0
+    try:
+        val = float(m.group(1))
+        return max(0.0, min(1.0, val))
+    except Exception:
+        return 0.0
+
+
+def _read_results(paths: List[str]) -> List[Tuple[str, Dict[str, Any]]]:
+    rows: List[Tuple[str, Dict[str, Any]]] = []
+    for p in paths:
+        try:
+            with open(p, "r") as f:
+                for line in f:
+                    line = line.strip()
+                    if not line:
+                        continue
+                    rec = json.loads(line)
+                    rows.append((p, rec))
+        except Exception as e:
+            print(f"Warning: failed to read {p}: {e}")
+    return rows
+
+
+class JudgeRecorder:
+    def __init__(self, client: AsyncOpenAI, model: str, sleep_ms: int = 500, max_retries: int = 5, verbose: bool = True, max_tokens: int = 384):
+        self.client = client
+        self.model = model
+        self.sleep_ms = max(0, int(sleep_ms))
+        self.max_retries = max(1, int(max_retries))
+        self.verbose = verbose
+        self.max_tokens = max_tokens
+        self.logs: List[Dict[str, str]] = []
+
+    async def __call__(self, messages, *_args, **_kwargs) -> str:
+        # messages is a list of {role, content}
+        content = messages[-1].get("content", "") if messages else ""
+        attempt = 0
+        delay = self.sleep_ms / 1000.0
+        while True:
+            try:
+                if self.verbose:
+                    print(f"[judge] calling model={self.model}, tokens<=256")
+                resp = await self.client.chat.completions.create(
+                    model=self.model,
+                    messages=messages,
+                    temperature=0,
+                    max_tokens=self.max_tokens,
+                )
+                out = resp.choices[0].message.content or ""
+                self.logs.append({"prompt": content, "response": out})
+                # throttle between calls
+                if self.sleep_ms > 0:
+                    await asyncio.sleep(self.sleep_ms / 1000.0)
+                return out
+            except Exception as e:
+                attempt += 1
+                msg = str(e)
+                if self.verbose:
+                    print(f"[judge] error on attempt {attempt}: {msg}")
+                # retry on rate limit or transient errors
+                if attempt < self.max_retries:
+                    # exponential backoff with floor at configured delay
+                    backoff = delay * (2 ** (attempt - 1))
+                    await asyncio.sleep(backoff)
+                    continue
+                # record failure
+                self.logs.append({"prompt": content, "response": f"<ERROR>: {msg}"})
+                return ""
+
+
+async def judge_geval(
+    client: AsyncOpenAI,
+    model: str,
+    rec: Dict[str, Any],
+    *,
+    sleep_ms: int = 500,
+    max_retries: int = 5,
+    verbose: bool = True,
+    max_tokens: int = 384,
+) -> Tuple[float, str, str]:
+    info = rec.get("info", {}) or {}
+    question = info.get("question", "")
+    options = {k: info.get(k, "") for k in ["A", "B", "C", "D"]}
+    exp0 = info.get("exp0", "")
+    exp1 = info.get("exp1", "")
+    answer = rec.get("answer", "")
+    completion_msgs = rec.get("completion", [])
+
+    jr = JudgeRecorder(client, model, sleep_ms=sleep_ms, max_retries=max_retries, verbose=verbose, max_tokens=max_tokens)
+    score = await geval_reward(jr, None, completion_msgs, answer, state={}, info=info, judge_client=client, judge_model=model)
+    # Last log entry contains the overall prompt/response
+    judge_output = jr.logs[-1]["response"] if jr.logs else ""
+    refs = (
+        f"Question: {question}\n"
+        f"Correct answer: {answer} ({options.get(answer,'')})\n"
+        f"Ref1: {exp0}\n"
+        f"Ref2: {exp1}"
+    )
+    return float(score), judge_output, refs
+
+
+async def judge_factscore(
+    client: AsyncOpenAI,
+    model: str,
+    rec: Dict[str, Any],
+    *,
+    sleep_ms: int = 500,
+    max_retries: int = 5,
+    verbose: bool = True,
+    max_tokens: int = 384,
+    use_coverage: bool = False,
+) -> Tuple[float, str, str]:
+    info = rec.get("info", {}) or {}
+    question = info.get("question", "")
+    options = {k: info.get(k, "") for k in ["A", "B", "C", "D"]}
+    exp0 = info.get("exp0", "")
+    exp1 = info.get("exp1", "")
+    answer = rec.get("answer", "")
+    completion_msgs = rec.get("completion", [])
+
+    jr = JudgeRecorder(client, model, sleep_ms=sleep_ms, max_retries=max_retries, verbose=verbose, max_tokens=max_tokens)
+    score = await factscore_reward(jr, None, completion_msgs, answer, state={}, info=info, judge_client=client, judge_model=model, use_coverage=use_coverage)
+
+    # Parse logs to reconstruct claim labels and track extraction outcomes
+    labels: List[Tuple[str, str]] = []  # (claim, label) where passage=references (support)
+    coverage_labels: List[Tuple[str, str]] = []  # (ref_claim, label) where passage=explanation (coverage)
+    extraction_responses: List[str] = []
+    for entry in jr.logs:
+        prompt = entry.get("prompt", "") or ""
+        response = entry.get("response", "") or ""
+
+        # Look for new format: "CLAIM TO VERIFY:\n<claim>"
+        m = re.search(r"CLAIM TO VERIFY:\s*\n(.+?)(?:\n\nINSTRUCTIONS:)", prompt, flags=re.DOTALL)
+        if m:
+            claim = m.group(1).strip()
+            # Heuristic: prompts containing "PASSAGE (Combined References):" belong to support
+            # Prompts with "MODEL EXPLANATION:" belong to coverage
+            if "PASSAGE (Combined References):" in prompt:
+                labels.append((claim, response.strip().upper()))
+            elif "MODEL EXPLANATION:" in prompt:
+                coverage_labels.append((claim, response.strip().upper()))
+
+        # Fallback: old format "Fact:\n<fact>"
+        if not m:
+            m_old = re.search(r"Fact:\s*\n(.+)$", prompt, flags=re.DOTALL)
+            if m_old:
+                claim = m_old.group(1).strip()
+                if "Question:" in prompt:
+                    labels.append((claim, response.strip().upper()))
+                else:
+                    coverage_labels.append((claim, response.strip().upper()))
+
+        if "Claims JSON:" in prompt:
+            extraction_responses.append(response)
+
+    refs = (
+        f"Question: {question}\n"
+        f"Correct: ({answer}) {options.get(answer,'')}\n"
+        f"Ref1: {exp0}\n"
+        f"Ref2: {exp1}"
+    )
+    # Derive error tag for extraction phase
+    err_tag = ""
+    if extraction_responses:
+        last_extraction = extraction_responses[-1]
+        try:
+            parsed = json.loads(last_extraction)
+            if isinstance(parsed, list) and len(parsed) == 0:
+                err_tag = "empty_extraction"
+        except Exception:
+            err_tag = "extraction_error"
+    elif not labels and not coverage_labels:
+        err_tag = "empty_extraction"
+
+    # Compute support/coverage rates from labels (handle 3-level format)
+    def _rate(pairs: List[Tuple[str, str]]) -> float:
+        if not pairs:
+            return 0.0
+        total_score = 0.0
+        for _, lbl in pairs:
+            lbl_clean = (lbl or "").strip().upper()
+            if "FULLY_SUPPORTED" in lbl_clean or "FULLY SUPPORTED" in lbl_clean:
+                total_score += 1.0
+            elif "PARTIALLY_SUPPORTED" in lbl_clean or "PARTIALLY SUPPORTED" in lbl_clean:
+                total_score += 0.5
+            elif lbl_clean.startswith("TRUE"):  # Fallback for old format
+                total_score += 1.0
+            # NOT_SUPPORTED or FALSE = 0.0
+        return float(total_score) / float(len(pairs))
+
+    support_rate = _rate(labels)
+    coverage_rate = _rate(coverage_labels)
+
+    details = json.dumps({
+        "claims": labels,
+        "coverage_labels": coverage_labels,
+        "support_rate": support_rate,
+        "coverage_rate": coverage_rate,
+    }, ensure_ascii=False)
+    if err_tag:
+        refs = refs + f"\n<ERR>: {err_tag}"
+    return float(score), details, refs
+
+
+async def main():
+    ap = argparse.ArgumentParser(description="Re-score saved MedExQA completions with LLM judges.")
+    ap.add_argument("--base", default="https://openrouter.ai/api/v1", help="Judge API base URL")
+    ap.add_argument("--model", default="openai/gpt-oss-20b:free", help="Judge model id")
+    ap.add_argument("--key_var", default="OPENAI_API_KEY", help="Env var name holding the API key")
+    ap.add_argument("--input_glob", default="environments/medexqa/outputs/evals/**/results.jsonl", help="Glob to results.jsonl files")
+    ap.add_argument("--out_csv_prefix", default="environments/medexqa/outputs/judge_scores/medexqa_", help="Output CSV prefix (will append judge name)")
+    ap.add_argument("--sleep_ms", type=int, default=500, help="Sleep/throttle between judge calls (ms)")
+    ap.add_argument("--max_retries", type=int, default=5, help="Max retries on judge call errors")
+    ap.add_argument("--max_tokens", type=int, default=384, help="Max tokens per judge response")
+    ap.add_argument("--verbose", action="store_true", help="Verbose logging")
+    ap.add_argument("--judge", choices=["geval", "factscore", "both"], default="both", help="Which judge(s) to run")
+    ap.add_argument("--use_coverage", action="store_true", help="Enable coverage calculation for FactScore (slower but more comprehensive)")
+    args = ap.parse_args()
+
+    api_key = os.getenv(args.key_var)
+    if not api_key:
+        raise SystemExit(f"Missing API key in env var {args.key_var}")
+
+    client = AsyncOpenAI(base_url=args.base, api_key=api_key)
+
+    # Discover saved runs
+    paths = sorted(glob.glob(args.input_glob, recursive=True))
+    if args.verbose:
+        print(f"Scanning {len(paths)} results.jsonl files...")
+    rows = _read_results(paths)
+    if not rows:
+        print("No results found to re-score.")
+        return
+
+    os.makedirs(os.path.dirname(args.out_csv_prefix), exist_ok=True)
+
+    # Prepare CSV writers conditionally
+    gwriter = None
+    fwriter = None
+    geval_path = args.out_csv_prefix + "geval.csv"
+    fact_path = args.out_csv_prefix + "factscore.csv"
+    if args.judge in ("geval", "both"):
+        gf = open(geval_path, "w", newline="")
+        gwriter = csv.writer(gf)
+        gwriter.writerow(["run_file", "specialty", "question", "A", "B", "C", "D", "answer", "completion", "judge_model_output", "judge_score", "references", "error"])
+    if args.judge in ("factscore", "both"):
+        ff = open(fact_path, "w", newline="")
+        fwriter = csv.writer(ff)
+        fwriter.writerow(["run_file", "specialty", "question", "A", "B", "C", "D", "answer", "completion", "claims_labels_json", "support_rate", "coverage_labels_json", "coverage_rate", "final_score", "references", "error"]) 
+
+    # Process sequentially to keep it simple
+    for idx, (run_file, rec) in enumerate(rows, start=1):
+        info = rec.get("info", {}) or {}
+        spec = info.get("specialty", "")
+        question = info.get("question", "")
+        A = info.get("A", "")
+        B = info.get("B", "")
+        C = info.get("C", "")
+        D = info.get("D", "")
+        answer = rec.get("answer", "")
+        completion_msgs = rec.get("completion", [])
+        completion_text = completion_msgs[-1].get("content", "") if completion_msgs else ""
+
+        if args.verbose:
+            print(f"[{idx}/{len(rows)}] {run_file} | spec={spec} | len(prompt)={len(question)} | len(completion)={len(completion_text)}")
+
+        # G-Eval
+        if args.judge in ("geval", "both") and gwriter is not None:
+            if args.verbose:
+                print("  -> G-Eval judging...")
+            g_score, g_out, g_refs = await judge_geval(
+                client,
+                args.model,
+                rec,
+                sleep_ms=args.sleep_ms,
+                max_retries=args.max_retries,
+                verbose=args.verbose,
+                max_tokens=args.max_tokens,
+            )
+            # detect errors in logs
+            g_err = ""
+            if g_out.strip().startswith("<ERROR>") or g_out.strip() == "":
+                g_err = "empty_or_error"
+            gwriter.writerow([run_file, spec, question, A, B, C, D, answer, completion_text, g_out, f"{g_score:.3f}", g_refs, g_err])
+
+        # FactScore
+        if args.judge in ("factscore", "both") and fwriter is not None:
+            if args.verbose:
+                print("  -> FactScore judging...")
+            f_score, f_details, f_refs = await judge_factscore(
+                client,
+                args.model,
+                rec,
+                sleep_ms=args.sleep_ms,
+                max_retries=args.max_retries,
+                verbose=args.verbose,
+                max_tokens=args.max_tokens,
+                use_coverage=args.use_coverage,
+            )
+            f_err = ""
+            support_rate = ""
+            coverage_rate = ""
+            coverage_labels_json = "{}"
+            try:
+                dd = json.loads(f_details)
+                support_rate = f"{float(dd.get('support_rate', 0.0)):.3f}"
+                coverage_rate = f"{float(dd.get('coverage_rate', 0.0)):.3f}"
+                coverage_labels_json = json.dumps(dd.get("coverage_labels", []), ensure_ascii=False)
+            except Exception:
+                pass
+            if f_details.strip() == "" or f_details.strip() == "{}":
+                f_err = "empty_or_error"
+            fwriter.writerow([run_file, spec, question, A, B, C, D, answer, completion_text, f_details, support_rate, coverage_labels_json, coverage_rate, f"{f_score:.3f}", f_refs, f_err])
+
+    if gwriter is not None:
+        gf.close()
+        print(f"Wrote: {geval_path}")
+    if fwriter is not None:
+        ff.close()
+        print(f"Wrote: {fact_path}")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
+
+