This repository contains reference code for TabReX, a referenceless, explainable evaluation framework for LLM-generated tables using graph-based reasoning and rubric-aware scoring.
Evaluating tables generated by large language models is hard: many metrics flatten tables into text (losing structure), while reference-based metrics require a gold table, limiting generalization to new schemas and valid alternative table layouts. TabReX addresses this with a reference-less, property-driven evaluation pipeline that provides interpretable scores and cell-level error traces.
TabReX works in three key steps:
-
Canonicalize to Knowledge Graphs
Convert the source text and the generated table into canonical knowledge graphs to preserve structure and factual relations. -
LLM-guided Graph Alignment
Align nodes/edges between the two graphs using an LLM-guided matching procedure to robustly handle paraphrases, schema variation, and re-orderings. -
Rubric-aware Scoring + Explanations
Compute interpretable, property-based scores that quantify structural fidelity and factual correctness, producing fine-grained diagnostics (e.g., which cells/relations are unsupported or mismatched).
To systematically test evaluation robustness, we introduce TabReX-Bench, a benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers, enabling stress-testing of table evaluation metrics under controlled shifts.
Overall, TabReX is designed to be trustworthy and explainable: it delivers human-aligned judgments, remains stable under harder perturbations, and enables detailed model/prompt analysis for structured generation systems.
This repository contains the code, data, and analysis pipelines for TabReX, a benchmark and evaluation suite for table robustness and table-to-table similarity metrics. It includes:
- The TabReX benchmark (original tables + 12 perturbations per item)
- Implementations/wrappers for multiple metrics (EM/ROUGE/chrF/BERTScore, BLEURT, HScore, PScore, TabEval, TabXEval/TabScore, QuestEval)
- Human-correlation scripts (Spearman/Kendall/Weighted Kendall/RBO, top-k overlap)
- A table-to-graph pipeline for analysis and alignment
.
├── TabReX/ # Core pipeline and table→graph converters
│ ├── TabRex.py # Main driver for graph generation and scoring
│ ├── __init__.py
│ └── table_to_graph_modules/ # rule_md, rule_html, llm_html converters
├── metrics/ # Metric wrappers and TabXEval integration
│ ├── em_chrf_rouge_bert.py # EM/ROUGE-L/chrF/BERTScore
│ ├── bleurt_metric.py # BLEURT (evaluate/TF)
│ ├── hscore_metric.py # HScore (format/content similarity)
│ ├── pscore_metric.py # PScore (LLM-based)
│ ├── tabeval_metric.py # TabEval (unroll + NLI)
│ ├── TabXEval_metric.py # TabScore via TabXEval pipeline
│ ├── questeval_metric.py # QuestEval-style QA F1
│ ├── tabxeval.py # Local TabXEval score_calc hook
│ └── TabXEval/ # Prompts, pipelines and examples
├── correlation/ # Correlation analyses vs. human ranking
│ ├── correlation.py # 7-way (GT/easy/medium/hard×2) correlation
│ └── t2t_correlation/ # Flat-12 text2table ranking correlation utilities
│ ├── human_ranking.json
│ ├── perturb_mapping.jsonl
│ └── t2t_correlation.py
├── perturbation/ # Perturbation planning utilities (optional)
│ └── perturbation_planning.py
├── data/ # Datasets
│ ├── TabReX_Bench.json # Default dataset (original + 12 perturbations)
│ └── original_710_tbl.json # 710 original tables (JSON array)
├── requirements.txt # Project dependencies
└── README.md # This file
- Clone and create a virtual environment
git clone https://github.com/CoRAL-ASU/TabReX.gitcd TabReXpython -m venv .venv && source .venv/bin/activate
- Install dependencies
pip install -r requirements.txt- (Optional, if you use TabXEval extras)
pip install -r metrics/TabXEval/requirements.txt
- Environment variables
- Create a
.envin the repo root and set:OPENAI_API_KEY=...(required for PScore/TabEval and some TabReX steps)GEMINI_API_KEY=...(only if you plan to run Gemini-specific tools)
- All metric wrappers default to
data/TabReX_Bench.jsonunless otherwise noted. - File shape: JSON array with keys
original, andperturbation1..perturbation12(strings) or{table: ..., metadata: ...}objects.
- The core driver is
TabReX/TabRex.py; it converts tables to knowledge graphs, aligns summary/table triplets, and computes TabReX scores. - Minimal example (single index):
python TabReX/TabRex.py --index 0 --table-converter rule_md --output TabRex_out.json --output-pkl TabRex_out.pkl
- Notes:
- The pipeline uses OpenAI APIs for certain steps; ensure
OPENAI_API_KEYis set.
- The pipeline uses OpenAI APIs for certain steps; ensure
-
HScore
python metrics/hscore_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex
-
EM/ROUGE-L/chrF/BERTScore
- Reads
data/TabReX_Bench.jsonby default. python metrics/em_chrf_rouge_bert.py
- Reads
-
BLEURT
python metrics/bleurt_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex
-
PScore (requires OPENAI_API_KEY)
python metrics/pscore_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex --workers 8
-
TabEval (OPENAI_API_KEY for unroll, Transformers for NLI)
python metrics/tabeval_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex- Env overrides:
TABEVAL_NLI_MODEL(default: roberta-large-mnli)TABEVAL_UNROLL_WORKERS(default: 100)
-
TabXEval/TabScore
python metrics/TabXEval_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex --workers 16
-
QuestEval
python metrics/questeval_metric.py(uses defaults; requiresOPENAI_API_KEY)
Metrics write outputs to --out_dir (recommended: results/metrics).
-
7-way correlation (EM/HScore/etc. PKLs vs. human ranking heuristics)
python correlation/correlation.py --in-dir correlation/pkls --out results/correlation/summary.json
-
Flat-12 text2table correlation (model/prompt flattening)
python correlation/t2t_correlation/t2t_correlation.py \ --human correlation/t2t_correlation/human_ranking.json \ --mapping correlation/t2t_correlation/perturb_mapping.jsonl \ --scores-dir results/metrics \ --out-dir results/correlation/flat12
- Outputs: Prefer a unified results layout, e.g.,
results/metricsandresults/correlation. - Models: Several scripts have a default model name; you can override via code/env if needed.
- Heavy deps: BLEURT/TF, Transformers/torch may be large; consider using a GPU environment (optional).
If you use this repository in your research, please cite the accompanying paper (TabReX).
@misc{anvekar2025tabrextabularreferenceless,
title={TabReX : Tabular Referenceless eXplainable Evaluation},
author={Tejas Anvekar and Juhna Park and Aparna Garimella and Vivek Gupta},
year={2025},
eprint={2512.15907},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.15907},
}Please see the LICENSE file if provided. If absent, contact the authors for licensing information.
Contributions are welcome. Please open an issue or a pull request for fixes and improvements.
