Skip to content

TabReX: Referenceless, explainable evaluation for LLM-generated tables via graph alignment + property-driven scoring (with TabReX-Bench)

License

Notifications You must be signed in to change notification settings

CoRAL-ASU/TabReX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TabReX: Tabular Referenceless eXplainable Evaluation

arXiv Project Page License: MIT Python 3.9+

This repository contains reference code for TabReX, a referenceless, explainable evaluation framework for LLM-generated tables using graph-based reasoning and rubric-aware scoring.


Overview

arch

Evaluating tables generated by large language models is hard: many metrics flatten tables into text (losing structure), while reference-based metrics require a gold table, limiting generalization to new schemas and valid alternative table layouts. TabReX addresses this with a reference-less, property-driven evaluation pipeline that provides interpretable scores and cell-level error traces.

TabReX works in three key steps:

  1. Canonicalize to Knowledge Graphs
    Convert the source text and the generated table into canonical knowledge graphs to preserve structure and factual relations.

  2. LLM-guided Graph Alignment
    Align nodes/edges between the two graphs using an LLM-guided matching procedure to robustly handle paraphrases, schema variation, and re-orderings.

  3. Rubric-aware Scoring + Explanations
    Compute interpretable, property-based scores that quantify structural fidelity and factual correctness, producing fine-grained diagnostics (e.g., which cells/relations are unsupported or mismatched).

To systematically test evaluation robustness, we introduce TabReX-Bench, a benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers, enabling stress-testing of table evaluation metrics under controlled shifts.

Overall, TabReX is designed to be trustworthy and explainable: it delivers human-aligned judgments, remains stable under harder perturbations, and enables detailed model/prompt analysis for structured generation systems.


This repository contains the code, data, and analysis pipelines for TabReX, a benchmark and evaluation suite for table robustness and table-to-table similarity metrics. It includes:

  • The TabReX benchmark (original tables + 12 perturbations per item)
  • Implementations/wrappers for multiple metrics (EM/ROUGE/chrF/BERTScore, BLEURT, HScore, PScore, TabEval, TabXEval/TabScore, QuestEval)
  • Human-correlation scripts (Spearman/Kendall/Weighted Kendall/RBO, top-k overlap)
  • A table-to-graph pipeline for analysis and alignment

Repository Structure

.
├── TabReX/                      # Core pipeline and table→graph converters
│   ├── TabRex.py               # Main driver for graph generation and scoring
│   ├── __init__.py
│   └── table_to_graph_modules/ # rule_md, rule_html, llm_html converters
├── metrics/                     # Metric wrappers and TabXEval integration
│   ├── em_chrf_rouge_bert.py   # EM/ROUGE-L/chrF/BERTScore
│   ├── bleurt_metric.py        # BLEURT (evaluate/TF)
│   ├── hscore_metric.py        # HScore (format/content similarity)
│   ├── pscore_metric.py        # PScore (LLM-based)
│   ├── tabeval_metric.py       # TabEval (unroll + NLI)
│   ├── TabXEval_metric.py      # TabScore via TabXEval pipeline
│   ├── questeval_metric.py     # QuestEval-style QA F1
│   ├── tabxeval.py             # Local TabXEval score_calc hook
│   └── TabXEval/               # Prompts, pipelines and examples
├── correlation/                 # Correlation analyses vs. human ranking
│   ├── correlation.py          # 7-way (GT/easy/medium/hard×2) correlation
│   └── t2t_correlation/        # Flat-12 text2table ranking correlation utilities
│       ├── human_ranking.json
│       ├── perturb_mapping.jsonl
│       └── t2t_correlation.py
├── perturbation/                # Perturbation planning utilities (optional)
│   └── perturbation_planning.py
├── data/                      # Datasets
│   ├── TabReX_Bench.json       # Default dataset (original + 12 perturbations)
│   └── original_710_tbl.json   # 710 original tables (JSON array)
├── requirements.txt            # Project dependencies
└── README.md                   # This file

Setup

  1. Clone and create a virtual environment
  • git clone https://github.com/CoRAL-ASU/TabReX.git
  • cd TabReX
  • python -m venv .venv && source .venv/bin/activate
  1. Install dependencies
  • pip install -r requirements.txt
  • (Optional, if you use TabXEval extras) pip install -r metrics/TabXEval/requirements.txt
  1. Environment variables
  • Create a .env in the repo root and set:
    • OPENAI_API_KEY=... (required for PScore/TabEval and some TabReX steps)
    • GEMINI_API_KEY=... (only if you plan to run Gemini-specific tools)

Usage

Datasets

  • All metric wrappers default to data/TabReX_Bench.json unless otherwise noted.
  • File shape: JSON array with keys original, and perturbation1..perturbation12 (strings) or {table: ..., metadata: ...} objects.

TabReX Pipeline (main)

  • The core driver is TabReX/TabRex.py; it converts tables to knowledge graphs, aligns summary/table triplets, and computes TabReX scores.
  • Minimal example (single index):
    • python TabReX/TabRex.py --index 0 --table-converter rule_md --output TabRex_out.json --output-pkl TabRex_out.pkl
  • Notes:
    • The pipeline uses OpenAI APIs for certain steps; ensure OPENAI_API_KEY is set.

Optional Metrics

  • HScore

    • python metrics/hscore_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex
  • EM/ROUGE-L/chrF/BERTScore

    • Reads data/TabReX_Bench.json by default.
    • python metrics/em_chrf_rouge_bert.py
  • BLEURT

    • python metrics/bleurt_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex
  • PScore (requires OPENAI_API_KEY)

    • python metrics/pscore_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex --workers 8
  • TabEval (OPENAI_API_KEY for unroll, Transformers for NLI)

    • python metrics/tabeval_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex
    • Env overrides:
      • TABEVAL_NLI_MODEL (default: roberta-large-mnli)
      • TABEVAL_UNROLL_WORKERS (default: 100)
  • TabXEval/TabScore

    • python metrics/TabXEval_metric.py --input data/TabReX_Bench.json --out_dir results/metrics --prefix tabrex --workers 16
  • QuestEval

    • python metrics/questeval_metric.py (uses defaults; requires OPENAI_API_KEY)

Metrics write outputs to --out_dir (recommended: results/metrics).

Human Correlation

  • 7-way correlation (EM/HScore/etc. PKLs vs. human ranking heuristics)

    • python correlation/correlation.py --in-dir correlation/pkls --out results/correlation/summary.json
  • Flat-12 text2table correlation (model/prompt flattening)

    • python correlation/t2t_correlation/t2t_correlation.py \ --human correlation/t2t_correlation/human_ranking.json \ --mapping correlation/t2t_correlation/perturb_mapping.jsonl \ --scores-dir results/metrics \ --out-dir results/correlation/flat12

Notes & Recommendations

  • Outputs: Prefer a unified results layout, e.g., results/metrics and results/correlation.
  • Models: Several scripts have a default model name; you can override via code/env if needed.
  • Heavy deps: BLEURT/TF, Transformers/torch may be large; consider using a GPU environment (optional).

Citation

If you use this repository in your research, please cite the accompanying paper (TabReX).

@misc{anvekar2025tabrextabularreferenceless,
      title={TabReX : Tabular Referenceless eXplainable Evaluation}, 
      author={Tejas Anvekar and Juhna Park and Aparna Garimella and Vivek Gupta},
      year={2025},
      eprint={2512.15907},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.15907}, 
}

License

Please see the LICENSE file if provided. If absent, contact the authors for licensing information.

Contributing

Contributions are welcome. Please open an issue or a pull request for fixes and improvements.

About

TabReX: Referenceless, explainable evaluation for LLM-generated tables via graph alignment + property-driven scoring (with TabReX-Bench)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors