This project provides a complete pipeline for analyzing and comparing peer reviews generated by large language models (LLMs) with human-written reviews. It includes modules for review generation, semantic similarity computation, knowledge graph construction, and structural analysis.
- The required environment and dependencies are listed in
requirements.txt. To replicate the environment, run:pip install -r requirements.txt
The Code/ directory contains all scripts used for data collection, filtering, and analysis. The full pipeline for preparing consistent and labeled paper–review pairs includes:
-
Paper fetching and rating extraction (
Paper_fetcher.py)
Retrieves all submissions and corresponding review ratings from OpenReview for a given conference and year. -
Consistency filtering using rating variance (
KDE_filter.py)
Applies kernel density estimation (KDE) to identify papers with low inter-review disagreement (low standard deviation in ratings). -
Paper selection and quality labeling (
Paper_select.py)
Selects consistent papers and categorizes them intogood,borderline, orbadbased on rating quantiles. -
Full paper download (
Paper_download.py)
Downloads the PDF files for all selected papers in each quality group. -
Review fetching (
Paper_review.py,Paper_review_process.py)
Extracts the full set of human-written reviews associated with the selected papers. -
Semantic similarity analysis (
similarity.py)- Loads pre-segmented “IMRaD” sections (abstract, introduction, related work, method, results, conclusion) encoded by BGE-M3.
- Computes cosine similarity between embeddings of each review component (summary, strengths, weaknesses, questions) and each paper section.
- Saves per-paper similarity scores (real vs. LLM reviews) into JSON files under
../Data/<Conference>/<Year>/similarity_results/.
-
Knowledge graph construction and metrics (
knowledge_graph_construct.py)- Builds a directed graph for each review segment using PL-Marker predictions (entities + relations).
- Computes structural metrics (node count, edge count, average degree, label entropy) on each graph.
- Aligns real vs. LLM question nodes by filtering to match counts, then saves all graph metrics to CSV under
../Data/Knowledge_Graph/<Conference>/<Year>/<Category>/graph_metrics_clean.csv.
All processed data, including real human reviews, LLM-generated reviews, data used for semantic similarity analysis, and knowledge graph construction, are available via the following link:
🔗 Download Data (Google Drive)
We provide standardized prompt templates used for LLM-based review generation, aligned with official review rubrics from ICLR and NeurIPS. These templates help ensure consistency in review outputs across different models and conferences.

