PluRule is a multilingual, multimodal benchmark for Reddit rule-violation detection: 13,371 discussion instances drawn from the Pushshift archives, each pairing a rule-violating thread with a compliant thread from the same submission, labeled against the community's own rules.
This repository contains the full construction pipeline, the scripts used to hydrate the released dataset from IDs, and the evaluation harness used in the paper.
Paper: PLURULE: A Benchmark for Moderating Pluralistic Communities on Social Media
A PluRule example: GPT-5.2 (high reasoning) is given the target comment with full context — subreddit description, rules, submission, and discussion thread — and asked to pick which rule, if any, was violated. The correct answer is (e); GPT-5.2 picks (c).
| Split | Instances | Comments | Images | Subreddits / Clusters | Rules / Clusters | Languages |
|---|---|---|---|---|---|---|
| Train | 9,155 | 51,968 | 2,077 | 861 / 25 | 1,336 / 27 | 9 |
| Val | 1,382 | 7,631 | 376 | 537 / 25 | 586 / 27 | 9 |
| Test | 2,834 | 13,076 | 1,190 | 1,989 / 25 | 2,039 / 27 | 9 |
| Total | 13,371 | 72,675 | 3,643 | 1,989 / 25 | 2,885 / 27 | 9 |
Every instance contains (a) a root-to-leaf discussion thread where a moderator cited a rule on the leaf comment, (b) a compliant sibling thread from the same submission, (c) the submission itself with any images, and (d) the subreddit's full rule set.
2D UMAP of (a) 1,989 subreddits and (b) 2,885 rules, colored by HDBSCAN cluster. Grey points are unclustered ("other"). Right: distributions of the 13,371 instances across (c) 25 subreddit clusters and (d) 27 rule clusters.
Accuracy (%) across models and context levels on the test set. Numbers in parentheses show the delta from the previous row. Bold is best per model. 95% CIs are within ±1.3% everywhere. The "No rules broken" baseline is 50%.
| Context | 4B Inst. | 4B Think. | 8B Inst. | 8B Think. | 30B Inst. | 30B Think. | GPT-5.2 Low | GPT-5.2 High |
|---|---|---|---|---|---|---|---|---|
| Comment only | 49.6 | 37.4 | 51.0 | 40.3 | 50.2 | 46.1 | 54.1 | 55.0 |
| + Discussion | 49.2 (−0.4) | 39.8 (+2.4) | 50.7 (−0.3) | 43.9 (+3.6) | 51.0 (+0.8) | 48.2 (+2.1) | 55.3 (+1.2) | 56.2 (+1.2) |
| + Submission | 48.3 (−0.9) | 44.9 (+5.1) | 49.2 (−1.5) | 47.2 (+3.3) | 51.1 (+0.1) | 49.1 (+0.9) | 56.8 (+1.5) | 57.3 (+1.1) |
| + User | 48.9 (+0.6) | 45.0 (+0.1) | 50.0 (+0.8) | 46.7 (−0.5) | 52.4 (+1.3) | 49.4 (+0.3) | 57.4 (+0.6) | 57.7 (+0.4) |
| + Images | 48.4 (−0.5) | 45.0 (+0.0) | 49.8 (−0.2) | 44.9 (−1.8) | 52.3 (−0.1) | 49.5 (+0.1) | 57.4 (+0.0) | 57.6 (−0.1) |
Even the best model (GPT-5.2 high reasoning with full context) only reaches 57.7% — less than 8 points above the trivial baseline. Adding context (discussion thread, submission, user identifiers, images) helps by at most 2–3 points. Open-weight models (Qwen3-VL-Instruct / -Thinking) don't beat baseline at all.
Accuracy by (a) subreddit cluster and (b) rule cluster with 95% CI. Dashed line is the 50% baseline. Universal violations (civility, self-promotion) are solved well; context-dependent rules (low-effort, evidence-based, relevance) fall below baseline.
Start here if you want to evaluate a model on PluRule.
- Grab the three dehydrated split files from
huggingface.co/datasets/osome-iu/PluRule
and place them under
./data/. - Follow
hydrate/README.mdto fill in comments, submissions, and media from the Pushshift archives (~a few hours, no GPU). - Run your model through
eval/README.md— supports vLLM (Qwen-VL, LLaVA, Llama-Vision) and API models (Claude, GPT-4V) out of the box.
Start here if you want to reproduce the dataset end to end, tweak thresholds, or extend the pipeline.
Follow pipeline/README.md. Budget 1–2 days and
multiple GPUs: embedding matcher (Qwen3-Embedding-8B), LLM judge
(Qwen3-30B-A3B-Instruct), and cluster labeler (Qwen3-30B-A3B-Thinking) are
all run locally via vLLM.
See eval/human_eval/ for the Google Forms
annotation protocol used in Section 5.4 of the paper (96% overall
agreement with the pipeline's labels on a 100-instance audit).
git clone https://github.com/osome-iu/PluRule.git
cd PluRule
# Pick the env that matches your goal:
conda env create -f environment-hydrate.yml # minimal, hydration only (no GPU)
conda env create -f environment-pipeline.yml # end-to-end reconstruction (GPUs)
conda env create -f environment-eval.yml # benchmark evaluation (GPU or API keys)For API-model evaluation, copy credentials/.env.template to
credentials/.env and fill in your OPENAI_API_KEY / ANTHROPIC_API_KEY.
PluRule/
├── hydrate/ # 3 scripts to reconstitute the released dataset
├── pipeline/ # end-to-end reconstruction from Pushshift (paper §5)
├── eval/ # benchmark evaluation harness
│ └── human_eval/ # human annotation reproduction
├── utils/ # shared helpers (zst I/O, Pushshift torrent, media, …)
├── config.py # base paths + thresholds (edit before running)
├── credentials/ # API key templates (.env, Reddit, Google)
├── environment-hydrate.yml # hydration-only conda env
├── environment-pipeline.yml # reconstruction conda env
└── environment-eval.yml # evaluation conda env
@misc{plurule2025,
title = {PLURULE: A Benchmark for Moderating Pluralistic Communities
on Social Media},
author = {Kachwala, Zoher and Truong, Bao Tran and Muralidharan, Rasika and
Kwak, Haewoon and An, Jisun and Menczer, Filippo},
year = {2025},
note = {arXiv preprint},
}Code in this repository is released under the MIT License — see
LICENSE. The PluRule dataset itself is distributed separately on
HuggingFace under its own license; the underlying moderator comments and
submissions come from the publicly archived Pushshift Reddit corpus and are
bound by Reddit's terms of service.


