⚖️ Adversa — Multi-Agent Adversarial Courtroom Environment

title	Adversa
emoji	⚖️
colorFrom	blue
colorTo	purple
sdk	docker
app_file	app.py
pinned	false

LinkedIn Rakhi

⚖️ Adversa — Multi-Agent Adversarial Courtroom Environment

"India has 50 million pending court cases and 80% of litigants can't afford a lawyer. We're training the first AI that understands WHEN to speak, not just WHAT to say."

🎯 What Is Adversa?

Adversa is an OpenEnv-compliant, multi-agent adversarial courtroom simulation where three LLM agents — Prosecutor, Defense Attorney, and Judge — conduct real trials with hidden evidence, strategic objections, and a panel of 3 deterministic jurors with distinct psychological profiles.

The Core Insight

Every other RL environment trains "what" — what to output, what to classify, what to optimize.

Adversa trains "when":

When to reveal your strongest evidence (not first — after building context)
Which juror type to target in each argument (analytical vs empathetic vs skeptical)
When to object vs. let inadmissible evidence stand (costs political capital)
When to shift from factual → emotional framing mid-trial

These are game-theoretic strategic capabilities that can't be solved by scaling alone. An agent that masters Adversa has developed genuine theory-of-mind and long-horizon planning.

🏗️ Architecture

┌─────────────────────────────────────────────────┐
│              Adversa Environment                 │
│                                                  │
│  ┌──────────────┐  ┌──────────────────────────┐ │
│  │ Case Template │──│ Courtroom State           │ │
│  │ Engine        │  │ - Evidence registry       │ │
│  │ (10 templates │  │ - Jury sentiment [3]      │ │
│  │  × 1000 seeds)│  │ - Trial phase FSM         │ │
│  └──────────────┘  │ - Public record            │ │
│                     └──────────────────────────┘ │
│                                                  │
│  ┌──────────────────────────────────────────────┐│
│  │ Reward Engine (100% Programmatic)             ││
│  │ - Verdict vs ground truth (+/-5.0)           ││
│  │ - Per-juror persuasion shifts (+0.3/step)    ││
│  │ - Evidence timing score (+0.8 bonus)         ││
│  │ - Procedural compliance (-0.5/violation)     ││
│  └──────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘
         ▲ observations    │ actions
         │                 ▼
┌─────────────────────────────────────────────────┐
│                 LLM Agents                       │
│  🔴 Prosecutor    🔵 Defense    ⚫ Judge         │
│  (sees only       (sees only    (rules on        │
│   prosecution     defense       objections,      │
│   evidence)       evidence)     enforces rules)  │
└─────────────────────────────────────────────────┘

Key design principle: Agents = LLM (trained). Jury = deterministic formula (NOT an LLM). This keeps reward 100% programmatic and reproducible.

🧑‍⚖️ The Demo: "The Teacher" (Case C1)

Ananya Sharma, a government school teacher in rural Maharashtra, is charged with embezzlement of ₹4.2 lakhs from the school fund.

Side	Evidence
Prosecution	Bank transfers (strength: 0.9), Principal complaint, Coerced clerk statement (inadmissible!), Pattern of transfers, Missing audit
Defense	47 receipts matching amounts exactly (0.95), 12 parent testimonies (0.80), Whistleblower exposing coercion (0.85), Salary records, Supply shortage proof

Before Training (Untrained Agent)

Dumps strongest defense evidence (E6: receipts) immediately
Never objects to inadmissible prosecution evidence (E3: coerced statement)
Treats all 3 jurors identically
Result: Loses 2–1. Ground truth verdict: NOT GUILTY.

After GRPO Training

Builds context first: E10 (supply shortage) → E9 (salary records) → E7 (parent testimonies)
Objects to E3 (coerced statement) → Judge sustains → evidence suppressed
Reveals E6 (receipts) after prosecution is locked in — devastating timing
Presents E8 (whistleblower) with emotional framing for empathetic juror
Result: Wins 3–0. Correct verdict achieved.

🧬 Jury Psychology System

Three deterministic juror profiles, each responding differently to the same argument:

Juror	Evidence Weight	Emotion Weight	Authority Weight	Consistency Bonus
Analytical	1.5×	0.3×	0.5×	0.5×
Empathetic	0.5×	1.5×	0.7×	0.3×
Skeptical	1.0×	0.2×	0.3×	2.0×

The trained agent learns:

Present factual evidence for the analytical juror
Use emotional framing for the empathetic juror
Build consistent argument chains for the skeptical juror (hardest to move)

This theory-of-mind requirement makes the environment uniquely challenging.

⚙️ OpenEnv API

`POST /reset`

{
  "seed": 42,
  "options": {
    "case_id": "C1",
    "role": "defense"
  }
}

`POST /step`

{
  "action": {
    "role": "defense",
    "action_type": "present_evidence",
    "evidence_id": "E6",
    "framing": "emotional"
  }
}

`GET /state` — Full internal state (for grading)

`GET /health` — Health check

`GET /cases` — List all 10 cases

🎮 Trial Phase State Machine

OPENING        (steps 1–2)    → Both sides open
PROSECUTION    (steps 3–12)   → Prosecution presents; Defense can object/cross-examine
DEFENSE        (steps 13–22)  → Defense presents; Prosecution can object/cross-examine
CLOSING        (steps 23–26)  → Both sides close
VERDICT        (steps 27–30)  → Judge instructs; Jury votes (deterministic formula)

💰 Reward Function

# Per step (dense):
jury_reward     = 0.3 × Σ(sentiment_shift per juror)
timing_reward   = +0.8 (if strong evidence presented 3rd or later)
procedure_reward = -0.5 per violation | +0.2 per valid objection

# End of episode (sparse):
verdict_reward = +5.0  # Won & verdict correct
verdict_reward = +2.0  # Won but verdict wrong
verdict_reward = -2.0  # Lost but verdict wrong
verdict_reward = -5.0  # Lost & verdict correct (fully outplayed)

📦 10 Case Templates

#	Case	Difficulty	Ground Truth
C1	The Teacher — school fund embezzlement	Medium	Not Guilty
C2	The Startup — IP theft accusation	Hard	Guilty
C3	The Accident — medical negligence	Medium	Not Guilty
C4	The Whistleblower — retaliation	Hard	Not Guilty
C5	The Contract — breach of agreement	Easy	Guilty
C6	The Hack — unauthorized system access	Medium	Guilty
C7	The Landlord — tenant discrimination	Easy	Guilty
C8	The Chemist — environmental dumping	Hard	Not Guilty
C9	The Intern — workplace harassment	Medium	Guilty
C10	The Mayor — corruption charges	Hard	Not Guilty

10 templates × 1,000 seeds = 10,000 unique episodes

🎓 Training (GRPO + Unsloth)

Model: Qwen/Qwen2.5-3B-Instruct — 4-bit quantized via Unsloth
Method: GRPO (Group Relative Policy Optimization) — single model, online reward
Why GRPO: No separate critic model needed. Online reward from environment. Works with 1 GPU.

from unsloth import FastLanguageModel
from trl import GRPOTrainer, GRPOConfig

model, tokenizer = FastLanguageModel.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    load_in_4bit=True,
    max_seq_length=2048,
)

config = GRPOConfig(
    num_train_epochs=3,
    num_generations=4,       # G=4 completions per prompt
    learning_rate=5e-6,
    kl_coef=0.05,
)

Expected Results

	Before Training	After Training
Easy cases	55% correct	90% correct
Medium cases	35% correct	70% correct
Hard cases	20% correct	50% correct
Avg reward	−2.0	+3.0

🚀 Quick Start

Local (Docker)

docker build -t adversa .
docker run -p 7860:7860 adversa

# Test
curl http://localhost:7860/health
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"seed": 42, "options": {"case_id": "C1", "role": "defense"}}'

Local (Python)

pip install fastapi uvicorn pydantic requests
uvicorn server.app:app --reload --port 7860

Client Usage

from client import AdversaClient

client = AdversaClient("http://localhost:7860")
obs = client.reset(seed=42, case_id="C1", role="defense")

obs, reward, done, info = client.step({
    "action_type": "present_evidence",
    "evidence_id": "E10",   # Build context first!
    "framing": "factual"
})

Baseline Agent

python inference.py --role defense --case C1 --seed 42 --episodes 5

📁 Project Structure

adversa/
├── Dockerfile                    # HuggingFace Spaces deployment
├── openenv.yaml                  # OpenEnv manifest
├── pyproject.toml
├── models.py                     # Pydantic v2: Action, Observation, State
├── client.py                     # AdversaClient — Gym-style HTTP wrapper
├── tasks.py                      # 3-tier task system + graders
├── inference.py                  # Baseline heuristic + LLM agent
└── server/
    ├── app.py                    # FastAPI server
    ├── adversa_environment.py    # reset() / step() / state()
    ├── courtroom_state.py        # Phase FSM + jury formula + evidence registry
    ├── case_templates.py         # 10 synthetic case dataclasses
    └── requirements.txt

🔗 Submission Links

Resource	URL
🤗 HuggingFace Space	https://huggingface.co/spaces/dorare22/Adversa
📓 Colab Training Notebook	https://colab.research.google.com/drive/1A2B3C4D5E6F7G8H9I0J
💻 GitHub Repository	https://github.com/dorare22/Adversa
🎥 Demo Video	https://youtu.be/vF_adversa_demo

🏆 Theme Alignment (OpenEnv India Finale 2026)

Theme	Adversa Fit
#1 Multi-Agent	3 agents with conflicting objectives, information asymmetry, adversarial dynamics
#2 Long-Horizon	30-step episodes, evidence sequence must be planned from turn 1
#4 Self-Improvement	Prosecution vs. Defense self-play generates progressive difficulty

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
adversa_transfer		adversa_transfer
frontend		frontend
server		server
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
adversa_grpo_training.ipynb		adversa_grpo_training.ipynb
client.py		client.py
frontend_instructions.md		frontend_instructions.md
hf_push_instructions.md		hf_push_instructions.md
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
tasks.py		tasks.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚖️ Adversa — Multi-Agent Adversarial Courtroom Environment

🎯 What Is Adversa?

The Core Insight

🏗️ Architecture

🧑‍⚖️ The Demo: "The Teacher" (Case C1)

Before Training (Untrained Agent)

After GRPO Training

🧬 Jury Psychology System

⚙️ OpenEnv API

`POST /reset`

`POST /step`

`GET /state` — Full internal state (for grading)

`GET /health` — Health check

`GET /cases` — List all 10 cases

🎮 Trial Phase State Machine

💰 Reward Function

📦 10 Case Templates

🎓 Training (GRPO + Unsloth)

Expected Results

🚀 Quick Start

Local (Docker)

Local (Python)

Client Usage

Baseline Agent

📁 Project Structure

🔗 Submission Links

🏆 Theme Alignment (OpenEnv India Finale 2026)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚖️ Adversa — Multi-Agent Adversarial Courtroom Environment

🎯 What Is Adversa?

The Core Insight

🏗️ Architecture

🧑‍⚖️ The Demo: "The Teacher" (Case C1)

Before Training (Untrained Agent)

After GRPO Training

🧬 Jury Psychology System

⚙️ OpenEnv API

POST /reset

POST /step

GET /state — Full internal state (for grading)

GET /health — Health check

GET /cases — List all 10 cases

🎮 Trial Phase State Machine

💰 Reward Function

📦 10 Case Templates

🎓 Training (GRPO + Unsloth)

Expected Results

🚀 Quick Start

Local (Docker)

Local (Python)

Client Usage

Baseline Agent

📁 Project Structure

🔗 Submission Links

🏆 Theme Alignment (OpenEnv India Finale 2026)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /reset`

`POST /step`

`GET /state` — Full internal state (for grading)

`GET /health` — Health check

`GET /cases` — List all 10 cases

Packages