Research-grade jailbreaking & hardening platform for Frontier LLMs.
⚠️ Disclaimer: This tool is for authorized security testing and educational purposes only. The authors are not responsible for misuse.
DO NOT DEPLOY THIS APPLICATION TO A PUBLIC URL. This application exposes powerful LLM capabilities and API endpoints that can be used to generate harmful content or incur significant costs. It is intended for local use only or deployment within a strictly controlled, authenticated private network.
- No built-in authentication: The API routes are unprotected by default.
- Cost risk: Malicious actors could trigger expensive fine-tuning or batch inference jobs.
- Safety risk: The tool is designed to bypass safety filters; public exposure allows anyone to generate harmful content.
Frontier models (DeepSeek, Llama 3, GPT-4) claim to be "aligned." They aren't.
Most red-teaming tools use toy prompts ("DAN mode") that are easily patched. JailbreakLLM implements 39 advanced attack vectors from top academic papers (USENIX '25, NeurIPS '24) to expose the real cracks in your model's safety.
We implement the most effective attacks from recent literature:
- Knowledge Decomposition (KDA): Breaks harmful tasks into benign sub-steps (96% Success).
- Dual Intention Escape: Hides harm in benign "engineering briefs."
- Chaos Chain: Iterative de-obfuscation that breaks reasoning models.
- System Policy Override: Fakes "admin mode" privileges.
We use a StrongREJECT-aligned evaluation system. Our dedicated LLM judge detects partial leaks and assigns granular risk scores (0-100) based on research benchmarks, ensuring that "successful" jailbreaks actually contain harmful information, not just refusals or incoherent text.
Single-shot testing misses 40% of vulnerabilities. We allow customizable attack volume (default 10, up to 50 attempts) with research-optimized parameters (temp=0.2, top_p=0.95) to catch stochastic failures while managing API costs.
👉 See Parameter Optimization Guide
Don't just break it. Fix it.
- Generate Dataset: Converts successful jailbreaks into synthetic refusal samples.
- Fine-Tune: One-click export to LoRA/SFT pipelines (via Nebius Token Factory).
- Verify: Re-test the hardened model to ensure the vulnerability is closed.
git clone https://github.com/demianarc/jailbreakllm.git
cd jailbreakllm
npm installCreate a .env.local file:
# Required for inference, fine-tuning, and judging
NEBIUS_API_KEY=your_key_here
NEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/
# Optional: Basic Auth for public deployment (Recommended)
BASIC_AUTH_USER=admin
BASIC_AUTH_PASSWORD=your_secure_passwordnpm run devOpen http://localhost:3000 and navigate to the Red Team Arsenal.
Our approach is grounded in "System 2 Thinking" and "Tree of Thoughts" methodologies applied to security testing:
- Deep Analysis: We don't just spam prompts; we analyze the model's cognitive architecture (Reasoning vs Chat).
- Iterative Refinement: Attacks like Chaos Chain and Reason Step-by-Step force the model to iterate on its own output, bypassing "System 1" safety filters.
- Comprehensive Coverage: From simple fuzzing to complex persona hijacking, we test the entire surface area.
We welcome contributions! If you've found a new jailbreak vector:
- Add the definition to
src/components/workflow/red-team-arsenal.tsx. - Implement the prompt generator in
src/lib/pipeline.ts. - Submit a PR!
Please read CONTRIBUTING.md for details.
MIT License. Hack responsibly.