From 7752efd0b6d99eca32a29b8ae0f9f04b7a6f8c63 Mon Sep 17 00:00:00 2001
From: Deepak Chander <172876867+DeepakChander@users.noreply.github.com>
Date: Sun, 17 May 2026 13:51:33 +0530
Subject: [PATCH] docs(ai): add AI Co-pilot system design (agentic RAG for CA)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adapts the Agentic RAG architecture (Coordinator → Retrieval → Rerank →
Context → Generation → Validation → Export) to a CA firm's needs:

- Per-firm Qdrant collection (structural tenant isolation)
- Two-collection retrieval: firm corpus + shared regulatory corpus
- New Compliance Critic agent (validates statute, math, UDIN status)
- Citation-must-resolve-to-primary-source rule
- Stakes-tiered validation (Low / Medium / High / Blocked)
- UDIN + DSC gates (AI never marks output "final")
- India-resident inference only (Qwen 2.5 via vLLM on Indian GPU)
- Mock providers kept for zero-GPU local dev
- 6 use cases mapped: Doc AI, Ask-your-books, Notice intelligence,
  Certificate draft, WP draft, Article training Q&A
- Phase 3A–3F sub-phasing plan with exit criteria

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/README.md                                |   1 +
 docs/architecture/ai-copilot-system-design.md | 539 ++++++++++++++++++
 2 files changed, 540 insertions(+)
 create mode 100644 docs/architecture/ai-copilot-system-design.md

diff --git a/docs/README.md b/docs/README.md
index ce30733..cceae65 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -49,6 +49,7 @@ If you have 10 minutes, read these in order:
 - [Data model](./architecture/data-model.md)
 - [Security and data residency](./architecture/security-and-data-residency.md)
 - [Deployment](./architecture/deployment.md)
+- [AI Co-pilot — system design](./architecture/ai-copilot-system-design.md)
 
 ### Integrations
 - [Tally integration](./integrations/tally-integration.md)
diff --git a/docs/architecture/ai-copilot-system-design.md b/docs/architecture/ai-copilot-system-design.md
new file mode 100644
index 0000000..05f5908
--- /dev/null
+++ b/docs/architecture/ai-copilot-system-design.md
@@ -0,0 +1,539 @@
+# AI Co-pilot — System Design
+
+This document is the concrete architectural blueprint for the **Phase 3 AI Co-pilot** module. It adapts the proven **Agentic RAG** pattern (Coordinator → Retrieval → Rerank → Context → Generation → Validation → Export) to the realities of an Indian CA firm: regulatory citations, ICAI rules, UDIN/DSC, DPDP Act, India-only inference, and strict per-firm + per-client tenant isolation.
+
+Read alongside:
+- [`../features/ai-copilot.md`](../features/ai-copilot.md) — product-facing capabilities
+- [`tech-stack.md`](./tech-stack.md) — overall stack
+- [`security-and-data-residency.md`](./security-and-data-residency.md) — encryption, residency
+- [`../compliance/dpdp-act.md`](../compliance/dpdp-act.md) — privacy obligations
+- [`../compliance/icai-regulatory.md`](../compliance/icai-regulatory.md) — UDIN / certificate rules
+
+---
+
+## 1. Why an agentic system, not a single LLM call
+
+A CA firm's questions decompose into multiple steps that benefit from **explicit planning, retrieval, validation, and repair**:
+
+> "Show me all RCM transactions in Q1 FY26 that haven't been offset against ITC, and check if any are at risk of being denied under Sec 17(5)."
+
+This is:
+- A retrieval task (find vouchers + ledger rows)
+- A calculation task (offset math)
+- A legal classification task (Sec 17(5) check)
+- A grounding task (cite each claim to a primary source)
+- A validation task (numbers must match the books exactly)
+
+A single LLM call would hallucinate at any of these steps. An agent graph lets us **isolate the failure mode, validate each step, and repair before showing the user.**
+
+---
+
+## 2. High-level architecture
+
+```
+┌──────────────────────────────────────────────────────────────────────────────┐
+│                              ONLINE PIPELINE                                  │
+│                                                                               │
+│  User query ──► Coordinator ──► Retrieval ──► Rerank ──► Context Builder      │
+│                     │                                          │              │
+│                     │                                          ▼              │
+│                     │                                  ┌──────────────┐       │
+│                     │                                  │ Context Critic│      │
+│                     │            ┌─── retry ◄──────────┤  sufficient? │      │
+│                     │            ▼                     └──────┬───────┘      │
+│                     │       Retrieval (refined)               │              │
+│                     │                                          ▼              │
+│                     │                                  ┌──────────────┐       │
+│                     ▼                                  │  Generation  │       │
+│              ┌──────────────┐                          │ (Qwen / vLLM)│       │
+│              │ Cost & Audit │                          └──────┬───────┘      │
+│              │   Tracker    │                                  │              │
+│              └──────┬───────┘                                  ▼              │
+│                     │                                  ┌──────────────┐       │
+│                     │                                  │  Validation  │       │
+│                     │            ┌── repair ◄──────────┤  + Grounding │       │
+│                     │            ▼                     └──────┬───────┘      │
+│                     │      Generation (refined)               │              │
+│                     │                                          ▼              │
+│                     │                                  ┌──────────────┐       │
+│                     │                                  │ Compliance   │       │
+│                     │                                  │ Critic (CA)  │       │
+│                     │                                  └──────┬───────┘      │
+│                     │                                          ▼              │
+│                     │                                  ┌──────────────┐       │
+│                     └──────────────────────────────────► Export Agent│       │
+│                                                        └──────┬───────┘      │
+│                                                               │              │
+└───────────────────────────────────────────────────────────────┼──────────────┘
+                                                                ▼
+                                                        Structured output
+                                                        + citations
+                                                        + confidence score
+                                                        + audit log entry
+```
+
+```
+┌──────────────────────────────────────────────────────────────────────────────┐
+│                          OFFLINE INDEXING PIPELINE                            │
+│                                                                               │
+│  Source events                                                                │
+│  ├─ Tally voucher created/modified (via Bridge)                               │
+│  ├─ Document uploaded to vault (WhatsApp/email/web/mobile)                    │
+│  ├─ Return filed (GSTR/ITR/MCA)                                               │
+│  ├─ Notice received + OCR'd                                                   │
+│  ├─ Regulatory corpus update (CBIC/CBDT/MCA/ICAI feed)                        │
+│       │                                                                       │
+│       ▼                                                                       │
+│  Ingestion Service                                                            │
+│       │                                                                       │
+│       ▼                                                                       │
+│  ┌─────────────┐    ┌──────────────┐    ┌────────────┐    ┌───────────────┐  │
+│  │   Clean &   │ ─► │   Chunker    │ ─► │  Embedder  │ ─► │ Per-firm      │  │
+│  │  Normalize  │    │ (semantic +  │    │  (BGE-M3,  │    │ Qdrant        │  │
+│  │  (redact)   │    │ structural)  │    │   India)   │    │ collection    │  │
+│  └─────────────┘    └──────────────┘    └────────────┘    └───────────────┘  │
+│                                                ▲                              │
+│                                                │                              │
+│                                          Postgres metadata                    │
+│                                          (jobs, hashes, lineage)              │
+└──────────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 3. Tenant isolation — the most important design rule
+
+A leaked answer from Firm A's books to Firm B = end of business. Isolation must be **structural**, not just a metadata filter.
+
+| Boundary | Mechanism |
+|----------|-----------|
+| **Per-firm Qdrant collection** | One collection per firm (e.g., `firm_42_chunks`). Query agent literally cannot query another firm's collection — different connection scope. |
+| **Per-client filter** | Inside a firm's collection, every chunk's payload has `client_id`. Retrieval API hard-rejects queries that don't specify `client_id` for client-scoped questions. |
+| **Per-firm Postgres schema (or strict RLS)** | Job table, metadata, lineage — same row-level security model as the rest of the app. |
+| **Per-firm KMS data key** | Stored chunk text is encrypted with the firm's DEK. Even an ops engineer with DB access cannot read it. |
+| **Per-firm AI quota + cost ledger** | Every generation call attributed to a firm; quota enforced at the gateway. |
+| **No cross-firm training, ever** | All foundation models used in inference-only mode; provider contracts explicitly forbid training on our traffic. |
+| **No cross-firm embeddings** | Even though embedding space is mathematically shared, no chunk is co-located with another firm's chunk in any index. |
+
+**Regulatory corpus is the exception**: ICAI Standards, GST law, IT Act, MCA forms, CBIC/CBDT circulars are public and shared. These live in a single global collection `regulatory_corpus` that every firm can query. Per-firm collections never contain a copy of public law.
+
+---
+
+## 4. Agent inventory
+
+| Agent | Responsibility | Implementation |
+|-------|---------------|----------------|
+| **Coordinator** | Parses the user request, classifies the use case, plans the agent graph, routes execution, tracks cost + audit | Python class with deterministic dispatch + LLM fallback for ambiguous queries |
+| **Retrieval** | Query rewriting, filter construction, Qdrant search; returns top-K candidates from per-firm collection + (optionally) regulatory corpus | Async service; talks to Qdrant HTTP API |
+| **Reranker** | Re-orders candidates by semantic match to the rewritten query; trims to top-N | BGE Reranker; CPU-fine to start, GPU when volume demands |
+| **Context Builder** | Merges chunks, deduplicates, preserves source IDs, formats for the model's context window | Pure Python; no LLM |
+| **Context Critic** | "Is this context sufficient to answer?" If not, requests a refined retrieval (different filter / different query) | Small LLM call (Qwen-1.5B or rules + heuristics) |
+| **Generation** | Produces the structured output (JSON) with inline citations | Qwen 7B/14B via vLLM on India GPU host |
+| **Validation** | Schema validation, duplicate removal, citation resolution, hallucination check (every cited source must exist) | Deterministic + small LLM for nuanced checks |
+| **Compliance Critic** *(CA-specific)* | Verifies cited sections are current law, cross-checks numbers against books, flags UDIN-bearing outputs, enforces "draft for partner review" labels | Rules engine + curated knowledge base + LLM for narrative |
+| **Export** | Writes the result to the right destination (DB, audit log, queue for partner review, draft document), emits webhook | Pure Python; no LLM |
+| **Cost & Audit Tracker** | Logs every step's tokens, latency, model, and outcome to the firm's audit log; enforces per-firm quota | Cross-cutting middleware |
+
+Each agent is **independently testable** with mock providers — see [§ 12](#12-local-development).
+
+---
+
+## 5. The six CA use cases and the agents they activate
+
+| Use case | Coordinator plan | Critical agent |
+|----------|------------------|----------------|
+| **Doc AI extraction** (invoice, bank statement, Form 16, contract) | Skip retrieval; structured extraction directly from uploaded doc | Generation + Validation (schema-strict) |
+| **Ask your books** | Retrieval (firm + client filter) → Generation | Context Critic (often retries) |
+| **Notice intelligence** | OCR notice → Retrieval (regulatory corpus + client's filing history) → Generation → draft reply | Compliance Critic (must classify correctly) |
+| **Auto-draft certificate** | Retrieval (client's books + template + relevant sections) → Generation → UDIN gate | Compliance Critic (UDIN-bearing flag) |
+| **Audit working paper draft** | Retrieval (client's books + SAs + prior year WPs) → Generation → cross-link to assertions | Validation (must cite SA + WP ref) |
+| **Article training Q&A** | Retrieval (firm's prior WPs + ICAI material) → Generation (MCQs) → difficulty calibration | Validation (no duplicates, balanced topics) |
+
+Same chassis. Six different specialised Coordinator plans + prompt templates.
+
+---
+
+## 6. Indexing pipeline — what gets embedded
+
+### 6.1 Sources, by priority
+
+| Source | Frequency | Chunk strategy |
+|--------|-----------|----------------|
+| Tally voucher (sales, purchase, journal, etc.) | Streamed via Bridge, near-real-time | One chunk per voucher with structured payload (date, party, amount, GST split, narration) |
+| Bank statement entry | Real-time on upload + recon match | One chunk per row, joined with matched voucher |
+| Document vault entry (invoice, contract, board resolution) | On upload | Chunk per page OR semantic chunk; OCR if PDF/image |
+| Filed return (GSTR, ITR, MCA) | On filing | Whole-return chunk + per-table chunks |
+| Notice received | On upload, after OCR | Full notice + extracted entities (section, demand, due date) |
+| Communication thread (email, WhatsApp) | On ingest | Per-message; PII-redacted at chunk time |
+| Working paper templates + firm's prior WPs | On upload / on save | Semantic chunks per heading |
+| **Regulatory corpus** (global, shared) | Updated weekly via curated feed | Per-section chunks; one collection serves all firms |
+
+### 6.2 Chunk payload
+
+Every chunk has a structured payload:
+
+```json
+{
+  "chunk_id": "v_42_abc_v0234_p0",
+  "firm_id": "firm_42",
+  "client_id": "client_ABC",        // null for regulatory chunks
+  "source_type": "tally_voucher",
+  "source_id": "V/0234",
+  "doc_hash": "sha256:...",
+  "fy": "2026-27",
+  "period": "2026-Q1",
+  "date": "2026-04-15",
+  "tags": ["RCM", "purchase", "GST"],
+  "version": 3,
+  "indexed_at": "2026-05-17T08:00:00Z",
+  "ciphertext_ref": "s3://...firm_42/v_42_abc_v0234.enc"
+}
+```
+
+The plaintext is not stored in Qdrant; only the embedding + payload. Plaintext lives encrypted in S3, fetched on demand by Context Builder using the firm's DEK.
+
+### 6.3 Reindexing triggers
+
+- Voucher edited → reindex that voucher's chunk + dependent recons
+- Return filed → new chunk for the return + invalidate stale "draft return" chunks
+- Document re-uploaded with new content_hash → reindex + supersede prior
+- Regulatory update (new circular, rate change) → reindex affected sections of corpus + flag dependent firm-side answers as "may be stale, re-verify"
+
+---
+
+## 7. Online generation pipeline — detail
+
+### 7.1 Coordinator decision tree (simplified)
+
+```
+incoming query
+  ├─ classify intent
+  │   ├─ "extract X from this doc"       → Doc AI plan
+  │   ├─ "tell me about my books"        → Ask-your-books plan
+  │   ├─ "respond to this notice"        → Notice plan
+  │   ├─ "draft a certificate"           → Certificate plan
+  │   ├─ "draft working paper"           → WP plan
+  │   └─ "create training questions"     → Training plan
+  ├─ check firm AI quota → reject / queue / proceed
+  ├─ check feature flag for this firm / client → may be disabled
+  ├─ stamp request with audit context (user, IP, time, model intended)
+  └─ dispatch plan
+```
+
+### 7.2 Retrieval with two-collection query
+
+For most CA queries, retrieve from **two** Qdrant collections in parallel:
+- `firm_42_chunks` — the firm's own data, filtered by client_id and period
+- `regulatory_corpus` — global, filtered by relevant tags (GST, IT, MCA, ICAI)
+
+Results are merged before reranking. This gives the LLM both **what the books say** and **what the law says** in the same context.
+
+### 7.3 Rerank
+
+BGE Reranker scores each chunk against the rewritten query. Keep top-N where N is dictated by model context budget (e.g., 20 chunks ≈ 8K tokens with margin for 32K context).
+
+### 7.4 Context Builder
+
+- Sort by source type then date
+- Inline source IDs as bracketed citations: `[V/0234]`, `[Sec 17(5)(g)]`, `[GSTR-3B 2026-04]`
+- Strip / mask PII not relevant to the query (Aadhaar, full bank a/c numbers)
+- Hard-cap at the model's safe context window
+
+### 7.5 Context Critic
+
+Asks: *given this query and this assembled context, can a reasonable answer be produced with cited grounding?*
+
+If no:
+- Identify what's missing (e.g., "no Q1 GSTR-3B chunk in context")
+- Request refined retrieval with explicit hints (e.g., add `period:2026-Q1 AND source_type:gstr_3b`)
+- Retry up to N=3 times; then surface a "context insufficient" response to the user
+
+This loop is what makes the system *agentic*, not just RAG.
+
+### 7.6 Generation
+
+- Qwen 7B (latency-sensitive) or Qwen 14B (quality-sensitive) via vLLM
+- Output: strict JSON schema per use case
+- Temperature: 0.0–0.2 (we want determinism, not creativity)
+- Token budget: capped per call; overruns abort with a clear error
+- Streaming for interactive flows ("Ask your books"); batched for bulk (training Q&A generation)
+
+### 7.7 Validation
+
+Three layers:
+
+| Layer | Checks |
+|-------|--------|
+| **Schema** | JSON parses; required fields present; types correct |
+| **Grounding** | Every cited source_id resolves to a real chunk; cited fact appears in that chunk; numbers in the output match a primary source within rounding tolerance |
+| **Quality** | No duplicates; difficulty matches request (training Q&A); language matches firm preference; no PII leaked beyond scope |
+
+Failures trigger a **repair loop**: regenerate with the validator's error appended to the prompt. Repair budget: 2 attempts. After that, hand the partially-validated output to the user with explicit flags.
+
+### 7.8 Compliance Critic (CA-specific)
+
+Distinct from generic validation. Asks:
+
+- Are cited statutory sections still in force as of the query date? (Cross-check against a curated "effective period" index of sections.)
+- Are numerical thresholds current? (₹2 Cr turnover threshold for tax audit, ₹5 Cr for e-invoicing, etc.)
+- Is this output a **certificate** (UDIN-bearing)? If yes, force the "draft for partner review" label; never mark "final".
+- Is this output making a **legal opinion** that requires partner sign-off?
+- Does the output cite the firm's prior position on a similar matter and contradict it? (Surface the inconsistency for partner review.)
+
+### 7.9 Export
+
+- Persist output + full chain (query → retrieved chunks → generated → validated → critic notes)
+- Write structured audit log entry
+- Update firm's AI usage ledger (cost, tokens, latency)
+- Webhook to UI / mobile / queue partner notification
+
+---
+
+## 8. Citation model — "must resolve to a primary source"
+
+Every claim in the output must cite a `source_id` from a **primary source category**:
+
+| Category | Example source_id format |
+|----------|--------------------------|
+| Tally voucher | `voucher://firm_42/client_ABC/V0234` |
+| Bank statement row | `bank://firm_42/client_ABC/HDFC-2026-04/row-127` |
+| Filed return | `gstr3b://firm_42/client_ABC/2026-04` |
+| Form-16 | `form16://firm_42/client_ABC/employee_E10/AY-2026-27` |
+| Notice | `notice://firm_42/client_ABC/DRC01-2026-05` |
+| Statutory section | `statute://CGST/sec-17/5/g` |
+| ICAI standard | `icai://SA-230/para-8` |
+| MCA form schema | `mca://AOC-4/v3/clause-12` |
+| CBIC circular | `cbic://circular-204-2023` |
+
+A citation that fails to resolve in our authoritative index = hallucination. Validation rejects the output.
+
+In the UI, every citation is **clickable** — partner sees the original voucher / section / notice in a side panel.
+
+---
+
+## 9. UDIN / DSC integration
+
+| Output type | AI's role | Final step |
+|-------------|-----------|------------|
+| Internal note, anomaly flag, ask-your-books answer | Final-ish (user can act on it) | None |
+| Notice reply draft | Draft | Partner edits + signs (DSC) |
+| Working paper draft | Draft | Preparer edits → reviewer → partner |
+| Certificate (15CB, turnover, net worth) | Draft | Partner reviews → generates **UDIN** → signs (DSC) → files |
+
+The AI never:
+- Generates a UDIN
+- Invokes a DSC
+- Marks an output as "final" or "issued"
+
+These transitions are gated by an explicit partner action in the UI, recorded in audit log, often with a step-up MFA challenge.
+
+---
+
+## 10. Validation tier by "stakes" (replaces EdTech's "difficulty")
+
+| Stakes | Examples | Min context recall | Min repair attempts | Compliance Critic strictness |
+|--------|----------|--------------------|---------------------|------------------------------|
+| Low | Internal Q&A, search summary, "explain this voucher" | 0.6 | 1 | Lenient |
+| Medium | Working paper draft, notice classification | 0.8 | 2 | Strict |
+| High | Certificate draft, return draft, opinion memo | 0.95 | 3 (then abort) | Maximum + cross-check |
+| Blocked | Final UDIN-bearing issuance, DSC-bound output | — | — | AI cannot produce final; partner gate enforced |
+
+The Coordinator picks the tier at plan time based on intent classification.
+
+---
+
+## 11. Model & infrastructure choices
+
+| Layer | Choice | Why |
+|-------|--------|-----|
+| **Generation LLM** | Qwen 2.5 7B (hot path) + Qwen 2.5 14B (high-stakes) | Open weights, strong on structured outputs, can self-host in India |
+| **Inference server** | vLLM | High throughput, batching, KV cache reuse |
+| **GPU host** | NVIDIA H100 / GH200 in **Tata Comm / Yotta / E2E Cloud / NxtGen** (Indian) | India residency; tier-1 datacentre |
+| **Embeddings** | BGE-M3 | Multilingual (English + Hindi + others), strong on technical text |
+| **Embedding host** | CPU initially → small GPU later | Embedding is sporadic, CPU OK to ~1000 chunks/min |
+| **Reranker** | BGE Reranker v2 | Industry-standard, good quality/cost |
+| **Vector DB** | Qdrant | Open source, strong filters, in-memory + on-disk modes |
+| **Vector DB host** | Self-hosted on EKS in ap-south-1 | Avoid cross-region calls; data residency |
+| **Metadata DB** | Postgres (same as main app) | Co-locate with existing infra; transactional joins |
+| **Queue** | Celery + Redis (existing infra) | Reuse |
+| **API** | FastAPI for AI service | Async, typed, fits Python ML ecosystem |
+| **Gateway** | LiteLLM in front of vLLM (and optional hosted fallback) | Provider-agnostic routing, cost tracking, retries |
+| **Provider fallback** | Hosted Indian providers (Sarvam, Krutrim, hosted Qwen) | Capacity overflow; never out of region |
+
+We **explicitly avoid**:
+- US-region OpenAI / Anthropic / Gemini for any client data
+- Any provider that trains on inputs
+- Any provider that doesn't sign a DPA with India residency clauses
+
+For the *firm's own training corpus* (article Q&A use case), we have more flexibility — but defaults stay India-resident for consistency.
+
+---
+
+## 12. Local development
+
+The system **must run with zero GPU** on a developer laptop, or velocity dies.
+
+Mock providers (provided in the AI service):
+- `MockGenerationProvider` — returns deterministic JSON shaped to the expected schema
+- `MockEmbeddingProvider` — returns hashed-to-vector embeddings (consistent across runs)
+- `MockReranker` — returns input order, slightly shuffled
+- `MockQdrant` — in-memory dict keyed by collection
+
+Switched via env var `AI_PROVIDER=mock|local-vllm|prod-vllm`.
+
+Docker Compose spec for local dev:
+```
+services:
+  api          # FastAPI
+  postgres     # metadata
+  redis        # queue
+  qdrant       # vector store
+  ai-mock      # python service implementing the mock providers
+```
+
+This setup is the single biggest reason we keep the EdTech architecture as-is — it already nailed dev ergonomics.
+
+---
+
+## 13. Cost model
+
+Per-call cost = embedding cost + retrieval cost (~free at our scale) + LLM tokens × price.
+
+| Use case | Avg tokens in | Avg tokens out | Cost / call (₹) at hosted-India pricing |
+|----------|---------------|----------------|-----------------------------------------|
+| Doc AI extraction | 4K | 0.5K | ~₹1–3 |
+| Ask your books | 6K | 0.5K | ~₹2–4 |
+| Notice intelligence | 10K | 2K | ~₹6–12 |
+| Certificate draft | 8K | 1.5K | ~₹5–10 |
+| WP draft | 12K | 3K | ~₹10–20 |
+| Training Q&A bulk (20 Qs) | 8K | 4K | ~₹15–25 |
+
+Self-hosted GPU amortises differently — fixed ~₹1L–1.5L/month for the GPU, marginal cost approaches zero. Crossover happens at ~50K calls/month → switch from hosted to self-hosted.
+
+Per-firm quota enforcement in Coordinator → never silently blow the budget.
+
+---
+
+## 14. Data residency in summary
+
+| Asset | Location |
+|-------|----------|
+| Qdrant collections | AWS Mumbai (ap-south-1), encrypted volumes |
+| Postgres metadata | AWS Mumbai |
+| Plaintext chunks | S3 Mumbai, per-firm KMS DEK |
+| Audit log of AI calls | AWS Mumbai, 8-year retention |
+| Inference GPUs | Indian cloud (Tata Comm / Yotta / E2E / NxtGen) |
+| Hosted-provider fallback | India-region endpoints only |
+| Embedding model weights | Cached in Mumbai region |
+
+No bytes of any firm's client data ever leave India.
+
+---
+
+## 15. Failure modes & their handling
+
+| Failure | Detection | Handling |
+|---------|-----------|----------|
+| Hallucinated citation | Citation doesn't resolve in index | Repair loop; if fails, return partial output with flag |
+| Wrong number (doesn't match books) | Compliance Critic cross-check | Repair; if fails, abort and surface to partner |
+| Stale statutory citation | Compliance Critic effective-period check | Replace with current; flag old |
+| Context window exceeded | Pre-flight token count | Aggressive rerank → reduce N; if still over, return "scope too broad" |
+| Model unavailable / vLLM down | Health check fails | Fallback to hosted Indian provider; if both down, queue request, notify firm |
+| Embedding model down | Indexer health check | Pause new indexing, alert; existing queries unaffected |
+| Qdrant down | Retrieval timeout | Open circuit; fall back to keyword search (Postgres / OpenSearch) with a "degraded mode" badge |
+| Per-firm quota exceeded | Pre-call check | Hard stop with clear message + upgrade link |
+| Compliance Critic disagrees with Generation | Different outputs | Always trust Critic; force repair or partner review |
+| PII leaked into output | Validation scanner | Block output, raise incident, audit |
+| Adversarial prompt from user | Prompt injection detector | Strip / refuse; log; rate-limit user |
+| Tenant boundary error (chunk from wrong firm) | Pre-output check that all chunk firm_ids match request | Hard abort, paged alert, freeze for incident review |
+
+---
+
+## 16. Phasing plan — sub-phases of Phase 3
+
+This is what actually ships, in order:
+
+### Phase 3A — Doc AI (extraction only)
+- Generation + Validation agents
+- No retrieval (single doc in context)
+- Use case: structured extraction from invoices, bank statements, Form 16, contracts
+- Lowest hallucination risk; builds trust
+- **Exit criteria**: extraction accuracy >92% on a curated test set
+
+### Phase 3B — Ask your books (retrieval-only)
+- Full agent chain except output is read-only summarisation with citations
+- No drafting / no certificate generation
+- **Exit criteria**: citation resolution rate ≥98%; partner NPS for the feature ≥40
+
+### Phase 3C — Notice intelligence
+- Adds OCR-on-notice + classification + draft reply
+- Compliance Critic must be solid before this ships
+- **Exit criteria**: classification accuracy ≥95% on top-10 notice types
+
+### Phase 3D — Auto-draft certificates
+- UDIN gate enforced
+- Stakes-tier "High" validation
+- **Exit criteria**: zero "incorrect certificate filed" incidents in 90-day beta
+
+### Phase 3E — Audit working paper drafting
+- WP chunking + SA cross-linking
+- Reviewer/partner workflow integration
+- **Exit criteria**: partner-acceptance rate of generated WP sections ≥70%
+
+### Phase 3F — Article training Q&A
+- The original EdTech use case
+- Lower stakes; can ship earlier if firm-side demand exists
+- **Exit criteria**: educator (CA-mentor) approval rating ≥75%
+
+---
+
+## 17. Mapping to the Agentic RAG reference architecture
+
+For the record, this design **inherits**:
+
+| From EdTech Agentic RAG | Kept | Changed for CA |
+|--------------------------|------|----------------|
+| Coordinator agent | ✅ | Plus per-firm quota, audit context, stakes tiering |
+| Retrieval + filter + Qdrant search | ✅ | Plus two-collection (firm + regulatory) parallel retrieval |
+| BGE-M3 embeddings | ✅ | India-hosted; multilingual including Hindi |
+| BGE Reranker | ✅ | Same |
+| Context Builder | ✅ | Plus PII redaction at chunk-time |
+| Context Critic feedback loop | ✅ | Same — this is the agentic heart |
+| Generation via Qwen / vLLM | ✅ | India-hosted GPU only |
+| Validation (grounding + dup-removal) | ✅ | Plus citation must resolve to primary source |
+| Repair loop | ✅ | Stakes-tier-aware budget |
+| Export to DB + API | ✅ | Plus audit log + UDIN/DSC gate |
+| Mock providers for dev | ✅ | Same — kept exactly |
+| FastAPI + Postgres + Qdrant + Celery | ✅ | Same — exact stack alignment |
+
+**Newly added agents and rules** specifically for CA:
+- Compliance Critic agent
+- Stakes tiering system
+- UDIN / DSC enforcement gate
+- Two-collection retrieval (firm + regulatory)
+- Citation-must-resolve-to-primary-source rule
+- Per-firm Qdrant collection (structural isolation)
+- Per-firm KMS-encrypted chunk plaintext
+- 8-year audit retention per ICAI
+- Regulatory corpus versioning + "may be stale" propagation
+
+---
+
+## 18. Open questions to settle before Phase 3 build
+
+1. **Self-hosted vLLM vs hosted Qwen** — when do we flip? Decide at ~50K calls/month milestone.
+2. **Qwen 7B vs 14B for high-stakes outputs** — benchmark on Notice intelligence + Certificate drafting before locking in.
+3. **Whether to fine-tune** Qwen on a firm's prior outputs (premium tier feature?) — defer until 100+ firms on AI plan.
+4. **Document AI: rules + ML hybrid vs pure ML** — likely hybrid (forms have known layouts), but quantify.
+5. **Indian GPU provider selection** — RFP at start of Phase 3.
+6. **Reranker on GPU vs CPU** — start CPU, move to GPU at ~100K reranks/day.
+7. **Multi-language outputs** (Hindi, Tamil, Gujarati for client portal) — Phase 4 by default; revisit if customer demand earlier.
+
+---
+
+## 19. What this doc does not cover
+
+- Detailed prompt templates (lives in code, not docs)
+- Specific JSON schemas per use case (also in code; auto-generated docs from schemas)
+- Test harness design for AI quality regression (separate doc: `ai-quality-testing.md`, to be written when build begins)
+- Pricing of AI as a premium add-on (in [`../business/pricing.md`](../business/pricing.md))