Added output.csv and implemented triage logic by sathvik1607 · Pull Request #36 · interviewstreet/hackerrank-orchestrate-may26

sathvik1607 · 2026-05-02T05:53:33Z

HackerRank Orchestrate – Development Log (Expanded)

==================================================
[PHASE 0] Constraints & Success Criteria

Must use ONLY provided corpus (data/); no external calls at runtime.
Zero hallucination: responses must be grounded in retrieved text.
Safe triage: high-risk categories must escalate deterministically.
Reproducible: same input → same output (no randomness).
Deliverables: code/, output.csv, log.txt.

Key risks identified early:

Over-reliance on LLM → non-determinism and hallucination risk.
Weak retrieval → irrelevant context → bad responses.
Incomplete escalation rules → false “replied” on sensitive cases.

Decision:

Build a deterministic RAG pipeline with rule-based decision layer.

==================================================
[PHASE 1] Architecture

Pipeline:

Ingest (kb.py) → 2) Chunk → 3) Retrieve (TF-IDF) →
Classify (rule-based) → 5) Decide (rule-based) → 6) Respond

Rationale:

Separation of concerns for auditability and debugging.
Rules for safety-critical decisions; retrieval for grounding.

Alternatives considered:

Full LLM agent (rejected: non-deterministic, harder to justify).
Embeddings/ANN (rejected for simplicity; TF-IDF sufficient on corpus).

==================================================
[PHASE 2] Knowledge Base & Chunking

Loaded .md files for claude/, hackerrank/, visa/.
Tagged each chunk with company for domain filtering.

Issue:

Large chunks caused noisy matches.

Fix:

Reduced chunk size (~300 words).
Verified chunks are semantically cohesive (titles + sections intact).

Outcome:

Improved top-1 relevance (qualitative check on 10 queries).

==================================================
[PHASE 3] Retrieval (TF-IDF)

Vectorizer: stop_words="english", max_df=0.9.
Cosine similarity for ranking.
Company filter before scoring.

Validation queries:

“visa card stolen” → correct security article
“login account issue” → password/reset docs
“test invitation problem” → invite/candidate docs

Issue:

Cross-domain noise before filtering.

Fix:

Filter by company first; then compute scores.

Outcome:

Stable top-3 results with domain relevance.

==================================================
[PHASE 4] Classification (rule-based)

Classes:

product_issue, bug, feature_request, invalid

Initial issue:

Keyword “account” over-triggered product_issue.

Fix:

Narrowed to “login”, “password”, “sign in”, and “access”.
Removed generic “account” trigger.

Trade-off:

Slightly less recall, but far fewer false positives.

==================================================
[PHASE 5] Decision Engine (core)

Principle:

Safety > completeness. When in doubt, escalate.

Signals used:

Keywords (security, billing, compliance)
Action intent (admin/control/subscription)
Domain strictness (Visa stricter)
Retrieval confidence (top_score)

Rule groups:

Security: stolen, fraud, unauthorized, hacked
Billing: refund, charge, payment
Admin/control: remove user, reschedule, score change
Subscription: pause/cancel/stop
Compliance: infosec, forms
Permission: lost access, no permission
Informational: how long, what is, why
Confidence thresholds

==================================================
[PHASE 6] Iterative Debugging (Key Fixes)

Case A: “remove employee …”

Problem: incorrectly replied
Root cause: exact phrase mismatch
Fix:
- Phrase patterns + word-patterns (["employee","remove"], ["employee","left"])
Result: now escalates as admin action

Case B: “pause subscription”

Problem: replied due to ordering
Fix:
- Move subscription rule into HARD ESCALATION (early)
Result: always escalates

Case C: “infosec forms”

Problem: replied
Fix:
- Added compliance_keywords
Result: escalates

Case D: “how long is my data used”

Problem: escalated
Fix:
- Added informational_keywords + threshold (>= 0.2)
Result: replied with grounded context

Case E: justification format

Problem: inconsistent strings
Fix:
- Standardized to “(score X.XX)”
Result: consistent, traceable output

==================================================
[PHASE 7] Final Critical Fixes (High-Impact)

Fix 1: Visa financial urgency

Query: “I need urgent cash…”
Action:
- If company == visa AND (“cash” or “need money”) → escalate
Rationale:
- Not a doc answer; requires financial assistance channel

Fix 2: Security vulnerability

Query: “found vulnerability”
Action:
- Any vulnerability/security bug → escalate
Rationale:
- Security reporting is sensitive; never auto-answer

Fix 3: Weak-context replies

Problem:
- Low-score retrieval produced irrelevant text in responses
Fix:
- If best_score < 0.4 → do not reply; escalate instead
Result:
- Eliminated noisy responses without hallucination risk

==================================================
[PHASE 8] Response Generation

Response = truncated retrieved context (≤300 chars) + instruction line.
No fabricated steps; no external info.

Trade-off:

Less “fluent” than LLM, but guaranteed grounding.

==================================================
[PHASE 9] Final Pipeline

classify → retrieve(top_k=5, company-filtered) → decide → select_best (re-score) → respond

Properties:

Deterministic (no randomness)
Corpus-only
Explicit escalation paths
Auditable via justification + score

==================================================
[PHASE 10] Validation (Spot Checks)

Visa:

fraud/billing → escalated
urgent cash → escalated
dispute charge → escalated

Claude:

vulnerability → escalated
data usage → replied (informational)
service failure → bug handling

HackerRank:

score manipulation → escalated
remove user → escalated
subscription pause → escalated
bugs → replied when confident

Edge:

vague query → escalated (no results)
invalid/harmful → escalated
multilingual (FR) → escalated (security)

==================================================
[METRICS / HEURISTICS]

top_k = 5
informational threshold = 0.2
reply safety threshold = 0.4 (below → escalate)
justification always includes top_score

==================================================
[FAILURE MODES & MITIGATION]

Paraphrase misses (keywords fail)
- Mitigation: phrase + word-pattern matching
Retrieval mismatch
- Mitigation: company filter + score threshold
Over-escalation
- Accepted trade-off for safety
Classifier drift
- Kept simple + conservative

==================================================
[WHY NOT LLM FOR DECISION]

LLMs are probabilistic → inconsistent escalation.
Risky for billing/security/admin decisions.
Harder to audit and justify.

Chosen approach:

Rules for decisions (deterministic)
Retrieval for grounding
Optional LLM only for surface-level phrasing (not used here)

==================================================
[FUTURE IMPROVEMENTS]

Hybrid retrieval (BM25 + embeddings)
Lightweight semantic intent classifier
Better chunking (title-aware segmentation)
Confidence calibration with validation set

==================================================
END OF LOG

Added output.csv and implemented triage logic

15c0606

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added output.csv and implemented triage logic#36

Added output.csv and implemented triage logic#36
sathvik1607 wants to merge 1 commit into
interviewstreet:mainfrom
sathvik1607:main

sathvik1607 commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

sathvik1607 commented May 2, 2026

================================================== [PHASE 0] Constraints & Success Criteria

================================================== [PHASE 1] Architecture

================================================== [PHASE 2] Knowledge Base & Chunking

================================================== [PHASE 3] Retrieval (TF-IDF)

================================================== [PHASE 4] Classification (rule-based)

================================================== [PHASE 5] Decision Engine (core)

================================================== [PHASE 6] Iterative Debugging (Key Fixes)

================================================== [PHASE 7] Final Critical Fixes (High-Impact)

================================================== [PHASE 8] Response Generation

================================================== [PHASE 9] Final Pipeline

================================================== [PHASE 10] Validation (Spot Checks)

================================================== [METRICS / HEURISTICS]

================================================== [FAILURE MODES & MITIGATION]

================================================== [WHY NOT LLM FOR DECISION]

================================================== [FUTURE IMPROVEMENTS]

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

==================================================
[PHASE 0] Constraints & Success Criteria

==================================================
[PHASE 1] Architecture

==================================================
[PHASE 2] Knowledge Base & Chunking

==================================================
[PHASE 3] Retrieval (TF-IDF)

==================================================
[PHASE 4] Classification (rule-based)

==================================================
[PHASE 5] Decision Engine (core)

==================================================
[PHASE 6] Iterative Debugging (Key Fixes)

==================================================
[PHASE 7] Final Critical Fixes (High-Impact)

==================================================
[PHASE 8] Response Generation

==================================================
[PHASE 9] Final Pipeline

==================================================
[PHASE 10] Validation (Spot Checks)

==================================================
[METRICS / HEURISTICS]

==================================================
[FAILURE MODES & MITIGATION]

==================================================
[WHY NOT LLM FOR DECISION]

==================================================
[FUTURE IMPROVEMENTS]