Skip to content

MuhammadAnas4774/HSRIS_CustomerSupport_NLP

 
 

Repository files navigation

🚀 Hybrid Semantic Retrieval & Intelligence System (HSRIS)

Academic Assignment — End-to-end NLP pipeline for customer support ticket retrieval using PyTorch, built entirely from scratch (no sklearn).


📌 Overview

HSRIS is a multi-stage NLP retrieval system that combines:

Stage Method Output
Encoders Label Encoding + One-Hot Encoding Priority integers, Channel binary vectors
Sparse Retrieval Bag-of-Words, N-Grams, TF-IDF (sparse tensors) Keyword similarity scores
Dense Retrieval GloVe 300-d + TF-IDF-weighted averaging Semantic similarity scores
Hybrid Search α·TF-IDF + (1-α)·GloVe Ranked ticket results
Dual-GPU torch.nn.DataParallel Batch of 100 queries on Tesla T4 ×2
Evaluation Precision@5 Quantitative comparison of all methods
UI Gradio App Interactive live demo

🗂️ Repository Structure

HSRIS/
├── HSRIS_FINAL.ipynb          ← ✅ Upload this to Kaggle (46 cells)
├── HSRIS_COMPLETE.py          ← Full single Python source file
├── build_notebook.py          ← Assembles .ipynb from source
├── make_single_notebook.py    ← Converts merged .py → .ipynb
├── hsris_part1_encoders.py    ← Label + One-Hot encoders
├── hsris_part2_sparse_retrieval.py  ← BoW, N-Grams, TF-IDF
├── hsris_part3_dense_layer.py ← GloVe, nn.Embedding, cosine sim
├── hsris_part4_hybrid_eval.py ← Hybrid search, Dual-GPU, Precision@5
├── hsris_part5_gradio_app.py  ← Gradio UI
└── README.md

⚙️ Environment

Setting Value
Platform Kaggle Notebook
GPU Tesla T4 ×2 (Dual GPU)
Python 3.10+
Libraries PyTorch · NumPy · Pandas · Regex · Gradio · Matplotlib
Forbidden scikit-learn (TfidfVectorizer, LabelEncoder, etc.)

📦 Kaggle Datasets Required

Dataset Kaggle Slug
Customer Support Tickets suraj520/customer-support-ticket-dataset
GloVe 6B 300d thanakomsn/glove6b300dtxt

🚀 How to Run on Kaggle

  1. Upload HSRIS_FINAL.ipynb → Kaggle → New Notebook → Import
  2. Add datasets (right panel → + Add Data):
    • suraj520/customer-support-ticket-dataset
    • thanakomsn/glove6b300dtxt
  3. Enable Dual GPU: Settings → Accelerator → GPU T4 x2
  4. Run All — Gradio public URL appears at the bottom

📐 Architecture

Query String
     │
     ├──► Tokenize ──► TF-IDF Vector ──► Cosine Sim ──► TF-IDF Scores
     │                                                        │
     └──► Tokenize ──► GloVe Embedding ──► Cosine Sim ──► GloVe Scores
                                                              │
                          α × TF-IDF + (1-α) × GloVe  ◄──────┘
                                      │
                               Top-K Ranked Results

📊 Deliverables

# Deliverable Location
1 Jupyter Notebook (all tasks) HSRIS_FINAL.ipynb
2 Execution Time vs Batch Size Plot Cell 27 → execution_time_plot.png
3 Precision@5 Report Table Cell 28
4 Precision@5 Bar Chart Cell 28 → precision_at5_chart.png
5 5 Qualitative GloVe > TF-IDF Examples Cell 29
6 Gradio App Live Link Cell 32 → printed URL

🔑 Key Design Decisions

  • No sklearn — TF-IDF, Label/One-Hot encoding all built from scratch with NumPy/PyTorch
  • Sparse tensors — TF-IDF stored as torch.sparse_coo_tensor to save GPU VRAM
  • TF-IDF weighted GloVe — More accurate than simple mean pooling (SIF-inspired)
  • nn.DataParallel — Automatically splits workload across both T4 GPUs
  • Gradio — Works natively in Kaggle with share=True for public URL

👤 Author

Built as part of an academic NLP assignment on Information Retrieval systems.

About

HSRIS (Hybrid Semantic Retrieval & Intelligence System) is an end-to-end, high-performance NLP pipeline built from the ground up for intelligent customer support ticket retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 55.5%
  • Python 44.5%