This project covers a full-stack Machine Learning pipeline, progressing from a traditional NLP baseline model (TF-IDF + Logistic Regression) to a modern Transformer-based model (DistilBERT) for sentiment classification. The final model is containerized using Docker and deployed via a Flask API for real-time inference.
The project goal is build and deploy a robust sentiment classification service that determines whether an IMDB movie text review is "Positive" or "Negative."
The project follows a three-stage development process:
- Exploration & Baseline: EDA to know word counts distribution and frequent words appearance. And baseline modeling (TF-IDF + Logistic Regression) to establish performance bound.
- Advanced Modeling: Fine-tuning a deep learning model (DistilBERT).
- Deployment: Containerize the solution for universal portability and reliable API serving.
- Source: IMDB Dataset of 50k Movie Reviews
- Subset Used: 10,000 samples were used for the baseline model to ensure fast iteration.
- Target: Binary Classification (
0for Negative,1for Positive).
- Data/ML:
pandas,numpy,scikit-learn - NLP/Deep Learning:
transformers,torch,datasets(for efficient handling of large text data) - Deployment:
Flask,Docker - EDA:
matplotlib,seaborn,nltk
-
Class Balance: The data followed a near-perfect 50/50 balance between Positive and Negative classes, but still applied stratified sampling and referred to F1-score/Accuracy as fair metrics.
-
Sequence Length: Found a long-tail distribution with the
$95^{th}$ percentile word count around$590$ words.-
Decision: Due to the
$512$ token limit of standard BERT architecture,MAX_LENwas set to 128 to strike a balance between training efficiency and information capture.
-
Decision: Due to the
-
File:
text_eda.ipynb
- File:
logRrg_TFIDF.py
| Technique | Parameter/Reasoning |
|---|---|
| Feature Extraction | TF-IDF Vectorizer (min_df=5, ngram_range=(1, 2), stop_words='english') |
| Model | Logistic Regression (solver='liblinear', random_state=42) |
| Data Handling | Stratified Sampling (stratify=y) |
| Metric | Negative (0) | Positive (1) | Overall |
|---|---|---|---|
| Precision | - | ||
| Recall | - | ||
| F1-Score | |||
| Accuracy | - | - |
Conclusion: The baseline model performed exceptionally well, achieving
- Files:
src/DistilBERT_base
| Component | Technology | Purpose | Key Files |
|---|---|---|---|
| Model | DistilBERT (Base Uncased) |
Fast, light, and high-performing NLP model for sequence classification. | model.py, train.py |
| Training | PyTorch, HuggingFace Transformers, AdamW |
Custom training loop with advanced scheduler and optimization. | engine.py, train.py |
| Serving | Flask API |
Provides a low-latency REST API endpoint for real-time predictions. | api.py |
| Deployment | Docker |
Containerizes the application, dependencies, and model for consistent, environment-agnostic deployment. | Dockerfile, config.py |
-
Hyperparameters: Key parameters like
$\mathbf{MAX_LEN=128}$ ,$\mathbf{TRAIN_BATCH_SIZE}=16$ (limited by VRAM), and a critical$\mathbf{LEARNING_RATE=3\text{e-}5}$ were set. -
Reasoning:
- Transformers learn rich contextual embeddings, offering superior performance over traditional feature engineering, especially for complex sentiment nuances.
- DistilBERT was chosen over the full BERT-Base model due to its efficiency. It is 40% smaller, 60% faster, and retains approximately 97% of BERT's language understanding capabilities, making it ideal for low-latency, production-level serving in a containerized environment.
-
Key Fine-Tuning Techniques:
- The fine-tuning process involved adding a simple linear classification head on top of the DistilBERT encoder.
- Structure: [CLS] Output (768 features)
$\rightarrow$ Dropout Layer$\rightarrow$ Linear Layer (Output: 1 Logit)$\rightarrow$ Sigmoid (for probability at inference).
- Structure: [CLS] Output (768 features)
- Optimizer: Used
AdamW(Adam with weight decay fix) standard for all Transformer models. - Learning Rate Schedule: Implemented a Linear Scheduler with Warmup to stabilize initial training and ensure robust convergence.
- The weights of the entire pre-trained DistilBERT layer were unfrozen (
param.requires_grad = True) to allow the model to fully adapt to the sentiment task, maximizing its classification accuracy.
- The weights of the entire pre-trained DistilBERT layer were unfrozen (
- The fine-tuning process involved adding a simple linear classification head on top of the DistilBERT encoder.
-
Expected Performance: Expected F1-Score to exceed
$\mathbf{90%}$ .
The final fine-tuned model weights were integrated into a production-ready API:
- Framework: Flask was used to create a simple
/predictendpoint. - Inference Pipeline: The API loads the BERT model and tokenizer once at startup, accepting review text via POST request and returning the predicted sentiment and confidence score.
- Containerization: A Dockerfile was used to package the Python environment, dependencies, Flask application (
api.py), and the trained model (model.bin) into a single portable image. - Key Path: All deployment assets were mapped to the internal container path,
/root/docker_data/, ensuring environment independence.
- VRAM Management: The greatest hurdle was training large models like DistilBERT requires significant GPU VRAM. To deal with such hurdle, I used Gradient Accumulation (
ACCUMULATION_STEPS) over batches to allow for stable training convergence without hitting VRAM limits. Which was implemented inengine.py - Robust Pathing and Configuration: I designed a robust path-finding logic in
config.pyusing Python'spathlib.Pathand a Docker environment variable (IN_DOCKER=true), so that the model could find files correctly no matter where it's running from (local path or /app Docker WORKDIR). - Dependency and Deployment Consistency: Docker Containerization: The Dockerfile locks the application to the python:3.9-slim base image and explicitly defines all dependencies in requirements.txt. This guarantees that the execution environment is identical from training to production.
As this project uses Docker to package the fine-tuned BERT model and Flask API, you could launch the complete, containerized service locally with just two commands:
- Docker installed and running on your system.
-
Build the Docker Image (Execute this command in the project root directory where the
Dockerfileresides):docker build -t sentiment-bert-api . -
Run the Container (This launches the API and maps the container's internal port 5000 to your local machine's port 8080):
docker run -d -p 8080:5000 --name sentiment_service sentiment-bert-api
-
Test the Endpoint (Use
curlor Postman to send a test review):curl -X POST http://localhost:8080/predict -H "Content-Type: application/json" -d '{"text": "This movie was an absolute masterpiece, the best film I have seen all year!"}'
Expected Output:
{"confidence": 0.998, "sentiment": "Positive", "model": "bert-base-uncased"}