This repository contains a lightweight, efficient text classification system designed to detect signs of depression primarily from text data. It achieves this using a classical Natural Language Processing (NLP) approach, leveraging Term Frequency-Inverse Document Frequency (TF-IDF) for feature extraction and a Logistic Regression model for classification.
Unlike computationally expensive deep learning models, this system provides a rapid and relatively accurate baseline for text classification tasks. The system is split into two main components:
- Training Script (
Training.py): Processes the dataset, vectorizes the text using TF-IDF, trains a Logistic Regression classifier, and saves the trained model and vectorizer for future use. - Prediction Script (
predict.py): Loads the pre-trained model and vectorizer to evaluate new text inputs, providing both a classification (Depressed / Not Depressed) and a confidence probability.
- Efficient Text Vectorization: Utilizes TF-IDF with up to 5,000 features (unigrams and bigrams) to capture meaningful word combinations.
- Logistic Regression Classifier: Employs a classical, easily interpretable machine learning algorithm.
- Pre-processing Integration: Handles basic text cleaning (lowercasing, punctuation removal) during inference.
- Interactive & CLI Modes: The prediction script supports both command-line arguments and an interactive prompt.
Python 3.x is required. It is highly recommended to use a virtual environment. Install the necessary dependencies via pip:
pip install pandas scikit-learn joblib numpyTo train the model from scratch, you must provide a dataset named depression_dataset.csv in the root directory. The dataset should contain at least the following two columns:
clean_text: The textual data to be analyzed.is_depression: The target label (1 for depressed, 0 for not depressed).
Run the training script to build the model based on your dataset:
python Training.pyThis script will split the data, train the TF-IDF vectorizer and Logistic Regression model, output the accuracy and a classification report, and save two files relative to the script:
depression_model.pklvectorizer.pkl
(Note: Ensure the models directory exists or the saving paths align across both scripts for prediction).
Once the model is trained and saved, you can use the prediction script to classify new sentences. You can run it interactively:
$ python predict.py
Enter text: I am feeling so sad and lonely
Prediction: Depressed
Confidence: 0.8421Or you can pass the text directly as command-line arguments:
$ python predict.py I am feeling so sad and lonely
Prediction: Depressed
Confidence: 0.8421- Text Preprocessing: During inference, the system automatically converts text to lowercase and strips non-alphabetic characters.
- Threshold Adjustment: The prediction script utilizes a custom threshold (
0.3probability) to bias towards recall, meaning it is more sensitive to detecting signs of depression. - Artifact Generation: Uses
joblibfor serializing the trained pipeline, ensuring that predictions strictly adhere to the original training vocabulary.
Please refer to the repository's licensing information for usage rights and restrictions.