Skip to content

Adarsh-Aravind/TF_IDF-Depression-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Depression Detection using TF-IDF & Machine Learning

This repository contains a lightweight, efficient text classification system designed to detect signs of depression primarily from text data. It achieves this using a classical Natural Language Processing (NLP) approach, leveraging Term Frequency-Inverse Document Frequency (TF-IDF) for feature extraction and a Logistic Regression model for classification.

Overview

Unlike computationally expensive deep learning models, this system provides a rapid and relatively accurate baseline for text classification tasks. The system is split into two main components:

  1. Training Script (Training.py): Processes the dataset, vectorizes the text using TF-IDF, trains a Logistic Regression classifier, and saves the trained model and vectorizer for future use.
  2. Prediction Script (predict.py): Loads the pre-trained model and vectorizer to evaluate new text inputs, providing both a classification (Depressed / Not Depressed) and a confidence probability.

Key Features

  • Efficient Text Vectorization: Utilizes TF-IDF with up to 5,000 features (unigrams and bigrams) to capture meaningful word combinations.
  • Logistic Regression Classifier: Employs a classical, easily interpretable machine learning algorithm.
  • Pre-processing Integration: Handles basic text cleaning (lowercasing, punctuation removal) during inference.
  • Interactive & CLI Modes: The prediction script supports both command-line arguments and an interactive prompt.

Prerequisites

Python 3.x is required. It is highly recommended to use a virtual environment. Install the necessary dependencies via pip:

pip install pandas scikit-learn joblib numpy

Getting Started

1. Dataset Requirements

To train the model from scratch, you must provide a dataset named depression_dataset.csv in the root directory. The dataset should contain at least the following two columns:

  • clean_text: The textual data to be analyzed.
  • is_depression: The target label (1 for depressed, 0 for not depressed).

2. Training the Model

Run the training script to build the model based on your dataset:

python Training.py

This script will split the data, train the TF-IDF vectorizer and Logistic Regression model, output the accuracy and a classification report, and save two files relative to the script:

  • depression_model.pkl
  • vectorizer.pkl

(Note: Ensure the models directory exists or the saving paths align across both scripts for prediction).

3. Making Predictions

Once the model is trained and saved, you can use the prediction script to classify new sentences. You can run it interactively:

$ python predict.py
Enter text: I am feeling so sad and lonely
Prediction: Depressed
Confidence: 0.8421

Or you can pass the text directly as command-line arguments:

$ python predict.py I am feeling so sad and lonely
Prediction: Depressed
Confidence: 0.8421

Workflow Details

  1. Text Preprocessing: During inference, the system automatically converts text to lowercase and strips non-alphabetic characters.
  2. Threshold Adjustment: The prediction script utilizes a custom threshold (0.3 probability) to bias towards recall, meaning it is more sensitive to detecting signs of depression.
  3. Artifact Generation: Uses joblib for serializing the trained pipeline, ensuring that predictions strictly adhere to the original training vocabulary.

License

Please refer to the repository's licensing information for usage rights and restrictions.

About

A lightweight and efficient text classification system designed to detect signs of depression using classical NLP. It features a TF-IDF vectorization pipeline combined with a Logistic Regression model, providing a fast and interpretable alternative to deep learning for mental health text analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages