This project aims to create an unsupervised model that classifies job listings into their respective departments based solely on the "Job Description" column. This approach can be particularly useful for organizing unstructured job data into coherent categories.
The dataset used for this project can be found on Kaggle: Booking.com Jobs EDA & NLP Ensemble Modeling
Perform text pre-processing steps on the "Job Description" column and explain the utility of each step in the context of this task.
Identify the number of natural clusters present in the data.
Train an unsupervised model to classify the jobs into their respective departments using only the "Job Description" column.
Identify key words from each cluster that are indicative of the department.
Deploy the trained model in two different ways: a. As a REST API endpoint
- Python 3.7 or higher
- Required Python packages (listed in
requirements.txt)
-
Clone the repository:
git clone https://github.com/yourusername/department-classifier.git cd department-classifier -
Set up a virtual environment and activate it (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use "venv\Scripts\activate"
-
Install the required packages:
pip install -r requirements.txt
-
Create a
.envfile in the root directory of the project. This file should contain any environment variables required for the application. Here is an example of what the.envfile might look like:FLASK_APP=app.py FLASK_ENV=development
-
Ensure the
.envfile is correctly configured with all necessary values. Required env variables given in .env_example file.
- To start the REST API server, run:
flask run # Or simply use python app_kmeans.py
The application should now be running at http://127.0.0.1:6000/.
The input payload should be in the following format:
{
"job_description": "Your job description"
}The expected output will be in the following format:
{
"department": "Engineering",
"keywords": [
"work",
"experi",
"product",
"world",
"travel",
"develop",
"manag",
"team",
"opportun",
"data"
]
}- job_description: The job description text which needs to be classified.
- department: The department to which the job description is classified.
- keywords: A list of key words/phrases that are indicative of the department. These keywords are extracted from the job description and reflect common terms associated with the identified department.
The following tasks were performed in this repository:
-
Text Pre-Processing: Steps such as tokenization, stop word removal, stemming, and lemmatization were performed on the "Job Description" column to prepare the text data for clustering.
-
Cluster Identification: Various clustering methods were explored to identify natural clusters in the data.
-
Model Training: Three different clustering methods were tried - KMeans, DBSCAN, and BERTopic. Among these, BERTopic performed significantly better compared to the others in terms of creating coherent and meaningful clusters.
-
Keyword Identification: Key words from each cluster were identified to indicate the department. These keywords help in understanding and labeling the clusters effectively.
-
Model Deployment: The trained model was deployed as a REST API endpoint and guidelines were provided to deploy.
For any questions or further assistance, please raise an issue on the GitHub repository or contact the maintainer at [sumedh.bhalerao07@gmail.com].