GitHub - Kavayk29/Quora-Duplicate-Question-Pair: This project improves information retrieval by detecting duplicate question pairs in the Quora dataset using data exploration, text preprocessing, feature engineering, and models like Random Forest and LSTM, aiming to streamline question-answering.

Project Overview This project aims to detect duplicate question pairs in the Quora dataset. By identifying similar questions, the system can help streamline the question-answering process and improve the efficiency of information retrieval on the platform.

Key Features: Data Exploration: Load and explore the Quora dataset to understand its structure and characteristics. Text Preprocessing: Implement various techniques to clean and preprocess the text data, including the removal of HTML tags and special characters. Feature Engineering: Extract meaningful features from the text data to improve the model’s ability to detect duplicate questions. Modeling: Apply machine learning models such as Random Forest and XGBoost, as well as deep learning models like LSTM and BiLSTM, to predict duplicate question pairs. Evaluation: Assess model performance using metrics like accuracy, precision, and recall. Technologies Used: Python: Core language for data processing and modeling. Pandas: For handling and manipulating data structures. Numpy: For numerical operations and array management. Seaborn & Matplotlib: For data visualization and analysis. BeautifulSoup: For text cleaning and preprocessing. How to Use: Load the Dataset: Begin by loading the Quora dataset using the provided code. Preprocess the Data: Clean and prepare the text data for modeling. Train the Models: Utilize the provided scripts to train and evaluate different models on the dataset. Analyze Results: Review the model performance metrics and visualizations to understand the results. Conclusion: This project provides a comprehensive approach to detecting duplicate questions on Quora. By combining data preprocessing, feature engineering, and advanced modeling techniques, it delivers a robust solution for improving information retrieval on the platform.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
code		code
dataset		dataset
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages