Welcome to Unlocking Yelp with AI, where we leverage the power of machine learning and artificial intelligence to analyze Yelp business data and predict ratings. Dive in to explore how data science can help unlock insights into customer reviews and business success. 🌟
This project aims to predict Yelp business ratings using a variety of machine learning models, such as Random Forest. We utilize Yelp's extensive dataset, perform Exploratory Data Analysis (EDA), clean the data, and build predictive models to understand what makes a business stand out.
- Predict Yelp business ratings based on customer reviews and features.
- Analyze key factors contributing to business ratings.
- Visualize the distribution and trends in Yelp data.
data/: Contains raw and processed data files.raw/: Original Yelp dataset files (not tracked).processed/: Cleaned and preprocessed data files.
notebooks/: Jupyter notebooks for EDA, feature engineering, and feature selection.models/: Stored machine learning model.visualizations/: Images and HTML files of visualizations.README.md: Project overview and documentation (you're here!).
Make sure you have Python 3.8+ installed. You'll also need to install the required packages listed in requirements.txt:
pip install -r requirements.txt-
Clone the Repository:
git clone https://github.com/your-username/Unlocking-Yelp-with-AI.git cd Unlocking-Yelp-with-AI -
Install Dependencies:
pip install -r requirements.txt
-
Explore the Data:
- Run the notebooks in the
notebooks/folder to explore the data and perform EDA.
- Run the notebooks in the
The dataset used for this project is the Yelp Academic Dataset, which contains information about businesses, reviews, and users. Due to its size, the raw data files are not included in the repository, but you can download them directly from Yelp's website or access the processed files provided.
We have created several visualizations to help understand the data better, including:
- Geographical maps showing business distribution.
- Rating trends over time.
- Feature correlations to understand what factors influence ratings the most.
Visualizations can be found in the visualizations/ directory.
We use machine learning models like Random Forest and K-Nearest Neighbors (KNN) to predict business ratings. Model training and evaluation can be found in the notebooks/03_Modeling.ipynb notebook.
- Python: Core programming language.
- Pandas and NumPy: For data manipulation and processing.
- Scikit-learn: For machine learning models.
- Matplotlib and Seaborn: For data visualization.
- GeoPandas: For geographical data representation.
- SciPy: For scientific and technical computing.
We welcome contributions! Feel free to fork this repository, create a branch, and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.
If you have any questions or feedback, please feel free to reach out via email victoriiavu@g.ucla.edu
- Yelp for providing the dataset.
- Scikit-learn and Pandas for their powerful data processing and machine learning tools.
If you found this project interesting, don't forget to star ⭐ the repository!