Welcome to the Sentiment Analysis using YouTube repository! 🎉
This project is a collaborative initiative brought to you by SuperDataScience, a thriving community dedicated to advancing the fields of data science, machine learning, and AI. We are excited to have you join us in this journey of learning, experimentation, and growth.
This project focuses on performing sentiment analysis using tweets collected from YouTube channels. The project targets beginner to intermediate-level data scientists and involves building a machine learning pipeline to extract, analyze, and predict the sentiment of comments on YouTube videos. An ETL (Extract, Transform, Load) pipeline will be orchestrated using Apache Airflow to ensure seamless data management, while a machine learning model will be developed and deployed using Streamlit for real-time sentiment analysis.
- Use the YouTube API to collect comments of recent/relevant videos of a YouTube channel.
- Store the collected comments in a structured database for analysis.
- Build an ETL pipeline using Apache Airflow to automate the process of fetching, cleaning, and storing tweet data.
- Perform data preprocessing and exploratory data analysis (EDA).
- Use a machine learning model from Huggingface to classify the sentiment of comments as positive, negative, or neutral.
- Deploy the sentiment analysis model using Streamlit/Huggingface to provide an interactive web application for real-time analysis.
- API Integration: Google Developer Portal, YouTube API.
- ETL Pipeline: Apache Airflow, Pandas.
- Database: PostgreSQL, SQLite, MySQL, etc.
- Model Development: scikit-learn, TensorFlow/PyTorch, Huggingface Transformers.
- Deployment: Streamlit.
- Python 3.8+
- Libraries: pandas, google-api-python-client, scikit-learn, transformers, streamlit, airflow.
- Follow Intro to Git & GitHub tutorial on SDS to get started with cloning the SDS GitHub repo to your laptop/desktops.
- Setup an account on Google Developer using the following video for guidance (https://www.youtube.com/watch?v=th5_9woFJmk).
- Register and authenticate with YouTube API.
- Write Python scripts to fetch and store comments.
- Build Airflow DAGs for automated tweet extraction, cleaning, and storage.
- Test and optimize data loading into a database.
- Conduct EDA and preprocess comments data.
- Evaluate a sentiment analysis model.
- Build a Streamlit app with input options for YouTube channel/video comments to retrieve.
- Deploy the app on Streamlit/Huggingface.
| Phase | Task | Duration |
|---|---|---|
| Phase 1: Setup | Setup GitHub repo & API Credentials | Week 1 |
| Phase 2: Data | Data collection via YouTube API | Week 2 |
| Phase 3: ETL | Build and test Airflow pipeline | Week 3 |
| Phase 4: Model | Train and evaluate sentiment model | Week 4 |
| Phase 5: Deployment | Deploy Streamlit app | Week 5 |
Follow these steps to set up the project locally:
To work on your own copy of this project:
- Navigate to the SDS GitHub repository for this project.
- Click the Fork button in the top-right corner of the repository page.
- This will create a copy of the repository under your GitHub account.
After forking the repository:
- Open a terminal on your local machine.
- Clone your forked repository by running:
git clone https://github.com/<your-username>/<repository-name>.git
- Navigate to the project directory:
cd <repository-name>
Setup a virtual environment to isolate project dependancies
- Run the following command in the terminal to create a virtual environment
python3 -m venv .venv
- Activate the virtual environment
- On a mac/linux:
source .venv/bin/activate - On a windows:
.venv\Scripts\activate
- Verify the virtual environment is active (the shell prompt should show (.venv))
Install the required libraries for the project
- Run the following command in the terminal to isntall dependancies from the requirements.txt file:
pip install -r requirements.txt
Once the setup is complete, you can proceed with building your project