This project analyzes relationships between different movie features such as budget, gross revenue, votes, score and runtime using Python.
The main goal is to understand which factors are most strongly correlated with a movie’s gross revenue.
- Source: Kaggle Movies Dataset : https://www.kaggle.com/danielgrijalvas/movies
- Contains information on movies including:
- name, rating, genre, year, released
- score, votes
- director, writer, star
- country, budget, gross, company, runtime
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Jupyter Notebook (VS Code)
- Loaded and explored the dataset
- Checked missing values
- Cleaned data (dropped missing values for analysis)
- Converted data types where required
- Extracted correct year from the
releasedcolumn - Created scatter plot of budget vs gross
- Built regression plot to observe trend
- Generated correlation matrix for numeric features
- Identified highly correlated feature pairs
- Budget and gross revenue show strong positive correlation
- Votes also have strong correlation with gross
- Other numeric features show moderate or weak relationships
- Correlation analysis helps identify key drivers of movie revenue
- Scatter plot (Budget vs Gross)
- Regression plot
- Heatmap of correlation matrix
Only numeric features were used for correlation analysis to ensure meaningful results.
movies_analysis_correlation.ipynb→ main project notebookmovies.csv→ dataset
- Download the repository
- Open the notebook in VS Code or Jupyter
- Install required libraries
- Run all cells
- Handle missing data more strategically instead of dropping
- Apply log transformation for skewed variables
- Build predictive models for gross revenue
- Explore genre/company-level insights
Akhtar R Khan