-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathProject 01.py
More file actions
79 lines (63 loc) · 3.33 KB
/
Project 01.py
File metadata and controls
79 lines (63 loc) · 3.33 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# ============================================
# 1. Task Description
# Perform exploratory data analysis on Netflix movie data to:
# • filter to Movies from the 1990s (1990 <= release_year < 2000),
# • visualize the distribution of movie durations in that decade,
# • save an approximate most frequent duration into `duration`,
# • count how many Action movies are "short" (< 90 minutes) and
# save the result into `short_movie_count`.
#
# 2. Topics Covered
# - pandas filtering and boolean masks (decade subsetting)
# - exploratory plotting (histogram with matplotlib)
# - counting via iteration and via vectorized boolean sums
# - chainable, readable DataFrame operations
# ============================================
# 3. Python Script
# Importing pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
# Read in the Netflix CSV as a DataFrame
netflix_df = pd.read_csv("netflix_data.csv")
# Subset the DataFrame for type "Movie"
netflix_subset = netflix_df[netflix_df["type"] == "Movie"]
# Filter the to keep only movies released in the 1990s
# Start by filtering out movies that were released in or after 1990
subset = netflix_subset[(netflix_subset["release_year"] >= 1990)]
# And then do the same to filter out movies released before 2000
movies_1990s = subset[(subset["release_year"] < 2000)]
# Another way to do this step is to use the & operator which allows you to do this type of filtering in one step
# movies_1990s = netflix_subset[(netflix_subset["release_year"] >= 1990) & (netflix_subset["release_year"] < 2000)]
# Visualize the duration column of your filtered data to see the distribution of movie durations
# See which bar is the highest and save the duration value, this doesn't need to be exact!
plt.hist(movies_1990s["duration"])
plt.title('Distribution of Movie Durations in the 1990s')
plt.xlabel('Duration (minutes)')
plt.ylabel('Number of Movies')
plt.show()
duration = 100
# Filter the data again to keep only the Action movies
action_movies_1990s = movies_1990s[movies_1990s["genre"] == "Action"]
# Use a for loop and a counter to count how many short action movies there were in the 1990s
# Start the counter
short_movie_count = 0
# Iterate over the labels and rows of the DataFrame and check if the duration is less than 90, if it is, add 1 to the counter, if it isn't, the counter should remain the same
for label, row in action_movies_1990s.iterrows() :
if row["duration"] < 90 :
short_movie_count = short_movie_count + 1
else:
short_movie_count = short_movie_count
print(short_movie_count)
# A quicker way of counting values in a column is to use .sum() on the desired column
# (action_movies_1990s["duration"] < 90).sum()
# ============================================
# 4. Additional Notes
# - If you prefer a *programmatic* mode estimate (instead of eyeballing),
# you could bin durations with numpy and take the bin center with max count,
# or use pandas' value_counts().idxmax() on integer-rounded durations:
# duration = movies_1990s["duration"].round().value_counts().idxmax()
# - Vectorized counting with `.sum()` on a boolean mask is the idiomatic
# and efficient approach in pandas; prefer it over iterrows() for scale.
# - Consider handling missing durations:
# movies_1990s = movies_1990s.dropna(subset=["duration"])
# ============================================