data-analysis-in-python-studies/Project 04.py at main · BrunoPerciani/data-analysis-in-python-studies · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# ============================================
# 1. Task Description
# Explore Los Angeles crime data to answer:
# 1) Which hour has the highest frequency of crimes? (peak_crime_hour)
# 2) Which area has the largest frequency of night crimes (10pm–3:59am)?
#    (peak_night_crime_location)
# 3) What is the distribution of crimes by victim age group? (victim_ages)
#
# 2. Topics Covered
# - String parsing for time features
# - Boolean filtering for time-of-day slices
# - Groupby counts and sorting
# - Binning ages into custom ranges with pd.cut()
# - Basic visual checks with seaborn
# ============================================

# 3. Python Script

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read in and preview the dataset
crimes = pd.read_csv("crimes.csv", dtype={"TIME OCC": str})
crimes.head()

## Which hour has the highest frequency of crimes? Store as an integer variable called peak_crime_hour

# Extract the first two digits from "TIME OCC", representing the hour,
# and convert to integer data type
crimes["HOUR OCC"] = crimes["TIME OCC"].str[:2].astype(int)

# Preview the DataFrame to confirm the new column is correct
crimes.head()

# Produce a countplot to find the largest frequency of crimes by hour
sns.countplot(data=crimes, x="HOUR OCC")
plt.show()

# Midday has the largest volume of crime
peak_crime_hour = 12

## Which area has the largest frequency of night crimes (crimes committed between 10pm and 3:59am)?
## Save as a string variable called peak_night_crime_location
# Filter for the night-time hours
# 0 = midnight; 3 = crimes between 3am and 3:59am, i.e., don't include 4
night_time = crimes[crimes["HOUR OCC"].isin([22,23,0,1,2,3])]

# Group by "AREA NAME" and count occurrences, filtering for the largest value and saving the "AREA NAME"
peak_night_crime_location = night_time.groupby("AREA NAME",
                                               as_index=False)["HOUR OCC"].count().sort_values("HOUR OCC",
                                                                                               ascending=False).iloc[0]["AREA NAME"]
# Print the peak night crime location
print(f"The area with the largest volume of night crime is {peak_night_crime_location}")

## Identify the number of crimes committed against victims by age group (0-17, 18-25, 26-34, 35-44, 45-54, 55-64, 65+)
## Save as a pandas Series called victim_ages
# Create bins and labels for victim age ranges
age_bins = [0, 17, 25, 34, 44, 54, 64, np.inf]
age_labels = ["0-17", "18-25", "26-34", "35-44", "45-54", "55-64", "65+"]

# Add a new column using pd.cut() to bin values into discrete intervals
crimes["Age Bracket"] = pd.cut(crimes["Vict Age"],
                               bins=age_bins,
                               labels=age_labels)

# Find the category with the largest frequency
victim_ages = crimes["Age Bracket"].value_counts()
print(victim_ages)

# ============================================
# 4. Additional Notes
# - TIME OCC is in 24-hour format; slicing the first two characters obtains the hour.
# - Night-time window is defined as 22:00–03:59 (hours 22, 23, 0, 1, 2, 3).
# - pd.cut() creates categorical bins for victim age groups; reindex ensures
#   the printed Series follows the intended label order.
# - peak_crime_hour can be determined programmatically via value_counts().idxmax()
#   or confirmed visually with the count plot.
# - Be mindful of potential missing or malformed values in Vict Age or TIME OCC.
# ============================================