-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathliam.qmd
More file actions
290 lines (230 loc) · 12.3 KB
/
liam.qmd
File metadata and controls
290 lines (230 loc) · 12.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
# Results
## Clean data
Since "Facility Type", "Vehicle Type", "Due Date" columns are all having >90% missing values (missing value analysis), we will remove them.
From the dataset description, they refer to...
Facility Type: If available, this field describes the type of city facility associated to the SR
Vehicle Type: If the incident is a taxi, this field describes the type of TLC vehicle.
Due Date: Date when responding agency is expected to update the SR. This is based on the Complaint Type and internal Service Level Agreements (SLAs).
"Landmark" has >40% missing values. But since other address columns (eg. Longitude, Latitude, Incident Address) are proxies of it, we can remove "Landmark".
Landmark: If the incident location is identified as a Landmark the name of the landmark will display here
Community Boards (each taking charge of 1 Community District, or CD), handle the "day-to-day processing of citizen complaints and requests for municipal services". At the sub-borough level, we will look at how well each CD manages their municipal issues.
["link"](https://www.nyc.gov/site/queenscb6/about/frequently-asked-questions-faq.page)
Since we're only looking at the New York City dataset, the City column is redundant, and can be removed too.
There are some rows with Borough and Status values being "Unspecified". But they take up such a small % of the total number of rows (~0% from missing value analysis), so we will remove these rows too.
```{r message = FALSE}
df <- df[, !(colnames(df) %in% c("Vehicle Type", "Due Date", "Facility Type", "Landmark", "City"))]
# see count of number of values. in Borough, Status column
table(df[["Borough"]])
table(df[["Status"]])
# remove rows with Borough / Status = "Unspecified"
df <- df[!grepl("Unspecified", df$Borough),]
df <- df[!grepl("Unspecified", df$Status),]
```
## 1. Which borough has the most complaints, and the least?
```{r message = FALSE}
library(ggplot2)
library(forcats)
library(dplyr)
# specify colors for boroughs
custom_colors <- c("QUEENS" = "yellow", "MANHATTAN" = "blue", "BRONX" = "brown", "STATEN ISLAND" = "green", "BROOKLYN" = "red")
# plot bar chart
ggplot(df %>%
group_by(Borough) %>% # to include another dimension (but it tends to be too cluttered), add eg: , `Complaint Type`
summarise(Total_Complaints = n()),
aes(x = reorder(Borough, -Total_Complaints), y = Total_Complaints/1000, fill = Borough)) +
geom_bar(stat = "identity") +
labs(title = "Total Complaints by NYC Borough",
x = "Borough",
y = "Number of Complaints (000s)") +
scale_fill_manual(values = custom_colors) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_minimal() # facetgrid: + facet_grid(Borough ~ `Complaint Type`)
```
Brooklyn appears to receive the most 311 complaints, followed by Queens, Manhattan, the Bronx. Staten Island seems to receive the least complaints.
If we zoom into each Borough, and see which areas within them receive more complaints (see q5).
## 2. Which complaint types are the most numerous, and the least?
```{r message = FALSE, fig.height=20, fig.width=10}
ggplot(df %>%
mutate(`Complaint Type` = tolower(`Complaint Type`)) %>%
group_by(`Complaint Type`) %>%
summarise(Total_Complaints = n()),
aes(x = Total_Complaints/1000, y = reorder(`Complaint Type`, Total_Complaints), fill = `Complaint Type`)) +
geom_col(position = position_dodge(width = 20)) +
labs(title = "Total Complaints by Complaint Type",
x = "Number of Complaints (000s)",
y = "Complaint Type") +
scale_fill_manual(values = custom_colors) +
theme(axis.text.y = element_text(vjust = 0.5)) # Adjust hjust and vjust as needed +
theme_minimal()
```
On an aggregate level across all boroughs, the most common complaints range from illegal parking, noise in residential spaces, (presumably, the lack of) hot water, to also blocked driveways. In general, noise/cleanliness/water issues are the most frequent complaints, as seen from the top 10-20 bars.
Towards the bottom of the bar plot, we see that certain issues amount to fewer cases of complaints. These include highway/snow/oil spill issues. This is understandable, given that they are less "day-to-day" issues.
To drill in further, we can facet by boroughs. However, there are clearly too many complaint types. Focusing on just the top 5, for each borough...
```{r message = FALSE, fig.height=6, fig.width=8}
library(tidytext)
df_top <- df %>%
mutate(`Complaint Type` = tolower(`Complaint Type`)) %>%
group_by(Borough, `Complaint Type`) %>%
summarise(Total_Complaints = n()) %>%
group_by(Borough) %>%
top_n(5, wt = Total_Complaints) %>%
arrange(Borough, desc(Total_Complaints)) %>%
mutate(`Complaint Type` = factor(`Complaint Type`, levels = unique(`Complaint Type`)))
# Get a vector of unique Boroughs ordered by the total complaints
ordered_boroughs <- df_top %>%
arrange(Borough) %>%
group_by(Borough) %>%
summarise(total_complaints = sum(Total_Complaints)) %>%
arrange(desc(total_complaints)) %>%
pull(Borough)
# Use reorder_within to sort bars within each facet
ggplot(df_top, aes(x = Total_Complaints, y = reorder_within(`Complaint Type`, Total_Complaints, Borough), fill = `Complaint Type`)) +
geom_col() +
labs(title = "Top 5 Complaint Types by Borough",
x = "Number of Complaints",
y = "Complaint Type") +
scale_fill_manual(values = custom_colors) +
theme_minimal() +
facet_wrap(~Borough, scales = "free_y", ncol = 1, strip.position = "bottom") +
theme(axis.text.y = element_text(vjust = 0.5)) +
theme_minimal()
```
## 3. What are the locations that most complaints are associated with?
We explore locations by their Street Names (eg. Seventh Avenue), not Incident Address (eg. 31 Seventh Avenue, 139 Seventh Avenue), as the latter would have too many unique categories. For the initial plots here, we do not use zip code or longitude/latitude as they are less interpretable (we will plot that on a chloropleth later).
Since the number of missing rows in Street Names is low, as seen from the missing vales analysis (~3.7% rows missing), we shall omit it from this section's analysis.
```{r message = FALSE}
# remove rows with missing Street Name
df_temp <- df[complete.cases(df$`Street Name`), ]
```
```{r message = FALSE}
df_top <- df_temp %>%
mutate(`Street Name` = tolower(`Street Name`)) %>%
group_by(Borough, `Street Name`) %>%
summarise(Total_Complaints = n()) %>%
group_by(Borough) %>%
top_n(5, wt = Total_Complaints) %>%
arrange(Borough, desc(Total_Complaints)) %>%
mutate(`Complaint Type` = factor(`Street Name`, levels = unique(`Street Name`)))
# Get a vector of unique Boroughs ordered by the total complaints
ordered_boroughs <- df_top %>%
arrange(Borough) %>%
group_by(Borough) %>%
summarise(total_complaints = sum(Total_Complaints)) %>%
arrange(desc(total_complaints)) %>%
pull(Borough)
# Use reorder_within to sort bars within each facet
ggplot(df_top, aes(x = Total_Complaints, y = reorder_within(`Street Name`, Total_Complaints, Borough), fill = `Street Name`)) +
geom_col() +
labs(title = "Top 5 Complaint Types by Borough",
x = "Number of Complaints",
y = "Complaint Type") +
scale_fill_manual(values = custom_colors) +
theme_minimal() +
facet_wrap(~Borough, scales = "free_y", ncol = 1, strip.position = "bottom") +
theme(axis.text.y = element_text(vjust = 0.5)) +
theme_minimal()
```
We see from here some of the most complained streets in each borough, eg. Ocean Avenue in Brooklyn. But this can also be high, due to the fact that some of them are long, main roads in each Borough (eg. Broadway, Amsterdam Avenue in Manhattan). We can plot a chloropleth to get a better, geographical view of problem areas.
## 4. What is the proportion of complaints that were handled by different departments?
```{r message = FALSE}
ggplot(df %>%
group_by(Agency) %>%
summarise(Total_Complaints = n()),
aes(x = reorder(Agency, -Total_Complaints), y = Total_Complaints/1000, fill = Agency)) +
geom_bar(stat = "identity") +
labs(title = "Total Complaints by NYC Agencies",
x = "Agency",
y = "Number of Complaints (000s)") +
scale_fill_manual(values = custom_colors) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_minimal()
```
We see here that the number of complaints managed by the NYPD was the most, followed by the HPD, and then the DSNY. DHS, TLC, EDC, DCWP, and DOE seem to handle the least number of complaints. The various departments and their abbreviations are as follows...
```{r message = FALSE}
library(knitr)
# Remove duplicates
df_agencies <- distinct(select(df, Agency, `Agency Name`))
# Display the table
kable(df_agencies, caption = "Agencies", format = "html")
```
```{r message = FALSE}
library(ggalluvial)
library(vcd)
library(alluvial)
ggplot(df, aes(axis1 = Borough, axis2 = Agency, y = ..count.., label = Agency)) +
geom_alluvium(aes(fill = Status)) +
geom_stratum(aes(fill = Status)) +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal()
df$Status <- factor(df$Status, levels = c("Open", "Assigned", "Pending", "Started", "In Progress", "Closed"))
# too many cols messy
mosaic(Status ~ Borough + Agency,
labeling = labeling_border(
rot_labels=c(0,0,45)
),
data = df,
shade = TRUE,
direction = c("v","v","h"),
main = "Mosaic Plot of Boroughs, and the \nstatuses of their complaints' from each Agency")
alluvial_data <- as.data.frame(table(df$Borough, df$Agency, df$Status))
# Plot the alluvial diagram
alluvial(alluvial_data, freq = alluvial_data$Freq,
# col = ifelse(alluvial_data$`df$Status` == "Closed", "blue", "red"),
# border = ifelse(alluvial_data$`df$Status` == "Closed", "darkblue", "darkred"),
cex = 0.7)
ggplot(df, aes(axis1 = Borough, axis2 = Agency, y = after_stat(stratum), label = after_stat(stratum))) +
geom_flow(aes(fill = Status)) +
geom_stratum(aes(fill = Status)) +
# geom_text(stat = "stratum", aes(label = after_stat(stratum)), size = 3) +
geom_text(stat = "stratum", aes(label = ifelse(after_stat(stratum) == "Borough", as.character(Borough), "")), size = 3) +
scale_x_discrete(limits = c("Borough", "Agency", "Status")) +
labs(title = "Alluvial Plot",
subtitle = "Flow of Counts from Borough to Agency to Status",
x = NULL, y = NULL) +
theme_minimal()
```
```{r message = FALSE}
df_top2 <- df %>%
group_by(Borough, `Street Name`, `Complaint Type`, Status) %>%
summarise(Count = n()) %>%
group_by(Borough, `Street Name`, Status) %>%
top_n(2, wt = Count) %>%
ungroup()
# Create the alluvial plot
ggplot(df_top2, aes(axis1 = Borough, axis2 = `Street Name`, y = Count, fill = Status)) +
geom_flow(aes(fill = Status)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_void() +
labs(title = "Alluvial Diagram of Complaints Flow",
subtitle = "Top 2 Complaint Types per Borough",
x = NULL, y = NULL)
```
## 5. What types of complaints get closed the fastest/slowest, by using the mean resolution time to close a request?
```{r message = FALSE}
# Specify the desired order of the x-axis values
custom_order <- c("Open", "Assigned", "Pending", "Started", "In Progress", "Closed")
ggplot(df %>%
group_by(Status) %>%
summarise(Total_Complaints = n()),
aes(x = factor(Status, levels = custom_order), y = Total_Complaints/1000, fill = Status)) +
geom_bar(stat = "identity") +
labs(title = "Total Complaints by Status",
x = "Complaint Status",
y = "Number of Complaints (000s)") +
scale_fill_manual(values = custom_colors) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_minimal()
```
The [data dictionary](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) provided did not explicitly state what each status means, but a sensible inference would be that they are ordinal categories, going from Open, to being Assigned to some agency. It would presumably stay in a Pending status, before work on it gets Started. It'll remain In Progress, until it gets Closed by the responsible agency.
## 6. What are the trends in service requests over time?
```{r message = FALSE}
```
## template section
# Interactive component in D3
## template section
# Conclusion
## template section