This project aims to analyze how air quality trends vary across different regions and sub-regions, and what potential factors might be driving these variations. I performed various statistical analyses and visualizations by utilizing datasets on air quality, country information, and population statistics.
Caution
This project is intended solely for practice purposes and should not be used to draw real-life conclusions.
- Air Quality Data: Contains daily air quality index (AQI) values for various countries.
- Country Data: Includes information on regions and sub-regions for each country.
- Population Data: Provides population statistics and yearly growth rates for each country.
- Population Data for 2022 and 2023: Retrieved from Worldometers.
- Air Quality Index: Retrieved from Kaggle
- Country data: Retrieved from Kaggle
- Yearly Growth Rate: Calculated using data from the above source in Excel.
- Population Data Extraction: Collected population figures for 2022 and 2023 from Worldometers.
- Growth Rate Calculation: Computed the yearly growth rate based on the population figures for 2022 and 2023 using Excel.
- Population Projection for 2024: Estimated the population for 2024 using the calculated yearly growth rate in R.
Note
The population projection for 2024 is an estimate and may not be highly accurate. The methodology involves using historical growth rates, which may not account for sudden changes in demographic trends.
Here are some of the key visualizations used in this analysis:
ggplot(data = final_df) +
geom_bar(mapping = aes(x = region, fill = Status)) +
scale_fill_manual(values = c(
"Good" = "green",
"Moderate" = "yellow",
"Unhealthy for Sensitive Groups" = "orange",
"Unhealthy" = "red",
"Very Unhealthy" = "purple",
"Hazardous" = "brown"
)) +
ggtitle("AQI by Region")I calculated and plotted the mean AQI values by region using a box plot.
mean_aqi_by_region <- final_df %>%
group_by(region) %>%
summarise(mean_aqi = mean(`AQI_Value`, na.rm = TRUE)) %>%
ungroup()
mean_aqi_vector <- setNames(round(mean_aqi_by_region$mean_aqi, 2),
mean_aqi_by_region$region)
ggplot(final_df, aes(x = reorder(region, `AQI_Value`),
y = `AQI_Value`, fill = region)) +
geom_boxplot() +
labs(title = "AQI_Values by Region",
x = "Region",
y = "AQI_Value") +
theme_minimal() +
scale_fill_brewer(palette = "Set3", guide = guide_legend(title = "Region (Mean AQI)")) +
scale_fill_manual(
values = scales::brewer_pal(palette = "Set3")(length(mean_aqi_vector)),
labels = paste(names(mean_aqi_vector), "(Mean AQI:", mean_aqi_vector, ")")
)I analyzed the trend of mean AQI values over time for each region.
final_df <- final_df %>%
mutate(Month = floor_date(Date, "month")) %>%
group_by(Month, region) %>%
summarise(mean_aqi = mean(`AQI_Value`, na.rm = TRUE)) %>%
ungroup()
ggplot(final_df, aes(x = Month, y = mean_aqi, color = region)) +
geom_line() +
labs(title = "Trend Analysis of Mean AQI_Values by Region",
x = "Date",
y = "Mean AQI_Value",
color = "Region") +
theme_minimal() +
scale_y_continuous(limits = c(0, NA))I created a functions to shpow ox plots for AQI values by sub-region within each specified region.
create_region_box_plot <- function(region_name) {
region_df <- final_df %>%
filter(region == region_name)
mean_aqi_by_sub_region <- region_df %>%
group_by(sub_region) %>%
summarise(mean_aqi = mean(`AQI_Value`, na.rm = TRUE)) %>%
ungroup()
mean_aqi_sub_vector <- setNames(round(mean_aqi_by_sub_region$mean_aqi, 2),
mean_aqi_by_sub_region$sub_region)
ggplot(region_df, aes(x = reorder(sub_region, `AQI_Value`),
y = `AQI_Value`,
fill = sub_region)) +
geom_boxplot() +
labs(title = paste("AQI_Values by Sub-region in", region_name),
x = "Sub-region",
y = "AQI_Value") +
theme_minimal() +
scale_fill_brewer(palette = "Set3",
guide = guide_legend(title = "Sub-region (Mean AQI)")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(
values = scales::brewer_pal(palette = "Set3")(length(mean_aqi_sub_vector)),
labels = paste(names(mean_aqi_sub_vector), "(Mean AQI:", mean_aqi_sub_vector, ")")
)
}
create_region_box_plot("Asia")visualize_trends_by_region <- function(data, region_name) {
region_data <- data %>% filter(region == region_name)
sub_regions <- unique(region_data$sub_region)
plot_list <- list()
for (sub_region_name in sub_regions) {
sub_region_data <- region_data %>% filter(sub_region == sub_region_name)
p <- ggplot(sub_region_data, aes(x = year(Date))) +
geom_line(aes(y = AQI_Value, color = "AQI Value"), size = 1) +
geom_line(aes(y = Population/1e6, color = "Population (Millions)"),
size = 1) +
labs(title = paste("AQI and Population Trends for", sub_region_name),
x = "Year",
y = "") +
scale_y_continuous(
sec.axis = sec_axis(~ . * 1e6, name = "Population")
) +
theme_minimal() +
theme(legend.position = "bottom") +
scale_color_manual(values = c("AQI Value" = "blue",
"Population (Millions)" = "red"))
plot_list[[sub_region_name]] <- p
}
grid.arrange(grobs = plot_list, ncol = 1)
}
visualize_trends_by_region(final_df, "Asia")I created a multiple regression model to find the statistical values
model <- lm(AQI_Value ~ Yearly_Growth + Population + factor(region) +
factor(sub_region), data = final_df)
# Summarize the model to see the results
summary(model)-
Intercept (85.95): This is the average AQI value when all other variables are zero. It represents the baseline AQI value.
-
Yearly Growth (-11.98): The coefficient for yearly growth is -11.98, indicating that as the yearly population growth rate increases by one unit, the AQI value decreases by 11.98 units. This suggests a negative relationship between population growth rate and AQI.
-
Population (8.49e-08): The coefficient for population is 8.49e-08, meaning that as the population increases by one unit, the AQI value increases by a very small amount (8.49e-08). This shows a positive but very weak relationship between population size and AQI.
-
Region and Sub-region Factors: The model includes categorical variables for region and sub-region, which means it controls for the effects of different regions and sub-regions on AQI. For example, the coefficient for the Americas region is -37.63, indicating that the AQI in the Americas is, on average, 37.63 units lower than the baseline region.
-
Significance: The significance levels (Pr(>|t|)) for most coefficients are very low (p < 0.05), indicating that these variables significantly contribute to predicting AQI values. For instance, the population and region variables have p-values < 2e-16, showing strong significance.
-
Model Fit: The R-squared value is 0.4927, which means that approximately 49.27% of the variability in AQI values can be explained by the model. The adjusted R-squared value (0.4871) accounts for the number of predictors in the model.




