Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
| [Climate change disasters](https://daseh.org/data/Yearly_CC_disasters_total_affected.csv) | Data about the number of people affected by total disasters (including droughts, extreme temperatures, floods, landslides, storms, and wildfires) by country and year. | • Manipulating Data in R Lab | [International Monetary Fund (IMF)](https://climatedata.imf.org/datasets/b13b69ee0dde43a99c811f592af4e821_0/about) |
| [CO heat-related ER visits](https://daseh.org/data/CO_ER_heat_visits.csv) | Age-adjusted visit rates and total number of visits for all genders by Colorado county for 2011-2023, collected by the Colorado Environmental Public Health Tracking program | • Data Input <br> • Subsetting Data in R <br> • Data Summarization <br> • Data Cleaning <br> • Intro to Data Visualization <br> • Data Visualization <br> • Factors <br> • Statistics <br> • Data Output <br> • Functions | https://coepht.colorado.gov/heat-related-illness |
| [COVID wastewater surveillance](https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv) | SARS-CoV-2 viral load measured in wastewater between 2022 and 2024, collected by the collected by the National Wastewater Surveillance System |• Data Classes | https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data |
| [CDC socioeconomic themes](https://daseh.org/data/socioeco_dat.csv) | 2018 socioeconomic theme data created by the Centers for Disease Control and Prevention (CDC) / Agency for Toxic Substances and Disease Registry (ATSDR) / Geospatial Research, Analysis, and Services Program (GRASP) |• Statistics | https://hub.scag.ca.gov/datasets/18981b657cf04f2dbe0df065f20581db_5/about |
| [Flu internet searches](https://daseh.org/data/Wojcik_2021_flu.csv) | This study looks at the use of internet search data to track prevalence of Influenza-Like Illness (ILI) | • Statistics | https://www.nature.com/articles/s41467-020-20206-z |
| [Nitrate exposure](https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv) | The amount of people in Washington exposed to excess levels of nitrate in their water between 1999 and 2020 by quarter, collected by the Washington Tracking Network | • Manipulating Data in R | https://doh.wa.gov/data-and-statistical-reports/washington-tracking-network-wtn/drinking-water |
| [Weather on Mars](https://daseh.org/data/kaggleMars_Dataset.csv) | Information about temperature measures from the Rover Environmental Monitoring Station (REMS) on Mars, collected by Spain and Finland | • Homework 2 | https://www.kaggle.com/datasets/deepcontractor/mars-rover-environmental-monitoring-station/data |
Expand Down
186 changes: 113 additions & 73 deletions modules/Statistics/Statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ where:

* $y_i$ is the outcome for person i
* $\alpha$ is the intercept
* $\beta_1$, $\beta_2$, $\beta_2$ are the slopes/coefficients for variables $x_{i1}$, $x_{i2}$, $x_{i3}$ - average difference in y for a unit change (or each value) in x while accounting for other variables
* $\beta_1$, $\beta_2$, $\beta_3$ are the slopes/coefficients for variables $x_{i1}$, $x_{i2}$, $x_{i3}$ - average difference in y for a unit change (or each value) in x while accounting for other variables
* $x_{i1}$, $x_{i2}$, $x_{i3}$ are the predictors for person i
* $\varepsilon_i$ is the residual variation for person i

Expand Down Expand Up @@ -452,34 +452,31 @@ For example, if we want to fit a regression model where outcome is `income` and

## Linear regression example

Let's look variables that might be able to predict the number of heat-related ER visits in Colorado.
Let's look variables that might be able to predict the number of "crowded" households.

We'll use the dataset that has ER visits separated out by age category.
We'll use a dataset that has socioeconomic measures from CDC. Find out more on https://daseh.org/data.

We'll combine this with a new dataset that has some weather information about summer temperatures in Denver, downloaded from [https://www.weather.gov/bou/DenverSummerHeat](https://www.weather.gov/bou/DenverSummerHeat).
It has already been filtered to include a few counties from Washington State.

We will use this as a proxy for temperatures for the state of CO as a whole for this example.
Each row represents a census tract/area.

## Linear regression example{.codesmall}

```{r}
er <- read_csv(file = "https://daseh.org/data/CO_ER_heat_visits_by_age.csv")
sp_dat <- read_csv(file = "https://daseh.org/data/socioeco_cdc.csv")

temps <- read_csv(file = "https://daseh.org/data/Denver_heat_data.csv")

er_temps <- full_join(er, temps)
er_temps
sp_dat
```

## Linear regression: model fitting{.codesmall}

For this model, we will use two variables:

- **visits** - number of visits to the ER for heat-related illness
- **highest_temp** - the highest recorded temperature of the summer
- **crowd** - At household level (occupied housing units), more people than rooms
- **hu** - Number of housing units

```{r}
fit <- glm(visits ~ highest_temp, data = er_temps)
fit <- glm(crowd ~ hu, data = sp_dat)
fit
```

Expand All @@ -497,18 +494,18 @@ The broom package can help us here too!

The estimate is the coefficient or slope.

for every 1 degree increase in the highest temperature, we see 1.134 more heat-related ER visits. The error for this estimate is pretty big at 3.328. This relationship appears to be insignificant with a p value = 0.735.
for every 1 additional housing unit, we see 0.035 more crowded households (~29 housing units to one more crowded household might make more sense!). This relationship appears to be quite strong, with a p value 7.96e-44!

```{r}
tidy(fit) |> glimpse()
```

## Linear regression: multiple predictors {.smaller}

Let's try adding another other explanatory variable to our model, year (`year`).
Let's try adding another other explanatory variable to our model, average per capita income for each census area (`pci`).

```{r}
fit2 <- glm(visits ~ highest_temp + year, data = er_temps)
fit2 <- glm(crowd ~ hu + pci, data = sp_dat)
summary(fit2)
```

Expand All @@ -526,53 +523,39 @@ fit2 |>

Factors get special treatment in regression models - lowest level of the factor is the comparison group, and all other factors are **relative** to its values.

Let's add age category (`age`) as a factor into our model. We'll need to convert it to a factor first.
Let's add the county (`county`) as a factor into our model. We'll need to convert it to a factor first.

```{r}
er_temps <- er_temps |> mutate(age = factor(age))
sp_dat <- sp_dat |> mutate(county = factor(county))
```


## Linear regression: factors {.smaller}

The comparison group that is not listed is treated as intercept. All other estimates are relative to the intercept.

```{r regressbaseline, comment="", fig.height=4,fig.width=8}

fit3 <- glm(visits ~ highest_temp + year + age, data = er_temps)
fit3 <- glm(crowd ~ hu + pci + county, data = sp_dat)
summary(fit3)
```

## Linear regression: factors {.smaller}

Maybe we want to use the age group "65+ years" as our reference. We can relevel the factor.
Maybe we want to use King County as our reference. We can relevel the factor.

The ages are relative to the level that is not listed.
The counties are relative to the level that is not listed.

```{r}
er_temps <-
er_temps |>
mutate(age = factor(age,
levels = c("65+ years", "35-64 years", "15-34 years", "5-14 years", "0-4 years")
sp_dat <-
sp_dat |>
mutate(county = factor(county,
levels = c("King", "Clark", "Pierce", "Snohomish", "Spokane")
))

fit4 <- glm(visits ~ highest_temp + year + age, data = er_temps)
fit4 <- glm(crowd ~ hu + pci + county, data = sp_dat)
summary(fit4)
```

## Linear regression: factors {.smaller}

You can view estimates for the comparison group by removing the intercept in the GLM formula

`y ~ x - 1`

*Caveat* is that the p-values change, and interpretation is often confusing.

```{r regress9, comment="", fig.height=4, fig.width=8}
fit5 <- glm(visits ~ highest_temp + year + age - 1, data = er_temps)
summary(fit5)
```

## Linear regression: interactions

```{r, fig.alt="Statistical interaction showing the relationship between cookie yield, temperature, and cooking duration.", out.width = "70%", echo = FALSE, fig.align='center'}
Expand All @@ -583,21 +566,20 @@ knitr::include_graphics("images/interaction.png")

## Linear regression: interactions {.smaller}

You can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.
You can also specify interactions between variables in a formula with `*`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.

```{r fig.height=4, fig.width=8}
fit6 <- glm(visits ~ highest_temp + year + age + age*highest_temp, data = er_temps
)
tidy(fit6)
fit5 <- glm(crowd ~ hu + pci * county, data = sp_dat)
summary(fit5)
```

## Linear regression: interactions {.smaller}

By default, `ggplot` with a factor added as a color will look include the interaction term. Notice the different intercept and slope of the lines.

```{r fig.height=3.5, fig.width=7, warning=FALSE}
ggplot(er_temps, aes(x = highest_temp, y = visits, color = age)) +
geom_point(size = 1, alpha = 0.1) +
ggplot(sp_dat, aes(x = pci, y = hu, color = county)) +
geom_point(size = 1, alpha = 0.2) +
geom_smooth(method = "glm", se = FALSE) +
theme_classic()
```
Expand All @@ -620,37 +602,35 @@ See `?family` documentation for details of family functions.

## Logistic regression {.smaller}

Let's look at a logistic regression example. We'll use the `er_temps` dataset again.
Let's look at a logistic regression example. We'll use the `sp_dat` dataset again with a different variable.

We will create a new binary variable `high_rate`. We will say a visit rate of more than 8 qualifies as a high visit rate.
- **f_crowd** - Flag for the percentage of crowded households is in the 90th percentile (1 = yes, 0 = no)

```{r}
er_temps <-
er_temps |> mutate(high_rate = rate > 8)
There are 36 census tracks in the 90th percentile for crowded households.

glimpse(er_temps)
```{r}
sp_dat |> count(f_crowd)
```


## Logistic regression {.smaller}

Let's explore how `highest_temp`, `year`, and `age` might predict `high_rate`.
Let's explore how `hu`, `pci`, and `county` might predict `f_crowd`.

```
# General format
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
```

```{r regress7, comment="", fig.height=4,fig.width=8}
binom_fit <- glm(high_rate ~ highest_temp + year + age, data = er_temps, family = binomial(link = "logit"))
binom_fit <- glm(f_crowd ~ hu + pci + county,
data = sp_dat, family = binomial(link = "logit"))
summary(binom_fit)
```

## Logistic Regression

See this [case study](https://www.opencasestudies.org/ocs-bp-vaping-case-study/#Logistic_regression_%E2%80%9Cby_hand%E2%80%9D_and_by_model) for more information.


## Odds ratios

> An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
Expand All @@ -661,34 +641,27 @@ Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.

## Odds ratios {.smaller}

During the years 2012, 2018, 2021, and 2022, there were multiple consecutive days with temperatures above 100 degrees. We will code this as `heatwave`.

```{r}
library(epitools)

er_temps <-
er_temps |>
mutate(heatwave = year %in% c(2012, 2018, 2021, 2022))
Let's see if a high prevalence of no vehicle homes can predict a high prevalence of crowded homes.

glimpse(er_temps)
```
- **f_noveh** - Flag for the percentage of households with no vehicles is in the 90th percentile (1 = yes, 0 = no)
- **f_crowd** - Flag for the percentage of crowded households is in the 90th percentile (1 = yes, 0 = no)

## Odds ratios {.smaller}
## Odds ratios

In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8.
In this case, we're calculating the odds ratio for census areas, indicating whether a prevalence of no vehicle households is associated with more crowded households.

```{r}
response <- er_temps |> pull(high_rate)
predictor <- er_temps |> pull(heatwave)
library(epitools)

oddsratio(predictor, response)
response <- sp_dat %>% pull(f_crowd)
predictor <- sp_dat %>% pull(f_noveh)
```

## Odds ratios {.smaller}

The Odds Ratio is 3.86.
The Odds Ratio is 3.33.

When the predictor is TRUE (aka it was a heatwave year), the odds of the response (high hospital visitation) are 3.86 times greater than when it is FALSE (not a heatwave year).
When the predictor is 1 (aka the census area has a lot of no vehicle households), the odds of the response (prevalence of crowded homes) are 3.33 times greater than when it is 0 (not a lot of no vehicle households).

```{r echo = FALSE}
oddsratio(predictor, response)
Expand Down Expand Up @@ -738,6 +711,12 @@ Image by <a href="https://pixabay.com/users/geralt-9301/?utm_source=link-attribu

# Extra Slides

## Model Selection

Check out the `leaps` package and other code snippets here: https://r-statistics.co/Model-Selection-in-R.html

# More tests!

## Wilcoxon Test

The Wilcoxon test is a good alternative to the t-test when the normal distribution of the differences between paired individuals cannot be assumed.
Expand Down Expand Up @@ -782,4 +761,65 @@ For balanced designs, determine if multiple variables influence a dependent vari

`aov()`

## More on Linear regression: factors {.smaller}

You can view estimates for the comparison group by removing the intercept in the GLM formula

`y ~ x - 1`

*Caveat* is that the p-values change, and interpretation is often confusing.

```{r regress9, comment="", fig.height=4, fig.width=8}
fit_force_intercept <-
glm(crowd ~ pci + sngpnt + county - 1, data = sp_dat)
summary(fit_force_intercept)
```

<!-- Raw data from https://hub.scag.ca.gov/datasets/18981b657cf04f2dbe0df065f20581db_5/about : -->

<!-- ```{r} -->
<!-- wa_dat2 <- -->
<!-- read_csv(file = "CDC_Social_Vulnerability_Index_2018_271880276067137134.csv") %>% -->
<!-- select( -->
<!-- "State Name", -->
<!-- "COUNTY", -->
<!-- "FIPS", -->
<!-- "Text description of tract, county, state", -->
<!-- "Per capita income estimate, 2014-2018 ACS", -->
<!-- "Housing units estimate, 2014-2018 ACS", -->
<!-- "Housing in structures with 10 or more units estimate, 2014-2018 ACS", -->
<!-- "Single parent household with children under 18 estimate, 2014-2018 ACS", -->
<!-- "At household level (occupied housing units), more people than rooms estimate, 2014-2018 ACS", -->
<!-- "Households with no vehicle available estimate, 2014-2018 ACS", -->
<!-- "Flag - the percentage of single parent households is in the 90th percentile (1 = yes, 0 = no)", -->
<!-- "Flag - the percentage of crowded households is in the 90th percentile (1 = yes, 0 = no)", -->
<!-- "Flag - the percentage of households with no vehicles is in the 90th percentile (1 = yes, 0 = no)" -->
<!-- ) %>% -->
<!-- rename( -->
<!-- state = "State Name", -->
<!-- county = "COUNTY", -->
<!-- description = "Text description of tract, county, state", -->
<!-- pci = "Per capita income estimate, 2014-2018 ACS", -->
<!-- hu = "Housing units estimate, 2014-2018 ACS", -->
<!-- crowd = "At household level (occupied housing units), more people than rooms estimate, 2014-2018 ACS", -->
<!-- munit = "Housing in structures with 10 or more units estimate, 2014-2018 ACS", -->
<!-- sngpnt = "Single parent household with children under 18 estimate, 2014-2018 ACS", -->
<!-- noveh = "Households with no vehicle available estimate, 2014-2018 ACS", -->
<!-- f_sngpnt = "Flag - the percentage of single parent households is in the 90th percentile (1 = yes, 0 = no)", -->
<!-- f_crowd = "Flag - the percentage of crowded households is in the 90th percentile (1 = yes, 0 = no)", -->
<!-- f_noveh = "Flag - the percentage of households with no vehicles is in the 90th percentile (1 = yes, 0 = no)" -->
<!-- ) -->

<!-- sp_dat <- -->
<!-- janitor::clean_names(wa_dat2) %>% -->
<!-- filter(county %in% c("King", "Pierce", "Snohomish", "Spokane", "Clark"), state == "WASHINGTON") %>% -->
<!-- mutate(across(where(is.numeric), ~na_if(., -999))) %>% -->
<!-- select(!c(state)) %>% -->
<!-- drop_na() -->

<!-- write_csv(sp_dat, "socioeco_cdc.csv") -->
<!-- ``` -->




5 changes: 5 additions & 0 deletions resources/dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Amador
anewma
annualDosage
anonymize
ATSDR
autauga
ava
ay
Expand Down Expand Up @@ -103,6 +104,7 @@ fwang
Gardiner
gdp
geoms
Geospatial
Gerd
gg
ggplot
Expand All @@ -123,12 +125,14 @@ hrbr
http
HTTPS
https
hu
Humphries
HW
hydrocodone
ide
ifelse
Ihaka
ILI
IMG
inclusivity
inute
Expand Down Expand Up @@ -181,6 +185,7 @@ NISSANs
nizovatina
NonCommercial
nonconfidential
noveh
NWSS
obert
ocs
Expand Down
Loading