Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
*.DS_Store
docs/
..Rcheck/
/.claude
77 changes: 77 additions & 0 deletions ceps/cep-007-diet/PR-148-review-summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# PR #148 review summary: diet score

**PR**: #148 (diet score)
**Author**: caitlink12, rafdoodle (latest commit)
**Target branch**: adl-additions
**Review date**: 2026-02-11
**Commit reviewed**: 16a8f3ab

## Scope

Reviewed 9 derived FVC/diet variables (FVCDFRU, FVCDSAL, FVCDPOT, FVCDCAR, FVCDVEG, FVCDJUI, FVCDTOT, diet_score, diet_score_cat3) and 30 raw FVC variables (FVC_1A through FVC_6E) across PUMF and Master databases, cycles 2001 through 2017-2018.

## Changes in this PR

1. Added `_m` (master) databases for all FVC/diet variables (cchs2001_m through cchs2017_2018_m)
2. Added `_s` (deprecated share) databases
3. Added explicit `variableStart` mappings for `_m` databases (pre-2007 cycle letters, 2015+ renames)
4. Added `units` field to FVCD* variables (times/day)
5. Fixed `cchs20013_2014_m` typo to `cchs2013_2014_m` (FVCDPOT, FVCDJUI)
6. Added `cchs2017_2018_m` to diet_score and diet_score_cat3
7. Expanded FVC_*A-E raw variables from `_s`-only to full master cycle coverage
8. ADL variables also modified (outside stated diet scope, not reviewed)

## Post-approval commit

yulric approved on 2025-12-04 at commit a612bdee. rafdoodle pushed commit 16a8f3ab on 2026-02-10 (after approval), adding cchs2017_2018_m and units fields.

## Checks performed

### L3-L5 worksheet checks

| Check | Result |
|-------|--------|
| Era boundary defaults | PASS - All FVCD* variables have explicit 2015+ and pre-2007 mappings; `[VAR]` default only covers 2007-2014 |
| databaseStart consistency | PASS - variables.csv and variable_details.csv match for all FVC/diet vars |
| PUMF/Master naming | PASS - _m databases use correct ungrouped names |
| Pre-2007 cycle letters | PASS - A (2001), C (2003), E (2005) correctly applied |
| Known error patterns | One issue found (see below) |
| DV specification review | PASS - diet_score_fun() inputs match worksheet; diet_score_fun_cat() correctly chains |
| Unit tests | Exist but minimal (2 tests each for diet_score_fun and diet_score_fun_cat) |

### L6 PUMF integration test

`rec_with_table()` ran successfully for all 12 PUMF cycles. Cross-cycle prevalence:

| Cycle | N | diet_score valid % | diet_score_cat3 distribution |
|-------|---|-------------------|------------------------------|
| cchs2001_p | 200 | 99.5% | 43 poor, 145 fair, 11 adequate |
| cchs2003_p | 200 | 94.0% | 24 poor, 139 fair, 25 adequate |
| cchs2005_p | 200 | 56.0% | 11 poor, 92 fair, 9 adequate (optional content) |
| cchs2007_2008_p | 200 | 94.5% | 26 poor, 137 fair, 26 adequate |
| cchs2009_2010_p | 200 | 93.5% | 21 poor, 142 fair, 24 adequate |
| cchs2011_2012_p | 200 | 88.5% | 15 poor, 128 fair, 34 adequate |
| cchs2013_2014_p | 200 | 91.0% | 14 poor, 136 fair, 32 adequate |
| cchs2015_2016_p | 200 | 95.5% | 7 poor, 161 fair, 23 adequate |
| cchs2017_2018_p | 200 | 1.0% | 2 fair, 198 NA(a) (optional content) |

No step changes at era boundaries. The 2005 and 2017-2018 dips are expected (FVC was optional content in those cycles). The 2014-2015 transition is clean, confirming 2015+ variable renames (FVCDVFRU, FVCDVGRN, etc.) are correctly mapped.

Master (`_m`) mappings validated by worksheet checks only -- no runtime data available for L6 testing.

## Issues found

### Issue 1: `chs` typo in FVC_* database names (confidence: 100)

All 30 raw FVC variables (FVC_1A through FVC_6E) use `chs2011_2012_m` and `chs2013_2014_m` instead of `cchs2011_2012_m` and `cchs2013_2014_m` in both `variables.csv` and `variable_details.csv`. The leading `c` is missing.

- This typo was **introduced by this PR** (the target branch does not have it for FVC_* variables)
- The pattern exists pre-existing in other variables (ADL etc.) on the target branch, which is likely where it was copied from
- Impact: These database names will fail to match any actual CCHS database, causing FVC_* variables to be unavailable when processing master data for 2011-2012 and 2013-2014 cycles
- Fix: Replace `chs2011_2012_m` with `cchs2011_2012_m` and `chs2013_2014_m` with `cchs2013_2014_m` in all 30 FVC_* variables across both CSV files

## CEP artifacts

- `ceps/cep-007-diet/PR-148-review-summary.md` (this file)
- `ceps/cep-007-diet/integration-test-diet.R` (executable PUMF integration test)
- `ceps/cep-007-diet/diet-pumf-integration-test.csv` (test results)
15 changes: 15 additions & 0 deletions ceps/cep-007-diet/_quarto.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
project:
type: website
output-dir: _site

website:
title: "CEP-007: Diet variables"
navbar:
left:
- text: "Cross-cycle prevalence"
href: cep-007-diet.qmd

format:
html:
toc: true
code-fold: true
206 changes: 206 additions & 0 deletions ceps/cep-007-diet/cep-007-diet.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
title: "CEP-007: Diet variable cross-cycle prevalence"
author: "cchsflow review (PR #148)"
date: "2026-02-11"
format:
html:
toc: true
code-fold: true
---

## Overview

PR #148 adds master (`_m`) database support and 2017-2018 coverage for 9 diet
variables: FVCDFRU, FVCDSAL, FVCDPOT, FVCDCAR, FVCDVEG, FVCDJUI (continuous
inputs), diet_score (derived), and diet_score_cat3 (categorical derived).

This document visualises the L6 integration test results from
`rec_with_table()` run against all PUMF cycles. The purpose is to detect
era boundary harmonization errors — sudden shifts in exposure distributions
at the 2007 or 2015 transitions that would signal a mapping or naming problem.

**Known data patterns:**

- **2005**: FVC was optional content (BC, ON, AB, PE only) — expect ~56% valid
- **2017-2018**: FVC was optional content — expect ~1% valid
- **2015 variable rename**: FVCDFRU → FVCDVFRU, FVCDSAL → FVCDVGRN, etc.

```{r}
#| label: setup
#| message: false
#| warning: false

results <- read.csv("diet-pumf-integration-test.csv", stringsAsFactors = FALSE)

# Extract start year from cycle name for x-axis
results$year <- as.numeric(gsub("cchs(\\d{4}).*", "\\1", results$cycle))

# Keep only the main biennial cycles (drop annual: 2010, 2012, 2014)
main_cycles <- c("cchs2001_p", "cchs2003_p", "cchs2005_p",
"cchs2007_2008_p", "cchs2009_2010_p",
"cchs2011_2012_p", "cchs2013_2014_p",
"cchs2015_2016_p", "cchs2017_2018_p")
main <- results[results$cycle %in% main_cycles, ]
```

## FVC input variables: valid % across cycles

Each FVC input variable should track closely. A divergence between variables
within the same cycle would indicate a variable-specific mapping error.

```{r}
#| label: fig-fvc-valid-pct
#| fig-cap: "Valid % for FVC input variables across CCHS PUMF cycles"
#| fig-width: 9
#| fig-height: 5

fvc_vars <- c("FVCDFRU", "FVCDSAL", "FVCDPOT", "FVCDCAR", "FVCDVEG", "FVCDJUI")
fvc_colours <- c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00", "#a65628")

plot(NULL, xlim = c(2001, 2018), ylim = c(0, 105),
xlab = "CCHS cycle start year", ylab = "Valid %",
main = "FVC input variables: cross-cycle valid %",
xaxt = "n")
axis(1, at = unique(main$year), labels = unique(main$year), las = 2)

# Era boundary reference lines
abline(v = 2007, lty = 2, col = "grey60")
abline(v = 2015, lty = 2, col = "grey60")
text(2007, 103, "pre-2007 era", pos = 4, cex = 0.7, col = "grey40")
text(2015, 103, "2015+ era", pos = 4, cex = 0.7, col = "grey40")

for (i in seq_along(fvc_vars)) {
v <- fvc_vars[i]
d <- main[main$variable == v, ]
d <- d[order(d$year), ]
lines(d$year, d$valid_pct, type = "b", pch = 19, col = fvc_colours[i], lwd = 1.5)
}

legend("bottomleft", legend = fvc_vars, col = fvc_colours,
lwd = 1.5, pch = 19, cex = 0.7, bg = "white")
```

**Interpretation:** All six FVC variables track together across cycles, which is
expected since they come from the same survey module. The 2005 dip (~57-60%) and
2017-2018 dip (~1%) reflect optional content — not harmonization errors. The
2014 → 2015 transition is clean with no step change, confirming the 2015+
variable renames (FVCDVFRU, FVCDVGRN, etc.) are correctly mapped.

## diet_score: cross-cycle valid %

The derived `diet_score` should have lower valid % than individual FVC inputs
(since all inputs must be non-missing for the score to compute), but should
follow the same cross-cycle pattern.

```{r}
#| label: fig-diet-score-valid
#| fig-cap: "diet_score valid % compared to FVCDFRU input"
#| fig-width: 9
#| fig-height: 5

ds <- main[main$variable == "diet_score", ]
ds <- ds[order(ds$year), ]
fru <- main[main$variable == "FVCDFRU", ]
fru <- fru[order(fru$year), ]

plot(ds$year, ds$valid_pct, type = "b", pch = 19, col = "#e41a1c", lwd = 2,
xlim = c(2001, 2018), ylim = c(0, 105),
xlab = "CCHS cycle start year", ylab = "Valid %",
main = "diet_score vs FVCDFRU: cross-cycle valid %",
xaxt = "n")
axis(1, at = unique(ds$year), labels = unique(ds$year), las = 2)
lines(fru$year, fru$valid_pct, type = "b", pch = 17, col = "#377eb8", lwd = 1.5)

abline(v = c(2007, 2015), lty = 2, col = "grey60")

legend("bottomleft", legend = c("diet_score", "FVCDFRU (input)"),
col = c("#e41a1c", "#377eb8"), pch = c(19, 17), lwd = c(2, 1.5),
cex = 0.8, bg = "white")
```

**Interpretation:** diet_score tracks slightly below its FVC inputs (as expected
— all 6 inputs must be valid). The gap is small (1-6 percentage points),
confirming the DV function is not excessively dropping valid cases.

## diet_score_cat3: category distribution across cycles

The most important harmonization check for the categorical derived variable.
A sudden shift in the poor/fair/adequate distribution at an era boundary would
signal a recoding or mapping error.

```{r}
#| label: fig-diet-cat-distribution
#| fig-cap: "diet_score_cat3 distribution across cycles (excluding NA)"
#| fig-width: 9
#| fig-height: 6

# Load the full integration test data to get category distributions
# We need to rerun rec_with_table to get actual categories, since the CSV
# only has valid counts. Use the stored diet_score valid % to label cycles.

# For this visualisation, we use the diet_score valid % and the known
# category distribution from the PR-148 review summary:
# Poor (1): 0-2, Fair (2): 2-8, Adequate (3): 8-10

cat_data <- data.frame(
cycle = c("2001", "2003", "2005", "2007", "2009", "2011", "2013", "2015", "2017"),
poor = c(43, 24, 11, 26, 21, 15, 14, 7, 0),
fair = c(145, 139, 92, 137, 142, 128, 136, 161, 2),
adequate = c(11, 25, 9, 26, 24, 34, 32, 23, 0),
stringsAsFactors = FALSE
)

# Calculate proportions (among valid respondents only)
cat_data$total <- cat_data$poor + cat_data$fair + cat_data$adequate
cat_data$pct_poor <- round(100 * cat_data$poor / cat_data$total, 1)
cat_data$pct_fair <- round(100 * cat_data$fair / cat_data$total, 1)
cat_data$pct_adequate <- round(100 * cat_data$adequate / cat_data$total, 1)

# Exclude 2017-2018 from the bar chart (only 2 valid respondents)
cat_plot <- cat_data[cat_data$cycle != "2017", ]

# Stacked bar chart
bar_matrix <- rbind(cat_plot$pct_poor, cat_plot$pct_fair, cat_plot$pct_adequate)
colnames(bar_matrix) <- cat_plot$cycle

barplot(bar_matrix,
col = c("#d73027", "#fee08b", "#1a9850"),
border = NA,
main = "diet_score_cat3: exposure distribution across cycles\n(% of valid respondents, N=200 per cycle)",
ylab = "Percentage",
xlab = "CCHS cycle start year",
ylim = c(0, 110),
las = 1)

# Add era boundary markers
abline(v = 3.7, lty = 2, col = "grey40") # Between 2005 and 2007
abline(v = 8.5, lty = 2, col = "grey40") # Between 2013 and 2015

legend("topright",
legend = c("Adequate (8-10)", "Fair (2-8)", "Poor (0-2)"),
fill = c("#1a9850", "#fee08b", "#d73027"),
border = NA, cex = 0.8, bg = "white")
```

**Interpretation:** The distribution is stable across cycles — fair diet (score
2-8) dominates at ~70-80%, consistent with the population distribution. There is
no step change at the 2015 era boundary. The gradual increase in "adequate"
from 2001 (5.5%) to 2013 (17.6%) may reflect real dietary trends.
The 2005 cycle is excluded from trend interpretation due to optional content
(only 112 valid respondents from 4 provinces). The 2017-2018 cycle is excluded
entirely (only 2 valid respondents).

## Summary

| Check | Result |
|-------|--------|
| Era boundary (2015) | **Clean** — no step change in valid % or distributions |
| Era boundary (2007) | **Clean** — pre-2007 cycle letter mappings correct |
| FVC input consistency | **PASS** — all 6 inputs track together |
| DV completeness gap | **Acceptable** — diet_score 1-6% below inputs |
| Category stability | **Stable** — gradual trends, no abrupt shifts |
| Optional content | **Expected** — 2005 and 2017-2018 dips documented |

**Issue found:** 30 raw FVC variables (FVC_1A through FVC_6E) have `chs2011_2012_m`
and `chs2013_2014_m` typos (missing leading `c`). See
[PR-148-review-summary.md](PR-148-review-summary.md) for details.
Loading