nsrr_data_harmonization/README.Rmd at master · nsrr/nsrr_data_harmonization · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
---
title: An introduction to the NSRR harmonized variable documentation
author: Ying Zhang
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  github_document
---

```{r setup,  include=FALSE}
library(nomnoml)
library(DiagrammeR)
options(knitr.table.format = 'html')
knitr::opts_chunk$set(message = FALSE)
```

This repository contains documentation for the phenotype, sleep monitoring (e.g. polysomnography, polygraphy, actigraphy) and survey questionnaire data harmonized by the [National Sleep Research Resource (NSRR)](https://www.sleepdata.org/), including harmonization standards, rationale, and  script for recreating the harmonized data.

## Repository contents

| file | description  |
|---|---|
| `non-sleep-phenotype/`               | A directory containing documentations for harmonized non-sleep covariates. |
| `sleep-monitoring/` | A directory containing documentations and script for creating harmonized sleep monitoring data. |
| `sleep-questionnaire/`            |  A directory containing documentations and script for creating harmonized sleep questionnaire data.  |
| `harmonized_variables/variables/`  | A directory containing JSON objects for NSRR harmonized variables. |
| `harmonized_variables/domains/`  | A directory containing JSON objects of coded values for categorical harmonized variables. |
| `README.md`  | This document. |
| `README.Rmd`  | Source for this document. |

__Table of Contents__

- [Overview](#overview)
- [Non-Sleep Phenotype Data](#non-sleep-phenotype-data)
- [Polysomnography and Polygraphy Data](#polysomnography-and-polygraphy-data)
- [Sleep Questionnaire Data](#sleep-questionnaire-data)

- [Resources](#resources)

---

## Overview
The objectives for data harmonization is (1) to identify core sleep terms that are semantically similar, (2) to assess the heterogeneity in the study-level and variable-level metadata, (3) to improve comparability of the harmonized terms, and (4) to make harmonized terms inferentially equivalent when possible. We identified three types of data currently shared on the NSRR website that are most likely to be valuable to our users once harmonized - commonly used non-sleep covariates, sleep monitoring data (e.g.,polysomnography/polygraphy) and sleep questionnaire data. We've developed domain-specific approach to address the challenges in harmonizing each of these types of data.
Harmonized data were created using SAS and saved as a separate data file in each dataset on NSRR website. All SAS scripts are  publicly available on [github.com/nsrr](https://github.com/nsrr).

It is worth noting that the current harmonization process is not data-driven but NSRR is planning on assessing how similar harmonized data is across demographically similar datasets.

## Non-Sleep Phenotype Data
[TOPMed DCC's harmonized phenotype](https://github.com/UW-GAC/topmed-dcc-harmonized-phenotypes) and [BioDataCatalyst Common Data Model](https://github.com/uc-cdis/gtex-dictionary/tree/master/gdcdictionary/schemas/) were identified as the harmonization standards for the non-sleep phenotype data. We have harmonized 9 non-sleep phenotype variables: age(`nsrr_age`), sex(`nsrr_sex`), race(`nsrr_race`), ethnicity(`nsrr_ethnicity`), body mass index(`nsrr_bmi`), current smoker(`nsrr_current_smoker`), ever smoker(`nsrr_ever_smoker`), resting systolic(`nsrr_bp_systolic`) and diastolic blood pressure(`nsrr_bp_diastolic`).

### Data format
The original data format was preserved for numeric variables; Categorical variables (e.g., sex, race) were recoded as string. All permissible values for the harmonized categorical variables are available as JSON files in `non-sleep-phenotype/`.

### Missing value
For numerical variables, missing values were represented as empty cell value in the date files. For categorical variables, explicit missing codes were used to comply with the TopMED/BioDataCatalyst harmonization standards. Unspecific missing was coded as "not reported" for most categorical variables.

## Polysomnography and Polygraphy Data
Retrospective polysomnography/Polygraph (PSG) data harmonization focuses on summary PSG data submitted by the data owners, which was either manually scored by scorers or automatically scored by the device. Prospective PSG data harmonization using EDF files can be done by the NSRR [NAP pipeline](https://zzz.bwh.harvard.edu/luna/nap/).
As illustrated in the process diagram down below, we took an iterative approach when developing harmonized summary PSG terms. We first identified the initial target terms (e.g., AHI3%, AHI4%, total sleep duration, overall arousal index and sleep/wake signal quality flag). Then we mapped out the key steps in the PSG data generation process which tend to be the major sources of heterogeneity. After identifying all the varying components in the study-level and variable-level metadata that greatly affect the comparability of PSG data, we went back to the initial target terms and decided whether (and how) we would need to modify and expand the target terms to accommodate all the heterogeneity. We then mapped the deposited data at the NSRR to the refined target terms and assessed if there were any additional sources of heterogeneity that should be accounted for. We also standardized the metadata for the target terms during the process which includes harmonized term definitions, provenance information (how to document the source data and any processing of the source data), as well as the standardized NSRR tags for "forward-compatible" harmonization.

```{nomnoml echo=F}
#stroke: #16733e
#direction: down
#.box: fill=#8f8 dashed visual=ellipse direction=right
#zoom: 1
[<frame>Development of Harmonized Terms|
  [Identify|Target term(s)]
  [Outline| Data generation process]
  [Identify] -:> [Outline]
  [Assess| Source of heterogeneity]
  [Outline]-:>[Assess]
  [Study-level| e.g. sleep test type]
  [Variable-level| e.g.%desat]
  [Assess]+->  [Study-level]
  [Assess]+->  [Variable-level]
  [Refine| Target term(s)]
  [Study-level]o->[Refine]
  [Variable-level]o->[Refine]
  [Map| Deposited Data]
  [Map]<:--[Refine]
  [Map]--:>[Assess]
  [Refine]-[<note>Standardized metadata||- Definition|- Provenance|- NSRR tags]
]
```

Using the NSRR tags, we are able to leverage data/metadata curation that we do routinely at NSRR to quickly map and create new harmonized data. As shown in the diagram down below, once the raw data is curated and assigned a NSRR tag based on a compositional coding scheme we developed at NSRR, we can simply search if any of the eligible tags listed by the harmonized term(s) metadata is present in the new dataset. Once there is a match, we will conduct an in-depth review of all the associated metadata to make sure the harmonized term(s) are mapped correctly, and create a harmonized version with new variable name and metadata once the mapping is confirmed.

```{nomnoml echo=F}
#stroke: #16733e
#direction: down
#zoom: 1
[<frame>Harmonizing Data|
  [Raw Data]
  [<abstract>Study-level Metadata|- NSRR metadata intake form|- MOP|...]
  [<abstract>Variable-level Metadata|- Data dictionary|- Codebook|...]
  [Curated Data||NSRR tags|- Standardized|- Compositional|- Expandable|- Machine-readable]
  [Raw Data]-:>[Curated Data]
  [Curated Data]<--incorperate[Study-level Metadata]
  [Curated Data]<--incorperate[Variable-level Metadata]
  [Harmonized Term||Eligible NSRR tags|- tag1|- tag2|...]
  [Matched Variables|var_name:tag2]
  [Curated Data]-:>[Matched Variables]
  [Harmonized Term]map-->[Matched Variables]
  [Matched Variables]-:>[Review Metadata]
  [Review Metadata]-:>[Create Harmonized Data|- Recoding|- Renaming(nsrr_var)|- Standardizing metadata]
]
```


### Apnea-hypopnea Indices
After carefully evaluating the main sources of heterogeneity in PSG data generation and processing across datasets, we've identified four varying components in the metadata - sleep test type (i.e., type I or II polysomnography, type III or IV polygraphy/home sleep apnea test), flow reduction, oxygen desaturation, and arousal. Here is a diagram for how commonly referred to as AHI3% and AHI4% can vary in the these categories and create many more permutations.

```{r,echo=FALSE}
grViz("
digraph boxes_and_circles {

  # a 'graph' statement
  graph [overlap = true, fontsize = 10]

  # 'node' statements
  node [shape = box,
        fontname = Helvetica]
  type_I_II[label = 'Polysomnography (Type I or II)'];type_III_IV[label = 'Polygraphy/HSAT (Type III or IV)'];
  ahi[label = 'Apnea-Hypopnea Index'];rei[label = 'Respiratory Event Index'];ap[label = 'Apneas'];hp[label = 'Hypopneas'];
  flow_90[label = '>=90% flow reduction'];
  flow_50[label = '>=50% flow reduction'];flow_30[label = '>=30% flow reduction'];flow_0[label = '>50% flow reduction\n or discernible flow reduction'];
  desat_3[label = '>=3% oxygen desat'];desat_4[label = '>=4% oxygen desat'];desat_03[label = '>=3% oxygen desat'];
  arousal_or[label = 'or with arousal'];arousal_un[label = 'w/ or w/o arousal'];arousal_0[label = 'or with arousal'];

  # 'edge' statements
  edge [color = grey, arrowhead = none, arrowtail = none]
  type_I_II->ahi type_III_IV->rei ahi->ap rei->ap ap->flow_90 flow_90->hp
  hp->flow_50 hp->flow_30 hp->flow_0
  flow_50->desat_3 flow_50->desat_4 flow_30->desat_3 flow_30->desat_4 flow_0->desat_03
  desat_3->arousal_or desat_3->arousal_un desat_4->arousal_or desat_4->arousal_un desat_03->arousal_0
}
")
```
Among the 13 permutations of AHI3% and AHI4% as illustrated in the diagram above, we've mapped a subset of them to the American Academy of Sleep Medicine (AASM) Clinical Guidelines and consolidated into 10 harmonized variables - 7 AHIs and 3 REIs.

| Year | Version  | Definition for Hypopnea Events | Event Duration | Reduction Duration Criteria | Consolidated Coding |
|---|---|---|---|---|---|
| 1999 |   | >50% nasal cannula reduction or discernible nasal cannula reduction with (>=3% desat or arousal) | >=10 sec | | chicago1999 |
| 2007 | Recommended | >=30% nasal cannula [or alternative sensor] reduction with >=4% desat | >=10 sec | >=90% of the event duration | same as hp4u_aasm15 |
| 2007 | Alternative | >=50% nasal cannula [or alternative sensor] reduction with  (>=3% desat or arousal) |  >=10 sec | >=90% of the event duration | hp3r_aasm07 |
| 2012 | Recommended | >=30% nasal cannula [or alternative sensor] reduction with  (>=3% desat or arousal) |  >=10 sec | >=10 sec | same as hp3r_aasm15 |
| 2012 | Alternative | >=30% nasal cannula [or alternative sensor] reduction  with >=4% desat |  >=10 sec | >=10 sec | same as hp4u_aasm15 |
| 2015 | Recommended | >=30% nasal cannula [or alternative sensor] reduction with (>=3% desat or arousal) |  >=10 sec | >=10 sec | hp3r_aasm15 |
| 2015 | Acceptable | >=30% nasal cannula [or alternative sensor] reduction with (>=4% desat) |  >=10 sec | >=10 sec | hp4u_aasm15 |


```{r,echo=FALSE}
library(DiagrammeR)

grViz("
digraph boxes_and_circles {

  graph [overlap = true, fontsize = 15,layout = dot, rankdir = LR]

  node [shape = box,
        fixedsize = FALSE,
        fontname = Helvetica]
  ahi[label = 'Apnea-Hypopnea Index'];
  ahi_5x[label = '>=50% flow reduction']; ahi_3x[label = '>=30% flow reduction']; ahi_5x_0x[label = '>50% flow reduction\n or discernible flow reduction'];
  ahi_5x3[label = '>=3% desat']; ahi_5x4[label = '>=4% desat']; ahi_3x3[label = '>=3% desat']; ahi_3x4[label = '>=4% desat'];
  ahi_5x0_0x3[label = '>=3% desat'];
  ahi_5x3u[label = 'w/ or w/o arousal'];ahi_5x3r[label = 'or with arousal'];ahi_5x4u[label = 'w/ or w/o arousal'];ahi_5x4r[label = 'or with arousal'];ahi_3x3u[label = 'w/ or w/o arousal'];ahi_3x3r[label = 'or with arousal'];ahi_3x4u[label = 'w/ or w/o arousal'];ahi_3x4r[label = 'or with arousal'];ahi_5x0u_0x3r[label = 'or with arousal'];
   node [shape = oval,
        fixedsize = FALSE,
        fillcolor=grey,
        width = 2.5]
  tag_hp3u[label = 'nsrr_ahi_hp3u',fillcolor=grey, style=filled];tag_hp3r_aasm07[label = 'nsrr_ahi_hp3r_aasm07',fillcolor=grey, style=filled];tag_hp4u[label = 'nsrr_ahi_hp4u',fillcolor=grey, style=filled];tag_hp4r[label = 'nsrr_ahi_hp4r',fillcolor=grey, style=filled];tag_hp3r_aasm15[label = 'nsrr_ahi_hp3r_aasm15',fillcolor=grey, style=filled];tag_hp4u_aasm15[label = 'nsrr_ahi_hp4u_aasm15',fillcolor=grey, style=filled];tag_chicago1999[label = 'nsrr_ahi_chicago1999',fillcolor=grey, style=filled];

  ahi->{ahi_5x ahi_3x ahi_5x_0x}
  ahi_5x->{ahi_5x3 ahi_5x4} ahi_3x->{ahi_3x3 ahi_3x4} ahi_5x_0x->ahi_5x0_0x3
  ahi_5x3->{ahi_5x3u ahi_5x3r} ahi_5x4->{ahi_5x4u ahi_5x4r} ahi_3x3->{ahi_3x3u ahi_3x3r} ahi_3x4->{ahi_3x4u ahi_3x4r} ahi_5x0_0x3->ahi_5x0u_0x3r
 ahi_5x3u->tag_hp3u ahi_5x3r->tag_hp3r_aasm07 ahi_5x4u->tag_hp4u ahi_5x4r->tag_hp4r ahi_3x3u->tag_hp3u ahi_3x3r->tag_hp3r_aasm15 ahi_3x4u->tag_hp4u_aasm15 ahi_3x4r->tag_hp4r ahi_5x0u_0x3r->tag_chicago1999
}
")
```

```{r,echo=FALSE}
grViz("
digraph boxes_and_circles {

  graph [overlap = true, fontsize = 15,layout = dot, rankdir = LR]

  node [shape = box,
        fixedsize = FALSE,
        fontname = Helvetica]
  rei[label = 'Respiratory Event Index'];
  rei_5x[label = '>=50% flow reduction'];rei_3x[label = '>=30% flow reduction'];
  rei_5x3[label = '>=3% desat'];rei_5x4[label = '>=4% desat'];rei_3x3[label = '>=3% desat'];rei_3x4[label = '>=4% desat'];

  node [shape = oval,
        fixedsize = FALSE,
        fillcolor=grey,
        width = 2.5]
  tag_rei_hp3n[label = 'nsrr_rei_hp3n',fillcolor=grey, style=filled];tag_rei_hp4n[label = 'nsrr_rei_hp4n',fillcolor=grey, style=filled];tag_rei_hp4n_aasm15[label = 'nsrr_rei_hp4n_aasm15',fillcolor=grey, style=filled];

  rei->{rei_5x rei_3x}
  rei_5x->{rei_5x3 rei_5x4} rei_3x->{rei_3x3 rei_3x4}
  rei_5x3->tag_rei_hp3n rei_5x4->tag_rei_hp4n rei_3x3->tag_rei_hp3n rei_3x4->tag_rei_hp4n_aasm15
}
")
```

Harmonized AHI/REI terms are shown in grey circles in the diagram above, and can be searched on the NSRR website. Links to the source variables are available in the metadata.

| Harmonized terms | Eligible NSRR tags |
|---|---|
| nsrr_ahi_hp3u |ahi_ap0uhp5x3u_f1t1, ahi_ap0uhp5x3u_f1t2, ahi_ap0uhp3x3u_f1t1, ahi_ap0uhp3x3u_f1t2|
| nsrr_ahi_hp3r_aasm07 |ahi_ap0uhp5x3r_f1t1, ahi_ap0uhp5x3r_f1t2|
| nsrr_ahi_hp3r_aasm15 |ahi_ap0uhp3x3r_f1t1, ahi_ap0uhp3x3r_f1t2|
| nsrr_ahi_hp4u |ahi_ap0uhp5x4u_f1t1, ahi_ap0uhp5x4u_f1t2   |
| nsrr_ahi_hp4u_aasm15 |ahi_ap0uhp3x4u_f1t1, ahi_ap0uhp3x4u_f1t2   |
| nsrr_ahi_hp4r |ahi_ap0uhp5x4r_f1t1,ahi_ap0uhp5x4r_f1t2, ahi_ap0uhp3x4r_f1t1, ahi_ap0uhp3x4r_f1t2|
| nsrr_ahi_chicago1999 |ahi_ap0uhp5x0u_ap0uhp0x3r_f1t1, ahi_ap0uhp5x0u_ap0uhp0x3r_f1t2|
| nsrr_rei_hp4n |rei_ap0nhp5x4n_f1t3, rei_ap0nhp5x4n_f1t4|
| nsrr_rei_hp4n_aasm15 |rei_ap0nhp3x4n_f1t3, rei_ap0nhp3x4n_f1t4|
| nsrr_rei_hp3n |rei_ap0nhp5x3n_f1t3, rei_ap0nhp5x3n_f1t4, rei_ap0nhp3x3n_f1t3, rei_ap0nhp3x3n_f1t4|

### Total Sleep Duration

The terms to describe time intervals versus points of time are one of the major sources of confusion and ambiguity in PSG sleep architecture metadata. Depending on the context and usage, “time” might refer to a specific point in time or to an interval between two time points (e.g., study start time versus total sleep time), while “duration” and “period” could both be taken to correspond to intervals, they were often used interchangeably in protocol descriptions and data dictionaries without any indication of whether they referred to intervals between designated time points or to specific intervals when subjects were determined to be awake or asleep. Therefore, we've developed a set of standardized terms to address the challenges of "time", "period", "duration" being used inconsistently at the NSRR metadata documentation internally.

```{r,echo=FALSE}
mermaid("
sequenceDiagram
  Recording start time->>Inbed time: Lights on
  Inbed time->>Sleep onset: Sleep onset latency
  Sleep onset->>Wake: Sleep duration
  Wake->>Fall back to sleep: WASO
  Fall back to sleep->>Sleep offset: Sleep duration
  Sleep offset->>Outbed time: Snooze
  Outbed time->>Recording end time: Lights on
  Sleep onset->>Sleep offset: Sleep period
  Inbed time->>Outbed time: Inbed period
  Recording start time->>Recording end time: Recording period
")
```

Total sleep duration (i.e., total sleep time) is defined as the interval between sleep onset and offset while the subject was asleep, which is an important measure used in calculating various indices. Among studies with type I/II polysomnography data, we've mapped all total sleep duration and created a harmonized variable `nsrr_ttldursp_f1` in which `f1` refers to the source of data (i.e.,f1=polysomnography/polygraphy, f2=actigraphy, f3=sleep questionnaire).

| Harmonized terms | Eligible NSRR tags |
|---|---|
|nsrr_ttldursp_f1|ttldursp_f1t1, ttldursp_f1t2, ttldursp_f1t3|

### Overall Arousal Index
Overall arousal index is defined as the total number of arousal divided by the total sleep duration (i.e., total sleep time). We've mapped all overall arousal index from datasets with type I/II polysomnography and created a harmonized variable `nsrr_phrnumar_f1`.

| Harmonized terms | Eligible NSRR tags |
|---|---|
|nsrr_phrnumar_f1| phrnumar_f1t1, phrnumar_f1t2|

### Sleep/Wake Signal Quality Flag
We've selected sleep/wake signal quality flag for it's commonly misunderstood and misused by the NSRR users. When the EEG is insufficient in quality to allow distinction of sleep stages, all sleep stages were scored using a default of “Stage 2” and no arousal was scored. Studies and individuals with "sleep/wake only " instead of "full scoring" for `nsrr_flag_spsw` should not be used to analyze sleep architecture or arousals.

## Sleep Questionnaire Data