From a1d09fd5e23a4bfed96c4fefe0823e854975e32f Mon Sep 17 00:00:00 2001 From: elenacandellone Date: Thu, 18 Jun 2026 11:26:29 +0200 Subject: [PATCH 1/4] network project updated --- projects/networks.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/projects/networks.md b/projects/networks.md index 1dafda6..fc74d3f 100644 --- a/projects/networks.md +++ b/projects/networks.md @@ -10,10 +10,10 @@ ## Tutorial framing -Network data are complex because observations are connected through ties, direction, weights, missing nodes, and dependence between relations rather than arriving as independent rows in a single analysis-ready table. +Network data are complex because observations are connected through ties, direction, weights, missing nodes and ties, and dependence between relations rather than data structured as independent rows in a single analysis-ready table. Students should learn three main things about these data: -1. How networks are represented through nodes, edges, edge lists, adjacency matrices, sparse matrices, GraphML, and choices about direction, weight, time, and isolates. +1. How networks are represented through nodes, edges, edge lists, adjacency matrices, sparse matrices, GraphML, and how to make critical choices about direction, weight, time, and isolates. 2. How to turn raw graph files into a clean network object while documenting what counts as a node, what counts as a tie, and which representation best matches the research question. 3. How network dependence affects standard statistical assumptions, and how network statistics, reference models, permutation tests, or clustering can support claims about homophily, polarization, centrality, or other network structures. @@ -39,28 +39,28 @@ Students should learn three main things about these data: ### Knowledge sources -- C/R/Python packages `igraph`, +- C/R/Python packages `igraph` - Introduction to networks - Chapter 0 of "A First Course in Network Science": https://github.com/CambridgeUniversityPress/FirstCourseNetworkScience/blob/master/sample/chapters/chapter0.pdf - App: https://javier.science/marimo_intro_networks/ - Guide for reference models: https://pubmed.ncbi.nlm.nih.gov/34216192/ -- Observed network vs latent network: https://www.nature.com/articles/s41467-022-34267-9 +- Observed vs latent networks: https://www.nature.com/articles/s41467-022-34267-9 ## Week-by-week ### Week 1: Begin with raw repository files and explain what the network is, who generated it, for what purpose, and the different storage formats. - Explain the underlying network in substantive terms: what the nodes and ties represent, and whether the graph is directed or undirected, weighted or unweighted, static or temporal. -- What is GraphML? How does it relate to XML? +- What is the GraphML data type? How does it relate to XML? How is this different from other network data types? - Are adjacency matrices sparse or dense? -- Read about different layout algorithms. +- Read about different visualization layout algorithms. Explore static/interactive visualization tools. Prepare for roundtable in week 2: - What is a network and why is it a useful representation of data? -- What are the main ways to represent a network: edge lists, adjacency matrices, and XML or GraphML-like +- What are the main ways to represent a network: edge lists, adjacency matrices, and XML or GraphML-like? - What are the advantages and disadvantages of adjacency matrices over edge lists? How do sparse matrices fix this and what are they? -- How do you visualize a network? +- How do you visualize a network? What could be the pitfalls of having your analysis based on the network visualization only? ### Week 2: @@ -74,7 +74,7 @@ Operationalize the research question by turning raw graph files into a clean fil Prepare for roundtable in week 3: -- Be able to describe three analyses typically done on networks (e.g. assortativity, centrality, clustering) at a conceptual level, so the rest of the class understands the landscape — but your own project should report only the one statistic and one permutation comparison committed to above. +- Be able to describe three analyses typically done on networks (e.g. assortativity, centrality, clustering) at a conceptual level, so the rest of the class understands the landscape, but your own project should report only the one statistic and one permutation comparison committed to above. - Explain the selection vs influence debate in networks. From ba983cd6c6e60e1913de6e836fe1ef52c83e0343 Mon Sep 17 00:00:00 2001 From: elenacandellone Date: Fri, 19 Jun 2026 16:51:40 +0200 Subject: [PATCH 2/4] geospatial project --- projects/geospatial.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/projects/geospatial.md b/projects/geospatial.md index bc4ebb8..c0f5d3c 100644 --- a/projects/geospatial.md +++ b/projects/geospatial.md @@ -9,7 +9,7 @@ ## Tutorial framing -Geospatial data are complex because observations are tied to coordinate systems, geometric boundaries, raster surfaces, and spatial dependence rather than arriving as independent rows in a single analysis-ready table. +Geospatial data are complex because observations are tied to coordinate systems, geometric boundaries, raster surfaces, and spatial dependence rather than arriving as independent rows in a single, analysis-ready, table. Students should learn three main things about these data: 1. How spatial data are represented through vector geometries, raster grids, coordinate reference systems, spatial identifiers, and formats or services such as GeoJSON, Shapefiles, GeoTIFF, WFS, and WMS. @@ -29,22 +29,22 @@ Students should learn three main things about these data: ## Resources ### Data sources -- [PDOK (Public services on the map)](https://www.pdok.nl/), specifically: +- [PDOK (Publieke Dienstverlening Op de Kaart, Public Services On the Map)](https://www.pdok.nl/), specifically: - [Statistics Netherlands' areal boundaries data](https://www.pdok.nl/introductie/-/article/cbs-gebiedsindelingen) - [Wageningen university's land-use data](https://www.pdok.nl/introductie/-/article/landelijk-grondgebruik-nederland-lgn-) -- [Statistics Netherlands core figures](https://www.cbs.nl/nl-nl/maatwerk/2025/40/kerncijfers-wijken-en-buurten-2025) +- [Statistics Netherlands Key figures for districts and neighborhoods](https://www.cbs.nl/nl-nl/maatwerk/2025/40/kerncijfers-wijken-en-buurten-2025) -Feel free to use different sources if you want. +Feel free to use additional sources if you want. ### Knowledge sources - R packages `sf` and `terra` - The book [Geocomputation with R](https://r.geocompx.org/) (e.g. chapter on raster-vector interactions and data I/O) -- Find your own resources on spatial autoregressive models: CAR. +- Find your own resources on spatial autoregressive models: conditional autoregressive model (CAR) and simultaneously autoregressive model (SAR). ## Week-by-week ### Week 1: -Start from raw spatial files or web services, identify the data generating process, and explain vector/raster or point/polygon structure before doing any modeling. +Start from raw spatial files or web services, identify the data generating/collection process, and explain vector/raster or point/polygon structure before doing any modeling. Visualize the data in the most appropriate way. - What is the standard key identifier for municipalities in the Netherlands? - Can we connect directly to PDOK from R to retrieve all municipalities' boundaries? Or can we download the information? - Can we connect to PDOK from R to retrieve land-use information? @@ -57,9 +57,10 @@ Prepare for the roundtable of week 2: ### Week 2 Operationalize the research question by turning raw geometry-linked files into one analysis table, and document why the data were stored in that format. -- How can we create a tidy dataset of municipalities with their land-use and population characteristics to perform statistical modeling? - What, exactly, does land-use mean? - What dimensions of population composition do we find relevant? +- How can we create a tidy dataset of municipalities with their land-use and population characteristics to perform statistical modeling? + Prepare for the roundtable of week 3: - Explain the main spatial operations: spatial joins, aggregation from grid or point data, etc. @@ -71,7 +72,7 @@ Fit models, explain preprocessing decisions, and show one sensitivity check to s - Do we need to do some transformations, what type, GLM? Or just linear model? - Fit a baseline (non-spatial) model first, then test residual spatial dependence (e.g. Moran's I on residuals). Only escalate to SAR/CAR if the baseline residuals show meaningful spatial structure. - Which parameters, specifically, answer our research question? -- Sensitivity check: show one Modifiable Areal Unit Problem (MAUP) sensitivity — re-run the analysis at a different aggregation level (e.g. neighbourhood vs municipality) or with a different boundary definition, and report whether the conclusion changes. +- Sensitivity check: show one Modifiable Areal Unit Problem (MAUP) sensitivity, i.e., re-run the analysis at a different aggregation level (e.g. neighbourhood vs municipality) or with a different boundary definition, and report whether the conclusion changes. Prepare for the roundtable of week 4: From 15c90bf16b9b211037ff34608e5e2ee186fb163d Mon Sep 17 00:00:00 2001 From: elenacandellone Date: Wed, 1 Jul 2026 14:52:48 +0200 Subject: [PATCH 3/4] web text --- projects/messy_web_text.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/projects/messy_web_text.md b/projects/messy_web_text.md index 655f2d6..4bd2326 100644 --- a/projects/messy_web_text.md +++ b/projects/messy_web_text.md @@ -12,7 +12,7 @@ Web text is complex because the data arrive wrapped in markup, navigation, scripts, boilerplate, duplicated page elements, and inconsistent page structure rather than as analysis-ready documents. Students should learn three main things about these data: -1. How web text is produced and represented through HTML, DOM trees, URLs, HTTP requests and responses, CSS, JavaScript, metadata, and page templates. +1. How web text is produced and represented through HTML, Document Object Model (DOM) trees, URLs, HTTP requests and responses, CSS, JavaScript, metadata, and page templates. 2. How to turn raw pages into a clean corpus or analysis table by choosing a unit of analysis, extracting meaningful text, removing boilerplate, preserving source metadata, and documenting text-cleaning choices. 3. How extraction, tokenization, repeated page elements, and publisher purpose affect linguistic features, models, visualizations, and the claims that can be made from a small web corpus. @@ -21,7 +21,7 @@ Students should learn three main things about these data: | Dimension | This project teaches | |---|---| | Data structure | HTML documents, DOM trees, text corpus, nested page metadata, and document-feature or document-term representations (which are sparse matrices — the same data structure used for network adjacency). | -| Storage system | Raw downloaded HTML files, with NoSQL/document-store storage such as MongoDB discussed as an optional comparison rather than a required implementation. | +| Storage system | Raw downloaded HTML files, with NoSQL/document-store storage such as MongoDB (optional comparison rather than a required implementation). | | File formats | HTML, JSON metadata or exports, TXT, and CSV/RDS-style clean analysis outputs. | | Encoding | UTF-8 text, HTML markup, and JSON serialization for metadata or document-style records. | | Model | Group comparison, logistic or linear regression, clustering, or another small interpretable model using transparent text features. | @@ -30,19 +30,20 @@ Students should learn three main things about these data: ## Resources ### Data sources - Raw HTML pages from corporate sustainability pages and public-interest climate information pages. -- Possible corporate sources: Shell, ExxonMobil, TotalEnergies, or other firms identified through Orbis or a similar source. [TODO: to be downloaded before the course] -- Possible public-interest sources: UN climate pages, National Geographic, government climate pages, or climate-focused NGOs. [TODO] +- Possible corporate sources: Shell, ExxonMobil, TotalEnergies, Siemens, Philips, or other firms identified through Orbis or a similar source. [TODO: to be downloaded before the course] +- Possible public-interest sources: UN climate pages, National Geographic, government climate pages, or climate-focused NGOs (e.g., IPCC, Carbon Brief, EU Climate Pact). [TODO] ### Knowledge sources -- Basic HTML and DOM tutorials. -- Python packages such as `requests`, `webSweep`, `beautifulsoup4`, TBD -- R packages such as `rvest`, TBD +- Basic HTML and DOM tutorials (e.g. [https://www.geeksforgeeks.org/html/html-tutorial/](https://www.geeksforgeeks.org/html/html-tutorial/), [https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Scripting/DOM_scripting](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Scripting/DOM_scripting)). +- Python packages such as `requests`, `webSweep`, `beautifulsoup4` +- R packages such as `rvest`, `httr` ## Week-by-week ### Week 1 Inspect raw HTML, explain who published it and why, and identify the DOM structure and markup noise that matter for extraction. - What is HTML? How does it relate to CSS, JavaScript, and server-side systems such as PHP? +- What are the main uses of HTML compared to other markup languages such as LaTeX, XML? - How do users and scripts interact with websites through HTTP or HTTPS requests? - What is the unit of raw data in this project: a page, a paragraph, a text block, a sentence, or something else? @@ -55,9 +56,9 @@ Prepare for roundtable in week 2: ### Week 2 Operationalize the question by turning raw pages into one analysis table with transparent text-cleaning choices. - What, exactly, counts as corporate climate communication or public-interest climate communication? -- Which parts of each page should be kept or removed: headers, menus, cookie banners, captions, footers, links, boilerplate, and repeated slogans? +- Which parts of each page should be kept or removed (e.g., headers, menus, cookie banners, captions, footers, links, boilerplate, and repeated slogans)? - Which text representation fits the research question? Start with a transparent count-based or TF-IDF representation. Embeddings are an **optional** extension only if time allows and only after a count/TF-IDF baseline has been built and interpreted. -- The document-term matrix you build is typically extremely sparse (most documents do not contain most terms). This is the same sparse-matrix concept that the Networks group teaches with adjacency matrices — note this connection so the two groups can teach it jointly. +- The document-term matrix you build is typically extremely sparse (most documents do not contain most terms). This is the same sparse-matrix concept that the Networks group teaches with adjacency matrices. Note this connection so the two groups can teach it jointly. - **Optional cross-modality reflection:** how is turning text into a model table similar to turning images, audio, or video into model inputs (pixels, spectrograms, frames, embeddings, labels, or extracted features)? Skip if time is tight. - What source metadata should stay attached to each unit, such as publisher type, URL, date collected, page title, or page section? From 654fc39d08986f507492a22200163e6dbad20d0a Mon Sep 17 00:00:00 2001 From: elenacandellone Date: Fri, 3 Jul 2026 13:15:19 +0200 Subject: [PATCH 4/4] time series --- projects/time_series.md | 300 +++++++++++++++++++++++++++++----------- 1 file changed, 222 insertions(+), 78 deletions(-) diff --git a/projects/time_series.md b/projects/time_series.md index f73484b..5f0e1aa 100644 --- a/projects/time_series.md +++ b/projects/time_series.md @@ -1,122 +1,266 @@ -# Time Series Project: Scientific Data Standards and Temporal Signals +# Time Series Project: Eye Tracking Signals and Filtering -- Project name: `scientific_time_series` -- Research question (example): __How does a signal change over time within a participant, task, or experimental condition?__ -- Programming language: **`Python` suggested for raw-data processing** (BIDS/NIfTI/event-file handling: `nibabel`, `pybids`, `nilearn`, `numpy`, `pandas`, and the NSD-specific `nsdcode` / `nsd_access`). **`R` suggested for the modeling and analysis stage** (`lme4`, `nlme`, `tidyverse`) once Week 2 has produced the participant-session-ROI panel. Students may stay in one language throughout if they prefer, but the default split is Python → panel → R. -- Expert contact: TBD, Ben Harvey? +- Project name: `nsd_eye_tracking_time_series` +- Research question: __During repeated Natural Scenes Dataset (NSD) image presentations, is gaze movement lower while the target image is on screen than during nearby periods when that target is not on screen?__ +- Optional extension: __Does the filtered pupil-size signal change in the seconds after image onset?__ +- Programming language: `R` suggested. +- Expert contact: TBD, Roy Hessels? > **Canonical course conventions live in [project_guidelines.md](../project_guidelines.md).** That file is the source of truth for the four required workflow files (`week1_explore.qmd`, `week2_operationalize_clean.qmd`, `week3_model.qmd`, `week4_storytelling.qmd`), the `data/model_data.rds` -> `data/model_results.rds` pipeline, the raw-data policy, quality-check requirements, decision logs, and contribution tracking. Read it before starting and treat anything below as project-specific guidance on top of those conventions. +![NSD eye-tracking movement over repeated image presentations](/assets/img/projects/nsd_eye_tracking_repetition_trace_check.png) + +*Example from an NSD eye-tracking run: gaze traces from six usable 3-second repeated presentations of the same target image. Each panel maps gaze onto the actual image as a 4.0 x 4.0 degree square, matching the helper relationship `x_plot = (x + 2) / 4` and `y_plot = (2 - y) / 4`. Color shows seconds after image onset; white and black dots mark the first and last usable samples.* + ## Tutorial framing -Scientific time-series data are complex because observations are ordered, repeated, metadata-dependent, and often stored in domain-specific standards designed for reproducibility rather than immediate analysis as a flat table. +Eye-tracking is a good time-series project because the raw object is a dense +signal: time, horizontal gaze, vertical gaze, pupil area, blinks, saccades, and +task messages. The small scientific question is about movement during image +viewing. The programming lesson is how to access and turn raw signal files into a regular, filtered, event-aligned time series. + +Students should learn four things: + +1. How eye-tracking data are represented as samples, event intervals, task + messages, device-specific files, and inspection plots. +2. Why file formats matter: raw EyeLink EDF files, NSD's MATLAB `.mat` + preprocessing, and a student-created RDS cache expose different parts of the + provenance chain. +3. How filters work on a noisy signal. Students should compare a raw trace with + at least two simple filters, then choose one filter for the analysis. +4. How to fit a tiny time-series model that accounts for autocorrelation instead + of assuming that each sample is independent. -Students should learn three main things about these data: -1. How scientific time-series data are represented through samples, events, timestamps, participant metadata, task metadata, calibration or acquisition settings, and standards such as BIDS-style folder structures, NIfTI, EDF/ASC, TSV sidecars, JSON metadata, HDF5, or NetCDF. -2. How to turn a raw temporal scientific object into an analysis-ready panel or time-series table by defining the signal, unit of analysis, time window, alignment rule, missing-data rule, and feature extraction choices. -3. How temporal dependence, sampling rate, smoothing, aggregation, lag construction, and scientific metadata affect modeling, visualization, assumptions, and the claims that can be made from the data. +The core research question is intentionally modest: + +> Is the filtered gaze-velocity signal lower during target-image viewing windows +> than during nearby periods when that target is not on screen? + +The project should not become a full psychology project about why someone looked +at a particular surfer, object, or region. It should also not make preprocessing +the research question. Filtering is part of the method. A filter-width change can +be a sensitivity check, but the main question is about gaze movement during an +experimental event. + +The fixation/saccade literature is still useful, but mainly as a warning about +language. If students label low-velocity periods, they should call them +computational candidates and report the rule. They should not claim that their +code has discovered true fixations. ## Peer-teaching checklist | Dimension | This project teaches | |---|---| -| Data structure | Time-indexed samples or events, multivariate time series, participant/task metadata, and possibly spatiotemporal arrays. | -| Storage system | Scientific repository or instructor-provided raw dataset organized through a scientific data standard. | -| File formats | One chosen standard such as BIDS with NIfTI/TSV/JSON sidecars, EDF/ASC eye-tracking exports, HDF5, NetCDF, or comparable domain files. | -| Encoding | Text metadata or event files, JSON sidecars, and binary scientific signal formats. | -| Model | Group comparison of extracted temporal features, linear or mixed model, lagged regression, simple classifier, or time-window comparison. | -| Key aspects to explain | Temporal order, sampling rate, alignment, smoothing, aggregation windows, missing segments, lag construction, scientific metadata, and sensitivity to preprocessing choices. | +| Data structure | Regularly sampled gaze and pupil time series, missing samples, event windows, blink/saccade intervals, and run-level metadata. | +| Storage system | Scientific repository on AWS plus a small local RDS cache created from the NSD MATLAB file. | +| File formats | EyeLink `.edf`, MATLAB `.mat`, JPG inspection plots, PNG stimulus images, and RDS/CSV outputs created by students. | +| Encoding | Binary eye-tracker files, MATLAB arrays, numeric time-series tables, and image-based quality-control plots. | +| Model | A small AR(1)/ARIMA-style model with an event indicator: filtered gaze velocity as the outcome and `image_on` as an external regressor. | +| Key aspects to explain | Sampling rate, missing samples, blinks, filtering, velocity, event alignment, autocorrelation, AR(1) errors, aggregation to 100 ms bins, one continuous modeling segment, and sensitivity to one filtering choice. | ## Resources ### Data source -The practical is built around fMRI data. European fMRI datasets are difficult to share publicly: anything that reveals the detailed structure of an individual brain — including raw fMRI volumes — is typically considered individually identifiable under the GDPR and cannot be released openly. The practical therefore uses an American dataset that is shareable. - -Primary dataset: **Natural Scenes Dataset (NSD)** — a high-resolution 7T fMRI dataset of individuals viewing thousands of natural images, with raw BIDS files, prepared NIfTI files, repeated scan sessions, visual ROI masks, behavioral/task event files, and extensive documentation. Access is public through AWS Open Data after signing the NSD data access agreement. +Use the [**Natural Scenes Dataset (NSD)**](https://naturalscenesdataset.org/) eye-tracking data so this project shares +provenance with the neuroimaging project but teaches a different data structure. +NSD access requires accepting the [NSD data terms](https://docs.google.com/forms/d/e/1FAIpQLSduTPeZo54uEMKD-ihXmRhx0hBDdLHNsVyeo_kCb8qbyAkXuQ/viewform). -- Dataset and documentation: https://naturalscenesdataset.org/ -- Main reference paper: Allen et al. (2022), Nature Neuroscience. https://doi.org/10.1038/s41593-021-00962-x -- Session-drift / repeated-measures reference (a useful precedent for the kind of question students can replicate): https://doi.org/10.1038/s41467-023-40144-w +Start with one subject, one run, and one repeated target image. A good teaching +subset is: -### Candidate research question +- Preprocessed subject-level MATLAB file: + `s3://natural-scenes-dataset/nsddata_timeseries/ppdata/subj01/eyedata_preprocessed.mat` +- Raw EyeLink folder to list and discuss, not fully parse: + `s3://natural-scenes-dataset/nsddata_timeseries/ppdata/subj01/eyedata/` +- Eye-tracking inspection plots: + `s3://natural-scenes-dataset/nsddata/inspections/eyetrackinginspections/pupil_subj01_nsdimagery_run01.jpg` + and + `s3://natural-scenes-dataset/nsddata/inspections/eyetrackinginspections/XY_subj01_nsdimagery_run01.jpg` +- NSD imagery design files: + `s3://natural-scenes-dataset/nsddata/experiments/nsdimagery/designmatrixGLMsingle.mat` + and the relevant pair-list file, such as `B_pair_list.mat` +- One or a few small target images from: + `s3://natural-scenes-dataset/nsddata/experiments/nsdimagery/rawtargetimages/` -Good fMRI research has moved well beyond simple summaries — current work uses complex models of neural responsivity, not toy questions. Students do not need to invent a new contribution. Instead they can replicate one of two well-established demonstrations, both supported directly by NSD: +Here "repeated target image" means that the same stimulus appears multiple times +within the run. In the example figure, `shared0385_nsd28752.png` is scheduled at +eight separate onsets in run 2. Each onset starts a 3-second image-presentation +period, followed by a 1-second rest/fixation period. These are repeated +presentations of the same image, not eight different screen regions. -1. Response amplitudes in a visual ROI vary across scan sessions for one participant (the session-drift / repeated-measures phenomenon documented in the reference above). -2. Animate versus inanimate object categories produce distinguishable responses in many brain areas. +Do **not** download the full 37 GB `nsd_stimuli.hdf5`, all subjects, all EDF +files, or any fMRI beta files for this project. -Either question keeps the project at a defensible size, foregrounds the BIDS/NIfTI raw object, and gives students something real to learn rather than a manufactured small question. +The raw `.edf` files are the device-native EyeLink recordings. They are important +for provenance and for the Week 1 open-format discussion. They are not the +recommended main input because direct EDF parsing in R adds too much tool +friction. The `.mat` file is the practical starting point because it preserves +the time-series structure students need while keeping the course workflow small. -### Alternative: NSD eye-tracking data - -If a group has a strong eye-tracking reason to deviate, NSD also includes eye-tracking data on AWS, which keeps the dataset and provenance story consistent: +### Knowledge sources -- Raw EyeLink files per run: `s3://natural-scenes-dataset/nsddata_timeseries/ppdata/subj01/eyedata/` (e.g. `eyedata_nsdimagery_run01.edf`) -- Preprocessed eye-tracking file per subject: `s3://natural-scenes-dataset/nsddata_timeseries/ppdata/subj01/eyedata_preprocessed.mat` (~162 MB for `subj01`) -- Eye-tracking inspection plots: `s3://natural-scenes-dataset/nsddata/inspections/eyetrackinginspections/` (e.g. `pupil_subj01_nsdimagery_run01.jpg`) +- Roy Hessels and Ignace Hooge PEP assignments 6 and 7: gaze traces, velocity, + filtering, and careful inspection. +- Hessels et al. (2018), "Is the eye-movement field confused about fixations and + saccades?", doi: [https://doi.org/10.1098/rsos.180502](https://doi.org/10.1098/rsos.180502), for the warning that fixation/saccade definitions must be explicit. +- Hooge et al. (2022), "Fixation classification: how to merge and select fixation + candidates", doi: [https://doi.org/10.3758/s13428-021-01723-1](https://doi.org/10.3758/s13428-021-01723-1), for why selection rules should be reported if candidates are used. +- R packages: `R.matlab`, `dplyr`, `tidyr`, `ggplot2`, `readr`. +- Useful base R functions: `diff()`, `stats::filter()`, `stats::runmed()`, + `stats::acf()`, `stats::arima()`, `is.finite()`, and `aggregate()`. +- Optional package if students want a more familiar ARIMA interface: `forecast`. -This is the fallback path, not the default. The main practical is fMRI. +### Filter choices -### Knowledge sources -- BIDS documentation for neuroimaging data organization and metadata. -- Basic introductions to NIfTI, JSON sidecars, events files, and participant metadata. -- NSD documentation, the main paper (https://doi.org/10.1038/s41593-021-00962-x), and the session-drift paper (https://doi.org/10.1038/s41467-023-40144-w) on the dataset page. -- Python packages for raw-data processing: `nibabel` and `nilearn` for NIfTI/ROI handling, `pybids` for BIDS queries, `numpy`, `pandas`, `matplotlib`, the official NSD `nsdcode`, and community helpers such as `nsd_access` or `nsdget`. -- R packages for the modeling stage (after the panel is built): `lme4` or `nlme` for mixed models, `broom.mixed` for tidy output, `tidyverse` for wrangling, and `ggplot2` for visualization. +Students should learn what filters do before applying one: -### Teaching angle -- Week 1: inspect BIDS metadata, events TSV files, NIfTI headers, ROI masks, and the AWS scientific repository structure. -- Week 2: extract a participant-session-ROI panel from NIfTI arrays. -- Week 3: fit a within-subject model that addresses one of the two candidate questions (session drift or animate-vs-inanimate distinction) and one sensitivity check tied to the operationalization. -- Week 4: visualize the result and explain what was gained and lost by reducing voxelwise fMRI maps to the summary used. +- A **moving average** smooths high-frequency jitter but blurs fast movements and + creates edge artifacts. +- A **median filter** is robust to isolated spikes but can flatten sharp changes. +- A **low-pass filter** keeps slow movement and removes fast jitter, but students + must explain the cutoff frequency if they use one. +For the class version, require one simple filter for the final analysis. A +centered moving average over 5 to 11 samples is enough. The sensitivity check can +be a second window width, not a large preprocessing contest. ## Week-by-week ### Week 1 -Start from the raw scientific files, identify the data-generating process, and explain why the data are stored in a standard rather than in one analysis-ready table. -- What is the scientific object: gaze samples, fixation events, fMRI volumes, task events, or participant-level metadata? -- What is the storage standard or raw format, and which files belong together? -- What is the sampling rate or temporal resolution, and how is time represented? -- Which metadata are required to interpret the signal correctly? + +Start from the AWS repository and the downloaded `.mat` file. The goal is to +understand what the raw scientific object is before filtering anything. + +Week 1 exact data checklist: + +- Read and accept the [NSD data terms](https://docs.google.com/forms/d/e/1FAIpQLSduTPeZo54uEMKD-ihXmRhx0hBDdLHNsVyeo_kCb8qbyAkXuQ/viewform). +- Download `eyedata_preprocessed.mat` for `subj01`. +- Download the two inspection JPGs for one run. +- List the raw EDF folder, but do not download every EDF file. +- Download only the small `nsdimagery` design/pair-list files needed to identify + one repeated target-image window. +- Download one small target PNG if the group wants to make an overlay. +- Save a small cached extract such as `data/model_data.rds` only after students + have documented which raw fields it came from. + +Week 1 questions: + +- What is one row in the sample table? +- What is the sampling rate after preprocessing? +- Is the `valid_ratio_pct` reported for this run compatible with the actual missingness? +- Which columns represent time, x gaze, y gaze, and pupil area? +- Which file tells us when the target image is on screen? +- What is a target-image presentation, and how is it different from a screen + region or image file? +- Which data are samples, which are events, and which are inspection plots? Prepare for roundtable in week 2: -- Explain why temporal order is itself a data structure and why it cannot be treated like independent rows. -- Explain what a scientific data standard is and why standards such as BIDS, NIfTI plus JSON sidecars, EDF/ASC exports, HDF5, or NetCDF exist. -- Explain the difference between raw measurements, task events, derived features, and analysis-ready summaries. -- Explain one provenance or power issue: who was measured, under what task or device constraints, and what is invisible in the recorded signal? + +- Explain why eye tracking is a time-series data structure rather than an + independent-row table. +- Explain the provenance chain EDF -> `.mat` -> RDS. Which decisions are visible + at each step, and which are harder to audit? +- Explain why blinks and tracking loss are not ordinary missing values. +- What does the device-native EDF file preserve, what does the NSD `.mat` + preprocessing make easier, and what are the consequences of relying on + proprietary binary formats rather than open, documented, analysis-ready + formats? +- Explain why a project can analyze gaze velocity without claiming to classify + true fixations or saccades. ### Week 2 -Operationalize the research question by turning the raw scientific files into one analysis-ready time-series or panel object. -- What, exactly, is the outcome signal: gaze position, fixation duration, pupil size, regional fMRI signal, task response, or another feature? -- What is the unit of analysis: sample, event, time window, trial, participant, region, or participant-condition? -- How should time be aligned across participants, trials, regions, or task events? -- How should gaps, blinks, missing volumes, noisy segments, or implausible values be handled? + +Operationalize the research question by building one small, regular time-series +table. + +- Choose one subject and one run. +- Use the `nsdimagery` design file to create an `image_on` indicator for the + selected target-image windows. +- Convert time to seconds from run start. +- Mark valid samples where x, y, and pupil area are finite. +- Compute gaze displacement and velocity from x/y using `diff()`. +- Aggregate or resample to 100 ms bins to keep the model small. +- For plotting, keep small event windows around presentations, such as 3 seconds + before image onset through 3 seconds after image offset. +- Plot raw velocity and at least two filtered versions. +- For the AR(1) model, keep one continuous segment spanning the first selected + target onset through the last selected target offset, plus a small margin. Do + not paste separate event windows together and then treat them as adjacent time + points. +- Choose one filter for the final model, such as an 11-sample moving average. +- Create `log_velocity_filtered = log1p(velocity_filtered)` so the highly + skewed velocity signal is easier to model. +- Save `data/model_data.rds` with only the columns needed for Week 3: + `time_sec`, `event_id`, `time_from_onset`, `image_on`, `valid_fraction`, + `velocity_raw`, `velocity_filtered`, `log_velocity_filtered`, and optional + `pupil_filtered`. Prepare for roundtable in week 3: -- Explain how aggregation, smoothing, filtering, baseline correction, lag construction, or feature extraction changed the raw signal. -- Explain what is gained and lost when a rich temporal object is reduced to windows, averages, slopes, or event-level summaries. -- Explain one alternative cleaning choice and how it could affect the result. + +- Explain what are the possible filters, what each filter did to the trace and why the chosen one is reasonable. +- Explain why filtering can remove jitter but can also blur fast movements. +- Explain how the `image_on` variable was made from the design file. +- Explain why nearby periods from the same continuous run are a better + comparison than unrelated parts of the recording. ### Week 3 -Fit a simple within-subject model on the panel from Week 2, evaluate it, and show one sensitivity check to a processing choice that is actually present in your pipeline. The specific model depends on which of the two candidate RQs the group chose: -- If the question is **session drift in a visual ROI** (RQ 1): fit a linear or mixed model of mean ROI beta on session number for one participant, and the key parameter is the session slope. Sensitivity: alternative ROI definition (V1v vs. V1d vs. combined V1), alternative session-aggregation window, or alternative missing-session rule. -- If the question is **animate vs. inanimate distinguishability** (RQ 2): fit a group comparison or a simple classifier on trial- or condition-level ROI responses, and the key parameter is the contrast / classification metric. Sensitivity: alternative ROI choice, alternative trial selection, or alternative animacy labeling rule. +Fit a small time-series model. Do not fit ordinary sample-level OLS as the main +model, because adjacent samples are autocorrelated. + +Recommended model: + +```r +fit_data <- model_data[ + is.finite(model_data$log_velocity_filtered) & + is.finite(model_data$image_on), +] + +fit <- arima( + fit_data$log_velocity_filtered, + order = c(1, 0, 0), + xreg = fit_data$image_on +) +``` + +Here `order = c(1, 0, 0)` is an AR(1) model: the current value is allowed to +depend on the previous value. The `image_on` coefficient answers the simple +research question. A negative coefficient means gaze movement is lower while the +target image is on screen, after accounting for short-range autocorrelation. +Use a continuous, equally spaced time series for this model. Event-aligned +windows are useful for visualization, but they should not be concatenated for the +AR(1) fit. -Common prompts for both RQs: -- Is the goal association, prediction, or causal effect? (For both candidate RQs this is a descriptive within-subject question.) -- Which model is small enough to explain clearly given that the temporal index is sessions or trials, not raw samples? -- Which parameter answers the substantive research question, and what would a null result actually look like? +Show: + +- the autocorrelation plot of `log_velocity_filtered`; +- a naive mean difference for intuition; +- the AR(1) estimate for `image_on`; +- one sensitivity check using a different filter width. +- Save the `model_results.rds` for next week's storytelling. + +Avoid a black-box `auto.arima()` search unless the group can explain why it chose +the final model. A fixed AR(1) is enough for this course. Prepare for roundtable in week 4: -- Explain why the temporal index here is session order (RQ 1) or trial structure (RQ 2), not within-trial autocorrelation, and what that implies for which "time-series" concepts apply and which don't. -- Explain how within-subject repeated measurement creates dependence that ordinary i.i.d. regression ignores, and how a mixed model or paired comparison addresses it. -- Explain how the model uses the extracted signal (session-mean beta in an ROI, or condition-level ROI response) and what parts of the original scientific object (voxel-level structure, trial-level events, full BOLD time series) it ignores. -- Explain why sensitivity to ROI choice, aggregation level, and labeling rules is central rather than optional in this kind of work. + +- Explain what autocorrelation means in this signal. +- Explain why an AR(1) model is already more time-series-aware than ordinary OLS. +- Explain which parameter answers the research question and why. +- Explain what changed, if anything, when the filter width changed. ### Week 4 -Visualize and tell a story about the within-subject result while making the data standard, preprocessing, and model assumptions explicit. -- What is the context? What is the main result? Why is it important? -- Which visualizations best separate raw NIfTI data, the ROI-aggregated summary, and the fitted model? For RQ 1: a per-session line/dot plot of mean beta per ROI with the fitted slope and an uncertainty band. For RQ 2: ROI-level mean response by animacy category, with appropriate uncertainty. -- Which scientific metadata or preprocessing choices (BIDS structure, ROI definition, beta version, session/trial filter, animacy labels) are necessary for someone else to reproduce the result? -- What are the assumptions and limitations of your design, especially the move from voxelwise fMRI to ROI-level summaries? + +Visualize and tell a story about the time-series pipeline. + +- Show the raw gaze trace or velocity trace. +- Show the chosen filtered trace. +- Show the target-image windows as shaded regions on the time axis. +- Show the event-aligned average of filtered velocity around image onset. +- Show the AR(1) model results and explain in plain language. +- Optionally show the gaze overlay on the target image as a sanity check. + +The final story should make a course-level argument: + +> A time-series result is not only a model output. It depends on the raw file +> format, sampling rate, missing-data handling, filtering, event alignment, +> autocorrelation, and the exact comparison window. \ No newline at end of file