diff --git a/projects/geospatial.md b/projects/geospatial.md index bc4ebb8..c0f5d3c 100644 --- a/projects/geospatial.md +++ b/projects/geospatial.md @@ -9,7 +9,7 @@ ## Tutorial framing -Geospatial data are complex because observations are tied to coordinate systems, geometric boundaries, raster surfaces, and spatial dependence rather than arriving as independent rows in a single analysis-ready table. +Geospatial data are complex because observations are tied to coordinate systems, geometric boundaries, raster surfaces, and spatial dependence rather than arriving as independent rows in a single, analysis-ready, table. Students should learn three main things about these data: 1. How spatial data are represented through vector geometries, raster grids, coordinate reference systems, spatial identifiers, and formats or services such as GeoJSON, Shapefiles, GeoTIFF, WFS, and WMS. @@ -29,22 +29,22 @@ Students should learn three main things about these data: ## Resources ### Data sources -- [PDOK (Public services on the map)](https://www.pdok.nl/), specifically: +- [PDOK (Publieke Dienstverlening Op de Kaart, Public Services On the Map)](https://www.pdok.nl/), specifically: - [Statistics Netherlands' areal boundaries data](https://www.pdok.nl/introductie/-/article/cbs-gebiedsindelingen) - [Wageningen university's land-use data](https://www.pdok.nl/introductie/-/article/landelijk-grondgebruik-nederland-lgn-) -- [Statistics Netherlands core figures](https://www.cbs.nl/nl-nl/maatwerk/2025/40/kerncijfers-wijken-en-buurten-2025) +- [Statistics Netherlands Key figures for districts and neighborhoods](https://www.cbs.nl/nl-nl/maatwerk/2025/40/kerncijfers-wijken-en-buurten-2025) -Feel free to use different sources if you want. +Feel free to use additional sources if you want. ### Knowledge sources - R packages `sf` and `terra` - The book [Geocomputation with R](https://r.geocompx.org/) (e.g. chapter on raster-vector interactions and data I/O) -- Find your own resources on spatial autoregressive models: CAR. +- Find your own resources on spatial autoregressive models: conditional autoregressive model (CAR) and simultaneously autoregressive model (SAR). ## Week-by-week ### Week 1: -Start from raw spatial files or web services, identify the data generating process, and explain vector/raster or point/polygon structure before doing any modeling. +Start from raw spatial files or web services, identify the data generating/collection process, and explain vector/raster or point/polygon structure before doing any modeling. Visualize the data in the most appropriate way. - What is the standard key identifier for municipalities in the Netherlands? - Can we connect directly to PDOK from R to retrieve all municipalities' boundaries? Or can we download the information? - Can we connect to PDOK from R to retrieve land-use information? @@ -57,9 +57,10 @@ Prepare for the roundtable of week 2: ### Week 2 Operationalize the research question by turning raw geometry-linked files into one analysis table, and document why the data were stored in that format. -- How can we create a tidy dataset of municipalities with their land-use and population characteristics to perform statistical modeling? - What, exactly, does land-use mean? - What dimensions of population composition do we find relevant? +- How can we create a tidy dataset of municipalities with their land-use and population characteristics to perform statistical modeling? + Prepare for the roundtable of week 3: - Explain the main spatial operations: spatial joins, aggregation from grid or point data, etc. @@ -71,7 +72,7 @@ Fit models, explain preprocessing decisions, and show one sensitivity check to s - Do we need to do some transformations, what type, GLM? Or just linear model? - Fit a baseline (non-spatial) model first, then test residual spatial dependence (e.g. Moran's I on residuals). Only escalate to SAR/CAR if the baseline residuals show meaningful spatial structure. - Which parameters, specifically, answer our research question? -- Sensitivity check: show one Modifiable Areal Unit Problem (MAUP) sensitivity — re-run the analysis at a different aggregation level (e.g. neighbourhood vs municipality) or with a different boundary definition, and report whether the conclusion changes. +- Sensitivity check: show one Modifiable Areal Unit Problem (MAUP) sensitivity, i.e., re-run the analysis at a different aggregation level (e.g. neighbourhood vs municipality) or with a different boundary definition, and report whether the conclusion changes. Prepare for the roundtable of week 4: diff --git a/projects/messy_web_text.md b/projects/messy_web_text.md index 655f2d6..4bd2326 100644 --- a/projects/messy_web_text.md +++ b/projects/messy_web_text.md @@ -12,7 +12,7 @@ Web text is complex because the data arrive wrapped in markup, navigation, scripts, boilerplate, duplicated page elements, and inconsistent page structure rather than as analysis-ready documents. Students should learn three main things about these data: -1. How web text is produced and represented through HTML, DOM trees, URLs, HTTP requests and responses, CSS, JavaScript, metadata, and page templates. +1. How web text is produced and represented through HTML, Document Object Model (DOM) trees, URLs, HTTP requests and responses, CSS, JavaScript, metadata, and page templates. 2. How to turn raw pages into a clean corpus or analysis table by choosing a unit of analysis, extracting meaningful text, removing boilerplate, preserving source metadata, and documenting text-cleaning choices. 3. How extraction, tokenization, repeated page elements, and publisher purpose affect linguistic features, models, visualizations, and the claims that can be made from a small web corpus. @@ -21,7 +21,7 @@ Students should learn three main things about these data: | Dimension | This project teaches | |---|---| | Data structure | HTML documents, DOM trees, text corpus, nested page metadata, and document-feature or document-term representations (which are sparse matrices — the same data structure used for network adjacency). | -| Storage system | Raw downloaded HTML files, with NoSQL/document-store storage such as MongoDB discussed as an optional comparison rather than a required implementation. | +| Storage system | Raw downloaded HTML files, with NoSQL/document-store storage such as MongoDB (optional comparison rather than a required implementation). | | File formats | HTML, JSON metadata or exports, TXT, and CSV/RDS-style clean analysis outputs. | | Encoding | UTF-8 text, HTML markup, and JSON serialization for metadata or document-style records. | | Model | Group comparison, logistic or linear regression, clustering, or another small interpretable model using transparent text features. | @@ -30,19 +30,20 @@ Students should learn three main things about these data: ## Resources ### Data sources - Raw HTML pages from corporate sustainability pages and public-interest climate information pages. -- Possible corporate sources: Shell, ExxonMobil, TotalEnergies, or other firms identified through Orbis or a similar source. [TODO: to be downloaded before the course] -- Possible public-interest sources: UN climate pages, National Geographic, government climate pages, or climate-focused NGOs. [TODO] +- Possible corporate sources: Shell, ExxonMobil, TotalEnergies, Siemens, Philips, or other firms identified through Orbis or a similar source. [TODO: to be downloaded before the course] +- Possible public-interest sources: UN climate pages, National Geographic, government climate pages, or climate-focused NGOs (e.g., IPCC, Carbon Brief, EU Climate Pact). [TODO] ### Knowledge sources -- Basic HTML and DOM tutorials. -- Python packages such as `requests`, `webSweep`, `beautifulsoup4`, TBD -- R packages such as `rvest`, TBD +- Basic HTML and DOM tutorials (e.g. [https://www.geeksforgeeks.org/html/html-tutorial/](https://www.geeksforgeeks.org/html/html-tutorial/), [https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Scripting/DOM_scripting](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Scripting/DOM_scripting)). +- Python packages such as `requests`, `webSweep`, `beautifulsoup4` +- R packages such as `rvest`, `httr` ## Week-by-week ### Week 1 Inspect raw HTML, explain who published it and why, and identify the DOM structure and markup noise that matter for extraction. - What is HTML? How does it relate to CSS, JavaScript, and server-side systems such as PHP? +- What are the main uses of HTML compared to other markup languages such as LaTeX, XML? - How do users and scripts interact with websites through HTTP or HTTPS requests? - What is the unit of raw data in this project: a page, a paragraph, a text block, a sentence, or something else? @@ -55,9 +56,9 @@ Prepare for roundtable in week 2: ### Week 2 Operationalize the question by turning raw pages into one analysis table with transparent text-cleaning choices. - What, exactly, counts as corporate climate communication or public-interest climate communication? -- Which parts of each page should be kept or removed: headers, menus, cookie banners, captions, footers, links, boilerplate, and repeated slogans? +- Which parts of each page should be kept or removed (e.g., headers, menus, cookie banners, captions, footers, links, boilerplate, and repeated slogans)? - Which text representation fits the research question? Start with a transparent count-based or TF-IDF representation. Embeddings are an **optional** extension only if time allows and only after a count/TF-IDF baseline has been built and interpreted. -- The document-term matrix you build is typically extremely sparse (most documents do not contain most terms). This is the same sparse-matrix concept that the Networks group teaches with adjacency matrices — note this connection so the two groups can teach it jointly. +- The document-term matrix you build is typically extremely sparse (most documents do not contain most terms). This is the same sparse-matrix concept that the Networks group teaches with adjacency matrices. Note this connection so the two groups can teach it jointly. - **Optional cross-modality reflection:** how is turning text into a model table similar to turning images, audio, or video into model inputs (pixels, spectrograms, frames, embeddings, labels, or extracted features)? Skip if time is tight. - What source metadata should stay attached to each unit, such as publisher type, URL, date collected, page title, or page section? diff --git a/projects/networks.md b/projects/networks.md index 1dafda6..fc74d3f 100644 --- a/projects/networks.md +++ b/projects/networks.md @@ -10,10 +10,10 @@ ## Tutorial framing -Network data are complex because observations are connected through ties, direction, weights, missing nodes, and dependence between relations rather than arriving as independent rows in a single analysis-ready table. +Network data are complex because observations are connected through ties, direction, weights, missing nodes and ties, and dependence between relations rather than data structured as independent rows in a single analysis-ready table. Students should learn three main things about these data: -1. How networks are represented through nodes, edges, edge lists, adjacency matrices, sparse matrices, GraphML, and choices about direction, weight, time, and isolates. +1. How networks are represented through nodes, edges, edge lists, adjacency matrices, sparse matrices, GraphML, and how to make critical choices about direction, weight, time, and isolates. 2. How to turn raw graph files into a clean network object while documenting what counts as a node, what counts as a tie, and which representation best matches the research question. 3. How network dependence affects standard statistical assumptions, and how network statistics, reference models, permutation tests, or clustering can support claims about homophily, polarization, centrality, or other network structures. @@ -39,28 +39,28 @@ Students should learn three main things about these data: ### Knowledge sources -- C/R/Python packages `igraph`, +- C/R/Python packages `igraph` - Introduction to networks - Chapter 0 of "A First Course in Network Science": https://github.com/CambridgeUniversityPress/FirstCourseNetworkScience/blob/master/sample/chapters/chapter0.pdf - App: https://javier.science/marimo_intro_networks/ - Guide for reference models: https://pubmed.ncbi.nlm.nih.gov/34216192/ -- Observed network vs latent network: https://www.nature.com/articles/s41467-022-34267-9 +- Observed vs latent networks: https://www.nature.com/articles/s41467-022-34267-9 ## Week-by-week ### Week 1: Begin with raw repository files and explain what the network is, who generated it, for what purpose, and the different storage formats. - Explain the underlying network in substantive terms: what the nodes and ties represent, and whether the graph is directed or undirected, weighted or unweighted, static or temporal. -- What is GraphML? How does it relate to XML? +- What is the GraphML data type? How does it relate to XML? How is this different from other network data types? - Are adjacency matrices sparse or dense? -- Read about different layout algorithms. +- Read about different visualization layout algorithms. Explore static/interactive visualization tools. Prepare for roundtable in week 2: - What is a network and why is it a useful representation of data? -- What are the main ways to represent a network: edge lists, adjacency matrices, and XML or GraphML-like +- What are the main ways to represent a network: edge lists, adjacency matrices, and XML or GraphML-like? - What are the advantages and disadvantages of adjacency matrices over edge lists? How do sparse matrices fix this and what are they? -- How do you visualize a network? +- How do you visualize a network? What could be the pitfalls of having your analysis based on the network visualization only? ### Week 2: @@ -74,7 +74,7 @@ Operationalize the research question by turning raw graph files into a clean fil Prepare for roundtable in week 3: -- Be able to describe three analyses typically done on networks (e.g. assortativity, centrality, clustering) at a conceptual level, so the rest of the class understands the landscape — but your own project should report only the one statistic and one permutation comparison committed to above. +- Be able to describe three analyses typically done on networks (e.g. assortativity, centrality, clustering) at a conceptual level, so the rest of the class understands the landscape, but your own project should report only the one statistic and one permutation comparison committed to above. - Explain the selection vs influence debate in networks.