Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 14 additions & 31 deletions planning/active/progress.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,21 @@
# Progress: Issue #7 - .qmd to .py Migration
# Progress: Issue #13 - Source URL Accessibility Validation

## Session: 2026-02-17

### Completed
- [x] Created branch `7-migrate-qmd-to-py`
- [x] Analyzed stac_create_item.qmd (350 LOC, 3 modes, no real R dependency)
- [x] Analyzed stac_create_collection.qmd (270 LOC, R dependency on ngr for S3 fetch)
- [x] Identified 3x duplicated utility functions
- [x] Created planning files (task_plan.md, findings.md, progress.md)

### In Progress
- [x] Phase 1.1: Created `scripts/stac_utils.py` with shared functions
- `date_extract_from_path()`, `datetime_parse_item()`, `check_geotiff_cog()`
- `fix_url()`, `url_to_item_id()`, `get_output_dir()`
- Path constants: `PATH_S3_STAC`, `PATH_S3_JSON`, `PATH_S3`, `PATH_RESULTS_CSV`, `BBOX_BC`
- [x] Phase 1.2: Updated `item_reprocess.py` to import from `stac_utils`
- [x] Phase 1.3: Verified imports and syntax

- [x] Phase 1.4: Renamed all scripts to `noun_verb.py` convention
- [x] Phase 1.5: Updated all cross-references (0 stale refs in code files)
- [x] Phase 2.1: Created `scripts/item_create.py` (argparse CLI, logging, imports stac_utils)
- All 6 .py scripts pass syntax checks
- Local PROJ conflict blocks `rio_stac` import (homebrew vs conda) — VM-only testing

- [x] Phase 3.1: Created `scripts/urls_fetch.R` (standalone R script for S3 key fetch)
- [x] Phase 3.2: Created `scripts/collection_create.py` (argparse CLI, logging, imports stac_utils)
- [x] Phase 4.1: Updated `scripts/build_safe.sh` to use new scripts (no more quarto render)
- Added urls_fetch.R step, validation step with item_validate.py

### Next Up
- [ ] Phase 4.2: Archive .qmd files
- [ ] Phase 4.3: Update CLAUDE.md
- [ ] Phase 4.4: Update README.md
- [ ] Test equivalence on VM
- [x] Archived issue #7 planning files
- [x] Created fresh planning files for issue #13
- [x] Explored pipeline and validation patterns
- [x] Step 2: Added `check_url_accessible()` to stac_utils.py
- [x] Step 3: Created `scripts/urls_check_access.py`
- [x] Step 4: Updated `scripts/build_safe.sh` with Step 3.5
- [x] Step 5: Tested — 5 URLs (3 good + 2 known-bad 092p045) all return 200
- Note: 092p045 permissions appear to be fixed upstream by GeoBC
- CSV output format verified, incremental cache working

### Findings
- The 6 known-bad 092p045 URLs now return HTTP 200 (permissions fixed upstream)
- Script still valuable for ongoing monitoring of new URLs

---

Expand Down
156 changes: 28 additions & 128 deletions planning/active/task_plan.md
Original file line number Diff line number Diff line change
@@ -1,145 +1,45 @@
# Task Plan: Migrate .qmd to Standalone Scripts (Issue #7)
# Task Plan: Source URL Accessibility Validation (Issue #13)

**Status:** Phase 2 - In Progress
**Branch:** `7-migrate-qmd-to-py`
**Status:** In Progress
**Branch:** `13-fix-s3-permissions`
**Started:** 2026-02-17
**Issue:** https://github.com/NewGraphEnvironment/stac_dem_bc/issues/7
**Issue:** https://github.com/NewGraphEnvironment/stac_dem_bc/issues/13

## Goal

Migrate `stac_create_collection.qmd` and `stac_create_item.qmd` to standalone scripts for production automation. Keep .qmd files as archive/reference. Enable clean `python script.py` and `Rscript script.R` execution for Phase 3 VM cron jobs.
Create a validation script that checks source GeoTIFF URL accessibility on the BC objectstore, produces a CSV report shareable with GeoBC, and integrates into the build pipeline.

## Naming Convention
## Steps

Scripts use `noun_verb.py` pattern for alphabetical grouping:
### Step 1: Archive old planning files ⬜ pending
- [x] Move issue #7 planning to `planning/archive/2026-02-issue-7-qmd-to-py/`
- [x] Create fresh planning files for issue #13

| Old Name | New Name |
|----------|----------|
| `validate_stac_items.py` | `item_validate.py` |
| `reprocess_invalid_items.py` | `item_reprocess.py` |
| `extract_invalid_urls.py` | `item_extract_invalid.py` |
| `qa_update_catalogue.py` | `catalogue_qa.py` |
| _(new)_ | `item_create.py` |
| _(new)_ | `collection_create.py` |
### Step 2: Add `check_url_accessible()` to stac_utils.py ✅ complete
- [x] HTTP HEAD request helper with timeout
- [x] Returns dict: `{url, status_code, accessible, error, last_checked}`

## Current State
### Step 3: Create `scripts/urls_check_access.py` ✅ complete
- [x] argparse CLI (`--urls-file`, `--recheck`, `--workers`, `--timeout`)
- [x] Incremental: load cache from `data/urls_access_checks.csv`, skip known URLs
- [x] Parallel HEAD requests (ThreadPoolExecutor)
- [x] CSV output shareable with GeoBC
- [x] Summary logging
- [x] Exit code 1 if any inaccessible

**Existing .py scripts (renamed):**
- `scripts/item_validate.py` (244 LOC)
- `scripts/item_reprocess.py` (196 LOC) — updated to use stac_utils
- `scripts/item_extract_invalid.py` (89 LOC)
- `scripts/catalogue_qa.py` (243 LOC)
- `scripts/stac_utils.py` (NEW — shared utilities)
- `scripts/build_safe.sh` (218 LOC)
### Step 4: Update `scripts/build_safe.sh` ✅ complete
- [x] Add accessibility check step after URL fetch (Step 3.5)
- [x] Warning only (don't block build — GeoTIFF validation handles skipping)

**Still in .qmd (need migration):**
- `stac_create_collection.qmd` — R chunk (S3 key fetch via `ngr`) + Python chunks (collection creation)
- `stac_create_item.qmd` — R chunk (conda env) + Python chunks (validation, item creation)

**R-only scripts (keep as-is):**
- `scripts/detect_changes.R` — Uses `ngr::ngr_s3_keys_get()`, pure R is correct
- `scripts/s3_sync.R` — AWS CLI wrapper, fine as R
- `scripts/functions.R` — `vm_upload_run()` utility
- `scripts/benchmark_fetch.R` — Dev tool
- `scripts/footprint_visualize.R` — Exploratory (Issue #2)

## Phases

### Phase 1: Extract shared utilities ✅ COMPLETE
**Goal:** Create shared Python module to eliminate duplication

- [x] **1.1** Create `scripts/stac_utils.py` with shared functions
- [x] **1.2** Update `item_reprocess.py` to import from `stac_utils.py`
- [x] **1.3** Verify imports and syntax
- [x] **1.4** Rename all scripts to `noun_verb.py` convention
- [x] **1.5** Update all cross-references (scripts, CLAUDE.md, .qmd)

---

### Phase 2: Migrate stac_create_item.qmd → scripts/item_create.py ⬜ pending
**Goal:** Standalone Python script for item creation

- [ ] **2.1** Create `scripts/item_create.py` with:
- argparse CLI (`--test`, `--test-count N`, `--incremental`, `--reprocess-invalid`)
- Python logging module (not print statements)
- Import shared utils from `stac_utils.py`
- All functionality from stac_create_item.qmd Python chunks
- [ ] **2.2** Test equivalence: run both .qmd and .py, compare output
- [ ] **2.3** Update `scripts/build_safe.sh` to call .py instead of `quarto render`

**Verify:** Create 10 test items with .py script, diff against .qmd output

---

### Phase 3: Migrate stac_create_collection.qmd → split R/Python ⬜ pending
**Goal:** Standalone scripts for collection creation

The collection .qmd has two distinct parts:
1. **R chunk:** Fetches S3 keys via `ngr::ngr_s3_keys_get()` → `data/urls_list.txt`
2. **Python chunks:** Creates collection JSON from urls_list.txt

Migration approach:
- [ ] **3.1** Create `scripts/urls_fetch.R` — Standalone R script for S3 key fetching
- Takes `--test` flag, outputs to `data/urls_list.txt`
- Replaces the R chunk in collection.qmd
- [ ] **3.2** Create `scripts/collection_create.py` — Standalone Python script
- Reads `data/urls_list.txt` (produced by urls_fetch.R or detect_changes.R)
- argparse CLI (`--test`, `--test-count N`)
- Temporal extent calculation, spatial extent (hardcoded BC bbox)
- Collection creation and validation
- [ ] **3.3** Test equivalence: compare collection.json from both approaches

**Verify:** `Rscript scripts/urls_fetch.R && python scripts/collection_create.py` produces identical collection.json

---

### Phase 4: Update build_safe.sh and documentation ⬜ pending
**Goal:** Wire everything together for production

- [ ] **4.1** Update `scripts/build_safe.sh` to use new scripts:
- `Rscript scripts/urls_fetch.R` (or `Rscript scripts/detect_changes.R`)
- `python scripts/collection_create.py`
- `python scripts/item_create.py`
- `python scripts/item_validate.py`
- [ ] **4.2** Archive .qmd files (move to `archive/` or add deprecation header)
- [ ] **4.3** Update CLAUDE.md with new script paths and workflow
- [ ] **4.4** Update README.md usage examples

**Verify:** Full `build_safe.sh` run in test mode produces valid catalog

---

## Critical Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Naming convention | `noun_verb.py` | Groups related scripts alphabetically (item_*, collection_*) |
| Keep R for S3 key fetch | Yes | `ngr::ngr_s3_keys_get()` has no Python equivalent, already works |
| Shared Python module | `stac_utils.py` | Eliminates 3x duplication of date functions |
| CLI interface | argparse | Standard, scriptable, supports `--test` flags |
| Logging | Python `logging` module | Proper log levels, file output, captures in cron |
| Archive .qmd | Keep in repo (header note) | Reference for literate programming approach |

## Risks

| Risk | Mitigation |
|------|-----------|
| Breaking production pipeline | Branch-based development, .qmd still works on main |
| ngr R dependency hard to replace | Keep R script for S3 fetch, don't force all-Python |
| Subtle behavior differences | Side-by-side output comparison before merging |
### Step 5: Test and verify ✅ complete
- [x] Test against known-bad 092p045 URLs (now return 200 — fixed upstream)
- [x] Test against known-good URLs
- [x] Verify CSV output format
- [x] Verify incremental cache works

## SRED Tracking

- Primary: NewGraphEnvironment/sred-2025-2026#8
- Secondary: NewGraphEnvironment/sred-2025-2026#3

---

## Errors Encountered

| Error | Phase | Resolution |
|-------|-------|------------|
| PROJ env conflict | 1.3 | Local homebrew/conda conflict — not our bug, works on VM |
- Relates to NewGraphEnvironment/sred-2025-2026#3

---

Expand Down
39 changes: 39 additions & 0 deletions planning/archive/2026-02-issue-7-qmd-to-py/progress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Progress: Issue #7 - .qmd to .py Migration

## Session: 2026-02-17

### Completed
- [x] Created branch `7-migrate-qmd-to-py`
- [x] Analyzed stac_create_item.qmd (350 LOC, 3 modes, no real R dependency)
- [x] Analyzed stac_create_collection.qmd (270 LOC, R dependency on ngr for S3 fetch)
- [x] Identified 3x duplicated utility functions
- [x] Created planning files (task_plan.md, findings.md, progress.md)

### In Progress
- [x] Phase 1.1: Created `scripts/stac_utils.py` with shared functions
- `date_extract_from_path()`, `datetime_parse_item()`, `check_geotiff_cog()`
- `fix_url()`, `url_to_item_id()`, `get_output_dir()`
- Path constants: `PATH_S3_STAC`, `PATH_S3_JSON`, `PATH_S3`, `PATH_RESULTS_CSV`, `BBOX_BC`
- [x] Phase 1.2: Updated `item_reprocess.py` to import from `stac_utils`
- [x] Phase 1.3: Verified imports and syntax

- [x] Phase 1.4: Renamed all scripts to `noun_verb.py` convention
- [x] Phase 1.5: Updated all cross-references (0 stale refs in code files)
- [x] Phase 2.1: Created `scripts/item_create.py` (argparse CLI, logging, imports stac_utils)
- All 6 .py scripts pass syntax checks
- Local PROJ conflict blocks `rio_stac` import (homebrew vs conda) — VM-only testing

- [x] Phase 3.1: Created `scripts/urls_fetch.R` (standalone R script for S3 key fetch)
- [x] Phase 3.2: Created `scripts/collection_create.py` (argparse CLI, logging, imports stac_utils)
- [x] Phase 4.1: Updated `scripts/build_safe.sh` to use new scripts (no more quarto render)
- Added urls_fetch.R step, validation step with item_validate.py

### Next Up
- [ ] Phase 4.2: Archive .qmd files
- [ ] Phase 4.3: Update CLAUDE.md
- [ ] Phase 4.4: Update README.md
- [ ] Test equivalence on VM

---

**Last updated:** 2026-02-17
Loading
Loading