diff --git a/planning/active/progress.md b/planning/active/progress.md index 7717f4b..f456714 100644 --- a/planning/active/progress.md +++ b/planning/active/progress.md @@ -1,38 +1,21 @@ -# Progress: Issue #7 - .qmd to .py Migration +# Progress: Issue #13 - Source URL Accessibility Validation ## Session: 2026-02-17 ### Completed -- [x] Created branch `7-migrate-qmd-to-py` -- [x] Analyzed stac_create_item.qmd (350 LOC, 3 modes, no real R dependency) -- [x] Analyzed stac_create_collection.qmd (270 LOC, R dependency on ngr for S3 fetch) -- [x] Identified 3x duplicated utility functions -- [x] Created planning files (task_plan.md, findings.md, progress.md) - -### In Progress -- [x] Phase 1.1: Created `scripts/stac_utils.py` with shared functions - - `date_extract_from_path()`, `datetime_parse_item()`, `check_geotiff_cog()` - - `fix_url()`, `url_to_item_id()`, `get_output_dir()` - - Path constants: `PATH_S3_STAC`, `PATH_S3_JSON`, `PATH_S3`, `PATH_RESULTS_CSV`, `BBOX_BC` -- [x] Phase 1.2: Updated `item_reprocess.py` to import from `stac_utils` -- [x] Phase 1.3: Verified imports and syntax - -- [x] Phase 1.4: Renamed all scripts to `noun_verb.py` convention -- [x] Phase 1.5: Updated all cross-references (0 stale refs in code files) -- [x] Phase 2.1: Created `scripts/item_create.py` (argparse CLI, logging, imports stac_utils) -- All 6 .py scripts pass syntax checks -- Local PROJ conflict blocks `rio_stac` import (homebrew vs conda) — VM-only testing - -- [x] Phase 3.1: Created `scripts/urls_fetch.R` (standalone R script for S3 key fetch) -- [x] Phase 3.2: Created `scripts/collection_create.py` (argparse CLI, logging, imports stac_utils) -- [x] Phase 4.1: Updated `scripts/build_safe.sh` to use new scripts (no more quarto render) - - Added urls_fetch.R step, validation step with item_validate.py - -### Next Up -- [ ] Phase 4.2: Archive .qmd files -- [ ] Phase 4.3: Update CLAUDE.md -- [ ] Phase 4.4: Update README.md -- [ ] Test equivalence on VM +- [x] Archived issue #7 planning files +- [x] Created fresh planning files for issue #13 +- [x] Explored pipeline and validation patterns +- [x] Step 2: Added `check_url_accessible()` to stac_utils.py +- [x] Step 3: Created `scripts/urls_check_access.py` +- [x] Step 4: Updated `scripts/build_safe.sh` with Step 3.5 +- [x] Step 5: Tested — 5 URLs (3 good + 2 known-bad 092p045) all return 200 + - Note: 092p045 permissions appear to be fixed upstream by GeoBC + - CSV output format verified, incremental cache working + +### Findings +- The 6 known-bad 092p045 URLs now return HTTP 200 (permissions fixed upstream) +- Script still valuable for ongoing monitoring of new URLs --- diff --git a/planning/active/task_plan.md b/planning/active/task_plan.md index ae6e105..b4245dc 100644 --- a/planning/active/task_plan.md +++ b/planning/active/task_plan.md @@ -1,145 +1,45 @@ -# Task Plan: Migrate .qmd to Standalone Scripts (Issue #7) +# Task Plan: Source URL Accessibility Validation (Issue #13) -**Status:** Phase 2 - In Progress -**Branch:** `7-migrate-qmd-to-py` +**Status:** In Progress +**Branch:** `13-fix-s3-permissions` **Started:** 2026-02-17 -**Issue:** https://github.com/NewGraphEnvironment/stac_dem_bc/issues/7 +**Issue:** https://github.com/NewGraphEnvironment/stac_dem_bc/issues/13 ## Goal -Migrate `stac_create_collection.qmd` and `stac_create_item.qmd` to standalone scripts for production automation. Keep .qmd files as archive/reference. Enable clean `python script.py` and `Rscript script.R` execution for Phase 3 VM cron jobs. +Create a validation script that checks source GeoTIFF URL accessibility on the BC objectstore, produces a CSV report shareable with GeoBC, and integrates into the build pipeline. -## Naming Convention +## Steps -Scripts use `noun_verb.py` pattern for alphabetical grouping: +### Step 1: Archive old planning files ⬜ pending +- [x] Move issue #7 planning to `planning/archive/2026-02-issue-7-qmd-to-py/` +- [x] Create fresh planning files for issue #13 -| Old Name | New Name | -|----------|----------| -| `validate_stac_items.py` | `item_validate.py` | -| `reprocess_invalid_items.py` | `item_reprocess.py` | -| `extract_invalid_urls.py` | `item_extract_invalid.py` | -| `qa_update_catalogue.py` | `catalogue_qa.py` | -| _(new)_ | `item_create.py` | -| _(new)_ | `collection_create.py` | +### Step 2: Add `check_url_accessible()` to stac_utils.py ✅ complete +- [x] HTTP HEAD request helper with timeout +- [x] Returns dict: `{url, status_code, accessible, error, last_checked}` -## Current State +### Step 3: Create `scripts/urls_check_access.py` ✅ complete +- [x] argparse CLI (`--urls-file`, `--recheck`, `--workers`, `--timeout`) +- [x] Incremental: load cache from `data/urls_access_checks.csv`, skip known URLs +- [x] Parallel HEAD requests (ThreadPoolExecutor) +- [x] CSV output shareable with GeoBC +- [x] Summary logging +- [x] Exit code 1 if any inaccessible -**Existing .py scripts (renamed):** -- `scripts/item_validate.py` (244 LOC) -- `scripts/item_reprocess.py` (196 LOC) — updated to use stac_utils -- `scripts/item_extract_invalid.py` (89 LOC) -- `scripts/catalogue_qa.py` (243 LOC) -- `scripts/stac_utils.py` (NEW — shared utilities) -- `scripts/build_safe.sh` (218 LOC) +### Step 4: Update `scripts/build_safe.sh` ✅ complete +- [x] Add accessibility check step after URL fetch (Step 3.5) +- [x] Warning only (don't block build — GeoTIFF validation handles skipping) -**Still in .qmd (need migration):** -- `stac_create_collection.qmd` — R chunk (S3 key fetch via `ngr`) + Python chunks (collection creation) -- `stac_create_item.qmd` — R chunk (conda env) + Python chunks (validation, item creation) - -**R-only scripts (keep as-is):** -- `scripts/detect_changes.R` — Uses `ngr::ngr_s3_keys_get()`, pure R is correct -- `scripts/s3_sync.R` — AWS CLI wrapper, fine as R -- `scripts/functions.R` — `vm_upload_run()` utility -- `scripts/benchmark_fetch.R` — Dev tool -- `scripts/footprint_visualize.R` — Exploratory (Issue #2) - -## Phases - -### Phase 1: Extract shared utilities ✅ COMPLETE -**Goal:** Create shared Python module to eliminate duplication - -- [x] **1.1** Create `scripts/stac_utils.py` with shared functions -- [x] **1.2** Update `item_reprocess.py` to import from `stac_utils.py` -- [x] **1.3** Verify imports and syntax -- [x] **1.4** Rename all scripts to `noun_verb.py` convention -- [x] **1.5** Update all cross-references (scripts, CLAUDE.md, .qmd) - ---- - -### Phase 2: Migrate stac_create_item.qmd → scripts/item_create.py ⬜ pending -**Goal:** Standalone Python script for item creation - -- [ ] **2.1** Create `scripts/item_create.py` with: - - argparse CLI (`--test`, `--test-count N`, `--incremental`, `--reprocess-invalid`) - - Python logging module (not print statements) - - Import shared utils from `stac_utils.py` - - All functionality from stac_create_item.qmd Python chunks -- [ ] **2.2** Test equivalence: run both .qmd and .py, compare output -- [ ] **2.3** Update `scripts/build_safe.sh` to call .py instead of `quarto render` - -**Verify:** Create 10 test items with .py script, diff against .qmd output - ---- - -### Phase 3: Migrate stac_create_collection.qmd → split R/Python ⬜ pending -**Goal:** Standalone scripts for collection creation - -The collection .qmd has two distinct parts: -1. **R chunk:** Fetches S3 keys via `ngr::ngr_s3_keys_get()` → `data/urls_list.txt` -2. **Python chunks:** Creates collection JSON from urls_list.txt - -Migration approach: -- [ ] **3.1** Create `scripts/urls_fetch.R` — Standalone R script for S3 key fetching - - Takes `--test` flag, outputs to `data/urls_list.txt` - - Replaces the R chunk in collection.qmd -- [ ] **3.2** Create `scripts/collection_create.py` — Standalone Python script - - Reads `data/urls_list.txt` (produced by urls_fetch.R or detect_changes.R) - - argparse CLI (`--test`, `--test-count N`) - - Temporal extent calculation, spatial extent (hardcoded BC bbox) - - Collection creation and validation -- [ ] **3.3** Test equivalence: compare collection.json from both approaches - -**Verify:** `Rscript scripts/urls_fetch.R && python scripts/collection_create.py` produces identical collection.json - ---- - -### Phase 4: Update build_safe.sh and documentation ⬜ pending -**Goal:** Wire everything together for production - -- [ ] **4.1** Update `scripts/build_safe.sh` to use new scripts: - - `Rscript scripts/urls_fetch.R` (or `Rscript scripts/detect_changes.R`) - - `python scripts/collection_create.py` - - `python scripts/item_create.py` - - `python scripts/item_validate.py` -- [ ] **4.2** Archive .qmd files (move to `archive/` or add deprecation header) -- [ ] **4.3** Update CLAUDE.md with new script paths and workflow -- [ ] **4.4** Update README.md usage examples - -**Verify:** Full `build_safe.sh` run in test mode produces valid catalog - ---- - -## Critical Decisions - -| Decision | Choice | Rationale | -|----------|--------|-----------| -| Naming convention | `noun_verb.py` | Groups related scripts alphabetically (item_*, collection_*) | -| Keep R for S3 key fetch | Yes | `ngr::ngr_s3_keys_get()` has no Python equivalent, already works | -| Shared Python module | `stac_utils.py` | Eliminates 3x duplication of date functions | -| CLI interface | argparse | Standard, scriptable, supports `--test` flags | -| Logging | Python `logging` module | Proper log levels, file output, captures in cron | -| Archive .qmd | Keep in repo (header note) | Reference for literate programming approach | - -## Risks - -| Risk | Mitigation | -|------|-----------| -| Breaking production pipeline | Branch-based development, .qmd still works on main | -| ngr R dependency hard to replace | Keep R script for S3 fetch, don't force all-Python | -| Subtle behavior differences | Side-by-side output comparison before merging | +### Step 5: Test and verify ✅ complete +- [x] Test against known-bad 092p045 URLs (now return 200 — fixed upstream) +- [x] Test against known-good URLs +- [x] Verify CSV output format +- [x] Verify incremental cache works ## SRED Tracking -- Primary: NewGraphEnvironment/sred-2025-2026#8 -- Secondary: NewGraphEnvironment/sred-2025-2026#3 - ---- - -## Errors Encountered - -| Error | Phase | Resolution | -|-------|-------|------------| -| PROJ env conflict | 1.3 | Local homebrew/conda conflict — not our bug, works on VM | +- Relates to NewGraphEnvironment/sred-2025-2026#3 --- diff --git a/planning/active/findings.md b/planning/archive/2026-02-issue-7-qmd-to-py/findings.md similarity index 100% rename from planning/active/findings.md rename to planning/archive/2026-02-issue-7-qmd-to-py/findings.md diff --git a/planning/archive/2026-02-issue-7-qmd-to-py/progress.md b/planning/archive/2026-02-issue-7-qmd-to-py/progress.md new file mode 100644 index 0000000..7717f4b --- /dev/null +++ b/planning/archive/2026-02-issue-7-qmd-to-py/progress.md @@ -0,0 +1,39 @@ +# Progress: Issue #7 - .qmd to .py Migration + +## Session: 2026-02-17 + +### Completed +- [x] Created branch `7-migrate-qmd-to-py` +- [x] Analyzed stac_create_item.qmd (350 LOC, 3 modes, no real R dependency) +- [x] Analyzed stac_create_collection.qmd (270 LOC, R dependency on ngr for S3 fetch) +- [x] Identified 3x duplicated utility functions +- [x] Created planning files (task_plan.md, findings.md, progress.md) + +### In Progress +- [x] Phase 1.1: Created `scripts/stac_utils.py` with shared functions + - `date_extract_from_path()`, `datetime_parse_item()`, `check_geotiff_cog()` + - `fix_url()`, `url_to_item_id()`, `get_output_dir()` + - Path constants: `PATH_S3_STAC`, `PATH_S3_JSON`, `PATH_S3`, `PATH_RESULTS_CSV`, `BBOX_BC` +- [x] Phase 1.2: Updated `item_reprocess.py` to import from `stac_utils` +- [x] Phase 1.3: Verified imports and syntax + +- [x] Phase 1.4: Renamed all scripts to `noun_verb.py` convention +- [x] Phase 1.5: Updated all cross-references (0 stale refs in code files) +- [x] Phase 2.1: Created `scripts/item_create.py` (argparse CLI, logging, imports stac_utils) +- All 6 .py scripts pass syntax checks +- Local PROJ conflict blocks `rio_stac` import (homebrew vs conda) — VM-only testing + +- [x] Phase 3.1: Created `scripts/urls_fetch.R` (standalone R script for S3 key fetch) +- [x] Phase 3.2: Created `scripts/collection_create.py` (argparse CLI, logging, imports stac_utils) +- [x] Phase 4.1: Updated `scripts/build_safe.sh` to use new scripts (no more quarto render) + - Added urls_fetch.R step, validation step with item_validate.py + +### Next Up +- [ ] Phase 4.2: Archive .qmd files +- [ ] Phase 4.3: Update CLAUDE.md +- [ ] Phase 4.4: Update README.md +- [ ] Test equivalence on VM + +--- + +**Last updated:** 2026-02-17 diff --git a/planning/archive/2026-02-issue-7-qmd-to-py/task_plan.md b/planning/archive/2026-02-issue-7-qmd-to-py/task_plan.md new file mode 100644 index 0000000..ae6e105 --- /dev/null +++ b/planning/archive/2026-02-issue-7-qmd-to-py/task_plan.md @@ -0,0 +1,146 @@ +# Task Plan: Migrate .qmd to Standalone Scripts (Issue #7) + +**Status:** Phase 2 - In Progress +**Branch:** `7-migrate-qmd-to-py` +**Started:** 2026-02-17 +**Issue:** https://github.com/NewGraphEnvironment/stac_dem_bc/issues/7 + +## Goal + +Migrate `stac_create_collection.qmd` and `stac_create_item.qmd` to standalone scripts for production automation. Keep .qmd files as archive/reference. Enable clean `python script.py` and `Rscript script.R` execution for Phase 3 VM cron jobs. + +## Naming Convention + +Scripts use `noun_verb.py` pattern for alphabetical grouping: + +| Old Name | New Name | +|----------|----------| +| `validate_stac_items.py` | `item_validate.py` | +| `reprocess_invalid_items.py` | `item_reprocess.py` | +| `extract_invalid_urls.py` | `item_extract_invalid.py` | +| `qa_update_catalogue.py` | `catalogue_qa.py` | +| _(new)_ | `item_create.py` | +| _(new)_ | `collection_create.py` | + +## Current State + +**Existing .py scripts (renamed):** +- `scripts/item_validate.py` (244 LOC) +- `scripts/item_reprocess.py` (196 LOC) — updated to use stac_utils +- `scripts/item_extract_invalid.py` (89 LOC) +- `scripts/catalogue_qa.py` (243 LOC) +- `scripts/stac_utils.py` (NEW — shared utilities) +- `scripts/build_safe.sh` (218 LOC) + +**Still in .qmd (need migration):** +- `stac_create_collection.qmd` — R chunk (S3 key fetch via `ngr`) + Python chunks (collection creation) +- `stac_create_item.qmd` — R chunk (conda env) + Python chunks (validation, item creation) + +**R-only scripts (keep as-is):** +- `scripts/detect_changes.R` — Uses `ngr::ngr_s3_keys_get()`, pure R is correct +- `scripts/s3_sync.R` — AWS CLI wrapper, fine as R +- `scripts/functions.R` — `vm_upload_run()` utility +- `scripts/benchmark_fetch.R` — Dev tool +- `scripts/footprint_visualize.R` — Exploratory (Issue #2) + +## Phases + +### Phase 1: Extract shared utilities ✅ COMPLETE +**Goal:** Create shared Python module to eliminate duplication + +- [x] **1.1** Create `scripts/stac_utils.py` with shared functions +- [x] **1.2** Update `item_reprocess.py` to import from `stac_utils.py` +- [x] **1.3** Verify imports and syntax +- [x] **1.4** Rename all scripts to `noun_verb.py` convention +- [x] **1.5** Update all cross-references (scripts, CLAUDE.md, .qmd) + +--- + +### Phase 2: Migrate stac_create_item.qmd → scripts/item_create.py ⬜ pending +**Goal:** Standalone Python script for item creation + +- [ ] **2.1** Create `scripts/item_create.py` with: + - argparse CLI (`--test`, `--test-count N`, `--incremental`, `--reprocess-invalid`) + - Python logging module (not print statements) + - Import shared utils from `stac_utils.py` + - All functionality from stac_create_item.qmd Python chunks +- [ ] **2.2** Test equivalence: run both .qmd and .py, compare output +- [ ] **2.3** Update `scripts/build_safe.sh` to call .py instead of `quarto render` + +**Verify:** Create 10 test items with .py script, diff against .qmd output + +--- + +### Phase 3: Migrate stac_create_collection.qmd → split R/Python ⬜ pending +**Goal:** Standalone scripts for collection creation + +The collection .qmd has two distinct parts: +1. **R chunk:** Fetches S3 keys via `ngr::ngr_s3_keys_get()` → `data/urls_list.txt` +2. **Python chunks:** Creates collection JSON from urls_list.txt + +Migration approach: +- [ ] **3.1** Create `scripts/urls_fetch.R` — Standalone R script for S3 key fetching + - Takes `--test` flag, outputs to `data/urls_list.txt` + - Replaces the R chunk in collection.qmd +- [ ] **3.2** Create `scripts/collection_create.py` — Standalone Python script + - Reads `data/urls_list.txt` (produced by urls_fetch.R or detect_changes.R) + - argparse CLI (`--test`, `--test-count N`) + - Temporal extent calculation, spatial extent (hardcoded BC bbox) + - Collection creation and validation +- [ ] **3.3** Test equivalence: compare collection.json from both approaches + +**Verify:** `Rscript scripts/urls_fetch.R && python scripts/collection_create.py` produces identical collection.json + +--- + +### Phase 4: Update build_safe.sh and documentation ⬜ pending +**Goal:** Wire everything together for production + +- [ ] **4.1** Update `scripts/build_safe.sh` to use new scripts: + - `Rscript scripts/urls_fetch.R` (or `Rscript scripts/detect_changes.R`) + - `python scripts/collection_create.py` + - `python scripts/item_create.py` + - `python scripts/item_validate.py` +- [ ] **4.2** Archive .qmd files (move to `archive/` or add deprecation header) +- [ ] **4.3** Update CLAUDE.md with new script paths and workflow +- [ ] **4.4** Update README.md usage examples + +**Verify:** Full `build_safe.sh` run in test mode produces valid catalog + +--- + +## Critical Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Naming convention | `noun_verb.py` | Groups related scripts alphabetically (item_*, collection_*) | +| Keep R for S3 key fetch | Yes | `ngr::ngr_s3_keys_get()` has no Python equivalent, already works | +| Shared Python module | `stac_utils.py` | Eliminates 3x duplication of date functions | +| CLI interface | argparse | Standard, scriptable, supports `--test` flags | +| Logging | Python `logging` module | Proper log levels, file output, captures in cron | +| Archive .qmd | Keep in repo (header note) | Reference for literate programming approach | + +## Risks + +| Risk | Mitigation | +|------|-----------| +| Breaking production pipeline | Branch-based development, .qmd still works on main | +| ngr R dependency hard to replace | Keep R script for S3 fetch, don't force all-Python | +| Subtle behavior differences | Side-by-side output comparison before merging | + +## SRED Tracking + +- Primary: NewGraphEnvironment/sred-2025-2026#8 +- Secondary: NewGraphEnvironment/sred-2025-2026#3 + +--- + +## Errors Encountered + +| Error | Phase | Resolution | +|-------|-------|------------| +| PROJ env conflict | 1.3 | Local homebrew/conda conflict — not our bug, works on VM | + +--- + +**Last updated:** 2026-02-17 diff --git a/scripts/build_safe.sh b/scripts/build_safe.sh index 940e49c..7dfd6ab 100755 --- a/scripts/build_safe.sh +++ b/scripts/build_safe.sh @@ -116,6 +116,22 @@ else fi log "" +# ============================================================================= +# Step 3.5: Check URL Accessibility +# ============================================================================= + +log "Step 3.5: Checking source URL accessibility..." +ACCESS_LOG="${LOG_DIR}/${TIMESTAMP}_urls_access.log" + +if python scripts/urls_check_access.py 2>&1 | tee "$ACCESS_LOG"; then + log "✓ All source URLs accessible" +else + log "⚠️ Some source URLs are inaccessible (upstream permission issue)" + log " See data/urls_access_checks.csv for details to share with GeoBC" + log " Continuing build — GeoTIFF validation will skip unreadable files" +fi +log "" + # ============================================================================= # Step 4: Create Collection # ============================================================================= @@ -218,7 +234,7 @@ if [[ $CURRENT_COUNT -gt 0 ]]; then log "Previous count: $CURRENT_COUNT (${DELTA:+\+}$DELTA)" log "Backup location: $BACKUP_DIR" fi -log "Logs: $URLS_LOG, $COLLECTION_LOG, $ITEMS_LOG, $VALIDATION_LOG" +log "Logs: $URLS_LOG, $ACCESS_LOG, $COLLECTION_LOG, $ITEMS_LOG, $VALIDATION_LOG" log "" if [[ "$AUTO_PROMOTE" == false ]]; then diff --git a/scripts/stac_utils.py b/scripts/stac_utils.py index c14208b..e2eee40 100644 --- a/scripts/stac_utils.py +++ b/scripts/stac_utils.py @@ -12,6 +12,8 @@ import subprocess from datetime import datetime, timezone +import requests + # ============================================================================= # Path Configuration @@ -111,6 +113,30 @@ def check_geotiff_cog(url: str) -> dict: # URL Helpers # ============================================================================= +def check_url_accessible(url: str, timeout: int = 10) -> dict: + """Check if a URL is accessible via HTTP HEAD request. + + Returns dict with url, status_code, accessible, error, last_checked. + """ + try: + resp = requests.head(url, timeout=timeout, allow_redirects=True) + return { + "url": url, + "status_code": resp.status_code, + "accessible": resp.status_code == 200, + "error": "" if resp.status_code == 200 else resp.reason, + "last_checked": datetime.now(timezone.utc).isoformat(), + } + except requests.RequestException as e: + return { + "url": url, + "status_code": None, + "accessible": False, + "error": str(e), + "last_checked": datetime.now(timezone.utc).isoformat(), + } + + def fix_url(url: str) -> str: """Fix malformed URLs with single slash after https:.""" if url.startswith("https:/") and not url.startswith("https://"): diff --git a/scripts/urls_check_access.py b/scripts/urls_check_access.py new file mode 100644 index 0000000..6d81a16 --- /dev/null +++ b/scripts/urls_check_access.py @@ -0,0 +1,115 @@ +#!/usr/bin/env python3 +"""Check source GeoTIFF URL accessibility on BC objectstore. + +Performs HTTP HEAD requests against source URLs to detect permission issues +(e.g., 403 Forbidden). Produces a CSV report shareable with GeoBC. + +Usage: + python scripts/urls_check_access.py # Check new URLs only + python scripts/urls_check_access.py --urls-file data/urls_list.txt # Specify URL file + python scripts/urls_check_access.py --recheck # Re-check all URLs +""" + +import argparse +import concurrent.futures +import logging +import sys + +import pandas as pd +from tqdm import tqdm + +from stac_utils import check_url_accessible, fix_url + +logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s") +logger = logging.getLogger(__name__) + +PATH_CACHE = "data/urls_access_checks.csv" + + +def main(): + parser = argparse.ArgumentParser(description="Check source URL accessibility") + parser.add_argument( + "--urls-file", default="data/urls_list.txt", + help="File containing URLs to check (default: data/urls_list.txt)", + ) + parser.add_argument( + "--recheck", action="store_true", + help="Re-check all URLs, ignoring cache", + ) + parser.add_argument( + "--workers", type=int, default=16, + help="Number of parallel workers (default: 16)", + ) + parser.add_argument( + "--timeout", type=int, default=10, + help="HTTP timeout in seconds (default: 10)", + ) + args = parser.parse_args() + + # Load URLs + with open(args.urls_file) as f: + all_urls = [fix_url(line.strip()) for line in f if line.strip()] + logger.info("Loaded %d URLs from %s", len(all_urls), args.urls_file) + + # Load cache + if not args.recheck: + try: + df_cached = pd.read_csv(PATH_CACHE) + cached_urls = set(df_cached["url"]) + logger.info("Loaded %d cached results from %s", len(df_cached), PATH_CACHE) + except FileNotFoundError: + df_cached = pd.DataFrame() + cached_urls = set() + else: + df_cached = pd.DataFrame() + cached_urls = set() + logger.info("Recheck mode: ignoring cache") + + # Determine which URLs need checking + urls_to_check = [u for u in all_urls if u not in cached_urls] + logger.info("%d URLs to check (%d already cached)", len(urls_to_check), len(all_urls) - len(urls_to_check)) + + if not urls_to_check: + logger.info("Nothing to check") + # Still report from cache + if len(df_cached) > 0: + n_inaccessible = (~df_cached["accessible"]).sum() + if n_inaccessible > 0: + logger.warning("%d URLs are inaccessible (from cache)", n_inaccessible) + sys.exit(1) + sys.exit(0) + + # Run checks in parallel + def _check(url): + return check_url_accessible(url, timeout=args.timeout) + + logger.info("Checking %d URLs with %d workers...", len(urls_to_check), args.workers) + with concurrent.futures.ThreadPoolExecutor(max_workers=args.workers) as executor: + results = list(tqdm( + executor.map(_check, urls_to_check), + total=len(urls_to_check), + desc="Checking URLs", + )) + + # Combine with cache and save + df_new = pd.DataFrame(results) + df_all = pd.concat([df_cached, df_new], ignore_index=True) if len(df_cached) > 0 else df_new + df_all.to_csv(PATH_CACHE, index=False) + logger.info("Saved %d results to %s", len(df_all), PATH_CACHE) + + # Summary + n_checked = len(df_new) + n_accessible = df_new["accessible"].sum() + n_inaccessible = n_checked - n_accessible + + logger.info("Results: %d accessible, %d inaccessible (out of %d checked)", n_accessible, n_inaccessible, n_checked) + + if n_inaccessible > 0: + logger.warning("Inaccessible URLs:") + for _, row in df_new[~df_new["accessible"]].iterrows(): + logger.warning(" %s → %s (%s)", row["url"], row["status_code"], row["error"]) + sys.exit(1) + + +if __name__ == "__main__": + main()