ftchvs · ftchvs · May 10, 2026 · May 10, 2026 · chatgpt-codex-connector · May 10, 2026
diff --git a/README.md b/README.md
@@ -63,9 +63,10 @@ make api  # then open http://127.0.0.1:8000/ui/
 - FastAPI app with `GET /health`, `GET /models`, `POST /analyze`, and
   `POST /eval`.
 - One-page Web UI at `/ui/` for the main local review workflow, including
-  model selection and opt-in Local model review controls.
+  model selection and opt-in Local model review controls with timeout recovery.
 - YAML policy files under `adlint/policies/`, plus custom policy paths.
-- Deterministic rule engine with policy-module, platform, and industry filters.
+- Deterministic rule engine with policy-module, platform, and industry filters,
+  including `platform: "all"` for broad cross-platform preflight checks.
 - Transparent score thresholds for `approved`, `needs_review`, and `high_risk`.
 - Optional `scoring.yml` threshold and weight overrides for team calibration.
 - JSON stdout, Markdown stdout, and paired JSON/Markdown report files.
@@ -177,8 +178,14 @@ Example config:
 ```
 
 Optional input fields include `target_age_range`, `landing_page_url`,
-`model_enabled`, `ollama_model`, `logging_enabled`, `log_path`,
-`storage_enabled`, and `storage_path`.
+`model_enabled`, `model_affects_score`, `ollama_model`, `logging_enabled`,
+`log_path`, `storage_enabled`, and `storage_path`.
+
+Use `platform: "all"` when you want one broad preflight pass across the
+platform-scoped policy modules AdLint currently ships. This is useful for early
+creative review before a channel is final, but it is not a platform-parity
+claim; use a specific platform value such as `google`, `meta`, `tiktok`, or
+`linkedin` when checking channel-specific launch risk.
 
 ## Scoring configuration
 
@@ -246,7 +253,9 @@ Endpoints:
   `input` object and optional `expected_decision`.
 
 The Web UI starts in rule-only mode. Users can opt into Local model review, and
-separately opt into score impact. API callers can omit `model_enabled` or set
+separately opt into score impact. The model selector is populated from local
+Ollama tags when available, filters obvious embedding-only models, and falls
+back to known review-model options. API callers can omit `model_enabled` or set
 it to `false` for a rule-only run, set `model_enabled: true` for metadata-only
 model notes, or set `model_affects_score: true` when valid model findings
 should join `policy_hits` and affect the final score. `ollama_model` overrides
@@ -337,14 +346,18 @@ make eval
 The seed dataset has 58 examples across health, wellness, finance, SaaS,
 creator disclosure, privacy, landing-page mismatch, brand-safety, and Meta
 platform-policy cases. It is a development sanity check, not a production
-benchmark.
+benchmark. The current PR #16-era local validation baseline is 1.000 decision
+accuracy with no policy/category false-positive or false-negative notes.
 
 Run the larger deterministic benchmark:
 
 ```bash
 make benchmark
 ```
 
+The synthetic policy regression benchmark currently has 213 examples and is
+intended to catch deterministic rule regressions before release work.
+
 Refresh or validate the policy coverage matrix:
 
 ```bash
@@ -467,10 +480,18 @@ ADLINT_OLLAMA_MODEL=gpt-oss-safeguard:20b \
 
 The default Ollama endpoint is `http://localhost:11434/api/chat`. Set
 `ADLINT_OLLAMA_URL` to point AdLint at a different Ollama-compatible chat
-endpoint.
-
-If the model endpoint is unavailable, AdLint still returns rule-based findings
-and marks the model status as `unavailable`.
+endpoint. `ADLINT_OLLAMA_TIMEOUT` bounds generation time, and
+`ADLINT_OLLAMA_NUM_PREDICT` can cap model output length for slower local
+models or eval runs.
+
+AdLint sends deterministic local classifier calls with JSON formatting,
+`temperature: 0`, and `think: false` where supported, so reasoning chatter is
+less likely to break the strict response schema. Fenced or wrapped JSON is
+accepted when the enclosed object validates.
+
+If the model endpoint is unavailable or a browser request times out, AdLint
+still returns rule-based findings when possible and marks the model status as
+`unavailable`.
 
 Invalid model JSON or schema violations are marked `invalid_response` and are
 ignored for scoring. Landing-page excerpts are treated as untrusted evidence in

diff --git a/docs/local_models.md b/docs/local_models.md
@@ -5,9 +5,13 @@ AdLint adds an Ollama-compatible model pass as decision-support metadata; it
 does not replace the rule-based findings or provide legal advice.
 
 The Web UI starts in rule-only mode and lets users opt into Local model review.
-API callers control the same behavior with `model_enabled`,
-`model_affects_score`, and `ollama_model` in `POST /analyze`. CLI users can
-pass `--enable-model`, `--model-affects-score`, and `--ollama-model`.
+The model selector is populated from local Ollama tags when available, filters
+obvious embedding-only models, and falls back to known review-model options.
+Client-side timeout recovery keeps a slow/stuck local model call from leaving
+the form disabled forever. API callers control the same behavior with
+`model_enabled`, `model_affects_score`, and `ollama_model` in `POST /analyze`.
+CLI users can pass `--enable-model`, `--model-affects-score`, and
+`--ollama-model`.
 
 Example:
 
@@ -26,7 +30,8 @@ Environment variables:
   The default is 45 seconds; eval targets use 300 seconds for slow local rows.
 - `ADLINT_OLLAMA_NUM_PREDICT`: optional positive token cap for local model
   generation. Leave unset for Ollama defaults; set it for evals when a model
-  times out while producing verbose JSON.
+  times out while producing verbose JSON. The development API target uses a
+  bounded default so local review runs fail closed instead of hanging forever.
 
 API fields:
 
@@ -54,6 +59,11 @@ Local model output is treated as untrusted runtime metadata until it passes the
 - `categories` and `evidence` are arrays of strings.
 - `recommended_action`, when present, is a string.
 
+AdLint sends deterministic local classifier calls with JSON formatting,
+`temperature: 0`, and `think: false` where supported. Some local models still
+wrap JSON in Markdown fences or pre/post text, so AdLint extracts the enclosed
+JSON object before schema validation.
+
 Invalid JSON, unknown decisions, and malformed fields produce
 `status: invalid_response`, `valid_response: false`, and `ignored: true`. Those
 responses add no findings and never affect scoring.
@@ -116,26 +126,30 @@ is ready for score impact.
 ## Current Model Recommendation
 
 Keep deterministic rules as the production baseline. The recommended local
-model default remains `gpt-oss-safeguard:20b` with the normal Ollama generation
-settings. Use local model output as review metadata only until measured runs
-show reliable decision accuracy and detailed YAML policy-id recall.
+model default remains `gpt-oss-safeguard:20b`. Use local model output as review
+metadata only until measured runs show reliable decision accuracy, detailed
+YAML policy-id recall, and acceptable false-review burden.
 
 Run `make model-smoke` before spending time on a full live model-quality run.
-The 2026-05-04 smoke used the first three seed rows and produced:
-
-| Configuration | Runtime | Model-only rows | Model-only accuracy | Hybrid accuracy | Model status | Generic review additions | Detailed policy-id additions | Rescued rule false negatives |
-| --- | ---: | ---: | ---: | ---: | --- | ---: | ---: | ---: |
-| `gpt-oss-safeguard:20b` default | 81.508s | 3 | 0.333 | 1.000 | `ok: 3` | 2 | 0 | 0 |
-| `gpt-oss-safeguard:20b`, `ADLINT_OLLAMA_NUM_PREDICT=128` | 28.081s | 0 | 0.000 | 1.000 | `invalid_response: 3` | 3 | 0 | 0 |
-
-The capped-token setting is faster, but it produced invalid structured
-responses on every model-required smoke row. Do not use it as a default quality
-setting.
-
-An additional installed model, `qwen3.5:35b-a3b`, was tested on the blind
-holdout as a manual diagnostic. The run took 2749.144 seconds, returned
-`invalid_response: 90` for model-only and hybrid model calls, scored 0 model-only
-rows, and reduced hybrid decision accuracy to 0.656 on the pre-triage blind
-baseline. A later smoke attempt was stopped after more than four minutes with no
-completed output. Treat this model as rejected for current AdLint eval use until
-its structured JSON behavior is fixed.
+The older 2026-05-04 smoke showed why score impact remains off by default:
+models could add generic review notes or invalid JSON even when hybrid rule
+accuracy stayed intact.
+
+PR #16 improved runtime stability by disabling Ollama thinking output where
+supported, accepting fenced JSON, and bounding local generation. With
+`ADLINT_OLLAMA_TIMEOUT=180` and `ADLINT_OLLAMA_NUM_PREDICT=1024`, a live local
+matrix reported `model.status: ok` for installed review models including
+`gpt-oss-safeguard:20b`, `gpt-oss:20b`, `qwen3-coder:30b`,
+`qwen3.5:35b-a3b`, and `gemma4:26b`. Treat that as runtime compatibility, not
+proof of better policy judgment.
+
+Model selection guidance:
+
+- Prefer `gpt-oss-safeguard:20b` as the default local reviewer until newer
+  evals show another model has better useful-note precision and lower review
+  burden.
+- Use `ADLINT_OLLAMA_TIMEOUT` and `ADLINT_OLLAMA_NUM_PREDICT` for slow models
+  during manual evals, but do not infer quality from `status: ok` alone.
+- Keep `model_affects_score` off unless you are explicitly measuring whether a
+  model's findings improve outcomes without adding unacceptable false review
+  burden.