diff --git a/README.md b/README.md index b329a3f..5ecc7b9 100644 --- a/README.md +++ b/README.md @@ -63,9 +63,10 @@ make api # then open http://127.0.0.1:8000/ui/ - FastAPI app with `GET /health`, `GET /models`, `POST /analyze`, and `POST /eval`. - One-page Web UI at `/ui/` for the main local review workflow, including - model selection and opt-in Local model review controls. + model selection and opt-in Local model review controls with timeout recovery. - YAML policy files under `adlint/policies/`, plus custom policy paths. -- Deterministic rule engine with policy-module, platform, and industry filters. +- Deterministic rule engine with policy-module, platform, and industry filters, + including `platform: "all"` for broad cross-platform preflight checks. - Transparent score thresholds for `approved`, `needs_review`, and `high_risk`. - Optional `scoring.yml` threshold and weight overrides for team calibration. - JSON stdout, Markdown stdout, and paired JSON/Markdown report files. @@ -177,8 +178,14 @@ Example config: ``` Optional input fields include `target_age_range`, `landing_page_url`, -`model_enabled`, `ollama_model`, `logging_enabled`, `log_path`, -`storage_enabled`, and `storage_path`. +`model_enabled`, `model_affects_score`, `ollama_model`, `logging_enabled`, +`log_path`, `storage_enabled`, and `storage_path`. + +Use `platform: "all"` when you want one broad preflight pass across the +platform-scoped policy modules AdLint currently ships. This is useful for early +creative review before a channel is final, but it is not a platform-parity +claim; use a specific platform value such as `google`, `meta`, `tiktok`, or +`linkedin` when checking channel-specific launch risk. ## Scoring configuration @@ -246,7 +253,9 @@ Endpoints: `input` object and optional `expected_decision`. The Web UI starts in rule-only mode. Users can opt into Local model review, and -separately opt into score impact. API callers can omit `model_enabled` or set +separately opt into score impact. The model selector is populated from local +Ollama tags when available, filters obvious embedding-only models, and falls +back to known review-model options. API callers can omit `model_enabled` or set it to `false` for a rule-only run, set `model_enabled: true` for metadata-only model notes, or set `model_affects_score: true` when valid model findings should join `policy_hits` and affect the final score. `ollama_model` overrides @@ -337,7 +346,8 @@ make eval The seed dataset has 58 examples across health, wellness, finance, SaaS, creator disclosure, privacy, landing-page mismatch, brand-safety, and Meta platform-policy cases. It is a development sanity check, not a production -benchmark. +benchmark. The current PR #16-era local validation baseline is 1.000 decision +accuracy with no policy/category false-positive or false-negative notes. Run the larger deterministic benchmark: @@ -345,6 +355,9 @@ Run the larger deterministic benchmark: make benchmark ``` +The synthetic policy regression benchmark currently has 213 examples and is +intended to catch deterministic rule regressions before release work. + Refresh or validate the policy coverage matrix: ```bash @@ -467,10 +480,18 @@ ADLINT_OLLAMA_MODEL=gpt-oss-safeguard:20b \ The default Ollama endpoint is `http://localhost:11434/api/chat`. Set `ADLINT_OLLAMA_URL` to point AdLint at a different Ollama-compatible chat -endpoint. - -If the model endpoint is unavailable, AdLint still returns rule-based findings -and marks the model status as `unavailable`. +endpoint. `ADLINT_OLLAMA_TIMEOUT` bounds generation time, and +`ADLINT_OLLAMA_NUM_PREDICT` can cap model output length for slower local +models or eval runs. + +AdLint sends deterministic local classifier calls with JSON formatting, +`temperature: 0`, and `think: false` where supported, so reasoning chatter is +less likely to break the strict response schema. Fenced or wrapped JSON is +accepted when the enclosed object validates. + +If the model endpoint is unavailable or a browser request times out, AdLint +still returns rule-based findings when possible and marks the model status as +`unavailable`. Invalid model JSON or schema violations are marked `invalid_response` and are ignored for scoring. Landing-page excerpts are treated as untrusted evidence in diff --git a/docs/local_models.md b/docs/local_models.md index 20f75c9..1ab9238 100644 --- a/docs/local_models.md +++ b/docs/local_models.md @@ -5,9 +5,13 @@ AdLint adds an Ollama-compatible model pass as decision-support metadata; it does not replace the rule-based findings or provide legal advice. The Web UI starts in rule-only mode and lets users opt into Local model review. -API callers control the same behavior with `model_enabled`, -`model_affects_score`, and `ollama_model` in `POST /analyze`. CLI users can -pass `--enable-model`, `--model-affects-score`, and `--ollama-model`. +The model selector is populated from local Ollama tags when available, filters +obvious embedding-only models, and falls back to known review-model options. +Client-side timeout recovery keeps a slow/stuck local model call from leaving +the form disabled forever. API callers control the same behavior with +`model_enabled`, `model_affects_score`, and `ollama_model` in `POST /analyze`. +CLI users can pass `--enable-model`, `--model-affects-score`, and +`--ollama-model`. Example: @@ -26,7 +30,8 @@ Environment variables: The default is 45 seconds; eval targets use 300 seconds for slow local rows. - `ADLINT_OLLAMA_NUM_PREDICT`: optional positive token cap for local model generation. Leave unset for Ollama defaults; set it for evals when a model - times out while producing verbose JSON. + times out while producing verbose JSON. The development API target uses a + bounded default so local review runs fail closed instead of hanging forever. API fields: @@ -54,6 +59,11 @@ Local model output is treated as untrusted runtime metadata until it passes the - `categories` and `evidence` are arrays of strings. - `recommended_action`, when present, is a string. +AdLint sends deterministic local classifier calls with JSON formatting, +`temperature: 0`, and `think: false` where supported. Some local models still +wrap JSON in Markdown fences or pre/post text, so AdLint extracts the enclosed +JSON object before schema validation. + Invalid JSON, unknown decisions, and malformed fields produce `status: invalid_response`, `valid_response: false`, and `ignored: true`. Those responses add no findings and never affect scoring. @@ -116,26 +126,30 @@ is ready for score impact. ## Current Model Recommendation Keep deterministic rules as the production baseline. The recommended local -model default remains `gpt-oss-safeguard:20b` with the normal Ollama generation -settings. Use local model output as review metadata only until measured runs -show reliable decision accuracy and detailed YAML policy-id recall. +model default remains `gpt-oss-safeguard:20b`. Use local model output as review +metadata only until measured runs show reliable decision accuracy, detailed +YAML policy-id recall, and acceptable false-review burden. Run `make model-smoke` before spending time on a full live model-quality run. -The 2026-05-04 smoke used the first three seed rows and produced: - -| Configuration | Runtime | Model-only rows | Model-only accuracy | Hybrid accuracy | Model status | Generic review additions | Detailed policy-id additions | Rescued rule false negatives | -| --- | ---: | ---: | ---: | ---: | --- | ---: | ---: | ---: | -| `gpt-oss-safeguard:20b` default | 81.508s | 3 | 0.333 | 1.000 | `ok: 3` | 2 | 0 | 0 | -| `gpt-oss-safeguard:20b`, `ADLINT_OLLAMA_NUM_PREDICT=128` | 28.081s | 0 | 0.000 | 1.000 | `invalid_response: 3` | 3 | 0 | 0 | - -The capped-token setting is faster, but it produced invalid structured -responses on every model-required smoke row. Do not use it as a default quality -setting. - -An additional installed model, `qwen3.5:35b-a3b`, was tested on the blind -holdout as a manual diagnostic. The run took 2749.144 seconds, returned -`invalid_response: 90` for model-only and hybrid model calls, scored 0 model-only -rows, and reduced hybrid decision accuracy to 0.656 on the pre-triage blind -baseline. A later smoke attempt was stopped after more than four minutes with no -completed output. Treat this model as rejected for current AdLint eval use until -its structured JSON behavior is fixed. +The older 2026-05-04 smoke showed why score impact remains off by default: +models could add generic review notes or invalid JSON even when hybrid rule +accuracy stayed intact. + +PR #16 improved runtime stability by disabling Ollama thinking output where +supported, accepting fenced JSON, and bounding local generation. With +`ADLINT_OLLAMA_TIMEOUT=180` and `ADLINT_OLLAMA_NUM_PREDICT=1024`, a live local +matrix reported `model.status: ok` for installed review models including +`gpt-oss-safeguard:20b`, `gpt-oss:20b`, `qwen3-coder:30b`, +`qwen3.5:35b-a3b`, and `gemma4:26b`. Treat that as runtime compatibility, not +proof of better policy judgment. + +Model selection guidance: + +- Prefer `gpt-oss-safeguard:20b` as the default local reviewer until newer + evals show another model has better useful-note precision and lower review + burden. +- Use `ADLINT_OLLAMA_TIMEOUT` and `ADLINT_OLLAMA_NUM_PREDICT` for slow models + during manual evals, but do not infer quality from `status: ok` alone. +- Keep `model_affects_score` off unless you are explicitly measuring whether a + model's findings improve outcomes without adding unacceptable false review + burden.