diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1d6ccbe..08d2f21 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,6 +1,6 @@ # Contributing to AdLint -Thanks for helping improve AdLint. The project is local-first decision-support software for preflight ad, landing-page, brand-safety, privacy, and disclosure checks. +Thanks for helping improve AdLint. The project is local-first decision-support software for preflight ad, landing-page, brand-safety, privacy, and disclosure checks. The current OSS project goal is documented in `docs/open_source_goal.md`. ## Good first contributions diff --git a/PRD.md b/PRD.md index 5bdff60..bb2e71e 100644 --- a/PRD.md +++ b/PRD.md @@ -165,8 +165,9 @@ Implemented platform modules: risk. - TikTok Ads: misleading content, weight-management claims, and disclosure risk. - LinkedIn Ads: sensitive targeting, discrimination, and professional claims. - -Meta remains a future platform module. +- Meta Ads: selected personal-attribute heuristics, health and appearance results, health/wellness + age-targeting review, financial-services authorization, Special Ad Category + review, private-information requests, and branded-content disclosure. ### 7.3 Health privacy risk @@ -387,7 +388,7 @@ Representative output shape: | FR-1 | Implemented | Users can submit ad copy with headline, body, and CTA through CLI, API, or the Python engine. | | FR-2 | Implemented | Users can provide an optional `landing_page_url` or `landing_page_html`. | | FR-3 | Partial | The system extracts title, headings, visible claims, forms, pricing text, disclaimers, and trackers from static HTML. JavaScript rendering and richer extraction are future work. | -| FR-4 | Implemented | Users can select `google`, `tiktok`, or `linkedin` policy behavior through platform metadata. Meta is future work. | +| FR-4 | Implemented | Users can select `google`, `tiktok`, `linkedin`, or `meta` policy behavior through platform metadata. | | FR-5 | Implemented | Users can select industries such as `health`, `wellness`, `finance`, `saas`, `creator`, or `general`. | | FR-6 | Implemented | Deterministic rule checks run first using policy signals, regexes, keyword patterns, and heuristics. | | FR-7 | Partial | Ollama classification can run as a hybrid pass behind `model_enabled` or `--enable-model`; deterministic rules always run, and live model quality is not yet benchmarked. | @@ -599,7 +600,7 @@ The current eval runner reports: ### 14.2 Future benchmark -Create 200-500 labeled examples over time across several axes. +Maintain and expand the 209-example labeled benchmark over time across several axes. Decision labels: @@ -643,7 +644,7 @@ Future benchmark reports should include: - Risk scoring. - JSON and Markdown reports. - Deterministic safer rewrites. -- 50 curated seed eval examples. +- 58 curated seed eval examples. - Documentation and legal boundary notes. - Opt-in JSONL logging. @@ -671,7 +672,7 @@ Future benchmark reports should include: ### Phase 4: Evals and benchmark -- 200-500 labeled examples. +- Maintain and expand the 209-example labeled benchmark. - Confusion matrix. - False positive and false negative review notes. - Rule-only vs. model-only vs. hybrid comparison. @@ -754,7 +755,7 @@ AdLint becomes successful beyond the MVP if: - A user can analyze an ad and landing page in under 60 seconds on a local Apple Silicon workstation with adequate memory. - The Web UI makes the main review workflow accessible to non-engineers. -- A 200-500 example benchmark shows stable recall for high-severity health, +- A maintained 209+ example benchmark shows stable recall for high-severity health, privacy, and safety categories. - Teams can tune scoring with `scoring.yml` without editing code. - Local model use is benchmarked and documented with clear limitations. diff --git a/README.md b/README.md index 6587883..5b8b076 100644 --- a/README.md +++ b/README.md @@ -106,7 +106,7 @@ make dev # install and run the high-risk example, writing reports/ make scan # install and run the wellness example make api # start uvicorn with adlint.api:app make eval # run the seed evals and write evals/results/latest.json -make benchmark # run the 200-row synthetic policy regression benchmark +make benchmark # run the 209-row synthetic policy regression benchmark make policy-coverage # refresh docs/policy_coverage_matrix.md make policy-coverage-validate # check the committed coverage matrix make rewrite-quality # run the deterministic rewrite-quality rubric eval @@ -318,7 +318,7 @@ Run the seed evals: make eval ``` -The seed dataset has 54 examples across health, wellness, finance, SaaS, +The seed dataset has 58 examples across health, wellness, finance, SaaS, creator disclosure, privacy, landing-page mismatch, brand-safety, and Meta platform-policy cases. It is a development sanity check, not a production benchmark. @@ -464,7 +464,7 @@ platform-specific examples, documentation, and tests for edge cases. Start with High-value contribution areas: -- Meta Ads policy coverage. +- Deeper Meta Ads parity, including additional restricted verticals and placement-specific cases. - More public-source/paraphrased eval cases. - Landing-page extraction improvements. - Safer rewrite-quality evaluation. @@ -472,7 +472,9 @@ High-value contribution areas: ## Related docs +- `docs/open_source_goal.md` - `docs/policy_design.md` +- `docs/meta_ads_scope.md` - `docs/legal_disclaimer.md` - `docs/local_models.md` - `docs/eval_report.md` diff --git a/adlint/policies/platform_meta_ads.yml b/adlint/policies/platform_meta_ads.yml index e0f514e..b7c1345 100644 --- a/adlint/policies/platform_meta_ads.yml +++ b/adlint/policies/platform_meta_ads.yml @@ -1,3 +1,7 @@ +# Initial Meta Ads heuristic coverage for local preflight review. +# Source notes: see docs/meta_ads_scope.md for official Meta references, reviewed scope, +# and explicit non-goals. These rules are decision support, not Meta approval guarantees. + policies: - id: meta_personal_attributes_health severity: high @@ -47,9 +51,92 @@ policies: - melts fat - belly fat - dramatic results + - body-shaming + - perfect body + - hate your body recommended_action: Avoid transformation framing and use qualified wellness-support language. rewrite_strategy: wellness_support + - id: meta_health_wellness_age_targeting_review + severity: medium + category: platform_policy + description: Meta health, weight-loss, cosmetic, sexual-health, and reproductive-health ads can require 18+ targeting review. + modules: [platform] + platforms: [meta] + industries: [health, wellness] + requires_review: true + signals: + - weight loss pills + - weight loss supplement + - cosmetic procedure + - botox + - reproductive health + - contraception + - family planning + - sexual health + recommended_action: Confirm age targeting and health/wellness policy eligibility before launch. + rewrite_strategy: add_review_reminder + + - id: meta_financial_services_authorization_review + severity: medium + category: platform_policy + description: Meta financial product ads can require authorization, licensing, disclosures, and 18+ targeting review. + modules: [platform] + platforms: [meta] + industries: [finance] + requires_review: true + signals: + - credit card application + - apply for credit + - apply for a loan + - get a loan + - loan approval + - insurance quote + - investment opportunity + - mortgage application + - refinance today + - cash advance + recommended_action: Confirm financial-services authorization, required disclosures, and 18+ targeting before launch. + rewrite_strategy: add_review_reminder + + - id: meta_special_ad_category_review + severity: medium + category: platform_policy + description: Meta campaigns for housing, employment, or financial products/services may require Special Ad Category configuration and targeting limits. + modules: [platform] + platforms: [meta] + requires_review: true + signals: + - housing opportunity + - apartment rental + - mortgage application + - job opening + - hiring now + - apply for this role + - employment opportunity + - financial products and services + - credit offer + recommended_action: Confirm Special Ad Category selection, country settings, and audience restrictions before launch. + rewrite_strategy: add_review_reminder + + - id: meta_private_information_request + severity: high + category: platform_policy + description: Meta ads should not request financial, health, or other private information without appropriate permission and review. + modules: [platform] + platforms: [meta] + requires_review: true + signals: + - enter your credit score + - share your bank account + - provide your income + - submit your medical history + - upload your diagnosis + - tell us your symptoms + - provide your health information + recommended_action: Remove private-information requests from ad creative and review the landing-page intake flow. + rewrite_strategy: add_review_reminder + - id: meta_branded_content_disclosure severity: medium category: platform_policy diff --git a/docs/adlint_hybrid_eval_paper.tex b/docs/adlint_hybrid_eval_paper.tex index 00ce419..33e1c41 100644 --- a/docs/adlint_hybrid_eval_paper.tex +++ b/docs/adlint_hybrid_eval_paper.tex @@ -244,7 +244,7 @@ \section{Evaluation Design} \toprule \textbf{Layer} & \textbf{Purpose} & \textbf{Gate type} \\ \midrule -\texttt{rule\_benchmark\_v1.jsonl} & Fast policy-engine regression coverage over 200 deterministic examples. & CI suitable \\ +\texttt{rule\_benchmark\_v1.jsonl} & Fast policy-engine regression coverage over 209 deterministic examples. & CI suitable \\ \dataset{} & Public-source diagnostic set with exact 25/25/25 decision balance and strict provenance metadata. & CI schema and rule-only diagnostic \\ \texttt{real-cases-model-quality} & Full live \modelname{} all-mode comparison with model-only, rule-only, and hybrid metrics. & Manual or scheduled quality run \\ \bottomrule diff --git a/docs/adlint_paper_figure_proposals.tex b/docs/adlint_paper_figure_proposals.tex index 63c79b1..43c0bd2 100644 --- a/docs/adlint_paper_figure_proposals.tex +++ b/docs/adlint_paper_figure_proposals.tex @@ -196,7 +196,7 @@ \section*{\sffamily AdLint Paper Figure Replacement Proposals} }; \end{axis} \end{tikzpicture} -\caption{Decision-label distribution in the 200-example rule benchmark.} +\caption{Decision-label distribution in the 209-example rule benchmark.} \label{fig:decision-distribution} \end{figure} diff --git a/docs/eval_report.md b/docs/eval_report.md index 4762b5c..ea4af00 100644 --- a/docs/eval_report.md +++ b/docs/eval_report.md @@ -4,8 +4,8 @@ Status: deterministic rule benchmark v1. AdLint includes four labeled JSONL datasets: -- `evals/datasets/seed_ads.jsonl`: the 54-example smoke set. -- `evals/datasets/rule_benchmark_v1.jsonl`: a 200-example benchmark generated +- `evals/datasets/seed_ads.jsonl`: the 58-example smoke set. +- `evals/datasets/rule_benchmark_v1.jsonl`: a 209-example benchmark generated from the seed set plus policy-author authored synthetic variants. - `evals/datasets/real_cases_v1.jsonl`: a 75-example public-source diagnostic set balanced across 25 approved, 25 needs-review, and 25 high-risk expected @@ -29,7 +29,7 @@ Rebuild the committed benchmark dataset: make benchmark-data ``` -Run the 200-example benchmark and write JSON plus Markdown reports: +Run the 209-example benchmark and write JSON plus Markdown reports: ```bash make benchmark @@ -201,8 +201,8 @@ Current category-level precision and recall: Interpretation: the 1.000 score is strong evidence that the deterministic rules and current benchmark labels are internally consistent. It is not a -claim that future ads will pass review with 100% accuracy. If the 200 examples -were a representative random sample, 200/200 correct decisions would imply an +claim that future ads will pass review with 100% accuracy. If the 209 examples +were a representative random sample, 209/209 correct decisions would imply an approximate 95% Wilson lower bound of 0.981, but this benchmark is authored regression coverage rather than a random production sample. diff --git a/docs/meta_ads_scope.md b/docs/meta_ads_scope.md new file mode 100644 index 0000000..032ed5e --- /dev/null +++ b/docs/meta_ads_scope.md @@ -0,0 +1,70 @@ +# Meta Ads policy scope + +AdLint's Meta module is an initial, conservative preflight surface for campaign +review. It is not a complete implementation of Meta's Advertising Standards and +it does not guarantee Meta approval. + +## What the current module covers + +The bundled `platform_meta_ads.yml` rules focus on high-signal patterns that are +useful before a growth team sends creative to review: + +- **Selected personal attributes**: health/body and vulnerable-finance wording that + can imply knowledge of the viewer's condition or status. The current module + does not yet cover every Meta personal-attribute class such as religion, race, + disability, gender identity, sexual orientation, or trade-union membership. +- **Health and appearance results**: transformation, weight-loss, and negative + self-perception framing. +- **Health/wellness age-targeting review**: weight-loss, cosmetic, sexual-health, + and reproductive-health terms that should trigger human review before launch. +- **Financial services authorization review**: credit, loan, insurance, + investment, refinance, and cash-advance offers that may require 18+ targeting, + disclosures, licensing, or authorization checks. +- **Special Ad Category review**: housing, employment, and financial-products + contexts that may require Meta campaign-level category settings and targeting + limits. This check is intentionally not gated by advertiser vertical because + healthcare, SaaS, education, or other advertisers can still run employment, + housing, or credit campaigns. +- **Private information requests**: ad copy asking for health, financial, or + similarly private information. This check is intentionally not gated by + advertiser vertical because sensitive-data requests can appear in otherwise + general campaigns. +- **Branded content disclosure**: sponsorship, affiliate, promo-code, and paid + partnership language. + +## Source references + +Use these official Meta references when extending the module: + +- Meta Advertising Standards overview: + +- Financial and Insurance Products and Services: + +- Discriminatory Practices and Special Ad Category context: + +- Marketing API Special Ad Categories: + +- Branded Content Policies: + + +## Deliberate limitations + +- Rules are deterministic phrase and pattern checks, not a legal or platform + approval model. +- Synthetic benchmark rows are regression coverage, not production accuracy + claims. +- The module intentionally routes ambiguous regulated-category copy to + `needs_review` rather than trying to decide eligibility automatically. +- Campaign-level fields such as actual Meta objective, placement, age targeting, + country targeting, and `special_ad_categories` are not modeled yet. +- Landing-page mismatch is currently handled by AdLint's generic landing-page + module rather than a Meta-specific policy id. + +## Good next contributions + +- Add public-source, paraphrased Meta Ad Library and Meta policy examples. +- Add explicit campaign metadata fields for age range, country, and special ad + category selection, then make review rules conditional on those fields. +- Add Meta-specific landing-page and destination-quality policy ids beyond the generic landing-page mismatch rule. +- Split regulated finance review into offer, education, and brand-awareness + subcases with stronger false-positive fixtures. diff --git a/docs/open_source_goal.md b/docs/open_source_goal.md new file mode 100644 index 0000000..218162a --- /dev/null +++ b/docs/open_source_goal.md @@ -0,0 +1,58 @@ +# Open-source project goal + +## Working prompt + +Make AdLint a legitimately useful open-source, local-first preflight tool for +growth and marketing teams before they ship ads or landing pages. + +Prioritize work that makes the project more trustworthy to a stranger landing on +GitHub: + +1. **Useful out of the box**: clear install path, runnable examples, CLI/API/Web + UI workflows, and reports a marketer can understand. +2. **Evidence-backed policies**: platform, disclosure, privacy, health, finance, + and brand-safety rules with source notes, scoped claims, and conservative + review language. +3. **False-positive discipline**: every broad rule needs near-miss tests or eval + rows so benign education, tooling, or planning content is not overflagged. +4. **Transparent quality gates**: benchmark, seed eval, real-case eval, policy + coverage, and PR preflight should make regressions obvious. +5. **Local-first trust**: no default raw ad persistence, no secret data in tests, + no live ad-account mutations, and no legal/platform approval guarantees. + +## Near-term OSS roadmap + +### 1. Meta Ads credibility + +- Keep the Meta module framed as initial heuristic coverage, not policy parity. +- Add source-linked docs and reviewed-date notes when new Meta rules land. +- Expand from synthetic triggers to paraphrased, public-source examples where + safe and legally usable. +- Split broad regulated-category checks into higher-precision subcases. + +### 2. Contributor-friendly policy work + +- Add one example, one positive eval, and one near-miss eval for each new policy. +- Prefer policy IDs that describe the review reason, not a vague platform bucket. +- Require recommended actions that tell a marketer what to change or verify. + +### 3. Product relevance + +- Improve first-run experience: demo configs, screenshots/GIFs, and concise + report examples. +- Make landing-page mismatch and disclosure checks easy to demo in the CLI and + Web UI. +- Keep README language practical: who uses this, what it catches, what it does + not promise. + +### 4. Research/eval credibility + +- Treat synthetic benchmarks as regression tests, not accuracy claims. +- Keep adding real-case/adjudicated datasets without private data. +- Track false positives and false negatives explicitly in CI-facing checks. + +## Definition of solid + +AdLint is “solid OSS” when a new contributor can clone it, run examples, trust +its privacy posture, understand policy scope, add a rule with tests, and see CI +catch both missed risky cases and noisy overtriggering. diff --git a/docs/policy_coverage_matrix.md b/docs/policy_coverage_matrix.md index 55ef4aa..a0a5214 100644 --- a/docs/policy_coverage_matrix.md +++ b/docs/policy_coverage_matrix.md @@ -1,16 +1,16 @@ # Policy Coverage Matrix Validation status: OK -Total bundled policy count: 36 -Total dataset row count: 329 +Total bundled policy count: 40 +Total dataset row count: 346 ## Dataset Row Counts | Dataset | Rows | Coverage requirement | | --- | ---: | --- | | evals/datasets/real_cases_v1.jsonl | 75 | diagnostic only | -| evals/datasets/rule_benchmark_v1.jsonl | 200 | required complete | -| evals/datasets/seed_ads.jsonl | 54 | required complete | +| evals/datasets/rule_benchmark_v1.jsonl | 213 | required complete | +| evals/datasets/seed_ads.jsonl | 58 | required complete | ## Coverage @@ -40,15 +40,19 @@ Total dataset row count: 329 | linkedin_sensitive_targeting | platform_policy | platform | linkedin | all | 2 | 11 | 3 | 16 | | medical_cure_claim | health_claims | health_claims | all | all | 2 | 2 | 1 | 5 | | meta_branded_content_disclosure | platform_policy | platform | meta | creator, wellness, health, finance | 1 | 5 | 0 | 6 | +| meta_financial_services_authorization_review | platform_policy | platform | meta | finance | 2 | 4 | 0 | 6 | | meta_health_appearance_results | platform_policy | platform | meta | health, wellness | 1 | 3 | 0 | 4 | +| meta_health_wellness_age_targeting_review | platform_policy | platform | meta | health, wellness | 1 | 2 | 0 | 3 | | meta_personal_attributes_finance | platform_policy | platform | meta | finance | 1 | 3 | 0 | 4 | | meta_personal_attributes_health | platform_policy | platform | meta | health, wellness | 1 | 3 | 0 | 4 | +| meta_private_information_request | platform_policy | platform | meta | all | 1 | 2 | 0 | 3 | +| meta_special_ad_category_review | platform_policy | platform | meta | all | 1 | 3 | 0 | 4 | | missing_affiliate_or_sponsor_disclosure | disclosure | disclosure, platform | all | creator, wellness, health, finance | 5 | 19 | 6 | 30 | | tiktok_disclosure_risk | platform_policy | platform | tiktok | creator, wellness, health | 4 | 14 | 5 | 23 | | tiktok_misleading_content | platform_policy | platform | tiktok | all | 3 | 3 | 1 | 7 | | tiktok_weight_management_claim | platform_policy | platform | tiktok | health, wellness | 2 | 12 | 2 | 16 | | tracking_pixel_risk | privacy | privacy | all | health, wellness | 4 | 16 | 5 | 25 | | unsupported_health_claim | health_claims | health_claims | all | health, wellness | 4 | 16 | 4 | 24 | -| washington_mhmda_indicator | privacy | privacy | all | health, wellness | 3 | 3 | 3 | 9 | +| washington_mhmda_indicator | privacy | privacy | all | health, wellness | 4 | 5 | 3 | 12 | | weight_loss_claim | health_claims | health_claims, platform | all | health, wellness | 4 | 25 | 6 | 35 | | wellness_claim_review | health_claims | health_claims | all | health, wellness | 2 | 14 | 6 | 22 | diff --git a/docs/research_paper.md b/docs/research_paper.md index bca145b..d1ada5d 100644 --- a/docs/research_paper.md +++ b/docs/research_paper.md @@ -5,7 +5,7 @@ AdLint is a local-first ad preflight system for campaign copy, landing-page signals, privacy risk, disclosure checks, and brand-safety review. This short paper reports the current deterministic evaluation of AdLint's policy-as-code -engine on a 200-example synthetic benchmark. The benchmark reaches 1.000 +engine on a 209-example synthetic benchmark. The benchmark reaches 1.000 decision accuracy across `approved`, `needs_review`, and `high_risk` labels, with no decision mismatches and no policy false-positive or false-negative review notes in the included review window. The main observed limitation is @@ -29,8 +29,8 @@ deterministic policy engine remains the reproducible baseline. The evaluation uses three local JSONL datasets: -- `evals/datasets/seed_ads.jsonl`: a 54-example smoke dataset. -- `evals/datasets/rule_benchmark_v1.jsonl`: a 200-example benchmark generated +- `evals/datasets/seed_ads.jsonl`: a 58-example smoke dataset. +- `evals/datasets/rule_benchmark_v1.jsonl`: a 209-example benchmark generated from the seed set plus policy-author authored synthetic variants. - `evals/datasets/real_cases_v1.jsonl`: a 75-example public-source diagnostic set balanced across 25 approved, 25 needs-review, and 25 high-risk expected @@ -89,7 +89,7 @@ make real-cases-model-quality ## 3. Results -The 200-example rule-only benchmark completed without skipped examples. +The 209-example rule-only benchmark completed without skipped examples. | Metric | Value | | --- | ---: | @@ -114,18 +114,18 @@ Category-level precision and recall were 1.000 for all tracked categories in the adjudicated benchmark. The 1.000 benchmark score should be read as internal regression evidence, not -as external reliability. If the 200 examples were a representative random -sample, 200/200 correct decisions would imply an approximate 95% Wilson lower +as external reliability. If the 209 examples were a representative random +sample, 209/209 correct decisions would imply an approximate 95% Wilson lower bound of 0.981. They are not random; they are authored policy coverage. The all-modes fallback comparison showed that rule-only and hybrid modes both -scored 200 examples with 1.000 decision accuracy. Model-only skipped all 200 +scored 209 examples with 1.000 decision accuracy. Model-only skipped all 209 rows because the local model endpoint was unavailable. Hybrid retained rule-based decisions and attached unavailable-model metadata. A separate local smoke run should be used before interpreting model quality. It requires the configured Ollama model to return status `ok` on a small subset -before teams spend time on the full 200-row model comparison. +before teams spend time on the full 209-row model comparison. The latest default smoke run reached status `ok` for all three model-required rows. Hybrid decision accuracy stayed at 1.000 on the subset, while model-only diff --git a/evals/README.md b/evals/README.md index beaafaf..960f0de 100644 --- a/evals/README.md +++ b/evals/README.md @@ -6,8 +6,8 @@ platform approval. ## Datasets -- `datasets/seed_ads.jsonl`: 54-example smoke set. -- `datasets/rule_benchmark_v1.jsonl`: 200-example deterministic benchmark. +- `datasets/seed_ads.jsonl`: 58-example smoke set. +- `datasets/rule_benchmark_v1.jsonl`: 209-example deterministic benchmark. - `datasets/real_cases_v1.jsonl`: 75 public-source, paraphrased real-case diagnostics balanced across 25 `approved`, 25 `needs_review`, and 25 `high_risk` expected decisions. These rows come from public marketing pages, diff --git a/evals/datasets/rule_benchmark_v1.jsonl b/evals/datasets/rule_benchmark_v1.jsonl index cbfa279..f642c60 100644 --- a/evals/datasets/rule_benchmark_v1.jsonl +++ b/evals/datasets/rule_benchmark_v1.jsonl @@ -52,6 +52,10 @@ {"id":"meta-002","input":{"platform":"meta","industry":"health","headline":"Are you overweight? See a before and after transformation","body":"Our clinically proven supplement helps you lose 20 pounds and melts fat.","cta":"Learn more"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_health","meta_health_appearance_results","unsupported_health_claim","weight_loss_claim","before_after_claim"]} {"id":"meta-003","input":{"platform":"meta","industry":"creator","headline":"Paid partnership desk setup","body":"Use my affiliate code for the partner product featured in this workflow.","cta":"Shop now"},"expected_decision":"needs_review","expected_policy_ids":["meta_branded_content_disclosure","missing_affiliate_or_sponsor_disclosure"]} {"id":"meta-004","input":{"platform":"meta","industry":"finance","headline":"Struggling with debt and bad credit?","body":"Explore a planning consultation for people behind on bills.","cta":"Learn more"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_finance"]} +{"id":"meta-005","input":{"platform":"meta","industry":"finance","headline":"Compare credit card options","body":"Review eligibility, terms, disclosures, and 18+ targeting before you apply for credit.","cta":"Learn more"},"expected_decision":"needs_review","expected_policy_ids":["meta_financial_services_authorization_review"]} +{"id":"meta-006","input":{"platform":"meta","industry":"finance","headline":"Mortgage planning guide","body":"Start a mortgage application with this credit offer and financial products and services checklist before applying.","cta":"Download guide"},"expected_decision":"needs_review","expected_policy_ids":["meta_financial_services_authorization_review","meta_special_ad_category_review"]} +{"id":"meta-007","input":{"platform":"meta","industry":"health","headline":"Tell us your symptoms before booking","body":"Submit your medical history and upload your diagnosis to find a provider match.","cta":"Start intake"},"expected_decision":"high_risk","expected_policy_ids":["meta_private_information_request"]} +{"id":"meta-008","input":{"platform":"meta","industry":"health","headline":"Reproductive health appointment guide","body":"Family planning resources and contraception appointment preparation for adults 18+.","cta":"Learn more"},"expected_decision":"needs_review","expected_policy_ids":["meta_health_wellness_age_targeting_review","washington_mhmda_indicator"]} {"id":"benchmark-approved-saas-001","input":{"platform":"linkedin","industry":"saas","headline":"Campaign calendar for lean teams","body":"Coordinate briefs, tasks, and launch approvals in one workspace.","cta":"Request demo"},"expected_decision":"approved","expected_policy_ids":[]} {"id":"benchmark-approved-saas-002","input":{"platform":"linkedin","industry":"saas","headline":"Creative QA checklist","body":"Review assets, copy, and routing notes before the next launch.","cta":"Download"},"expected_decision":"approved","expected_policy_ids":[]} {"id":"benchmark-approved-saas-003","input":{"platform":"linkedin","industry":"saas","headline":"Weekly reporting workspace","body":"Bring spend, pacing, and learning notes into a single dashboard.","cta":"See templates"},"expected_decision":"approved","expected_policy_ids":[]} @@ -164,14 +168,23 @@ {"id":"benchmark-meta-approved-002","input":{"platform":"meta","industry":"saas","headline":"Creative testing checklist","body":"Compare messages and asset notes before publishing.","cta":"Download"},"expected_decision":"approved","expected_policy_ids":[]} {"id":"benchmark-meta-approved-003","input":{"platform":"meta","industry":"saas","headline":"Retail planning calendar","body":"Track seasonal promotions, tasks, and owner notes.","cta":"View calendar"},"expected_decision":"approved","expected_policy_ids":[]} {"id":"benchmark-meta-approved-004","input":{"platform":"meta","industry":"saas","headline":"Wellness event schedule","body":"Find voluntary classes and preparation reminders.","cta":"Browse events"},"expected_decision":"approved","expected_policy_ids":[]} +{"id":"benchmark-meta-approved-near-miss-001","input":{"platform":"meta","industry":"saas","headline":"Hiring pipeline dashboard","body":"Plan recruiting tasks and approval notes without advertising a specific role.","cta":"Request demo"},"expected_decision":"approved","expected_policy_ids":[]} +{"id":"benchmark-meta-approved-near-miss-002","input":{"platform":"meta","industry":"finance","headline":"Insurance education webinar","body":"Learn how coverage terms work; no quote or application is offered.","cta":"Register"},"expected_decision":"approved","expected_policy_ids":[]} +{"id":"benchmark-meta-approved-near-miss-003","input":{"platform":"meta","industry":"creator","headline":"Creator desk tour","body":"A personal workflow walkthrough with gear notes and editing tips.","cta":"Watch"},"expected_decision":"approved","expected_policy_ids":[]} +{"id":"benchmark-meta-approved-near-miss-004","input":{"platform":"meta","industry":"finance","headline":"Mortgage calculator worksheet","body":"Estimate hypothetical payments for planning; no lender matching is provided.","cta":"Download"},"expected_decision":"approved","expected_policy_ids":[]} {"id":"benchmark-meta-creator-001","input":{"platform":"meta","industry":"creator","headline":"Paid partnership desk setup","body":"Use my affiliate code for the partner product.","cta":"Shop now"},"expected_decision":"needs_review","expected_policy_ids":["meta_branded_content_disclosure","missing_affiliate_or_sponsor_disclosure"]} {"id":"benchmark-meta-creator-002","input":{"platform":"meta","industry":"creator","headline":"Sponsored morning routine","body":"The partner product is featured with a promo code.","cta":"Use code"},"expected_decision":"needs_review","expected_policy_ids":["meta_branded_content_disclosure","missing_affiliate_or_sponsor_disclosure"]} {"id":"benchmark-meta-creator-003","input":{"platform":"meta","industry":"creator","headline":"Affiliate creator toolkit","body":"This paid partnership includes setup templates.","cta":"Download"},"expected_decision":"needs_review","expected_policy_ids":["meta_branded_content_disclosure","missing_affiliate_or_sponsor_disclosure"]} {"id":"benchmark-meta-creator-004","input":{"platform":"meta","industry":"creator","headline":"Partner product workflow","body":"Sponsored tips include affiliate resources.","cta":"Learn more"},"expected_decision":"needs_review","expected_policy_ids":["meta_branded_content_disclosure","missing_affiliate_or_sponsor_disclosure"]} +{"id":"benchmark-meta-review-001","input":{"platform":"meta","industry":"finance","headline":"Compare credit card options","body":"Review eligibility, terms, disclosures, and 18+ targeting before you apply for credit.","cta":"Learn more"},"expected_decision":"needs_review","expected_policy_ids":["meta_financial_services_authorization_review"]} +{"id":"benchmark-meta-review-002","input":{"platform":"meta","industry":"finance","headline":"Mortgage planning guide","body":"Start a mortgage application with this credit offer and financial products and services checklist before applying.","cta":"Download guide"},"expected_decision":"needs_review","expected_policy_ids":["meta_financial_services_authorization_review","meta_special_ad_category_review"]} +{"id":"benchmark-meta-review-003","input":{"platform":"meta","industry":"health","headline":"Reproductive health appointment guide","body":"Family planning resources and contraception appointment preparation for adults 18+.","cta":"Learn more"},"expected_decision":"needs_review","expected_policy_ids":["meta_health_wellness_age_targeting_review","washington_mhmda_indicator"]} +{"id":"benchmark-meta-review-004","input":{"platform":"meta","industry":"saas","headline":"Hiring campaign checklist","body":"Plan compliant job opening creative and employment opportunity audience settings.","cta":"Download"},"expected_decision":"needs_review","expected_policy_ids":["meta_special_ad_category_review"]} {"id":"benchmark-meta-high-risk-001","input":{"platform":"meta","industry":"health","headline":"Are you overweight? See a before and after transformation","body":"Our clinically proven supplement helps you lose 20 pounds and melts fat.","cta":"Learn more"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_health","meta_health_appearance_results","unsupported_health_claim","weight_loss_claim","before_after_claim"]} {"id":"benchmark-meta-high-risk-002","input":{"platform":"meta","industry":"health","headline":"Tired of your symptoms? Try our transformation plan","body":"A clinical study routine supports dramatic results.","cta":"Start today"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_health","meta_health_appearance_results","unsupported_health_claim","before_after_claim"]} {"id":"benchmark-meta-high-risk-003","input":{"platform":"meta","industry":"finance","headline":"Struggling with debt and bad credit?","body":"Explore a planning consultation for people behind on bills.","cta":"Learn more"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_finance"]} {"id":"benchmark-meta-high-risk-004","input":{"platform":"meta","industry":"finance","headline":"Are you broke after monthly bills?","body":"Review neutral budgeting resources for credit planning.","cta":"Get guide"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_finance"]} +{"id":"benchmark-meta-high-risk-005","input":{"platform":"meta","industry":"health","headline":"Tell us your symptoms before booking","body":"Submit your medical history and upload your diagnosis to find a provider match.","cta":"Start intake"},"expected_decision":"high_risk","expected_policy_ids":["meta_private_information_request"]} {"id":"benchmark-brand-safety-001","input":{"platform":"google","industry":"general","headline":"Election night sponsorship","body":"Advertise beside political ballot analysis.","cta":"Review inventory"},"expected_decision":"needs_review","expected_policy_ids":["brand_safety_politics"]} {"id":"benchmark-brand-safety-002","input":{"platform":"google","industry":"general","headline":"Political campaign finance report","body":"Sponsor a newsletter about election rules.","cta":"Review inventory"},"expected_decision":"needs_review","expected_policy_ids":["brand_safety_politics"]} {"id":"benchmark-brand-safety-003","input":{"platform":"google","industry":"general","headline":"War update coverage","body":"Inventory around conflict and disaster updates.","cta":"Review inventory"},"expected_decision":"high_risk","expected_policy_ids":["brand_safety_tragedy_conflict"]} diff --git a/evals/datasets/seed_ads.jsonl b/evals/datasets/seed_ads.jsonl index b087340..6313845 100644 --- a/evals/datasets/seed_ads.jsonl +++ b/evals/datasets/seed_ads.jsonl @@ -52,3 +52,7 @@ {"id":"meta-002","input":{"platform":"meta","industry":"health","headline":"Are you overweight? See a before and after transformation","body":"Our clinically proven supplement helps you lose 20 pounds and melts fat.","cta":"Learn more"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_health","meta_health_appearance_results","unsupported_health_claim","weight_loss_claim","before_after_claim"]} {"id":"meta-003","input":{"platform":"meta","industry":"creator","headline":"Paid partnership desk setup","body":"Use my affiliate code for the partner product featured in this workflow.","cta":"Shop now"},"expected_decision":"needs_review","expected_policy_ids":["meta_branded_content_disclosure","missing_affiliate_or_sponsor_disclosure"]} {"id":"meta-004","input":{"platform":"meta","industry":"finance","headline":"Struggling with debt and bad credit?","body":"Explore a planning consultation for people behind on bills.","cta":"Learn more"},"expected_decision":"high_risk","expected_policy_ids":["meta_personal_attributes_finance"]} +{"id":"meta-005","input":{"platform":"meta","industry":"finance","headline":"Compare credit card options","body":"Review eligibility, terms, disclosures, and 18+ targeting before you apply for credit.","cta":"Learn more"},"expected_decision":"needs_review","expected_policy_ids":["meta_financial_services_authorization_review"]} +{"id":"meta-006","input":{"platform":"meta","industry":"finance","headline":"Mortgage planning guide","body":"Start a mortgage application with this credit offer and financial products and services checklist before applying.","cta":"Download guide"},"expected_decision":"needs_review","expected_policy_ids":["meta_financial_services_authorization_review","meta_special_ad_category_review"]} +{"id":"meta-007","input":{"platform":"meta","industry":"health","headline":"Tell us your symptoms before booking","body":"Submit your medical history and upload your diagnosis to find a provider match.","cta":"Start intake"},"expected_decision":"high_risk","expected_policy_ids":["meta_private_information_request"]} +{"id":"meta-008","input":{"platform":"meta","industry":"health","headline":"Reproductive health appointment guide","body":"Family planning resources and contraception appointment preparation for adults 18+.","cta":"Learn more"},"expected_decision":"needs_review","expected_policy_ids":["meta_health_wellness_age_targeting_review", "washington_mhmda_indicator"]} diff --git a/evals/generate_benchmark_dataset.py b/evals/generate_benchmark_dataset.py index 8ef0bdc..47456d5 100644 --- a/evals/generate_benchmark_dataset.py +++ b/evals/generate_benchmark_dataset.py @@ -8,7 +8,7 @@ ROOT = Path(__file__).resolve().parents[1] SEED_PATH = ROOT / "evals" / "datasets" / "seed_ads.jsonl" OUTPUT_PATH = ROOT / "evals" / "datasets" / "rule_benchmark_v1.jsonl" -TARGET_EXAMPLES = 200 +TARGET_EXAMPLES = 213 def main() -> int: @@ -324,12 +324,68 @@ def _meta_rows() -> list[dict[str, Any]]: ("Retail planning calendar", "Track seasonal promotions, tasks, and owner notes.", "View calendar"), ("Wellness event schedule", "Find voluntary classes and preparation reminders.", "Browse events"), ] + approved_near_miss = [ + ( + "Hiring pipeline dashboard", + "Plan recruiting tasks and approval notes without advertising a specific role.", + "Request demo", + "saas", + ), + ( + "Insurance education webinar", + "Learn how coverage terms work; no quote or application is offered.", + "Register", + "finance", + ), + ( + "Creator desk tour", + "A personal workflow walkthrough with gear notes and editing tips.", + "Watch", + "creator", + ), + ( + "Mortgage calculator worksheet", + "Estimate hypothetical payments for planning; no lender matching is provided.", + "Download", + "finance", + ), + ] creator_review = [ ("Paid partnership desk setup", "Use my affiliate code for the partner product.", "Shop now"), ("Sponsored morning routine", "The partner product is featured with a promo code.", "Use code"), ("Affiliate creator toolkit", "This paid partnership includes setup templates.", "Download"), ("Partner product workflow", "Sponsored tips include affiliate resources.", "Learn more"), ] + platform_review = [ + ( + "Compare credit card options", + "Review eligibility, terms, disclosures, and 18+ targeting before you apply for credit.", + "Learn more", + ["meta_financial_services_authorization_review"], + "finance", + ), + ( + "Mortgage planning guide", + "Start a mortgage application with this credit offer and financial products and services checklist before applying.", + "Download guide", + ["meta_financial_services_authorization_review", "meta_special_ad_category_review"], + "finance", + ), + ( + "Reproductive health appointment guide", + "Family planning resources and contraception appointment preparation for adults 18+.", + "Learn more", + ["meta_health_wellness_age_targeting_review", "washington_mhmda_indicator"], + "health", + ), + ( + "Hiring campaign checklist", + "Plan compliant job opening creative and employment opportunity audience settings.", + "Download", + ["meta_special_ad_category_review"], + "saas", + ), + ] high_risk = [ ( "Are you overweight? See a before and after transformation", @@ -370,9 +426,29 @@ def _meta_rows() -> list[dict[str, Any]]: ["meta_personal_attributes_finance"], "finance", ), + ( + "Tell us your symptoms before booking", + "Submit your medical history and upload your diagnosis to find a provider match.", + "Start intake", + ["meta_private_information_request"], + "health", + ), ] rows = _fixed_rows("benchmark-meta-approved", "meta", "saas", approved, "approved", []) + rows.extend( + _row( + f"benchmark-meta-approved-near-miss-{index:03d}", + "meta", + industry, + headline, + body, + cta, + "approved", + [], + ) + for index, (headline, body, cta, industry) in enumerate(approved_near_miss, start=1) + ) rows.extend( _fixed_rows( "benchmark-meta-creator", @@ -383,6 +459,19 @@ def _meta_rows() -> list[dict[str, Any]]: ["meta_branded_content_disclosure", "missing_affiliate_or_sponsor_disclosure"], ) ) + rows.extend( + _row( + f"benchmark-meta-review-{index:03d}", + "meta", + industry, + headline, + body, + cta, + "needs_review", + expected_policy_ids, + ) + for index, (headline, body, cta, expected_policy_ids, industry) in enumerate(platform_review, start=1) + ) rows.extend( _row( f"benchmark-meta-high-risk-{index:03d}", diff --git a/examples/meta_financial_services_review.json b/examples/meta_financial_services_review.json new file mode 100644 index 0000000..693a94b --- /dev/null +++ b/examples/meta_financial_services_review.json @@ -0,0 +1,9 @@ +{ + "platform": "meta", + "country": "US", + "industry": "finance", + "headline": "Compare credit card options", + "body": "Review eligibility, terms, disclosures, and 18+ targeting before you apply for credit.", + "cta": "Learn more", + "policy_modules": ["platform", "privacy"] +} diff --git a/examples/meta_private_health_info_high_risk.json b/examples/meta_private_health_info_high_risk.json new file mode 100644 index 0000000..d9944a9 --- /dev/null +++ b/examples/meta_private_health_info_high_risk.json @@ -0,0 +1,9 @@ +{ + "platform": "meta", + "country": "US", + "industry": "health", + "headline": "Tell us your symptoms before booking", + "body": "Submit your medical history and upload your diagnosis to find a provider match.", + "cta": "Start intake", + "policy_modules": ["platform", "privacy", "health_claims"] +} diff --git a/examples/meta_special_ad_category_review.json b/examples/meta_special_ad_category_review.json new file mode 100644 index 0000000..64d54df --- /dev/null +++ b/examples/meta_special_ad_category_review.json @@ -0,0 +1,9 @@ +{ + "platform": "meta", + "country": "US", + "industry": "finance", + "headline": "Mortgage planning guide", + "body": "Start a mortgage application with this credit offer and financial products and services checklist before applying.", + "cta": "Download guide", + "policy_modules": ["platform"] +} diff --git a/tests/test_engine.py b/tests/test_engine.py index 6c1fd00..575c47b 100644 --- a/tests/test_engine.py +++ b/tests/test_engine.py @@ -9,6 +9,71 @@ def policy_ids(result) -> set[str]: return {hit.policy_id for hit in result.policy_hits} + +def test_meta_financial_education_copy_avoids_authorization_overtrigger() -> None: + result = analyze( + { + "platform": "meta", + "industry": "finance", + "headline": "Loan planning education webinar", + "body": "Learn budgeting basics and how loan terms work. No application or quote is offered.", + "cta": "Register", + } + ) + + assert "meta_financial_services_authorization_review" not in policy_ids(result) + assert "meta_special_ad_category_review" not in policy_ids(result) + + +def test_meta_landing_page_mismatch_stacks_with_platform_review() -> None: + result = analyze( + { + "platform": "meta", + "industry": "finance", + "headline": "Credit card application discount", + "body": "Compare an apply for credit offer before launch.", + "cta": "Apply", + "landing_page_html": "

General budgeting newsletter

Weekly savings tips.

", + } + ) + + ids = policy_ids(result) + assert "meta_financial_services_authorization_review" in ids + assert "landing_page_offer_mismatch" in ids + assert result.decision == "needs_review" + + + + +def test_meta_special_ad_category_review_applies_outside_default_verticals() -> None: + result = analyze( + { + "platform": "meta", + "industry": "health", + "headline": "Hiring now for clinic coordinators", + "body": "Apply for this role supporting patient scheduling teams.", + "cta": "Apply", + } + ) + + assert "meta_special_ad_category_review" in policy_ids(result) + + +def test_meta_private_information_request_applies_to_general_campaigns() -> None: + result = analyze( + { + "platform": "meta", + "industry": "general", + "headline": "Check eligibility in minutes", + "body": "Enter your credit score to personalize your results.", + "cta": "Start", + } + ) + + assert result.decision == "high_risk" + assert "meta_private_information_request" in policy_ids(result) + + def test_high_risk_health_claims_and_tiktok_policy() -> None: result = analyze( { diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py index 316bbd0..da93742 100644 --- a/tests/test_eval_runner.py +++ b/tests/test_eval_runner.py @@ -701,7 +701,7 @@ def test_benchmark_dataset_is_reproducible_labeled_and_large_enough() -> None: dataset_path = ROOT / "evals" / "datasets" / "rule_benchmark_v1.jsonl" dataset_rows = run_eval._load_rows(dataset_path) - assert len(dataset_rows) == 200 + assert len(dataset_rows) == 213 assert len(dataset_rows) == len({row["id"] for row in dataset_rows}) assert {row["expected_decision"] for row in dataset_rows} == {"approved", "needs_review", "high_risk"} assert dataset_rows == generated_rows @@ -723,6 +723,10 @@ def test_benchmark_dataset_includes_meta_policy_rows() -> None: "meta_personal_attributes_health", "meta_personal_attributes_finance", "meta_health_appearance_results", + "meta_health_wellness_age_targeting_review", + "meta_financial_services_authorization_review", + "meta_special_ad_category_review", + "meta_private_information_request", "meta_branded_content_disclosure", } @@ -736,7 +740,7 @@ def test_benchmark_dataset_reports_policy_and_category_precision_recall() -> Non max_review_notes=100, ) - assert metrics["total_examples"] == 200 + assert metrics["total_examples"] == 213 assert metrics["decision_accuracy"] == 1.0 assert {"approved", "needs_review", "high_risk"} <= metrics["confusion_matrix"].keys() assert "health_claims" in metrics["category_metrics"] diff --git a/tests/test_examples.py b/tests/test_examples.py index 8fde460..2656342 100644 --- a/tests/test_examples.py +++ b/tests/test_examples.py @@ -27,12 +27,25 @@ "meta_health_appearance_results", "unsupported_health_claim", "weight_loss_claim", + "before_after_claim", }, ), "meta_needs_review_creator.json": ( "needs_review", {"meta_branded_content_disclosure", "missing_affiliate_or_sponsor_disclosure"}, ), + "meta_financial_services_review.json": ( + "needs_review", + {"meta_financial_services_authorization_review"}, + ), + "meta_private_health_info_high_risk.json": ( + "high_risk", + {"meta_private_information_request"}, + ), + "meta_special_ad_category_review.json": ( + "needs_review", + {"meta_financial_services_authorization_review", "meta_special_ad_category_review"}, + ), "needs_review_google_wellness.json": ( "needs_review", {"tracking_pixel_risk", "health_form_tracking_risk"}, @@ -58,3 +71,15 @@ def test_all_documented_json_examples_have_expectations() -> None: example_names = {path.name for path in Path("examples").glob("*.json")} assert example_names == set(EXAMPLE_EXPECTATIONS) + + +def test_meta_examples_do_not_emit_unexpected_policy_ids() -> None: + meta_examples = [path for path in Path("examples").glob("meta_*.json")] + + assert meta_examples + for example_path in meta_examples: + _, expected_policy_ids = EXAMPLE_EXPECTATIONS[example_path.name] + result = analyze(load_config(example_path)) + actual_policy_ids = {hit.policy_id for hit in result.policy_hits} + + assert actual_policy_ids == expected_policy_ids, example_path.name diff --git a/tests/test_policy.py b/tests/test_policy.py index 7fa4db1..7ef7cba 100644 --- a/tests/test_policy.py +++ b/tests/test_policy.py @@ -74,11 +74,15 @@ def test_filter_policies_applies_platform_and_industry_filters(tmp_path) -> None assert filter_policies(policies, wrong_industry) == [] -def test_bundled_meta_ads_policy_module_is_narrow_and_platform_scoped() -> None: +def test_bundled_meta_ads_policy_module_is_platform_scoped() -> None: meta_policy_ids = { "meta_personal_attributes_health", "meta_personal_attributes_finance", "meta_health_appearance_results", + "meta_health_wellness_age_targeting_review", + "meta_financial_services_authorization_review", + "meta_special_ad_category_review", + "meta_private_information_request", "meta_branded_content_disclosure", } @@ -91,3 +95,10 @@ def test_bundled_meta_ads_policy_module_is_narrow_and_platform_scoped() -> None: assert policy.modules == ("platform",) assert policy.platforms == ("meta",) assert policy.signals + + +def test_meta_cross_vertical_rules_are_not_industry_gated() -> None: + policies = {policy.id: policy for policy in load_policies()} + + assert policies["meta_special_ad_category_review"].industries == () + assert policies["meta_private_information_request"].industries == ()