ShreyasSar26 · ShreyasSar26 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/agent-readiness-eval/01-methodology.md b/agent-readiness-eval/01-methodology.md
@@ -0,0 +1,93 @@
+# 01 — Methodology
+
+## Goal
+
+Validate that the restructured, task-based SharePoint Embedded (SPE) documentation is
+**agent-ready**: a coding agent that a user delegates an SPE question to should be able
+to retrieve the answer from the docs alone, answer correctly, and cite the right task
+article — without relying on the model's pre-training.
+
+## Two complementary evaluations
+
+1. **Structural scoring (deterministic).** A Python scorer grades every task article on
+   four dimensions (structure, correctness signals, agent-readiness, task framing).
+   This catches missing metadata, oversized articles, missing navigation, weak titles,
+   and broken cross-links. See [`02-structural-scoring.md`](02-structural-scoring.md).
+
+2. **Empirical Q&A (behavioral).** Real questions are answered by **doc-restricted
+   retrieval agents** that may only read files under `docs/embedded`. The agents
+   simulate a coding agent doing keyword retrieval: `grep`/`glob` to find a candidate
+   article, read it, then answer. Outside/world knowledge is prohibited, and the agent
+   must say "NOT IN DOCS" when the answer genuinely isn't present. See
+   [`03-qa-round1-21-queries.md`](03-qa-round1-21-queries.md) and
+   [`04-qa-stress-test-54-queries.md`](04-qa-stress-test-54-queries.md).
+
+## Why doc-restricted agents
+
+A general model can answer many SPE questions from pre-training, which would hide gaps
+in the docs. Restricting the agent to `docs/embedded` forces every answer to be
+*grounded in the documentation*, so a wrong or "NOT IN DOCS" result is a true signal of
+a documentation gap, not a model limitation.
+
+## Agent output contract
+
+Every Q&A agent returns one structured block per question:
+
+```
+QID: <id>
+ANSWER: <concise answer from docs, or "NOT IN DOCS">
+CITED: <relative path(s) under docs/embedded, or NONE>
+FINDABILITY: <EASY | HARD | NONE>   # how easily keyword search located it
+SUFFICIENCY: <COMPLETE | PARTIAL | MISSING>
+GAP: <what's missing or hard to find, or NONE>
+```
+
+`FINDABILITY` and `SUFFICIENCY` separate two distinct failure modes: the content can
+exist but be **hard to find** (retrieval problem) or be **findable but incomplete**
+(content problem). Both are actionable.
+
+## Grading scale
+
+| Score | Meaning |
+| --- | --- |
+| 3 | Complete + correct, cited the intended task article, easy to find |
+| 2 | Correct but partial, hard to find, or cited a legacy/secondary file |
+| 1 | Answer missing or materially incomplete (real content gap) |
+| 0 | Wrong / hallucinated answer |
+
+**Negative probes.** Some queries deliberately ask for facts not in the doc set (for
+example, SLA, region availability, customer-managed keys). These test *honesty*: an
+agent that answers "NOT IN DOCS" scores **3**; one that invents an answer scores **0**.
+
+## Loop
+
+1. Build a query bank with persona, question, expected file, and ground truth.
+2. Split queries into groups; run one doc-restricted agent per group in parallel.
+3. Grade each answer 0–3; record `FINDABILITY`/`SUFFICIENCY`/`GAP`.
+4. For every score < 3 that is a real gap, edit the doc to close it.
+5. Re-run the structural scorer and link check to confirm no regression.
+6. Re-test the fixed queries with a fresh agent.
+7. Commit and record results here.
+
+## Tooling
+
+- **Retrieval agents:** fast explore agents, scoped by prompt to `docs/embedded` only.
+- **Structural scorer:** `score_docs.py` (4 × 25-point rubric). Kept with the session
+  artifacts; summarized in [`02-structural-scoring.md`](02-structural-scoring.md).
+- **Query bank + grades:** stored in a SQLite session database (`qa` for the 21-query
+  campaign, `qa2` for the 54-query stress test) and exported into the records here.
+- **Link integrity:** a small Python pass over every `[..](../x.md)` relative link in
+  `docs/embedded`, checked against the filesystem.
+
+## Reproducing
+
+```powershell
+# Structural score (from repo root)
+python <path-to>/score_docs.py
+
+# Relative-link integrity over docs/embedded
+# (walks every ../*.md and ./*.md link and checks it resolves on disk)
+```
+
+Q&A campaigns are reproduced by re-issuing the prompts in files 03 and 04 to a
+retrieval agent restricted to `docs/embedded`, then grading with the scale above.
diff --git a/agent-readiness-eval/02-structural-scoring.md b/agent-readiness-eval/02-structural-scoring.md
@@ -0,0 +1,72 @@
+# 02 — Structural scoring
+
+A deterministic Python scorer grades every task article in `docs/embedded` on four
+25-point dimensions. It is the fast, repeatable gate that runs before and after every
+content change to prevent regressions.
+
+## Rubric (4 × 25 = 100)
+
+| Dimension | What it checks (25 pts each) |
+| --- | --- |
+| **Structure** | Single H1; YAML front matter with `title`/`description`/`ms.date`; heading hierarchy is well-formed; article is right-sized for a context window (not oversized). |
+| **Correctness signals** | Has an `**Applies to:**` audience line; uses concrete API/cmdlet/permission names; code fenced with a language; intro paragraph present before the first H2. |
+| **Agent-readiness** | Article fits comfortably in a context window; descriptive `description` for retrieval; explicit visible navigation (`## Next steps` and `## Related resources` / `See also`); stable headings usable as anchors. |
+| **Task framing** | Verb-first, task-oriented title (e.g., "Create and manage containers"); imperative steps; one primary task per article. Reference and concept pages (`ms.topic` / `task_type: concept`) are exempt from the verb-first rule. |
+
+### Scorer notes / fixes captured during development
+
+- The `**Applies to:**` audience line is matched with both bold and plain variants.
+- Intro detection must not be defeated by `DOTALL` regex (a bug that was found and fixed).
+- Cross-link validity does not require a `../` prefix; same-folder `./x.md` links count.
+- Navigation recognizes the variants `Next step`/`Next steps`/`Related`/
+  `Related resources`/`See also`.
+
+## Result
+
+```
+structure   : 25.00 / 25
+correctness : 25.00 / 25
+agent       : 25.00 / 25
+task        : 25.00 / 25
+  TOTAL     : 100.00 / 100
+  Perfect articles: 43/43
+
+=== ARTICLES WITH DEDUCTIONS / ISSUES ===
+  (none)
+```
+
+The score was re-run after the stress-test fixes (see file 04) and remained
+**100.00 / 100, 43/43 perfect**, confirming no regression.
+
+## Articles graded (43)
+
+Spread across the task-based information architecture:
+
+| Section | Count | Examples |
+| --- | --- | --- |
+| Overview | 3 | `overview.md`, `scenarios-and-use-cases.md`, `whats-new.md` |
+| Plan | 7 | `app-tenant-architecture.md`, `choose-app-model.md`, `authentication-permissions.md`, `limits-calling-patterns.md` |
+| Build | 16 | `quickstart-vscode.md`, `create-container-type.md`, `manage-files.md`, `open-office-files.md`, `search-containers-files.md`, `agent-experiences.md` |
+| Publish | 4 | `prepare-customer-installation.md`, `validate-customer-installation.md` |
+| Install & manage (admin) | 10 | `admin-overview.md`, `install-sharepoint-embedded-app.md`, `manage-containers-powershell.md`, `review-audit-events.md` |
+| Reference | 6 | `billing-meters.md`, `audit-events.md`, `troubleshooting.md`, `glossary.md` |
+
+## Link integrity
+
+A separate pass walks every relative Markdown link (`[..](../x.md)` and `./x.md`) under
+`docs/embedded` and resolves it against the filesystem.
+
+```
+relative md links checked = 297   broken = 0
+```
+
+Entry-link integrity was also verified separately:
+
+- `docs/embedded/overview.md` keeps H1 **"What is SharePoint Embedded?"**, so the
+  published `/sharepoint/dev/embedded/overview` URL and its `#what-is-sharepoint-embedded`
+  anchor do not break.
+- `docs/index.yml` (the SharePoint-dev landing hub) — both SPE entry cards resolve
+  (Overview → `/embedded/overview`, quickstart → `/embedded/build/quickstart-vscode`).
+  A pre-existing broken "Enable SharePoint Embedded" card was repointed to the existing
+  quickstart.
+- `docs/toc.yml` — 46/46 SPE hrefs resolve.
diff --git a/agent-readiness-eval/03-qa-round1-21-queries.md b/agent-readiness-eval/03-qa-round1-21-queries.md
@@ -0,0 +1,70 @@
+# 03 — Q&A campaign: 21 persona queries
+
+The first empirical campaign tested whether doc-restricted agents could answer realistic
+questions from the core SPE personas (new developer, developer, architect, admin, ISV,
+compliance, billing, and an undecided "router" user). 21 queries were split into 5
+groups and run by parallel doc-restricted agents, then graded 0–3.
+
+## Agent prompt (template)
+
+Each agent received this instruction plus its group's questions:
+
+> You are evaluating whether a documentation set can answer SharePoint Embedded (SPE)
+> questions. You may ONLY read files under `docs/embedded`. Do NOT use outside knowledge.
+> Simulate a coding agent doing keyword retrieval: use grep/glob within that folder to
+> FIND the answer, then read the matching article. If the docs genuinely do not contain
+> the answer, say so honestly — do not invent it.
+>
+> For EACH question output one block:
+> `QID / ANSWER / CITED / FINDABILITY / SUFFICIENCY / GAP`.
+
+## Queries, expected article, and grades
+
+| QID | Persona | Question | Intended article | R1 | R2 |
+| --- | --- | --- | --- | --- | --- |
+| q01 | dev-new | I am brand new to SPE. How do I build my first app? | `build/quickstart-vscode.md` | 3 | 3 |
+| q02 | dev | How do I let my app users open and edit Word/Office files from my app? | `build/open-office-files.md` | 3 | 3 |
+| q03 | dev | What permission lets my app read content in containers and how do I grant it? | `build/register-application-permissions.md` | 3 | 3 |
+| q04 | dev | How do I get notified when a file changes in a container? | `build/respond-to-changes-webhooks.md` | 3 | 3 |
+| q05 | dev | How do I create a container type and what is its relationship to my app? | `build/create-container-type.md` | 3 | 3 |
+| q06 | dev | How do I search across files in my containers? | `build/search-containers-files.md` | 3 | 3 |
+| q07 | dev | How do I archive an inactive container and reactivate it later? | `build/archive-restore-containers.md` | 3 | 3 |
+| q08 | dev | How do I add custom metadata columns to containers and query them? | `build/container-metadata.md` | 2 | 3 |
+| q09 | architect | Single-tenant vs multitenant: which app model should I choose? | `plan/choose-app-model.md` | 3 | 3 |
+| q10 | architect | Who pays for storage, me or my customer? What billing models exist? | `plan/choose-billing-model.md` | 3 | 3 |
+| q11 | architect | What size and throughput limits should I design around? | `plan/limits-calling-patterns.md` | 2 | 3 |
+| q12 | admin | As a tenant admin, how do I install a vendor SPE app in my tenant? | `admin/install-sharepoint-embedded-app.md` | 3 | 3 |
+| q13 | admin | How do I set up billing for an SPE app in my tenant? | `admin/setup-billing-m365-admin-center.md` | 3 | 3 |
+| q14 | admin | How do I manage SPE containers with PowerShell? | `admin/manage-containers-powershell.md` | 3 | 3 |
+| q15 | admin | What admin role do I need to manage SharePoint Embedded? | `admin/admin-overview.md` | 2 | 3 |
+| q16 | compliance | How do I find audit events for SPE container type activity? | `admin/review-audit-events.md` | 3 | 3 |
+| q17 | isv | I built an app. How do I prepare it for a customer to install? | `publish/prepare-customer-installation.md` | 3 | 3 |
+| q18 | billing | What does SPE charge for? What are the billing meters? | `reference/billing-meters.md` | 3 | 3 |
+| q19 | compliance | How do I apply a sensitivity label or block download on a container? | `admin/apply-security-compliance-controls.md` | 3 | 3 |
+| q20 | dev | I get access denied / SubscriptionNotRegistered. How do I fix it? | `reference/troubleshooting.md` | 3 | 3 |
+| q21 | router | I do not know if I am a developer or admin. Where do I start? | `overview.md` | 2 | 3 |
+
+## Round 1 result
+
+- **21/21 answered correctly**, overall quality ≈ **93.7%**.
+- Four weaknesses surfaced (all scored 2, "correct but partial / mis-cited"):
+
+| # | QID(s) | Weakness | Decision |
+| --- | --- | --- | --- |
+| A | several | Legacy folders (`development/`, `administration/`, `getting-started/`, `compliance/`) compete in keyword search and were sometimes cited instead of the new task article. | **Benign** — answers were still correct; legacy files are intentionally retained as deep-dive sources. Not deleted (out of scope, risky). |
+| B | q21 | Router page lacked an explicit developer-vs-admin role identifier. | **Fixed** — added a "Not sure which path?" tip that splits on "do you write code, or manage a tenant?" |
+| C | q08 | OWSTEXT custom-property search syntax lived only in the search article; the metadata article didn't cross-link to it. | **Fixed** — added a cross-link from `container-metadata.md` to `search-containers-files.md`. |
+| D | q11 | The word "throughput" was absent and the limit-increase process was undocumented. | **Fixed** — added throughput/resource-unit wording and limit-increase guidance to `limits-calling-patterns.md`. |
+
+## Fixes (committed)
+
+The B/C/D fixes were committed as `1288ec1c7`:
+
+- `docs/embedded/overview.md` — dev-vs-admin tip box before the routing table.
+- `docs/embedded/build/container-metadata.md` — cross-link to free-text/OWSTEXT search.
+- `docs/embedded/plan/limits-calling-patterns.md` — throughput + limit-increase guidance.
+
+## Round 2 result
+
+Re-test of the four weak queries (q08, q11, q15, q21) confirmed every gap closed:
+**21/21 full score, 100%.**