Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions agent-readiness-eval/01-methodology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# 01 — Methodology

## Goal

Validate that the restructured, task-based SharePoint Embedded (SPE) documentation is
**agent-ready**: a coding agent that a user delegates an SPE question to should be able
to retrieve the answer from the docs alone, answer correctly, and cite the right task
article — without relying on the model's pre-training.

## Two complementary evaluations

1. **Structural scoring (deterministic).** A Python scorer grades every task article on
four dimensions (structure, correctness signals, agent-readiness, task framing).
This catches missing metadata, oversized articles, missing navigation, weak titles,
and broken cross-links. See [`02-structural-scoring.md`](02-structural-scoring.md).

2. **Empirical Q&A (behavioral).** Real questions are answered by **doc-restricted
retrieval agents** that may only read files under `docs/embedded`. The agents
simulate a coding agent doing keyword retrieval: `grep`/`glob` to find a candidate
article, read it, then answer. Outside/world knowledge is prohibited, and the agent
must say "NOT IN DOCS" when the answer genuinely isn't present. See
[`03-qa-round1-21-queries.md`](03-qa-round1-21-queries.md) and
[`04-qa-stress-test-54-queries.md`](04-qa-stress-test-54-queries.md).

## Why doc-restricted agents

A general model can answer many SPE questions from pre-training, which would hide gaps
in the docs. Restricting the agent to `docs/embedded` forces every answer to be
*grounded in the documentation*, so a wrong or "NOT IN DOCS" result is a true signal of
a documentation gap, not a model limitation.

## Agent output contract

Every Q&A agent returns one structured block per question:

```
QID: <id>
ANSWER: <concise answer from docs, or "NOT IN DOCS">
CITED: <relative path(s) under docs/embedded, or NONE>
FINDABILITY: <EASY | HARD | NONE> # how easily keyword search located it
SUFFICIENCY: <COMPLETE | PARTIAL | MISSING>
GAP: <what's missing or hard to find, or NONE>
```

`FINDABILITY` and `SUFFICIENCY` separate two distinct failure modes: the content can
exist but be **hard to find** (retrieval problem) or be **findable but incomplete**
(content problem). Both are actionable.

## Grading scale

| Score | Meaning |
| --- | --- |
| 3 | Complete + correct, cited the intended task article, easy to find |
| 2 | Correct but partial, hard to find, or cited a legacy/secondary file |
| 1 | Answer missing or materially incomplete (real content gap) |
| 0 | Wrong / hallucinated answer |

**Negative probes.** Some queries deliberately ask for facts not in the doc set (for
example, SLA, region availability, customer-managed keys). These test *honesty*: an
agent that answers "NOT IN DOCS" scores **3**; one that invents an answer scores **0**.

## Loop

1. Build a query bank with persona, question, expected file, and ground truth.
2. Split queries into groups; run one doc-restricted agent per group in parallel.
3. Grade each answer 0–3; record `FINDABILITY`/`SUFFICIENCY`/`GAP`.
4. For every score < 3 that is a real gap, edit the doc to close it.
5. Re-run the structural scorer and link check to confirm no regression.
6. Re-test the fixed queries with a fresh agent.
7. Commit and record results here.

## Tooling

- **Retrieval agents:** fast explore agents, scoped by prompt to `docs/embedded` only.
- **Structural scorer:** `score_docs.py` (4 × 25-point rubric). Kept with the session
artifacts; summarized in [`02-structural-scoring.md`](02-structural-scoring.md).
- **Query bank + grades:** stored in a SQLite session database (`qa` for the 21-query
campaign, `qa2` for the 54-query stress test) and exported into the records here.
- **Link integrity:** a small Python pass over every `[..](../x.md)` relative link in
`docs/embedded`, checked against the filesystem.

## Reproducing

```powershell
# Structural score (from repo root)
python <path-to>/score_docs.py

# Relative-link integrity over docs/embedded
# (walks every ../*.md and ./*.md link and checks it resolves on disk)
```

Q&A campaigns are reproduced by re-issuing the prompts in files 03 and 04 to a
retrieval agent restricted to `docs/embedded`, then grading with the scale above.
72 changes: 72 additions & 0 deletions agent-readiness-eval/02-structural-scoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# 02 — Structural scoring

A deterministic Python scorer grades every task article in `docs/embedded` on four
25-point dimensions. It is the fast, repeatable gate that runs before and after every
content change to prevent regressions.

## Rubric (4 × 25 = 100)

| Dimension | What it checks (25 pts each) |
| --- | --- |
| **Structure** | Single H1; YAML front matter with `title`/`description`/`ms.date`; heading hierarchy is well-formed; article is right-sized for a context window (not oversized). |
| **Correctness signals** | Has an `**Applies to:**` audience line; uses concrete API/cmdlet/permission names; code fenced with a language; intro paragraph present before the first H2. |
| **Agent-readiness** | Article fits comfortably in a context window; descriptive `description` for retrieval; explicit visible navigation (`## Next steps` and `## Related resources` / `See also`); stable headings usable as anchors. |
| **Task framing** | Verb-first, task-oriented title (e.g., "Create and manage containers"); imperative steps; one primary task per article. Reference and concept pages (`ms.topic` / `task_type: concept`) are exempt from the verb-first rule. |

### Scorer notes / fixes captured during development

- The `**Applies to:**` audience line is matched with both bold and plain variants.
- Intro detection must not be defeated by `DOTALL` regex (a bug that was found and fixed).
- Cross-link validity does not require a `../` prefix; same-folder `./x.md` links count.
- Navigation recognizes the variants `Next step`/`Next steps`/`Related`/
`Related resources`/`See also`.

## Result

```
structure : 25.00 / 25
correctness : 25.00 / 25
agent : 25.00 / 25
task : 25.00 / 25
TOTAL : 100.00 / 100
Perfect articles: 43/43

=== ARTICLES WITH DEDUCTIONS / ISSUES ===
(none)
```

The score was re-run after the stress-test fixes (see file 04) and remained
**100.00 / 100, 43/43 perfect**, confirming no regression.

## Articles graded (43)

Spread across the task-based information architecture:

| Section | Count | Examples |
| --- | --- | --- |
| Overview | 3 | `overview.md`, `scenarios-and-use-cases.md`, `whats-new.md` |
| Plan | 7 | `app-tenant-architecture.md`, `choose-app-model.md`, `authentication-permissions.md`, `limits-calling-patterns.md` |
| Build | 16 | `quickstart-vscode.md`, `create-container-type.md`, `manage-files.md`, `open-office-files.md`, `search-containers-files.md`, `agent-experiences.md` |
| Publish | 4 | `prepare-customer-installation.md`, `validate-customer-installation.md` |
| Install & manage (admin) | 10 | `admin-overview.md`, `install-sharepoint-embedded-app.md`, `manage-containers-powershell.md`, `review-audit-events.md` |
| Reference | 6 | `billing-meters.md`, `audit-events.md`, `troubleshooting.md`, `glossary.md` |

## Link integrity

A separate pass walks every relative Markdown link (`[..](../x.md)` and `./x.md`) under
`docs/embedded` and resolves it against the filesystem.

```
relative md links checked = 297 broken = 0
```

Entry-link integrity was also verified separately:

- `docs/embedded/overview.md` keeps H1 **"What is SharePoint Embedded?"**, so the
published `/sharepoint/dev/embedded/overview` URL and its `#what-is-sharepoint-embedded`
anchor do not break.
- `docs/index.yml` (the SharePoint-dev landing hub) — both SPE entry cards resolve
(Overview → `/embedded/overview`, quickstart → `/embedded/build/quickstart-vscode`).
A pre-existing broken "Enable SharePoint Embedded" card was repointed to the existing
quickstart.
- `docs/toc.yml` — 46/46 SPE hrefs resolve.
70 changes: 70 additions & 0 deletions agent-readiness-eval/03-qa-round1-21-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# 03 — Q&A campaign: 21 persona queries

The first empirical campaign tested whether doc-restricted agents could answer realistic
questions from the core SPE personas (new developer, developer, architect, admin, ISV,
compliance, billing, and an undecided "router" user). 21 queries were split into 5
groups and run by parallel doc-restricted agents, then graded 0–3.

## Agent prompt (template)

Each agent received this instruction plus its group's questions:

> You are evaluating whether a documentation set can answer SharePoint Embedded (SPE)
> questions. You may ONLY read files under `docs/embedded`. Do NOT use outside knowledge.
> Simulate a coding agent doing keyword retrieval: use grep/glob within that folder to
> FIND the answer, then read the matching article. If the docs genuinely do not contain
> the answer, say so honestly — do not invent it.
>
> For EACH question output one block:
> `QID / ANSWER / CITED / FINDABILITY / SUFFICIENCY / GAP`.

## Queries, expected article, and grades

| QID | Persona | Question | Intended article | R1 | R2 |
| --- | --- | --- | --- | --- | --- |
| q01 | dev-new | I am brand new to SPE. How do I build my first app? | `build/quickstart-vscode.md` | 3 | 3 |
| q02 | dev | How do I let my app users open and edit Word/Office files from my app? | `build/open-office-files.md` | 3 | 3 |
| q03 | dev | What permission lets my app read content in containers and how do I grant it? | `build/register-application-permissions.md` | 3 | 3 |
| q04 | dev | How do I get notified when a file changes in a container? | `build/respond-to-changes-webhooks.md` | 3 | 3 |
| q05 | dev | How do I create a container type and what is its relationship to my app? | `build/create-container-type.md` | 3 | 3 |
| q06 | dev | How do I search across files in my containers? | `build/search-containers-files.md` | 3 | 3 |
| q07 | dev | How do I archive an inactive container and reactivate it later? | `build/archive-restore-containers.md` | 3 | 3 |
| q08 | dev | How do I add custom metadata columns to containers and query them? | `build/container-metadata.md` | 2 | 3 |
| q09 | architect | Single-tenant vs multitenant: which app model should I choose? | `plan/choose-app-model.md` | 3 | 3 |
| q10 | architect | Who pays for storage, me or my customer? What billing models exist? | `plan/choose-billing-model.md` | 3 | 3 |
| q11 | architect | What size and throughput limits should I design around? | `plan/limits-calling-patterns.md` | 2 | 3 |
| q12 | admin | As a tenant admin, how do I install a vendor SPE app in my tenant? | `admin/install-sharepoint-embedded-app.md` | 3 | 3 |
| q13 | admin | How do I set up billing for an SPE app in my tenant? | `admin/setup-billing-m365-admin-center.md` | 3 | 3 |
| q14 | admin | How do I manage SPE containers with PowerShell? | `admin/manage-containers-powershell.md` | 3 | 3 |
| q15 | admin | What admin role do I need to manage SharePoint Embedded? | `admin/admin-overview.md` | 2 | 3 |
| q16 | compliance | How do I find audit events for SPE container type activity? | `admin/review-audit-events.md` | 3 | 3 |
| q17 | isv | I built an app. How do I prepare it for a customer to install? | `publish/prepare-customer-installation.md` | 3 | 3 |
| q18 | billing | What does SPE charge for? What are the billing meters? | `reference/billing-meters.md` | 3 | 3 |
| q19 | compliance | How do I apply a sensitivity label or block download on a container? | `admin/apply-security-compliance-controls.md` | 3 | 3 |
| q20 | dev | I get access denied / SubscriptionNotRegistered. How do I fix it? | `reference/troubleshooting.md` | 3 | 3 |
| q21 | router | I do not know if I am a developer or admin. Where do I start? | `overview.md` | 2 | 3 |

## Round 1 result

- **21/21 answered correctly**, overall quality ≈ **93.7%**.
- Four weaknesses surfaced (all scored 2, "correct but partial / mis-cited"):

| # | QID(s) | Weakness | Decision |
| --- | --- | --- | --- |
| A | several | Legacy folders (`development/`, `administration/`, `getting-started/`, `compliance/`) compete in keyword search and were sometimes cited instead of the new task article. | **Benign** — answers were still correct; legacy files are intentionally retained as deep-dive sources. Not deleted (out of scope, risky). |
| B | q21 | Router page lacked an explicit developer-vs-admin role identifier. | **Fixed** — added a "Not sure which path?" tip that splits on "do you write code, or manage a tenant?" |
| C | q08 | OWSTEXT custom-property search syntax lived only in the search article; the metadata article didn't cross-link to it. | **Fixed** — added a cross-link from `container-metadata.md` to `search-containers-files.md`. |
| D | q11 | The word "throughput" was absent and the limit-increase process was undocumented. | **Fixed** — added throughput/resource-unit wording and limit-increase guidance to `limits-calling-patterns.md`. |

## Fixes (committed)

The B/C/D fixes were committed as `1288ec1c7`:

- `docs/embedded/overview.md` — dev-vs-admin tip box before the routing table.
- `docs/embedded/build/container-metadata.md` — cross-link to free-text/OWSTEXT search.
- `docs/embedded/plan/limits-calling-patterns.md` — throughput + limit-increase guidance.

## Round 2 result

Re-test of the four weak queries (q08, q11, q15, q21) confirmed every gap closed:
**21/21 full score, 100%.**
Loading