diff --git a/docs/CONCURRENCY.md b/docs/CONCURRENCY.md
index 59aa7a4..7736165 100644
--- a/docs/CONCURRENCY.md
+++ b/docs/CONCURRENCY.md
@@ -2,68 +2,92 @@
 
 drift uses Zig 0.16's `std.Io` interface throughout. Every file read, subprocess spawn, and path resolution flows through an `Io` instance threaded from `std.process.Init` in `main` down through `CommandContext` to every command.
 
-This design makes it straightforward to parallelize CPU- and I/O-bound work without changing the control flow of individual functions. Below is what's done today and what remains as viable follow-up.
+This design makes it straightforward to parallelize CPU- and I/O-bound work without changing the control flow of individual functions. The work below is all implemented in `src/commands/lint.zig`.
 
-## Implemented: per-doc parallelism (`Io.Group`)
+## Layer 1: per-doc parallelism (`Io.Group`)
 
 `drift check` wraps the per-doc loop in `src/commands/lint.zig` in an `Io.Group`. Each doc's binding checks (file read → tree-sitter parse → hash → optional `git log` for blame) run as independent tasks on the thread pool backing `Io.Threaded`.
 
 Key design constraints:
 
-- **Task-local `CommandContext`.** Each task builds its own `std.heap.ArenaAllocator` (child of `run_arena`) so `ctx.scratch()` / `ctx.resetScratch()` inside `checkBinding`, `checkDocLinks`, and `classifyRelativeLink` continue to work unchanged.
+- **Task-local `CommandContext`.** Each task builds its own `std.heap.ArenaAllocator` (child of `run_arena`) so `ctx.scratch()` / `ctx.resetScratch()` inside `checkBinding` continue to work unchanged.
 - **Task-local `FileCache`.** `std.StringHashMap` is not thread-safe. Each task gets its own cache. Docs rarely share files across each other, so the lost hit rate is small and not worth a mutex.
 - **Pre-allocated result slots.** `results: []?DocCheckResult` is allocated once from `run_arena`. Tasks write their own slot; the main thread merges in doc-order (docs are already sorted by `discoverDocGroups`) so output stays deterministic.
 - **Error handling.** `Io.Group.async` tasks cannot propagate errors back to the caller. A `checkOneDoc` wrapper catches errors and stores them as `error_message: ?[]const u8` on the result. The main thread prints and translates to `error.LintCheckFailed` during merge.
 
-Local measurement on the drift repo itself: ~0.40 s → ~0.14 s (≈3×).
+## Layer 2: per-binding parallel blame (`io.async`)
 
-## Proposed: speculative blame via `io.async`
+`checkBinding` is CPU-bound (tree-sitter parse + fingerprint). It no longer shells out to `git log` for blame; instead it reports `.result = .stale, .reason_code = .changed_after_baseline` with `blame = null`. A second phase inside `checkOneDocInner` iterates over the anchor rows, and for each `changed_after_baseline` entry fires `io.async(vcs.getLatestBlameInfo, …)`, collecting the futures. A follow-up loop awaits each future and fills in the blame info.
 
-When a binding turns stale, `checkBinding` shells out to `git log` for blame info (`vcs.getLatestBlameInfo`). Today this is serial inside each task. Starting the blame query speculatively *before* the fingerprint comparison finishes removes its latency from the stale path:
+Two properties matter:
 
-```zig
-var blame_future = io.async(vcs.getLatestBlameInfo, .{ ... });
-defer if (blame_future.cancel(io)) |_| {} else |_| {};
+- **No wasted work on fresh anchors.** Blame queries are only fired after we know a binding is stale, so the common case (all anchors fresh) pays nothing extra. An earlier proposal to fire blame *speculatively* before the staleness check was rejected because `git log -1` is ~20 ms — more than the ~5 ms staleness check — so speculation would slow down the common case.
+- **Scales with stale count.** A doc with 5 stale anchors previously ran 5 sequential `git log` calls (~100 ms). Now they execute concurrently on the thread pool (~25 ms on a 4-core machine).
 
-// ... compute fingerprint ...
-if (is_fresh) return .{ .result = .fresh, ... };  // cancel fires on defer
-const blame = try blame_future.await(io);
-return .{ .result = .stale, .blame = blame, ... };
-```
+Subprocess buffers for blame queries use `run_arena` (thread-safe, lives until the run ends) rather than the task-local scratch arena, because the task-local scratch is reset between bindings in phase 1 and the blame futures outlive that reset.
 
-`io.async` is infallible on `Io.Threaded` (it runs inline on `Io.failing`), so the code reads the same on both backends. Cost: wasted work on fresh anchors. A `git log -1` against a single path is cheap; the tradeoff favours the stale path since that's what blocks the user.
+## Layer 3: per-link parallel existence checks (`Io.Group`)
 
-## Proposed: `Io.Batch` for link-existence checks
+`checkDocLinks` previously walked each markdown link serially, doing `realPathFileAlloc` + `accessAbsolute` per link. Now each link is a task in a per-doc `Io.Group`; results fill a pre-allocated `?JsonLinkRow` slot array that's stable across parallel writes. After `group.await`, slots are appended to `out` in doc-order so output is deterministic.
 
-`checkDocLinks` in `src/commands/lint.zig` resolves each markdown link in a doc one at a time and calls `pathExists` (`accessAbsolute`) on each target. For docs with dense cross-links this is many sequential stat syscalls.
+For docs with dense cross-links this collapses N sequential stats into a single pool round-trip.
 
-`Io.Batch` is the low-level batching primitive (one layer below `Io.Group`) and supports stat-style operations:
+**Note on `Io.Batch`.** An earlier proposal used `Io.Batch`, which is the lower-level primitive below `Io.Group`. As of Zig 0.16, `Io.Batch.Operation` only supports `file_read_streaming`, `file_write_streaming`, `device_io_control`, and `net_receive` — no stat/access operation. If stdlib adds one, link checks could migrate to `Io.Batch` for better coalescing under `Io.Uring` specifically. On `Io.Threaded` (current backend) `Io.Group` is functionally equivalent.
 
-```zig
-var batch: Io.Batch = .init;
-for (parsed.links.items) |link| batch.stat(io, absolute);
-try batch.await(io);
-```
+## Layer 4: overlapped startup (`io.async`)
 
-On Linux with `Io.Uring` this becomes a single submission; on `Io.Threaded` it falls back to parallel thread-pool stats. Modest win on dense docs, larger win under `Io.Uring` once that backend stabilizes.
+At the top of `lint.run`, `discoverDocGroups` (shells out to `git ls-files`) and `vcs.getRepoIdentity` (shells out to `git remote get-url origin`) are both independent subprocess calls. `getRepoIdentity` is now spawned as an `io.async` future before `discoverDocGroups` runs; the two `git` processes execute in parallel and we await the identity right after ls-files returns.
 
-## Proposed: overlap startup shell-outs
+Saves a few ms per run. Small absolute win, but free — both calls are already `io`-aware.
 
-At the top of `lint.run`, `discoverDocGroups` calls `git ls-files` and `getRepoIdentity` calls `git remote get-url origin` back-to-back. Both take separate allocators and are trivially independent:
+## Measurements
 
-```zig
-var identity_future = io.async(vcs.getRepoIdentity, .{ ctx.run_arena, ctx.scratch(), cwd_path });
-var doc_groups = try discoverDocGroups(...);
-const repo_identity = identity_future.await(io) catch null;
-```
+Benchmarks via `hyperfine --warmup 3 --runs 20+` on an M-series Mac, `-Doptimize=ReleaseSafe` builds at the three points: sequential (main before #24), layer 1 only (main after #24), all four layers (after #26).
 
-Saves a few ms per run. Small absolute win, but free — both calls are already `io`-aware.
+### Small repo — drift itself (10 docs, 19 anchors, ~6 links)
+
+Fresh steady-state (no stale anchors):
+
+| Variant | Mean | vs sequential |
+|---|---:|---:|
+| sequential | 64.6 ± 2.4 ms | 1.00× |
+| layer 1 only | 34.6 ± 3.4 ms | 1.87× |
+| layers 1-4 | 26.4 ± 1.5 ms | **2.45×** |
+
+All 19 anchors forced stale (worst-case blame load):
+
+| Variant | Mean | vs sequential |
+|---|---:|---:|
+| sequential | ~280 ms | 1.00× |
+| layer 1 only | 103.3 ± 3.2 ms | ~2.7× |
+| layers 1-4 | 78.3 ± 12.7 ms | **~3.6×** |
+
+### Large repo — nocturne monorepo (247 docs, 107 anchors)
 
-## Proposed: rework `GitCatFile` with `Io.Group`
+Fresh steady-state:
 
-`GitCatFile` (`src/vcs.zig`) keeps a persistent `git cat-file --batch` process alive to avoid spawn overhead for historical file reads. Today request/response is synchronous. Running request submission and response parsing as two concurrent tasks inside an `Io.Group` would let the next request go out while the current response is still being read — useful only if we shift to a use case that issues many queries, which today we don't.
+| Variant | Mean | vs sequential |
+|---|---:|---:|
+| sequential | 1.426 ± 0.036 s | 1.00× |
+| layer 1 only | 468.6 ± 65.0 ms | 3.04× |
+| layers 1-4 | 424.3 ± 35.0 ms | **3.36×** |
 
-Noting it as a shape that 0.16 now permits; not a priority.
+All 107 anchors forced stale:
+
+| Variant | Mean | vs sequential |
+|---|---:|---:|
+| sequential | 2.876 ± 0.053 s | 1.00× |
+| layer 1 only | 914.7 ± 132.8 ms | 3.14× |
+| layers 1-4 | 672.1 ± 26.9 ms | **4.28×** |
+
+### Reading the numbers
+
+- **Layer 1 is the largest win** and scales with doc count: 1.87× on a 10-doc repo, 3.04× on a 247-doc repo.
+- **Layer 2 (parallel blame) is a 1.32–1.36× multiplier when many anchors are stale** (`git log` runs serialize inside a doc before layer 2, parallelize after). On fresh runs it costs nothing because blame is only fired once staleness is confirmed — not speculatively.
+- **Layer 3 (parallel link checks)** fires independently of staleness; on drift's repo with 6 links it's in the noise, on nocturne with 112 links it accounts for part of the 1.1× on top of layer 1 in the fresh case.
+- **Layer 4 (overlapped startup)** saves one `git remote` subprocess latency (~10–20 ms) and is free.
+
+Combined, on the large-repo stale worst case, the pipeline went from **2.88 s to 672 ms (4.28×)**.
 
 ## What not to parallelize
 
@@ -73,4 +97,8 @@ Noting it as a shape that 0.16 now permits; not a priority.
 
 ## Backends
 
-All of the above runs unchanged on any `Io` implementation. Today drift uses `Io.Threaded` (the only feature-complete backend in 0.16). `Io.Uring` (Linux), `Io.Kqueue` (BSD/macOS), and `Io.Dispatch` (macOS) are proof-of-concept in 0.16 and will become interesting for `Io.Batch` work once they stabilize.
+All of the above runs unchanged on any `Io` implementation. Today drift uses `Io.Threaded` (the only feature-complete backend in 0.16). `Io.Uring` (Linux), `Io.Kqueue` (BSD/macOS), and `Io.Dispatch` (macOS) are proof-of-concept in 0.16 and will become interesting for layer 3 specifically once `Io.Batch` grows a stat operation.
+
+## Proposed but not implemented
+
+**Rework `GitCatFile` with `Io.Group`.** `src/vcs.zig` keeps a persistent `git cat-file --batch` process alive to avoid spawn overhead for historical file reads. Running request submission and response parsing as two concurrent tasks inside an `Io.Group` would let the next request go out while the current response is still being read — useful only if we shift to a use case that issues many queries, which today we don't.
diff --git a/drift.lock b/drift.lock
index fb7fbd4..4b896d9 100644
--- a/drift.lock
+++ b/drift.lock
@@ -3,7 +3,7 @@
 CLAUDE.md -> build.zig sig:7194b38f39dbadba
 CLAUDE.md -> src/main.zig sig:f2735440986d2477
 docs/CLI.md -> src/commands/link.zig sig:7bd7f824afc30e0b
-docs/CLI.md -> src/commands/lint.zig sig:e29be45c4d84759e
+docs/CLI.md -> src/commands/lint.zig sig:b5afc0405907cf64
 docs/CLI.md -> src/commands/refs.zig sig:f623b7774086094e
 docs/CLI.md -> src/commands/status.zig sig:eade166d24a20b81
 docs/CLI.md -> src/commands/unlink.zig sig:0dbe1ee3315211b5
diff --git a/src/commands/lint.zig b/src/commands/lint.zig
index dbddd7d..43a06e1 100644
--- a/src/commands/lint.zig
+++ b/src/commands/lint.zig
@@ -173,6 +173,12 @@ pub fn run(
     const lf = try lockfile.discover(ctx.io, ctx.run_arena, ctx.scratch(), cwd_path);
     ctx.resetScratch();
 
+    // Kick off the `git remote get-url origin` query concurrently with the
+    // `git ls-files` run inside `discoverDocGroups`. Both are independent and
+    // both shell out to git, so the two subprocesses overlap.
+    var identity_future = ctx.io.async(vcs.getRepoIdentity, .{ ctx.io, ctx.run_arena, ctx.run_arena, cwd_path });
+    defer _ = identity_future.cancel(ctx.io);
+
     var doc_groups = try discoverDocGroups(ctx.io, ctx.run_arena, lf.root_path, lf.bindings.items);
     defer {
         for (doc_groups.items) |*doc| doc.bindings.deinit(ctx.run_arena);
@@ -180,7 +186,7 @@ pub fn run(
     }
 
     const detected_vcs = vcs.detectVcs();
-    const repo_identity = vcs.getRepoIdentity(ctx.io, ctx.run_arena, ctx.scratch(), cwd_path);
+    const repo_identity = identity_future.await(ctx.io);
 
     const normalized_changed = if (changed_path) |raw|
         try normalizeChangedPrefix(ctx, lf.root_path, cwd_path, raw)
@@ -314,6 +320,8 @@ fn checkOneDocInner(
     var stale_count: usize = 0;
     var skip_count: usize = 0;
 
+    // Phase 1: determine staleness per binding. CPU-bound (tree-sitter parse +
+    // fingerprint). `checkBinding` no longer fetches blame — that's phase 2.
     for (doc.bindings.items) |binding| {
         task_ctx.resetScratch();
         const parsed = target.parse(binding.target);
@@ -324,7 +332,7 @@ fn checkOneDocInner(
                 const is_local = if (repo_identity) |ri| std.mem.eql(u8, o, ri) else false;
                 if (!is_local) break :blk AnchorOutcome{ .result = .skip, .reason_code = .origin_mismatch };
             }
-            break :blk checkBinding(task_ctx, root_path, binding, parsed, &file_cache, detected_vcs) catch |err| {
+            break :blk checkBinding(task_ctx, root_path, binding, parsed, &file_cache) catch |err| {
                 const msg = try std.fmt.allocPrint(run_arena, "error checking {s}: {s}\n", .{ binding.target, @errorName(err) });
                 doc_result.error_message = msg;
                 out.* = doc_result;
@@ -340,6 +348,47 @@ fn checkOneDocInner(
         }
     }
 
+    // Phase 2: fetch blame in parallel for every `changed_after_baseline` stale
+    // anchor. Each call shells out to `git log`; running them concurrently cuts
+    // doc check time on docs with multiple stale anchors from N× to ~1×.
+    // Previously these were serialized inside `checkBinding`.
+    const BlameReturn = @typeInfo(@TypeOf(vcs.getLatestBlameInfo)).@"fn".return_type.?;
+    const BlameJob = struct {
+        row_index: usize,
+        future: std.Io.Future(BlameReturn),
+    };
+    var blame_jobs: std.ArrayList(BlameJob) = .empty;
+    defer blame_jobs.deinit(run_arena);
+
+    for (doc_result.anchors.items, 0..) |row, i| {
+        if (!std.mem.eql(u8, row.wire.result, "stale")) continue;
+        const reason = row.wire.reason orelse continue;
+        if (!std.mem.eql(u8, reason.code, @tagName(ReasonCode.changed_after_baseline))) continue;
+
+        const future = io.async(vcs.getLatestBlameInfo, .{
+            io, run_arena, run_arena, root_path, row.wire.path, detected_vcs,
+        });
+        blame_jobs.append(run_arena, .{ .row_index = i, .future = future }) catch {
+            // If we can't record the future, at least drain it so it doesn't leak.
+            var drop = future;
+            _ = drop.cancel(io) catch null;
+            continue;
+        };
+    }
+    for (blame_jobs.items) |*job| {
+        const blame = (job.future.await(io)) catch null;
+        if (blame) |b| {
+            const row = &doc_result.anchors.items[job.row_index];
+            row.blame_storage = b;
+            row.wire.blame = .{
+                .author = b.author,
+                .commit = b.commit_hash,
+                .date = b.date,
+                .subject = b.subject,
+            };
+        }
+    }
+
     try checkDocLinks(task_ctx, root_path, doc.path, &file_cache, &doc_result.links);
 
     var broken_links: usize = 0;
@@ -501,50 +550,65 @@ fn checkDocLinks(
     var parsed = (try markdown.parseDocument(ctx.run_arena, content)) orelse return;
     defer parsed.deinit();
 
-    for (parsed.links.items) |link| {
-        const checked = try classifyRelativeLink(ctx, root_path, doc_path, link.target) orelse continue;
-        try out.append(ctx.run_arena, .{
-            .display_target = checked.display_target,
-            .wire = .{
-                .target = link.target,
-                .line = link.line,
-                .result = linkResultStr(if (checked.exists) .ok else .broken),
-                .reason = if (checked.exists) null else driftReason(.link_target_not_found),
-            },
+    if (parsed.links.items.len == 0) return;
+
+    // Resolve the doc's real path once; every link check needs the same dir.
+    const raw_absolute_doc = try std.Io.Dir.path.resolve(ctx.run_arena, &.{ root_path, doc_path });
+    const real_doc_path = std.Io.Dir.cwd().realPathFileAlloc(ctx.io, raw_absolute_doc, ctx.run_arena) catch raw_absolute_doc;
+    const doc_dir = std.Io.Dir.path.dirname(real_doc_path) orelse root_path;
+
+    // Parallelize link-existence checks. Each task is a `statFile` syscall
+    // wrapped in `pathExists`. On docs with many cross-doc links this
+    // collapses N sequential stats into a single pool round-trip.
+    const slots = try ctx.run_arena.alloc(?JsonLinkRow, parsed.links.items.len);
+    @memset(slots, null);
+
+    var group: std.Io.Group = .init;
+    defer group.cancel(ctx.io);
+
+    for (parsed.links.items, slots) |link, *slot| {
+        group.async(ctx.io, classifyLinkTask, .{
+            ctx.io, ctx.run_arena, root_path, doc_dir, link, slot,
         });
     }
-}
+    try group.await(ctx.io);
 
-const CheckedLink = struct {
-    display_target: []const u8,
-    exists: bool,
-};
+    for (slots) |maybe_row| {
+        const row = maybe_row orelse continue;
+        try out.append(ctx.run_arena, row);
+    }
+}
 
-fn classifyRelativeLink(
-    ctx: CommandContext,
+fn classifyLinkTask(
+    io: std.Io,
+    run_arena: std.mem.Allocator,
     root_path: []const u8,
-    doc_path: []const u8,
-    raw_target: []const u8,
-) !?CheckedLink {
-    const trimmed = std.mem.trim(u8, raw_target, " \t\r\n");
-    if (trimmed.len == 0) return null;
-    if (trimmed[0] == '#') return null;
-    if (std.Io.Dir.path.isAbsolute(trimmed)) return null;
-    if (hasUriScheme(trimmed)) return null;
+    doc_dir: []const u8,
+    link: markdown.Link,
+    slot: *?JsonLinkRow,
+) void {
+    const trimmed = std.mem.trim(u8, link.target, " \t\r\n");
+    if (trimmed.len == 0) return;
+    if (trimmed[0] == '#') return;
+    if (std.Io.Dir.path.isAbsolute(trimmed)) return;
+    if (hasUriScheme(trimmed)) return;
 
     const path_part = if (std.mem.findScalar(u8, trimmed, '#')) |idx| trimmed[0..idx] else trimmed;
-    if (path_part.len == 0) return null;
+    if (path_part.len == 0) return;
 
-    // Resolve symlinks on the doc path so relative links are computed from the real location.
-    const raw_absolute_doc = try std.Io.Dir.path.resolve(ctx.scratch(), &.{ root_path, doc_path });
-    const real_doc_path = std.Io.Dir.cwd().realPathFileAlloc(ctx.io, raw_absolute_doc, ctx.scratch()) catch raw_absolute_doc;
-    const doc_dir = std.Io.Dir.path.dirname(real_doc_path) orelse root_path;
-    const absolute = try std.Io.Dir.path.resolve(ctx.scratch(), &.{ doc_dir, path_part });
-    const relative = try std.Io.Dir.path.relative(ctx.run_arena, "", null, root_path, absolute);
-    const exists = pathExists(ctx.io, absolute);
-    ctx.resetScratch();
+    const absolute = std.Io.Dir.path.resolve(run_arena, &.{ doc_dir, path_part }) catch return;
+    const relative = std.Io.Dir.path.relative(run_arena, "", null, root_path, absolute) catch return;
+    const exists = pathExists(io, absolute);
 
-    return .{ .display_target = relative, .exists = exists };
+    slot.* = .{
+        .display_target = relative,
+        .wire = .{
+            .target = link.target,
+            .line = link.line,
+            .result = linkResultStr(if (exists) .ok else .broken),
+            .reason = if (exists) null else driftReason(.link_target_not_found),
+        },
+    };
 }
 
 fn hasUriScheme(target_text: []const u8) bool {
@@ -595,7 +659,6 @@ fn checkBinding(
     binding: *const lockfile.Binding,
     parsed: target.ParsedTarget,
     file_cache: *FileCache,
-    detected_vcs: vcs.VcsKind,
 ) !AnchorOutcome {
     const sig_hex = binding.fieldValue("sig") orelse return .{ .result = .stale, .reason_code = .baseline_unavailable };
 
@@ -631,8 +694,10 @@ fn checkBinding(
         return .{ .result = .fresh, .reason_code = .none };
     }
 
-    const blame = try vcs.getLatestBlameInfo(ctx.io, ctx.run_arena, ctx.scratch(), root_path, parsed.file_path, detected_vcs);
-    return .{ .result = .stale, .reason_code = .changed_after_baseline, .blame = blame };
+    // Blame lookup happens in a second parallel phase in `checkOneDocInner`.
+    // Leaving it unset here so multiple stale anchors in the same doc don't
+    // serialize on one-at-a-time `git log` invocations.
+    return .{ .result = .stale, .reason_code = .changed_after_baseline };
 }
 
 fn driftProvenance(sig: ?[]const u8) ?drift_check_v1.Provenance {