Skip to content

docling bridge: wire IngestStatementOp PDF branch + DoclingProcessSurface b00t attestation #60

@promptexecutionerr

Description

@promptexecutionerr

What already exists (do not re-implement)

The MCP contract, request shape, and pipeline status tracking are complete:

Symbol Location State
proxy_docling_ingest_pdf MCP tool crates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:132 Registered, routes to handle_ingest_pdf()
handle_ingest_pdf<T>() crates/ledgerr-mcp/src/mcp_adapter.rs:999 Implemented — calls ingest_statement_rows() with pre-parsed extracted_rows
IngestPdfRequest crates/ledgerr-mcp/src/lib.rs:91 { pdf_path: String, journal_path, workbook_path, ontology_path, raw_context_bytes, extracted_rows: Vec<TransactionInput> }
docling_ready: bool mcp_adapter.rs:193 In get_pipeline_status() — hardcoded true at call site bin/ledgerr-mcp-server.rs:130
DocumentChunk crates/ledger-core/src/rule_registry.rs:122 NDJSON sidecar output type — { node_id, text, parent_id, semantic_id, anchors: Vec<[u32;2]> }
IngestStatementOp crates/ledger-core/src/ledger_ops.rs:169 Handles CSV/XLSX via calamine; PDF branch missing — returns error on .pdf input
test_ingest_statement_via_pdf_sidecar crates/ledger-core/src/integration_tests.rs:73 #[ignore] — contract written, awaiting implementation
ProcessSurface + Requirement::BinaryOnPath crates/b00t-iface/src/core/surface.rs:14–66 Surface lifecycle trait with typed requirement declarations
HandshakeSurface / HandshakeDocument crates/b00t-iface/src/handshake/mod.rs:63 Writes _b00t_/handshake/l3dg3rr.json; surfaces: Vec<String> field carries capability advertisement

What needs to be built

1. IngestStatementOp::execute() — PDF branch (ledger-core/src/ledger_ops.rs)

Currently execute() opens input_path via calamine::open_workbook_auto(), which panics or errors on .pdf. Add a PDF branch before the calamine block:

if matches!(doc_type, DocType::Pdf) {
    return ingest_pdf_via_docling(input_path, &account_id, ctx);
}

ingest_pdf_via_docling(path, account_id, ctx) must:

  1. Check which::which("docling").is_ok() — return LedgerOpError::MissingDependency("docling not on PATH") if absent (not panic).
  2. Spawn: std::process::Command::new("docling").args(["convert", "--to", "json", path]).output()
  3. Deserialize stdout as DoclingDocument (see schema below).
  4. Map DoclingDocument.tables[*].data.grid rows → TransactionInput { account_id, date, amount, description, source_ref }.
  5. amount must be rust_decimal::Decimal::from_str() — never f64.
  6. Compute Blake3 content-hash ID per row: blake3(account_id + date + amount_str + description).
  7. Return OperationResult::success("ingest-statement", rows.len()).

2. DoclingDocument deserialization target (ledger-core/src/ingest.rs or new ledger-core/src/docling.rs)

Docling 2.78.0 JSON schema (relevant subset):

#[derive(Debug, Deserialize)]
pub struct DoclingDocument {
    pub tables: Vec<DoclingTable>,
}

#[derive(Debug, Deserialize)]
pub struct DoclingTable {
    pub data: DoclingTableData,
}

#[derive(Debug, Deserialize)]
pub struct DoclingTableData {
    pub grid: Vec<Vec<DoclingCell>>,
}

#[derive(Debug, Deserialize)]
pub struct DoclingCell {
    pub text: String,
    #[serde(default)]
    pub col_span: u32,
    #[serde(default)]
    pub row_span: u32,
}

Column heuristic: first row of grid is the header. Map headers to date, amount, description via DocumentShape::column_map (same path as the XLSX branch at ledger_ops.rs:257–271).

3. docling_ready real probe (ledgerr-mcp/src/bin/ledgerr-mcp-server.rs)

Replace hardcoded true at line 130:

// Before:
mcp_adapter::handle_pipeline_status(true, true, true, Vec::new())
// After:
let docling_ready = which::which("docling").is_ok();
mcp_adapter::handle_pipeline_status(true, true, docling_ready, Vec::new())

which crate is already in the workspace (check Cargo.lock); if absent, use std::process::Command::new("which").arg("docling").status().map(|s| s.success()).unwrap_or(false).

4. DoclingProcessSurface — b00t attestation (crates/b00t-iface/src/ or crates/ledgerr-mcp/src/)

Implement ProcessSurface for Docling as a b00t datum. This is the node-level attestation: when the binary was compiled/optimized for this system, b00t can verify Docling is operational before l3dg3rr claims docling_ready: true.

pub struct DoclingProcessSurface;

impl ProcessSurface for DoclingProcessSurface {
    type Config = ();
    type Error = DoclingError;
    type Handle = ();

    fn capability(&self) -> SurfaceCapability {
        SurfaceCapability {
            name: "docling",
            requirements: vec![
                Requirement::BinaryOnPath("docling".into()),
            ],
            governance: GovernancePolicy::default(),
        }
    }

    fn init(&mut self, _config: ()) -> Result<(), DoclingError> {
        which::which("docling").map(|_| ()).map_err(|_| DoclingError::NotOnPath)
    }

    fn operate(&mut self) -> Result<(), DoclingError> {
        // Smoke: docling --version
        let out = std::process::Command::new("docling")
            .arg("--version")
            .output()
            .map_err(|e| DoclingError::SpawnFailed(e.to_string()))?;
        if out.status.success() { Ok(()) } else { Err(DoclingError::VersionCheckFailed) }
    }

    fn maintain(&mut self) -> MaintenanceAction { MaintenanceAction::NoOp }
    fn terminate(&mut self, _: ()) -> AuditRecord { /* ... */ }
}

On HandshakeSurface::operate(), append "docling" to surfaces in the HandshakeDocument written to _b00t_/handshake/l3dg3rr.json only when DoclingProcessSurface.init() succeeds. This makes the datum self-attesting: the handshake file's surfaces array is the proof that Docling is operational on this node.

Acceptance Criteria

  • IngestStatementOp::execute() with a .pdf input and docling on $PATH produces ≥ 1 TransactionInput with non-None date and Decimal amount.
  • IngestStatementOp::execute() with docling absent returns LedgerOpError::MissingDependency — no panic.
  • get_pipeline_status(true, true, false, vec![]) returns blockers: ["docling_unreachable"] (existing test at mcp_adapter_contract.rs:65 must still pass).
  • test_ingest_statement_via_pdf_sidecar (currently #[ignore]) un-ignores and passes when docling is on $PATH; stays #[ignore] in CI unless DOCLING_INTEGRATION=1 env var is set.
  • DoclingProcessSurface::init() returns Err(DoclingError::NotOnPath) when docling is absent.
  • HandshakeDocument { surfaces } includes "docling" iff DoclingProcessSurface.init().is_ok().

Files

File Change
crates/ledger-core/src/ledger_ops.rs:188 Add PDF branch in IngestStatementOp::execute()
crates/ledger-core/src/ingest.rs or new docling.rs DoclingDocument, DoclingTable, DoclingCell deserialization; ingest_pdf_via_docling()
crates/ledgerr-mcp/src/bin/ledgerr-mcp-server.rs:130 Replace docling_ready: true with which::which("docling").is_ok()
crates/b00t-iface/src/ (new file) DoclingProcessSurface implementing ProcessSurface
crates/b00t-iface/src/handshake/mod.rs Append "docling" to HandshakeDocument::surfaces when surface is ready
crates/ledger-core/src/integration_tests.rs:73 Remove #[ignore] gate; add DOCLING_INTEGRATION env guard

Dependency

Independent of #55#57. IngestPdfRequest.extracted_rows is the output that feeds #55 (TransactionFacts population) — implementing this makes the PDF → legal verification path end-to-end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    doclingDocument extraction bridgeenhancementNew feature or requestpipelineTransaction pipeline

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions