Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ LLM_PROVIDER=openai
LLM_MODEL_NAME=gpt-4o-mini
LLM_TEMPERATURE=0.0
LLM_API_KEY=your_openai_api_key_here
# LLM_CACHE_ENABLED=true
# LLM_CACHE_READ_ONLY=false
# LLM_MAX_INFLIGHT=16
# MAX_CONCURRENT_PROCESSES=4

# Ollama Configuration (Alternative)
# LLM_PROVIDER=ollama
Expand Down
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- **Facts precision/recall/F1** on `POST /match/evaluate` (`fact_precision`, `fact_recall`, `fact_f1` and counts): relational triples only, excluding schema predicates and triples with ontological class/concept nodes in subject or object position.
- **Anthropic (Claude) and Google (Gemini) LLM providers** via `LLM_PROVIDER=anthropic|google`, with `ClaudeModel` and `GeminiModel` config enums.
- **Token usage reporting** in `BudgetTracker` when providers return `usage_metadata` on LLM responses (character counts remain the universal fallback).
- **LLM disk cache controls** on `LLMConfig`: `LLM_CACHE_ENABLED` (default on), `LLM_CACHE_READ_ONLY`, and in-memory plus on-disk stats via `LLMTool.get_cache_stats()`; `GET /info` exposes `llm_cache`.
- **Global LLM in-flight limit** (`LLM_MAX_INFLIGHT`, default 16) — shared semaphore caps concurrent provider requests across parallel unit workers.
- **Optional process concurrency cap** (`MAX_CONCURRENT_PROCESSES`) — limits simultaneous `/process` and `/process_unit` handlers (additional requests wait for a slot).
- **OpenAI Batch API helpers** (`ontocast.tool.llm_batch`) to export chat batch JSONL and import completed results into the LLM disk cache for offline benchmark pre-warming.
- **`BudgetTracker.cache_hits`** — disk-cache hits count toward character totals but not `calls_count`; included in budget summaries when non-zero.
- **Structured-document preprocessing** for heading-structured text (papers, reports): optional **Tag Sections** node detects academic-style headings, assigns **section-aligned labels** to semantic chunks via character-span overlap, and `target_sections` filters units before extraction.
- **Optional chunk summarization** — `summarize_sections` and `summary_max_sentences` on `/process` and CLI (`--summarize-sections`, `--summary-max-sentences`) run a **Summarize Chunks** graph node; ontology/facts render and critic prompts use `ContentUnit.extraction_text` (summary when present, else full chunk text).

### Changed
- **LLM caching path** — `complete`, `extract`, `__call__`, and `acall` share one `_invoke_cached` implementation with consistent cache keys (normalized prompt text), optional disable/read-only modes, and provider calls gated by the global in-flight semaphore.
- **Facts extraction prompts** (`facts_guidelines.py`): clearer two-namespace contract — domain ontology is read-only schema plus optional **reference individuals**; all text-derived occurrences use `cd:` with `lowercase_snake_case` local names. New rules separate **classes** from **instances** (no PascalCase class IRIs in subject/object slots), forbid typing `cd:` entities as `rdfs:Class` / `rdf:Property`, and add a final structural validation checklist before output.

### Fixed
Expand All @@ -20,6 +29,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Documentation
- User guide: facts two-namespace model (`concepts.md`), facts guidelines vs `facts_user_instruction` (`user_instructions.md`), entity alignment and evaluate semantics (`aggregation.md`, `api.md`, `workflow.md`).
- User guide: LLM cache configuration, in-flight/process limits, batch pre-warming, and `/info` cache stats (`llm_caching.md`, `configuration.md`, `api.md`, `concepts.md`, `workflow.md`).
- User guide: structured documents — section tagging, section-aligned chunk labels, `target_sections` / `summarize_sections` (`concepts.md`, `workflow.md`, `api.md`, `configuration.md`).

## [0.4.0] - 2026-05-26

Expand Down
Binary file modified docs/assets/graph.lr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
462 changes: 254 additions & 208 deletions docs/assets/graph.lr.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/assets/graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/assets/graph.preview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
511 changes: 280 additions & 231 deletions docs/assets/graph.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ OntoCast extracts semantic triples from documents using an agentic, ontology-dri
- **Triple store integration** — Fuseki, Neo4j (n10s), or filesystem fallback
- **Tenancy** — partition datasets/collections by tenant and project
- **REST API** — document processing, ontology catalog management, graph matching
- **Automatic LLM caching** — built-in response caching
- **Automatic LLM caching** — disk cache with optional read-only mode, global in-flight limiting, and OpenAI Batch API pre-warming for benchmarks
- **Structured documents** — optional section tagging, section-aligned chunk labels, section filtering, and LLM summarization before extraction

---

Expand Down Expand Up @@ -101,7 +102,7 @@ Document-level pipeline (regenerated via `uv run plot-graph`):

Landscape variant: [graph.lr.png](assets/graph.lr.png). Per-unit render/critic loops are documented in [Workflow](user_guide/workflow.md#per-unit-atomic-loop).

1. Convert → chunk document
1. Convert → optional tag sections → chunk (semantic) → optional summarize chunks
2. Parallel ontology render per unit → normalize → optional consolidate → validate
3. Parallel facts render per unit → merge with disambiguation
4. Serialize to triple store; return Turtle in API response
Expand Down
4 changes: 3 additions & 1 deletion docs/user_guide/aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ For evaluation against ground truth, use the match endpoints (see [API Endpoints

- Align entities across multiple graphs globally
- Derive pairwise predicted↔GT mappings
- Compute triple and entity precision/recall/F1
- Compute triple, facts, and entity precision/recall/F1

**Facts vs triple metrics:** triple-level scores count typing and taxonomy (`rdf:type`, `rdfs:subClassOf`, …). **Facts** scores measure only instance-to-instance relations (e.g. book → character via an ontology property), excluding schema predicates and triples that touch class/concept nodes in subject or object position. Relation property IRIs in predicate position still count toward facts.

Entity match payloads accept IRI strings or `URIRef` values; evaluation normalizes to `URIRef` for projection. **Entity false positives/negatives** count unmatched entities in each graph (set difference), so a shared ontology vocabulary IRI matched once is not also counted as an extra false positive on the other side.

Expand Down
50 changes: 45 additions & 5 deletions docs/user_guide/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,25 @@ Returns service health. Use for load balancers and readiness probes.

### `GET /info`

Returns version, configuration summary, and active backend information.
Returns service metadata, including:

| Field | Description |
|-------|-------------|
| `version` | Package version |
| `llm_cache` | When the LLM tool is initialized: in-memory hit/miss counters plus on-disk cache file stats (`cache_hits`, `cache_misses`, `disk`) |
| `max_concurrent_processes` | Configured cap on simultaneous `/process` handlers, if `MAX_CONCURRENT_PROCESSES` is set |

```bash
curl http://localhost:8999/info
```

---

## Document Processing

### `POST /process`

Runs the full document pipeline: convert → chunk → ontology map/reduce → facts map/reduce → serialize.
Runs the full document pipeline: convert → [tag sections] → chunk → [summarize chunks] → ontology map/reduce → facts map/reduce → serialize. Bracketed stages run only when structured-document parameters are set.

**Content types:**

Expand All @@ -40,6 +50,18 @@ Runs the full document pipeline: convert → chunk → ontology map/reduce → f
| `ontology_user_instruction` | Guide ontology extraction |
| `ontology_selection_user_instruction` | Guide catalog ontology selection |
| `facts_user_instruction` | Guide facts extraction |
| `target_sections` | Comma-separated or JSON list; keep only these sections (enables section tagging) |
| `summarize_sections` | Sections to summarize before extraction; omit to skip. `*` or empty = all chunks |
| `summary_max_sentences` | Max sentences per summary when summarization runs (default `5`) |

**CLI file processing** (`ontocast --input-path …`) accepts the same structured-document flags:

```bash
ontocast --input-path ./papers/ \
--target-sections results,methods \
--summarize-sections results \
--summary-max-sentences 5
```

**Examples:**

Expand All @@ -60,9 +82,18 @@ curl -X POST "http://localhost:8999/process?strip_provenance=true" \
# Multi-tenant request
curl -X POST "http://localhost:8999/process?tenant=acme&project=reports" \
-F "file=@document.pdf"

# Structured paper: keep Results/Methods, summarize Results only
curl -X POST "http://localhost:8999/process?target_sections=results,methods&summarize_sections=results&summary_max_sentences=5" \
-F "file=@paper.pdf"

# JSON body with section lists
curl -X POST http://localhost:8999/process \
-H "Content-Type: application/json" \
-d '{"text": "# Introduction\n...\n## Results\n...", "target_sections": ["results"], "summarize_sections": ["*"], "summary_max_sentences": 5}'
```

**Response:** JSON with `data.facts` (Turtle), `data.ontology_artifacts` (list of ontology TTL payloads), and `metadata` (status, chunk counts, budget).
**Response:** JSON with `data.facts` (Turtle), `data.ontology_artifacts` (list of ontology TTL payloads), and `metadata` (status, chunk counts, budget including `cache_hits` when applicable).

---

Expand Down Expand Up @@ -154,9 +185,13 @@ Derive 1:1 predicted↔ground-truth entity matches for one graph pair from align

### `POST /match/evaluate`

Compute triple and entity precision/recall/F1 given graphs and entity matches. Label triples (`rdfs:label`) are excluded from triple metrics.
Compute triple, **facts**, and entity precision/recall/F1 given graphs and entity matches.

- **Triple metrics** — all triples except `rdfs:label` (includes `rdf:type` and other schema assertions).
- **Facts metrics** — relational assertions only: excludes schema predicates (`rdf:type`, `rdfs:subClassOf`, `rdfs:comment`) and any triple whose subject or object is an ontological (class/concept) URIRef. Ontology **relation** IRIs used only as predicates (e.g. `.../relations#P674`) are not treated as ontological entities.
- **Entity metrics** — true positives = number of accepted entity matches; false positives = predicted entities not in the matched set; false negatives = ground-truth entities not in the matched set (set-based, so correctly matched shared vocabulary IRIs are not double-penalized).

Entity metrics: true positives = number of accepted entity matches; false positives = predicted entities not in the matched set; false negatives = ground-truth entities not in the matched set (set-based, so correctly matched shared vocabulary IRIs are not double-penalized).
Response fields: `precision` / `recall` / `f1` (triples), `fact_precision` / `fact_recall` / `fact_f1` (facts), `entity_precision` / `entity_recall` / `entity_f1` (entities), plus TP/FP/FN counts for each tier.

**Standalone CLI:**

Expand All @@ -178,6 +213,9 @@ match-dirs \
| `400` | Invalid parameters (e.g. missing fixed ontology id) |
| `409` | Vector store unavailable when vector ontology mode requested |
| `500` | Processing or store errors |
| `503` | Health check: LLM not initialized or health probe failure |

When `MAX_CONCURRENT_PROCESSES` is set, additional `/process` and `/process_unit` requests **wait** until a handler slot is free (they are not rejected with 503).

Vector mode unavailable:

Expand All @@ -193,5 +231,7 @@ Vector mode unavailable:
## Related

- [Configuration](configuration.md) — server and tool settings
- [LLM Caching](llm_caching.md) — disk cache, in-flight limits, batch pre-warming
- [User Instructions](user_instructions.md) — guiding extraction
- [Workflow](workflow.md) — what happens inside `/process`
- [Structured documents](concepts.md#structured-documents-optional) — section tagging and summarization
31 changes: 28 additions & 3 deletions docs/user_guide/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,30 @@ OntoCast uses **pyoxigraph** for RDF 1.2 quoted-triple syntax and separates prov

See [Workflow](workflow.md#4-ontology-reduce-document-level).

## Structured documents (optional)

For papers and other heading-structured Markdown text, `/process` and `ontocast --input-path` accept optional parameters. When both `target_sections` and `summarize_sections` are omitted, the pipeline stays `convert → chunk → extract` with no extra graph nodes.

### Section tagging and section-aligned chunks

1. **Tag Sections** (when `target_sections` or `summarize_sections` is set) scans converted text for academic-style headings (`introduction`, `methods`, `results`, `discussion`, `conclusion`, `future_work`, `limitations`, `related_work`, `background`, and numbered variants).
2. **Chunk** still uses the semantic `ChunkerTool`; each content unit then gets a `section_label` by **maximum character-span overlap** with detected section ranges (section-aligned labeling, not a separate chunker mode).
3. **`target_sections`** drops units whose label is not in the allowlist (case-insensitive).

Recognized labels match normalized heading text (underscore form), e.g. `results`, `future_work`.

### Optional summarization

When `summarize_sections` is present (including empty or `*` for all units), the **Summarize Chunks** node runs an LLM pass per selected unit (bounded by `PARALLEL_WORKERS`). Summaries are stored on `ContentUnit.summary`; render and critic agents read `extraction_text`, which prefers the summary over the raw chunk.

| Parameter | Default | Effect |
|-----------|---------|--------|
| `target_sections` | omitted | Enable tagging; keep only listed sections (e.g. `results,methods`) |
| `summarize_sections` | omitted | Enable tagging + summarization node; omit to skip summaries. `*` or empty = all chunks |
| `summary_max_sentences` | `5` | Max sentences per summary when summarization runs |

Section lists accept comma-separated values or a JSON array in query, form, or JSON body fields.

## Parallel Map/Reduce

Document processing uses a **parallel map/reduce** architecture:
Expand Down Expand Up @@ -96,11 +120,12 @@ Details: [Tenancy](tenancy.md).

## Budget Tracking

- **LLM Statistics**: API calls, characters sent/received
- **LLM Statistics**: API calls, characters sent/received; optional token counts when the provider reports usage metadata
- **Cache hits**: Disk-cache hits increment `cache_hits` and character totals but **not** `calls_count` (no provider tokens)
- **Triple Metrics**: Ontology and facts triples per operation
- **Summary Reports**: Logged at end of processing:
```
LLM: X calls, Y sent, Z received | Triples: A ontology, B facts
LLM: X calls, Y sent, Z received, N cache hits | Triples: A ontology, B facts
```
- **BudgetTracker** lives on `AgentState` and per-unit states; merged at reduce stages

Expand All @@ -114,7 +139,7 @@ Details: [Tenancy](tenancy.md).
| `UnitOntologyState` / `UnitFactsState` | Per-unit loop state |
| `ToolBox` | LLM, triple store, chunking, vector store, cache |
| `GraphUpdate` | Structured SPARQL operations from the LLM |
| `ContentUnit` | One chunk's ontology/facts outputs |
| `ContentUnit` | One chunk's text, optional `section_label` / `summary`, and ontology/facts outputs (`extraction_text` for LLM prompts) |

## Next Steps

Expand Down
32 changes: 32 additions & 0 deletions docs/user_guide/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,14 @@ LLM_BASE_URL=http://localhost:11434 # optional (ollama; anthropic proxy URL)

OntoCast uses `LLM_API_KEY` for all cloud providers (not `ANTHROPIC_API_KEY` / `GOOGLE_API_KEY`).

**Disk cache and provider concurrency** (see [LLM Caching](llm_caching.md)):

```bash
LLM_CACHE_ENABLED=true # read/write disk cache (default true)
LLM_CACHE_READ_ONLY=false # use cache without writing new entries
LLM_MAX_INFLIGHT=16 # max concurrent provider requests (all documents)
```

```bash
# Anthropic Claude
LLM_PROVIDER=anthropic
Expand Down Expand Up @@ -79,6 +87,7 @@ PARALLEL_WORKERS=4
PARALLEL_FACTS_RETRIES=3
PARALLEL_ONTOLOGY_RETRIES=3
ENABLE_ONTOLOGY_CONSOLIDATION=false
# MAX_CONCURRENT_PROCESSES=4 # optional cap on simultaneous /process handlers
```

### Chunking
Expand All @@ -90,6 +99,27 @@ CHUNK_MIN_SIZE=3000
CHUNK_MAX_SIZE=12000
```

Semantic chunking is configured here. **Section-aligned labels** and filtering are not chunker settings: they run when `/process` or CLI file mode passes `target_sections` and/or `summarize_sections` (see [Structured documents](concepts.md#structured-documents-optional)).

### Structured documents (per request)

No environment variables. Pass on `POST /process`, multipart form, JSON body, or CLI batch mode:

| Parameter | CLI flag | Description |
|-----------|----------|-------------|
| `target_sections` | `--target-sections` | Comma-separated or JSON list; enables tagging and keeps only these sections |
| `summarize_sections` | `--summarize-sections` | Enables tagging + summarization; `*` or empty = all chunks |
| `summary_max_sentences` | `--summary-max-sentences` | Max sentences per summary (default `5`) |

```bash
ontocast --input-path ./papers/ \
--target-sections results,methods \
--summarize-sections results \
--summary-max-sentences 5
```

Details: [API Endpoints](api.md#post-process), [Workflow](workflow.md#2-chunking-and-optional-structured-preprocessing).

### Triple Stores

```bash
Expand Down Expand Up @@ -247,6 +277,8 @@ Entity alignment and evaluation endpoints are documented in [API Endpoints](api.
- `MAX_VISITS` is supported as an alias for `max_visits_per_node`.
- `RECURSION_LIMIT` was renamed to `BASE_RECURSION_LIMIT`.
- `WEB_SEARCH_ALLOWED_DOMAINS` and `WEB_SEARCH_BLOCKED_DOMAINS` accept comma-separated values.
- `LLM_CACHE_ENABLED` and `LLM_CACHE_READ_ONLY` control disk cache read/write behavior.
- `LLM_MAX_INFLIGHT` must be ≥ 1; `MAX_CONCURRENT_PROCESSES` must be ≥ 1 when set.

## Recommended Workflow

Expand Down
Loading