diff --git a/CHANGELOG.md b/CHANGELOG.md index 2fe54fee..cee3c5a1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,37 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [1.7.30] + +### Added + +- **`tolerate_transform_errors`** on **`ResourceConfig`** (default **`true`**) — a failing transform step sets its declared output fields to **`None`**, records a **`failure_kind=transform`** row in the doc error sink, and the rest of the resource pipeline (vertices, edges, later transforms) continues for that document. Set **`tolerate_transform_errors: false`** to fail fast on transform exceptions. + +### Changed + +- **`VertexActor` + `from_doc`** — transform-buffer projection is selective: only **`TransformPayload`** entries whose **`named`** keys cover the **`from_doc`** source fields are consumed, so dressed or pivot outputs for other vertex types are not stolen. Dressed dict payloads (`__transformed_value#*`) are handled consistently with passthrough from the merged observation doc. +- **Blank vertices in `VertexConfig`** — mark placeholder types with **`blank: true`** on each **`Vertex`** (identity defaults to **`id`**). **`VertexConfig.blank_vertices`** is now a derived name list, not a separate manifest field. Runtime **`ResourceRuntime`** scopes **`VertexConfig`** to vertices referenced by the resource pipeline only; unreferenced blank types are no longer injected automatically. +- **Ingestion contract layout** — declarative **`ResourceConfig`** lives under **`graflo.architecture.contract.ingestion`**; schema-bound execution is **`ResourceRuntime`** / **`build_resource_runtime`** under **`graflo.architecture.contract.runtime`**. **`Resource`** remains an internal alias for **`ResourceConfig`**. + +### Breaking + +- **Top-level `blank_vertices` on `vertex_config`** — no longer read from manifests; set **`blank: true`** on the corresponding **`vertices`** entries instead (silent ignore under `extra="ignore"` if the old key is left in place). +- **Runtime blank vertex scope** — blank vertex types must appear in the resource pipeline (or edge inference selectors) to be present in the per-resource runtime **`VertexConfig`**; relying on schema-wide blank placeholders without a matching actor step will not add them at cast time. +- **Imports** — prefer **`ResourceConfig`** from **`graflo.architecture.contract`** (or **`graflo.architecture.contract.ingestion`**); **`graflo.architecture.contract.declarations.resource`** is not the canonical module path. + +### Documentation + +- **[Document cast errors](docs/concepts/ingestion_doc_errors.md)** — **`tolerate_transform_errors`** and transform failure records. +- **[Core components](docs/concepts/core_components.md)** — **`ResourceConfig`** / **`ResourceRuntime`**, per-vertex **`blank`**, **`from_doc`** with dressed transforms, identity defaults. +- **[Architecture diagrams](docs/concepts/architecture_diagrams.md)** — contract and blank-vertex model aligned with 1.7.30. +- **[Creating a manifest](docs/getting_started/creating_manifest.md)** — **`tolerate_transform_errors`** and blank vertex YAML. + +## [1.7.29] + +### Added + +- **Empty-identity filter on cast batches** — after resource casting, **`Caster`** can drop vertex docs and edge tuples whose schema identity fields are all missing, `null`, or `""` before **`DBWriter`** (identity rules from **`VertexConfig`**, not **`GraphContainer`**). Controlled by **`IngestionParams.drop_empty_identity_docs`** (default **`true`**). Blank vertex collections are exempt. + ## [1.7.27] ### Added @@ -12,7 +43,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`ColumnTimeFilter`** — shared pandas-like time window on a single column (`column`, optional `start` / `end`, optional `interval` as a **`pandas.Timedelta`** string such as `"7D"` or `"2h"` for day/hour windows, optional `not_equals`, optional `start_inclusive` / `end_inclusive`). Rendered to SQL via **`FilterExpression`** (same path as other pushdown filters). Calendar-style offsets (for example month arithmetic) are not supported when `pandas.Timedelta` rejects the string; use explicit `start` / `end` ISO bounds instead. - **`FileConnector.time_filter`** and **`TableConnector.time_filter`** — canonical field replacing duplicated `date_field` / `date_filter` / `date_range_*` fields on the wire. - **Bindings — runtime connector patches**: **`ConnectorUpdate`**, **`Bindings.apply_connector_update`**, and **`Bindings.replace_connector`** so defining-field changes re-hash and reindex correctly while preserving **`conn_proxy`** wiring. Patches are applied **after** manifest load (not stored on `GraphManifest`). -- **Empty-identity filter on cast batches** — after resource casting, **`Caster`** can drop vertex docs and edge tuples whose schema identity fields are all missing, `null`, or `""` before **`DBWriter`** (identity rules from **`VertexConfig`**, not **`GraphContainer`**). Controlled by **`IngestionParams.drop_empty_identity_docs`** (default **`true`**). Blank vertex collections are exempt. ### Breaking diff --git a/docs/concepts/architecture_diagrams.md b/docs/concepts/architecture_diagrams.md index 3abe7160..da1e8ad6 100644 --- a/docs/concepts/architecture_diagrams.md +++ b/docs/concepts/architecture_diagrams.md @@ -122,10 +122,10 @@ classDiagram } class IngestionModel { - +resources: list~Resource~ + +resources: list~ResourceConfig~ +transforms: list~ProtoTransform~ +finish_init(core_schema) - +fetch_resource(name) Resource + +fetch_resource_config(name) ResourceConfig } class GraphMetadata { @@ -136,13 +136,15 @@ classDiagram class VertexConfig { +vertices: list~Vertex~ - +blank_vertices: list~Vertex~ + +identity_from_all_properties: bool + +blank_vertices: list~str~ } class Vertex { +name: str +identity: list~str~ +properties: list~Field~ + +blank: bool +filters: FilterExpression? } @@ -165,11 +167,16 @@ classDiagram +filters: FilterExpression? } - class Resource { + class ResourceConfig { +name: str - +root: ActorWrapper + +pipeline: list~dict~ + +tolerate_transform_errors: bool + } + + class ResourceRuntime { + +config: ResourceConfig + +vertex_config: VertexConfig +executor: ActorExecutor - +finish_init(vertex_config, edge_config, transforms) } class ActorWrapper { @@ -224,7 +231,7 @@ classDiagram Schema *-- CoreSchema : core_schema CoreSchema *-- VertexConfig : vertex_config CoreSchema *-- EdgeConfig : edge_config - IngestionModel *-- "0..*" Resource : resources + IngestionModel *-- "0..*" ResourceConfig : resources IngestionModel *-- "0..*" ProtoTransform : transforms VertexConfig *-- "0..*" Vertex : vertices @@ -235,8 +242,9 @@ classDiagram Edge *-- "0..*" Field : properties Edge --> FilterExpression : filters - Resource *-- ActorWrapper : root - Resource *-- ActorExecutor : runtime orchestration + ResourceRuntime *-- ResourceConfig : config + ResourceRuntime *-- ActorWrapper : root + ResourceRuntime *-- ActorExecutor : runtime orchestration ActorWrapper --> Actor : actor ActorExecutor ..> ExtractionContext : produces ActorExecutor ..> AssemblyContext : consumes @@ -378,5 +386,5 @@ These are the two key abstractions that decouple *data retrieval* from *graph tr - **DataSources** (`AbstractDataSource` subclasses) — handle *where* and *how* data is read. Each carries a `DataSourceType` (`FILE`, `SQL`, `SPARQL`, `API`, `IN_MEMORY`). Many DataSources can bind to the same Resource by name via the `DataSourceRegistry`. -- **Resources** (`Resource`) — handle *what* the data becomes in the LPG. Each Resource is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint. +- **Resources** (`ResourceConfig` → `ResourceRuntime`) — handle *what* the data becomes in the LPG. Each resource is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Because DataSources bind to resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint. - Optional **`drop_trivial_input_fields`** (default `false` on the model): when `true`, each record is preprocessed by dropping **top-level** keys whose value is `null` or the empty string `""` before actors run. This trims sparse wide rows (many unused columns) without extra transforms; nested dicts and lists are not walked. diff --git a/docs/concepts/core_components.md b/docs/concepts/core_components.md index 81ab8aa2..6685c192 100644 --- a/docs/concepts/core_components.md +++ b/docs/concepts/core_components.md @@ -83,11 +83,14 @@ A `Vertex` describes vertices and their logical identity. It supports: - If one duplicate is typed and the other is untyped, the typed definition wins - Conflicting non-null types for the same field name are rejected - Filtering conditions -- Optional blank vertex configuration +- **`blank: true`** — placeholder vertex with no natural key; identity defaults to **`id`** when omitted -Identity defaults are strict by default at schema level: -- `VertexConfig.identity_from_all_properties: false` (default) do not require explicit vertex `identity`, defaults to all properties -- `VertexConfig.identity_from_all_properties: false` disables compatibility fallback where missing identity uses all property names +Identity defaults at schema level (`VertexConfig`): + +- **`identity_from_all_properties: true`** (default) — vertices without explicit **`identity`** use all **`properties`** names as the logical key. +- **`identity_from_all_properties: false`** — each non-blank vertex must declare **`identity`** explicitly; blank vertices still default to **`id`**. + +**Blank vertices:** set **`blank: true`** on the vertex entry under **`schema.graph.vertex_config.vertices`**. **`VertexConfig.blank_vertices`** is a derived list of names (not a separate YAML field). At runtime, **`ResourceRuntime`** keeps only vertex types referenced by that resource’s pipeline (and edge-inference selectors); blank types that are declared in the schema but not used by the resource are not injected automatically—include a **`vertex`** (or edge) step when the placeholder must be populated. ### Edge An `Edge` describes edges and their logical identities. It allows: @@ -166,19 +169,20 @@ An `AbstractDataSource` subclass defines where data comes from and how it is ret Data sources handle retrieval only. They bind to Resources by name via the `DataSourceRegistry`, so the same `Resource` can ingest data from multiple sources without modification. -### Resource -A `Resource` is the central abstraction that bridges data sources and the graph schema. Each Resource defines a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements: +### Resource (`ResourceConfig` / `ResourceRuntime`) + +Ingestion resources split into two layers: -- How data structures map to vertices and edges -- What transformations to apply -- The actor pipeline for processing documents +- **`ResourceConfig`** — declarative contract in **`ingestion_model.resources`** (YAML/Python): pipeline steps, encoding, type casters, edge-inference flags, **`tolerate_transform_errors`**, and related options. Serialized in manifests; validated by **`IngestionModel`**. +- **`ResourceRuntime`** — schema-bound executor built via **`build_resource_runtime`**: filtered **`VertexConfig`**, bound transforms, and **`ActorExecutor`** for document casting. -Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, a SQL table, or a SPARQL endpoint. +The name **`Resource`** in manifests and docs usually means **`ResourceConfig`**. Data sources bind to resources by name, so the same pipeline applies whether data arrives from a file, API, SQL table, or SPARQL endpoint. -Resource-level edge inference controls: +Resource-level controls: - **`infer_edges`**: Global toggle for inferred edge emission during assembly (default: `true`). - **`infer_edge_only`**: Allow-list of inferred edges (`source`, `target`, optional `relation`). - **`infer_edge_except`**: Deny-list of inferred edges (`source`, `target`, optional `relation`). +- **`tolerate_transform_errors`** (default **`true`**): on transform failure, null declared outputs and continue the pipeline; see [Document cast errors](ingestion_doc_errors.md). - `infer_edge_only` and `infer_edge_except` are mutually exclusive and validated against declared schema edges. - These controls apply to inferred edges only; explicit edge actors in the pipeline are still emitted. - **Auto-exclusion**: When a resource pipeline contains any EdgeActor for edges of type `(source, target)`, `(source, target, None)` is automatically added to `infer_edge_except` for that resource, so inferred edges do not duplicate edges produced by explicit edge actors. @@ -192,7 +196,7 @@ An `Actor` describes how the current level of the document should be mapped/tran - `TransformActor`: Applies data transformations - `VertexActor`: Creates vertices from the current level. Key options: - **`role`** (optional): named accumulator slot. When set the vertex is stored at `lindex.extend((role, 0))` instead of bare `lindex`, so multiple vertices of the same type in one row (e.g. `role: self`, `role: parent`, `role: child`) occupy distinct slots and can be addressed individually by a downstream edge step. - - **`from`**: rename map `{vertex_field: doc_field}`. Only mismatched column names need listing; remaining vertex schema properties are absorbed from the doc automatically (passthrough). + - **`from`** (`from_doc`): rename map `{vertex_field: doc_field}`. Only mismatched column names need listing; remaining vertex schema properties are absorbed from the doc and transform buffer automatically (passthrough). When multiple **`TransformPayload`** entries share a location, **`from_doc`** consumes only payloads whose **`named`** keys include all mapped source fields—so dressed metrics or pivot rows for other vertex types are left for their own **`vertex`** steps. - **`keep_fields`**: restrict passthrough to this field subset. Use on role-vertex steps to prevent shared row columns from leaking into placeholder vertices that only carry an ID. - `EdgeActor`: Creates edges between vertices. Operates in three modes: - **Static mode** (`from`/`to` set on both sides): vertex types declared at config time. diff --git a/docs/concepts/features_and_practices.md b/docs/concepts/features_and_practices.md index 72f512ea..4d57a400 100644 --- a/docs/concepts/features_and_practices.md +++ b/docs/concepts/features_and_practices.md @@ -106,7 +106,7 @@ Schema comparison gives you a predictable transition path between versions. Inst ## Best Practices 1. Use compound identity fields for natural keys, and **`schema.db_profile`** secondary indexes for query performance -2. Leverage blank vertices for complex relationship modeling +2. Leverage blank vertices (`blank: true` on the vertex definition) for complex relationship modeling; include them in the resource pipeline when they must be populated at cast time 3. Define reusable transforms in **`ingestion_model.transforms`** and reference them from resource steps 4. Configure appropriate batch sizes based on your data volume 5. Enable parallel processing for large datasets diff --git a/docs/concepts/index.md b/docs/concepts/index.md index d040948c..3d0d57b4 100644 --- a/docs/concepts/index.md +++ b/docs/concepts/index.md @@ -96,10 +96,10 @@ flowchart LR - **Bindings** (`FileConnector`, `TableConnector`, `SparqlConnector`) describe *where* data comes from (file paths, SQL tables, SPARQL endpoints). Multiple connectors may attach to the same ingestion resource name; optional **`connector_connection`** entries assign each SQL/SPARQL connector a **`conn_proxy`** by **connector `name` or `hash`** (not by resource name). The `ConnectionProvider` turns that label into real connection config at runtime so manifests stay credential-free. - **DataSources** (`AbstractDataSource` subclasses) handle *how* to read data in batches. Each carries a `DataSourceType` and is registered in the `DataSourceRegistry`. -- **Resources** define *what* to extract — each `Resource` is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Optional **`drop_trivial_input_fields`: `true`** removes top-level keys whose value is `null` or `""` **before** actors run (shallow only; `0` and `false` stay). **TigerGraph** physical defaults for missing attributes belong in **`schema.db_profile.default_property_values`** (GSQL `DEFAULT` at DDL time), not in the covariant `GraphContainer` assembly path. +- **Resources** define *what* to extract — each **`ResourceConfig`** (manifest `ingestion_model.resources`) is a reusable actor pipeline (descend → transform → vertex → edge) executed at cast time by **`ResourceRuntime`**. Optional **`drop_trivial_input_fields`: `true`** removes top-level keys whose value is `null` or `""` **before** actors run (shallow only; `0` and `false` stay). Optional **`tolerate_transform_errors`: `true`** (default) continues the pipeline when a transform step fails. **TigerGraph** physical defaults for missing attributes belong in **`schema.db_profile.default_property_values`** (GSQL `DEFAULT` at DDL time), not in the covariant `GraphContainer` assembly path. - **GraphContainer** (covariant graph representation) collects the resulting vertices and edges in a database-independent format. - **DBWriter** pushes the graph data into the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph). -- **Document cast errors** — when a single source document fails inside a resource, **`IngestionParams.on_doc_error`** chooses skip vs fail-the-batch; optional **gzip JSONL** persistence uses **`doc_error_sink_path`** (CLI **`ingest --doc-error-sink`**). Details: [Document cast errors and doc error sink](ingestion_doc_errors.md). +- **Document cast errors** — when a single source document fails inside a resource, **`IngestionParams.on_doc_error`** chooses skip vs fail-the-batch; optional **gzip JSONL** persistence uses **`doc_error_sink_path`** (CLI **`ingest --doc-error-sink`**). Per-resource **`tolerate_transform_errors`** (default **`true`**) lets a single transform step fail without aborting the rest of the pipeline for that document. Details: [Document cast errors and doc error sink](ingestion_doc_errors.md). ### Minimal canonical config contract diff --git a/docs/concepts/ingestion_doc_errors.md b/docs/concepts/ingestion_doc_errors.md index 827d947d..417c638a 100644 --- a/docs/concepts/ingestion_doc_errors.md +++ b/docs/concepts/ingestion_doc_errors.md @@ -11,6 +11,8 @@ When a **resource** maps a **source document** (one item from a batch: a JSON ob Set **`IngestionParams.doc_error_sink_path`** to a filesystem path (convention: **`*.jsonl.gz`**). The caster appends **gzip-compressed JSONL**: each line is one JSON object matching **`DocCastFailure`** (resource name, **`doc_index`** within the batch, exception type, message, traceback, optional document preview). Writes are serialized with an internal async lock so concurrent batches do not corrupt the file. +Records use **`failure_kind`**: **`document`** (default) when the whole source document failed, or **`transform`** when a single transform step failed but the document was still ingested (see below). Transform rows also include **`location_path`**, **`transform_label`**, and **`nulled_fields`**. + Each append may add a new gzip member to the file (normal for log-style gzip). Tools such as **`zcat`**, **`gzip -dc`**, or **`pigz -dc`** stream all concatenated members, for example: ```bash @@ -23,7 +25,13 @@ If **`doc_error_sink_path`** is **`None`**, skipped failures are emitted as stru ## Optional caps -- **`max_doc_errors`**: if the **total** number of persisted document failures across the run exceeds this limit, ingestion raises **`DocErrorBudgetExceeded`** (after writing the failures that pushed over the limit). Use this to stop a bad source early. +- **`max_doc_errors`**: if the **total** number of persisted failure records across the run (document **and** transform) exceeds this limit, ingestion raises **`DocErrorBudgetExceeded`** (after writing the failures that pushed over the limit). Use this to stop a bad source early. + +## Per-transform tolerance: `tolerate_transform_errors` + +On each ingestion resource (**`ResourceConfig`** in YAML under **`ingestion_model.resources`**), **`tolerate_transform_errors`** defaults to **`True`**. When enabled, a failing transform step sets its declared output fields to **`None`**, records a **`failure_kind=transform`** row in the doc error sink, and the rest of the pipeline (vertices, edges, later transforms) still runs for that document. Set **`tolerate_transform_errors: false`** on a resource to restore fail-fast behavior for transform exceptions (the whole document is lost unless **`on_doc_error=skip`** at the caster). + +Transform failures are persisted through the same **`doc_error_sink_path`** and count toward **`max_doc_errors`** as full document failures. With **`on_doc_error=fail`**, tolerated transform errors do not fail the batch; only unhandled document-level exceptions do. - **`doc_error_preview_max_bytes`** and **`doc_error_preview_keys`**: bound the size and shape of the **`doc_preview`** field on **`DocCastFailure`** so logs and files stay readable and bounded. @@ -54,6 +62,18 @@ ingestion_params = IngestionParams( ) ``` +Per-resource transform tolerance in YAML: + +```yaml +ingestion_model: + resources: + - name: metrics + tolerate_transform_errors: true + apply: + - transform: {call: {use: parse_metric}} + - vertex: Metric +``` + ## Extensibility Additional sink types can implement the **`DocErrorSink`** protocol (**`async write_failures(failures)`**) and be wired from your own orchestration code; the built-in path is **`JsonlGzDocErrorSink`** behind **`doc_error_sink_path`**. diff --git a/docs/getting_started/creating_manifest.md b/docs/getting_started/creating_manifest.md index 68e5eb9f..aec93874 100644 --- a/docs/getting_started/creating_manifest.md +++ b/docs/getting_started/creating_manifest.md @@ -62,7 +62,7 @@ bindings: {} Defines the graph contract. - `metadata`: human-facing identity (`name`, optional `version`) -- `graph.vertex_config`: vertex types, **`properties`**, identity keys +- `graph.vertex_config`: vertex types, **`properties`**, identity keys; optional **`blank: true`** for placeholder vertices (auto **`id`** identity) - `graph.edge_config`: source/target relationships, optional `relation`, edge **`properties`**, `identities` - `db_profile`: DB-specific physical behavior (indexes, naming, **`default_property_values`** for TigerGraph GSQL `DEFAULT` on vertex/edge attributes, backend details) @@ -74,7 +74,9 @@ Defines ingestion behavior. - `resources`: named pipelines (`name`) with ordered actor steps - `transforms`: reusable named transforms as a **list** (each entry must define `name`) and referenced from resources via `transform.call.use` -- Optional per-resource flags include **`drop_trivial_input_fields`** (default `false`): when `true`, top-level keys whose value is `null` or `""` are removed **before** the actor pipeline runs. Only the top-level dict is filtered (nested structures are not recursed); numeric zero and boolean false are kept. Useful for sparse wide tables (CSV/SQL) without custom transforms. +- Optional per-resource flags include: + - **`drop_trivial_input_fields`** (default `false`): when `true`, top-level keys whose value is `null` or `""` are removed **before** the actor pipeline runs. Only the top-level dict is filtered (nested structures are not recursed); numeric zero and boolean false are kept. Useful for sparse wide tables (CSV/SQL) without custom transforms. + - **`tolerate_transform_errors`** (default `true`): when `true`, a failing transform nulls its declared outputs and the pipeline continues; when `false`, transform exceptions fail the document (subject to caster **`on_doc_error`**). See [Document cast errors](../concepts/ingestion_doc_errors.md). **TigerGraph attribute defaults (schema / `db_profile`, not ingestion):** under `schema.db_profile`, optional **`default_property_values`** declares GSQL `DEFAULT` literals per logical vertex property and per logical edge type, for example: diff --git a/docs/guides/tigergraph_bulk_load.md b/docs/guides/tigergraph_bulk_load.md index 9eb7a75f..2c3d2789 100644 --- a/docs/guides/tigergraph_bulk_load.md +++ b/docs/guides/tigergraph_bulk_load.md @@ -50,7 +50,7 @@ Ingestion coordinates begin/finalize through the backend-agnostic **`BulkSession ## Limitations (current release) -- **`blank_vertices`** in the logical schema are rejected at `bulk_load_begin`. +- Vertices with **`blank: true`** (blank placeholders) in the logical schema are rejected at `bulk_load_begin`. - Resources with **`extra_weights`** (DB lookups during ingest) cannot use bulk for that resource; use REST ingest or remove extra weights for those resources. Upsert semantics differ from REST: native **LOAD** is oriented toward **append** semantics; plan idempotency and clears according to your operations model. diff --git a/docs/index.md b/docs/index.md index 4d4f2626..860c7639 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,7 +15,7 @@ It is a **Python package** and **Graph Schema & Transformation Language (GSTL)** - **One pipeline, several graph databases** — The same manifest targets ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, or NebulaGraph; `DatabaseProfile` and DB-aware types absorb naming, defaults, and indexing differences. - **Explicit identities** — Vertex identity fields and indexes back upserts so reloads merge on keys instead of blindly duplicating nodes. -- **Reusable ingestion** — `Resource` actor pipelines (including **vertex** / **vertex_router** / **edge** steps) bind to files, SQL, SPARQL/RDF, APIs, or in-memory batches via `Bindings` and the `DataSourceRegistry`. A single flat row can populate multiple same-type vertices in distinct named slots (`role`) and emit multiple edges in one `edge: links` step. +- **Reusable ingestion** — `ResourceConfig` actor pipelines (including **vertex** / **vertex_router** / **edge** steps) bind to files, SQL, SPARQL/RDF, APIs, or in-memory batches via `Bindings` and the `DataSourceRegistry`. A single flat row can populate multiple same-type vertices in distinct named slots (`role`) and emit multiple edges in one `edge: links` step. Per-resource **`tolerate_transform_errors`** (default on) keeps ingestion moving when an individual transform step fails. - **Manifest-first sanitization** — `Sanitizer` (backed by `graflo.architecture.evolution` **`SanitizeOp`**) normalizes schema identifiers (reserved words, TigerGraph relation/index constraints) and synchronizes related ingestion mappings via `sanitize_manifest(GraphManifest)`. `GraphEngine.infer_manifest(...)` applies it automatically; lower-level `SQLInferenceManager` does not—sanitize the manifest yourself when assembling contracts outside the engine. ### What’s in the manifest @@ -36,7 +36,7 @@ It is a **Python package** and **Graph Schema & Transformation Language (GSTL)** |-------|------|------| | **Logical graph schema** | Manifest `schema`: vertex/edge definitions, identities, typed **properties**, DB profile. Constrains pipeline output and projection; not a separate queue between steps. | `Schema`, `VertexConfig`, `EdgeConfig` (under `core_schema`). | | **Source instance** | Concrete input: file, SQL table, SPARQL endpoint, API payload, in-memory rows. | `AbstractDataSource` + `DataSourceType`. | -| **Resource** | Ordered actors; resources are looked up by name when sources are registered. | `Resource` in `IngestionModel`. | +| **Resource** | Ordered actors; resources are looked up by name when sources are registered. | `ResourceConfig` in `IngestionModel`; `ResourceRuntime` at cast time. | | **Covariant graph** (`GraphContainer`) | Batches of vertices/edges before load. | `GraphContainer`. | | **DB-aware projection** | Physical names, defaults, indexes for the target. | `Schema.resolve_db_aware()`, `VertexConfigDBAware`, `EdgeConfigDBAware`. | | **Graph DB** | Target LPG; each `DBType` has its own connector, orchestrated the same way. | `ConnectionManager`, `DBWriter`, per-backend `Connection`. | diff --git a/docs/reference/architecture/contract/bindings/column_time_filter.md b/docs/reference/architecture/contract/bindings/column_time_filter.md new file mode 100644 index 00000000..af96304e --- /dev/null +++ b/docs/reference/architecture/contract/bindings/column_time_filter.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.bindings.column_time_filter` + +::: graflo.architecture.contract.bindings.column_time_filter diff --git a/docs/reference/architecture/contract/declarations/__init__.md b/docs/reference/architecture/contract/declarations/__init__.md deleted file mode 100644 index d2981ef6..00000000 --- a/docs/reference/architecture/contract/declarations/__init__.md +++ /dev/null @@ -1,3 +0,0 @@ -# `graflo.architecture.contract.declarations` - -::: graflo.architecture.contract.declarations diff --git a/docs/reference/architecture/contract/declarations/edge_derivation_registry.md b/docs/reference/architecture/contract/declarations/edge_derivation_registry.md deleted file mode 100644 index d8d17a78..00000000 --- a/docs/reference/architecture/contract/declarations/edge_derivation_registry.md +++ /dev/null @@ -1,3 +0,0 @@ -# `graflo.architecture.contract.declarations.edge_derivation_registry` - -::: graflo.architecture.contract.declarations.edge_derivation_registry diff --git a/docs/reference/architecture/contract/declarations/ingestion_model/__init__.md b/docs/reference/architecture/contract/declarations/ingestion_model/__init__.md deleted file mode 100644 index e0587ec1..00000000 --- a/docs/reference/architecture/contract/declarations/ingestion_model/__init__.md +++ /dev/null @@ -1,3 +0,0 @@ -# `graflo.architecture.contract.declarations.ingestion_model` - -::: graflo.architecture.contract.declarations.ingestion_model diff --git a/docs/reference/architecture/contract/declarations/ingestion_model/model.md b/docs/reference/architecture/contract/declarations/ingestion_model/model.md deleted file mode 100644 index e528db5e..00000000 --- a/docs/reference/architecture/contract/declarations/ingestion_model/model.md +++ /dev/null @@ -1,3 +0,0 @@ -# `graflo.architecture.contract.declarations.ingestion_model.model` - -::: graflo.architecture.contract.declarations.ingestion_model.model diff --git a/docs/reference/architecture/contract/declarations/resource.md b/docs/reference/architecture/contract/declarations/resource.md deleted file mode 100644 index aaf1f305..00000000 --- a/docs/reference/architecture/contract/declarations/resource.md +++ /dev/null @@ -1,3 +0,0 @@ -# `graflo.architecture.contract.declarations.resource` - -::: graflo.architecture.contract.declarations.resource diff --git a/docs/reference/architecture/contract/declarations/transform.md b/docs/reference/architecture/contract/declarations/transform.md deleted file mode 100644 index 15167c41..00000000 --- a/docs/reference/architecture/contract/declarations/transform.md +++ /dev/null @@ -1,3 +0,0 @@ -# `graflo.architecture.contract.declarations.transform` - -::: graflo.architecture.contract.declarations.transform diff --git a/docs/reference/architecture/contract/ingestion/__init__.md b/docs/reference/architecture/contract/ingestion/__init__.md new file mode 100644 index 00000000..83cbf45c --- /dev/null +++ b/docs/reference/architecture/contract/ingestion/__init__.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.ingestion` + +::: graflo.architecture.contract.ingestion diff --git a/docs/reference/architecture/contract/ingestion/model.md b/docs/reference/architecture/contract/ingestion/model.md new file mode 100644 index 00000000..431700db --- /dev/null +++ b/docs/reference/architecture/contract/ingestion/model.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.ingestion.model` + +::: graflo.architecture.contract.ingestion.model diff --git a/docs/reference/architecture/contract/ingestion/resource.md b/docs/reference/architecture/contract/ingestion/resource.md new file mode 100644 index 00000000..96ec5230 --- /dev/null +++ b/docs/reference/architecture/contract/ingestion/resource.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.ingestion.resource` + +::: graflo.architecture.contract.ingestion.resource diff --git a/docs/reference/architecture/contract/ingestion/transform.md b/docs/reference/architecture/contract/ingestion/transform.md new file mode 100644 index 00000000..12d80efa --- /dev/null +++ b/docs/reference/architecture/contract/ingestion/transform.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.ingestion.transform` + +::: graflo.architecture.contract.ingestion.transform diff --git a/docs/reference/architecture/contract/runtime/__init__.md b/docs/reference/architecture/contract/runtime/__init__.md new file mode 100644 index 00000000..9b9abc91 --- /dev/null +++ b/docs/reference/architecture/contract/runtime/__init__.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.runtime` + +::: graflo.architecture.contract.runtime diff --git a/docs/reference/architecture/contract/runtime/edge_derivation.md b/docs/reference/architecture/contract/runtime/edge_derivation.md new file mode 100644 index 00000000..01b76a47 --- /dev/null +++ b/docs/reference/architecture/contract/runtime/edge_derivation.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.runtime.edge_derivation` + +::: graflo.architecture.contract.runtime.edge_derivation diff --git a/docs/reference/architecture/contract/runtime/resource.md b/docs/reference/architecture/contract/runtime/resource.md new file mode 100644 index 00000000..06269553 --- /dev/null +++ b/docs/reference/architecture/contract/runtime/resource.md @@ -0,0 +1,3 @@ +# `graflo.architecture.contract.runtime.resource` + +::: graflo.architecture.contract.runtime.resource diff --git a/docs/reference/architecture/evolution/__init__.md b/docs/reference/architecture/evolution/__init__.md new file mode 100644 index 00000000..5e7ea584 --- /dev/null +++ b/docs/reference/architecture/evolution/__init__.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution` + +::: graflo.architecture.evolution diff --git a/docs/reference/architecture/evolution/apply.md b/docs/reference/architecture/evolution/apply.md new file mode 100644 index 00000000..cdc72b20 --- /dev/null +++ b/docs/reference/architecture/evolution/apply.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution.apply` + +::: graflo.architecture.evolution.apply diff --git a/docs/reference/architecture/evolution/db_profile.md b/docs/reference/architecture/evolution/db_profile.md new file mode 100644 index 00000000..311c0629 --- /dev/null +++ b/docs/reference/architecture/evolution/db_profile.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution.db_profile` + +::: graflo.architecture.evolution.db_profile diff --git a/docs/reference/architecture/evolution/merge_core.md b/docs/reference/architecture/evolution/merge_core.md new file mode 100644 index 00000000..804c93df --- /dev/null +++ b/docs/reference/architecture/evolution/merge_core.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution.merge_core` + +::: graflo.architecture.evolution.merge_core diff --git a/docs/reference/architecture/evolution/ops.md b/docs/reference/architecture/evolution/ops.md new file mode 100644 index 00000000..10918dac --- /dev/null +++ b/docs/reference/architecture/evolution/ops.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution.ops` + +::: graflo.architecture.evolution.ops diff --git a/docs/reference/architecture/evolution/rewrite.md b/docs/reference/architecture/evolution/rewrite.md new file mode 100644 index 00000000..b29ed496 --- /dev/null +++ b/docs/reference/architecture/evolution/rewrite.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution.rewrite` + +::: graflo.architecture.evolution.rewrite diff --git a/docs/reference/architecture/evolution/sanitize.md b/docs/reference/architecture/evolution/sanitize.md new file mode 100644 index 00000000..3d8a64b3 --- /dev/null +++ b/docs/reference/architecture/evolution/sanitize.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution.sanitize` + +::: graflo.architecture.evolution.sanitize diff --git a/docs/reference/architecture/evolution/version.md b/docs/reference/architecture/evolution/version.md new file mode 100644 index 00000000..5ee23ff1 --- /dev/null +++ b/docs/reference/architecture/evolution/version.md @@ -0,0 +1,3 @@ +# `graflo.architecture.evolution.version` + +::: graflo.architecture.evolution.version diff --git a/docs/reference/hq/document_caster.md b/docs/reference/hq/document_caster.md new file mode 100644 index 00000000..22dbb189 --- /dev/null +++ b/docs/reference/hq/document_caster.md @@ -0,0 +1,3 @@ +# `graflo.hq.document_caster` + +::: graflo.hq.document_caster diff --git a/docs/reference/util/casting.md b/docs/reference/util/casting.md new file mode 100644 index 00000000..5d4ff3b4 --- /dev/null +++ b/docs/reference/util/casting.md @@ -0,0 +1,3 @@ +# `graflo.util.casting` + +::: graflo.util.casting diff --git a/docs/reference/util/data_normalize.md b/docs/reference/util/data_normalize.md new file mode 100644 index 00000000..ebfa436a --- /dev/null +++ b/docs/reference/util/data_normalize.md @@ -0,0 +1,3 @@ +# `graflo.util.data_normalize` + +::: graflo.util.data_normalize diff --git a/graflo/architecture/contract/__init__.py b/graflo/architecture/contract/__init__.py index d46a7716..6dfad3dd 100644 --- a/graflo/architecture/contract/__init__.py +++ b/graflo/architecture/contract/__init__.py @@ -10,11 +10,14 @@ SparqlConnector, TableConnector, ) -from .declarations import ( +from .ingestion import ( IngestionModel, ProtoTransform, Resource, + ResourceConfig, + ResourceRuntime, Transform, + build_resource_runtime, ) from .manifest import GraphManifest @@ -28,7 +31,10 @@ "JoinClause", "ProtoTransform", "Resource", + "ResourceConfig", + "ResourceRuntime", "ResourceConnector", + "build_resource_runtime", "SparqlConnector", "TableConnector", "Transform", diff --git a/graflo/architecture/contract/bindings/__init__.py b/graflo/architecture/contract/bindings/__init__.py index 093fd464..760d9fcd 100644 --- a/graflo/architecture/contract/bindings/__init__.py +++ b/graflo/architecture/contract/bindings/__init__.py @@ -1,6 +1,12 @@ """Resource connectors and named binding collections.""" -from .core import Bindings, ResourceConnectorBinding, StagingProxyBinding +from .core import ( + Bindings, + BindingsConfig, + BindingsRegistry, + ResourceConnectorBinding, + StagingProxyBinding, +) from .column_time_filter import ColumnTimeFilter from .connectors import ( BoundSourceKind, @@ -14,6 +20,8 @@ __all__ = [ "Bindings", + "BindingsConfig", + "BindingsRegistry", "BoundSourceKind", "ColumnTimeFilter", "ConnectorUpdate", diff --git a/graflo/architecture/contract/bindings/core.py b/graflo/architecture/contract/bindings/core.py index d9a56da7..1b4db48a 100644 --- a/graflo/architecture/contract/bindings/core.py +++ b/graflo/architecture/contract/bindings/core.py @@ -47,8 +47,8 @@ class StagingProxyBinding(ConfigBaseModel): conn_proxy: str -class Bindings(ConfigBaseModel): - """Named resource connectors with explicit resource linkage.""" +class BindingsConfig(ConfigBaseModel): + """Declarative bindings contract (connectors and resource wiring).""" connectors: list[FileConnector | TableConnector | SparqlConnector] = Field( default_factory=list @@ -332,6 +332,42 @@ def get_conn_proxy_for_connector( """Return the mapped runtime proxy name for a given connector.""" return self._connector_to_conn_proxy.get(connector.hash) + def get_connectors_for_resource( + self, resource_name: str + ) -> list[TableConnector | FileConnector | SparqlConnector]: + """Return connectors bound to *resource_name*, in binding order (unique by hash).""" + result: list[TableConnector | FileConnector | SparqlConnector] = [] + for h in self._resource_to_connector_hashes.get(resource_name, []): + c = self._connectors_index.get(h) + if isinstance(c, (TableConnector, FileConnector, SparqlConnector)): + result.append(c) + return result + + @classmethod + def from_dict(cls, data: dict[str, Any] | list[Any]) -> Self: + if isinstance(data, list): + raise ValueError( + "Bindings.from_dict expects a mapping with 'connectors' and optional " + "'resource_connector'. List-style connector payloads are not supported." + ) + legacy_keys = { + "postgres_connections", + "table_connectors", + "file_connectors", + "sparql_connectors", + } + found_legacy = sorted(k for k in legacy_keys if k in data) + if found_legacy: + raise ValueError( + "Legacy Bindings init keys are not supported. " + f"Unsupported keys: {', '.join(found_legacy)}." + ) + return cls.model_validate(data) + + +class BindingsRegistry(BindingsConfig): + """Mutable bindings registry for programmatic connector updates.""" + def bind_connector_to_conn_proxy( self, connector: TableConnector | FileConnector | SparqlConnector, @@ -373,27 +409,6 @@ def bind_connector_to_conn_proxy( self._rebuild_connector_to_conn_proxy() - @classmethod - def from_dict(cls, data: dict[str, Any] | list[Any]) -> Self: - if isinstance(data, list): - raise ValueError( - "Bindings.from_dict expects a mapping with 'connectors' and optional " - "'resource_connector'. List-style connector payloads are not supported." - ) - legacy_keys = { - "postgres_connections", - "table_connectors", - "file_connectors", - "sparql_connectors", - } - found_legacy = sorted(k for k in legacy_keys if k in data) - if found_legacy: - raise ValueError( - "Legacy Bindings init keys are not supported. " - f"Unsupported keys: {', '.join(found_legacy)}." - ) - return cls.model_validate(data) - def apply_connector_update(self, update: ConnectorUpdate) -> None: """Patch a connector in-place in this binding (re-hashes and reindexes). @@ -506,13 +521,5 @@ def bind_resource( # Keep the public contract field in sync for serialization / downstream. self.resource_connector = list(self._resource_connector_typed) - def get_connectors_for_resource( - self, resource_name: str - ) -> list[TableConnector | FileConnector | SparqlConnector]: - """Return connectors bound to *resource_name*, in binding order (unique by hash).""" - result: list[TableConnector | FileConnector | SparqlConnector] = [] - for h in self._resource_to_connector_hashes.get(resource_name, []): - c = self._connectors_index.get(h) - if isinstance(c, (TableConnector, FileConnector, SparqlConnector)): - result.append(c) - return result + +Bindings = BindingsRegistry diff --git a/graflo/architecture/contract/declarations/__init__.py b/graflo/architecture/contract/declarations/__init__.py deleted file mode 100644 index c039dc3d..00000000 --- a/graflo/architecture/contract/declarations/__init__.py +++ /dev/null @@ -1,22 +0,0 @@ -"""Ingestion declarations: resources, transforms, and ingestion model.""" - -from .ingestion_model import IngestionModel -from .resource import EdgeInferSpec, Resource -from .transform import ( - DressConfig, - KeySelectionConfig, - ProtoTransform, - Transform, - TransformException, -) - -__all__ = [ - "DressConfig", - "EdgeInferSpec", - "IngestionModel", - "KeySelectionConfig", - "ProtoTransform", - "Resource", - "Transform", - "TransformException", -] diff --git a/graflo/architecture/contract/declarations/ingestion_model/__init__.py b/graflo/architecture/contract/declarations/ingestion_model/__init__.py deleted file mode 100644 index 484f3625..00000000 --- a/graflo/architecture/contract/declarations/ingestion_model/__init__.py +++ /dev/null @@ -1,3 +0,0 @@ -from .model import IngestionModel - -__all__ = ["IngestionModel"] diff --git a/graflo/architecture/contract/declarations/resource.py b/graflo/architecture/contract/declarations/resource.py deleted file mode 100644 index 59f5518a..00000000 --- a/graflo/architecture/contract/declarations/resource.py +++ /dev/null @@ -1,519 +0,0 @@ -"""Resource management and processing for graph databases. - -This module provides the core resource handling functionality for graph databases. -It defines how data resources are processed, transformed, and mapped to graph -structures through a system of actors and transformations. - -Key Components: - - Resource: Main class for resource processing and transformation - - ActorWrapper: Wrapper for processing actors - - ActionContext: Context for processing actions - -The resource system allows for: - - Data encoding and transformation - - Vertex and edge creation - - Weight management - - Collection merging - - Type casting and validation - - Dynamic vertex-type routing via VertexRouterActor in the pipeline - -Example: - >>> resource = Resource( - ... name="users", - ... pipeline=[{"vertex": "user"}, {"edge": {"from": "user", "to": "user"}}], - ... encoding=EncodingType.UTF_8 - ... ) - >>> result = resource(doc) -""" - -from __future__ import annotations - -import builtins -import logging -from collections import defaultdict -from typing import TYPE_CHECKING, Any, Callable - -from pydantic import AliasChoices, Field as PydanticField, PrivateAttr, model_validator - -from graflo.architecture.base import ConfigBaseModel -from graflo.architecture.graph_types import ( - EdgeId, - EncodingType, - GraphEntity, - Weight, -) -from graflo.architecture.schema.edge import Edge, EdgeConfig -from graflo.architecture.schema.vertex import VertexConfig -from graflo.onto import DBType - -from .edge_derivation_registry import EdgeDerivationRegistry -from .transform import ProtoTransform - -if TYPE_CHECKING: - from graflo.architecture.pipeline.runtime.actor import ( - ActorWrapper, - ) - from graflo.architecture.pipeline.runtime.executor import ActorExecutor - -logger = logging.getLogger(__name__) - -_SAFE_TYPE_CASTERS: dict[str, Callable[..., Any]] = { - "str": str, - "int": int, - "float": float, - "bool": bool, - "bytes": bytes, - "list": list, - "dict": dict, - "tuple": tuple, - "set": set, -} - - -def _resolve_type_caster(name: str) -> Callable[..., Any] | None: - """Resolve a type caster by name from a strict allowlist.""" - if not isinstance(name, str): - return None - candidate = _SAFE_TYPE_CASTERS.get(name) - if candidate is not None: - return candidate - # Support "builtins.int" style entries without evaluating expressions. - if "." in name: - module_name, attr_name = name.split(".", 1) - if module_name == "builtins": - builtin_attr = getattr(builtins, attr_name, None) - if callable(builtin_attr) and attr_name in _SAFE_TYPE_CASTERS: - return _SAFE_TYPE_CASTERS[attr_name] - return None - - -def _strip_trivial_top_level_fields(doc: dict[str, Any]) -> dict[str, Any]: - """Return a shallow copy of *doc* without None or empty-string values.""" - return {k: v for k, v in doc.items() if v is not None and v != ""} - - -def _filter_vertex_config_by_allowed( - vertex_config: VertexConfig, - *, - allowed_vertex_names: set[str] | None, -) -> VertexConfig: - """Derive a filtered VertexConfig for runtime actor execution. - - This intentionally filters only the vertex collections present in - *allowed_vertex_names*; it does not attempt to rewrite edge configs. - """ - if allowed_vertex_names is None: - return vertex_config - - allowed = allowed_vertex_names - filtered_vertices = [v for v in vertex_config.vertices if v.name in allowed] - filtered_blank_vertices = [b for b in vertex_config.blank_vertices if b in allowed] - filtered_force_types = { - name: types - for name, types in vertex_config.force_types.items() - if name in allowed - } - return VertexConfig( - vertices=filtered_vertices, - blank_vertices=filtered_blank_vertices, - force_types=filtered_force_types, - ) - - -class EdgeInferSpec(ConfigBaseModel): - """Selector for controlling inferred edge emission.""" - - source: str = PydanticField(..., description="Edge source vertex name.") - target: str = PydanticField(..., description="Edge target vertex name.") - relation: str | None = PydanticField( - default=None, - description=( - "Optional relation discriminator. If omitted, selector applies to all relations " - "for (source, target)." - ), - ) - - @property - def edge_id(self) -> EdgeId: - return self.source, self.target, self.relation - - def matches(self, edge_id: EdgeId) -> bool: - source, target, relation = edge_id - return ( - self.source == source - and self.target == target - and (self.relation is None or self.relation == relation) - ) - - -class ResourceExtraWeightEntry(ConfigBaseModel): - """Schema edge plus optional vertex-derived weight rules for DB enrichment.""" - - edge: Edge - vertex_weights: list[Weight] = PydanticField(default_factory=list) - - @model_validator(mode="before") - @classmethod - def _from_yaml(cls, data: Any) -> Any: - if data is None: - return data - if isinstance(data, Edge): - return {"edge": data, "vertex_weights": []} - if not isinstance(data, dict): - raise TypeError( - f"extra_weights item must be dict or Edge, got {type(data)}" - ) - d = dict(data) - vw_raw = d.pop("vertex_weights", None) or [] - if not isinstance(vw_raw, list): - vw_raw = [vw_raw] - v_w = [Weight.model_validate(x) for x in vw_raw] - if "edge" in d and isinstance(d["edge"], dict): - edge = Edge.model_validate(dict(d.pop("edge"))) - if d: - raise ValueError( - f"extra_weights entry has unexpected keys with 'edge': {sorted(d)}" - ) - return {"edge": edge, "vertex_weights": v_w} - edge = Edge.model_validate(d) - return {"edge": edge, "vertex_weights": v_w} - - -class Resource(ConfigBaseModel): - """Resource configuration and processing. - - Represents a data resource that can be processed and transformed into graph - structures. Manages the processing pipeline through actors and handles data - encoding, transformation, and mapping. Suitable for LLM-generated schema - constituents. - - Dynamic vertex-type routing is handled by ``vertex_router`` steps in the - pipeline (see :class:`~graflo.architecture.pipeline.runtime.actor.VertexRouterActor`). - Per-row relationship labels and location matching for edges belong on - ``edge`` pipeline steps (:class:`~graflo.architecture.edge_derivation.EdgeDerivation`), - not on ``Resource``. - """ - - model_config = {"extra": "forbid"} - - name: str = PydanticField( - ..., - description="Name of the resource (e.g. table or file identifier).", - ) - pipeline: list[dict[str, Any]] = PydanticField( - ..., - description="Pipeline of actor steps to apply in sequence (vertex, edge, transform, descend). " - 'Each step is a dict, e.g. {"vertex": "user"} or {"edge": {"from": "a", "to": "b"}}.', - validation_alias=AliasChoices("pipeline", "apply"), - ) - encoding: EncodingType = PydanticField( - default=EncodingType.UTF_8, - description="Character encoding for input/output (e.g. utf-8, ISO-8859-1).", - ) - merge_collections: list[str] = PydanticField( - default_factory=list, - description="List of collection names to merge when writing to the graph.", - ) - extra_weights: list[ResourceExtraWeightEntry] = PydanticField( - default_factory=list, - description="Additional edge attribute / vertex-weight enrichment for this resource.", - ) - types: dict[str, str] = PydanticField( - default_factory=dict, - description='Field name to Python type expression for casting (e.g. {"amount": "float"}).', - ) - infer_edges: bool = PydanticField( - default=True, - description=( - "If True, infer edges from current vertex population. " - "If False, emit only edges explicitly declared as edge actors in the pipeline." - ), - ) - infer_edge_only: list[EdgeInferSpec] = PydanticField( - default_factory=list, - description=( - "Optional allow-list for inferred edges. Applies only to inferred (greedy) edges, " - "not explicit edge actors." - ), - ) - infer_edge_except: list[EdgeInferSpec] = PydanticField( - default_factory=list, - description=( - "Optional deny-list for inferred edges. Applies only to inferred (greedy) edges, " - "not explicit edge actors." - ), - ) - drop_trivial_input_fields: bool = PydanticField( - default=False, - description=( - "If True, remove top-level input keys whose value is None or the empty string before " - "the actor pipeline runs. Only the outer dict is filtered: nested dicts and list " - "elements are left unchanged, and keys whose values are containers (dict/list) are " - "kept even when empty. Numeric 0 and boolean False are kept. Use with wide or " - "sparse tabular rows so VertexActor projection sees fewer irrelevant columns." - ), - ) - skip_actors_on_missing_input_keys: bool | None = PydanticField( - default=None, - description=( - "If True, actors that declare required input keys may skip execution when keys are " - "missing in the current document instead of raising indexing errors. " - "If None, defaults to drop_trivial_input_fields." - ), - ) - - _root: ActorWrapper = PrivateAttr() - _types: dict[str, Callable[..., Any]] = PrivateAttr(default_factory=dict) - _vertex_config: VertexConfig = PrivateAttr() - _edge_config: EdgeConfig = PrivateAttr() - _executor: ActorExecutor = PrivateAttr() - _initialized: bool = PrivateAttr(default=False) - _edge_derivation_registry: EdgeDerivationRegistry | None = PrivateAttr(default=None) - - @model_validator(mode="after") - def _build_root_and_types(self) -> Resource: - """Build root ActorWrapper and resolve safe cast functions.""" - from graflo.architecture.pipeline.runtime.actor import ActorWrapper - from graflo.architecture.pipeline.runtime.executor import ActorExecutor - - object.__setattr__(self, "_root", ActorWrapper(*self.pipeline)) - object.__setattr__(self, "_executor", ActorExecutor(self._root)) - object.__setattr__(self, "_types", {}) - for k, v in self.types.items(): - caster = _resolve_type_caster(v) - if caster is not None: - self._types[k] = caster - else: - logger.error( - "For resource %s for field %s failed to resolve cast type %s", - self.name, - k, - v, - ) - # Placeholders until schema binds real configs. - object.__setattr__(self, "_vertex_config", VertexConfig(vertices=[])) - object.__setattr__(self, "_edge_config", EdgeConfig()) - object.__setattr__(self, "_initialized", False) - self._validate_infer_edge_spec_policy() - return self - - def _validate_infer_edge_spec_policy(self) -> None: - if self.infer_edge_only and self.infer_edge_except: - raise ValueError( - "Resource infer_edge_only and infer_edge_except are mutually exclusive." - ) - - def _validate_infer_edge_spec_targets(self, edge_config: EdgeConfig) -> None: - known_edge_ids = {edge_id for edge_id, _ in edge_config.items()} - - def _validate_list(field_name: str, specs: list[EdgeInferSpec]) -> None: - unknown: list[EdgeId] = [] - for spec in specs: - if not any(spec.matches(edge_id) for edge_id in known_edge_ids): - unknown.append(spec.edge_id) - if unknown: - raise ValueError( - f"Resource {field_name} contains unknown edge selectors: {unknown}" - ) - - _validate_list("infer_edge_only", self.infer_edge_only) - _validate_list("infer_edge_except", self.infer_edge_except) - - @property - def vertex_config(self) -> VertexConfig: - """Vertex configuration (set by Schema.finish_init).""" - return self._vertex_config - - @property - def edge_config(self) -> EdgeConfig: - """Edge configuration (set by Schema.finish_init).""" - return self._edge_config - - @property - def root(self) -> ActorWrapper: - """Root actor wrapper for the processing pipeline.""" - return self._root - - def finish_init( - self, - vertex_config: VertexConfig, - edge_config: EdgeConfig, - transforms: dict[str, ProtoTransform], - *, - strict_references: bool = False, - dynamic_edge_feedback: bool = False, - allowed_vertex_names: set[str] | None = None, - target_db_flavor: DBType | None = None, - ) -> None: - """Complete resource initialization. - - Initializes the resource with vertex and edge configurations, - and sets up the processing pipeline. Called by Schema after load. - - Args: - vertex_config: Configuration for vertices - edge_config: Configuration for edges - transforms: Dictionary of available transforms - target_db_flavor: Target graph DB flavor (for ingestion-time defaults, e.g. TigerGraph). - """ - self._rebuild_runtime( - vertex_config=vertex_config, - edge_config=edge_config, - transforms=transforms, - strict_references=strict_references, - dynamic_edge_feedback=dynamic_edge_feedback, - allowed_vertex_names=allowed_vertex_names, - target_db_flavor=target_db_flavor, - ) - - def _edge_ids_from_edge_actors(self) -> set[EdgeId]: - """Collect (source, target, None) for every EdgeActor in this resource's pipeline. - - Used to auto-add to infer_edge_except so inferred edges do not duplicate - edges produced by explicit edge actors. - """ - from graflo.architecture.pipeline.runtime.actor import EdgeActor - - edge_actors = [ - a for a in self.root.collect_actors() if isinstance(a, EdgeActor) - ] - # Dynamic EdgeActors (ea.edge is None) resolve types at row time; - # exclude them from static inference suppression. - return { - (ea.edge.source, ea.edge.target, None) - for ea in edge_actors - if ea.edge is not None - } - - def _validate_dynamic_edge_vertices_exist( - self, vertex_config: VertexConfig - ) -> None: - """Ensure all vertices implied by dynamic edge controls are declared.""" - known_vertices = set(vertex_config.vertex_set) - referenced_vertices: set[str] = set() - - for spec in self.infer_edge_only: - referenced_vertices.add(spec.source) - referenced_vertices.add(spec.target) - - for spec in self.infer_edge_except: - referenced_vertices.add(spec.source) - referenced_vertices.add(spec.target) - - for source, target, _ in self._edge_ids_from_edge_actors(): - referenced_vertices.add(source) - referenced_vertices.add(target) - - missing_vertices = sorted(referenced_vertices - known_vertices) - if missing_vertices: - raise ValueError( - "Resource dynamic edge references undefined vertices: " - f"{missing_vertices}. " - "Declare these vertices in vertex_config before using dynamic/inferred edges." - ) - - def _rebuild_runtime( - self, - *, - vertex_config: VertexConfig, - edge_config: EdgeConfig, - transforms: dict[str, ProtoTransform], - strict_references: bool = False, - dynamic_edge_feedback: bool = False, - allowed_vertex_names: set[str] | None = None, - target_db_flavor: DBType | None = None, - ) -> None: - """Rebuild runtime actor initialization state from typed context.""" - # Keep the full schema vertex_config for correctness validations, but - # use the filtered runtime vertex_config for actor execution. - runtime_vertex_config = _filter_vertex_config_by_allowed( - vertex_config, allowed_vertex_names=allowed_vertex_names - ) - object.__setattr__(self, "_vertex_config", runtime_vertex_config) - # Runtime actors may register dynamic edges; keep per-resource edge state. - local_edge_config = EdgeConfig.model_validate( - edge_config.to_dict(skip_defaults=False) - ) - object.__setattr__(self, "_edge_config", local_edge_config) - self._validate_dynamic_edge_vertices_exist(vertex_config) - self._validate_infer_edge_spec_targets(self._edge_config) - - baseline_edge_ids = {edge_id for edge_id, _ in edge_config.items()} - infer_edge_except = {spec.edge_id for spec in self.infer_edge_except} - # When not using infer_edge_only, auto-add (s,t,None) to infer_edge_except - # for any edge type handled by explicit EdgeActors in this resource. - if not self.infer_edge_only: - infer_edge_except |= self._edge_ids_from_edge_actors() - - from graflo.architecture.pipeline.runtime.actor import ActorInitContext - - edge_derivation_registry = EdgeDerivationRegistry() - object.__setattr__(self, "_edge_derivation_registry", edge_derivation_registry) - - logger.debug("total resource actor count : %s", self.root.count()) - skip_on_missing_input_keys = ( - self.skip_actors_on_missing_input_keys - if self.skip_actors_on_missing_input_keys is not None - else self.drop_trivial_input_fields - ) - init_ctx = ActorInitContext( - vertex_config=runtime_vertex_config, - edge_config=self._edge_config, - transforms=transforms, - edge_derivation=edge_derivation_registry, - allowed_vertex_names=allowed_vertex_names, - infer_edges=self.infer_edges, - infer_edge_only={spec.edge_id for spec in self.infer_edge_only}, - infer_edge_except=infer_edge_except, - strict_references=strict_references, - skip_actors_on_missing_input_keys=skip_on_missing_input_keys, - target_db_flavor=target_db_flavor, - ) - self.root.finish_init(init_ctx=init_ctx) - object.__setattr__(self, "_initialized", True) - - if dynamic_edge_feedback: - # Edge actors register static edge definitions into the resource-local edge - # config during finish_init(). Optionally propagate newly discovered edges - # to the shared schema-level edge_config so schema definition and DB - # writers can see them. - for edge_id, edge in self._edge_config.items(): - if edge_id in baseline_edge_ids: - continue - edge_config.update_edges( - edge.model_copy(deep=True), vertex_config=vertex_config - ) - - logger.debug("total resource actor count (after finit): %s", self.root.count()) - - reg = self._edge_derivation_registry - for entry in self.extra_weights: - entry.edge.finish_init(vertex_config) - if reg is not None and entry.vertex_weights: - reg.merge_vertex_weights(entry.edge.edge_id, entry.vertex_weights) - - def __call__(self, doc: dict) -> defaultdict[GraphEntity, list]: - """Process a document through the resource pipeline. - - Args: - doc: Document to process - - Returns: - defaultdict[GraphEntity, list]: Processed graph entities - """ - if not self._initialized: - raise RuntimeError( - f"Resource '{self.name}' must be initialized via finish_init() before use." - ) - work_doc: dict[str, Any] = ( - _strip_trivial_top_level_fields(doc) - if self.drop_trivial_input_fields - else doc - ) - extraction_ctx = self._executor.extract(work_doc) - result = self._executor.assemble_result(extraction_ctx) - return result.entities - - def count(self) -> int: - """Total number of actors in the resource pipeline.""" - return self.root.count() diff --git a/graflo/architecture/contract/ingestion/__init__.py b/graflo/architecture/contract/ingestion/__init__.py new file mode 100644 index 00000000..8e9dfddd --- /dev/null +++ b/graflo/architecture/contract/ingestion/__init__.py @@ -0,0 +1,43 @@ +"""Declarative ingestion contract: resources, transforms, and ingestion model.""" + +from .model import IngestionModel +from .resource import ( + EdgeInferSpec, + Resource, + ResourceConfig, + ResourceExtraWeightEntry, + collect_vertex_names_from_pipeline, +) +from .transform import ( + DressConfig, + KeySelectionConfig, + ProtoTransform, + Transform, + TransformException, +) +from ..runtime.resource import ( + ResourceRuntime, + build_resource_runtime, + filter_vertex_config_for_resource, + strip_trivial_top_level_fields, +) +from graflo.util.casting import resolve_type_caster + +__all__ = [ + "DressConfig", + "EdgeInferSpec", + "IngestionModel", + "KeySelectionConfig", + "ProtoTransform", + "Resource", + "ResourceConfig", + "ResourceExtraWeightEntry", + "ResourceRuntime", + "Transform", + "TransformException", + "build_resource_runtime", + "collect_vertex_names_from_pipeline", + "filter_vertex_config_for_resource", + "resolve_type_caster", + "strip_trivial_top_level_fields", +] diff --git a/graflo/architecture/contract/declarations/ingestion_model/model.py b/graflo/architecture/contract/ingestion/model.py similarity index 66% rename from graflo/architecture/contract/declarations/ingestion_model/model.py rename to graflo/architecture/contract/ingestion/model.py index 0f69a811..ee6657a7 100644 --- a/graflo/architecture/contract/declarations/ingestion_model/model.py +++ b/graflo/architecture/contract/ingestion/model.py @@ -8,11 +8,13 @@ from pydantic import Field as PydanticField, PrivateAttr, model_validator from graflo.architecture.base import ConfigBaseModel +from graflo.architecture.pipeline.runtime.actor import ActorWrapper from graflo.onto import DBType -from ..edge_derivation_registry import EdgeDerivationRegistry -from ..resource import Resource -from ..transform import ProtoTransform +from ..runtime.edge_derivation import EdgeDerivationRegistry +from ..runtime.resource import ResourceRuntime +from .resource import ResourceConfig +from .transform import ProtoTransform if TYPE_CHECKING: from graflo.architecture.schema import CoreSchema @@ -26,12 +28,10 @@ class IngestionModel(ConfigBaseModel): description=( "How batch edge writes tolerate an already-matching edge. Passed through to " ":meth:`~graflo.db.conn.Connection.insert_edges_batch` where the target backend " - "supports it. Today ArangoDB maps ``ignore`` to INSERT with ignoreErrors and " - "``upsert`` to AQL UPSERT (with schema merge keys as ``uniq_weight_fields`` when " - "present). Other databases may interpret the same values later." + "supports it." ), ) - resources: list[Resource] = PydanticField( + resources: list[ResourceConfig] = PydanticField( default_factory=list, description="List of resource definitions (data pipelines mapping to vertices/edges).", ) @@ -40,7 +40,8 @@ class IngestionModel(ConfigBaseModel): description="List of named transforms available to resources.", ) - _resources: dict[str, Resource] = PrivateAttr() + _resources: dict[str, ResourceConfig] = PrivateAttr() + _runtimes: dict[str, ResourceRuntime] = PrivateAttr(default_factory=dict) _transforms: dict[str, ProtoTransform] = PrivateAttr(default_factory=dict) _combined_edge_derivation: EdgeDerivationRegistry = PrivateAttr( default_factory=EdgeDerivationRegistry @@ -49,7 +50,7 @@ class IngestionModel(ConfigBaseModel): @model_validator(mode="after") def _init_model(self) -> IngestionModel: """Build transform and resource lookup maps.""" - self._rebuild_runtime_state() + self._rebuild_config_state() return self def _rebuild_resource_map(self) -> None: @@ -91,10 +92,12 @@ def finish_init( allowed_vertex_names: set[str] | None = None, target_db_flavor: DBType | None = None, ) -> None: - """Initialize resources against graph model and transform library.""" - self._rebuild_runtime_state() - for r in self.resources: - r.finish_init( + """Build per-resource runtimes against graph model and transform library.""" + self._rebuild_config_state() + runtimes: dict[str, ResourceRuntime] = {} + for config in self.resources: + runtimes[config.name] = ResourceRuntime( + config, vertex_config=core_schema.vertex_config, edge_config=core_schema.edge_config, transforms=self._transforms, @@ -103,37 +106,35 @@ def finish_init( allowed_vertex_names=allowed_vertex_names, target_db_flavor=target_db_flavor, ) + object.__setattr__(self, "_runtimes", runtimes) - def _rebuild_runtime_state(self) -> None: + def _rebuild_config_state(self) -> None: """Rebuild transform and resource lookup maps.""" self._rebuild_transform_map() self._rebuild_resource_map() - def fetch_resource(self, name: str | None = None) -> Resource: - """Fetch a resource by name or get the first available resource. - - Args: - name: Optional name of the resource to fetch - - Returns: - Resource: The requested resource - - Raises: - ValueError: If the requested resource is not found or if no resources exist - """ - _current_resource = None - + def fetch_resource(self, name: str | None = None) -> ResourceRuntime: + """Fetch an initialized runtime resource by name.""" if name is not None: - if name in self._resources: - _current_resource = self._resources[name] - else: + runtime = self._runtimes.get(name) + if runtime is None: raise ValueError(f"Resource {name} not found") - else: - if self._resources: - _current_resource = self.resources[0] - else: - raise ValueError("Empty resource container :(") - return _current_resource + return runtime + if self._runtimes: + return next(iter(self._runtimes.values())) + if self.resources: + raise RuntimeError( + "IngestionModel resources exist but runtimes were not built; " + "call finish_init() first." + ) + raise ValueError("Empty resource container :(") + + def fetch_resource_config(self, name: str) -> ResourceConfig: + """Fetch declarative resource config by name.""" + config = self._resources.get(name) + if config is None: + raise ValueError(f"Resource {name} not found") + return config def prune_to_graph( self, core_schema: CoreSchema, disconnected: set[str] | None = None @@ -146,19 +147,22 @@ def prune_to_graph( if not disconnected: return - def _mentions_disconnected(wrapper) -> bool: + def _mentions_disconnected(wrapper: ActorWrapper) -> bool: return bool(wrapper.actor.references_vertices() & disconnected) - to_drop: list[Resource] = [] - for resource in self.resources: - root = resource.root + to_drop: list[ResourceConfig] = [] + for resource_config in self.resources: + root = ActorWrapper(*resource_config.pipeline) if _mentions_disconnected(root): - to_drop.append(resource) + to_drop.append(resource_config) continue root.remove_descendants_if(_mentions_disconnected) if not any(a.references_vertices() for a in root.collect_actors()): - to_drop.append(resource) - - for r in to_drop: - self.resources.remove(r) - self._resources.pop(r.name, None) + to_drop.append(resource_config) + + for dropped in to_drop: + self.resources.remove(dropped) + self._resources.pop(dropped.name, None) + self._runtimes.pop(dropped.name, None) + if to_drop: + self._rebuild_config_state() diff --git a/graflo/architecture/contract/ingestion/resource.py b/graflo/architecture/contract/ingestion/resource.py new file mode 100644 index 00000000..787e8040 --- /dev/null +++ b/graflo/architecture/contract/ingestion/resource.py @@ -0,0 +1,228 @@ +"""Declarative resource configuration (YAML/manifest contract).""" + +from __future__ import annotations + +import logging +from typing import Any + +from pydantic import AliasChoices, Field as PydanticField, model_validator + +from graflo.architecture.base import ConfigBaseModel +from graflo.architecture.graph_types import EdgeId, EncodingType, Weight +from graflo.architecture.pipeline.runtime.actor.config.normalize import ( + normalize_actor_step, +) +from graflo.architecture.schema.edge import Edge + +logger = logging.getLogger(__name__) + + +def collect_vertex_names_from_pipeline(steps: list[Any]) -> set[str]: + """Collect vertex names referenced by pipeline steps (including nested descend).""" + names: set[str] = set() + for step in steps: + if not isinstance(step, dict): + continue + normalized = normalize_actor_step(dict(step)) + step_type = normalized.get("type") + if step_type == "vertex" and isinstance(normalized.get("vertex"), str): + names.add(normalized["vertex"]) + elif step_type == "vertex_router": + type_map = normalized.get("type_map") + if isinstance(type_map, dict): + for value in type_map.values(): + if isinstance(value, str): + names.add(value) + vertex_from_map = normalized.get("vertex_from_map") + if isinstance(vertex_from_map, dict): + for key in vertex_from_map: + if isinstance(key, str): + names.add(key) + elif step_type == "edge": + source = normalized.get("source") or normalized.get("from") + target = normalized.get("target") or normalized.get("to") + if isinstance(source, str): + names.add(source) + if isinstance(target, str): + names.add(target) + vertex_weights = normalized.get("vertex_weights") + if isinstance(vertex_weights, list): + for weight in vertex_weights: + if isinstance(weight, dict) and isinstance(weight.get("name"), str): + names.add(weight["name"]) + elif step_type == "descend": + sub_pipeline = normalized.get("pipeline") + if isinstance(sub_pipeline, list): + names |= collect_vertex_names_from_pipeline(sub_pipeline) + return names + + +class EdgeInferSpec(ConfigBaseModel): + """Selector for controlling inferred edge emission.""" + + source: str = PydanticField(..., description="Edge source vertex name.") + target: str = PydanticField(..., description="Edge target vertex name.") + relation: str | None = PydanticField( + default=None, + description=( + "Optional relation discriminator. If omitted, selector applies to all relations " + "for (source, target)." + ), + ) + + @property + def edge_id(self) -> EdgeId: + return self.source, self.target, self.relation + + def matches(self, edge_id: EdgeId) -> bool: + source, target, relation = edge_id + return ( + self.source == source + and self.target == target + and (self.relation is None or self.relation == relation) + ) + + +class ResourceExtraWeightEntry(ConfigBaseModel): + """Schema edge plus optional vertex-derived weight rules for DB enrichment.""" + + edge: Edge + vertex_weights: list[Weight] = PydanticField(default_factory=list) + + @model_validator(mode="before") + @classmethod + def _from_yaml(cls, data: Any) -> Any: + if data is None: + return data + if isinstance(data, Edge): + return {"edge": data, "vertex_weights": []} + if not isinstance(data, dict): + raise TypeError( + f"extra_weights item must be dict or Edge, got {type(data)}" + ) + d = dict(data) + vw_raw = d.pop("vertex_weights", None) or [] + if not isinstance(vw_raw, list): + vw_raw = [vw_raw] + v_w = [Weight.model_validate(x) for x in vw_raw] + if "edge" in d and isinstance(d["edge"], dict): + edge = Edge.model_validate(dict(d.pop("edge"))) + if d: + raise ValueError( + f"extra_weights entry has unexpected keys with 'edge': {sorted(d)}" + ) + return {"edge": edge, "vertex_weights": v_w} + edge = Edge.model_validate(d) + return {"edge": edge, "vertex_weights": v_w} + + +class ResourceConfig(ConfigBaseModel): + """Declarative resource definition (serializable contract).""" + + model_config = {"extra": "forbid"} + + name: str = PydanticField( + ..., + description="Name of the resource (e.g. table or file identifier).", + ) + pipeline: list[dict[str, Any]] = PydanticField( + ..., + description="Pipeline of actor steps to apply in sequence (vertex, edge, transform, descend). " + 'Each step is a dict, e.g. {"vertex": "user"} or {"edge": {"from": "a", "to": "b"}}.', + validation_alias=AliasChoices("pipeline", "apply"), + ) + encoding: EncodingType = PydanticField( + default=EncodingType.UTF_8, + description="Character encoding for input/output (e.g. utf-8, ISO-8859-1).", + ) + merge_collections: list[str] = PydanticField( + default_factory=list, + description="List of collection names to merge when writing to the graph.", + ) + extra_weights: list[ResourceExtraWeightEntry] = PydanticField( + default_factory=list, + description="Additional edge attribute / vertex-weight enrichment for this resource.", + ) + types: dict[str, str] = PydanticField( + default_factory=dict, + description='Field name to Python type expression for casting (e.g. {"amount": "float"}).', + ) + infer_edges: bool = PydanticField( + default=True, + description=( + "If True, infer edges from current vertex population. " + "If False, emit only edges explicitly declared as edge actors in the pipeline." + ), + ) + infer_edge_only: list[EdgeInferSpec] = PydanticField( + default_factory=list, + description=( + "Optional allow-list for inferred edges. Applies only to inferred (greedy) edges, " + "not explicit edge actors." + ), + ) + infer_edge_except: list[EdgeInferSpec] = PydanticField( + default_factory=list, + description=( + "Optional deny-list for inferred edges. Applies only to inferred (greedy) edges, " + "not explicit edge actors." + ), + ) + drop_trivial_input_fields: bool = PydanticField( + default=False, + description=( + "If True, remove top-level input keys whose value is None or the empty string before " + "the actor pipeline runs." + ), + ) + skip_actors_on_missing_input_keys: bool | None = PydanticField( + default=None, + description=( + "If True, actors that declare required input keys may skip execution when keys are " + "missing in the current document instead of raising indexing errors. " + "If None, defaults to drop_trivial_input_fields." + ), + ) + tolerate_transform_errors: bool = PydanticField( + default=True, + description=( + "If True, a failing transform step sets its declared output fields to None, " + "records the error, and continues the pipeline." + ), + ) + + @model_validator(mode="after") + def _validate_policy(self) -> ResourceConfig: + if self.infer_edge_only and self.infer_edge_except: + raise ValueError( + "Resource infer_edge_only and infer_edge_except are mutually exclusive." + ) + return self + + def collect_vertex_names(self) -> set[str]: + """Vertex types referenced by this resource (pipeline and related config).""" + names = collect_vertex_names_from_pipeline(self.pipeline) + names.update(self.merge_collections) + for spec in self.infer_edge_only: + names.add(spec.source) + names.add(spec.target) + for spec in self.infer_edge_except: + names.add(spec.source) + names.add(spec.target) + for entry in self.extra_weights: + names.add(entry.edge.source) + names.add(entry.edge.target) + for weight in entry.vertex_weights: + if weight.name is not None: + names.add(weight.name) + return names + + def pipeline_actor_count(self) -> int: + """Count actors in the pipeline without binding schema context.""" + from graflo.architecture.pipeline.runtime.actor import ActorWrapper + + return ActorWrapper(*self.pipeline).count() + + +# Internal-only alias; prefer ResourceConfig in new code. +Resource = ResourceConfig diff --git a/graflo/architecture/contract/declarations/transform.py b/graflo/architecture/contract/ingestion/transform.py similarity index 95% rename from graflo/architecture/contract/declarations/transform.py rename to graflo/architecture/contract/ingestion/transform.py index 298c15a0..33e6033e 100644 --- a/graflo/architecture/contract/declarations/transform.py +++ b/graflo/architecture/contract/ingestion/transform.py @@ -370,10 +370,9 @@ def _normalize_fields(cls, data: Any) -> Any: def _init_derived(self) -> Self: explicit_map = bool(self.rename) object.__setattr__(self, "functional_transform", self._foo is not None) - next_input, next_output, next_map = self._derive_effective_io_and_map() + next_input, next_output, _next_map = self._derive_effective_io_and_map() object.__setattr__(self, "input", next_input) object.__setattr__(self, "output", next_output) - object.__setattr__(self, "map", next_map) self._validate_configuration(explicit_map=explicit_map) return self @@ -541,14 +540,11 @@ def _validate_configuration(self, *, explicit_map: bool) -> None: ) def _refresh_derived(self) -> None: - """Re-run derived state (e.g. map from input/output) after mutating attributes.""" + """Re-run derived input/output after mutating attributes (merge_from).""" if self.rename or not self.input or not self.output: return if len(self.input) != len(self.output): return - object.__setattr__( - self, "map", {src: dst for src, dst in zip(self.input, self.output)} - ) def __call__(self, *nargs: Any, **kwargs: Any) -> dict[str, Any] | Any: """Execute the transform. @@ -680,6 +676,41 @@ def is_mapping(self) -> bool: """True when the transform is pure mapping (no function).""" return self._foo is None + def planned_output_field_names( + self, doc: dict[str, Any] | None = None + ) -> tuple[str, ...]: + """Return output field names this transform would write on success.""" + if self.target == "keys": + if doc is None: + return () + return tuple(sorted(self._selected_keys(doc))) + + if self.input_groups: + if self.output_groups: + names: list[str] = [] + for group in self.output_groups: + names.extend(group) + return tuple(dict.fromkeys(names)) + if self.output: + return self.output + scalar_names: list[str] = [] + for group in self.input_groups: + if len(group) != 1: + return () + scalar_names.append(group[0]) + return tuple(scalar_names) + + if self.dress is not None: + return (self.dress.key, self.dress.value) + + if self.rename: + return tuple(self.rename.values()) + + if self.output: + return self.output + + return () + def _dress_as_dict(self, transform_result: Any) -> dict[str, Any]: """Convert transform result to dictionary format. diff --git a/graflo/architecture/contract/manifest.py b/graflo/architecture/contract/manifest.py index 93d61cbc..85fd8c85 100644 --- a/graflo/architecture/contract/manifest.py +++ b/graflo/architecture/contract/manifest.py @@ -10,7 +10,7 @@ from graflo.architecture.schema import Schema from .bindings import Bindings -from .declarations.ingestion_model import IngestionModel +from .ingestion.model import IngestionModel class GraphManifest(ConfigBaseModel): diff --git a/graflo/architecture/contract/runtime/__init__.py b/graflo/architecture/contract/runtime/__init__.py new file mode 100644 index 00000000..85b2c18a --- /dev/null +++ b/graflo/architecture/contract/runtime/__init__.py @@ -0,0 +1,28 @@ +"""Schema-bound runtime executors (non-serializable).""" + +from __future__ import annotations + +from typing import Any + +from .edge_derivation import EdgeDerivationRegistry + +__all__ = [ + "EdgeDerivationRegistry", + "ResourceRuntime", + "build_resource_runtime", + "filter_vertex_config_for_resource", + "strip_trivial_top_level_fields", +] + + +def __getattr__(name: str) -> Any: + if name in { + "ResourceRuntime", + "build_resource_runtime", + "filter_vertex_config_for_resource", + "strip_trivial_top_level_fields", + }: + from . import resource as _resource + + return getattr(_resource, name) + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") diff --git a/graflo/architecture/contract/declarations/edge_derivation_registry.py b/graflo/architecture/contract/runtime/edge_derivation.py similarity index 98% rename from graflo/architecture/contract/declarations/edge_derivation_registry.py rename to graflo/architecture/contract/runtime/edge_derivation.py index 377c9e0d..6867ac4e 100644 --- a/graflo/architecture/contract/declarations/edge_derivation_registry.py +++ b/graflo/architecture/contract/runtime/edge_derivation.py @@ -11,7 +11,7 @@ class EdgeDerivationRegistry: """Mutable store for ingestion-time edge behavior keyed by :class:`EdgeId`. - Lives under the ingestion layer (typically one instance per :class:`Resource`), + Lives under the ingestion layer (typically one instance per :class:`ResourceRuntime`), not on :class:`~graflo.architecture.schema.core.CoreSchema`. """ diff --git a/graflo/architecture/contract/runtime/resource.py b/graflo/architecture/contract/runtime/resource.py new file mode 100644 index 00000000..b646d09d --- /dev/null +++ b/graflo/architecture/contract/runtime/resource.py @@ -0,0 +1,303 @@ +"""Runtime resource executor (schema-bound, not serializable).""" + +from __future__ import annotations + +import logging +from collections import defaultdict +from typing import Any, Callable + +from graflo.architecture.graph_types import EdgeId, GraphEntity, ResourceCastResult +from graflo.architecture.pipeline.runtime.actor import ( + ActorInitContext, + ActorWrapper, + EdgeActor, +) +from graflo.architecture.pipeline.runtime.executor import ActorExecutor +from graflo.architecture.schema.edge import EdgeConfig +from graflo.architecture.schema.vertex import VertexConfig +from graflo.onto import DBType +from graflo.util.casting import apply_type_casters, resolve_type_casters + +from ..ingestion.resource import EdgeInferSpec, ResourceConfig +from ..ingestion.transform import ProtoTransform +from .edge_derivation import EdgeDerivationRegistry + +logger = logging.getLogger(__name__) + + +def strip_trivial_top_level_fields(doc: dict[str, Any]) -> dict[str, Any]: + """Return a shallow copy of *doc* without None or empty-string values.""" + return {k: v for k, v in doc.items() if v is not None and v != ""} + + +def filter_vertex_config_for_resource( + vertex_config: VertexConfig, + *, + resource_vertex_names: set[str], + allowed_vertex_names: set[str] | None, +) -> VertexConfig: + """Derive a filtered VertexConfig for runtime actor execution.""" + if resource_vertex_names: + effective = set(resource_vertex_names) + if allowed_vertex_names is not None: + effective &= allowed_vertex_names + elif allowed_vertex_names is not None: + effective = set(allowed_vertex_names) + else: + return vertex_config + filtered_vertices = [v for v in vertex_config.vertices if v.name in effective] + filtered_force_types = { + name: types + for name, types in vertex_config.force_types.items() + if name in effective + } + return VertexConfig( + vertices=filtered_vertices, + force_types=filtered_force_types, + identity_from_all_properties=vertex_config.identity_from_all_properties, + ) + + +class ResourceRuntime: + """Fully initialized resource executor for document casting.""" + + def __init__( + self, + config: ResourceConfig, + vertex_config: VertexConfig, + edge_config: EdgeConfig, + transforms: dict[str, ProtoTransform], + *, + strict_references: bool = False, + dynamic_edge_feedback: bool = False, + allowed_vertex_names: set[str] | None = None, + target_db_flavor: DBType | None = None, + ) -> None: + self.config = config + self._type_casters = resolve_type_casters(config.types) + self._root = ActorWrapper(*config.pipeline) + self._executor = ActorExecutor(self._root) + + runtime_vertex_config, local_edge_config = self._filter_vertex_edge_configs( + vertex_config, + edge_config, + allowed_vertex_names=allowed_vertex_names, + ) + self._vertex_config = runtime_vertex_config + self._edge_config = local_edge_config + + self._validate_vertex_references(vertex_config) + self._validate_infer_edge_spec_targets(self._edge_config) + + edge_derivation_registry = EdgeDerivationRegistry() + self._edge_derivation_registry = edge_derivation_registry + + infer_edge_except = self._build_infer_except() + init_ctx = self._build_init_context( + transforms=transforms, + edge_derivation=edge_derivation_registry, + infer_edge_except=infer_edge_except, + strict_references=strict_references, + allowed_vertex_names=allowed_vertex_names, + target_db_flavor=target_db_flavor, + ) + logger.debug("total resource actor count : %s", self._root.count()) + self._root.finish_init(init_ctx=init_ctx) + + if dynamic_edge_feedback: + self._propagate_dynamic_edges(edge_config, vertex_config=vertex_config) + + logger.debug("total resource actor count (after init): %s", self._root.count()) + self._init_extra_weights(vertex_config) + + @property + def name(self) -> str: + return self.config.name + + @property + def vertex_config(self) -> VertexConfig: + return self._vertex_config + + @property + def edge_config(self) -> EdgeConfig: + return self._edge_config + + @property + def root(self) -> ActorWrapper: + return self._root + + @property + def type_casters(self) -> dict[str, Callable[..., Any]]: + return self._type_casters + + def collect_vertex_names(self) -> set[str]: + return self.config.collect_vertex_names() + + def count(self) -> int: + return self._root.count() + + @staticmethod + def edge_ids_from_pipeline(pipeline: list[dict[str, Any]]) -> set[EdgeId]: + """Collect (source, target, None) for every static EdgeActor in *pipeline*.""" + root = ActorWrapper(*pipeline) + edge_actors = [a for a in root.collect_actors() if isinstance(a, EdgeActor)] + return { + (ea.edge.source, ea.edge.target, None) + for ea in edge_actors + if ea.edge is not None + } + + def _filter_vertex_edge_configs( + self, + vertex_config: VertexConfig, + edge_config: EdgeConfig, + *, + allowed_vertex_names: set[str] | None, + ) -> tuple[VertexConfig, EdgeConfig]: + runtime_vertex_config = filter_vertex_config_for_resource( + vertex_config, + resource_vertex_names=self.collect_vertex_names(), + allowed_vertex_names=allowed_vertex_names, + ) + local_edge_config = EdgeConfig.model_validate( + edge_config.to_dict(skip_defaults=False) + ) + return runtime_vertex_config, local_edge_config + + def _validate_vertex_references(self, vertex_config: VertexConfig) -> None: + known_vertices = set(vertex_config.vertex_set) + referenced_vertices: set[str] = set() + + for spec in self.config.infer_edge_only: + referenced_vertices.add(spec.source) + referenced_vertices.add(spec.target) + for spec in self.config.infer_edge_except: + referenced_vertices.add(spec.source) + referenced_vertices.add(spec.target) + for source, target, _ in self.edge_ids_from_pipeline(self.config.pipeline): + referenced_vertices.add(source) + referenced_vertices.add(target) + + missing_vertices = sorted(referenced_vertices - known_vertices) + if missing_vertices: + raise ValueError( + "Resource dynamic edge references undefined vertices: " + f"{missing_vertices}. " + "Declare these vertices in vertex_config before using dynamic/inferred edges." + ) + + def _validate_infer_edge_spec_targets(self, edge_config: EdgeConfig) -> None: + known_edge_ids = {edge_id for edge_id, _ in edge_config.items()} + + def _validate_list(field_name: str, specs: list[EdgeInferSpec]) -> None: + unknown: list[EdgeId] = [] + for spec in specs: + if not any(spec.matches(edge_id) for edge_id in known_edge_ids): + unknown.append(spec.edge_id) + if unknown: + raise ValueError( + f"Resource {field_name} contains unknown edge selectors: {unknown}" + ) + + _validate_list("infer_edge_only", self.config.infer_edge_only) + _validate_list("infer_edge_except", self.config.infer_edge_except) + + def _build_infer_except(self) -> set[EdgeId]: + infer_edge_except = {spec.edge_id for spec in self.config.infer_edge_except} + if not self.config.infer_edge_only: + infer_edge_except |= self.edge_ids_from_pipeline(self.config.pipeline) + return infer_edge_except + + def _build_init_context( + self, + *, + transforms: dict[str, ProtoTransform], + edge_derivation: EdgeDerivationRegistry, + infer_edge_except: set[EdgeId], + strict_references: bool, + allowed_vertex_names: set[str] | None, + target_db_flavor: DBType | None, + ) -> ActorInitContext: + skip_on_missing_input_keys = ( + self.config.skip_actors_on_missing_input_keys + if self.config.skip_actors_on_missing_input_keys is not None + else self.config.drop_trivial_input_fields + ) + return ActorInitContext( + vertex_config=self._vertex_config, + edge_config=self._edge_config, + transforms=transforms, + edge_derivation=edge_derivation, + allowed_vertex_names=allowed_vertex_names, + infer_edges=self.config.infer_edges, + infer_edge_only={spec.edge_id for spec in self.config.infer_edge_only}, + infer_edge_except=infer_edge_except, + strict_references=strict_references, + skip_actors_on_missing_input_keys=skip_on_missing_input_keys, + tolerate_transform_errors=self.config.tolerate_transform_errors, + target_db_flavor=target_db_flavor, + ) + + def _propagate_dynamic_edges( + self, + edge_config: EdgeConfig, + *, + vertex_config: VertexConfig, + ) -> None: + baseline_edge_ids = {edge_id for edge_id, _ in edge_config.items()} + for edge_id, edge in self._edge_config.items(): + if edge_id in baseline_edge_ids: + continue + edge_config.update_edges( + edge.model_copy(deep=True), vertex_config=vertex_config + ) + + def _init_extra_weights(self, vertex_config: VertexConfig) -> None: + reg = self._edge_derivation_registry + for entry in self.config.extra_weights: + entry.edge.finish_init(vertex_config) + if reg is not None and entry.vertex_weights: + reg.merge_vertex_weights(entry.edge.edge_id, entry.vertex_weights) + + def cast_document(self, doc: dict) -> ResourceCastResult: + """Process a document and return entities plus any tolerated transform failures.""" + work_doc: dict[str, Any] = ( + strip_trivial_top_level_fields(doc) + if self.config.drop_trivial_input_fields + else dict(doc) + ) + if self._type_casters: + apply_type_casters(work_doc, self._type_casters) + extraction_ctx = self._executor.extract(work_doc) + result = self._executor.assemble_result(extraction_ctx) + return ResourceCastResult( + entities=result.entities, + transform_failures=list(extraction_ctx.transform_failures), + ) + + def __call__(self, doc: dict) -> defaultdict[GraphEntity, list]: + return self.cast_document(doc).entities + + +def build_resource_runtime( + config: ResourceConfig, + vertex_config: VertexConfig, + edge_config: EdgeConfig, + transforms: dict[str, ProtoTransform] | None = None, + *, + strict_references: bool = False, + dynamic_edge_feedback: bool = False, + allowed_vertex_names: set[str] | None = None, + target_db_flavor: DBType | None = None, +) -> ResourceRuntime: + """Construct a fully initialized :class:`ResourceRuntime` from declarative config.""" + return ResourceRuntime( + config, + vertex_config, + edge_config, + transforms or {}, + strict_references=strict_references, + dynamic_edge_feedback=dynamic_edge_feedback, + allowed_vertex_names=allowed_vertex_names, + target_db_flavor=target_db_flavor, + ) diff --git a/graflo/architecture/edge_derivation.py b/graflo/architecture/edge_derivation.py index a873a3cf..19ad582e 100644 --- a/graflo/architecture/edge_derivation.py +++ b/graflo/architecture/edge_derivation.py @@ -6,7 +6,7 @@ :func:`~graflo.architecture.pipeline.runtime.actor.edge_render.render_edge`. When :attr:`EdgeDerivation.relation_from_key` is true, the ingestion -:class:`~graflo.architecture.contract.declarations.edge_derivation_registry.EdgeDerivationRegistry` +:class:`~graflo.architecture.contract.runtime.edge_derivation.EdgeDerivationRegistry` records the edge id so :class:`~graflo.architecture.schema.db_aware.EdgeConfigDBAware` (with overlay) can align TigerGraph DDL with runtime. """ diff --git a/graflo/architecture/evolution/apply.py b/graflo/architecture/evolution/apply.py index a80d7b6d..8302ea5d 100644 --- a/graflo/architecture/evolution/apply.py +++ b/graflo/architecture/evolution/apply.py @@ -5,8 +5,9 @@ import logging from typing import Any, Literal, Sequence -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.contract.manifest import GraphManifest +from graflo.architecture.pipeline.runtime.actor import ActorWrapper from graflo.architecture.database_features import DatabaseProfile from graflo.architecture.schema import Schema from graflo.architecture.schema.core import CoreSchema @@ -84,7 +85,7 @@ def _prune_ingestion_for_removed_vertices( if pipeline_mentions_any_vertex(resource.pipeline, removed): to_drop.append(resource) continue - root = resource.root + root = ActorWrapper(*resource.pipeline) if _actor_wrapper_mentions_removed(root, removed): to_drop.append(resource) continue @@ -218,26 +219,18 @@ def _build_merged_vertex_config( seen_ft.add(x) deduped_ft.append(x) - new_blank = [b for b in vc.blank_vertices if b not in sset] - was_blank = any(b in sset for b in vc.blank_vertices) or ( - into_exists and into in vc.blank_vertices - ) - if was_blank and into not in new_blank: - new_blank.append(into) - new_force = {k: v for k, v in vc.force_types.items() if k not in sset and k != into} if deduped_ft: new_force[into] = deduped_ft return VertexConfig( vertices=new_vertices, - blank_vertices=new_blank, force_types=new_force, ) def _rewrite_ingestion_for_merge(im: IngestionModel, mapping: dict[str, str]) -> None: - from graflo.architecture.contract.declarations.resource import Resource + from graflo.architecture.contract.ingestion.resource import Resource new_resources: list[Resource] = [] for r in im.resources: @@ -342,7 +335,7 @@ def _rebuild_ingestion_with_pipeline_rewrite( """ if manifest.ingestion_model is None: return - from graflo.architecture.contract.declarations.resource import Resource + from graflo.architecture.contract.ingestion.resource import Resource renames_ctx = vertex_field_renames if vertex_field_renames else {} @@ -532,13 +525,6 @@ def _apply_rename_entities( vertex["name"], vertex["name"] ) - blank_vertices = vertex_config.get("blank_vertices") - if isinstance(blank_vertices, list): - vertex_config["blank_vertices"] = [ - vertex_map.get(name, name) if isinstance(name, str) else name - for name in blank_vertices - ] - force_types = vertex_config.get("force_types") if isinstance(force_types, dict): vertex_config["force_types"] = { @@ -701,7 +687,7 @@ def apply_remove_edges(manifest: GraphManifest, op: RemoveEdgesOp) -> None: schema.finish_init() if manifest.ingestion_model is not None: - from graflo.architecture.contract.declarations.resource import Resource + from graflo.architecture.contract.ingestion.resource import Resource resources: list[Resource] = [] for resource in manifest.ingestion_model.resources: diff --git a/graflo/architecture/evolution/merge_core.py b/graflo/architecture/evolution/merge_core.py index 70698f2b..0e73e0d9 100644 --- a/graflo/architecture/evolution/merge_core.py +++ b/graflo/architecture/evolution/merge_core.py @@ -57,6 +57,7 @@ def merge_vertex_models(vertices: list[Vertex], into_name: str) -> Vertex: identity=identity_out, filters=filters_out, description=desc_out, + blank=any(v.blank for v in vertices), ) diff --git a/graflo/architecture/graph_types.py b/graflo/architecture/graph_types.py index a03ee7c2..8ee29760 100644 --- a/graflo/architecture/graph_types.py +++ b/graflo/architecture/graph_types.py @@ -437,6 +437,17 @@ class TransformObservation(ConfigBaseModel): provenance: ProvenancePath +class TransformCastFailure(ConfigBaseModel): + """One transform step that failed during extraction (tolerance mode).""" + + location: LocationIndex + transform_label: str + exception_type: str + message: str + traceback: str = "" + nulled_fields: tuple[str, ...] = Field(default_factory=tuple) + + class EdgeIntent(ConfigBaseModel): """Typed edge assembly request emitted during extraction.""" @@ -529,6 +540,10 @@ def _default_edge_intents() -> list[EdgeIntent]: return [] +def _default_transform_failures() -> list[TransformCastFailure]: + return [] + + class ExtractionContext(ConfigBaseModel): """Extraction-phase context. @@ -555,6 +570,9 @@ class ExtractionContext(ConfigBaseModel): default_factory=_default_transform_observations ) edge_intents: list[EdgeIntent] = Field(default_factory=_default_edge_intents) + transform_failures: list[TransformCastFailure] = Field( + default_factory=_default_transform_failures + ) def record_vertex_observation( self, *, vertex_name: str, location: LocationIndex, vertex: dict, ctx: dict @@ -596,6 +614,26 @@ def record_edge_intent( ) ) + def record_transform_failure( + self, + *, + location: LocationIndex, + transform_label: str, + exc: BaseException, + traceback_text: str, + nulled_fields: tuple[str, ...], + ) -> None: + self.transform_failures.append( + TransformCastFailure( + location=location, + transform_label=transform_label, + exception_type=type(exc).__name__, + message=str(exc), + traceback=traceback_text, + nulled_fields=nulled_fields, + ) + ) + class AssemblyContext(ConfigBaseModel): """Assembly-phase context built from extraction outputs.""" @@ -632,6 +670,15 @@ class GraphAssemblyResult(ConfigBaseModel): entities: Any = Field(default_factory=dd_factory) +class ResourceCastResult(ConfigBaseModel): + """Outcome of casting one document through a resource pipeline.""" + + model_config = ConfigDict(arbitrary_types_allowed=True) + + entities: Any + transform_failures: list[TransformCastFailure] = Field(default_factory=list) + + class ActionContext(ExtractionContext): """Backward-compatible extraction+assembly context. diff --git a/graflo/architecture/pipeline/runtime/actor/base.py b/graflo/architecture/pipeline/runtime/actor/base.py index 1e5e3532..5104c489 100644 --- a/graflo/architecture/pipeline/runtime/actor/base.py +++ b/graflo/architecture/pipeline/runtime/actor/base.py @@ -5,12 +5,12 @@ from abc import ABC, abstractmethod from dataclasses import dataclass, field -from graflo.architecture.contract.declarations.edge_derivation_registry import ( +from graflo.architecture.contract.runtime.edge_derivation import ( EdgeDerivationRegistry, ) from graflo.architecture.schema.edge import EdgeConfig from graflo.architecture.graph_types import EdgeId, ExtractionContext, LocationIndex -from graflo.architecture.contract.declarations.transform import ProtoTransform +from graflo.architecture.contract.ingestion.transform import ProtoTransform from graflo.architecture.schema.vertex import VertexConfig from graflo.onto import DBType @@ -38,6 +38,7 @@ class ActorInitContext: infer_edge_except: set[EdgeId] = field(default_factory=set) strict_references: bool = False skip_actors_on_missing_input_keys: bool = False + tolerate_transform_errors: bool = True target_db_flavor: DBType | None = None diff --git a/graflo/architecture/pipeline/runtime/actor/config/models.py b/graflo/architecture/pipeline/runtime/actor/config/models.py index 4d932ad7..a60c3eb7 100644 --- a/graflo/architecture/pipeline/runtime/actor/config/models.py +++ b/graflo/architecture/pipeline/runtime/actor/config/models.py @@ -7,7 +7,7 @@ from pydantic import Field as PydanticField, TypeAdapter, model_validator from graflo.architecture.base import ConfigBaseModel -from graflo.architecture.contract.declarations.transform import DressConfig +from graflo.architecture.contract.ingestion.transform import DressConfig from graflo.architecture.edge_derivation import EdgeDerivation from .normalize import normalize_actor_step diff --git a/graflo/architecture/pipeline/runtime/actor/transform.py b/graflo/architecture/pipeline/runtime/actor/transform.py index 23cebf6c..2a6fad72 100644 --- a/graflo/architecture/pipeline/runtime/actor/transform.py +++ b/graflo/architecture/pipeline/runtime/actor/transform.py @@ -3,6 +3,7 @@ from __future__ import annotations import logging +import traceback from typing import Any from .base import Actor, ActorInitContext @@ -12,7 +13,7 @@ LocationIndex, TransformPayload, ) -from graflo.architecture.contract.declarations.transform import ( +from graflo.architecture.contract.ingestion.transform import ( KeySelectionConfig, ProtoTransform, Transform, @@ -29,6 +30,7 @@ def __init__(self, config: TransformActorConfig): self.call_use: str | None = None self._call_config = None self._skip_on_missing_input_keys = False + self._tolerate_transform_errors = True self._required_doc_keys: frozenset[str] = frozenset() if config.rename is not None: @@ -86,6 +88,7 @@ def __init__(self, config: TransformActorConfig): def _refresh_missing_key_guard(self, init_ctx: ActorInitContext) -> None: self._skip_on_missing_input_keys = init_ctx.skip_actors_on_missing_input_keys + self._tolerate_transform_errors = init_ctx.tolerate_transform_errors if ( not self._skip_on_missing_input_keys or self.t.target == "keys" @@ -223,6 +226,20 @@ def _extract_doc(self, nargs: tuple[Any, ...], **kwargs: Any) -> dict[str, Any]: def _format_transform_result(self, result: Any) -> TransformPayload: return TransformPayload.from_result(result) + def _transform_label(self) -> str: + if self.call_use: + return self.call_use + if self.t.foo and self.t.module: + return f"{self.t.module}.{self.t.foo}" + if self.t.foo: + return self.t.foo + if self.t.name: + return self.t.name + return type(self.t).__name__ + + def _format_traceback(self, exc: BaseException) -> str: + return "".join(traceback.format_exception(type(exc), exc, exc.__traceback__)) + def __call__( self, ctx: ExtractionContext, lindex: LocationIndex, *nargs: Any, **kwargs: Any ) -> ExtractionContext: @@ -231,7 +248,24 @@ def __call__( if self._skip_on_missing_input_keys and self._required_doc_keys: if not self._required_doc_keys.issubset(doc): return ctx - transform_result = self.t(doc) + try: + transform_result = self.t(doc) + except Exception as exc: + if not self._tolerate_transform_errors: + raise + nulled_fields = self.t.planned_output_field_names(doc) + if nulled_fields: + payload = TransformPayload(named={k: None for k in nulled_fields}) + ctx.transform_buffer[lindex].append(payload) + ctx.record_transform_observation(location=lindex, payload=payload) + ctx.record_transform_failure( + location=lindex, + transform_label=self._transform_label(), + exc=exc, + traceback_text=self._format_traceback(exc), + nulled_fields=nulled_fields, + ) + return ctx _update_doc = self._format_transform_result(transform_result) ctx.transform_buffer[lindex].append(_update_doc) ctx.record_transform_observation(location=lindex, payload=_update_doc) diff --git a/graflo/architecture/pipeline/runtime/actor/vertex.py b/graflo/architecture/pipeline/runtime/actor/vertex.py index c3c64531..ce9e8844 100644 --- a/graflo/architecture/pipeline/runtime/actor/vertex.py +++ b/graflo/architecture/pipeline/runtime/actor/vertex.py @@ -158,14 +158,44 @@ def __call__( agg = [] if self.from_doc: - projected = { - v_f: effective_doc.get(d_f) for v_f, d_f in self.from_doc.items() - } - if any(v is not None for v in projected.values()): - agg.append(projected) + source_keys = set(self.from_doc.values()) + consumed_from_buffer = False + for item in ctx.transform_buffer[lindex]: + if isinstance(item, TransformPayload) and source_keys.issubset( + item.named + ): + projected = { + v_f: item.named[d_f] for v_f, d_f in self.from_doc.items() + } + if any(v is not None for v in projected.values()): + agg.append(projected) + for k in source_keys: + item.named.pop(k, None) + consumed_from_buffer = True + ctx.transform_buffer[lindex] = [ + item + for item in ctx.transform_buffer[lindex] + if not ( + isinstance(item, TransformPayload) + and not item.named + and not item.positional + ) + and not (isinstance(item, dict) and not item) + ] + if not consumed_from_buffer: + projected = { + v_f: effective_doc.get(d_f) for v_f, d_f in self.from_doc.items() + } + if any(v is not None for v in projected.values()): + agg.append(projected) + buffer_vertex_keys = tuple(k for k in vertex_keys if k not in self.from_doc) + else: + buffer_vertex_keys = vertex_keys agg.extend( - self._process_transformed_items(ctx, lindex, effective_doc, vertex_keys) + self._process_transformed_items( + ctx, lindex, effective_doc, buffer_vertex_keys + ) ) if self.extraction_scope == "full": diff --git a/graflo/architecture/pipeline/runtime/assemble.py b/graflo/architecture/pipeline/runtime/assemble.py index a84bdf8f..1007c739 100644 --- a/graflo/architecture/pipeline/runtime/assemble.py +++ b/graflo/architecture/pipeline/runtime/assemble.py @@ -5,7 +5,7 @@ from typing import Any from .actor.edge_render import render_edge, render_weights -from graflo.architecture.contract.declarations.edge_derivation_registry import ( +from graflo.architecture.contract.runtime.edge_derivation import ( EdgeDerivationRegistry, ) from graflo.architecture.schema.edge import ( diff --git a/graflo/architecture/schema/edge.py b/graflo/architecture/schema/edge.py index e8e0dfae..aff939c5 100644 --- a/graflo/architecture/schema/edge.py +++ b/graflo/architecture/schema/edge.py @@ -101,7 +101,7 @@ class Edge(ConfigBaseModel): description=( "Edge property names/types (relationship properties). " "Vertex-derived bindings belong in ingestion (:class:`~graflo.architecture.contract." - "declarations.edge_derivation_registry.EdgeDerivationRegistry`)." + "runtime.edge_derivation.EdgeDerivationRegistry`)." ), ) diff --git a/graflo/architecture/schema/vertex.py b/graflo/architecture/schema/vertex.py index 4eb01fca..41522a1d 100644 --- a/graflo/architecture/schema/vertex.py +++ b/graflo/architecture/schema/vertex.py @@ -303,6 +303,12 @@ class Vertex(ConfigBaseModel): default=None, description="Optional semantic description of the vertex meaning, role, and intended interpretation.", ) + blank: bool = PydanticField( + default=False, + description=( + "True when this vertex has no natural identity and gets an auto-generated ID." + ), + ) @field_validator("properties", mode="before") @classmethod @@ -374,7 +380,6 @@ class VertexConfig(ConfigBaseModel): Attributes: vertices: List of vertex configurations - blank_vertices: List of blank vertex names force_types: Dictionary mapping vertex names to type lists """ @@ -385,10 +390,6 @@ class VertexConfig(ConfigBaseModel): ..., description="List of vertex type definitions (name, properties, identity, filters).", ) - blank_vertices: list[str] = PydanticField( - default_factory=list, - description="Vertex names that may be created without explicit data (e.g. placeholders).", - ) force_types: dict[str, list] = PydanticField( default_factory=dict, description="Override mapping: vertex name -> list of field type names for type inference.", @@ -404,27 +405,28 @@ class VertexConfig(ConfigBaseModel): _vertex_numeric_fields_map: dict[str, object] | None = PrivateAttr(default=None) @model_validator(mode="after") - def build_vertices_map_and_validate_blank(self) -> "VertexConfig": + def build_vertices_map(self) -> "VertexConfig": object.__setattr__( self, "_vertices_map", {item.name: item for item in self.vertices}, ) object.__setattr__(self, "_vertex_numeric_fields_map", {}) - if set(self.blank_vertices) - set(self.vertex_set): - raise ValueError( - f" Blank vertices {self.blank_vertices} are not defined as vertices" - ) self._normalize_vertex_identities() return self + @property + def blank_vertices(self) -> list[str]: + """Vertex names marked blank (no natural identity; auto-generated ID).""" + return [v.name for v in self.vertices if v.blank] + def _normalize_vertex_identities( self, ) -> None: blank_id_field = "id" for vertex in self.vertices: if not vertex.identity: - if vertex.name in self.blank_vertices: + if vertex.blank: vertex.identity = [blank_id_field] elif self.identity_from_all_properties: vertex.identity = list(vertex.property_names) @@ -534,8 +536,7 @@ def filters(self, vertex_name) -> list[FilterExpression]: def remove_vertices(self, names: set[str]) -> None: """Remove vertices by name. - Removes vertices from the configuration and from blank_vertices - when present. Mutates the instance in place. + Removes vertices from the configuration. Mutates the instance in place. Args: names: Set of vertex names to remove @@ -546,7 +547,6 @@ def remove_vertices(self, names: set[str]) -> None: m = self._get_vertices_map() for n in names: m.pop(n, None) - self.blank_vertices[:] = [b for b in self.blank_vertices if b not in names] def update_vertex(self, v: Vertex): """Update vertex configuration. diff --git a/graflo/db/postgres/resource_mapping.py b/graflo/db/postgres/resource_mapping.py index f40881b4..0502bb8e 100644 --- a/graflo/db/postgres/resource_mapping.py +++ b/graflo/db/postgres/resource_mapping.py @@ -13,7 +13,7 @@ import logging from typing import TYPE_CHECKING, Any -from graflo.architecture.contract.declarations.resource import Resource +from graflo.architecture.contract.ingestion.resource import Resource from graflo.architecture.schema.vertex import VertexConfig from .conn import EdgeTableInfo, SchemaIntrospectionResult from .inference_utils import ( diff --git a/graflo/hq/auto_join.py b/graflo/hq/auto_join.py index 00a3ba78..84d49e3f 100644 --- a/graflo/hq/auto_join.py +++ b/graflo/hq/auto_join.py @@ -13,7 +13,7 @@ from typing import TYPE_CHECKING from graflo.architecture.pipeline.runtime.actor import ActorWrapper, EdgeActor -from graflo.architecture.contract.declarations.resource import Resource +from graflo.architecture.contract.runtime import ResourceRuntime from graflo.filter.onto import ComparisonOperator, FilterExpression from graflo.architecture.contract.bindings import JoinClause, TableConnector @@ -29,7 +29,7 @@ def enrich_edge_connector_with_joins( - resource: Resource, + resource: ResourceRuntime, connector: TableConnector, bindings: Bindings, vertex_config: VertexConfig, diff --git a/graflo/hq/caster.py b/graflo/hq/caster.py index 76ea289d..ab19af1c 100644 --- a/graflo/hq/caster.py +++ b/graflo/hq/caster.py @@ -1,30 +1,14 @@ """Data casting and ingestion system for graph databases. -This module provides functionality for casting and ingesting data into graph databases. -It handles batch processing, file discovery, and database operations for both ArangoDB -and Neo4j. - -Key Components: - - Caster: Main class for data casting and ingestion - - FileConnector: Connector matching for file discovery - - Connectors: Collection of file connectors for different resources - -Ingestion paths (:meth:`ingest`, :meth:`ingest_data_sources`, :meth:`process_resource`, -:meth:`process_data_source`, queue workers) all route batches through -:meth:`process_batch` → :meth:`cast_normal_resource`, which loads the named -``Resource`` from the :class:`~graflo.architecture.contract.declarations.ingestion_model.IngestionModel` -and invokes :meth:`~graflo.architecture.contract.declarations.resource.Resource.__call__` per source document. - -Example: - >>> caster = Caster(schema=schema) - >>> caster.ingest(path="data/", conn_conf=db_config) +Orchestration (batching, DB writes, queues) lives in :class:`Caster`. +Pure document casting is delegated to :class:`~graflo.hq.document_caster.DocumentCaster`. """ +from __future__ import annotations + import asyncio -import json import logging import sys -import traceback from pathlib import Path from typing import Any, cast @@ -32,10 +16,10 @@ from suthing import Timer -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel -from graflo.architecture.graph_types import EncodingType, GraphContainer +from graflo.architecture.contract.bindings import Bindings +from graflo.architecture.contract.ingestion import IngestionModel +from graflo.architecture.graph_types import EncodingType from graflo.architecture.schema import Schema -from graflo.architecture.schema.vertex import VertexConfig from graflo.data_source import ( AbstractDataSource, DataSourceFactory, @@ -43,180 +27,38 @@ ) from graflo.db.connection import DBConfig from graflo.hq.bulk_session import BulkSessionCoordinator -from graflo.hq.db_writer import DBWriter -from graflo.hq.registry_builder import RegistryBuilder -from graflo.util.chunker import ChunkerType -from graflo.architecture.contract.bindings import Bindings from graflo.hq.connection_provider import ConnectionProvider, EmptyConnectionProvider +from graflo.hq.db_writer import DBWriter from graflo.hq.doc_error_sink import failure_sinks_from_ingestion_params +from graflo.hq.document_caster import DocumentCaster from graflo.hq.ingestion_parameters import ( CastBatchResult, DocCastFailure, DocErrorBudgetExceeded, IngestionParams, ) +from graflo.hq.registry_builder import RegistryBuilder +from graflo.util.chunker import ChunkerType +from graflo.util.data_normalize import normalize_rows logger = logging.getLogger(__name__) -_DOC_CAST_ERROR_TRACEBACK_MAX_CHARS = 16_384 - - -def _filter_graph_container_by_vertices_inplace( - gc: GraphContainer, *, allowed_vertex_names: set[str] | None -) -> None: - """Restrict persistence to a subset of vertex types. - - Mutates *gc* in-place, removing: - - vertex collections whose names are not in *allowed_vertex_names* - - edge collections whose source/target vertex names are not allowed - """ - - if allowed_vertex_names is None: - return - - gc.vertices = { - vcol: items - for vcol, items in gc.vertices.items() - if vcol in allowed_vertex_names - } - gc.edges = { - (vfrom, vto, rel): items - for (vfrom, vto, rel), items in gc.edges.items() - if vfrom in allowed_vertex_names and vto in allowed_vertex_names - } - - -def _identity_value_is_empty(value: Any) -> bool: - return value is None or value == "" - - -def _vertex_doc_has_empty_identity( - doc: dict[str, Any], identity_fields: list[str] -) -> bool: - if not identity_fields: - return False - return all(_identity_value_is_empty(doc.get(field)) for field in identity_fields) - - -def _filter_graph_container_drop_empty_identity_inplace( - gc: GraphContainer, *, vertex_config: VertexConfig -) -> None: - """Remove vertex docs and edge tuples with no usable schema identity. - - Identity rules come from *vertex_config*; :class:`GraphContainer` is unchanged - as a type. Blank vertex collections are skipped (empty identity before DB assign). - """ - blank = set(vertex_config.blank_vertices) - vertex_set = vertex_config.vertex_set - - for vcol, docs in list(gc.vertices.items()): - if vcol in blank or vcol not in vertex_set: - continue - id_fields = vertex_config.identity_fields(vcol) - gc.vertices[vcol] = [ - d for d in docs if not _vertex_doc_has_empty_identity(d, id_fields) - ] - - for edge_id, docs in list(gc.edges.items()): - vfrom, vto, _rel = edge_id - if vfrom not in vertex_set or vto not in vertex_set: - continue - if vfrom in blank or vto in blank: - continue - src_ids = vertex_config.identity_fields(vfrom) - tgt_ids = vertex_config.identity_fields(vto) - kept = [ - t - for t in docs - if not _vertex_doc_has_empty_identity(t[0], src_ids) - and not _vertex_doc_has_empty_identity(t[1], tgt_ids) - ] - if kept: - gc.edges[edge_id] = kept - else: - del gc.edges[edge_id] - - -def _format_traceback(exc: BaseException) -> str: - tb = "".join(traceback.format_exception(type(exc), exc, exc.__traceback__)) - if len(tb) > _DOC_CAST_ERROR_TRACEBACK_MAX_CHARS: - return tb[:_DOC_CAST_ERROR_TRACEBACK_MAX_CHARS] + "\n...(traceback truncated)" - return tb - - -def _build_doc_preview( - doc: dict[str, Any], - keys: tuple[str, ...] | None, - max_bytes: int, -) -> Any: - if keys is not None: - preview_obj: Any = {k: doc[k] for k in keys if k in doc} - else: - preview_obj = doc - raw = json.dumps(preview_obj, default=str, sort_keys=True) - encoded = raw.encode("utf-8") - if len(encoded) <= max_bytes: - return json.loads(raw) - cut = raw.encode("utf-8")[:max_bytes].decode("utf-8", errors="replace") - return f"{cut}...(doc preview truncated)" - - -def _doc_failure_from_exception( - *, - resource_name: str, - doc_index: int, - doc: dict[str, Any], - exc: BaseException, - doc_keys: tuple[str, ...] | None, - doc_preview_max_bytes: int, -) -> DocCastFailure: - return DocCastFailure( - resource_name=resource_name, - doc_index=doc_index, - exception_type=type(exc).__name__, - message=str(exc), - traceback=_format_traceback(exc), - doc_preview=_build_doc_preview(doc, doc_keys, doc_preview_max_bytes), - ) - class Caster: - """Main class for data casting and ingestion. - - This class handles the process of casting data into graph structures and - ingesting them into the database. It supports batch processing, parallel - execution, and various data formats. - - Attributes: - schema: Schema configuration for the graph - ingestion_params: IngestionParams instance controlling ingestion behavior - """ + """Ingestion orchestrator: cast documents and write graph batches to the database.""" def __init__( self, schema: Schema, ingestion_model: IngestionModel, ingestion_params: IngestionParams | None = None, - **kwargs, ): - """Initialize the caster with schema and configuration. - - Args: - schema: Schema configuration for the graph - ingestion_params: IngestionParams instance with ingestion configuration. - If None, creates IngestionParams from kwargs or uses defaults - **kwargs: Additional configuration options (for backward compatibility): - - clear_data: Whether to clear existing data before ingestion - - n_cores: Number of CPU cores/threads to use for parallel processing - - max_items: Maximum number of items to process - - batch_size: Size of batches for processing - - dry: Whether to perform a dry run - """ if ingestion_params is None: - ingestion_params = IngestionParams(**kwargs) + ingestion_params = IngestionParams() self.ingestion_params = ingestion_params self.schema = schema self.ingestion_model = ingestion_model + self._document_caster = DocumentCaster(ingestion_model) self._allowed_vertex_names: set[str] | None = None self._doc_cast_error_total = 0 self._doc_cast_error_io_lock = asyncio.Lock() @@ -225,12 +67,7 @@ def __init__( self._ingest_bindings: Bindings | None = None self._connection_provider: ConnectionProvider = EmptyConnectionProvider() - # ------------------------------------------------------------------ - # Casting - # ------------------------------------------------------------------ - async def _ensure_bulk_session(self, conn_conf: DBConfig) -> str | None: - """Return active native bulk session id, starting one if needed.""" return await self._bulk_coordinator.ensure_session(conn_conf) async def _finalize_bulk_session(self, conn_conf: DBConfig) -> None: @@ -272,89 +109,15 @@ async def _persist_doc_failures(self, failures: list[DocCastFailure]) -> None: async def cast_normal_resource( self, data, resource_name: str | None = None ) -> CastBatchResult: - """Cast data into a graph container using a resource. - - Args: - data: Iterable of documents to cast - resource_name: Optional name of the resource to use - - Returns: - CastBatchResult with graph and any per-document failures (empty when - ``on_doc_error`` is ``fail`` and the batch succeeds). - """ - rr = self.ingestion_model.fetch_resource(resource_name) - resolved_name = rr.name - params = self.ingestion_params - - semaphore = asyncio.Semaphore(params.n_cores) - - async def process_doc(doc: dict[str, Any]) -> Any: - async with semaphore: - return await asyncio.to_thread(rr, doc) - - if params.on_doc_error == "fail": - coros = [process_doc(doc) for doc in data] - docs = await asyncio.gather(*coros) - graph = GraphContainer.from_docs_list(docs) - _filter_graph_container_by_vertices_inplace( - graph, allowed_vertex_names=self._allowed_vertex_names - ) - if params.drop_empty_identity_docs: - _filter_graph_container_drop_empty_identity_inplace( - graph, - vertex_config=self.schema.core_schema.vertex_config, - ) - return CastBatchResult(graph=graph, failures=[]) - - doc_list = list(data) - raw = await asyncio.gather( - *[process_doc(doc) for doc in doc_list], - return_exceptions=True, - ) - docs: list[Any] = [] - failures: list[DocCastFailure] = [] - for i, item in enumerate(raw): - doc_raw = doc_list[i] - doc = ( - doc_raw - if isinstance(doc_raw, dict) - else {"_source_repr": repr(doc_raw)} - ) - - if isinstance(item, asyncio.CancelledError): - raise item - if isinstance(item, (KeyboardInterrupt, SystemExit)): - raise item - if isinstance(item, BaseException): - failures.append( - _doc_failure_from_exception( - resource_name=resolved_name, - doc_index=i, - doc=doc, - exc=item, - doc_keys=params.doc_error_preview_keys, - doc_preview_max_bytes=params.doc_error_preview_max_bytes, - ) - ) - continue - docs.append(item) - - await self._persist_doc_failures(failures) - - graph = GraphContainer.from_docs_list(docs) - _filter_graph_container_by_vertices_inplace( - graph, allowed_vertex_names=self._allowed_vertex_names + """Cast data into a graph container using a resource.""" + result = await self._document_caster.cast_batch( + data, + resource_name, + params=self.ingestion_params, + allowed_vertex_names=self._allowed_vertex_names, ) - if params.drop_empty_identity_docs: - _filter_graph_container_drop_empty_identity_inplace( - graph, - vertex_config=self.schema.core_schema.vertex_config, - ) - return CastBatchResult(graph=graph, failures=failures) - - # ------------------------------------------------------------------ - # Processing pipeline - # ------------------------------------------------------------------ + await self._persist_doc_failures(result.failures) + return result async def process_batch( self, @@ -362,13 +125,6 @@ async def process_batch( resource_name: str | None, conn_conf: None | DBConfig = None, ): - """Process a batch of data. - - Args: - batch: Batch of data to process - resource_name: Optional name of the resource to use - conn_conf: Optional database connection configuration - """ result = await self.cast_normal_resource(batch, resource_name=resource_name) if result.failures: logger.warning( @@ -396,16 +152,8 @@ async def process_data_source( resource_name: str | None = None, conn_conf: None | DBConfig = None, ): - """Process a data source. - - Args: - data_source: Data source to process - resource_name: Optional name of the resource (overrides data_source.resource_name) - conn_conf: Optional database connection configuration - """ actual_resource_name = resource_name or data_source.resource_name - # Same semantics as AbstractDataSource.iter_batches(limit=...). limit = self.ingestion_params.max_items batch_prefetch = self.ingestion_params.batch_prefetch queue: asyncio.Queue[list[dict] | object] = asyncio.Queue( @@ -475,25 +223,6 @@ async def process_resource( conn_conf: None | DBConfig = None, **kwargs, ): - """Process a resource instance from configuration or direct data. - - This method accepts either: - 1. A configuration dictionary with 'source_type' and data source parameters - 2. A file path (Path or str) - creates FileDataSource - 3. In-memory data (list[dict], list[list], or pd.DataFrame) - creates InMemoryDataSource - - Args: - resource_instance: Configuration dict, file path, or in-memory data. - Configuration dict format: - - {"source_type": "file", "path": "data.json"} - - {"source_type": "api", "config": {"url": "https://..."}} - - {"source_type": "sql", "config": {"connection_string": "...", "query": "..."}} - - {"source_type": "in_memory", "data": [...]} - resource_name: Optional name of the resource - conn_conf: Optional database connection configuration - **kwargs: Additional arguments passed to data source creation - (e.g., columns for list[list], encoding for files) - """ if isinstance(resource_instance, dict): config = resource_instance.copy() config.update(kwargs) @@ -529,19 +258,9 @@ async def process_resource( conn_conf=conn_conf, ) - # ------------------------------------------------------------------ - # Queue-based processing - # ------------------------------------------------------------------ - async def process_with_queue( self, tasks: asyncio.Queue, conn_conf: DBConfig | None = None ): - """Process tasks from a queue. - - Args: - tasks: Async queue of tasks to process - conn_conf: Optional database connection configuration - """ SENTINEL = None while True: @@ -569,37 +288,12 @@ async def process_with_queue( tasks.task_done() break - # ------------------------------------------------------------------ - # Normalization utility - # ------------------------------------------------------------------ - @staticmethod def normalize_resource( data: pd.DataFrame | list[list] | list[dict], columns: list[str] | None = None ) -> list[dict]: - """Normalize resource data into a list of dictionaries. - - Args: - data: Data to normalize (DataFrame, list of lists, or list of dicts) - columns: Optional column names for list data - - Returns: - list[dict]: Normalized data as list of dictionaries - - Raises: - ValueError: If columns is not provided for list data - """ - if isinstance(data, pd.DataFrame): - columns = data.columns.tolist() - _data = data.values.tolist() - elif data and isinstance(data[0], list): - _data = cast(list[list], data) - if columns is None: - raise ValueError("columns should be set") - else: - return cast(list[dict], data) - rows_dressed = [{k: v for k, v in zip(columns, item)} for item in _data] - return rows_dressed + """Normalize resource data into a list of dictionaries.""" + return normalize_rows(data, columns=columns) async def ingest_data_sources( self, @@ -610,23 +304,11 @@ async def ingest_data_sources( bindings: Bindings | None = None, connection_provider: ConnectionProvider | None = None, ): - """Ingest data from data sources in a registry. - - Note: Schema definition should be handled separately via GraphEngine.define_schema() - before calling this method. - - Args: - data_source_registry: Registry containing data sources mapped to resources - conn_conf: Database connection configuration - ingestion_params: IngestionParams instance with ingestion configuration. - If None, uses default IngestionParams() - bindings: Optional manifest bindings (used to resolve S3 staging proxies). - connection_provider: Runtime credential provider for source connectors and S3. - """ if ingestion_params is None: ingestion_params = IngestionParams() self.ingestion_params = ingestion_params + self._document_caster = DocumentCaster(self.ingestion_model) self._doc_cast_error_total = 0 init_only = ingestion_params.init_only @@ -684,21 +366,6 @@ def ingest( ingestion_params: IngestionParams | None = None, connection_provider: ConnectionProvider | None = None, ): - """Ingest data into the graph database. - - This is the main ingestion method that takes: - - Schema: Graph structure (already set in Caster) - - OutputConfig: Target graph database configuration - - Bindings: Mapping of resources to physical data sources - - IngestionParams: Parameters controlling the ingestion process - - Args: - target_db_config: Target database connection configuration (for writing graph) - bindings: Bindings instance mapping resources to data sources - If None, defaults to empty Bindings() - ingestion_params: IngestionParams instance with ingestion configuration. - If None, uses default IngestionParams() - """ bindings = bindings or Bindings() ingestion_params = ingestion_params or IngestionParams() @@ -715,6 +382,7 @@ def ingest( allowed_vertex_names=self._allowed_vertex_names, target_db_flavor=db_flavor, ) + self._document_caster = DocumentCaster(self.ingestion_model) registry = RegistryBuilder(self.schema, self.ingestion_model).build( bindings, @@ -734,17 +402,9 @@ def ingest( ) ) - # ------------------------------------------------------------------ - # Internal helpers - # ------------------------------------------------------------------ - def _resolve_ingestion_scope( self, ingestion_params: IngestionParams ) -> set[str] | None: - """Resolve and validate resource/vertex filters for ingestion. - - Resolution order is resources first, then vertices. - """ if ingestion_params.resources is not None: known_resources = set(self.ingestion_model._resources.keys()) requested_resources = set(ingestion_params.resources) @@ -776,7 +436,6 @@ def _resolve_ingestion_scope( return allowed_resource_names def _make_db_writer(self) -> DBWriter: - """Create a :class:`DBWriter` from the current ingestion params.""" max_concurrent = ( self.ingestion_params.max_concurrent_db_ops if self.ingestion_params.max_concurrent_db_ops is not None diff --git a/graflo/hq/db_writer.py b/graflo/hq/db_writer.py index 13f64650..edcb1785 100644 --- a/graflo/hq/db_writer.py +++ b/graflo/hq/db_writer.py @@ -13,7 +13,7 @@ from graflo.architecture.schema.edge import Edge from graflo.architecture.schema import EdgeRuntime, SchemaDBAware -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.graph_types import GraphContainer from graflo.architecture.schema import Schema from graflo.db.connection import DBConfig @@ -101,7 +101,7 @@ def _validate_bulk_resource(self, resource_name: str | None) -> None: if resource_name is None: return resource = self.ingestion_model.fetch_resource(resource_name) - if resource.extra_weights: + if resource.config.extra_weights: raise ValueError( "Native bulk ingest does not support resources with extra_weights " "(those require DB round-trips). Use REST ingest or disable extra_weights." @@ -215,7 +215,7 @@ async def _enrich_extra_weights( def _sync(): with ConnectionManager(connection_config=conn_conf) as db: - for entry in resource.extra_weights: + for entry in resource.config.extra_weights: edge = entry.edge if not entry.vertex_weights: continue diff --git a/graflo/hq/document_caster.py b/graflo/hq/document_caster.py new file mode 100644 index 00000000..a2ddf831 --- /dev/null +++ b/graflo/hq/document_caster.py @@ -0,0 +1,300 @@ +"""Stateless document-to-graph casting (no I/O).""" + +from __future__ import annotations + +import asyncio +import json +import traceback +from collections.abc import Iterable +from typing import Any, Literal + +from graflo.architecture.contract.ingestion import IngestionModel +from graflo.architecture.contract.runtime import ResourceRuntime +from graflo.architecture.graph_types import ( + GraphContainer, + ResourceCastResult, + TransformCastFailure, +) +from graflo.architecture.schema.vertex import VertexConfig +from graflo.hq.ingestion_parameters import ( + CastBatchResult, + DocCastFailure, + IngestionParams, +) + +_DOC_CAST_ERROR_TRACEBACK_MAX_CHARS = 16_384 + + +def cast_vertex_filter( + resource_vertex_names: set[str], + *, + allowed_vertex_names: set[str] | None, +) -> set[str]: + """Vertex names to retain after casting for a single resource.""" + if allowed_vertex_names is None: + return resource_vertex_names + return resource_vertex_names & allowed_vertex_names + + +def filter_graph_container_by_vertices_inplace( + gc: GraphContainer, *, allowed_vertex_names: set[str] | None +) -> None: + """Restrict persistence to a subset of vertex types (in-place).""" + if allowed_vertex_names is None: + return + gc.vertices = { + vcol: items + for vcol, items in gc.vertices.items() + if vcol in allowed_vertex_names + } + gc.edges = { + (vfrom, vto, rel): items + for (vfrom, vto, rel), items in gc.edges.items() + if vfrom in allowed_vertex_names and vto in allowed_vertex_names + } + + +def _identity_value_is_empty(value: Any) -> bool: + return value is None or value == "" + + +def _vertex_doc_has_empty_identity( + doc: dict[str, Any], identity_fields: list[str] +) -> bool: + if not identity_fields: + return False + return all(_identity_value_is_empty(doc.get(field)) for field in identity_fields) + + +def filter_graph_container_drop_empty_identity_inplace( + gc: GraphContainer, *, vertex_config: VertexConfig +) -> None: + """Remove vertex docs and edge tuples with no usable schema identity.""" + blank = set(vertex_config.blank_vertices) + vertex_set = vertex_config.vertex_set + + for vcol, docs in list(gc.vertices.items()): + if vcol in blank or vcol not in vertex_set: + continue + id_fields = vertex_config.identity_fields(vcol) + gc.vertices[vcol] = [ + d for d in docs if not _vertex_doc_has_empty_identity(d, id_fields) + ] + + for edge_id, docs in list(gc.edges.items()): + vfrom, vto, _rel = edge_id + if vfrom not in vertex_set or vto not in vertex_set: + continue + if vfrom in blank or vto in blank: + continue + src_ids = vertex_config.identity_fields(vfrom) + tgt_ids = vertex_config.identity_fields(vto) + kept = [ + t + for t in docs + if not _vertex_doc_has_empty_identity(t[0], src_ids) + and not _vertex_doc_has_empty_identity(t[1], tgt_ids) + ] + if kept: + gc.edges[edge_id] = kept + else: + del gc.edges[edge_id] + + +def _format_traceback(exc: BaseException) -> str: + tb = "".join(traceback.format_exception(type(exc), exc, exc.__traceback__)) + if len(tb) > _DOC_CAST_ERROR_TRACEBACK_MAX_CHARS: + return tb[:_DOC_CAST_ERROR_TRACEBACK_MAX_CHARS] + "\n...(traceback truncated)" + return tb + + +def _build_doc_preview( + doc: dict[str, Any], + keys: tuple[str, ...] | None, + max_bytes: int, +) -> Any: + if keys is not None: + preview_obj: Any = {k: doc[k] for k in keys if k in doc} + else: + preview_obj = doc + raw = json.dumps(preview_obj, default=str, sort_keys=True) + encoded = raw.encode("utf-8") + if len(encoded) <= max_bytes: + return json.loads(raw) + cut = raw.encode("utf-8")[:max_bytes].decode("utf-8", errors="replace") + return f"{cut}...(doc preview truncated)" + + +def _doc_failure_from_exception( + *, + resource_name: str, + doc_index: int, + doc: dict[str, Any], + exc: BaseException, + doc_keys: tuple[str, ...] | None, + doc_preview_max_bytes: int, +) -> DocCastFailure: + return DocCastFailure( + resource_name=resource_name, + doc_index=doc_index, + exception_type=type(exc).__name__, + message=str(exc), + traceback=_format_traceback(exc), + doc_preview=_build_doc_preview(doc, doc_keys, doc_preview_max_bytes), + ) + + +def _doc_failure_from_transform( + *, + resource_name: str, + doc_index: int, + doc: dict[str, Any], + fail: TransformCastFailure, + doc_keys: tuple[str, ...] | None, + doc_preview_max_bytes: int, +) -> DocCastFailure: + tb = fail.traceback + if len(tb) > _DOC_CAST_ERROR_TRACEBACK_MAX_CHARS: + tb = tb[:_DOC_CAST_ERROR_TRACEBACK_MAX_CHARS] + "\n...(traceback truncated)" + return DocCastFailure( + resource_name=resource_name, + doc_index=doc_index, + failure_kind="transform", + exception_type=fail.exception_type, + message=fail.message, + traceback=tb, + doc_preview=_build_doc_preview(doc, doc_keys, doc_preview_max_bytes), + location_path=fail.location.path, + transform_label=fail.transform_label, + nulled_fields=fail.nulled_fields, + ) + + +def _transform_failures_to_doc_cast_failures( + *, + resource_name: str, + doc_index: int, + doc: dict[str, Any], + transform_failures: list[TransformCastFailure], + doc_keys: tuple[str, ...] | None, + doc_preview_max_bytes: int, +) -> list[DocCastFailure]: + return [ + _doc_failure_from_transform( + resource_name=resource_name, + doc_index=doc_index, + doc=doc, + fail=fail, + doc_keys=doc_keys, + doc_preview_max_bytes=doc_preview_max_bytes, + ) + for fail in transform_failures + ] + + +def _coerce_doc(doc_raw: Any) -> dict[str, Any]: + if isinstance(doc_raw, dict): + return doc_raw + return {"_source_repr": repr(doc_raw)} + + +class DocumentCaster: + """Cast source documents to :class:`GraphContainer` via ingestion resources.""" + + def __init__(self, ingestion_model: IngestionModel) -> None: + self.ingestion_model = ingestion_model + + async def cast_batch( + self, + data: Iterable[Any], + resource_name: str | None, + *, + params: IngestionParams, + allowed_vertex_names: set[str] | None = None, + ) -> CastBatchResult: + runtime = self.ingestion_model.fetch_resource(resource_name) + resolved_name = runtime.name + vertex_filter = cast_vertex_filter( + runtime.collect_vertex_names(), + allowed_vertex_names=allowed_vertex_names, + ) + + doc_list = list(data) + cast_results, failures = await self._gather_cast_results( + runtime, + doc_list, + on_doc_error=params.on_doc_error, + resolved_name=resolved_name, + params=params, + ) + + graph = GraphContainer.from_docs_list( + [r.entities for r in cast_results if isinstance(r, ResourceCastResult)] + ) + filter_graph_container_by_vertices_inplace( + graph, allowed_vertex_names=vertex_filter + ) + if params.drop_empty_identity_docs: + filter_graph_container_drop_empty_identity_inplace( + graph, + vertex_config=runtime.vertex_config, + ) + return CastBatchResult(graph=graph, failures=failures) + + async def _gather_cast_results( + self, + runtime: ResourceRuntime, + doc_list: list[Any], + *, + on_doc_error: Literal["fail", "skip"], + resolved_name: str, + params: IngestionParams, + ) -> tuple[list[ResourceCastResult | BaseException], list[DocCastFailure]]: + semaphore = asyncio.Semaphore(params.n_cores) + + async def process_doc(doc: dict[str, Any]) -> ResourceCastResult: + async with semaphore: + return await asyncio.to_thread(runtime.cast_document, doc) + + if on_doc_error == "fail": + raw = await asyncio.gather( + *[process_doc(_coerce_doc(doc)) for doc in doc_list] + ) + else: + raw = await asyncio.gather( + *[process_doc(_coerce_doc(doc)) for doc in doc_list], + return_exceptions=True, + ) + + cast_results: list[ResourceCastResult | BaseException] = [] + failures: list[DocCastFailure] = [] + for i, item in enumerate(raw): + doc = _coerce_doc(doc_list[i]) + if isinstance(item, asyncio.CancelledError): + raise item + if isinstance(item, (KeyboardInterrupt, SystemExit)): + raise item + if isinstance(item, BaseException): + failures.append( + _doc_failure_from_exception( + resource_name=resolved_name, + doc_index=i, + doc=doc, + exc=item, + doc_keys=params.doc_error_preview_keys, + doc_preview_max_bytes=params.doc_error_preview_max_bytes, + ) + ) + continue + failures.extend( + _transform_failures_to_doc_cast_failures( + resource_name=resolved_name, + doc_index=i, + doc=doc, + transform_failures=item.transform_failures, + doc_keys=params.doc_error_preview_keys, + doc_preview_max_bytes=params.doc_error_preview_max_bytes, + ) + ) + cast_results.append(item) + return cast_results, failures diff --git a/graflo/hq/graph_engine.py b/graflo/hq/graph_engine.py index 6b670566..fab1a5d8 100644 --- a/graflo/hq/graph_engine.py +++ b/graflo/hq/graph_engine.py @@ -9,7 +9,7 @@ import logging from graflo.architecture.contract.manifest import GraphManifest -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.schema import Schema from graflo.onto import DBType from graflo.architecture.onto_sql import SchemaIntrospectionResult diff --git a/graflo/hq/ingestion_parameters.py b/graflo/hq/ingestion_parameters.py index 6652010c..98d634b1 100644 --- a/graflo/hq/ingestion_parameters.py +++ b/graflo/hq/ingestion_parameters.py @@ -41,6 +41,7 @@ class DocCastFailure(BaseModel): doc_index: int exception_type: str message: str + failure_kind: Literal["document", "transform"] = "document" traceback: str = Field( default="", description="Formatted traceback, truncated to the configured max length.", @@ -49,6 +50,18 @@ class DocCastFailure(BaseModel): default=None, description="Subset or truncated JSON of the source document for debugging.", ) + location_path: tuple[str | int | None, ...] | None = Field( + default=None, + description="Extraction location path when failure_kind is transform.", + ) + transform_label: str | None = Field( + default=None, + description="Transform name or module.foo when failure_kind is transform.", + ) + nulled_fields: tuple[str, ...] | None = Field( + default=None, + description="Output fields set to None when failure_kind is transform.", + ) class CastBatchResult(BaseModel): diff --git a/graflo/hq/rdf_inferencer.py b/graflo/hq/rdf_inferencer.py index 64690fa5..f1533f41 100644 --- a/graflo/hq/rdf_inferencer.py +++ b/graflo/hq/rdf_inferencer.py @@ -27,8 +27,8 @@ from graflo.architecture.schema.edge import Edge, EdgeConfig from graflo.architecture.database_features import DatabaseProfile -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel -from graflo.architecture.contract.declarations.resource import Resource +from graflo.architecture.contract.ingestion import IngestionModel +from graflo.architecture.contract.ingestion.resource import Resource from graflo.architecture.schema import ( CoreSchema, GraphMetadata, diff --git a/graflo/hq/registry_builder.py b/graflo/hq/registry_builder.py index 4bcde9a7..5b1f80eb 100644 --- a/graflo/hq/registry_builder.py +++ b/graflo/hq/registry_builder.py @@ -11,7 +11,7 @@ from pathlib import Path from typing import TYPE_CHECKING -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.schema import Schema from graflo.data_source import DataSourceFactory, DataSourceRegistry from graflo.data_source.sql import SQLConfig, SQLDataSource diff --git a/graflo/hq/sql_inferencer.py b/graflo/hq/sql_inferencer.py index 96c01a73..24966f27 100644 --- a/graflo/hq/sql_inferencer.py +++ b/graflo/hq/sql_inferencer.py @@ -10,7 +10,7 @@ from dataclasses import dataclass from graflo.architecture import Resource -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.onto_sql import SchemaIntrospectionResult from graflo.architecture.schema import Schema from graflo.db.postgres.conn import PostgresConnection diff --git a/graflo/migrate/io.py b/graflo/migrate/io.py index 8387fc23..074726ad 100644 --- a/graflo/migrate/io.py +++ b/graflo/migrate/io.py @@ -10,7 +10,7 @@ from suthing import FileHandle from graflo.architecture.contract.manifest import GraphManifest -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.schema import Schema diff --git a/graflo/plot/plotter.py b/graflo/plot/plotter.py index 5ef4d8bb..d8982a75 100644 --- a/graflo/plot/plotter.py +++ b/graflo/plot/plotter.py @@ -367,9 +367,8 @@ def _discover_edges_from_resources( relation_source_by_edge_id: dict[EdgeId, str] = {} relation_from_key_by_edge_id: dict[EdgeId, bool] = {} - for resource in self.ingestion_model.resources: - # Collect all actors from the resource's ActorWrapper - actors = resource.root.collect_actors() + for resource_config in self.ingestion_model.resources: + actors = ActorWrapper(*resource_config.pipeline).collect_actors() for actor in actors: if isinstance(actor, EdgeActor): @@ -705,12 +704,12 @@ def plot_resources(self): ) kwargs = {"vertex_sh": vertex_prefix_dict, "resource_sh": resource_prefix_dict} - for resource in self.ingestion_model.resources: - kwargs["resource"] = resource.name + for resource_config in self.ingestion_model.resources: + kwargs["resource"] = resource_config.name assemble_tree( - resource.root, + ActorWrapper(*resource_config.pipeline), self._figure_path( - f"{self.schema.metadata.name}.resource-{resource.name}" + f"{self.schema.metadata.name}.resource-{resource_config.name}" ), output_format=self.output_format, output_dpi=self.output_dpi, @@ -757,7 +756,7 @@ def _extract_resource_vertex_reasons(self, resource) -> dict[str, set[str]]: """Collect vertex references for a resource with lightweight reason labels.""" vertex_reasons: dict[str, set[str]] = {} known_vertices = set(self.schema.core_schema.vertex_config.vertex_set) - actors = resource.root.collect_actors() + actors = ActorWrapper(*resource.pipeline).collect_actors() def _add(vertex_name: str, reason: str) -> None: if vertex_name not in known_vertices: diff --git a/graflo/util/casting.py b/graflo/util/casting.py new file mode 100644 index 00000000..fc59c9ff --- /dev/null +++ b/graflo/util/casting.py @@ -0,0 +1,58 @@ +"""Safe document field type casting for ingestion resources.""" + +from __future__ import annotations + +import builtins +from typing import Any, Callable + +SAFE_TYPE_CASTERS: dict[str, Callable[..., Any]] = { + "str": str, + "int": int, + "float": float, + "bool": bool, + "bytes": bytes, + "list": list, + "dict": dict, + "tuple": tuple, + "set": set, +} + + +def resolve_type_caster(name: str) -> Callable[..., Any] | None: + """Resolve a type caster by name from a strict allowlist.""" + if not isinstance(name, str): + return None + candidate = SAFE_TYPE_CASTERS.get(name) + if candidate is not None: + return candidate + if "." in name: + module_name, attr_name = name.split(".", 1) + if module_name == "builtins": + builtin_attr = getattr(builtins, attr_name, None) + if callable(builtin_attr) and attr_name in SAFE_TYPE_CASTERS: + return SAFE_TYPE_CASTERS[attr_name] + return None + + +def resolve_type_casters( + types: dict[str, str], +) -> dict[str, Callable[..., Any]]: + """Resolve declared field types to callables, skipping unknown names.""" + resolved: dict[str, Callable[..., Any]] = {} + for field_name, type_name in types.items(): + caster = resolve_type_caster(type_name) + if caster is not None: + resolved[field_name] = caster + return resolved + + +def apply_type_casters( + doc: dict[str, Any], casters: dict[str, Callable[..., Any]] +) -> dict[str, Any]: + """Apply configured type casters to top-level document fields in place.""" + if not casters: + return doc + for field_name, caster in casters.items(): + if field_name in doc: + doc[field_name] = caster(doc[field_name]) + return doc diff --git a/graflo/util/data_normalize.py b/graflo/util/data_normalize.py new file mode 100644 index 00000000..7890604b --- /dev/null +++ b/graflo/util/data_normalize.py @@ -0,0 +1,23 @@ +"""Normalize tabular input into document rows.""" + +from __future__ import annotations + +from typing import cast + +import pandas as pd + + +def normalize_rows( + data: pd.DataFrame | list[list] | list[dict], columns: list[str] | None = None +) -> list[dict]: + """Normalize resource data into a list of dictionaries.""" + if isinstance(data, pd.DataFrame): + columns = data.columns.tolist() + _data = data.values.tolist() + elif data and isinstance(data[0], list): + _data = cast(list[list], data) + if columns is None: + raise ValueError("columns should be set") + else: + return cast(list[dict], data) + return [{k: v for k, v in zip(columns, item)} for item in _data] diff --git a/pyproject.toml b/pyproject.toml index 21f7f583..8bae0b20 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -40,7 +40,7 @@ description = "A framework for transforming tabular (CSV, SQL) and hierarchical name = "graflo" readme = "README.md" requires-python = ">=3.11" -version = "1.7.29" +version = "1.7.30" [project.optional-dependencies] dev = [ diff --git a/test/architecture/_test_blank_vertices.py b/test/architecture/_test_blank_vertices.py index 7c3deb12..c9e8e353 100644 --- a/test/architecture/_test_blank_vertices.py +++ b/test/architecture/_test_blank_vertices.py @@ -15,10 +15,9 @@ @pytest.fixture() def schema_ibes_vertices(): tc = yaml.safe_load(""" - blank_vertices: - - publication vertices: - name: publication + blank: true fields: - datetime_review - datetime_announce diff --git a/test/architecture/test_actor.py b/test/architecture/test_actor.py index b2f4b0ff..ed80b361 100644 --- a/test/architecture/test_actor.py +++ b/test/architecture/test_actor.py @@ -11,13 +11,19 @@ VertexActor, ) from graflo.architecture.schema.edge import EdgeConfig -from graflo.architecture.graph_types import ActionContext, LocationIndex, VertexRep +from graflo.architecture.graph_types import ( + ActionContext, + ExtractionContext, + LocationIndex, + TransformPayload, + VertexRep, +) from graflo.architecture.pipeline.runtime.actor.config import ( VertexActorConfig, normalize_actor_step, validate_actor_step, ) -from graflo.architecture.contract.declarations.transform import ( +from graflo.architecture.contract.ingestion.transform import ( DressConfig, KeySelectionConfig, ProtoTransform, @@ -1322,3 +1328,64 @@ def test_extraction_context_records_observations( assert all( obs.provenance.path == obs.location.path for obs in ctx.transform_observations ) + + +def test_vertex_from_doc_does_not_steal_other_vertex_buffer_payloads() -> None: + """from_doc Identifier must not consume pivot payloads meant for Metric.""" + vc = VertexConfig.model_validate( + { + "vertices": [ + { + "name": "Identifier", + "properties": ["type", "value"], + "identity": ["type", "value"], + }, + { + "name": "Metric", + "properties": ["type", "value"], + "identity": ["type", "value"], + }, + ] + } + ) + init = ActorInitContext( + vertex_config=vc, + edge_config=EdgeConfig(), + transforms={}, + ) + identifier = VertexActor.from_config( + VertexActorConfig( + type="vertex", + vertex="Identifier", + from_doc={"type": "itype", "value": "ivalue"}, + ) + ) + metric = VertexActor.from_config( + VertexActorConfig(type="vertex", vertex="Metric"), + ) + identifier.finish_init(init) + metric.finish_init(init) + + loc = LocationIndex(()) + ctx = ExtractionContext() + ctx.transform_buffer[loc].extend( + [ + TransformPayload(named={"type": "VOL", "value": 93115.0}), + TransformPayload(named={"type": "PRC", "value": 42.5}), + TransformPayload(named={"itype": "CUSIP", "ivalue": "03073T10"}), + TransformPayload(named={"itype": "TICKER", "ivalue": "AMGP"}), + ] + ) + + identifier(ctx, loc, doc={}) + metric(ctx, loc, doc={}) + + id_docs = [rep.vertex for rep in ctx.acc_vertex["Identifier"][loc]] + metric_docs = [rep.vertex for rep in ctx.acc_vertex["Metric"][loc]] + + assert len(id_docs) == 2 + assert {"type": "TICKER", "value": "AMGP"} in id_docs + assert {"type": "CUSIP", "value": "03073T10"} in id_docs + assert len(metric_docs) == 2 + assert {"type": "VOL", "value": 93115.0} in metric_docs + assert {"type": "PRC", "value": 42.5} in metric_docs diff --git a/test/architecture/test_edge.py b/test/architecture/test_edge.py index be883c26..677a2c55 100644 --- a/test/architecture/test_edge.py +++ b/test/architecture/test_edge.py @@ -12,7 +12,7 @@ from graflo.architecture.schema import EdgeConfigDBAware, VertexConfigDBAware from graflo.architecture.graph_types import Weight from graflo.architecture.schema.vertex import VertexConfig -from graflo.architecture.contract.declarations.edge_derivation_registry import ( +from graflo.architecture.contract.runtime.edge_derivation import ( EdgeDerivationRegistry, ) from graflo.onto import DBType diff --git a/test/architecture/test_evolution_sanitize.py b/test/architecture/test_evolution_sanitize.py index 9c8dc2e8..c5b38eb2 100644 --- a/test/architecture/test_evolution_sanitize.py +++ b/test/architecture/test_evolution_sanitize.py @@ -38,7 +38,6 @@ def _build_manifest( Vertex(name="users", properties=user_props, identity=identity), Vertex(name="orders", properties=[Field(name="id")], identity=["id"]), ], - blank_vertices=[], force_types={}, ) ec = EdgeConfig(edges=[Edge(source="users", target="orders", relation=None)]) @@ -498,7 +497,6 @@ def _build_multi_relation_manifest( identity=["tid"], ), ], - blank_vertices=[], force_types={}, ) ec = EdgeConfig( diff --git a/test/architecture/test_manifest_canonical_contract.py b/test/architecture/test_manifest_canonical_contract.py index 40571be3..8f57fa8d 100644 --- a/test/architecture/test_manifest_canonical_contract.py +++ b/test/architecture/test_manifest_canonical_contract.py @@ -1,5 +1,5 @@ from graflo.architecture.contract.bindings import Bindings -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.contract.manifest import GraphManifest from graflo.architecture.schema import Schema from graflo.hq.caster import IngestionParams @@ -123,8 +123,8 @@ def test_resource_finish_init_does_not_mutate_shared_schema_edge_config() -> Non # Shared logical schema stays untouched. assert len(schema.core_schema.edge_config.edges) == 0 # Runtime resource edge configs receive local dynamic edge registrations. - assert len(ingestion_model.resources[0].edge_config.edges) == 1 - assert len(ingestion_model.resources[1].edge_config.edges) == 1 + assert len(ingestion_model.fetch_resource("r1").edge_config.edges) == 1 + assert len(ingestion_model.fetch_resource("r2").edge_config.edges) == 1 def test_bindings_reject_inline_credentials_payload() -> None: diff --git a/test/architecture/test_manifest_evolution.py b/test/architecture/test_manifest_evolution.py index 45708137..4c700e8a 100644 --- a/test/architecture/test_manifest_evolution.py +++ b/test/architecture/test_manifest_evolution.py @@ -26,7 +26,6 @@ def _minimal_manifest() -> GraphManifest: Vertex(name="a", properties=[Field(name="id")], identity=["id"]), Vertex(name="b", properties=[Field(name="id")], identity=["id"]), ], - blank_vertices=[], force_types={}, ) ec = EdgeConfig( diff --git a/test/architecture/test_manifest_rename.py b/test/architecture/test_manifest_rename.py index 251fdb3d..cd257b9a 100644 --- a/test/architecture/test_manifest_rename.py +++ b/test/architecture/test_manifest_rename.py @@ -23,9 +23,13 @@ def _sample_manifest_payload() -> dict: "vertex_config": { "vertices": [ {"name": "person", "identity": ["id"], "properties": ["id"]}, - {"name": "company", "identity": ["id"], "properties": ["id"]}, + { + "name": "company", + "identity": ["id"], + "properties": ["id"], + "blank": True, + }, ], - "blank_vertices": ["company"], "force_types": {"person": ["STRING"]}, }, "edge_config": { diff --git a/test/architecture/test_resource.py b/test/architecture/test_resource.py index 8b80964a..04906ea4 100644 --- a/test/architecture/test_resource.py +++ b/test/architecture/test_resource.py @@ -5,37 +5,54 @@ from graflo.architecture.graph_types import ExtractionContext from graflo.architecture.schema.edge import EdgeConfig -from graflo.architecture.contract.declarations.resource import ( - Resource, - _resolve_type_caster, +from graflo.architecture.contract.ingestion.resource import Resource +from graflo.architecture.contract.runtime import ( + ResourceRuntime, + build_resource_runtime, ) from graflo.architecture.schema.vertex import VertexConfig +from graflo.util.casting import resolve_type_caster logger = logging.getLogger(__name__) +def _runtime( + data: dict[str, Any], + vertex_config: VertexConfig, + edge_config: EdgeConfig, + transforms: dict | None = None, + **kwargs: Any, +) -> ResourceRuntime: + config = Resource.from_dict(data) + return build_resource_runtime( + config, + vertex_config, + edge_config, + transforms or {}, + **kwargs, + ) + + def test_schema_tree(schema): sch = schema("kg") mn = Resource.from_dict(sch["ingestion_model"]["resources"][0]) - assert mn.count() == 14 + assert mn.pipeline_actor_count() == 14 def test_resolve_type_caster_allowlist(): - assert _resolve_type_caster("int") is int - assert _resolve_type_caster("float") is float - assert _resolve_type_caster("builtins.str") is str + assert resolve_type_caster("int") is int + assert resolve_type_caster("float") is float + assert resolve_type_caster("builtins.str") is str def test_resolve_type_caster_rejects_expressions(): - assert _resolve_type_caster("__import__('os').system") is None + assert resolve_type_caster("__import__('os').system") is None def test_resource_drop_trivial_input_fields_strips_none_and_empty_string(): - from graflo.architecture.contract.declarations.resource import ( - _strip_trivial_top_level_fields, - ) + from graflo.architecture.contract.runtime import strip_trivial_top_level_fields - assert _strip_trivial_top_level_fields( + assert strip_trivial_top_level_fields( {"a": 1, "b": None, "c": "", "d": "x", "nested": {"e": None}} ) == {"a": 1, "d": "x", "nested": {"e": None}} @@ -43,13 +60,6 @@ def test_resource_drop_trivial_input_fields_strips_none_and_empty_string(): def test_resource_drop_trivial_input_fields_passes_stripped_doc_to_executor( monkeypatch: pytest.MonkeyPatch, ) -> None: - resource = Resource.from_dict( - { - "name": "wide_row", - "pipeline": [{"vertex": "person"}], - "drop_trivial_input_fields": True, - } - ) vc = VertexConfig.from_dict( { "vertices": [ @@ -62,7 +72,15 @@ def test_resource_drop_trivial_input_fields_passes_stripped_doc_to_executor( } ) ec = EdgeConfig.from_dict({"edges": []}) - resource.finish_init(vertex_config=vc, edge_config=ec, transforms={}) + resource = _runtime( + { + "name": "wide_row", + "pipeline": [{"vertex": "person"}], + "drop_trivial_input_fields": True, + }, + vc, + ec, + ) doc = {"id": "1", "note": "hi", "empty": "", "nullish": None, "keep": 0} real_extract = resource._executor.extract snapshots: list[dict[str, Any]] = [] @@ -81,13 +99,6 @@ def capturing_extract(work: dict[str, Any]) -> ExtractionContext: def test_resource_drop_trivial_input_fields_false_passes_doc_unchanged( monkeypatch: pytest.MonkeyPatch, ) -> None: - resource = Resource.from_dict( - { - "name": "wide_row", - "pipeline": [{"vertex": "person"}], - "drop_trivial_input_fields": False, - } - ) vc = VertexConfig.from_dict( { "vertices": [ @@ -100,7 +111,15 @@ def test_resource_drop_trivial_input_fields_false_passes_doc_unchanged( } ) ec = EdgeConfig.from_dict({"edges": []}) - resource.finish_init(vertex_config=vc, edge_config=ec, transforms={}) + resource = _runtime( + { + "name": "wide_row", + "pipeline": [{"vertex": "person"}], + "drop_trivial_input_fields": False, + }, + vc, + ec, + ) doc = {"id": "1", "empty": ""} expected_at_extract_entry = dict(doc) real_extract = resource._executor.extract @@ -120,7 +139,19 @@ def capturing_extract(work: dict[str, Any]) -> ExtractionContext: def test_resource_skip_actors_on_missing_input_keys_true_skips_missing_transform() -> ( None ): - resource = Resource.from_dict( + vc = VertexConfig.from_dict( + { + "vertices": [ + { + "name": "person", + "properties": ["id", "age"], + "identity": ["id"], + } + ] + } + ) + ec = EdgeConfig.from_dict({"edges": []}) + resource = _runtime( { "name": "skip_missing_transform", "pipeline": [ @@ -137,8 +168,18 @@ def test_resource_skip_actors_on_missing_input_keys_true_skips_missing_transform {"vertex": "person", "from": {"id": "id"}}, ], "skip_actors_on_missing_input_keys": True, - } + }, + vc, + ec, ) + + entities = resource({"id": "u-1"}) + assert entities["person"] == [{"id": "u-1"}] + + +def test_resource_drop_trivial_input_fields_large_doc_auto_skips_missing_transform( + monkeypatch: pytest.MonkeyPatch, +) -> None: vc = VertexConfig.from_dict( { "vertices": [ @@ -151,17 +192,7 @@ def test_resource_skip_actors_on_missing_input_keys_true_skips_missing_transform } ) ec = EdgeConfig.from_dict({"edges": []}) - resource.finish_init(vertex_config=vc, edge_config=ec, transforms={}) - - # missing_age is absent, transform should be skipped (not raise KeyError) - entities = resource({"id": "u-1"}) - assert entities["person"] == [{"id": "u-1"}] - - -def test_resource_drop_trivial_input_fields_large_doc_auto_skips_missing_transform( - monkeypatch: pytest.MonkeyPatch, -) -> None: - resource = Resource.from_dict( + resource = _runtime( { "name": "wide_row", "pipeline": [ @@ -178,21 +209,10 @@ def test_resource_drop_trivial_input_fields_large_doc_auto_skips_missing_transfo {"vertex": "person", "from": {"id": "id"}}, ], "drop_trivial_input_fields": True, - } + }, + vc, + ec, ) - vc = VertexConfig.from_dict( - { - "vertices": [ - { - "name": "person", - "properties": ["id", "age"], - "identity": ["id"], - } - ] - } - ) - ec = EdgeConfig.from_dict({"edges": []}) - resource.finish_init(vertex_config=vc, edge_config=ec, transforms={}) large_doc: dict[str, Any] = {f"empty_{i}": "" for i in range(1000)} large_doc.update({"id": "u-2", "age_raw": "", "keep_zero": 0}) @@ -205,7 +225,6 @@ def capturing_extract(work: dict[str, Any]) -> ExtractionContext: monkeypatch.setattr(resource._executor, "extract", capturing_extract) - # age_raw is stripped as trivial; auto-enabled missing-key skip should prevent failure. entities = resource(large_doc) assert entities["person"] == [{"id": "u-2"}] assert snapshots and "age_raw" not in snapshots[0] @@ -213,15 +232,22 @@ def capturing_extract(work: dict[str, Any]) -> ExtractionContext: def test_resource_types_uses_safe_caster_resolution(): - resource = Resource.from_dict( + config = Resource.from_dict( { "name": "typed_resource", "pipeline": [{"vertex": "person"}], "types": {"age": "int", "unsafe": "__import__('os').system"}, } ) - assert resource._types["age"] is int - assert "unsafe" not in resource._types + runtime = build_resource_runtime( + config, + VertexConfig.from_dict( + {"vertices": [{"name": "person", "properties": ["id"], "identity": ["id"]}]} + ), + EdgeConfig.from_dict({"edges": []}), + ) + assert runtime.type_casters["age"] is int + assert "unsafe" not in runtime.type_casters def test_resource_infer_edge_selectors_are_mutually_exclusive(): @@ -237,43 +263,44 @@ def test_resource_infer_edge_selectors_are_mutually_exclusive(): def test_resource_infer_edge_selector_references_unknown_edge(): - resource = Resource.from_dict( - { - "name": "typed_resource", - "pipeline": [{"vertex": "person"}], - "infer_edge_only": [{"source": "a", "target": "b"}], - } - ) vc = VertexConfig.from_dict( {"vertices": [{"name": "person", "properties": ["id"], "identity": ["id"]}]} ) ec = EdgeConfig.from_dict({"edges": [{"source": "person", "target": "person"}]}) with pytest.raises(ValueError, match="undefined vertices"): - resource.finish_init(vertex_config=vc, edge_config=ec, transforms={}) + _runtime( + { + "name": "typed_resource", + "pipeline": [{"vertex": "person"}], + "infer_edge_only": [{"source": "a", "target": "b"}], + }, + vc, + ec, + ) def test_resource_dynamic_edge_vertices_must_be_declared(): - resource = Resource.from_dict( - { - "name": "dynamic_edges", - "pipeline": [ - {"vertex": "person"}, - {"edge": {"from": "person", "to": "company"}}, - ], - } - ) vc = VertexConfig.from_dict( {"vertices": [{"name": "person", "properties": ["id"], "identity": ["id"]}]} ) ec = EdgeConfig.from_dict({"edges": []}) with pytest.raises(ValueError, match="undefined vertices"): - resource.finish_init(vertex_config=vc, edge_config=ec, transforms={}) + _runtime( + { + "name": "dynamic_edges", + "pipeline": [ + {"vertex": "person"}, + {"edge": {"from": "person", "to": "company"}}, + ], + }, + vc, + ec, + ) def test_resource_auto_adds_edge_actor_types_to_infer_edge_except(): - """When a Resource has EdgeActors for (s,t), (s,t, None) is auto-added to infer_edge_except.""" - resource = Resource.from_dict( + config = Resource.from_dict( { "name": "test", "pipeline": [ @@ -284,12 +311,11 @@ def test_resource_auto_adds_edge_actor_types_to_infer_edge_except(): ], } ) - ids = resource._edge_ids_from_edge_actors() + ids = ResourceRuntime.edge_ids_from_pipeline(config.pipeline) assert ids == {("a", "b", None)} def test_resource_infer_edge_except_excludes_edges_handled_by_edge_actors(): - """Resource with EdgeActor for (a,b) does not infer (a,b); (a,c) is still inferred.""" from graflo.architecture.graph_types import ActionContext vc = VertexConfig.from_dict( @@ -309,8 +335,7 @@ def test_resource_infer_edge_except_excludes_edges_handled_by_edge_actors(): ] } ) - # EdgeActor for (a,b) is inside a descend that never runs (doc has no "nested" key) - resource = Resource.from_dict( + resource = _runtime( { "name": "test", "pipeline": [ @@ -322,14 +347,98 @@ def test_resource_infer_edge_except_excludes_edges_handled_by_edge_actors(): "apply": [{"edge": {"from": "a", "to": "b"}}], }, ], - } + }, + vc, + ec, ) - resource.finish_init(vertex_config=vc, edge_config=ec, transforms={}) anw = resource.root ctx = ActionContext() ctx = anw(ctx, doc={"a": "1", "b": "2", "c": "3"}) acc = anw.assemble(ctx) - # (a,b) has EdgeActor so it's in infer_edge_except - not inferred assert ("a", "b", "ab") not in acc - # (a,c) has no EdgeActor - inferred assert len(acc[("a", "c", "ac")]) == 1 + + +def _person_vertex_config() -> VertexConfig: + return VertexConfig.from_dict( + { + "vertices": [ + { + "name": "person", + "properties": ["id", "age"], + "identity": ["id"], + } + ] + } + ) + + +def test_resource_tolerate_transform_errors_continues_pipeline() -> None: + ec = EdgeConfig.from_dict({"edges": []}) + resource = _runtime( + { + "name": "tolerant", + "pipeline": [ + { + "transform": { + "call": { + "module": "builtins", + "foo": "int", + "input": ["age_raw"], + "output": ["age"], + } + } + }, + {"vertex": "person", "from": {"id": "id"}}, + ], + "tolerate_transform_errors": True, + }, + _person_vertex_config(), + ec, + ) + + result = resource.cast_document({"id": "u-1", "age_raw": "not-a-number"}) + person = result.entities["person"][0] + assert person["id"] == "u-1" + assert person.get("age") is None + assert len(result.transform_failures) == 1 + assert result.transform_failures[0].nulled_fields == ("age",) + assert result.transform_failures[0].exception_type == "ValueError" + + +def test_resource_tolerate_transform_errors_false_raises() -> None: + ec = EdgeConfig.from_dict({"edges": []}) + resource = _runtime( + { + "name": "strict", + "pipeline": [ + { + "transform": { + "call": { + "module": "builtins", + "foo": "int", + "input": ["age_raw"], + "output": ["age"], + } + } + }, + {"vertex": "person", "from": {"id": "id"}}, + ], + "tolerate_transform_errors": False, + }, + _person_vertex_config(), + ec, + ) + + with pytest.raises(ValueError): + resource.cast_document({"id": "u-1", "age_raw": "not-a-number"}) + + +def test_resource_tolerate_transform_errors_defaults_true() -> None: + resource = Resource.from_dict( + { + "name": "default_tolerant", + "pipeline": [{"vertex": "person"}], + } + ) + assert resource.tolerate_transform_errors is True diff --git a/test/architecture/test_resource_filters.py b/test/architecture/test_resource_filters.py index ef9e78ee..2e34ca3d 100644 --- a/test/architecture/test_resource_filters.py +++ b/test/architecture/test_resource_filters.py @@ -446,9 +446,7 @@ class TestAutoJoin: def _make_schema_and_patterns(self): """Build a minimal Schema + Connectors for the CMDB-like scenario.""" - from graflo.architecture.contract.declarations.ingestion_model import ( - IngestionModel, - ) + from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.schema import Schema schema = Schema.model_validate( diff --git a/test/architecture/test_transform_planned_output.py b/test/architecture/test_transform_planned_output.py new file mode 100644 index 00000000..19610f68 --- /dev/null +++ b/test/architecture/test_transform_planned_output.py @@ -0,0 +1,54 @@ +"""Tests for Transform.planned_output_field_names.""" + +from __future__ import annotations + +from graflo.architecture.contract.ingestion.transform import ( + DressConfig, + KeySelectionConfig, + Transform, +) + + +def test_planned_output_field_names_from_output() -> None: + t = Transform( + module="builtins", + foo="int", + input=("a",), + output=("age",), + ) + assert t.planned_output_field_names() == ("age",) + + +def test_planned_output_field_names_from_dress() -> None: + t = Transform( + module="builtins", + foo="str", + input=("Open",), + dress=DressConfig(key="name", value="value"), + ) + assert t.planned_output_field_names() == ("name", "value") + + +def test_planned_output_field_names_from_rename() -> None: + t = Transform(rename={"src": "dst"}) + assert t.planned_output_field_names() == ("dst",) + + +def test_planned_output_field_names_from_output_groups() -> None: + t = Transform( + module="builtins", + foo="int", + input_groups=(("a", "b"),), + output_groups=(("x", "y"),), + ) + assert t.planned_output_field_names() == ("x", "y") + + +def test_planned_output_field_names_target_keys() -> None: + t = Transform( + module="builtins", + foo="str", + target="keys", + keys=KeySelectionConfig(mode="include", names=("a", "c")), + ) + assert t.planned_output_field_names({"a": 1, "b": 2, "c": 3}) == ("a", "c") diff --git a/test/architecture/test_vertex.py b/test/architecture/test_vertex.py index e49c71b3..ab498d24 100644 --- a/test/architecture/test_vertex.py +++ b/test/architecture/test_vertex.py @@ -383,13 +383,12 @@ def test_vertex_config_properties_with_db_flavor(): def test_vertex_config_remove_vertices(): - """Test VertexConfig.remove_vertices removes vertices and updates blank_vertices.""" + """Test VertexConfig.remove_vertices removes vertices and blank_vertices property.""" v1 = Vertex.from_dict({"name": "a", "properties": ["id"]}) - v2 = Vertex.from_dict({"name": "b", "properties": ["id"]}) + v2 = Vertex.from_dict({"name": "b", "properties": ["id"], "blank": True}) v3 = Vertex.from_dict({"name": "c", "properties": ["id"]}) config = VertexConfig( vertices=[v1, v2, v3], - blank_vertices=["b"], identity_from_all_properties=True, ) assert config.vertex_set == {"a", "b", "c"} @@ -408,8 +407,8 @@ def test_vertex_config_identity_fallback_when_flag_enabled(): def test_blank_vertex_defaults_to_id_identity(): """Blank vertices still default to id identity when omitted.""" - blank = Vertex(name="placeholder", properties=[]) - cfg = VertexConfig(vertices=[blank], blank_vertices=["placeholder"]) + blank = Vertex(name="placeholder", properties=[], blank=True) + cfg = VertexConfig(vertices=[blank]) assert cfg.identity_fields("placeholder") == ["id"] assert cfg.property_names("placeholder") == ["id"] @@ -442,3 +441,29 @@ def test_vertex_properties_conflicting_duplicate_types_raise(): ], identity=["id"], ) + + +def test_resource_runtime_vertex_config_excludes_unreferenced_blank_vertices(): + """Blank vertices outside the resource pipeline are not in runtime config.""" + from graflo.architecture.contract.ingestion.resource import Resource + from graflo.architecture.contract.runtime import build_resource_runtime + from graflo.architecture.schema.edge import EdgeConfig + + schema_vc = VertexConfig( + vertices=[ + Vertex(name="ticker", properties=["cusip"], identity=["cusip"]), # type: ignore[arg-type] + Vertex(name="publication", properties=[], blank=True), + ] + ) + config = Resource( + name="ibes", + pipeline=[{"vertex": "ticker"}], + ) + resource = build_resource_runtime( + config, + vertex_config=schema_vc, + edge_config=EdgeConfig(), + transforms={}, + ) + assert resource.vertex_config.vertex_set == {"ticker"} + assert resource.vertex_config.blank_vertices == [] diff --git a/test/config/schema/ibes.yaml b/test/config/schema/ibes.yaml index c9f818f9..871dcd37 100644 --- a/test/config/schema/ibes.yaml +++ b/test/config/schema/ibes.yaml @@ -3,10 +3,9 @@ schema: name: ibes graph: vertex_config: - blank_vertices: - - publication vertices: - name: publication + blank: true properties: - datetime_review - datetime_announce diff --git a/test/data_source/test_api_data_source.py b/test/data_source/test_api_data_source.py index 77504074..e8375495 100644 --- a/test/data_source/test_api_data_source.py +++ b/test/data_source/test_api_data_source.py @@ -9,6 +9,7 @@ from test.conftest import fetch_manifest_obj from graflo.db import PostgresConfig from graflo.hq.caster import Caster +from graflo.hq.ingestion_parameters import IngestionParams from graflo.data_source import ( APIConfig, APIDataSource, @@ -56,8 +57,12 @@ def test_api_data_source_basic(mock_api_server, api_mode, current_path, reset): api_source = DataSourceFactory.create_api_data_source(api_config) api_source.resource_name = resource_name - # Create caster and process - caster = Caster(schema, ingestion_model, n_cores=1) + ingestion_model.finish_init(schema.core_schema) + caster = Caster( + schema, + ingestion_model, + ingestion_params=IngestionParams(n_cores=1), + ) asyncio.run( caster.process_data_source(data_source=api_source, resource_name=resource_name) ) @@ -77,8 +82,12 @@ def test_api_data_source_via_process_resource( schema = manifest.require_schema() ingestion_model = manifest.require_ingestion_model() - # Create caster - caster = Caster(schema, ingestion_model, n_cores=1) + ingestion_model.finish_init(schema.core_schema) + caster = Caster( + schema, + ingestion_model, + ingestion_params=IngestionParams(n_cores=1), + ) # Process using configuration dict resource_config = { diff --git a/test/db/tigergraphs/test_reserved_words.py b/test/db/tigergraphs/test_reserved_words.py index bb6be6aa..c88c5378 100644 --- a/test/db/tigergraphs/test_reserved_words.py +++ b/test/db/tigergraphs/test_reserved_words.py @@ -73,7 +73,9 @@ def test_edges_sanitization_for_tigergraph(schema_with_incompatible_edges): # ) # ) - assert ingestion_model.resources[-1].root.actor.descendants[0].actor.t.rename == { + ingestion_model.finish_init(sanitized_schema.core_schema) + last_resource = ingestion_model.fetch_resource(ingestion_model.resources[-1].name) + assert last_resource.root.actor.descendants[0].actor.t.rename == { "container_name": "id" } diff --git a/test/hq/test_db_writer.py b/test/hq/test_db_writer.py index 95a69c03..f32b62fb 100644 --- a/test/hq/test_db_writer.py +++ b/test/hq/test_db_writer.py @@ -4,7 +4,7 @@ from graflo.architecture.schema.edge import Edge, EdgeConfig from graflo.architecture.database_features import DatabaseProfile -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.graph_types import GraphContainer from graflo.architecture.schema import ( CoreSchema, @@ -44,10 +44,9 @@ def __exit__(self, exc_type, exc, tb): def _build_schema() -> Schema: vertex_config = VertexConfig( vertices=[ - Vertex(name="blank_v", properties=[], identity=[]), + Vertex(name="blank_v", properties=[], identity=[], blank=True), Vertex(name="target_v", properties=[Field(name="id")], identity=["id"]), ], - blank_vertices=["blank_v"], ) edge_config = EdgeConfig(edges=[Edge(source="blank_v", target="target_v")]) schema = Schema( @@ -113,12 +112,10 @@ def test_resolve_blank_edges_prefers_identity_join_over_zip(): def test_blank_vertex_default_identity_depends_on_db_flavor(): arango_cfg = VertexConfig( - vertices=[Vertex(name="blank_v", properties=[], identity=[])], - blank_vertices=["blank_v"], + vertices=[Vertex(name="blank_v", properties=[], identity=[], blank=True)], ) neo4j_cfg = VertexConfig( - vertices=[Vertex(name="blank_v", properties=[], identity=[])], - blank_vertices=["blank_v"], + vertices=[Vertex(name="blank_v", properties=[], identity=[], blank=True)], ) arango_cfg.finish_init() neo4j_cfg.finish_init() diff --git a/test/plot/test_plotter.py b/test/plot/test_plotter.py index c76ca198..7fed6855 100644 --- a/test/plot/test_plotter.py +++ b/test/plot/test_plotter.py @@ -1,4 +1,4 @@ -from types import SimpleNamespace +from types import MethodType, SimpleNamespace from typing import cast import networkx as nx @@ -10,6 +10,7 @@ ) from graflo.architecture.pipeline.runtime.actor.wrapper import ActorWrapper from graflo.architecture.pipeline.runtime.actor.config import EdgeActorConfig +from graflo.architecture.graph_types import EdgeId from graflo.architecture.schema.edge import Edge from graflo.plot.plotter import ManifestPlotter, assemble_tree, fillcolor_palette @@ -148,16 +149,35 @@ def test_plot_vc2vc_preserves_labels_and_partition_grouping(monkeypatch): {"from": "b", "to": "c", "relation_from_key": True} ) ) - resource = SimpleNamespace() - resource.root = SimpleNamespace( - collect_actors=lambda: [edge_ab_actor, edge_bc_actor] - ) + edge_ab = edge_ab_actor.edge + edge_bc = edge_bc_actor.edge + assert edge_ab is not None + assert edge_bc is not None plotter = _build_plotter( configured_edges={}, vertex_set={"a", "b", "c"}, ) - plotter.ingestion_model = SimpleNamespace(resources=[resource]) + plotter.ingestion_model = SimpleNamespace( + resources=[SimpleNamespace(name="r1", pipeline=[])] + ) + + def _discover_edges( + self: ManifestPlotter, + ) -> tuple[dict[EdgeId, Edge], dict[EdgeId, str], dict[EdgeId, bool]]: + discovered = { + edge_ab.edge_id: edge_ab, + edge_bc.edge_id: edge_bc, + } + relation_source = {edge_ab.edge_id: "edge_kind"} + relation_from_key = {edge_bc.edge_id: True} + return discovered, relation_source, relation_from_key + + monkeypatch.setattr( + plotter, + "_discover_edges_from_resources", + MethodType(_discover_edges, plotter), + ) captured = {} diff --git a/test/routing/test_ingestion_subset.py b/test/routing/test_ingestion_subset.py index 77425617..159e66a2 100644 --- a/test/routing/test_ingestion_subset.py +++ b/test/routing/test_ingestion_subset.py @@ -1,14 +1,30 @@ from __future__ import annotations -from graflo.architecture.contract.declarations.resource import Resource +from graflo.architecture.contract.ingestion.resource import Resource +from graflo.architecture.contract.runtime import build_resource_runtime from graflo.architecture.schema.edge import EdgeConfig from graflo.architecture.schema.vertex import VertexConfig from graflo.architecture.graph_types import GraphContainer -from graflo.hq.caster import ( - IngestionParams, - _filter_graph_container_by_vertices_inplace, - _filter_graph_container_drop_empty_identity_inplace, +from graflo.hq.document_caster import ( + filter_graph_container_by_vertices_inplace, + filter_graph_container_drop_empty_identity_inplace, ) +from graflo.hq.ingestion_parameters import IngestionParams + + +def _runtime( + data: dict, + vertex_config: VertexConfig, + edge_config: EdgeConfig, + **kwargs, +): + return build_resource_runtime( + Resource.from_dict(data), + vertex_config, + edge_config, + {}, + **kwargs, + ) def test_filter_graph_container_by_vertices_keeps_allowed_vertices() -> None: @@ -25,7 +41,7 @@ def test_filter_graph_container_by_vertices_keeps_allowed_vertices() -> None: linear=[], ) - _filter_graph_container_by_vertices_inplace(gc, allowed_vertex_names={"A", "B"}) + filter_graph_container_by_vertices_inplace(gc, allowed_vertex_names={"A", "B"}) assert set(gc.vertices.keys()) == {"A", "B"} assert set(gc.edges.keys()) == {("A", "B", None)} @@ -64,7 +80,7 @@ def test_filter_graph_container_drops_empty_identity_vertices_and_edges() -> Non linear=[], ) - _filter_graph_container_drop_empty_identity_inplace(gc, vertex_config=vc) + filter_graph_container_drop_empty_identity_inplace(gc, vertex_config=vc) assert gc.vertices["modifier"] == [{"modifier_id": 123}] assert gc.vertices["metric"] == [{"metric_id": 981}] @@ -86,7 +102,7 @@ def test_filter_graph_container_by_vertices_empty_ingests_nothing() -> None: linear=[], ) - _filter_graph_container_by_vertices_inplace(gc, allowed_vertex_names=set()) + filter_graph_container_by_vertices_inplace(gc, allowed_vertex_names=set()) assert gc.vertices == {} assert gc.edges == {} @@ -120,19 +136,16 @@ def test_vertex_actor_early_exit_skips_disallowed_vertices() -> None: vc = _vertex_config_a_b_c() ec = EdgeConfig.from_dict({"edges": []}) - resource = Resource.from_dict( + resource = _runtime( { "name": "r", "pipeline": [ {"vertex": "A", "from": {"id": "a_id"}}, {"vertex": "B", "from": {"id": "b_id"}}, ], - } - ) - resource.finish_init( - vertex_config=vc, - edge_config=ec, - transforms={}, + }, + vc, + ec, allowed_vertex_names={"A"}, ) @@ -145,7 +158,7 @@ def test_vertex_router_early_exit_skips_disallowed_types() -> None: vc = _vertex_config_a_b_c() ec = EdgeConfig.from_dict({"edges": []}) - resource = Resource.from_dict( + resource = _runtime( { "name": "r", "pipeline": [ @@ -159,13 +172,9 @@ def test_vertex_router_early_exit_skips_disallowed_types() -> None: } } ], - } - ) - - resource.finish_init( - vertex_config=vc, - edge_config=ec, - transforms={}, + }, + vc, + ec, allowed_vertex_names={"A"}, ) @@ -183,7 +192,7 @@ def test_dynamic_edge_early_exit_skips_disallowed_endpoints() -> None: vc = _vertex_config_a_b_c() ec = EdgeConfig.from_dict({"edges": []}) - resource = Resource.from_dict( + resource = _runtime( { "name": "r", "pipeline": [ @@ -206,13 +215,9 @@ def test_dynamic_edge_early_exit_skips_disallowed_endpoints() -> None: } }, ], - } - ) - - resource.finish_init( - vertex_config=vc, - edge_config=ec, - transforms={}, + }, + vc, + ec, allowed_vertex_names={"A", "B"}, ) @@ -245,7 +250,7 @@ def test_edge_inference_skips_edges_with_disallowed_vertices() -> None: ec = _edge_config_a_b_and_a_c() # Include an explicit A->C edge actor to ensure EdgeActor also early-exits. - resource = Resource.from_dict( + resource = _runtime( { "name": "r", "pipeline": [ @@ -254,13 +259,9 @@ def test_edge_inference_skips_edges_with_disallowed_vertices() -> None: {"vertex": "C", "from": {"id": "c_id"}}, {"edge": {"from": "A", "to": "C"}}, ], - } - ) - - resource.finish_init( - vertex_config=vc, - edge_config=ec, - transforms={}, + }, + vc, + ec, allowed_vertex_names={"A", "B"}, ) diff --git a/test/routing/test_resource_dynamic.py b/test/routing/test_resource_dynamic.py index e6abd81a..334decfe 100644 --- a/test/routing/test_resource_dynamic.py +++ b/test/routing/test_resource_dynamic.py @@ -15,7 +15,7 @@ from sqlalchemy import create_engine, text -from graflo.architecture.contract.declarations.ingestion_model import IngestionModel +from graflo.architecture.contract.ingestion import IngestionModel from graflo.architecture.schema import Schema from graflo.data_source.sql import SQLConfig, SQLDataSource from graflo.filter.onto import ComparisonOperator, FilterExpression diff --git a/test/test_caster.py b/test/test_caster.py index 09d99a9b..46410f97 100644 --- a/test/test_caster.py +++ b/test/test_caster.py @@ -8,7 +8,9 @@ import pytest from suthing import FileHandle +from graflo.architecture.pipeline.runtime.actor import ActorWrapper from graflo.hq.caster import Caster +from graflo.hq.ingestion_parameters import IngestionParams logger = logging.getLogger(__name__) @@ -40,11 +42,19 @@ def cast(modes, current_path, level, reset, n_cores=1): from graflo.plot.plotter import assemble_tree for r in ingestion_model.resources: - assemble_tree(r.root, f"{output_dir}/{mode}.resource-{r.name}.pdf") + assemble_tree( + ActorWrapper(*r.pipeline), + f"{output_dir}/{mode}.resource-{r.name}.pdf", + ) except ImportError: # graphviz/pygraphviz not available, skip visualization logger.debug("graphviz not available, skipping tree visualization") - caster = Caster(schema, ingestion_model, n_cores=n_cores) + ingestion_model.finish_init(schema.core_schema) + caster = Caster( + schema, + ingestion_model, + ingestion_params=IngestionParams(n_cores=n_cores), + ) if level == 0: fname = os.path.join( diff --git a/test/test_caster_doc_errors.py b/test/test_caster_doc_errors.py index af9f145e..ff8e7bae 100644 --- a/test/test_caster_doc_errors.py +++ b/test/test_caster_doc_errors.py @@ -14,12 +14,15 @@ import pytest from graflo.data_source.base import AbstractDataSource, DataSourceType +from graflo.architecture.graph_types import LocationIndex, TransformCastFailure from graflo.hq.caster import ( CastBatchResult, Caster, DocErrorBudgetExceeded, IngestionParams, ) +from graflo.architecture.graph_types import ResourceCastResult +from graflo.architecture.schema.vertex import Field, Vertex, VertexConfig def _read_all_jsonl_gz_lines(path: Path) -> list[str]: @@ -44,12 +47,41 @@ class _FakeResource: name = "fake_resource" - def __call__(self, doc: dict) -> defaultdict: + @property + def vertex_config(self) -> VertexConfig: + return VertexConfig( + vertices=[ + Vertex( + name="v_test", + properties=[Field(name="id")], + identity=["id"], + ) + ] + ) + + def collect_vertex_names(self) -> set[str]: + return {"v_test"} + + def cast_document(self, doc: dict) -> ResourceCastResult: if doc.get("_fail"): raise ValueError("intentional document failure") out: defaultdict = defaultdict(list) out["v_test"] = [{"id": doc.get("id")}] - return out + failures: list[TransformCastFailure] = [] + if doc.get("_xform_fail"): + failures.append( + TransformCastFailure( + location=LocationIndex(path=()), + transform_label="builtins.int", + exception_type="ValueError", + message="invalid literal for int()", + nulled_fields=("age",), + ) + ) + return ResourceCastResult(entities=out, transform_failures=failures) + + def __call__(self, doc: dict) -> defaultdict: + return self.cast_document(doc).entities @pytest.fixture @@ -95,6 +127,60 @@ def test_skip_continues_batch_and_doc_error_sink( assert rec["exception_type"] == "ValueError" +def test_transform_failure_written_to_sink( + mock_schema: MagicMock, + mock_ingestion_model: MagicMock, + tmp_path: Path, +) -> None: + sink = tmp_path / "errors.jsonl.gz" + params = IngestionParams( + n_cores=1, + on_doc_error="skip", + doc_error_sink_path=sink, + ) + caster = Caster(mock_schema, mock_ingestion_model, ingestion_params=params) + data = [{"id": 1, "_xform_fail": True}, {"id": 2}] + + result = asyncio.run(caster.cast_normal_resource(data)) + + assert len(result.failures) == 1 + assert result.failures[0].failure_kind == "transform" + assert result.failures[0].doc_index == 0 + assert result.failures[0].transform_label == "builtins.int" + assert result.failures[0].nulled_fields == ("age",) + assert len(result.graph.vertices["v_test"]) == 2 + + lines = _read_all_jsonl_gz_lines(sink) + assert len(lines) == 1 + rec = json.loads(lines[0]) + assert rec["failure_kind"] == "transform" + assert rec["nulled_fields"] == ["age"] + + +def test_transform_failure_counts_toward_max_doc_errors( + mock_schema: MagicMock, + mock_ingestion_model: MagicMock, + tmp_path: Path, +) -> None: + sink = tmp_path / "errors.jsonl.gz" + params = IngestionParams( + n_cores=1, + on_doc_error="skip", + max_doc_errors=1, + doc_error_sink_path=sink, + ) + caster = Caster(mock_schema, mock_ingestion_model, ingestion_params=params) + data = [{"id": 1, "_xform_fail": True}, {"id": 2, "_xform_fail": True}] + + with pytest.raises(DocErrorBudgetExceeded) as exc_info: + asyncio.run(caster.cast_normal_resource(data)) + + assert exc_info.value.total_failures == 2 + lines = _read_all_jsonl_gz_lines(sink) + assert len(lines) == 2 + assert all(json.loads(ln)["failure_kind"] == "transform" for ln in lines) + + def test_fail_propagates( mock_schema: MagicMock, mock_ingestion_model: MagicMock, diff --git a/test/transform/test_transform.py b/test/transform/test_transform.py index 6c840009..83a66a4d 100644 --- a/test/transform/test_transform.py +++ b/test/transform/test_transform.py @@ -2,7 +2,7 @@ import pytest -from graflo.architecture.contract.declarations.transform import ( +from graflo.architecture.contract.ingestion.transform import ( DressConfig, KeySelectionConfig, ProtoTransform, diff --git a/uv.lock b/uv.lock index d734af0f..3935902a 100644 --- a/uv.lock +++ b/uv.lock @@ -434,7 +434,7 @@ dependencies = [ ] name = "graflo" source = {editable = "."} -version = "1.7.29" +version = "1.7.30" [package.metadata] provides-extras = ["dev", "docs", "plot"]