Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,44 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.7.30]

### Added

- **`tolerate_transform_errors`** on **`ResourceConfig`** (default **`true`**) — a failing transform step sets its declared output fields to **`None`**, records a **`failure_kind=transform`** row in the doc error sink, and the rest of the resource pipeline (vertices, edges, later transforms) continues for that document. Set **`tolerate_transform_errors: false`** to fail fast on transform exceptions.

### Changed

- **`VertexActor` + `from_doc`** — transform-buffer projection is selective: only **`TransformPayload`** entries whose **`named`** keys cover the **`from_doc`** source fields are consumed, so dressed or pivot outputs for other vertex types are not stolen. Dressed dict payloads (`__transformed_value#*`) are handled consistently with passthrough from the merged observation doc.
- **Blank vertices in `VertexConfig`** — mark placeholder types with **`blank: true`** on each **`Vertex`** (identity defaults to **`id`**). **`VertexConfig.blank_vertices`** is now a derived name list, not a separate manifest field. Runtime **`ResourceRuntime`** scopes **`VertexConfig`** to vertices referenced by the resource pipeline only; unreferenced blank types are no longer injected automatically.
- **Ingestion contract layout** — declarative **`ResourceConfig`** lives under **`graflo.architecture.contract.ingestion`**; schema-bound execution is **`ResourceRuntime`** / **`build_resource_runtime`** under **`graflo.architecture.contract.runtime`**. **`Resource`** remains an internal alias for **`ResourceConfig`**.

### Breaking

- **Top-level `blank_vertices` on `vertex_config`** — no longer read from manifests; set **`blank: true`** on the corresponding **`vertices`** entries instead (silent ignore under `extra="ignore"` if the old key is left in place).
- **Runtime blank vertex scope** — blank vertex types must appear in the resource pipeline (or edge inference selectors) to be present in the per-resource runtime **`VertexConfig`**; relying on schema-wide blank placeholders without a matching actor step will not add them at cast time.
- **Imports** — prefer **`ResourceConfig`** from **`graflo.architecture.contract`** (or **`graflo.architecture.contract.ingestion`**); **`graflo.architecture.contract.declarations.resource`** is not the canonical module path.

### Documentation

- **[Document cast errors](docs/concepts/ingestion_doc_errors.md)** — **`tolerate_transform_errors`** and transform failure records.
- **[Core components](docs/concepts/core_components.md)** — **`ResourceConfig`** / **`ResourceRuntime`**, per-vertex **`blank`**, **`from_doc`** with dressed transforms, identity defaults.
- **[Architecture diagrams](docs/concepts/architecture_diagrams.md)** — contract and blank-vertex model aligned with 1.7.30.
- **[Creating a manifest](docs/getting_started/creating_manifest.md)** — **`tolerate_transform_errors`** and blank vertex YAML.

## [1.7.29]

### Added

- **Empty-identity filter on cast batches** — after resource casting, **`Caster`** can drop vertex docs and edge tuples whose schema identity fields are all missing, `null`, or `""` before **`DBWriter`** (identity rules from **`VertexConfig`**, not **`GraphContainer`**). Controlled by **`IngestionParams.drop_empty_identity_docs`** (default **`true`**). Blank vertex collections are exempt.

## [1.7.27]

### Added

- **`ColumnTimeFilter`** — shared pandas-like time window on a single column (`column`, optional `start` / `end`, optional `interval` as a **`pandas.Timedelta`** string such as `"7D"` or `"2h"` for day/hour windows, optional `not_equals`, optional `start_inclusive` / `end_inclusive`). Rendered to SQL via **`FilterExpression`** (same path as other pushdown filters). Calendar-style offsets (for example month arithmetic) are not supported when `pandas.Timedelta` rejects the string; use explicit `start` / `end` ISO bounds instead.
- **`FileConnector.time_filter`** and **`TableConnector.time_filter`** — canonical field replacing duplicated `date_field` / `date_filter` / `date_range_*` fields on the wire.
- **Bindings — runtime connector patches**: **`ConnectorUpdate`**, **`Bindings.apply_connector_update`**, and **`Bindings.replace_connector`** so defining-field changes re-hash and reindex correctly while preserving **`conn_proxy`** wiring. Patches are applied **after** manifest load (not stored on `GraphManifest`).
- **Empty-identity filter on cast batches** — after resource casting, **`Caster`** can drop vertex docs and edge tuples whose schema identity fields are all missing, `null`, or `""` before **`DBWriter`** (identity rules from **`VertexConfig`**, not **`GraphContainer`**). Controlled by **`IngestionParams.drop_empty_identity_docs`** (default **`true`**). Blank vertex collections are exempt.

### Breaking

Expand Down
28 changes: 18 additions & 10 deletions docs/concepts/architecture_diagrams.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,10 @@ classDiagram
}

class IngestionModel {
+resources: list~Resource~
+resources: list~ResourceConfig~
+transforms: list~ProtoTransform~
+finish_init(core_schema)
+fetch_resource(name) Resource
+fetch_resource_config(name) ResourceConfig
}

class GraphMetadata {
Expand All @@ -136,13 +136,15 @@ classDiagram

class VertexConfig {
+vertices: list~Vertex~
+blank_vertices: list~Vertex~
+identity_from_all_properties: bool
+blank_vertices: list~str~
}

class Vertex {
+name: str
+identity: list~str~
+properties: list~Field~
+blank: bool
+filters: FilterExpression?
}

Expand All @@ -165,11 +167,16 @@ classDiagram
+filters: FilterExpression?
}

class Resource {
class ResourceConfig {
+name: str
+root: ActorWrapper
+pipeline: list~dict~
+tolerate_transform_errors: bool
}

class ResourceRuntime {
+config: ResourceConfig
+vertex_config: VertexConfig
+executor: ActorExecutor
+finish_init(vertex_config, edge_config, transforms)
}

class ActorWrapper {
Expand Down Expand Up @@ -224,7 +231,7 @@ classDiagram
Schema *-- CoreSchema : core_schema
CoreSchema *-- VertexConfig : vertex_config
CoreSchema *-- EdgeConfig : edge_config
IngestionModel *-- "0..*" Resource : resources
IngestionModel *-- "0..*" ResourceConfig : resources
IngestionModel *-- "0..*" ProtoTransform : transforms

VertexConfig *-- "0..*" Vertex : vertices
Expand All @@ -235,8 +242,9 @@ classDiagram
Edge *-- "0..*" Field : properties
Edge --> FilterExpression : filters

Resource *-- ActorWrapper : root
Resource *-- ActorExecutor : runtime orchestration
ResourceRuntime *-- ResourceConfig : config
ResourceRuntime *-- ActorWrapper : root
ResourceRuntime *-- ActorExecutor : runtime orchestration
ActorWrapper --> Actor : actor
ActorExecutor ..> ExtractionContext : produces
ActorExecutor ..> AssemblyContext : consumes
Expand Down Expand Up @@ -378,5 +386,5 @@ These are the two key abstractions that decouple *data retrieval* from *graph tr

- **DataSources** (`AbstractDataSource` subclasses) — handle *where* and *how* data is read. Each carries a `DataSourceType` (`FILE`, `SQL`, `SPARQL`, `API`, `IN_MEMORY`). Many DataSources can bind to the same Resource by name via the `DataSourceRegistry`.

- **Resources** (`Resource`) — handle *what* the data becomes in the LPG. Each Resource is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint.
- **Resources** (`ResourceConfig` → `ResourceRuntime`) — handle *what* the data becomes in the LPG. Each resource is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Because DataSources bind to resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint.
- Optional **`drop_trivial_input_fields`** (default `false` on the model): when `true`, each record is preprocessed by dropping **top-level** keys whose value is `null` or the empty string `""` before actors run. This trims sparse wide rows (many unused columns) without extra transforms; nested dicts and lists are not walked.
28 changes: 16 additions & 12 deletions docs/concepts/core_components.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,11 +83,14 @@ A `Vertex` describes vertices and their logical identity. It supports:
- If one duplicate is typed and the other is untyped, the typed definition wins
- Conflicting non-null types for the same field name are rejected
- Filtering conditions
- Optional blank vertex configuration
- **`blank: true`** — placeholder vertex with no natural key; identity defaults to **`id`** when omitted

Identity defaults are strict by default at schema level:
- `VertexConfig.identity_from_all_properties: false` (default) do not require explicit vertex `identity`, defaults to all properties
- `VertexConfig.identity_from_all_properties: false` disables compatibility fallback where missing identity uses all property names
Identity defaults at schema level (`VertexConfig`):

- **`identity_from_all_properties: true`** (default) — vertices without explicit **`identity`** use all **`properties`** names as the logical key.
- **`identity_from_all_properties: false`** — each non-blank vertex must declare **`identity`** explicitly; blank vertices still default to **`id`**.

**Blank vertices:** set **`blank: true`** on the vertex entry under **`schema.graph.vertex_config.vertices`**. **`VertexConfig.blank_vertices`** is a derived list of names (not a separate YAML field). At runtime, **`ResourceRuntime`** keeps only vertex types referenced by that resource’s pipeline (and edge-inference selectors); blank types that are declared in the schema but not used by the resource are not injected automatically—include a **`vertex`** (or edge) step when the placeholder must be populated.

### Edge
An `Edge` describes edges and their logical identities. It allows:
Expand Down Expand Up @@ -166,19 +169,20 @@ An `AbstractDataSource` subclass defines where data comes from and how it is ret

Data sources handle retrieval only. They bind to Resources by name via the `DataSourceRegistry`, so the same `Resource` can ingest data from multiple sources without modification.

### Resource
A `Resource` is the central abstraction that bridges data sources and the graph schema. Each Resource defines a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements:
### Resource (`ResourceConfig` / `ResourceRuntime`)

Ingestion resources split into two layers:

- How data structures map to vertices and edges
- What transformations to apply
- The actor pipeline for processing documents
- **`ResourceConfig`** — declarative contract in **`ingestion_model.resources`** (YAML/Python): pipeline steps, encoding, type casters, edge-inference flags, **`tolerate_transform_errors`**, and related options. Serialized in manifests; validated by **`IngestionModel`**.
- **`ResourceRuntime`** — schema-bound executor built via **`build_resource_runtime`**: filtered **`VertexConfig`**, bound transforms, and **`ActorExecutor`** for document casting.

Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, a SQL table, or a SPARQL endpoint.
The name **`Resource`** in manifests and docs usually means **`ResourceConfig`**. Data sources bind to resources by name, so the same pipeline applies whether data arrives from a file, API, SQL table, or SPARQL endpoint.

Resource-level edge inference controls:
Resource-level controls:
- **`infer_edges`**: Global toggle for inferred edge emission during assembly (default: `true`).
- **`infer_edge_only`**: Allow-list of inferred edges (`source`, `target`, optional `relation`).
- **`infer_edge_except`**: Deny-list of inferred edges (`source`, `target`, optional `relation`).
- **`tolerate_transform_errors`** (default **`true`**): on transform failure, null declared outputs and continue the pipeline; see [Document cast errors](ingestion_doc_errors.md).
- `infer_edge_only` and `infer_edge_except` are mutually exclusive and validated against declared schema edges.
- These controls apply to inferred edges only; explicit edge actors in the pipeline are still emitted.
- **Auto-exclusion**: When a resource pipeline contains any EdgeActor for edges of type `(source, target)`, `(source, target, None)` is automatically added to `infer_edge_except` for that resource, so inferred edges do not duplicate edges produced by explicit edge actors.
Expand All @@ -192,7 +196,7 @@ An `Actor` describes how the current level of the document should be mapped/tran
- `TransformActor`: Applies data transformations
- `VertexActor`: Creates vertices from the current level. Key options:
- **`role`** (optional): named accumulator slot. When set the vertex is stored at `lindex.extend((role, 0))` instead of bare `lindex`, so multiple vertices of the same type in one row (e.g. `role: self`, `role: parent`, `role: child`) occupy distinct slots and can be addressed individually by a downstream edge step.
- **`from`**: rename map `{vertex_field: doc_field}`. Only mismatched column names need listing; remaining vertex schema properties are absorbed from the doc automatically (passthrough).
- **`from`** (`from_doc`): rename map `{vertex_field: doc_field}`. Only mismatched column names need listing; remaining vertex schema properties are absorbed from the doc and transform buffer automatically (passthrough). When multiple **`TransformPayload`** entries share a location, **`from_doc`** consumes only payloads whose **`named`** keys include all mapped source fields—so dressed metrics or pivot rows for other vertex types are left for their own **`vertex`** steps.
- **`keep_fields`**: restrict passthrough to this field subset. Use on role-vertex steps to prevent shared row columns from leaking into placeholder vertices that only carry an ID.
- `EdgeActor`: Creates edges between vertices. Operates in three modes:
- **Static mode** (`from`/`to` set on both sides): vertex types declared at config time.
Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/features_and_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Schema comparison gives you a predictable transition path between versions. Inst

## Best Practices
1. Use compound identity fields for natural keys, and **`schema.db_profile`** secondary indexes for query performance
2. Leverage blank vertices for complex relationship modeling
2. Leverage blank vertices (`blank: true` on the vertex definition) for complex relationship modeling; include them in the resource pipeline when they must be populated at cast time
3. Define reusable transforms in **`ingestion_model.transforms`** and reference them from resource steps
4. Configure appropriate batch sizes based on your data volume
5. Enable parallel processing for large datasets
Expand Down
4 changes: 2 additions & 2 deletions docs/concepts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,10 +96,10 @@ flowchart LR

- **Bindings** (`FileConnector`, `TableConnector`, `SparqlConnector`) describe *where* data comes from (file paths, SQL tables, SPARQL endpoints). Multiple connectors may attach to the same ingestion resource name; optional **`connector_connection`** entries assign each SQL/SPARQL connector a **`conn_proxy`** by **connector `name` or `hash`** (not by resource name). The `ConnectionProvider` turns that label into real connection config at runtime so manifests stay credential-free.
- **DataSources** (`AbstractDataSource` subclasses) handle *how* to read data in batches. Each carries a `DataSourceType` and is registered in the `DataSourceRegistry`.
- **Resources** define *what* to extract — each `Resource` is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Optional **`drop_trivial_input_fields`: `true`** removes top-level keys whose value is `null` or `""` **before** actors run (shallow only; `0` and `false` stay). **TigerGraph** physical defaults for missing attributes belong in **`schema.db_profile.default_property_values`** (GSQL `DEFAULT` at DDL time), not in the covariant `GraphContainer` assembly path.
- **Resources** define *what* to extract — each **`ResourceConfig`** (manifest `ingestion_model.resources`) is a reusable actor pipeline (descend → transform → vertex → edge) executed at cast time by **`ResourceRuntime`**. Optional **`drop_trivial_input_fields`: `true`** removes top-level keys whose value is `null` or `""` **before** actors run (shallow only; `0` and `false` stay). Optional **`tolerate_transform_errors`: `true`** (default) continues the pipeline when a transform step fails. **TigerGraph** physical defaults for missing attributes belong in **`schema.db_profile.default_property_values`** (GSQL `DEFAULT` at DDL time), not in the covariant `GraphContainer` assembly path.
- **GraphContainer** (covariant graph representation) collects the resulting vertices and edges in a database-independent format.
- **DBWriter** pushes the graph data into the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph).
- **Document cast errors** — when a single source document fails inside a resource, **`IngestionParams.on_doc_error`** chooses skip vs fail-the-batch; optional **gzip JSONL** persistence uses **`doc_error_sink_path`** (CLI **`ingest --doc-error-sink`**). Details: [Document cast errors and doc error sink](ingestion_doc_errors.md).
- **Document cast errors** — when a single source document fails inside a resource, **`IngestionParams.on_doc_error`** chooses skip vs fail-the-batch; optional **gzip JSONL** persistence uses **`doc_error_sink_path`** (CLI **`ingest --doc-error-sink`**). Per-resource **`tolerate_transform_errors`** (default **`true`**) lets a single transform step fail without aborting the rest of the pipeline for that document. Details: [Document cast errors and doc error sink](ingestion_doc_errors.md).

### Minimal canonical config contract

Expand Down
Loading