diff --git a/CHANGELOG.md b/CHANGELOG.md index 36723ab507..9673b646d9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm ### Added +- **GFQL schema inference API (#1338)**: Added experimental `graphistry.infer_schema(g)`, `g.infer_schema()`, and `g.bind(infer_schema=True)` for opt-in public `GraphSchema` inference from bound local graph data. Inference derives node/edge property logical types, presence/nullability report details, `label__*` node and relationship labels, and source/destination topology when node label evidence is available. Inferred schemas carry descriptive `GraphSchema.metadata` provenance (`source="inferred"` or `source="mixed"`). Declared schemas remain explicit and take precedence when passed to `infer_schema(..., schema=...)`; `bind(schema=..., infer_schema=True)` is rejected instead of silently merging contracts. - **GFQL NetworkX CALL parity (#1058)**: Expanded the local Cypher `graphistry.nx.*` CALL surface with explicit NetworkX dispatch for `degree_centrality`, `closeness_centrality`, `eigenvector_centrality`, `katz_centrality`, `connected_components`, `strongly_connected_components`, `core_number`, and multi-output `hits`, including row and `.write()` coverage. - **NetworkX/SciPy optional dependency policy (#1618)**: Declared supported `networkx>=2.5,<4` and optional `scipy>=1.5,<2` ranges for NetworkX-backed GFQL CALL procedures, with runtime version guards and a focused lower/current-upper CI matrix. - **GFQL schema Arrow boundary APIs (#1339)**: Added experimental public schema↔Arrow import/export helpers, graph-level Arrow declaration payloads, and opt-in `schema_validate='strict'|'autofix'` enforcement for `plot()`, `upload()`, `to_arrow()`, and `validate_arrow_schema()` when a `GraphSchema` is bound. diff --git a/docs/source/gfql/schema.rst b/docs/source/gfql/schema.rst index d85c319a28..a32ea28f9b 100644 --- a/docs/source/gfql/schema.rst +++ b/docs/source/gfql/schema.rst @@ -5,8 +5,9 @@ GFQL accepts public schema declarations through the stable ``graphistry.schema`` import path. Use this when application code owns a graph contract and wants Cypher preflight checks to fail before query execution. The API is experimental in this release: the import path and core declaration -objects are intended to be stable, while inference, coercion, remote transport, -and planner use are still follow-on surfaces. +objects are intended to be stable, while coercion, remote transport, and +planner use are still follow-on surfaces. Inference is also experimental and +must be requested explicitly. The schema is optional. When you provide one, PyGraphistry uses it as the declared contract for local GFQL validation. When you do not provide one, @@ -95,6 +96,8 @@ Schema Objects ``GraphSchemaCatalog`` used by binder/preflight validation. ``strict=False`` makes schema-bound ``g.gfql_validate(...)`` permissive by default; callers can still override per call with ``g.gfql_validate(..., strict=True)``. + ``metadata`` is descriptive provenance for callers and exports; it is not part + of validation semantics. ``NodeType.to_arrow()`` and ``EdgeType.to_arrow()`` Export declarations as ``pyarrow.Schema`` objects through GFQL's row-schema @@ -128,8 +131,8 @@ Invalid queries raise ``GFQLValidationError`` with structured context. This is a correctness and documentation surface first: applications can state what labels, relationship types, properties, and topology they expect, then validate user-authored or generated Cypher before running it. The same typed -contract is also the foundation for later inference, coercion, remote transport, -and planner/performance work, but this page covers the declared local contract. +contract is also used by inference and is the foundation for later coercion, +remote transport, and planner/performance work. Arrow Boundary Validation ------------------------- @@ -163,22 +166,116 @@ boundaries. This is off by default so existing ``plot()``, ``upload()``, and Provided vs. Inferred Schema ---------------------------- -In this release, schemas are **provided**, not inferred. You create -``NodeType``, ``EdgeType``, and ``GraphSchema`` objects directly and attach them -with ``graphistry.bind(..., schema=schema)`` or ``g.bind(schema=schema)``. +You can provide a schema directly or infer one from bound local data. -Without an explicit ``GraphSchema``: +Use a provided schema when application code owns the contract: -* ``g.gfql_validate(...)`` can still use local dataframe columns already bound - on ``g._nodes`` and ``g._edges`` for schema-aware checks. -* It does not infer node types, edge types, Arrow dtypes, nullability, or - topology from data. +.. code-block:: python + + declared_g = ( + graphistry + .edges(edges_df, "src", "dst") + .nodes(nodes_df, "id") + .bind(schema=schema) + ) + +Use inference when the graph data should define the first draft contract: + +.. code-block:: python + + inferred_base_g = graphistry.edges(edges_df, "src", "dst").nodes(nodes_df, "id") + inferred_schema = inferred_base_g.infer_schema() + inferred_g = inferred_base_g.bind(schema=inferred_schema) + +For one-step local binding, use: + +.. code-block:: python + + inferred_g = ( + graphistry + .edges(edges_df, "src", "dst") + .nodes(nodes_df, "id") + .bind(infer_schema=True) + ) + +Inference is opt-in. ``graphistry.bind(...)`` and ``g.bind(...)`` do not infer a +schema unless ``infer_schema=True`` is passed. + +Inference Rules +--------------- + +``graphistry.infer_schema(g)`` and ``g.infer_schema()`` return a public +``GraphSchema``. They inspect currently bound ``nodes`` and ``edges`` dataframes: + +* Node types come from boolean ``label__