diff --git a/.gitignore b/.gitignore index 79564e9..bbb3a98 100644 --- a/.gitignore +++ b/.gitignore @@ -13,4 +13,8 @@ node_modules/* **/.idea/** .direnv/ .envrc -dist/ \ No newline at end of file +dist/ +.infrahub-sync-cache/ +# invoke bench.run artifacts (default to repo root) +bench-results.csv +.bench-filtered-config.yml \ No newline at end of file diff --git a/docs/docs/guides/run.mdx b/docs/docs/guides/run.mdx index bddf659..3180dc2 100644 --- a/docs/docs/guides/run.mdx +++ b/docs/docs/guides/run.mdx @@ -2,37 +2,27 @@ title: Running sync tasks --- -Learn how to use Infrahub Sync's commands to generate sync adapters, calculate differences, and synchronize data between your source and destination systems. +Learn how to use Infrahub Sync's commands to calculate differences, synchronize data, and apply previously cached plans against your destination. ![Infrahub-Sync process](../media/infrahub_sync_process.excalidraw.svg) -::: info +:::info -Before generating the necessary Python code for your sync adapters and models and synchronizing, you need to created a configuration. -To create a new configuration, please refer to the guide [Creating a new Sync Instance](./creation) +Before you can run a sync, you need a configuration file. To create a new configuration, see the [Creating a new Sync Instance](./creation) guide. ::: - -## Generating sync adapters and models - - -### Command +## Listing available sync projects ```bash -infrahub-sync generate --name --directory +infrahub-sync list --directory ``` -### Parameters - -- `--name`: The name of the sync project you want to generate code for. -- `--directory`: The directory where your sync configuration files are located. - -This command reads your configuration file and generates Python code for the sync adapters and models required for the synchronization task. +Prints every sync project found under the given directory along with its source, destination, and on-disk location. Useful as a quick sanity check. ## Calculating differences -The `diff` command lets you see the differences between your source and destination before actually performing the synchronization. This is useful for verifying what will be synchronized. +The `diff` command compares the source and destination without writing anything to the destination. It also writes a Parquet **plan** to the local cache so you can review the change set and replay it later with `apply`. ### Command @@ -42,14 +32,19 @@ infrahub-sync diff --name --directory at ` line on success. Note that id — you can hand it to `apply` to dispatch the plan without re-extracting the source. ## Synchronizing data -Once you're ready to synchronize the data between your source and destination, you can use the `sync` command. +The `sync` command runs `diff` and immediately applies the changes to the destination. ### Command @@ -59,20 +54,65 @@ infrahub-sync sync --name --directory /last-successful-rowcounts.json`. The next run reads it; if any resource has shrunk by more than 50% the sync refuses to proceed unless you pass `--allow-rowcount-drop`. The threshold catches accidents like a partially-restored source or a credential that lost permissions, where syncing would otherwise wipe legitimate data on the destination. -For example: +## Reviewing and applying a cached plan + +The cache pattern lets you split a run into two steps: produce a plan (`diff`), then apply it (`apply`). This is useful when you want a human approval gate, when the destination is briefly unreachable, or when you want to re-apply the same plan without re-fetching the source. ```bash -infrahub-sync sync --name my_project --directory configs --diff --show-progress -``` \ No newline at end of file +# 1. Dry-run — extracts source + destination, writes plan.parquet +infrahub-sync diff --name from-netbox --directory examples/ + +# Look at the logged line: +# INFO | infrahub_sync.cli | Cached run 20260518T1430-abc12345 at .infrahub-sync-cache/from-netbox/20260518T1430-abc12345 +# +# Inspect the diff or query the parquet directly with DuckDB: +# duckdb -c "SELECT * FROM read_parquet('.infrahub-sync-cache/from-netbox/20260518T1430-abc12345/plan.parquet')" + +# 2. Apply the cached plan — no source extraction +infrahub-sync apply --name from-netbox --run-id 20260518T1430-abc12345 --directory examples/ +``` + +`apply` refuses to proceed if the destination's schema shape has drifted since the plan was built — the cached `schema-sub-hash.txt` must match the freshly-computed hash. When it doesn't, re-run `diff` to rebuild the plan. + +For the full on-disk layout (per-resource Parquet snapshots, sidecar JSON files, the per-pipeline filelock), see the [Cache layout reference](../reference/cache-layout). + +## Generating sync adapters and models + +`infrahub-sync generate` reads your configuration file and emits Python code for the sync adapters and models used at runtime. + +```bash +infrahub-sync generate --name --directory +``` + +You typically only run this once per configuration (and after editing `config.yml`). diff --git a/docs/docs/readme.mdx b/docs/docs/readme.mdx index 4ffc5b4..8e13c22 100644 --- a/docs/docs/readme.mdx +++ b/docs/docs/readme.mdx @@ -1,138 +1,40 @@ --- title: Infrahub Sync --- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; -Infrahub Sync is a Python package that synchronizes data between Infrahub and external infrastructure systems. It connects to NetBox, Nautobot, IP Fabric, Slurp’it, Cisco ACI, ServiceNow-style systems, and other tools through a library of pre-built adapters, and handles both the communication with each system and the translation of data between different schemas. Sync projects are defined declaratively in YAML; the CLI generates the adapter code, calculates the diff, and applies changes. +Infrahub Sync is a versatile Python package that synchronizes data between a source and a destination system. It builds on the robust capabilities of `diffsync` to offer flexible and efficient data synchronization across different platforms, including Netbox, Nautobot, and Infrahub. This package features a Typer-based CLI for ease of use, supporting operations such as listing available sync projects, generating diffs, and executing sync processes. -Infrahub Sync is open source under Apache 2.0, distributed on [PyPI](https://pypi.org/project/infrahub-sync/), and maintained on [GitHub](https://github.com/opsmill/infrahub-sync). +## Guides -Infrahub Sync supports the following: +- [Installing Infrahub Sync](./guides/installation.mdx) +- [Creating a new sync instance](./guides/creation.mdx) +- [Support adapters with custom CA certificates](./guides/custom-certificates.mdx) +- [Run a sync instance](./guides/run.mdx) -- **Migration from an existing system of record** — data moves from NetBox, Nautobot, or another source into Infrahub one model at a time, on the schedule the team chooses. The legacy system continues to operate during the migration. -- **Recurring synchronization between systems** — a sync project runs as often as the environment requires. Each run calculates a fresh diff and applies only the deltas. -- **Inventory population from network discovery** — adapters for IP Fabric and Slurp’it bring discovered network state into Infrahub as the source of truth, rather than requiring inventory to be entered by hand. -- **Outbound data movement from Infrahub** — Infrahub data is published into monitoring, observability, or CMDB systems that need a current view of infrastructure. -- **Translation between data models** — source fields map to destination fields through a declarative YAML configuration. Identifiers, relationships, and static values are handled in the same file. -- **Diff preview before changes are applied** — `infrahub-sync diff` shows the differences between source and destination state without modifying either system. +## Reference -## How it works +- [Sync instance configuration](./reference/config.mdx) +- [Sync CLI](./reference/cli.mdx) -### Concepts +## Adapters -- **Sync project** — a directory containing a YAML configuration file (`config.yml`) that defines one synchronization between two systems. A project specifies the source, the destination, the sync order, and the per-model schema mapping. A team can have many sync projects, each managing a different source-destination pair. -- **Adapter** — the component that connects Infrahub Sync to a specific system. Each adapter handles both communication (API calls, authentication, request handling) and translation (converting the system's data into the internal sync engine's format). Infrahub Sync ships with adapters for common systems and supports custom adapters for systems without a pre-built one. -- **Schema mapping** — the part of the project configuration that defines how source fields map to destination fields. Direct field mappings, references between models, identifiers, and static values are all declared in YAML. +- [Infrahub](./adapters/infrahub.mdx) +- [NetBox](./adapters/netbox.mdx) +- [Nautobot](./adapters/nautobot.mdx) +- [IP Fabric](./adapters/ipfabric.mdx) +- [Cisco ACI](./adapters/aci.mdx) +- [LibreNMS](./adapters/librenms.mdx) +- [Observium](./adapters/observium.mdx) +- [Peering Manager](./adapters/peering-manager.mdx) +- [Prometheus](./adapters/prometheus.mdx) + +- [Slurp'it](./adapters/slurpit.mdx) + +- [Generic REST API](./adapters/genericrestapi.mdx) +- [Local Adapters](./adapters/local-adapters.mdx) -Three CLI commands operate on a sync project: `generate`, `diff`, and `sync`. +## Contributing -### Define a sync project - -Each sync project consists of a directory and a `config.yml` describing the sync. The configuration specifies the source adapter and destination adapter with their connection details, the order in which models should be synchronized, and how each source field maps to a destination field. Credentials reference environment variables rather than being embedded in the file. - -→ [Creating a new sync instance](./guides/creation.mdx) · [Sync instance configuration](./reference/config.mdx) - -### Generate the adapter code - -`infrahub-sync generate --name --directory ` reads the YAML configuration and produces the Python adapter and model code that `diff` and `sync` use. Re-run `generate` whenever the configuration or the schema mapping changes. - -### Preview with `diff` - -`infrahub-sync diff --name --directory ` reads both the source and destination, calculates what would change, and prints the result to the terminal. The destination is not modified. The `diff` command is read-only and is typically run before applying any sync — particularly during initial setup or when adjusting mappings. - -### Execute the sync - -`infrahub-sync sync --name --directory ` applies the changes calculated by the diff, in the order defined by the project's `order` key — independent models first, then dependent models, then models that reference earlier ones. The sync is idempotent: if a run fails partway through, re-running calculates a fresh diff and applies whatever is still outstanding. - -Three `diffsync_flags` (`SKIP_UNMATCHED_DST` by default, `SKIP_UNMATCHED_SRC`, `SKIP_MODIFIED`) and per-mapping filters control what each run is allowed to change. - -→ [Run a sync instance](./guides/run.mdx) · [Sync CLI](./reference/cli.mdx) - -## Who it's for - -### Implementing Infrahub alongside an existing system of record - -Data lives in NetBox, Nautobot, IP Fabric, or another tool, and the team is adopting Infrahub. Infrahub Sync provides a path that does not require all teams to move at the same time, and that does not require writing integration code. - -→ [Installing Infrahub Sync](./guides/installation.mdx) · [Creating a new sync instance](./guides/creation.mdx) - -### Operating Infrahub at steady state - -Infrahub is deployed and needs to stay current with the other systems the team uses — IPAM, ITSM, monitoring, network discovery, or in-house databases. One sync project per source, run on the cadence the environment requires. - -→ [Creating a new sync instance](./guides/creation.mdx) · [Run a sync instance](./guides/run.mdx) - -### Building inventory from network discovery - -Infrahub is populated from what is actually deployed in the network rather than from manually curated inventory. IP Fabric and Slurp’it adapters connect to discovery tools and bring discovered state into Infrahub. - -→ [IP Fabric adapter](./adapters/ipfabric.mdx) · [Slurp’it adapter](./adapters/slurpit.mdx) - -## What's included - -- **Pre-built adapter library** — adapters for Infrahub, NetBox, Nautobot, IP Fabric, Cisco ACI, LibreNMS, Observium, Peering Manager, Prometheus, and Slurp’it, plus a Generic REST API adapter for systems with HTTP/JSON APIs. Each adapter handles both communication and translation for its target system. -- **Declarative YAML configuration** — a single file per sync project defines source, destination, sync order, and per-model schema mapping. Mappings support 14 filter operations (including `regex` and `is_ip_within`), per-field transforms, custom Jinja filters, and ordered cross-reference resolution. -- **Sync engine** — built on the `diffsync` framework. Diffs and applies only deltas; three flags control what each run is allowed to change (creates, deletes, modifications). -- **Typer-based CLI** — four commands: `list` (show available projects), `generate` (produce adapter code from the configuration), `diff` (preview changes), `sync` (apply changes). -- **Custom adapter support** — for systems without a pre-built adapter, write a local custom adapter and load it from a filesystem path, a Python module path, or an installed entry point (`INFRAHUB_SYNC_ADAPTER_PATHS`). -- **Custom CA certificate support** — connect to systems with self-signed or internal CA-issued TLS certificates. - -### Adapter reference - -| Adapter | Direction supported | -|---|---| -| Infrahub | source or destination | -| NetBox | NetBox → Infrahub | -| Nautobot | Nautobot → Infrahub | -| IP Fabric | IP Fabric → Infrahub | -| Cisco ACI | Cisco ACI → Infrahub | -| Peering Manager | Peering Manager → Infrahub · Infrahub → Peering Manager | -| Prometheus | Prometheus → Infrahub | -| Slurp’it | Slurp’it → Infrahub | -| LibreNMS | LibreNMS → Infrahub | -| Observium | Observium → Infrahub | -| Generic REST API | external system → Infrahub | - -## Get started - -1. **Prerequisites** - - A running [Infrahub](https://github.com/opsmill/infrahub) instance - - Python 3.10–3.13 - - Credentials and network access for the source and destination systems -2. **Install Infrahub Sync.** See [Installing Infrahub Sync](./guides/installation.mdx) for the full setup steps. The short version: `pip install infrahub-sync` into a virtual environment. -3. **Choose your starting point.** - - Setting up a sync project for the first time? → [Creating a new sync instance](./guides/creation.mdx) - - Running an existing project? → [Run a sync instance](./guides/run.mdx) - -## Common questions - -**Do I have to migrate everything at once?** -No. Infrahub Sync is designed to move data one model at a time, on the team's own schedule. The legacy system keeps running throughout — there is no required cutover moment. - -**Does Infrahub Sync replace my scheduler?** -No. Infrahub Sync runs as a CLI and is designed to plug into whatever scheduling tooling the team already uses — cron, CI jobs, Prefect, Dagster, or similar. There is no built-in scheduler by design. - -**What happens if a sync run fails partway through?** -Sync runs are idempotent. Re-running the sync calculates a fresh diff against the current destination state and applies only what is still outstanding. Retries on failure are safe. - -**Can changes in the destination be overwritten by a sync?** -By default, `SKIP_UNMATCHED_DST` is enabled, which preserves destination objects that have no corresponding object in the source. For destination objects that do have a source match, the sync's behavior depends on the configured `diffsync_flags`. Decide upfront which system is authoritative for each model and configure the flags accordingly. - -**What if my source system doesn't have a pre-built adapter?** -Most systems with a REST/JSON API can use the Generic REST API adapter without modifications. For systems with non-standard APIs or custom logic requirements, build a local custom adapter. See [Local Adapters](./adapters/local-adapters.mdx). - -**Can I run two sync projects at the same time?** -Yes. Each sync project is independent — a separate directory, configuration, and CLI invocation. Schedule and operate each project on its own cadence. - -## Additional resources - -| What you want to do | Where to go | -|---|---| -| Set up your environment | [Installing Infrahub Sync](./guides/installation.mdx) | -| Configure a sync project | [Creating a new sync instance](./guides/creation.mdx) · [Sync instance configuration](./reference/config.mdx) | -| Run a sync | [Run a sync instance](./guides/run.mdx) | -| CLI reference | [Sync CLI](./reference/cli.mdx) | -| All adapters | See the **Adapters** section in the sidebar | -| Custom CA certificates | [Support adapters with custom CA certificates](./guides/custom-certificates.mdx) | -| Build a custom adapter | [Local Adapters](./adapters/local-adapters.mdx) | -| Contribute | [Development guide](./development.mdx) | -| Source code | [github.com/opsmill/infrahub-sync](https://github.com/opsmill/infrahub-sync) | +- [Development guide](./development.mdx) - Set up a development environment, run tests, and publish releases diff --git a/docs/docs/reference/cache-layout.mdx b/docs/docs/reference/cache-layout.mdx new file mode 100644 index 0000000..dcfd2b8 --- /dev/null +++ b/docs/docs/reference/cache-layout.mdx @@ -0,0 +1,58 @@ +--- +title: Cache layout +--- + +`infrahub-sync diff` and `infrahub-sync apply` persist run state under: + +```text +.infrahub-sync-cache// +├── .lock # per-pipeline filelock (held during runs) +├── last-successful-rowcounts.json # baseline for the rowcount guardrail +└── / + ├── A/ # source snapshot + │ ├── BuiltinTag.parquet + │ └── ... + ├── B/ # destination snapshot + │ └── ... + ├── plan.parquet # the diff plan + ├── errors.parquet # only when errors > 0 + ├── cursors.json # {A: {Resource: cursor}, B: {Resource: cursor}} + ├── schema-sub-hash.txt # invalidates the cache when shape changes + └── run.json # status, mode, summary, finished_at +``` + +Override the root with `INFRAHUB_SYNC_CACHE_DIR=/path/to/shared/cache`. + +## plan.parquet + +One row per change. The columns are: + +| Column | Description | +| --- | --- | +| `action` | `create`, `update`, or `delete`. Empty for no-op elements (which are skipped during serialization). | +| `resource` | Kind name as declared in `schema_mapping[].name`. | +| `source_id` | DiffSync `unique_id` of the source-side element. | +| `dest_id` | Reserved for the destination's primary key once adapters return it. Empty today. | +| `attribute` | Reserved for per-attribute granularity. Empty today (rows are per-element). | +| `old_value` | JSON-encoded mapping of `{attr: prior_value}` from `element.get_attrs_diffs()["-"]`. Populated on `update` actions. | +| `new_value` | JSON-encoded mapping of `{attr: new_value}` from `element.get_attrs_diffs()["+"]`. Populated on `create` and `update`. | +| `owner` | Reserved for sync-identity-based skip logic. Empty today. | +| `skip_reason` | Empty unless the engine deliberately skipped a row. | +| `conflict_class` | Empty unless the engine flagged a write conflict. | + +Query with DuckDB without any import step: + +```bash +duckdb -c "SELECT action, resource, source_id, new_value FROM read_parquet('.infrahub-sync-cache/from-netbox//plan.parquet') WHERE action <> 'create' LIMIT 20" +``` + +## Commands + +- `infrahub-sync diff --name X` — writes side A, side B, and `plan.parquet`. +- `infrahub-sync sync --name X` — runs diff then sync; writes the same cache artifacts as `diff` plus updates `last-successful-rowcounts.json` on success. +- `infrahub-sync apply --name X --run-id ` — dispatches the cached plan + against the destination without re-extracting the source. Refuses if the + destination's schema sub-hash has drifted. +- `--allow-rowcount-drop` (on `sync`) bypasses the rowcount guardrail when the operator knows the source has legitimately shrunk. +- `--continue-on-error` (on `sync`) skips peer relationships missing identifier values rather than aborting; the engine logs each skip so you can review what was dropped. +- `--no-concurrent-load` (on `diff` and `sync`) falls back to loading source then destination sequentially. The default (concurrent) is safe with all built-in adapters and roughly halves load wall-clock time on real APIs. diff --git a/docs/docs/reference/cli.mdx b/docs/docs/reference/cli.mdx index 1de2bf6..c33a8fe 100644 --- a/docs/docs/reference/cli.mdx +++ b/docs/docs/reference/cli.mdx @@ -1,5 +1,7 @@ # `infrahub-sync` +Infrahub-sync: synchronize data between infrastructure sources and destinations. + **Usage**: ```console @@ -8,16 +10,35 @@ $ infrahub-sync [OPTIONS] COMMAND [ARGS]... **Options**: +* `--verbosity [quiet|default|verbose]`: Log verbosity level [default: default] +* `-v, --verbose`: Shorthand for --verbosity verbose +* `-q, --quiet`: Shorthand for --verbosity quiet * `--install-completion`: Install completion for the current shell. * `--show-completion`: Show completion for the current shell, to copy it or customize the installation. * `--help`: Show this message and exit. **Commands**: -* `diff`: Calculate and print the differences... -* `generate`: Generate all the Python files for a given... * `list`: List all available SYNC projects. +* `diff`: Calculate and print the differences... * `sync`: Synchronize the data between source and... +* `apply`: Apply a previously cached plan against the... +* `generate`: Generate all the Python files for a given... + +## `infrahub-sync list` + +List all available SYNC projects. + +**Usage**: + +```console +$ infrahub-sync list [OPTIONS] +``` + +**Options**: + +* `--directory TEXT`: Base directory to search for sync configurations +* `--help`: Show this message and exit. ## `infrahub-sync diff` @@ -35,18 +56,21 @@ $ infrahub-sync diff [OPTIONS] * `--config-file TEXT`: File path to the sync configuration YAML file * `--directory TEXT`: Base directory to search for sync configurations * `--branch TEXT`: Branch to use for the diff. -* `--show-progress / --no-show-progress`: Show a progress bar during diff [default: show-progress] +* `--show-progress / --no-show-progress`: Show a progress bar (default: auto-detect terminal) * `--adapter-path TEXT`: Paths to look for adapters. Can be specified multiple times. +* `--run-id TEXT`: Re-use a specific cache run id. +* `--concurrent-load / --no-concurrent-load`: Load source and destination concurrently. Disable when a custom adapter isn't thread-safe. [default: concurrent-load] +* `--full-extract / --no-full-extract`: Re-extract every resource from scratch (default). Pass --no-full-extract to enable the cursor-driven incremental path on warm runs — see docs/reference/incremental-extraction. [default: full-extract] * `--help`: Show this message and exit. -## `infrahub-sync generate` +## `infrahub-sync sync` -Generate all the Python files for a given sync based on the configuration. +Synchronize the data between source and the destination systems for a given project or configuration file. **Usage**: ```console -$ infrahub-sync generate [OPTIONS] +$ infrahub-sync sync [OPTIONS] ``` **Options**: @@ -55,32 +79,43 @@ $ infrahub-sync generate [OPTIONS] * `--config-file TEXT`: File path to the sync configuration YAML file * `--directory TEXT`: Base directory to search for sync configurations * `--branch TEXT`: Branch to use for the sync. +* `--diff / --no-diff`: Print the differences between the source and the destination before syncing [default: diff] +* `--show-progress / --no-show-progress`: Show a progress bar (default: auto-detect terminal) * `--adapter-path TEXT`: Paths to look for adapters. Can be specified multiple times. +* `--parallel / --no-parallel`: Sync tier-by-tier using the auto-computed dep graph. Requires order: to be omitted from config.yml. [default: parallel] +* `--allow-rowcount-drop / --no-allow-rowcount-drop`: Skip the rowcount drop guardrail. Use only when you know the source intentionally shrank. [default: no-allow-rowcount-drop] +* `--continue-on-error / --no-continue-on-error`: Log and skip peer relationships whose identifier values are missing instead of aborting. Useful when source data is partial; review the warnings before relying on the result. [default: no-continue-on-error] +* `--concurrent-load / --no-concurrent-load`: Load source and destination concurrently. Disable when a custom adapter isn't thread-safe. [default: concurrent-load] +* `--full-extract / --no-full-extract`: Re-extract every resource from scratch (default). Pass --no-full-extract to enable the cursor-driven incremental path on warm runs — see docs/reference/incremental-extraction. [default: full-extract] * `--help`: Show this message and exit. -## `infrahub-sync list` +## `infrahub-sync apply` -List all available SYNC projects. +Apply a previously cached plan against the destination — no source extraction. **Usage**: ```console -$ infrahub-sync list [OPTIONS] +$ infrahub-sync apply [OPTIONS] ``` **Options**: +* `--name TEXT`: Name of the sync to use +* `--config-file TEXT`: File path to the sync configuration YAML file * `--directory TEXT`: Base directory to search for sync configurations +* `--run-id TEXT`: Cache run id produced by a previous `diff`. [required] +* `--branch TEXT`: Branch to use for the apply. * `--help`: Show this message and exit. -## `infrahub-sync sync` +## `infrahub-sync generate` -Synchronize the data between source and the destination systems for a given project or configuration file. +Generate all the Python files for a given sync based on the configuration. **Usage**: ```console -$ infrahub-sync sync [OPTIONS] +$ infrahub-sync generate [OPTIONS] ``` **Options**: @@ -89,7 +124,5 @@ $ infrahub-sync sync [OPTIONS] * `--config-file TEXT`: File path to the sync configuration YAML file * `--directory TEXT`: Base directory to search for sync configurations * `--branch TEXT`: Branch to use for the sync. -* `--diff / --no-diff`: Print the differences between the source and the destination before syncing [default: diff] -* `--show-progress / --no-show-progress`: Show a progress bar during syncing [default: show-progress] * `--adapter-path TEXT`: Paths to look for adapters. Can be specified multiple times. * `--help`: Show this message and exit. diff --git a/docs/docs/reference/config.mdx b/docs/docs/reference/config.mdx index cbf1044..e8974d2 100644 --- a/docs/docs/reference/config.mdx +++ b/docs/docs/reference/config.mdx @@ -19,10 +19,33 @@ Describes the overall synchronization configuration. | `store` | `SyncStore` | Configuration for the optional storage mechanism. | No | | `source` | `SyncAdapter` | Configuration for the source adapter. | Yes | | `destination` | `SyncAdapter` | Configuration for the destination adapter. | Yes | -| `order` | List of strings | Specifies the order in which objects should be synchronized. | Yes | +| `order` | List of strings | Order in which objects should be synchronized. Optional — when omitted, infrahub-sync auto-computes tiers from schema_mapping. | No | | `schema_mapping` | List of `SchemaMappingModel` | Defines how data is mapped from source to destination. | Yes | | `diffsync_flags` | List of `DiffSyncFlags` | Instruct Infrahub Sync how to handle some specific situation without changing the data | No | +### Auto-tiered execution + +`order:` is now optional. When it is omitted, infrahub-sync derives a +write-order graph from the `reference:` entries in each `schema_mapping` +field and groups kinds into **tiers**: + +- Tier 0: kinds with no outgoing references. +- Tier N: kinds whose references all live in tiers `0..N-1`. + +The flattened tier order replaces the manual `order:` list. Tiers and any +optional edges dropped to break cycles are logged at `INFO` level when +`diff` or `sync` runs. + +`infrahub-sync sync` runs with `--parallel` on by default: the engine +narrows the destination's `top_level` to one tier at a time so no tier +starts before the previous tier's writes have completed. Pass +`--no-parallel` to disable the tier boundary and fall back to the legacy +single-pass code path. + +If you must override the computed order (because it doesn't match an +adapter quirk), keep the `order:` list — it always wins, and `--parallel` +will warn and fall back to serial when an explicit order is set. + ### Sync store Optional configuration for a storage mechanism used for stateful synchronization. diff --git a/docs/docs/reference/incremental-extraction.mdx b/docs/docs/reference/incremental-extraction.mdx new file mode 100644 index 0000000..94f9365 --- /dev/null +++ b/docs/docs/reference/incremental-extraction.mdx @@ -0,0 +1,68 @@ +--- +title: Incremental Extraction +--- + +`infrahub-sync` can skip re-extracting unchanged data on warm runs by +asking each backend "what changed since the last successful run?". + +## Default behavior + +`infrahub-sync` defaults to `--full-extract`: every run re-extracts every +resource from scratch. The cursor-driven warm path is **opt-in** because +timestamp filters miss deletes and because a fresh extract is the safer +posture for a tool that writes to a downstream system. + +The cache machinery still runs under `--full-extract` — snapshots and +cursor sidecars are written under the run dir so that the opt-in warm +path is immediately usable when you switch to `--no-full-extract`. See +[Cache layout](./cache-layout) for the on-disk shape. + +## Enabling the incremental warm path + +```bash +uv run infrahub-sync sync --name from-netbox --directory examples/ --no-full-extract +``` + +When `--no-full-extract` is set, the engine takes the cursor path **if +all** of these hold: + +1. A prior run exists under `.infrahub-sync-cache//` with + `run.json` status `applied` (or `dry-run`). +2. `schema-sub-hash.txt` from that run matches the current schema + mapping + destination schema. Any mapping change forces a full + extract. +3. The adapter declares a non-NONE cursor tier for the resource + (`cursor_tier_for()` — see adapter docs). +4. The run counter has not hit the configured cadence (default: every + 10 runs, configurable via `incremental.full_resync_every` in + `config.yml`). + +If any condition fails the engine falls back to the full extract path +for that side / resource. + +## When to keep `--full-extract` + +- Investigating a discrepancy and you suspect cached state. +- A backend has had data deleted and you want the delete reflected + immediately (timestamp filters do not catch deletes — the cadence + knob handles this routinely). + +## Supported backends + +| Adapter | Tier | Notes | +| ---------------------- | ----------------- | ----- | +| NetBox source | TIMESTAMP | `last_updated__gte` | +| Nautobot source | TIMESTAMP | `last_updated__gte` | +| Infrahub destination | TIMESTAMP | `node_metadata__updated_at__after` | +| Others | NONE | Always full extract today | + +## Soft deletes + +Timestamp-based incremental misses DELETEs (the deleted row has no +`last_updated` to match). The engine forces a full extract every N +runs (default 10) to reconcile deletes. Set +`incremental.full_resync_every: 1` to disable incremental entirely. + +A future optimization will add an ID-only sweep +(`adapter.list_existing_ids`) so deletes are caught on every warm +run — the contract is in place but not yet wired into the engine. diff --git a/docs/sidebars.ts b/docs/sidebars.ts index bf159cc..3099c27 100644 --- a/docs/sidebars.ts +++ b/docs/sidebars.ts @@ -36,6 +36,7 @@ const sidebars: SidebarsConfig = { items: [ 'reference/config', 'reference/cli', + 'reference/incremental-extraction', ], }, 'development', diff --git a/examples/aci_to_infrahub/config.yml b/examples/aci_to_infrahub/config.yml index 9a75184..54c268e 100644 --- a/examples/aci_to_infrahub/config.yml +++ b/examples/aci_to_infrahub/config.yml @@ -16,13 +16,15 @@ destination: settings: url: "http://localhost:8000" -order: [ - "OrganizationCustomer", - "LocationMetro", - "LocationBuilding", - "DcimPhysicalDevice", - "DcimPhysicalInterface", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "OrganizationCustomer", +# "LocationMetro", +# "LocationBuilding", +# "DcimPhysicalDevice", +# "DcimPhysicalInterface", +# ] schema_mapping: # ACI Tenants -> Organizations diff --git a/examples/custom_adapter/config.yml b/examples/custom_adapter/config.yml index d257658..44e1f3c 100644 --- a/examples/custom_adapter/config.yml +++ b/examples/custom_adapter/config.yml @@ -13,9 +13,11 @@ destination: settings: url: "http://localhost:8000" -order: [ - "InfraDevice", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "InfraDevice", +# ] schema_mapping: - name: InfraDevice diff --git a/examples/device42_to_infrahub/config.yml b/examples/device42_to_infrahub/config.yml index bc43bca..98219d4 100644 --- a/examples/device42_to_infrahub/config.yml +++ b/examples/device42_to_infrahub/config.yml @@ -16,11 +16,13 @@ destination: settings: url: "http://localhost:8000" -order: [ - "BuiltinTag", - "OrganizationTenant", - "LocationSite", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "BuiltinTag", +# "OrganizationTenant", +# "LocationSite", +# ] schema_mapping: # Builtin Tags (Device42 tags) diff --git a/examples/infrahub_to_peering-manager/config.yml b/examples/infrahub_to_peering-manager/config.yml index e357c1f..1958e03 100644 --- a/examples/infrahub_to_peering-manager/config.yml +++ b/examples/infrahub_to_peering-manager/config.yml @@ -11,16 +11,18 @@ destination: url: "https://demo.peering-manager.net" # api_endpoint: "api" # auth_method: "token" - token: "13bf6338aed52d172e33750d39717fff5a5f5d18" + token: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" -order: [ - "InfraAutonomousSystem", - "InfraBGPCommunity", - "InfraBGPRoutingPolicy", - "InfraBGPPeerGroup", - "InfraIXP", - "InfraIXPConnection", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "InfraAutonomousSystem", +# "InfraBGPCommunity", +# "InfraBGPRoutingPolicy", +# "InfraBGPPeerGroup", +# "InfraIXP", +# "InfraIXPConnection", +# ] schema_mapping: - name: InfraAutonomousSystem diff --git a/examples/ipfabric_to_infrahub/config.yml b/examples/ipfabric_to_infrahub/config.yml index 0ce978a..9a29c4b 100644 --- a/examples/ipfabric_to_infrahub/config.yml +++ b/examples/ipfabric_to_infrahub/config.yml @@ -11,20 +11,22 @@ destination: settings: url: "http://localhost:8000" -order: [ - "LocationGeneric", - "OrganizationGeneric", - "InfraPlatform", - "ChoiceDeviceType", - "InfraNOSVersion", - "InfraDevice", - "InfraPartNumber", - "InfraVLAN", - "InfraVRF", - "InfraInterfaceL3", - "InfraPrefix", - "InfraIPAddress", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "LocationGeneric", +# "OrganizationGeneric", +# "InfraPlatform", +# "ChoiceDeviceType", +# "InfraNOSVersion", +# "InfraDevice", +# "InfraPartNumber", +# "InfraVLAN", +# "InfraVRF", +# "InfraInterfaceL3", +# "InfraPrefix", +# "InfraIPAddress", +# ] schema_mapping: - name: LocationGeneric diff --git a/examples/librenms_to_infrahub/config.yml b/examples/librenms_to_infrahub/config.yml index 5928be3..84a8025 100644 --- a/examples/librenms_to_infrahub/config.yml +++ b/examples/librenms_to_infrahub/config.yml @@ -14,12 +14,14 @@ destination: settings: url: "http://localhost:8000" -order: [ - "CoreStandardGroup", - "LocationSite", - "IpamIPAddress", - "InfraDevice", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "CoreStandardGroup", +# "LocationSite", +# "IpamIPAddress", +# "InfraDevice", +# ] schema_mapping: - name: CoreStandardGroup diff --git a/examples/nautobot-v1_to_infrahub/config.yml b/examples/nautobot-v1_to_infrahub/config.yml index 86b25e2..0a0840a 100644 --- a/examples/nautobot-v1_to_infrahub/config.yml +++ b/examples/nautobot-v1_to_infrahub/config.yml @@ -10,30 +10,32 @@ destination: settings: url: "http://localhost:8000" -order: [ - "BuiltinTag", - "RoleGeneric", - # "StatusGeneric", - "CoreStandardGroup", - "ChoiceLocationType", - "OgranizationGeneric", - "LocationGeneric", - "InfraRack", - "ChoiceDeviceType", - "InfraPlatform", - "InfraProviderNetwork", - "ChoiceCircuitType", - "InfraCircuit", - "InfraRouteTarget", - "InfraVRF", - "InfraDevice", - "InfraVLAN", - "InfraPrefix", - "InfraIPAddress", - "InfraRearPort", - "InfraFrontPort", - "InfraInterfaceL2L3" -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "BuiltinTag", +# "RoleGeneric", +# # "StatusGeneric", +# "CoreStandardGroup", +# "ChoiceLocationType", +# "OgranizationGeneric", +# "LocationGeneric", +# "InfraRack", +# "ChoiceDeviceType", +# "InfraPlatform", +# "InfraProviderNetwork", +# "ChoiceCircuitType", +# "InfraCircuit", +# "InfraRouteTarget", +# "InfraVRF", +# "InfraDevice", +# "InfraVLAN", +# "InfraPrefix", +# "InfraIPAddress", +# "InfraRearPort", +# "InfraFrontPort", +# "InfraInterfaceL2L3" +# ] schema_mapping: # Tags diff --git a/examples/nautobot-v2_to_infrahub/config.yml b/examples/nautobot-v2_to_infrahub/config.yml index 56f21c3..a33d57e 100644 --- a/examples/nautobot-v2_to_infrahub/config.yml +++ b/examples/nautobot-v2_to_infrahub/config.yml @@ -17,31 +17,33 @@ destination: # host: localhost # port: 6379 -order: [ - "BuiltinTag", - "RoleGeneric", - "StatusGeneric", - "CoreStandardGroup", - "OrganizationGeneric", - "ChoiceLocationType", - "LocationGeneric", - "InfraRack", - "ChoiceDeviceType", - "InfraPlatform", - "InfraProviderNetwork", - "ChoiceCircuitType", - "InfraCircuit", - "NautobotNamespace", - "InfraRouteTarget", - "InfraVLAN", - "InfraVRF", - # "InfraDevice", - "InfraPrefix", - # "InfraInterfaceL2L3", - "InfraRearPort", - "InfraFrontPort", - # "InfraIPAddress", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "BuiltinTag", +# "RoleGeneric", +# "StatusGeneric", +# "CoreStandardGroup", +# "OrganizationGeneric", +# "ChoiceLocationType", +# "LocationGeneric", +# "InfraRack", +# "ChoiceDeviceType", +# "InfraPlatform", +# "InfraProviderNetwork", +# "ChoiceCircuitType", +# "InfraCircuit", +# "NautobotNamespace", +# "InfraRouteTarget", +# "InfraVLAN", +# "InfraVRF", +# # "InfraDevice", +# "InfraPrefix", +# # "InfraInterfaceL2L3", +# "InfraRearPort", +# "InfraFrontPort", +# # "InfraIPAddress", +# ] schema_mapping: # Tags @@ -312,7 +314,7 @@ schema_mapping: # BGP Plugin (Autonomous System, BGP Session, BGP Peer Group) - name: InfraAutonomousSystem - mapping: plugin.bgp.autonomous-systems + mapping: plugins.bgp.autonomous-systems identifiers: ["name"] fields: - name: name diff --git a/examples/netbox_to_infrahub/config.yml b/examples/netbox_to_infrahub/config.yml index 4a3bc60..203dc0b 100644 --- a/examples/netbox_to_infrahub/config.yml +++ b/examples/netbox_to_infrahub/config.yml @@ -1,4 +1,21 @@ --- +# from-netbox — demonstrates the two newest features of infrahub-sync: +# +# 1. Auto-tiered write order. The `order:` list has been removed; the engine +# derives tiers from `schema_mapping[].fields[].reference` and groups +# kinds that can be written in parallel. Add `--parallel` to `sync` to +# enforce a hard barrier between tiers. +# +# 2. Parquet sidecar cache. Every `diff` writes per-resource snapshots and +# a `plan.parquet` under `.infrahub-sync-cache/from-netbox//`. +# The run_id is logged at INFO; pass it to `apply` to re-dispatch the +# cached plan without re-extracting the source. +# +# Usage: +# uv run infrahub-sync diff --name from-netbox --directory examples/ +# uv run infrahub-sync apply --name from-netbox --run-id --directory examples/ +# uv run infrahub-sync sync --name from-netbox --directory examples/ --parallel + name: from-netbox source: @@ -12,25 +29,12 @@ destination: settings: url: "http://localhost:8000" -order: [ - "BuiltinTag", - "RoleGeneric", - "CoreStandardGroup", - "OrganizationGeneric", - "LocationGeneric", - "InfraRack", - "ChoiceDeviceType", - "InfraProviderNetwork", - "ChoiceCircuitType", - "InfraCircuit", - "InfraRouteTarget", - "InfraVRF", - "InfraDevice", - "InfraVLAN", - "InfraPrefix", - # "InfraIPAddress", - "InfraInterfaceL2L3", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: +# - BuiltinTag +# - RoleGeneric +# ... schema_mapping: # Tags diff --git a/examples/observium_to_infrahub/config.yml b/examples/observium_to_infrahub/config.yml index 1e39dcb..81e1cc8 100644 --- a/examples/observium_to_infrahub/config.yml +++ b/examples/observium_to_infrahub/config.yml @@ -15,11 +15,13 @@ destination: settings: url: "http://localhost:8000" -order: [ - "CoreStandardGroup", - "IpamIPAddress", - "InfraDevice", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "CoreStandardGroup", +# "IpamIPAddress", +# "InfraDevice", +# ] schema_mapping: - name: CoreStandardGroup diff --git a/examples/peering-manager_to_infrahub/config.yml b/examples/peering-manager_to_infrahub/config.yml index 25cefff..b13de54 100644 --- a/examples/peering-manager_to_infrahub/config.yml +++ b/examples/peering-manager_to_infrahub/config.yml @@ -6,7 +6,7 @@ source: # name: genericrestapi settings: url: "https://demo.peering-manager.net" - token: "13bf6338aed52d172e33750d39717fff5a5f5d18" + token: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" # When using the genericrestapi adapter # we need to specify the api_endpoint, auth_method and response_key_pattern # with peeringmanager adapter, these are set by default @@ -19,16 +19,18 @@ destination: settings: url: "http://localhost:8000" -order: [ - "OrganizationProvider", - "InfraAutonomousSystem", - "InfraBGPCommunity", - "InfraBGPRoutingPolicy", - "InfraBGPPeerGroup", - "InfraIXP", - "IpamIPAddress", - "InfraIXPConnection", -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "OrganizationProvider", +# "InfraAutonomousSystem", +# "InfraBGPCommunity", +# "InfraBGPRoutingPolicy", +# "InfraBGPPeerGroup", +# "InfraIXP", +# "IpamIPAddress", +# "InfraIXPConnection", +# ] schema_mapping: - name: OrganizationProvider diff --git a/examples/peeringdb_to_infrahub/config.yml b/examples/peeringdb_to_infrahub/config.yml index 534df8d..4838c1e 100644 --- a/examples/peeringdb_to_infrahub/config.yml +++ b/examples/peeringdb_to_infrahub/config.yml @@ -9,7 +9,7 @@ source: # auth_method: "none" # If you need authentication auth_method: "api-key" - token: "BdW624dP.fBkFmpt3gPAj0z3t2PcQE7FQvhfn1IKV" + token: "aaaaaaaa.aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" response_key_pattern: "data" # PeeringDB wraps lists under "data" @@ -19,9 +19,11 @@ destination: url: "http://localhost:8000" # Optional: Skip objects in the source that don't exist in the destination (prevents creation) -dyffsync_flags: ["SKIP_UNMATCHED_SRC"] +diffsync_flags: ["SKIP_UNMATCHED_SRC"] -order: ["InfraAutonomousSystem"] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: ["InfraAutonomousSystem"] schema_mapping: - name: InfraAutonomousSystem diff --git a/examples/prometheus_to_infrahub (node_exporter)/config.yml b/examples/prometheus_to_infrahub (node_exporter)/config.yml index a85dbe1..4c28012 100644 --- a/examples/prometheus_to_infrahub (node_exporter)/config.yml +++ b/examples/prometheus_to_infrahub (node_exporter)/config.yml @@ -12,11 +12,13 @@ destination: settings: url: "http://localhost:8000" -order: - - "VirtualizationVirtualMachine" - - "VirtualizationVMNetworkInterface" - - "VirtualizationVMFilesystem" - - "VirtualizationVMDisk" +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: +# - "VirtualizationVirtualMachine" +# - "VirtualizationVMNetworkInterface" +# - "VirtualizationVMFilesystem" +# - "VirtualizationVMDisk" schema_mapping: diff --git a/examples/slurpit_to_infrahub/config.yml b/examples/slurpit_to_infrahub/config.yml index 75d8f72..ec1fba4 100644 --- a/examples/slurpit_to_infrahub/config.yml +++ b/examples/slurpit_to_infrahub/config.yml @@ -12,20 +12,22 @@ destination: url: "http://localhost:8000" token: "06438eb2-8019-4776-878c-0941b1f1d1ec" -order: [ - "OrganizationGeneric", - "LocationGeneric", - "ChoiceDeviceType", - "InfraPlatform", - "InfraDevice", - "InfraHardwareInfo", - "InfraVersion", - "InfraVLAN", - "InfraVRF", - "InfraInterface", - "InfraPrefix", - "InfraIPAddress" -] +# order: omitted — infrahub-sync auto-computes tiers from schema_mapping. +# Uncomment and edit only if you need to override the computed order: +# order: [ +# "OrganizationGeneric", +# "LocationGeneric", +# "ChoiceDeviceType", +# "InfraPlatform", +# "InfraDevice", +# "InfraHardwareInfo", +# "InfraVersion", +# "InfraVLAN", +# "InfraVRF", +# "InfraInterface", +# "InfraPrefix", +# "InfraIPAddress" +# ] schema_mapping: - name: OrganizationGeneric diff --git a/infrahub_sync/__init__.py b/infrahub_sync/__init__.py index 7eae311..d23588c 100644 --- a/infrahub_sync/__init__.py +++ b/infrahub_sync/__init__.py @@ -5,6 +5,13 @@ import re from typing import TYPE_CHECKING, Any, ClassVar, Union +from infrahub_sync.cache.cursors import CursorTier + +if TYPE_CHECKING: + from collections.abc import Iterable + + from infrahub_sync.cache.cursors import CursorState + import pydantic if TYPE_CHECKING: @@ -70,6 +77,12 @@ class SyncStore(pydantic.BaseModel): settings: dict[str, Any] | None = {} +class IncrementalConfig(pydantic.BaseModel): + """Optional configuration block for incremental-extraction behaviour.""" + + full_resync_every: int = 10 + + class SyncConfig(pydantic.BaseModel): name: str store: SyncStore | None = None # Fix default value that was incorrectly set as list @@ -79,6 +92,7 @@ class SyncConfig(pydantic.BaseModel): order: list[str] = pydantic.Field(default_factory=list) schema_mapping: list[SchemaMappingModel] = [] diffsync_flags: list[Union[str, DiffSyncFlags]] | None = [] + incremental: IncrementalConfig | None = None @validator_decorator("diffsync_flags", **validator_kwargs) # ty: ignore[no-matching-overload] def convert_str_to_enum(cls, v): @@ -97,6 +111,39 @@ def convert_str_to_enum(cls, v): new_flags.append(item) return new_flags + def compute_order(self) -> list[str]: + """Return the operator-provided `order` if set, else flattened tiers + auto-computed from `schema_mapping`. + + Logs the tier layout and any dropped optional edges at INFO level. + """ + order, _tiers = self.compute_order_and_tiers() + return order + + def compute_order_and_tiers(self) -> tuple[list[str], list[set[str]] | None]: + """Return `(flat_order, tiers)` from a single topological pass. + + `tiers` is `None` when an explicit `order` is configured. Callers that + need both the flat order and the tier layout should use this rather + than calling `compute_order()` and `compute_tiers()` separately, which + would sort the graph twice. Logs the tier layout and any dropped + optional edges at INFO level. + """ + if self.order: + return list(self.order), None + # Imported here to avoid a circular import at module load. + from infrahub_sync.dependency_graph import compute_tiers, flatten_tiers + + tiers, dropped = compute_tiers(self.schema_mapping) + for idx, tier in enumerate(tiers): + logger.info("tier %d (%d): %s", idx, len(tier), sorted(tier)) + if dropped: + logger.warning( + "dropped optional edges to break cycles: %s", + dropped, + ) + return flatten_tiers(tiers), tiers + class SyncInstance(SyncConfig): directory: str @@ -152,6 +199,33 @@ def load(self): def model_loader(self, model_name: str, model): raise NotImplementedError + def cursor_tier_for(self, model_name: str) -> CursorTier: # noqa: ARG002 + """Strongest cursor tier the adapter supports for this model. + + Default = NONE (always full extract). Override per adapter. + """ + return CursorTier.NONE + + def list_changed_since(self, model_name: str, cursor: CursorState) -> Iterable[dict]: + """Yield raw upstream records changed since `cursor`. + + Adapters that override `cursor_tier_for` to a non-NONE tier MUST + implement this. Records are dicts in the same shape `model_loader` + feeds to `add(...)` (DiffSync model fields). + """ + msg = ( + f"{type(self).__name__}.list_changed_since is not implemented. " + "Override it or keep cursor_tier_for returning NONE." + ) + raise NotImplementedError(msg) + + def list_existing_ids(self, model_name: str) -> Iterable[str]: + """Yield current `unique_id` strings for `model_name` in the source + of truth. Used for delete detection between incremental runs. + """ + msg = f"{type(self).__name__}.list_existing_ids is not implemented. Override it for soft-delete detection." + raise NotImplementedError(msg) + class DiffSyncModelMixin: # Set on generated subclasses (see generator/templates/diffsync_models.j2). diff --git a/infrahub_sync/adapters/infrahub.py b/infrahub_sync/adapters/infrahub.py index 6f351f2..565ef5f 100644 --- a/infrahub_sync/adapters/infrahub.py +++ b/infrahub_sync/adapters/infrahub.py @@ -15,6 +15,7 @@ from infrahub_sdk.node.property import NodeProperty from infrahub_sdk.schema.main import GenericSchemaAPI, NodeSchemaAPI, RelationshipSchemaAPI from infrahub_sdk.utils import compare_lists +from pydantic import ValidationError from typing_extensions import Self from infrahub_sync import ( @@ -23,12 +24,19 @@ SyncAdapter, SyncConfig, ) +from infrahub_sync.cache.cursors import CursorState, CursorTier from infrahub_sync.generator import has_field logger = logging.getLogger(__name__) +# GraphQL filter kwarg for timestamp-based incremental queries. +# Verified against a live Infrahub via __type introspection — every node +# exposes the metadata-prefixed `node_metadata__updated_at__after` arg. +# Adjust if the server renames this field. +_TIMESTAMP_FILTER_KW = "node_metadata__updated_at__after" + if TYPE_CHECKING: - from collections.abc import Mapping, MutableMapping + from collections.abc import Iterator, Mapping, MutableMapping from infrahub_sdk.node import InfrahubNodeSync, RelatedNodeSync, RelationshipManagerSync from infrahub_sdk.schema import MainSchemaTypesAPI @@ -226,9 +234,51 @@ def diffsync_to_infrahub( return data +class PeerIdentifierError(ValueError): + """Raised when an Infrahub peer node is missing a value required to build its DiffSync identifier. + + Carries enough context (parent kind/id, relationship name, peer kind/id, missing keys, + identifiers schema, values that were present) for the user to fix the schema_mapping + or seed the missing data without re-running the failing job. + """ + + def __init__( + self, + *, + parent_kind: str, + parent_id: str | None, + rel_name: str, + peer_kind: str, + peer_id: str | None, + identifiers: tuple[str, ...], + missing_keys: tuple[str, ...], + present_keys: tuple[str, ...], + ) -> None: + self.parent_kind = parent_kind + self.parent_id = parent_id + self.rel_name = rel_name + self.peer_kind = peer_kind + self.peer_id = peer_id + self.identifiers = identifiers + self.missing_keys = missing_keys + self.present_keys = present_keys + msg = ( + f"Cannot build unique_id for peer {peer_kind}[{peer_id}] " + f"(relationship {parent_kind}.{rel_name}, parent id={parent_id}): " + f"missing identifier key(s) {list(missing_keys)}; " + f"required identifiers={list(identifiers)}, present keys={list(present_keys)}. " + "Likely cause: schema_mapping does not declare a 'fields:' entry for the missing " + "key, or the peer record was not loaded with that field populated. " + "Re-run with --continue-on-error to skip these peers." + ) + super().__init__(msg) + + class InfrahubAdapter(DiffSyncMixin, Adapter): type = "Infrahub" + continue_on_error: bool = False + def __init__( self, target: str, @@ -294,6 +344,67 @@ def __init__( # We will keep a copy of the schema self.schema: MutableMapping[str, MainSchemaTypesAPI] = self.client.schema.all(branch=infrahub_branch) + def cursor_tier_for(self, model_name: str) -> CursorTier: + """TIMESTAMP for any kind present in the live Infrahub schema. + + Every Infrahub node carries `node_metadata.updated_at`, so the + `node_metadata__updated_at__after` filter works for any kind the + destination schema knows about. Kinds absent from `self.schema` + fall back to NONE — defensive guard so the engine never attempts + an incremental query for an unknown kind. + """ + if model_name in self.schema: + return CursorTier.TIMESTAMP + return CursorTier.NONE + + def list_changed_since(self, model_name: str, cursor: CursorState) -> Iterator[dict]: + """Yield Infrahub nodes changed since `cursor.value`. + + Uses the `node_metadata__updated_at__after` GraphQL filter + (see `_TIMESTAMP_FILTER_KW`). + """ + if model_name not in self.schema: + msg = f"Infrahub: model {model_name!r} not in schema; cursor tier NONE" + raise NotImplementedError(msg) + + filter_kwargs = {_TIMESTAMP_FILTER_KW: cursor.value} + nodes = self.client.filters( # ty: ignore[no-matching-overload] + kind=model_name, + populate_store=True, + prefetch_relationships=True, + **filter_kwargs, + ) + for node in nodes: + yield self.infrahub_node_to_diffsync(node=node) + + def list_existing_ids(self, model_name: str) -> Iterator[str]: + """Yield unique IDs for all Infrahub nodes of `model_name`. + + Used by soft-delete sweeps: timestamp-filtered queries miss DELETEs, + so an occasional ID-only scan catches removed peers. + """ + if model_name not in self.schema: + msg = f"Infrahub: model {model_name!r} not in schema; cursor tier NONE" + raise NotImplementedError(msg) + + model_cls = getattr(self, model_name, None) + if model_cls is None: + msg = f"Infrahub: adapter has no model class for {model_name!r}" + raise NotImplementedError(msg) + + # `include` is the list of attribute fields the diffsync model + # treats as identifiers. Pulling just those keeps the GraphQL + # response small. + identifiers = list(getattr(model_cls, "_identifiers", ()) or ()) + nodes = self.client.all( + kind=model_name, + include=identifiers or None, + populate_store=False, + ) + for node in nodes: + payload = self.infrahub_node_to_diffsync(node=node) + yield model_cls(**payload).get_unique_id() + def model_loader(self, model_name: str, model: type[InfrahubModel]) -> None: """ Load and process models using schema mapping filters and transformations. @@ -326,11 +437,70 @@ def model_loader(self, model_name: str, model: type[InfrahubModel]) -> None: # Create model instances after filtering and transforming for transformed_obj in transformed_objs: original_node: InfrahubNodeSync = next(node for node, obj in node_dict_pairs if obj == transformed_obj) - item = model(**transformed_obj) + try: + item = model(**transformed_obj) + except ValidationError as exc: + if not self.continue_on_error: + raise + logger.warning( + "Skipping %s[%s]: cannot build DiffSync model " + "(likely a required peer was skipped earlier). Pydantic errors: %s", + model_name, + transformed_obj.get("local_id"), + exc.errors(include_url=False), + ) + continue unique_id = item.get_unique_id() self.client.store.set(key=unique_id, node=original_node) self.update_or_add_model_instance(item) + def _resolve_peer_unique_id( + self, + *, + parent_node: InfrahubNodeSync, + rel_name: str, + peer_node: InfrahubNodeSync, + ) -> str | None: + """Resolve a peer node to its DiffSync unique_id. + + Returns None if the peer cannot be mapped (no DiffSync model, or + `continue_on_error` is set and the peer is missing identifier values). + Raises ``PeerIdentifierError`` otherwise so the operator sees actionable + context instead of a bare ``KeyError``. + """ + peer_kind = peer_node._schema.kind + peer_model = getattr(self, peer_kind, None) + if not peer_model: + logger.warning("Unable to map '%s' with kind '%s' - Ignored", peer_node, peer_kind) + return None + + peer_data = self.infrahub_node_to_diffsync(peer_node) + identifiers = tuple(peer_model._identifiers) + missing = tuple(k for k in identifiers if k not in peer_data) + if missing: + err = PeerIdentifierError( + parent_kind=parent_node._schema.kind, + parent_id=str(getattr(parent_node, "id", None)), + rel_name=rel_name, + peer_kind=peer_kind, + peer_id=str(getattr(peer_node, "id", None)), + identifiers=identifiers, + missing_keys=missing, + present_keys=tuple(peer_data.keys()), + ) + if self.continue_on_error: + logger.warning("Skipping peer relationship: %s", err) + return None + raise err + + unique_id = peer_model.create_unique_id(**{k: peer_data[k] for k in identifiers}) + peer_item = self.store.get(model=peer_kind, identifier=unique_id) + if not peer_item: + peer_item = peer_model(**peer_data) + self.update_or_add_model_instance(peer_item) + self.client.store.set(key=unique_id, node=peer_node) + return peer_item.get_unique_id() + def infrahub_node_to_diffsync(self, node: InfrahubNodeSync) -> dict[str, Any]: """ Convert an Infrahub node into a dictionary suitable for creating a DiffSyncModel. @@ -373,29 +543,12 @@ def infrahub_node_to_diffsync(self, node: InfrahubNodeSync) -> dict[str, Any]: ) if not peer_node: continue - - # First, get the peer model class to access identifiers - peer_model = getattr(self, peer_node._schema.kind, None) - if not peer_model: - logger.warning("Unable to map '%s' with kind '%s'", peer_node, peer_node._schema.kind) + unique_id = self._resolve_peer_unique_id( + parent_node=node, rel_name=rel_schema.name, peer_node=peer_node + ) + if unique_id is None: continue - - # Convert peer_node to dict to extract identifier values - peer_data = self.infrahub_node_to_diffsync(peer_node) - # Create the unique_id using the peer model's identifier schema - unique_id = peer_model.create_unique_id(**{k: peer_data[k] for k in peer_model._identifiers}) - - # Try to get existing item from store using the unique identifier - peer_item = self.store.get(model=peer_node._schema.kind, identifier=unique_id) - - # If not found in store, create and add it - if not peer_item: - peer_item = peer_model(**peer_data) - self.update_or_add_model_instance(peer_item) - # Also store in Infrahub client store for future lookups - self.client.store.set(key=unique_id, node=peer_node) - - data[rel_schema.name] = peer_item.get_unique_id() + data[rel_schema.name] = unique_id elif rel_schema.cardinality == "many": values = [] @@ -413,29 +566,12 @@ def infrahub_node_to_diffsync(self, node: InfrahubNodeSync) -> dict[str, Any]: ) if not peer_node: continue - - # First, get the peer model class to access identifiers - peer_model = getattr(self, peer_node._schema.kind, None) - if not peer_model: - logger.warning("Unable to map '%s' with kind '%s' - Ignored", peer_node, peer_node._schema.kind) + unique_id = self._resolve_peer_unique_id( + parent_node=node, rel_name=rel_schema.name, peer_node=peer_node + ) + if unique_id is None: continue - - # Convert peer_node to dict to extract identifier values - peer_data = self.infrahub_node_to_diffsync(peer_node) - # Create the unique_id using the peer model's identifier schema - unique_id = peer_model.create_unique_id(**{k: peer_data[k] for k in peer_model._identifiers}) - - # Try to get existing item from store using the unique identifier - peer_item = self.store.get(model=peer_node._schema.kind, identifier=unique_id) - - # If not found in store, create and add it - if not peer_item: - peer_item = peer_model(**peer_data) - self.update_or_add_model_instance(peer_item) - # Also store in Infrahub client store for future lookups - self.client.store.set(key=unique_id, node=peer_node) - - values.append(peer_item.get_unique_id()) + values.append(unique_id) data[rel_schema.name] = sorted(values) return data diff --git a/infrahub_sync/adapters/nautobot.py b/infrahub_sync/adapters/nautobot.py index 0c26f7c..c016e4c 100644 --- a/infrahub_sync/adapters/nautobot.py +++ b/infrahub_sync/adapters/nautobot.py @@ -3,10 +3,12 @@ # pylint: disable=R0801 import logging import os -from typing import Any +from typing import TYPE_CHECKING, Any import pynautobot # ty: ignore[unresolved-import] # optional dep, see pyproject extras +import pynautobot.core.query # ty: ignore[unresolved-import] # optional dep, see pyproject extras from diffsync import Adapter, DiffSyncModel +from pydantic import ValidationError from typing_extensions import Self from infrahub_sync import ( @@ -16,11 +18,40 @@ SyncAdapter, SyncConfig, ) +from infrahub_sync.cache.cursors import CursorState, CursorTier from .utils import get_value logger = logging.getLogger(__name__) +if TYPE_CHECKING: + from collections.abc import Iterator + + +def _is_unknown_filter_error(exc: pynautobot.core.query.RequestError, field: str) -> bool: + """True if `exc` is a 400 rejecting `field` as an unknown filter. + + Prefers the response JSON, where Nautobot reports the offending filter + as a top-level key (e.g. ``{"last_updated__gte": ["Unknown filter field"]}``), + so the predicate survives wording tweaks in the error string. Only when the + body isn't JSON do we fall back to a substring match, and even then we + require both the field name *and* an unknown-filter phrase so an unrelated + 400 that merely happens to mention the field can't trigger a false positive. + """ + req = getattr(exc, "req", None) + if req is None or getattr(req, "status_code", None) != 400: + return False + try: + payload = req.json() + except (ValueError, AttributeError): + payload = None + if isinstance(payload, dict): + # Authoritative signal: the rejected filter appears as a key in the body. + return field in payload + # No JSON body — require the field name and filter-rejection wording. + text = str(exc) + return field in text and "filter" in text.lower() + class NautobotAdapter(DiffSyncMixin, Adapter): type = "Nautobot" @@ -45,6 +76,105 @@ def _create_nautobot_client(self, adapter: SyncAdapter) -> pynautobot.api: client = pynautobot.api(url=url, token=token, threading=True, max_workers=5, retries=3, verify=verify_ssl) return client + def cursor_tier_for(self, model_name: str) -> CursorTier: + """Return TIMESTAMP for any kind we have a schema_mapping for. + + Most pynautobot endpoints accept ``last_updated__gte`` but a few + (e.g. dcim.front-ports / rear-ports) return 400 "Unknown filter + field" — ``list_changed_since`` falls back to a full extract for + those. Kinds not in the schema_mapping return NONE so the engine + never attempts an incremental query for them. + """ + for element in self.config.schema_mapping: + if element.name == model_name and element.mapping: + return CursorTier.TIMESTAMP + return CursorTier.NONE + + def _resolve_endpoint(self, mapping: str) -> Any: + """Walk `mapping` (e.g. 'dcim.devices' or 'plugins.foo.bar') to a pynautobot endpoint.""" + parts = mapping.split(".") + endpoint = self.client + for part in parts: + try: + endpoint = getattr(endpoint, part) + except AttributeError as exc: + msg = f"Invalid Nautobot mapping path {mapping!r} (missing segment {part!r})" + raise ValueError(msg) from exc + return endpoint + + def _records_to_diffsync( + self, + *, + element: SchemaMappingModel, + model: type[NautobotModel], + raw_records: list[dict], + already_filtered: bool = False, + ) -> Iterator[dict]: + """Filter+transform Nautobot records and yield diffsync-ready dicts. + + Same transformation flow as model_loader, factored out for reuse by + list_changed_since. Pass `already_filtered=True` when the caller has + run `filter_records` itself (e.g. to log a filtered count) so records + aren't filtered twice. + """ + if self.config.source.name.title() == self.type.title(): # ty: ignore[unresolved-attribute] + filtered = ( + raw_records if already_filtered else model.filter_records(records=raw_records, schema_mapping=element) + ) + transformed = model.transform_records(records=filtered, schema_mapping=element) + else: + transformed = raw_records + for obj in transformed: + yield self.nautobot_obj_to_diffsync(obj=obj, mapping=element, model=model) + + def list_changed_since(self, model_name: str, cursor: CursorState) -> Iterator[dict]: + """Yield Nautobot records changed since `cursor`. Uses `last_updated__gte` filter.""" + element = next( + (e for e in self.config.schema_mapping if e.name == model_name), + None, + ) + if element is None or not element.mapping: + msg = f"Nautobot: no schema_mapping entry with mapping for {model_name!r}" + raise NotImplementedError(msg) + + model: type[NautobotModel] = getattr(self, model_name) + endpoint = self._resolve_endpoint(element.mapping) + try: + raw = [dict(node) for node in endpoint.filter(last_updated__gte=cursor.value)] + except pynautobot.core.query.RequestError as exc: + # Not every Nautobot endpoint exposes `last_updated__gte` (e.g. + # dcim.front-ports / dcim.rear-ports return 400 with a body like + # `{"last_updated__gte": ["Unknown filter field"]}`). Fall back + # to a full extract for any 400 that mentions the filter key. + if not _is_unknown_filter_error(exc, "last_updated__gte"): + raise + logger.warning( + "Nautobot %s (%s) does not support last_updated__gte; falling back to full extract for this kind.", + model_name, + element.mapping, + ) + raw = [dict(node) for node in endpoint.all()] + yield from self._records_to_diffsync(element=element, model=model, raw_records=raw) + + def list_existing_ids(self, model_name: str) -> Iterator[str]: + """Yield current unique IDs for `model_name` from Nautobot. + + Used by soft-delete sweeps: timestamp-filtered queries miss DELETEs. + """ + element = next( + (e for e in self.config.schema_mapping if e.name == model_name), + None, + ) + if element is None or not element.mapping: + msg = f"Nautobot: no schema_mapping entry with mapping for {model_name!r}" + raise NotImplementedError(msg) + + model: type[NautobotModel] = getattr(self, model_name) + endpoint = self._resolve_endpoint(element.mapping) + raw_records = [dict(node) for node in endpoint.all()] + for payload in self._records_to_diffsync(element=element, model=model, raw_records=raw_records): + yield model(**payload).get_unique_id() + def model_loader(self, model_name: str, model: type[NautobotModel]) -> None: """ Load and process models using schema mapping filters and transformations. @@ -52,7 +182,6 @@ def model_loader(self, model_name: str, model: type[NautobotModel]) -> None: This method retrieves data from Nautobot, applies filters and transformations as specified in the schema mapping, and loads the processed data into the adapter. """ - # Retrieve schema mapping for this model for element in self.config.schema_mapping: if element.name != model_name: continue @@ -61,34 +190,37 @@ def model_loader(self, model_name: str, model: type[NautobotModel]) -> None: logger.info("No mapping defined for '%s', skipping", element.name) continue - # Use the resource endpoint from the schema mapping - app_name, resource_name = element.mapping.split(".") - nautobot_app = getattr(self.client, app_name) - nautobot_model = getattr(nautobot_app, resource_name) - - # Retrieve all objects (RecordSet) - nodes = nautobot_model.all() - - # Transform the RecordSet into a list of Dict - list_obj = [] - for node in nodes: - list_obj.append(dict(node)) - - total = len(list_obj) + endpoint = self._resolve_endpoint(element.mapping) + raw_records = [dict(node) for node in endpoint.all()] + total = len(raw_records) + resource_name = element.mapping.split(".")[-1] if self.config.source.name.title() == self.type.title(): # ty: ignore[unresolved-attribute] - # Filter records - filtered_objs = model.filter_records(records=list_obj, schema_mapping=element) - logger.info("%s: Loading %d/%d %s", self.type, len(filtered_objs), total, resource_name) - # Transform records - transformed_objs = model.transform_records(records=filtered_objs, schema_mapping=element) + filtered = model.filter_records(records=raw_records, schema_mapping=element) + # Mirror the NetBox adapter's filtered/total log so operators see + # the same detail regardless of source system. + logger.info("%s: Loading %d/%d %s", self.type, len(filtered), total, resource_name) else: + filtered = raw_records logger.info("%s: Loading all %d %s", self.type, total, resource_name) - transformed_objs = list_obj - # Create model instances after filtering and transforming - for obj in transformed_objs: - data = self.nautobot_obj_to_diffsync(obj=obj, mapping=element, model=model) - item = model(**data) + continue_on_error = getattr(self, "continue_on_error", False) + # Records are already filtered above; don't filter again. + for data in self._records_to_diffsync( + element=element, model=model, raw_records=filtered, already_filtered=True + ): + try: + item = model(**data) + except ValidationError as exc: + if not continue_on_error: + raise + logger.warning( + "Skipping %s[%s]: cannot build DiffSync model " + "(likely a required peer was skipped earlier). Pydantic errors: %s", + model_name, + data.get("local_id"), + exc.errors(include_url=False), + ) + continue self.add(item) def nautobot_obj_to_diffsync( diff --git a/infrahub_sync/adapters/netbox.py b/infrahub_sync/adapters/netbox.py index db2f5e6..94b7337 100644 --- a/infrahub_sync/adapters/netbox.py +++ b/infrahub_sync/adapters/netbox.py @@ -3,7 +3,7 @@ # pylint: disable=R0801 import logging import os -from typing import Any +from typing import TYPE_CHECKING, Any import pynetbox # ty: ignore[unresolved-import] # optional dep, see pyproject extras from diffsync import Adapter, DiffSyncModel @@ -17,11 +17,15 @@ SyncAdapter, SyncConfig, ) +from infrahub_sync.cache.cursors import CursorState, CursorTier from .utils import get_value logger = logging.getLogger(__name__) +if TYPE_CHECKING: + from collections.abc import Iterator + class NetboxAdapter(DiffSyncMixin, Adapter): type = "Netbox" @@ -50,6 +54,92 @@ def _create_netbox_client(self, adapter: SyncAdapter) -> pynetbox.api: client.http_session = session return client + def cursor_tier_for(self, model_name: str) -> CursorTier: + """Return TIMESTAMP for any kind we have a schema_mapping for. + + pynetbox DCIM/IPAM/Circuits/Tenancy endpoints uniformly support + `last_updated__gte`. Kinds not in the schema_mapping fall back to + NONE so the engine never attempts an incremental query for them. + """ + for element in self.config.schema_mapping: + if element.name == model_name and element.mapping: + return CursorTier.TIMESTAMP + return CursorTier.NONE + + def _resolve_endpoint(self, mapping: str) -> Any: + """Walk `mapping` (e.g. 'dcim.devices' or 'plugins.foo.bar') to a pynetbox endpoint.""" + parts = mapping.split(".") + endpoint = self.client + for part in parts: + try: + endpoint = getattr(endpoint, part) + except AttributeError as exc: + msg = f"Invalid NetBox mapping path {mapping!r} (missing segment {part!r})" + raise ValueError(msg) from exc + return endpoint + + def _records_to_diffsync( + self, + *, + element: SchemaMappingModel, + model: type[NetboxModel], + raw_records: list[dict], + already_filtered: bool = False, + ) -> Iterator[dict]: + """Filter+transform NetBox records and yield diffsync-ready dicts. + + Same transformation flow as model_loader, factored out for reuse by + list_changed_since. Pass `already_filtered=True` when the caller has + run `filter_records` itself (e.g. to log a filtered count) so records + aren't filtered twice. + """ + if self.config.source.name.title() == self.type.title(): # ty: ignore[unresolved-attribute] + filtered = ( + raw_records if already_filtered else model.filter_records(records=raw_records, schema_mapping=element) + ) + transformed = model.transform_records(records=filtered, schema_mapping=element) + else: + transformed = raw_records + for obj in transformed: + yield self.netbox_obj_to_diffsync(obj=obj, mapping=element, model=model) + + def list_changed_since(self, model_name: str, cursor: CursorState) -> Iterator[dict]: + """Yield NetBox records changed since `cursor`. Uses `last_updated__gte` filter.""" + element = next( + (e for e in self.config.schema_mapping if e.name == model_name), + None, + ) + if element is None or not element.mapping: + msg = f"NetBox: no schema_mapping entry with mapping for {model_name!r}" + raise NotImplementedError(msg) + + model: type[NetboxModel] = getattr(self, model_name) + endpoint = self._resolve_endpoint(element.mapping) + raw = [dict(node) for node in endpoint.filter(last_updated__gte=cursor.value)] + yield from self._records_to_diffsync(element=element, model=model, raw_records=raw) + + def list_existing_ids(self, model_name: str) -> Iterator[str]: + """Yield current unique IDs for `model_name` from NetBox. + + The unique ID is computed by the existing diffsync model: + `model(**netbox_obj_to_diffsync(...)).get_unique_id()`. + Adapters that override the identifier convention will produce + correctly-shaped IDs without further work here. + """ + element = next( + (e for e in self.config.schema_mapping if e.name == model_name), + None, + ) + if element is None or not element.mapping: + msg = f"NetBox: no schema_mapping entry with mapping for {model_name!r}" + raise NotImplementedError(msg) + + model: type[NetboxModel] = getattr(self, model_name) + endpoint = self._resolve_endpoint(element.mapping) + raw_records = [dict(node) for node in endpoint.all()] + for payload in self._records_to_diffsync(element=element, model=model, raw_records=raw_records): + yield model(**payload).get_unique_id() + def model_loader(self, model_name: str, model: type[NetboxModel]) -> None: """ Load and process models using schema mapping filters and transformations. @@ -65,33 +155,27 @@ def model_loader(self, model_name: str, model: type[NetboxModel]) -> None: logger.info("No mapping defined for '%s', skipping", element.name) continue - # Use the resource endpoint from the schema mapping - app_name, resource_name = element.mapping.split(".") - netbox_app = getattr(self.client, app_name) - netbox_model = getattr(netbox_app, resource_name) - - # Retrieve all objects (RecordSet) - nodes = netbox_model.all() + # Supports nested attribute paths (e.g. "plugins.foo.bar") for + # pynetbox plugin endpoints. + resource_name = element.mapping.split(".")[-1] + endpoint = self._resolve_endpoint(element.mapping) - # Transform the RecordSet into a list of Dict - list_obj = [] - for node in nodes: - list_obj.append(dict(node)) + # Retrieve all objects (RecordSet) and convert to dicts. + raw_records = [dict(node) for node in endpoint.all()] + total = len(raw_records) - total = len(list_obj) if self.config.source.name.title() == self.type.title(): # ty: ignore[unresolved-attribute] - # Filter records - filtered_objs = model.filter_records(records=list_obj, schema_mapping=element) - logger.info("%s: Loading %d/%d %s", self.type, len(filtered_objs), total, resource_name) - # Transform records - transformed_objs = model.transform_records(records=filtered_objs, schema_mapping=element) + filtered = model.filter_records(records=raw_records, schema_mapping=element) + logger.info("%s: Loading %d/%d %s", self.type, len(filtered), total, resource_name) else: + filtered = raw_records logger.info("%s: Loading all %d %s", self.type, total, resource_name) - transformed_objs = list_obj - # Create model instances after filtering and transforming - for obj in transformed_objs: - data = self.netbox_obj_to_diffsync(obj=obj, mapping=element, model=model) + # Create model instances after transforming — records are already + # filtered above, so `_records_to_diffsync` must not filter again. + for data in self._records_to_diffsync( + element=element, model=model, raw_records=filtered, already_filtered=True + ): item = model(**data) self.add(item) diff --git a/infrahub_sync/cache/__init__.py b/infrahub_sync/cache/__init__.py new file mode 100644 index 0000000..24225d5 --- /dev/null +++ b/infrahub_sync/cache/__init__.py @@ -0,0 +1,62 @@ +"""Cache subsystem: persists every sync run's source/destination snapshots, +the computed plan, and per-row errors as Parquet files under +`cache///`.""" + +from __future__ import annotations + +import hashlib +import json +from typing import TYPE_CHECKING, Any + +if TYPE_CHECKING: + from infrahub_sync import SyncConfig + + +def compute_schema_subhash(config: SyncConfig, schema: dict[str, Any]) -> str: + """Hash inputs that, if changed, must invalidate the cache. + + Captures the operator's schema_mapping shape (resource mapping path, + per-field mapping/reference/static, filters, and transforms) AND the + destination schema's kind names. Anything that affects how a row is + extracted or transformed must contribute to the hash, otherwise a + config edit could silently reuse a stale plan or cursor. + + Returns a 12-hex-char prefix of SHA-256. + """ + payload = { + "schema_mapping": [ + { + "name": sm.name, + "mapping": getattr(sm, "mapping", None), + "identifiers": sm.identifiers, + "fields": [ + { + "name": f.name, + "mapping": getattr(f, "mapping", None), + "reference": getattr(f, "reference", None), + "static": getattr(f, "static", None), + } + for f in (sm.fields or []) + ], + "filters": [ + { + "field": getattr(fltr, "field", None), + "operation": getattr(fltr, "operation", None), + "value": getattr(fltr, "value", None), + } + for fltr in (getattr(sm, "filters", None) or []) + ], + "transforms": [ + { + "field": getattr(t, "field", None), + "expression": getattr(t, "expression", None), + } + for t in (getattr(sm, "transforms", None) or []) + ], + } + for sm in config.schema_mapping + ], + "schema_kinds": sorted(schema.keys()), + } + serialized = json.dumps(payload, sort_keys=True, default=str).encode("utf-8") + return hashlib.sha256(serialized).hexdigest()[:12] diff --git a/infrahub_sync/cache/cursors.py b/infrahub_sync/cache/cursors.py new file mode 100644 index 0000000..1c68bfd --- /dev/null +++ b/infrahub_sync/cache/cursors.py @@ -0,0 +1,39 @@ +"""Cursor tiers for incremental sync. + +Each adapter resource declares its tier. The engine uses the strongest tier +the adapter supports for each resource at run time. + +| Tier | Used by | Update rule | +|-----------------|-----------------------------------------------|--------------------------| +| NONE | adapters that cannot filter by mtime | always full extract | +| PAGE_TOKEN | adapters with `?next=` pagination only | resume mid-page on crash | +| TIMESTAMP | NetBox, Nautobot — `last_updated__gte` | extract changed-since | +| INFRAHUB_DIFF | Infrahub destination read-back | diff API returns deltas | +""" + +from __future__ import annotations + +from dataclasses import dataclass +from enum import IntEnum + + +class CursorTier(IntEnum): + """Capability tier the adapter exposes for incremental cursors (see module docstring).""" + + NONE = 0 + PAGE_TOKEN = 1 + TIMESTAMP = 2 + INFRAHUB_DIFF = 3 + + +@dataclass(frozen=True) +class CursorState: + """Serialized cursor for one model/resource — `tier` + a tier-specific opaque value.""" + + tier: CursorTier + value: str | None = None + + def __post_init__(self) -> None: + if self.tier is not CursorTier.NONE and self.value is None: + msg = f"CursorState(tier={self.tier.name}) requires a non-None value." + raise ValueError(msg) diff --git a/infrahub_sync/cache/guardrails.py b/infrahub_sync/cache/guardrails.py new file mode 100644 index 0000000..3e514e9 --- /dev/null +++ b/infrahub_sync/cache/guardrails.py @@ -0,0 +1,48 @@ +"""Rowcount guardrails. + +The previous successful run's rowcounts are kept in +`/last-successful-rowcounts.json` (one canonical copy per pipeline, +updated only when a sync completes successfully). The next run loads the +baseline; if any resource's current count is below the threshold the engine +raises and asks the operator to confirm with `--allow-rowcount-drop`. +""" + +from __future__ import annotations + +import logging +from dataclasses import dataclass, field + +logger = logging.getLogger(__name__) + + +class RowcountGuardrailError(RuntimeError): + """Raised when a resource's rowcount drops below the threshold.""" + + +@dataclass +class RowcountGuardrail: + """Reject per-resource rowcount drops below `drop_threshold` vs. `previous`.""" + + previous: dict[str, int] + drop_threshold: float = 0.5 + allow_drop: bool = False + triggered: list[str] = field(default_factory=list) + + def check(self, resource: str, *, current: int) -> None: + """Raise `RowcountGuardrailError` when `current/prior < drop_threshold`.""" + if self.allow_drop: + return + prior = self.previous.get(resource) + if prior is None or prior == 0: + return + ratio = current / prior + if ratio >= self.drop_threshold: + return + msg = ( + f"Rowcount guardrail tripped for {resource!r}: dropped from " + f"{prior} to {current} (ratio {ratio:.2f} < threshold " + f"{self.drop_threshold:.2f}). Pass --allow-rowcount-drop to override." + ) + self.triggered.append(resource) + logger.error(msg) + raise RowcountGuardrailError(msg) diff --git a/infrahub_sync/cache/incremental.py b/infrahub_sync/cache/incremental.py new file mode 100644 index 0000000..fa5ac38 --- /dev/null +++ b/infrahub_sync/cache/incremental.py @@ -0,0 +1,170 @@ +"""Helpers for incremental (changed-since) extraction. + +Pure functions only — engine wiring lives in `potenda/__init__.py`. +""" + +from __future__ import annotations + +import json +import logging +from collections.abc import Callable # noqa: TC003 +from datetime import datetime # noqa: TC003 +from typing import TYPE_CHECKING + +from infrahub_sync.cache.cursors import CursorState, CursorTier +from infrahub_sync.cache.parquet_io import SNAPSHOT_INTERNAL_COLUMNS, read_table +from infrahub_sync.cache.sidecars import CursorsFile + +if TYPE_CHECKING: + from pathlib import Path + +logger = logging.getLogger(__name__) + + +_SUCCESS_STATUSES = frozenset({"applied", "dry-run"}) + + +def previous_successful_run_dir(cache_root: Path) -> Path | None: + """Return the most recent `/` whose run.json status is + 'applied' or 'dry-run'. Returns None when no such run exists. + """ + if not cache_root.exists(): + return None + candidates: list[Path] = [] + for run_dir in cache_root.iterdir(): + if not run_dir.is_dir(): + continue + run_file = run_dir / "run.json" + if not run_file.exists(): + continue + try: + payload = json.loads(run_file.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + continue + if payload.get("status") in _SUCCESS_STATUSES: + candidates.append(run_dir) + if not candidates: + return None + return max(candidates, key=lambda p: p.name) + + +def should_use_incremental( + *, + prev_run_dir: Path | None, + current_subhash: str, + force_full: bool, + runs_since_full: int = 0, + cadence: int = 0, +) -> bool: + """Gate the incremental path. False = full extract. + + Bails out when: caller asked for full extract, no prior run exists, + the cadence threshold is reached, or the schema-subhash changed + (mapping or destination schema moved under us, so prior snapshot is + no longer trustworthy). + + ``cadence=0`` disables the cadence check entirely (0 is falsy). + """ + if force_full: + logger.info("Incremental disabled: --full-extract requested") + return False + if prev_run_dir is None: + logger.info("Incremental disabled: no prior successful run") + return False + if cadence and runs_since_full >= cadence: + logger.info( + "Incremental disabled: cadence reached (%d/%d runs since full)", + runs_since_full, + cadence, + ) + return False + subhash_path = prev_run_dir / "schema-sub-hash.txt" + if not subhash_path.exists(): + logger.info("Incremental disabled: prior run has no schema-sub-hash.txt") + return False + prev_subhash = subhash_path.read_text(encoding="utf-8").strip() + if prev_subhash != current_subhash: + logger.info( + "Incremental disabled: schema-subhash changed (prev=%s, now=%s)", + prev_subhash, + current_subhash, + ) + return False + return True + + +def load_cursors(path: Path, *, side: str) -> dict[str, CursorState]: + """Load ``{model_name: CursorState}`` for the given side. + + Returns an empty dict when the file does not exist or the side has no + entries yet. ``side`` must be ``"A"`` or ``"B"``. + """ + if side not in {"A", "B"}: + msg = f"side must be 'A' or 'B', got {side!r}" + raise ValueError(msg) + raw = CursorsFile.load_or_default(path).cursors.get(side, {}) + out: dict[str, CursorState] = {} + for model_name, packed in raw.items(): + tier_name, _, value = packed.partition(":") + tier = CursorTier[tier_name] + out[model_name] = CursorState(tier=tier, value=value or None) + return out + + +def persist_cursors( + path: Path, + *, + side: str, + cursors: dict[str, CursorState], +) -> None: + """Merge ``cursors`` into the sidecar for the given side and save. + + Existing entries for other sides (or other models in the same side) are + preserved. ``side`` must be ``"A"`` or ``"B"``. + """ + if side not in {"A", "B"}: + msg = f"side must be 'A' or 'B', got {side!r}" + raise ValueError(msg) + sidecar = CursorsFile.load_or_default(path) + bucket = sidecar.cursors.setdefault(side, {}) + for model_name, state in cursors.items(): + bucket[model_name] = f"{state.tier.name}:{state.value or ''}" + sidecar.save() + + +def hydrate_from_parquet( + *, + run_dir: Path, + side: str, + resource: str, + add_row: Callable[[str, dict], None], +) -> tuple[int, datetime | None]: + """Replay ``//.parquet`` into the adapter. + + Calls ``add_row(resource, payload)`` for each non-tombstoned row. + Returns ``(rows_loaded, max_extract_ts)``. When the file is missing + returns ``(0, None)``. + """ + parquet_path = run_dir / side / f"{resource}.parquet" + if not parquet_path.exists(): + return 0, None + + table = read_table(str(parquet_path)) + if table.num_rows == 0: + return 0, None + + cols = [c for c in table.column_names if c not in SNAPSHOT_INTERNAL_COLUMNS] + pylist = table.select(cols).to_pylist() + extract_ts_col = table.column("_extract_ts").to_pylist() + tombstones = table.column("_tombstone").to_pylist() + + rows_loaded = 0 + max_ts: datetime | None = None + for payload, ts, tomb in zip(pylist, extract_ts_col, tombstones, strict=True): + if tomb: + continue + add_row(resource, payload) + rows_loaded += 1 + if ts is not None and (max_ts is None or ts > max_ts): + max_ts = ts + return rows_loaded, max_ts diff --git a/infrahub_sync/cache/locks.py b/infrahub_sync/cache/locks.py new file mode 100644 index 0000000..efe1d1a --- /dev/null +++ b/infrahub_sync/cache/locks.py @@ -0,0 +1,33 @@ +"""Per-pipeline filelock so only one infrahub-sync invocation can write into +the cache for a given sync name at a time.""" + +from __future__ import annotations + +import logging +from contextlib import contextmanager +from typing import TYPE_CHECKING + +from filelock import FileLock + +from infrahub_sync.cache.paths import cache_root_for + +if TYPE_CHECKING: + from collections.abc import Iterator + +logger = logging.getLogger(__name__) + + +@contextmanager +def pipeline_lock(sync_name: str, *, timeout: float = 60.0) -> Iterator[None]: + """Acquire an exclusive lock for `sync_name`. Raises filelock.Timeout if + the lock cannot be taken within `timeout` seconds.""" + root = cache_root_for(sync_name) + root.mkdir(parents=True, exist_ok=True) + lock_path = root / ".lock" + lock = FileLock(str(lock_path), timeout=timeout) + logger.debug("Acquiring pipeline lock %s", lock_path) + with lock: + try: + yield + finally: + logger.debug("Released pipeline lock %s", lock_path) diff --git a/infrahub_sync/cache/parquet_io.py b/infrahub_sync/cache/parquet_io.py new file mode 100644 index 0000000..cda17c5 --- /dev/null +++ b/infrahub_sync/cache/parquet_io.py @@ -0,0 +1,142 @@ +"""Atomic Parquet I/O + the well-known schemas used in the cache layout.""" + +from __future__ import annotations + +import logging +from typing import TYPE_CHECKING + +import fsspec + +if TYPE_CHECKING: + from datetime import datetime + from pathlib import Path +import pyarrow as pa +import pyarrow.parquet as pq + +logger = logging.getLogger(__name__) + + +# Columns injected into every per-resource snapshot by `write_resource_side`. +# Single-source so consumers (e.g. `hydrate_from_parquet`) can strip them +# without re-listing the names. +SNAPSHOT_INTERNAL_COLUMNS = frozenset({"_extract_ts", "_source_id", "_tombstone"}) + + +PLAN_SCHEMA = pa.schema( + [ + pa.field("action", pa.string(), nullable=False), + pa.field("resource", pa.string(), nullable=False), + pa.field("source_id", pa.string(), nullable=False), + pa.field("dest_id", pa.string()), + pa.field("attribute", pa.string()), + pa.field("old_value", pa.string()), + pa.field("new_value", pa.string()), + pa.field("owner", pa.string()), + pa.field("skip_reason", pa.string()), + pa.field("conflict_class", pa.string()), + ] +) + + +ERRORS_SCHEMA = pa.schema( + [ + pa.field("error_class", pa.string(), nullable=False), + pa.field("resource", pa.string(), nullable=False), + pa.field("source_id", pa.string()), + pa.field("dest_id", pa.string()), + pa.field("attribute", pa.string()), + pa.field("message", pa.string(), nullable=False), + pa.field("hint", pa.string()), + pa.field("retry_count", pa.int64(), nullable=False), + pa.field("terminal", pa.bool_(), nullable=False), + ] +) + + +def write_table(uri: str, table: pa.Table) -> None: + """Write a Parquet table to `uri` atomically. + + The write goes to `.tmp` and is then renamed over `uri`, so a + crashed process never leaves a half-written canonical file. + """ + fs, path = fsspec.core.url_to_fs(uri) + tmp_path = f"{path}.tmp" + parent = path.rsplit("/", 1)[0] if "/" in path else "." + if not fs.exists(parent): + fs.makedirs(parent, exist_ok=True) + with fs.open(tmp_path, "wb") as fh: + pq.write_table(table, fh, compression="snappy") + if fs.exists(path): + fs.rm(path) + fs.mv(tmp_path, path) + + +def read_table(uri: str) -> pa.Table: + """Read a Parquet table from `uri`.""" + fs, path = fsspec.core.url_to_fs(uri) + with fs.open(path, "rb") as fh: + return pq.read_table(fh) + + +def write_plan(*, run_dir: Path, rows: list[dict[str, str]]) -> None: + """Write the diff plan to `/plan.parquet`.""" + table = pa.Table.from_pylist(rows, schema=PLAN_SCHEMA) + write_table(str(run_dir / "plan.parquet"), table) + + +def read_plan(*, run_dir: Path) -> pa.Table: + """Read the diff plan from `/plan.parquet`.""" + return read_table(str(run_dir / "plan.parquet")) + + +def write_resource_side( + *, + run_dir: Path, + side: str, + resource: str, + rows: list[dict[str, object]], + source_ids: list[str], + extract_ts: datetime, + tombstones: list[bool] | None = None, +) -> None: + """Write one side's snapshot of one resource to + `//.parquet`. + + Injects three engine-controlled columns: `_extract_ts`, `_source_id`, + `_tombstone`. `side` is "A" (source) or "B" (destination). + """ + if side not in {"A", "B"}: + msg = f"side must be 'A' or 'B', got {side!r}" + raise ValueError(msg) + if len(rows) != len(source_ids): + msg = ( + f"rows ({len(rows)}) and source_ids ({len(source_ids)}) length " + "mismatch — refusing to write a misaligned snapshot." + ) + raise ValueError(msg) + tombs = tombstones if tombstones is not None else [False] * len(rows) + if len(tombs) != len(rows): + msg = "tombstones length does not match rows length" + raise ValueError(msg) + + if rows: + merged = [] + for row, sid, tomb in zip(rows, source_ids, tombs, strict=True): + payload = dict(row) + payload["_extract_ts"] = extract_ts + payload["_source_id"] = sid + payload["_tombstone"] = tomb + merged.append(payload) + table = pa.Table.from_pylist(merged) + else: + # Empty snapshots still get a file (so re-apply can see "0 rows" and + # the guardrail can compare counts). + table = pa.table( + { + "_extract_ts": pa.array([], type=pa.timestamp("ns", tz="UTC")), + "_source_id": pa.array([], type=pa.string()), + "_tombstone": pa.array([], type=pa.bool_()), + } + ) + + write_table(str(run_dir / side / f"{resource}.parquet"), table) diff --git a/infrahub_sync/cache/paths.py b/infrahub_sync/cache/paths.py new file mode 100644 index 0000000..2a64eb9 --- /dev/null +++ b/infrahub_sync/cache/paths.py @@ -0,0 +1,59 @@ +"""Cache directory layout + run_id allocation.""" + +from __future__ import annotations + +import os +import secrets +from datetime import datetime, timezone +from pathlib import Path, PurePath + + +def _require_safe_segment(value: str, field: str) -> str: + """Reject path segments that could escape the cache root. + + `sync_name` comes from `config.yml` and `run_id` comes from `--run-id` + on the apply command; both are joined into a `Path`, so a value like + `..` or `/etc` would let an attacker (or a typo) write outside the + intended root. + """ + p = PurePath(value) + if p.is_absolute() or len(p.parts) != 1 or p.parts[0] in {".", ".."}: + msg = f"{field} must be a single relative path segment (got {value!r})" + raise ValueError(msg) + return value + + +def cache_root_for(sync_name: str) -> Path: + """Return the per-pipeline cache root. + + Defaults to `/.infrahub-sync-cache//`. Override with the + `INFRAHUB_SYNC_CACHE_DIR` environment variable to point at a shared + location (e.g., an NFS mount used by a fleet of runners). The override is + expanded (`~`) and rejected if it contains `..` traversal segments, so a + misconfigured value can't silently redirect the cache outside its root. + """ + _require_safe_segment(sync_name, "sync_name") + base = os.environ.get("INFRAHUB_SYNC_CACHE_DIR") + if base: + base_path = Path(base).expanduser() + if ".." in base_path.parts: + msg = f"INFRAHUB_SYNC_CACHE_DIR must not contain '..' traversal segments (got {base!r})" + raise ValueError(msg) + return base_path / sync_name + return Path.cwd() / ".infrahub-sync-cache" / sync_name + + +def generate_run_id() -> str: + """Return a sortable, low-collision run identifier. + + Format: `YYYYMMDDTHHMM-<8 hex>`. Sortable by time (the prefix), unique + across processes (the suffix is 32 bits of randomness). + """ + now = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M") + return f"{now}-{secrets.token_hex(4)}" + + +def run_dir(sync_name: str, run_id: str) -> Path: + """Concatenate the cache root with the run identifier.""" + _require_safe_segment(run_id, "run_id") + return cache_root_for(sync_name) / run_id diff --git a/infrahub_sync/cache/sidecars.py b/infrahub_sync/cache/sidecars.py new file mode 100644 index 0000000..cc2306e --- /dev/null +++ b/infrahub_sync/cache/sidecars.py @@ -0,0 +1,130 @@ +"""JSON (and one .txt) sidecars carried alongside the Parquet snapshots.""" + +from __future__ import annotations + +import json +import os +import tempfile +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any, ClassVar + + +def _atomic_write_text(path: Path, payload: str) -> None: + """Write `payload` to `path` via tmp+rename.""" + path.parent.mkdir(parents=True, exist_ok=True) + fd, tmp_name = tempfile.mkstemp(prefix=path.name + ".", dir=str(path.parent)) + try: + with os.fdopen(fd, "w", encoding="utf-8") as fh: + fh.write(payload) + Path(tmp_name).replace(path) + except BaseException: + # Best-effort cleanup of the tmp file on any failure. + Path(tmp_name).unlink(missing_ok=True) + raise + + +@dataclass +class CursorsFile: + """Per-resource, per-side cursors (since-timestamp, page token, or + Infrahub diff anchor).""" + + path: Path + cursors: dict[str, dict[str, str]] = field(default_factory=lambda: {"A": {}, "B": {}}) + + @classmethod + def load_or_default(cls, path: Path) -> CursorsFile: + if not path.exists(): + return cls(path=path) + data = json.loads(path.read_text(encoding="utf-8")) + return cls(path=path, cursors=data) + + def save(self) -> None: + _atomic_write_text(self.path, json.dumps(self.cursors, indent=2, sort_keys=True)) + + +@dataclass +class RowcountsFile: + path: Path + counts: dict[str, int] = field(default_factory=dict) + + @classmethod + def load_or_default(cls, path: Path) -> RowcountsFile: + if not path.exists(): + return cls(path=path) + data = json.loads(path.read_text(encoding="utf-8")) + return cls(path=path, counts={k: int(v) for k, v in data.items()}) + + def set(self, resource: str, count: int) -> None: + self.counts[resource] = count + + def get(self, resource: str) -> int | None: + return self.counts.get(resource) + + def save(self) -> None: + _atomic_write_text(self.path, json.dumps(self.counts, indent=2, sort_keys=True)) + + +@dataclass +class RunFile: + path: Path + status: str = "pending" # pending | running | dry-run | applied | failed + mode: str = "" # diff | sync | apply + summary: dict[str, Any] = field(default_factory=dict) + finished_at: str | None = None + + KEYS: ClassVar[tuple[str, ...]] = ("status", "mode", "summary", "finished_at") + + @classmethod + def load_or_default(cls, path: Path) -> RunFile: + if not path.exists(): + return cls(path=path) + data = json.loads(path.read_text(encoding="utf-8")) + # Use `k in data` (not `is not None`) so a genuinely-stored null is kept + # rather than silently reset to the dataclass default — a stored + # `{"status": null}` should surface as corruption, not masquerade as "pending". + return cls(path=path, **{k: data[k] for k in cls.KEYS if k in data}) + + def save(self) -> None: + payload = {k: getattr(self, k) for k in self.KEYS} + _atomic_write_text(self.path, json.dumps(payload, indent=2, sort_keys=True)) + + +@dataclass +class SchemaHashFile: + path: Path + value: str = "" + + @classmethod + def load(cls, path: Path) -> SchemaHashFile: + if not path.exists(): + return cls(path=path, value="") + return cls(path=path, value=path.read_text(encoding="utf-8").strip()) + + def save(self) -> None: + _atomic_write_text(self.path, self.value) + + +@dataclass +class RunCounterFile: + """Tracks how many runs have happened since the last full extract. + + The engine forces a full extract when `runs_since_full >= cadence`. + Reset to 0 after every full extract. + """ + + path: Path + runs_since_full: int = 0 + + KEYS: ClassVar[tuple[str, ...]] = ("runs_since_full",) + + @classmethod + def load_or_default(cls, path: Path) -> RunCounterFile: + if not path.exists(): + return cls(path=path) + data = json.loads(path.read_text(encoding="utf-8")) + return cls(path=path, **{k: data.get(k, 0) for k in cls.KEYS}) + + def save(self) -> None: + payload = {k: getattr(self, k) for k in self.KEYS} + _atomic_write_text(self.path, json.dumps(payload, indent=2, sort_keys=True)) diff --git a/infrahub_sync/cli.py b/infrahub_sync/cli.py index 5d89897..8d3fa98 100644 --- a/infrahub_sync/cli.py +++ b/infrahub_sync/cli.py @@ -1,6 +1,7 @@ from __future__ import annotations import logging +from datetime import datetime, timezone from enum import Enum from timeit import default_timer as timer from typing import TYPE_CHECKING, NoReturn, cast @@ -9,6 +10,8 @@ from infrahub_sdk import InfrahubClientSync from infrahub_sdk.exceptions import ServerNotResponsiveError +from infrahub_sync.cache.locks import pipeline_lock +from infrahub_sync.cache.sidecars import RunFile from infrahub_sync.utils import ( find_missing_schema_model, get_all_sync, @@ -92,6 +95,19 @@ def diff_cmd( default=None, help="Paths to look for adapters. Can be specified multiple times.", ), + run_id: str | None = typer.Option(default=None, help="Re-use a specific cache run id."), + concurrent_load: bool = typer.Option( + default=True, + help=("Load source and destination concurrently. Disable when a custom adapter isn't thread-safe."), + ), + full_extract: bool = typer.Option( + True, # noqa: FBT003 + "--full-extract/--no-full-extract", + help=( + "Re-extract every resource from scratch (default). Pass --no-full-extract to enable " + "the cursor-driven incremental path on warm runs — see docs/reference/incremental-extraction." + ), + ), ) -> None: """Calculate and print the differences between the source and the destination systems for a given project.""" if sum([bool(name), bool(config_file)]) != 1: @@ -109,21 +125,42 @@ def diff_cmd( sync_instance.adapters_path = adapter_path verbosity_level = ctx.obj.get("verbosity", logging.INFO) if ctx.obj else logging.INFO - try: - ptd = get_potenda_from_instance( - sync_instance=sync_instance, branch=branch, show_progress=show_progress, verbosity=verbosity_level - ) - except ValueError as exc: - print_error_and_abort(f"Failed to initialize the Sync Instance: {exc}") - try: - ptd.source_load() - ptd.destination_load() - except ValueError as exc: - print_error_and_abort(str(exc)) - mydiff = ptd.diff() + with pipeline_lock(sync_instance.name): + try: + ptd = get_potenda_from_instance( + sync_instance=sync_instance, + branch=branch, + show_progress=show_progress, + verbosity=verbosity_level, + run_id=run_id, + concurrent_load=concurrent_load, + ) + except ValueError as exc: + print_error_and_abort(f"Failed to initialize the Sync Instance: {exc}") + + ptd.force_full_extract = full_extract + if ptd.run_dir is None: # get_potenda_from_instance always allocates one + msg = "get_potenda_from_instance did not allocate a run_dir" + raise RuntimeError(msg) + run_file = RunFile(path=ptd.run_dir / "run.json", status="running", mode="diff") + run_file.save() + + try: + ptd.load_both_sides() + mydiff = ptd.diff() + ptd.write_plan(mydiff) + logger.info("\n%s", mydiff.str()) + run_file.status = "dry-run" + run_file.summary = {"resources": len(ptd.top_level)} + except Exception: + run_file.status = "failed" + run_file.save() + raise - logger.info("\n%s", mydiff.str()) + run_file.finished_at = datetime.now(timezone.utc).isoformat() + run_file.save() + logger.info("Cached run %s at %s", ptd.run_id, ptd.run_dir) @app.command(name="sync") @@ -142,6 +179,33 @@ def sync_cmd( default=None, help="Paths to look for adapters. Can be specified multiple times.", ), + parallel: bool = typer.Option( + default=True, + help="Sync tier-by-tier using the auto-computed dep graph. Requires order: to be omitted from config.yml.", + ), + allow_rowcount_drop: bool = typer.Option( + default=False, + help="Skip the rowcount drop guardrail. Use only when you know the source intentionally shrank.", + ), + continue_on_error: bool = typer.Option( + default=False, + help=( + "Log and skip peer relationships whose identifier values are missing instead of aborting. " + "Useful when source data is partial; review the warnings before relying on the result." + ), + ), + concurrent_load: bool = typer.Option( + default=True, + help=("Load source and destination concurrently. Disable when a custom adapter isn't thread-safe."), + ), + full_extract: bool = typer.Option( + True, # noqa: FBT003 + "--full-extract/--no-full-extract", + help=( + "Re-extract every resource from scratch (default). Pass --no-full-extract to enable " + "the cursor-driven incremental path on warm runs — see docs/reference/incremental-extraction." + ), + ), ) -> None: """Synchronize the data between source and the destination systems for a given project or configuration file.""" if sum([bool(name), bool(config_file)]) != 1: @@ -159,29 +223,133 @@ def sync_cmd( sync_instance.adapters_path = adapter_path verbosity_level = ctx.obj.get("verbosity", logging.INFO) if ctx.obj else logging.INFO - try: - ptd = get_potenda_from_instance( - sync_instance=sync_instance, branch=branch, show_progress=show_progress, verbosity=verbosity_level - ) - except ValueError as exc: - print_error_and_abort(f"Failed to initialize the Sync Instance: {exc}") - try: - ptd.source_load() - ptd.destination_load() - except ValueError as exc: - print_error_and_abort(str(exc)) - mydiff = ptd.diff() + with pipeline_lock(sync_instance.name): + try: + ptd = get_potenda_from_instance( + sync_instance=sync_instance, + branch=branch, + show_progress=show_progress, + verbosity=verbosity_level, + continue_on_error=continue_on_error, + concurrent_load=concurrent_load, + ) + except ValueError as exc: + print_error_and_abort(f"Failed to initialize the Sync Instance: {exc}") + + ptd.force_full_extract = full_extract + if ptd.run_dir is None: # get_potenda_from_instance always allocates one + msg = "get_potenda_from_instance did not allocate a run_dir" + raise RuntimeError(msg) + run_file = RunFile(path=ptd.run_dir / "run.json", status="running", mode="sync") + run_file.save() + + try: + if parallel and not ptd.tiers: + logger.warning( + "--parallel ignored because order: is set in config.yml; " + "remove order: to enable tier-by-tier execution", + ) + + if parallel and ptd.tiers: + try: + ptd.sync_in_tiers(parallel=True, allow_rowcount_drop=allow_rowcount_drop) + except ValueError as exc: + run_file.status = "failed" + run_file.save() + print_error_and_abort(str(exc)) + run_file.summary = {"resources": len(ptd.top_level), "mode": "parallel"} + else: + try: + ptd.load_both_sides() + except ValueError as exc: + run_file.status = "failed" + run_file.save() + print_error_and_abort(str(exc)) + ptd.check_rowcount_guardrail(allow_drop=allow_rowcount_drop) + mydiff = ptd.diff() + ptd.write_plan(mydiff) + if mydiff.has_diffs(): + if diff: + logger.info("\n%s", mydiff.str()) + start_synctime = timer() + ptd.sync(diff=mydiff) + end_synctime = timer() + logger.info("Sync: Completed in %s sec", end_synctime - start_synctime) + else: + logger.info("No difference found. Nothing to sync") + ptd.persist_baseline_counts() + run_file.summary = {"resources": len(ptd.top_level), "mode": "serial"} + + run_file.status = "applied" + except Exception: + run_file.status = "failed" + run_file.save() + raise + + run_file.finished_at = datetime.now(timezone.utc).isoformat() + run_file.save() + logger.info("Sync run %s at %s", ptd.run_id, ptd.run_dir) + + +@app.command(name="apply") +def apply_cmd( + ctx: typer.Context, + name: str = typer.Option(default=None, help="Name of the sync to use"), + config_file: str = typer.Option(default=None, help="File path to the sync configuration YAML file"), + directory: str = typer.Option(default=None, help="Base directory to search for sync configurations"), + run_id: str = typer.Option(..., help="Cache run id produced by a previous `diff`."), + branch: str = typer.Option(default=None, help="Branch to use for the apply."), +) -> None: + """Apply a previously cached plan against the destination — no source extraction.""" + if sum([bool(name), bool(config_file)]) != 1: + print_error_and_abort("Please specify exactly one of 'name' or 'config-file'.") + sync_instance = get_instance(name=name, config_file=config_file, directory=directory) + if not sync_instance: + print_error_and_abort("Failed to load sync instance.") + verbosity_level = ctx.obj.get("verbosity", logging.INFO) if ctx.obj else logging.INFO - if mydiff.has_diffs(): - if diff: - logger.info("\n%s", mydiff.str()) - start_synctime = timer() - ptd.sync(diff=mydiff) - end_synctime = timer() - logger.info("Sync: Completed in %s sec", end_synctime - start_synctime) - else: - logger.info("No difference found. Nothing to sync") + with pipeline_lock(sync_instance.name): + ptd = get_potenda_from_instance( + sync_instance=sync_instance, + branch=branch, + verbosity=verbosity_level, + run_id=run_id, + ) + if ptd.run_dir is None: # get_potenda_from_instance always allocates one + msg = "get_potenda_from_instance did not allocate a run_dir" + raise RuntimeError(msg) + run_file = RunFile(path=ptd.run_dir / "run.json", status="running", mode="apply") + run_file.save() + # Check that the cached plan was built against the same schema we + # would build against now. Plan 2 will provide _resolve_infrahub_schema; + # until then this check is a no-op. + try: + from infrahub_sync.cache import compute_schema_subhash + from infrahub_sync.cache.sidecars import SchemaHashFile + from infrahub_sync.utils import _resolve_infrahub_schema # ty: ignore[unresolved-import] + + schema = _resolve_infrahub_schema(sync_instance, branch=branch) + current = compute_schema_subhash(sync_instance, schema) + cached = SchemaHashFile.load(ptd.run_dir / "schema-sub-hash.txt").value + if cached and cached != current: + print_error_and_abort( + f"Cached plan was built against schema-sub-hash {cached!r} but " + f"the destination is now at {current!r}. Re-run `diff` to " + "rebuild the plan." + ) + except ImportError: + pass # Plan 2 resolver not available yet + try: + ptd.apply_plan() + run_file.status = "applied" + except Exception: + run_file.status = "failed" + run_file.save() + raise + run_file.finished_at = datetime.now(timezone.utc).isoformat() + run_file.save() + logger.info("Applied run %s", ptd.run_id) @app.command(name="generate") diff --git a/infrahub_sync/dependency_graph.py b/infrahub_sync/dependency_graph.py new file mode 100644 index 0000000..0fcc59b --- /dev/null +++ b/infrahub_sync/dependency_graph.py @@ -0,0 +1,107 @@ +"""Compute write-order tiers for a SyncConfig from its schema_mapping. + +The dep graph is derived purely from `SchemaMappingField.reference` entries on +each `SchemaMappingModel`. Self-references (a kind that references itself, e.g. +LocationGeneric.parent) are not write-order edges and are excluded. + +Edges where the source field is not in the model's `identifiers` are +"optional": the dependent peer is not part of uniqueness, so the write can be +deferred and the cycle (if any) is broken automatically. Edges where the field +is in `identifiers` are "identity-bearing" — a cycle through identity edges is a +real schema problem and is surfaced to the operator. +""" + +from __future__ import annotations + +import logging +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + from infrahub_sync import SchemaMappingModel + +logger = logging.getLogger(__name__) + + +def build_dependency_graph(schema_mapping: list[SchemaMappingModel]) -> dict[str, set[str]]: + """Return the dep graph keyed by kind name. Self-edges are excluded.""" + deps: dict[str, set[str]] = {} + for sm in schema_mapping: + bucket = deps.setdefault(sm.name, set()) + for field in sm.fields or []: + if not field.reference: + continue + if field.reference == sm.name: + continue + bucket.add(field.reference) + return deps + + +def _collect_optional_edges( + schema_mapping: list[SchemaMappingModel], +) -> set[tuple[str, str]]: + """Edges (src, dst) where the field carrying the reference is NOT part of + `identifiers` for src. Missing the peer doesn't break uniqueness, so we + can drop the edge to resolve a cycle.""" + optional: set[tuple[str, str]] = set() + for sm in schema_mapping: + identity_set = set(sm.identifiers or []) + for field in sm.fields or []: + if not field.reference or field.reference == sm.name: + continue + if field.name not in identity_set: + optional.add((sm.name, field.reference)) + return optional + + +_MAX_CYCLE_BREAK_ATTEMPTS = 50 + + +def _consecutive_pairs(nodes: list[str]) -> list[tuple[str, str]]: + """Yield successive `(nodes[i], nodes[i+1])` edges along a reported cycle.""" + return [(nodes[i], nodes[i + 1]) for i in range(len(nodes) - 1)] + + +def compute_tiers( + schema_mapping: list[SchemaMappingModel], +) -> tuple[list[set[str]], list[tuple[str, str]]]: + """Return (tiers, dropped_optional_edges). + + Raises `infrahub_sdk.topological_sort.DependencyCycleExistsError` when a + cycle goes through identity-bearing edges only. + """ + from infrahub_sdk.topological_sort import ( + DependencyCycleExistsError, + topological_sort, + ) + + deps = build_dependency_graph(schema_mapping) + optional = _collect_optional_edges(schema_mapping) + dropped: list[tuple[str, str]] = [] + + for _ in range(_MAX_CYCLE_BREAK_ATTEMPTS): + try: + return topological_sort(deps), dropped + except DependencyCycleExistsError as exc: + # Drop every optional edge appearing in *any* reported cycle in one + # pass, then retry — typically resolves in a single extra sort + # instead of one-edge-per-iteration (O(n_cycles) sorts). The bounded + # loop remains only as a safety net should dropping these edges + # expose a fresh cycle. Sorted for deterministic `dropped` output. + to_drop = { + (src, dst) + for cycle in exc.cycles + for src, dst in _consecutive_pairs(list(cycle)) + if (src, dst) in optional and dst in deps.get(src, set()) + } + if not to_drop: + raise + for src, dst in sorted(to_drop): + deps[src].discard(dst) + dropped.append((src, dst)) + msg = "Exceeded cycle-break budget; aborting tier computation." + raise RuntimeError(msg) + + +def flatten_tiers(tiers: list[set[str]]) -> list[str]: + """Deterministic serial ordering: sort within tier, preserve tier order.""" + return [name for tier in tiers for name in sorted(tier)] diff --git a/infrahub_sync/potenda/__init__.py b/infrahub_sync/potenda/__init__.py index 906fee6..44bdb0c 100644 --- a/infrahub_sync/potenda/__init__.py +++ b/infrahub_sync/potenda/__init__.py @@ -2,14 +2,19 @@ import logging import sys -from typing import TYPE_CHECKING +from typing import TYPE_CHECKING, Any from diffsync.enum import DiffSyncFlags from tqdm import tqdm +from infrahub_sync import IncrementalConfig + logger = logging.getLogger(__name__) if TYPE_CHECKING: + from datetime import datetime + from pathlib import Path + from diffsync import Adapter from diffsync.diff import Diff @@ -26,8 +31,34 @@ def __init__( partition=None, show_progress: bool | None = None, verbosity: int | None = None, + tiers: list[set[str]] | None = None, + run_dir: Path | None = None, + run_id: str | None = None, + cache_root: Path | None = None, + schema_subhash: str = "", + continue_on_error: bool = False, + concurrent_load: bool = True, ): self.top_level = top_level + self.tiers: list[set[str]] | None = tiers + self.continue_on_error = continue_on_error + self.concurrent_load = concurrent_load + if self.tiers: + for idx, tier in enumerate(self.tiers): + logger.info("Potenda tier %d (%d): %s", idx, len(tier), sorted(tier)) + # Cache/run identity — passed at construction so the object is fully + # valid on return rather than mutated into shape by the caller. + self.run_dir: Path | None = run_dir + self.run_id: str | None = run_id + self.cache_root: Path | None = cache_root + self._schema_subhash: str = schema_subhash + self._counts: dict[str, int] = {} + self._did_full_extract: bool = False + self._side_extract_ts: dict[str, datetime] = {} + self._prev_run_resolved: bool = False + self._prev_run_cached: Path | None = None + # Runtime toggle set per-command by the CLI just before load. + self.force_full_extract: bool = False self.config = config @@ -38,6 +69,11 @@ def __init__( self.source.top_level = top_level # ty: ignore[invalid-attribute-access] self.destination.top_level = top_level # ty: ignore[invalid-attribute-access] + # Propagate continue_on_error so adapters can skip bad peers in-loop. + # Adapters that don't read the attribute just ignore it. + self.source.continue_on_error = continue_on_error # ty: ignore[unresolved-attribute] + self.destination.continue_on_error = continue_on_error # ty: ignore[unresolved-attribute] + self.partition = partition self.progress_bar = None self.show_progress = show_progress if show_progress is not None else sys.stderr.isatty() @@ -45,10 +81,12 @@ def __init__( if verbosity is not None: logging.getLogger("diffsync").setLevel(verbosity) - # Combine DiffSyncFlags from the configuration + # Combine DiffSyncFlags from the configuration. `config` is typed as + # SyncInstance but tests pass None — guard explicitly. self.flags: DiffSyncFlags = DiffSyncFlags.NONE - for flag in self.config.diffsync_flags or []: - self.flags |= flag if isinstance(flag, DiffSyncFlags) else DiffSyncFlags[flag] + if self.config is not None: + for flag in self.config.diffsync_flags or []: + self.flags |= flag if isinstance(flag, DiffSyncFlags) else DiffSyncFlags[flag] # Fallback to `SKIP_UNMATCHED_DST` if nothing is define if self.flags == DiffSyncFlags.NONE: @@ -69,10 +107,131 @@ def _print_callback(self, stage: str, elements_processed: int, total_models: int elif elements_processed == total_models: logger.info("%s: %d/%d models processed", stage, elements_processed, total_models) + def _previous_run(self) -> Path | None: + """Cached lookup of the most recent successful run dir. + + Called once per side load — recompute is wasteful since both sides + share the same cache_root and the answer is invariant within a run. + """ + if not self._prev_run_resolved: + from infrahub_sync.cache.incremental import previous_successful_run_dir + + self._prev_run_cached = previous_successful_run_dir(self.cache_root) if self.cache_root else None + self._prev_run_resolved = True + return self._prev_run_cached + + def _write_side_snapshot(self, side: str, adapter: Adapter) -> None: + if not self.run_dir: + return + from datetime import datetime, timezone + + from infrahub_sync.cache.parquet_io import write_resource_side + + extract_ts = datetime.now(timezone.utc) + # Remember when this side started loading so persist_cursors_for_run + # can anchor a cursor for resources whose snapshot is empty (e.g. the + # destination on a fresh Infrahub — nothing exists pre-sync, but the + # next warm run still needs a cursor to query `_updated_at__gte` from). + self._side_extract_ts[side] = extract_ts + for kind in adapter.top_level: + records = list(adapter.get_all(kind)) + # Include both identifiers AND attributes so hydrate_from_parquet + # can reconstruct a complete payload — without identifiers, replaying + # a row through `model_cls(**payload)` fails pydantic validation for + # any required identifier field. `get_identifiers` is guarded for + # adapter stubs that don't implement it; falling back to just + # attributes is what the pre-fix behavior did. + rows = [ + { + **(r.get_identifiers() if hasattr(r, "get_identifiers") else {}), + **r.get_attrs(), + } + for r in records + ] + source_ids = [r.get_unique_id() for r in records] + if side == "A": + self._counts[kind] = len(records) + write_resource_side( + run_dir=self.run_dir, + side=side, + resource=kind, + rows=rows, + source_ids=source_ids, + extract_ts=extract_ts, + ) + + def load_one_side(self, *, side: str, adapter: Adapter) -> None: + """Load one side, choosing incremental vs full based on cursors. + + Falls back to the legacy full-extract path (``adapter.load()``) when + there is no prior successful run, the schema-subhash mismatches, or + the caller asked for ``--full-extract``. + """ + from infrahub_sync.cache.cursors import CursorTier + from infrahub_sync.cache.incremental import ( + hydrate_from_parquet, + load_cursors, + should_use_incremental, + ) + from infrahub_sync.cache.sidecars import RunCounterFile + + cache_root = self.cache_root + prev_run = self._previous_run() + + inc_config = self.config.incremental if self.config else None + cadence = inc_config.full_resync_every if inc_config else IncrementalConfig().full_resync_every + + runs_since_full = 0 + if cache_root is not None: + counter = RunCounterFile.load_or_default(cache_root / "run-counter.json") + runs_since_full = counter.runs_since_full + + use_inc = should_use_incremental( + prev_run_dir=prev_run, + current_subhash=self._schema_subhash, + force_full=self.force_full_extract, + runs_since_full=runs_since_full, + cadence=cadence, + ) + + # OR-accumulate so the second side cannot silently overwrite the + # first side's True (persist_baseline_counts resets the run-counter + # only when no side ran the incremental path). + self._did_full_extract = self._did_full_extract or (not use_inc) + + if not use_inc or prev_run is None: + adapter.load() + return + + def _add(model_name: str, payload: dict, _adapter: Adapter = adapter) -> None: + model_cls = getattr(_adapter, model_name) + _adapter.add(model_cls(**payload)) + + cursors = load_cursors(prev_run / "cursors.json", side=side) + for resource in adapter.top_level: + tier_supported = adapter.cursor_tier_for(resource) # ty: ignore[unresolved-attribute] + cursor = cursors.get(resource) + model_cls = getattr(adapter, resource, None) + if model_cls is None: + continue + if cursor is None or tier_supported is CursorTier.NONE: + adapter.model_loader(model_name=resource, model=model_cls) # ty: ignore[unresolved-attribute] + continue + + hydrate_from_parquet( + run_dir=prev_run, + side=side, + resource=resource, + add_row=_add, + ) + for row in adapter.list_changed_since(resource, cursor): # ty: ignore[unresolved-attribute] + adapter.add(model_cls(**row)) + def source_load(self): try: logger.info("Load: Importing data from %s", self.source) - self.source.load() + self.load_one_side(side="A", adapter=self.source) + self._write_side_snapshot("A", self.source) except Exception as exc: msg = f"An error occurred while loading {self.source}: {exc!s}" raise ValueError(msg) from exc @@ -80,15 +239,47 @@ def source_load(self): def destination_load(self): try: logger.info("Load: Importing data from %s", self.destination) - self.destination.load() + self.load_one_side(side="B", adapter=self.destination) + self._write_side_snapshot("B", self.destination) except Exception as exc: msg = f"An error occurred while loading {self.destination}: {exc!s}" raise ValueError(msg) from exc - def load(self): - try: + def load_both_sides(self) -> None: + """Load source and destination. + + When ``self.concurrent_load`` is True (the default), the two loads + run on a 2-thread ``ThreadPoolExecutor`` since they hit independent + services, write to independent ``DiffSyncStore``s, and write to + disjoint cache subdirectories. Roughly halves wall-clock time on + real APIs. + + When ``self.concurrent_load`` is False, falls back to sequential + execution (``source_load`` then ``destination_load``) — useful when + a custom adapter isn't thread-safe. + + Exceptions from either side are surfaced: the first failure to + complete is re-raised, just like the sequential path would do. + """ + if not self.concurrent_load: self.source_load() self.destination_load() + return + + from concurrent.futures import FIRST_EXCEPTION, ThreadPoolExecutor, wait + + with ThreadPoolExecutor(max_workers=2, thread_name_prefix="potenda-load") as pool: + src_fut = pool.submit(self.source_load) + dst_fut = pool.submit(self.destination_load) + wait([src_fut, dst_fut], return_when=FIRST_EXCEPTION) + # Surface the failure (if any). Both futures have run or are + # being cancelled by the pool's shutdown. ``.result()`` re-raises. + for fut in (src_fut, dst_fut): + fut.result() + + def load(self): + try: + self.load_both_sides() except Exception as exc: msg = f"An error occurred while loading the sync: {exc!s}" raise ValueError(msg) from exc @@ -102,3 +293,211 @@ def sync(self, diff: Diff | None = None): logger.info("Sync: Importing data from %s to %s based on Diff", self.source, self.destination) self.progress_bar = None return self.destination.sync_from(self.source, diff=diff, flags=self.flags, callback=self._print_callback) + + def _diff_to_rows(self, diff: Any) -> list[dict[str, str]]: + """Materialize a diffsync.Diff into plan-row dicts (one per change). + + Pulled out so sync_in_tiers can accumulate rows across per-tier + diffs before writing a single plan.parquet for the whole run. + """ + import json + + rows: list[dict[str, str]] = [] + children = getattr(diff, "children", None) or {} + for resource, elements_by_name in children.items(): + # `elements_by_name` is the diffsync `{name: DiffElement}` mapping. + for element in elements_by_name.values(): + action = getattr(element, "action", None) or "" + if not action: + # Skip elements with no actionable change (no-op). + continue + attrs_diffs = element.get_attrs_diffs() if hasattr(element, "get_attrs_diffs") else {} + old_attrs = attrs_diffs.get("-") or {} + new_attrs = attrs_diffs.get("+") or {} + rows.append( + { + "action": action, + "resource": resource, + "source_id": getattr(element, "name", "") or "", + "dest_id": "", + "attribute": "", + "old_value": json.dumps(old_attrs, sort_keys=True, default=str) if old_attrs else "", + "new_value": json.dumps(new_attrs, sort_keys=True, default=str) if new_attrs else "", + "owner": "", + "skip_reason": "", + "conflict_class": "", + } + ) + return rows + + def write_plan(self, diff: Any) -> None: + """Serialize the diffsync Diff into /plan.parquet.""" + if not self.run_dir: + return + from infrahub_sync.cache.parquet_io import write_plan + + write_plan(run_dir=self.run_dir, rows=self._diff_to_rows(diff)) + + def apply_plan(self) -> None: + """Dispatch each row in plan.parquet to the destination adapter. + + The destination's `apply_cached_row(*, resource, action, source_id, + attribute, new_value)` method is expected to perform the actual + write. Adapters that don't implement it yet will raise + AttributeError; the operator is told to fall back to `sync`. + """ + from infrahub_sync.cache.parquet_io import read_plan + + if not self.run_dir: + msg = "Potenda.apply_plan requires run_dir to be set." + raise ValueError(msg) + if not hasattr(self.destination, "apply_cached_row"): + msg = ( + f"Destination adapter {type(self.destination).__name__} does " + "not implement apply_cached_row. Use `infrahub-sync sync` " + "until the adapter is upgraded." + ) + raise NotImplementedError(msg) + apply_cached_row = getattr(self.destination, "apply_cached_row") + table = read_plan(run_dir=self.run_dir) + for i in range(table.num_rows): + apply_cached_row( + resource=table.column("resource")[i].as_py(), + action=table.column("action")[i].as_py(), + source_id=table.column("source_id")[i].as_py(), + attribute=table.column("attribute")[i].as_py(), + new_value=table.column("new_value")[i].as_py(), + ) + + def persist_cursors_for_run(self, *, side: str) -> None: + """Walk the run_dir snapshot files for `side`, compute per-resource + cursors (max `_extract_ts`), and persist into `/cursors.json`. + """ + if not self.run_dir: + return + import pyarrow.compute as pc + + from infrahub_sync.cache.cursors import CursorState, CursorTier + from infrahub_sync.cache.incremental import persist_cursors + from infrahub_sync.cache.parquet_io import read_table + + side_dir = self.run_dir / side + if not side_dir.exists(): + return + + adapter = self.source if side == "A" else self.destination + fallback_ts = self._side_extract_ts.get(side) + cursors: dict[str, CursorState] = {} + for parquet_path in side_dir.glob("*.parquet"): + resource = parquet_path.stem + tier = adapter.cursor_tier_for(resource) # ty: ignore[unresolved-attribute] + if tier is CursorTier.NONE: + continue + table = read_table(str(parquet_path)) + if table.num_rows == 0: + # Empty snapshot (e.g. destination on a fresh Infrahub). + # Anchor the cursor to when this side started loading so the + # next warm run's `_updated_at__gte=` picks up + # whatever this run wrote afterwards. + if fallback_ts is not None: + cursors[resource] = CursorState(tier=tier, value=fallback_ts.isoformat()) + continue + max_ts = pc.max(table.column("_extract_ts")).as_py() # ty: ignore[unresolved-attribute] + cursors[resource] = CursorState(tier=tier, value=max_ts.isoformat()) + + if cursors: + persist_cursors(self.run_dir / "cursors.json", side=side, cursors=cursors) + + def persist_baseline_counts(self) -> None: + """Write the source-side row counts to the canonical baseline file. + + Called only after a successful sync — a failed run must not poison + the baseline. Also bumps run-counter.json toward the cadence + threshold (or resets it to zero if this run was a full extract). + """ + if not self.run_dir: + return + from infrahub_sync.cache.paths import cache_root_for + from infrahub_sync.cache.sidecars import RowcountsFile, RunCounterFile + + root = cache_root_for(self.config.name if self.config else "_unknown") + counts_file = RowcountsFile.load_or_default(root / "last-successful-rowcounts.json") + for k, v in self._counts.items(): + counts_file.set(k, v) + counts_file.save() + + counter = RunCounterFile.load_or_default(root / "run-counter.json") + if self._did_full_extract: + counter.runs_since_full = 0 + else: + counter.runs_since_full += 1 + counter.save() + + def check_rowcount_guardrail(self, *, allow_drop: bool) -> None: + if not self.run_dir or not self.config: + return + from infrahub_sync.cache.guardrails import RowcountGuardrail + from infrahub_sync.cache.paths import cache_root_for + from infrahub_sync.cache.sidecars import RowcountsFile + + root = cache_root_for(self.config.name) + baseline = RowcountsFile.load_or_default(root / "last-successful-rowcounts.json") + guard = RowcountGuardrail(previous=baseline.counts, allow_drop=allow_drop) + for resource, current in self._counts.items(): + guard.check(resource, current=current) + + def sync_in_tiers(self, *, parallel: bool = False, allow_rowcount_drop: bool = False) -> None: + """Run diff+sync one tier at a time. + + When `parallel=False`, falls back to the existing serial pathway. + When `parallel=True`, narrows the destination's top_level to each + tier in turn and runs them sequentially. Aggregates per-tier diff + rows into a single plan.parquet so `apply` and operators can + review the whole change set. + """ + if not self.tiers: + self.load_both_sides() + self.check_rowcount_guardrail(allow_drop=allow_rowcount_drop) + diff = self.diff() + self.write_plan(diff) + if diff.has_diffs(): + self.sync(diff=diff) + # Re-snapshot destination AFTER writes so the next warm run + # hydrates from real post-sync state rather than the pre-sync + # (often empty) snapshot. Source state was already final. + self._write_side_snapshot("B", self.destination) + self.persist_baseline_counts() + self.persist_cursors_for_run(side="A") + self.persist_cursors_for_run(side="B") + return + + self.load_both_sides() + self.check_rowcount_guardrail(allow_drop=allow_rowcount_drop) + saved_top = self.destination.top_level + aggregated_rows: list[dict[str, str]] = [] + any_writes = False + try: + for idx, tier in enumerate(self.tiers): + tier_list = sorted(tier) + logger.info("Sync tier %d (%d): %s", idx, len(tier), tier_list) + self.destination.top_level = tier_list # ty: ignore[invalid-attribute-access] + diff = self.diff() + aggregated_rows.extend(self._diff_to_rows(diff)) + if diff.has_diffs(): + self.sync(diff=diff) + any_writes = True + finally: + self.destination.top_level = saved_top # ty: ignore[invalid-attribute-access] + + if any_writes: + # Same reasoning as the no-tiers branch — capture post-sync + # destination state for the next warm run's hydrate path. + self._write_side_snapshot("B", self.destination) + if self.run_dir: + from infrahub_sync.cache.parquet_io import write_plan as _write_plan_file + + _write_plan_file(run_dir=self.run_dir, rows=aggregated_rows) + self.persist_baseline_counts() + self.persist_cursors_for_run(side="A") + self.persist_cursors_for_run(side="B") + _ = parallel # reserved for diffsync v3 thread fan-out; see backport doc diff --git a/infrahub_sync/utils.py b/infrahub_sync/utils.py index e3da000..6d998c3 100644 --- a/infrahub_sync/utils.py +++ b/infrahub_sync/utils.py @@ -171,8 +171,15 @@ def get_potenda_from_instance( branch: str | None = None, show_progress: bool | None = None, verbosity: int | None = None, + run_id: str | None = None, + continue_on_error: bool = False, + concurrent_load: bool = True, ) -> Potenda: - """Create and return a Potenda instance based on the provided SyncInstance.""" + """Create and return a Potenda instance based on the provided SyncInstance. + + When ``run_id`` is None, a fresh sortable identifier is allocated via + ``generate_run_id()`` so each invocation gets its own cache directory. + """ source = import_adapter(sync_instance=sync_instance, adapter=sync_instance.source) destination = import_adapter(sync_instance=sync_instance, adapter=sync_instance.destination) @@ -227,17 +234,52 @@ def get_potenda_from_instance( msg = f"Error initializing {sync_instance.destination.name.title()}Adapter: {exc}" raise ValueError(msg) from exc - ptd = Potenda( + # Single topological pass yields both the flat order and the tier layout + # (tiers is None when an explicit `order` is configured). + top_level, tiers = sync_instance.compute_order_and_tiers() + + from infrahub_sync.cache.paths import generate_run_id + from infrahub_sync.cache.paths import run_dir as run_dir_for + + rid = run_id or generate_run_id() + rdir = run_dir_for(sync_instance.name, rid) + rdir.mkdir(parents=True, exist_ok=True) + + # Compute (and persist) the schema sub-hash *before* constructing Potenda so + # the engine receives fully-formed cache identity rather than being mutated + # into shape afterwards. `apply` uses it to refuse cached runs whose shape no + # longer matches the destination's live schema, and `should_use_incremental` + # compares it against the prior run. Uses the destination adapter's live + # schema (populated at __init__); falls back to `sync_instance._cached_schema` + # for test seams. + subhash = "" + try: + from infrahub_sync.cache import compute_schema_subhash + from infrahub_sync.cache.sidecars import SchemaHashFile + + schema = getattr(dst, "schema", None) or getattr(sync_instance, "_cached_schema", None) + if schema: + subhash = compute_schema_subhash(sync_instance, schema) + SchemaHashFile(path=rdir / "schema-sub-hash.txt", value=subhash).save() + except ImportError: + pass # cache extras not available — degrade silently + + return Potenda( destination=dst, source=src, config=sync_instance, - top_level=sync_instance.order, + top_level=top_level, + tiers=tiers, show_progress=show_progress, verbosity=verbosity, + run_dir=rdir, + run_id=rid, + cache_root=rdir.parent, # .infrahub-sync-cache// + schema_subhash=subhash, + continue_on_error=continue_on_error, + concurrent_load=concurrent_load, ) - return ptd - def get_infrahub_config(settings: dict[str, str | None], branch: str | None) -> Config: """Creates and returns a Config object for infrahub if settings are valid. diff --git a/pyproject.toml b/pyproject.toml index 7c5a72a..8503909 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -20,6 +20,9 @@ dependencies = [ "diffsync[redis]>=2.1,<3.0", "netutils>=1.9,<2.0", "tqdm>=4.67", + "pyarrow>=17,<22", + "fsspec>=2024.6", + "filelock>=3.13", ] [project.urls] @@ -288,6 +291,15 @@ max-complexity = 33 "PGH003", # Use specific rule codes when ignoring type issues "S101", # Use of `assert` detected (standard in pytest) "PLC0415", # `import` should be at the top-level of a file + "TC003", # Path used only in annotations — acceptable in test files + "ANN001", # Missing type annotation for function argument + "ANN202", # Missing return type annotation for private function + "ANN204", # Missing return type annotation for special method + "PLR2004", # Magic value used in comparison — common in assertions + "SLF001", # Private member accessed — tests legitimately probe internals + "PT011", # pytest.raises match — broad raises acceptable in tests + "B903", # Class could be dataclass — test helpers don't need it + "B905", # zip() without strict= — relaxed in tests ] "tasks/**.py" = [ diff --git a/scripts/bench-clean-nautobot.sh b/scripts/bench-clean-nautobot.sh new file mode 100755 index 0000000..44cac50 --- /dev/null +++ b/scripts/bench-clean-nautobot.sh @@ -0,0 +1,96 @@ +#!/usr/bin/env bash +# Clean-Infrahub comparison: rebuild between scenarios so each cold+warm +# pair sees a freshly-rebuilt destination. Drops InfraInterfaceL2L3 from +# the nautobot example mapping. +# +# Outputs: bench-clean.csv with one row per scenario_phase. + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)" +INFRAHUB_REPO="$(cd "$REPO_ROOT/../infrahub" && pwd)" +CFG_DIR=/tmp/bench-nautobot-cfg +CSV="$REPO_ROOT/bench-clean.csv" + +rebuild_infrahub() { + echo ">>> Rebuilding Infrahub..." + (cd "$INFRAHUB_REPO" && uv run invoke dev.destroy dev.start) >/dev/null 2>&1 + echo ">>> Loading nautobot-v2 schema..." + (cd "$INFRAHUB_REPO" && uv run infrahubctl schema load models/examples/nautobot/nautobot-v2.yml) >/dev/null 2>&1 +} + +build_config() { + cd "$REPO_ROOT" + uv run python - <<'PY' +import shutil +from pathlib import Path +import yaml + +src = Path("examples/nautobot-v2_to_infrahub") +out = Path("/tmp/bench-nautobot-cfg") +shutil.rmtree(out, ignore_errors=True) +out.mkdir() + +data = yaml.safe_load((src / "config.yml").read_text(encoding="utf-8")) +data["schema_mapping"] = [sm for sm in data["schema_mapping"] if sm["name"] != "InfraInterfaceL2L3"] +data.pop("order", None) +(out / "config.yml").write_text(yaml.dump(data, sort_keys=False), encoding="utf-8") + +# The adapters live under //sync_adapter.py +shutil.copytree(src / "nautobot", out / "nautobot") +shutil.copytree(src / "infrahub", out / "infrahub") +PY +} + +run_sync() { + local FLAGS="$1" + local LABEL="$2" + cd "$REPO_ROOT" + local START + START=$(python -c "import time; print(time.monotonic())") + # shellcheck disable=SC2086 + uv run infrahub-sync sync \ + --name from-nautobot-v2 \ + --directory "$CFG_DIR" \ + --no-diff \ + --no-show-progress \ + --continue-on-error \ + $FLAGS \ + >/dev/null 2>&1 + local END + END=$(python -c "import time; print(time.monotonic())") + local ELAPSED + ELAPSED=$(python -c "print(f'{$END - $START:.2f}')") + echo "$LABEL,$ELAPSED" >> "$CSV" + echo ">>> $LABEL took ${ELAPSED}s" +} + +echo "scenario,seconds" > "$CSV" + +# ----- Scenario 1: baseline (no parallel, no concurrent, no incremental) ----- +echo "=== Scenario 1: baseline (serial, sequential, no incremental) ===" +rebuild_infrahub +build_config +rm -rf "$REPO_ROOT/.infrahub-sync-cache/from-nautobot-v2" +run_sync "--no-parallel --no-concurrent-load --full-extract" "baseline-cold" +run_sync "--no-parallel --no-concurrent-load --full-extract" "baseline-warm" + +# ----- Scenario 2: parallel+concurrent (no incremental) ----- +echo "=== Scenario 2: parallel+concurrent (no incremental) ===" +rebuild_infrahub +build_config +rm -rf "$REPO_ROOT/.infrahub-sync-cache/from-nautobot-v2" +run_sync "--parallel --concurrent-load --full-extract" "parconc-cold" +run_sync "--parallel --concurrent-load --full-extract" "parconc-warm" + +# ----- Scenario 3: incremental (parallel+concurrent, cursor-driven on warm) ----- +echo "=== Scenario 3: incremental (parallel+concurrent, cursor-driven warm) ===" +rebuild_infrahub +build_config +rm -rf "$REPO_ROOT/.infrahub-sync-cache/from-nautobot-v2" +run_sync "--parallel --concurrent-load --full-extract" "incremental-cold" +run_sync "--parallel --concurrent-load --no-full-extract" "incremental-warm" + +echo "" +echo "=== Results ===" +cat "$CSV" diff --git a/scripts/bench-incremental-only.sh b/scripts/bench-incremental-only.sh new file mode 100755 index 0000000..f243d69 --- /dev/null +++ b/scripts/bench-incremental-only.sh @@ -0,0 +1,85 @@ +#!/usr/bin/env bash +# Single-scenario rerun for incremental on a freshly-rebuilt Infrahub, +# with sync output streamed to a per-cell log so failures don't get +# swallowed by /dev/null. + +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)" +INFRAHUB_REPO="$(cd "$REPO_ROOT/../infrahub" && pwd)" +CFG_DIR=/tmp/bench-nautobot-cfg +CSV="$REPO_ROOT/bench-incremental.csv" + +rebuild_infrahub() { + echo ">>> Rebuilding Infrahub..." + (cd "$INFRAHUB_REPO" && uv run invoke dev.destroy dev.start) >/dev/null 2>&1 + echo ">>> Loading nautobot-v2 schema..." + (cd "$INFRAHUB_REPO" && uv run infrahubctl schema load models/examples/nautobot/nautobot-v2.yml) >/dev/null 2>&1 +} + +build_config() { + cd "$REPO_ROOT" + uv run python - <<'PY' +import shutil +from pathlib import Path +import yaml + +src = Path("examples/nautobot-v2_to_infrahub") +out = Path("/tmp/bench-nautobot-cfg") +shutil.rmtree(out, ignore_errors=True) +out.mkdir() + +data = yaml.safe_load((src / "config.yml").read_text(encoding="utf-8")) +data["schema_mapping"] = [sm for sm in data["schema_mapping"] if sm["name"] != "InfraInterfaceL2L3"] +data.pop("order", None) +(out / "config.yml").write_text(yaml.dump(data, sort_keys=False), encoding="utf-8") + +shutil.copytree(src / "nautobot", out / "nautobot") +shutil.copytree(src / "infrahub", out / "infrahub") +PY +} + +run_sync() { + local FLAGS="$1" + local LABEL="$2" + local LOG="$REPO_ROOT/bench-incremental-$LABEL.log" + cd "$REPO_ROOT" + local START + START=$(python -c "import time; print(time.monotonic())") + # Don't `set -e` propagate from sync; we want to record exit status + set +e + # shellcheck disable=SC2086 + uv run infrahub-sync sync \ + --name from-nautobot-v2 \ + --directory "$CFG_DIR" \ + --no-diff \ + --no-show-progress \ + --continue-on-error \ + $FLAGS \ + >"$LOG" 2>&1 + local EXIT=$? + set -e + local END + END=$(python -c "import time; print(time.monotonic())") + local ELAPSED + ELAPSED=$(python -c "print(f'{$END - $START:.2f}')") + echo "$LABEL,$ELAPSED,$EXIT" >> "$CSV" + echo ">>> $LABEL took ${ELAPSED}s (exit=$EXIT)" + if [ "$EXIT" -ne 0 ]; then + echo ">>> FAILED. Tail of $LOG:" + tail -10 "$LOG" + fi +} + +echo "scenario,seconds,exit_code" > "$CSV" + +echo "=== Incremental: cold (force --full-extract) + warm (cursor-driven) ===" +rebuild_infrahub +build_config +rm -rf "$REPO_ROOT/.infrahub-sync-cache/from-nautobot-v2" +run_sync "--parallel --concurrent-load --full-extract" "incremental-cold" +run_sync "--parallel --concurrent-load --no-full-extract" "incremental-warm" + +echo "" +echo "=== Results ===" +cat "$CSV" diff --git a/tasks/__init__.py b/tasks/__init__.py index f4b905e..ab9ce9b 100644 --- a/tasks/__init__.py +++ b/tasks/__init__.py @@ -2,7 +2,7 @@ from invoke import Collection, Context, task -from . import docs, linter, tests +from . import bench, docs, linter, tests ns = Collection("infrahub_sync") ns.configure( @@ -17,6 +17,7 @@ ns.add_collection(Collection.from_module(linter)) ns.add_collection(Collection.from_module(docs)) ns.add_collection(Collection.from_module(tests)) +ns.add_collection(Collection.from_module(bench)) @task(name="lint") diff --git a/tasks/bench.py b/tasks/bench.py new file mode 100644 index 0000000..48cbe2f --- /dev/null +++ b/tasks/bench.py @@ -0,0 +1,418 @@ +"""Benchmark task: compare auto-tier + concurrent-load combinations. + +Runs the 4-scenario x cold/warm matrix and reports wall-clock per cell. + +Scenarios: + 1. baseline - explicit `order:` list, --no-parallel, --no-concurrent-load + 2. topology-only - `order:` omitted, --no-parallel, --no-concurrent-load + 3. parallel-only - `order:` omitted, --parallel, --no-concurrent-load + 4. parallel+conc - `order:` omitted, --parallel, --concurrent-load (default) + +Cold/warm: + cold - `.infrahub-sync-cache//` deleted before the measured run + warm - a measured run preceded by one un-timed warm-up run + +Usage: + uv run invoke bench.run --name from-netbox --directory examples/ + +Targets live services (e.g., demo.netbox.dev + a local Infrahub). Each cell +runs `infrahub-sync sync`, so the destination Infrahub will be written to. +Run #2+ against the same destination is mostly a no-op diff - that's expected. +""" + +from __future__ import annotations + +import csv +import logging +import re +import shlex +import shutil +import tempfile +import time +from dataclasses import dataclass, field +from pathlib import Path +from typing import TYPE_CHECKING + +import yaml +from invoke import Context, task + +if TYPE_CHECKING: + from invoke.runners import Result + +from .utils import REPO_BASE + +NAMESPACE = "BENCH" +logger = logging.getLogger(__name__) + + +# Tuple shape: (label, explicit_order, parallel, concurrent_load). +SCENARIOS: list[tuple[str, bool, bool, bool]] = [ + ("baseline (explicit order, serial, sequential loads)", True, False, False), + ("topology-only (auto-order, serial, sequential loads)", False, False, False), + ("parallel-only (auto-order, --parallel, sequential loads)", False, True, False), + ("parallel+concurrent (auto-order, --parallel, concurrent loads)", False, True, True), +] + +WARMTHS: tuple[str, ...] = ("cold", "warm") + + +@dataclass +class CellResult: + scenario: str + warmth: str + elapsed_seconds: float + exit_code: int + load_seconds: float | None = None + sync_seconds: float | None = None + notes: list[str] = field(default_factory=list) + + +@task(name="run") +def run_bench( # noqa: PLR0913, PLR0914, PLR0915, PLR0917 + context: Context, + name: str = "from-netbox", + directory: str = "examples/", + csv_out: str = "bench-results.csv", + continue_on_error: bool = True, # noqa: FBT001, FBT002 + exclude: str = "InfraInterfaceL2L3,InfraIPAddress", + scenarios: str = "", +) -> None: + """Run the 4-scenario x cold/warm benchmark matrix against a real sync target. + + Args: + name: SyncConfig name (must match `name:` in a config.yml under `directory`). + directory: Root directory holding the sync config. + csv_out: Path for the CSV summary. Rows are streamed as cells finish. + continue_on_error: Pass --continue-on-error to every sync invocation + (default on; matches the operator's preferred bench setup today). + exclude: Comma-separated kinds to drop from schema_mapping (and from + any explicit order: list). Defaults to interfaces + IP addresses, + which dominate wall-clock on the netbox example. + scenarios: Comma-separated substrings; only scenarios whose label + contains any match are run. Empty (default) runs all four. + """ + excluded = {e.strip() for e in exclude.split(",") if e.strip()} + scenario_filters = [s.strip() for s in scenarios.split(",") if s.strip()] + selected_scenarios = ( + [s for s in SCENARIOS if any(f in s[0] for f in scenario_filters)] if scenario_filters else SCENARIOS + ) + if scenario_filters and not selected_scenarios: + msg = f"No scenario label matches filters {scenario_filters!r}" + raise ValueError(msg) + repo_root = REPO_BASE + base_config_path = _find_base_config(repo_root / directory, name) + base_dir = base_config_path.parent + + base_data = yaml.safe_load(base_config_path.read_text(encoding="utf-8")) + if excluded: + base_data = _filter_excluded(base_data, excluded) + + # Tier order is derived from the FILTERED schema_mapping. Write the + # filtered config to a temp file so _compute_tier_order (which shells + # out) reads the right shape. + filtered_config = repo_root / ".bench-filtered-config.yml" + filtered_config.write_text(yaml.dump(base_data, sort_keys=False), encoding="utf-8") + try: + tier_order = _compute_tier_order(context, filtered_config) + finally: + filtered_config.unlink(missing_ok=True) + + cache_root = repo_root / ".infrahub-sync-cache" / name + csv_path = repo_root / csv_out + results: list[CellResult] = [] + + # Open the CSV up-front and stream rows as each cell completes — so a + # mid-run failure (Infrahub OOM, network blip, etc.) still leaves a + # readable partial result on disk instead of an empty file. + _write_csv_header(csv_path) + + print(f"\n[{NAMESPACE}] Benchmarking sync '{name}' against the config at {base_config_path}") + if excluded: + print(f"[{NAMESPACE}] Excluding kinds from schema_mapping: {sorted(excluded)}") + print(f"[{NAMESPACE}] Streaming results to {csv_path} as each cell completes\n") + for label, explicit_order, parallel, concurrent in selected_scenarios: + with tempfile.TemporaryDirectory(prefix="infrahub-sync-bench-", dir=str(repo_root)) as tmp_dir: + tmp_path = Path(tmp_dir) + _write_scenario_config( + base_data=base_data, + base_dir=base_dir, + tmp_dir=tmp_path, + explicit_order=explicit_order, + tier_order=tier_order, + ) + relative_directory = tmp_path.relative_to(repo_root).as_posix() + + for warmth in WARMTHS: + # cold = wipe cache. warm = re-run immediately after cold so + # the cache + Infrahub destination already hold what cold + # produced. No separate un-timed warm-up run — cold IS the + # warm-up. + if warmth == "cold" and cache_root.exists(): + shutil.rmtree(cache_root) + + start = time.monotonic() + result = _run_sync( + context, + name=name, + directory=relative_directory, + parallel=parallel, + concurrent=concurrent, + continue_on_error=continue_on_error, + capture=True, + ) + elapsed = time.monotonic() - start + + stdout = result.stdout if result is not None else "" + exit_code = result.exited if result is not None else -1 + load_s, sync_s, notes = _parse_phase_timings(stdout) + cell = CellResult( + scenario=label, + warmth=warmth, + elapsed_seconds=elapsed, + exit_code=exit_code, + load_seconds=load_s, + sync_seconds=sync_s, + notes=notes, + ) + results.append(cell) + _append_csv_row(csv_path, cell) + print( + f" [{warmth:4s}] {label:60s} total={elapsed:6.2f}s exit={cell.exit_code} (saved to {csv_path.name})" + ) + + # Phase 5 incremental candidate: cold = force full extract; warm = + # let the engine choose (should pick incremental and skip per-resource + # loads when the cursor is fresh). + print(f"\n[{NAMESPACE}] Incremental candidate (cold forces --full-extract; warm lets engine choose)") + with tempfile.TemporaryDirectory(prefix="infrahub-sync-bench-inc-", dir=str(repo_root)) as tmp_dir: + tmp_path = Path(tmp_dir) + _write_scenario_config( + base_data=base_data, + base_dir=base_dir, + tmp_dir=tmp_path, + explicit_order=False, + tier_order=tier_order, + ) + relative_directory = tmp_path.relative_to(repo_root).as_posix() + if cache_root.exists(): + shutil.rmtree(cache_root) + for warmth, force_full in (("cold", True), ("warm", False)): + start = time.monotonic() + result = _run_sync( + context, + name=name, + directory=relative_directory, + parallel=True, + concurrent=True, + continue_on_error=continue_on_error, + full_extract=force_full, + capture=True, + ) + elapsed = time.monotonic() - start + load_s, sync_s, notes = _parse_phase_timings(result.stdout if result is not None else "") + cell = CellResult( + scenario="incremental (auto-order, --parallel, concurrent, cursor-driven)", + warmth=warmth, + elapsed_seconds=elapsed, + exit_code=result.exited if result is not None else -1, + load_seconds=load_s, + sync_seconds=sync_s, + notes=notes, + ) + results.append(cell) + _append_csv_row(csv_path, cell) + print(f" [{warmth:4s}] {cell.scenario:60s} total={elapsed:6.2f}s exit={cell.exit_code}") + + _print_markdown(results) + + +def _find_base_config(directory: Path, name: str) -> Path: + for cfg in directory.rglob("config.yml"): + try: + data = yaml.safe_load(cfg.read_text(encoding="utf-8")) + except yaml.YAMLError: + continue + if isinstance(data, dict) and data.get("name") == name: + return cfg + msg = f"No config.yml with name={name!r} found under {directory!r}" + raise FileNotFoundError(msg) + + +_TIER_ORDER_SCRIPT = """ +import sys, yaml +from infrahub_sync import SyncConfig +from infrahub_sync.dependency_graph import compute_tiers, flatten_tiers +data = yaml.safe_load(open(sys.argv[1], encoding='utf-8').read()) +cfg = SyncConfig(**data) +tiers, _ = compute_tiers(cfg.schema_mapping) +for name in flatten_tiers(tiers): + print(name) +""" + + +def _compute_tier_order(context: Context, base_config: Path) -> list[str]: + """Use the engine's auto-tier to derive what `order:` should look like. + + Shelled out via `uv run python` so the task works whether invoke + itself was launched from the project venv or the system pyenv. + """ + cmd = f"uv run python -c {shlex.quote(_TIER_ORDER_SCRIPT)} {shlex.quote(str(base_config))}" + result = context.run(cmd, warn=False, hide=True, pty=False) + if result is None: + msg = "Tier-order script returned no result." + raise RuntimeError(msg) + return [line.strip() for line in result.stdout.splitlines() if line.strip()] + + +def _filter_excluded(data: dict, excluded: set[str]) -> dict: + """Return a new config dict with `excluded` kinds dropped from + schema_mapping and (if present) the operator-supplied `order:` list.""" + filtered = dict(data) + mapping = filtered.get("schema_mapping") or [] + filtered["schema_mapping"] = [sm for sm in mapping if sm.get("name") not in excluded] + if "order" in filtered: + filtered["order"] = [k for k in (filtered.get("order") or []) if k not in excluded] + return filtered + + +def _write_scenario_config( + *, + base_data: dict, + base_dir: Path, + tmp_dir: Path, + explicit_order: bool, + tier_order: list[str], +) -> Path: + data = dict(base_data) + data.pop("order", None) + if explicit_order: + data["order"] = list(tier_order) + out = tmp_dir / "config.yml" + out.write_text(yaml.dump(data, sort_keys=False), encoding="utf-8") + + # The adapter resolver looks for `/sync_adapter.py` alongside + # config.yml (the generated per-kind class attrs live there). Copy the + # source/destination subdirs from the base config's directory so the + # scenario's temp directory is self-contained. + for adapter_name in (data.get("source", {}).get("name"), data.get("destination", {}).get("name")): + if not adapter_name: + continue + src = base_dir / adapter_name + if src.is_dir(): + shutil.copytree(src, tmp_dir / adapter_name) + return out + + +def _run_sync( # noqa: PLR0913 + context: Context, + *, + name: str, + directory: str, + parallel: bool, + concurrent: bool, + continue_on_error: bool, + capture: bool, + full_extract: bool = False, +) -> Result | None: + cmd_parts = [ + "uv run infrahub-sync sync", + f"--name {name}", + f"--directory {directory}", + "--no-diff", + "--no-show-progress", + "--parallel" if parallel else "--no-parallel", + "--concurrent-load" if concurrent else "--no-concurrent-load", + ] + if continue_on_error: + cmd_parts.append("--continue-on-error") + if full_extract: + cmd_parts.append("--full-extract") + cmd = " ".join(cmd_parts) + return context.run(cmd, warn=True, hide=capture, pty=False) + + +_LOAD_LINE_RE = re.compile(r"Load:\s+Importing data from") +_SYNC_DONE_RE = re.compile(r"Sync:\s+Completed in\s+([0-9.]+)\s+sec") +_SYNC_TIER_RE = re.compile(r"Sync tier\s+\d+") + + +def _parse_phase_timings(stdout: str) -> tuple[float | None, float | None, list[str]]: + """Best-effort phase splits from structlog output. + + Today the engine logs human-readable INFO lines: 'Load: Importing data + from ' on each load and 'Sync: Completed in sec' at the + end. Until the engine emits structured phase timestamps, we extract + what we can from those lines and surface caveats as `notes`. + """ + notes: list[str] = [] + load_count = len(_LOAD_LINE_RE.findall(stdout)) + sync_done = _SYNC_DONE_RE.search(stdout) + sync_seconds = float(sync_done.group(1)) if sync_done else None + + tier_count = len(_SYNC_TIER_RE.findall(stdout)) + if tier_count: + notes.append(f"{tier_count} tiers logged") + if not load_count: + notes.append("no 'Load:' lines parsed") + if sync_seconds is None: + notes.append("no 'Sync: Completed' line parsed") + return None, sync_seconds, notes + + +_CSV_HEADER = ("scenario", "warmth", "total_seconds", "sync_seconds", "exit_code", "notes") + + +def _write_csv_header(path: Path) -> None: + """Open the CSV in write mode and write the header. Truncates any existing file.""" + with path.open("w", newline="", encoding="utf-8") as fh: + csv.writer(fh).writerow(_CSV_HEADER) + + +def _append_csv_row(path: Path, cell: CellResult) -> None: + """Append a single cell's row to the CSV and flush so the file is readable + immediately, even if the benchmark dies mid-run.""" + with path.open("a", newline="", encoding="utf-8") as fh: + csv.writer(fh).writerow( + [ + cell.scenario, + cell.warmth, + f"{cell.elapsed_seconds:.3f}", + f"{cell.sync_seconds:.3f}" if cell.sync_seconds is not None else "", + cell.exit_code, + "; ".join(cell.notes), + ] + ) + + +def _print_markdown(results: list[CellResult]) -> None: + by_scenario: dict[str, dict[str, CellResult]] = {} + for r in results: + by_scenario.setdefault(r.scenario, {})[r.warmth] = r + + print("\n## Benchmark results - total wall-clock per cell\n") + print("| Scenario | Cold | Warm | Cold/Warm speedup |") + print("| --- | ---: | ---: | ---: |") + for scen, warmth_map in by_scenario.items(): + cold = warmth_map.get("cold") + warm = warmth_map.get("warm") + cold_s = f"{cold.elapsed_seconds:.2f}s" if cold else "-" + warm_s = f"{warm.elapsed_seconds:.2f}s" if warm else "-" + speedup = "-" + if cold and warm and warm.elapsed_seconds > 0: + speedup = f"{cold.elapsed_seconds / warm.elapsed_seconds:.2f}x" + print(f"| {scen} | {cold_s} | {warm_s} | {speedup} |") + print() + print("> Notes:") + print(">") + print( + "> - Today `sync` always re-extracts source + destination; cache state does NOT short-circuit the load phase." + ) + print( + "> Cold/warm deltas reflect upstream service warmth + the destination already containing the data from the cold run." + ) + print("> When `apply` is wired (adapters implement `apply_cached_row`), cold/warm splits will be more dramatic.") + print( + "> - `--parallel` changes write ORDERING (hard tier barrier); wall-clock impact comes from concurrent loads, not from" + ) + print("> per-kind parallelism (diffsync is still single-threaded today).") + print("> - Network jitter against demo.netbox.dev is significant. Run the matrix several times for stable numbers.") diff --git a/tests/adapters/test_infrahub_incremental.py b/tests/adapters/test_infrahub_incremental.py new file mode 100644 index 0000000..4fb46cf --- /dev/null +++ b/tests/adapters/test_infrahub_incremental.py @@ -0,0 +1,128 @@ +from typing import TYPE_CHECKING, Any +from unittest.mock import MagicMock + +from infrahub_sync.cache.cursors import CursorTier + +if TYPE_CHECKING: + from infrahub_sync.adapters.infrahub import InfrahubAdapter + + +def _make_adapter(schema_kinds: list[str]) -> "InfrahubAdapter": + """Build an InfrahubAdapter with stubbed `schema` and `client`. + + The real ctor pulls live schema + accounts from Infrahub. Patch + those out so the constructor can complete with an in-memory schema. + """ + from infrahub_sync import SchemaMappingModel, SyncAdapter, SyncConfig + from infrahub_sync.adapters.infrahub import InfrahubAdapter + + config = SyncConfig( + name="t", + source=SyncAdapter(name="netbox"), + destination=SyncAdapter(name="infrahub"), + schema_mapping=[SchemaMappingModel(name=k, mapping="anything", identifiers=["name"]) for k in schema_kinds], + ) + + fake_client = MagicMock() + fake_client.schema.all.return_value = {k: MagicMock() for k in schema_kinds} + fake_client.get.return_value = None + # Build an adapter, then overwrite client/schema rather than running + # the constructor's I/O. Skip __init__ entirely. + adapter = InfrahubAdapter.__new__(InfrahubAdapter) + adapter.target = "test" + adapter.config = config + adapter.client = fake_client + adapter.schema = {k: MagicMock() for k in schema_kinds} + adapter.source_node = None + adapter.owner_node = None + adapter.continue_on_error = False + return adapter + + +def test_cursor_tier_is_timestamp_for_known_kinds() -> None: + adapter = _make_adapter(["InfraDevice", "InfraInterfaceL2L3"]) + assert adapter.cursor_tier_for("InfraDevice") is CursorTier.TIMESTAMP + + +def test_cursor_tier_is_none_for_unknown_kinds() -> None: + adapter = _make_adapter(["InfraDevice"]) + assert adapter.cursor_tier_for("MissingFromSchema") is CursorTier.NONE + + +def test_list_changed_since_uses_updated_at_filter() -> None: + from infrahub_sync.cache.cursors import CursorState + + adapter = _make_adapter(["InfraDevice"]) + fake_node = MagicMock() + adapter.client.filters.return_value = [fake_node] # ty: ignore[unresolved-attribute] + + # Stub infrahub_node_to_diffsync to bypass complex node→dict logic. + adapter.infrahub_node_to_diffsync = MagicMock(return_value={"local_id": "1", "name": "leaf1"}) # ty: ignore[invalid-assignment] + + cursor = CursorState(tier=CursorTier.TIMESTAMP, value="2026-05-17T10:00:00Z") + rows = list(adapter.list_changed_since("InfraDevice", cursor)) + + adapter.client.filters.assert_called_once_with( # ty: ignore[unresolved-attribute] + kind="InfraDevice", + populate_store=True, + prefetch_relationships=True, + node_metadata__updated_at__after="2026-05-17T10:00:00Z", + ) + assert rows == [{"local_id": "1", "name": "leaf1"}] + + +def test_list_changed_since_raises_for_unknown_model() -> None: + import pytest + + from infrahub_sync.cache.cursors import CursorState + + adapter = _make_adapter(["InfraDevice"]) + with pytest.raises(NotImplementedError): + list( + adapter.list_changed_since( + "MissingKind", CursorState(tier=CursorTier.TIMESTAMP, value="2026-01-01T00:00:00Z") + ) + ) + + +def test_list_existing_ids_yields_unique_ids() -> None: + adapter = _make_adapter(["InfraDevice"]) + + # Stub two fake nodes; the adapter calls client.all → 2 nodes. + fake_node_a = MagicMock() + fake_node_b = MagicMock() + adapter.client.all.return_value = [fake_node_a, fake_node_b] # ty: ignore[unresolved-attribute] + + # Stub infrahub_node_to_diffsync to return predictable payloads + payloads = [{"local_id": "1", "name": "leaf1"}, {"local_id": "2", "name": "leaf2"}] + adapter.infrahub_node_to_diffsync = MagicMock(side_effect=payloads) # ty: ignore[invalid-assignment] + + # Stub the model class so `model_cls(**payload).get_unique_id()` returns + # the local_id (or any predictable value tied to the payload). + fake_model = MagicMock() + fake_model._identifiers = ("name",) + + def _make_instance(**payload: Any) -> MagicMock: # noqa: ANN401 + instance = MagicMock() + instance.get_unique_id.return_value = payload["local_id"] + return instance + + fake_model.side_effect = _make_instance + adapter.InfraDevice = fake_model # ty: ignore[unresolved-attribute] + + ids = list(adapter.list_existing_ids("InfraDevice")) + + adapter.client.all.assert_called_once_with( # ty: ignore[unresolved-attribute] + kind="InfraDevice", + include=["name"], + populate_store=False, + ) + assert ids == ["1", "2"] + + +def test_list_existing_ids_raises_for_unknown_model() -> None: + import pytest + + adapter = _make_adapter(["InfraDevice"]) + with pytest.raises(NotImplementedError): + list(adapter.list_existing_ids("MissingKind")) diff --git a/tests/adapters/test_infrahub_peer_identifier.py b/tests/adapters/test_infrahub_peer_identifier.py new file mode 100644 index 0000000..dad3491 --- /dev/null +++ b/tests/adapters/test_infrahub_peer_identifier.py @@ -0,0 +1,123 @@ +"""Unit tests for InfrahubAdapter._resolve_peer_unique_id error and skip paths. + +The full adapter touches an Infrahub server, so these tests build a minimal +stand-in adapter that reuses the real helper. We only care about three things: + + 1. Missing peer identifier keys raise PeerIdentifierError with rich context. + 2. continue_on_error=True logs a warning and returns None instead of raising. + 3. Successful path returns the peer's unique_id. +""" + +from __future__ import annotations + +import logging +from types import SimpleNamespace +from typing import Any + +import pytest + +from infrahub_sync.adapters.infrahub import InfrahubAdapter, PeerIdentifierError + + +class _FakeStore: + def __init__(self) -> None: + self._items: dict[tuple[str, str], object] = {} + + def get(self, *, model: str, identifier: str) -> object: + return self._items.get((model, identifier)) + + def set(self, *, key: str, node: object) -> None: # match client.store.set signature + self._items[getattr(node, "_schema", SimpleNamespace(kind="?")).kind, key] = node + + +class _FakeClient: + def __init__(self) -> None: + self.store = _FakeStore() + + +class _FakePeerModel: + _identifiers = ("name", "organization") + + def __init__(self, **kwargs: object) -> None: + self._kwargs = kwargs + + @classmethod + def create_unique_id(cls, **kwargs: object) -> str: + return "|".join(str(kwargs[k]) for k in cls._identifiers) + + def get_unique_id(self) -> str: + return type(self).create_unique_id(**self._kwargs) + + +class _Harness(InfrahubAdapter): + """Skip the heavy __init__ that needs a real Infrahub server.""" + + def __init__(self, *, continue_on_error: bool = False) -> None: + # bypass the parent chain entirely + self.client = _FakeClient() + self.store = _FakeStore() # ty: ignore[invalid-assignment] + self.continue_on_error = continue_on_error + self._instances: list[object] = [] + # Register the fake peer model under its kind so getattr(self, kind) works. + self.LocationGeneric = _FakePeerModel + + def update_or_add_model_instance(self, item: object) -> None: # ty: ignore[invalid-method-override] + self._instances.append(item) + + def infrahub_node_to_diffsync(self, node: object) -> dict[str, Any]: # noqa: PLR6301 + # Return whatever fake data the test attached to the node. + return dict(node._fake_diffsync_data) # ty: ignore[unresolved-attribute] + + +def _make_node(kind: str, node_id: str, diffsync_data: dict[str, object]) -> SimpleNamespace: + return SimpleNamespace( + id=node_id, + _schema=SimpleNamespace(kind=kind), + _fake_diffsync_data=diffsync_data, + ) + + +def test_missing_identifier_raises_with_rich_context() -> None: + harness = _Harness(continue_on_error=False) + parent = _make_node("InfraDevice", "parent-id", {}) + peer = _make_node("LocationGeneric", "peer-id", {"name": "dc-east"}) # 'organization' missing + + with pytest.raises(PeerIdentifierError) as excinfo: + harness._resolve_peer_unique_id(parent_node=parent, rel_name="location", peer_node=peer) # ty: ignore[invalid-argument-type] + + err = excinfo.value + assert err.parent_kind == "InfraDevice" + assert err.parent_id == "parent-id" + assert err.rel_name == "location" + assert err.peer_kind == "LocationGeneric" + assert err.peer_id == "peer-id" + assert err.missing_keys == ("organization",) + assert "organization" in str(err) + assert "LocationGeneric" in str(err) + assert "InfraDevice.location" in str(err) + + +def test_missing_identifier_skipped_when_continue_on_error(caplog: pytest.LogCaptureFixture) -> None: + harness = _Harness(continue_on_error=True) + parent = _make_node("InfraDevice", "parent-id", {}) + peer = _make_node("LocationGeneric", "peer-id", {"name": "dc-east"}) + + with caplog.at_level(logging.WARNING, logger="infrahub_sync.adapters.infrahub"): + result = harness._resolve_peer_unique_id(parent_node=parent, rel_name="location", peer_node=peer) # ty: ignore[invalid-argument-type] + + assert result is None + assert any("Skipping peer relationship" in rec.message for rec in caplog.records) + + +def test_complete_peer_returns_unique_id() -> None: + harness = _Harness() + parent = _make_node("InfraDevice", "parent-id", {}) + peer = _make_node( + "LocationGeneric", + "peer-id", + {"name": "dc-east", "organization": "acme"}, + ) + + result = harness._resolve_peer_unique_id(parent_node=parent, rel_name="location", peer_node=peer) # ty: ignore[invalid-argument-type] + + assert result == "dc-east|acme" diff --git a/tests/adapters/test_nautobot_incremental.py b/tests/adapters/test_nautobot_incremental.py new file mode 100644 index 0000000..f589fa5 --- /dev/null +++ b/tests/adapters/test_nautobot_incremental.py @@ -0,0 +1,133 @@ +from collections import UserDict +from typing import TYPE_CHECKING +from unittest.mock import MagicMock + +import pytest + +from infrahub_sync.cache.cursors import CursorState, CursorTier + +if TYPE_CHECKING: + from infrahub_sync.adapters.nautobot import NautobotAdapter + + +class _FakeRecord(UserDict): + """`dict(MagicMock())` returns {}, so use UserDict to make `dict(node)` work.""" + + +def _make_adapter(mappings: list[dict]) -> "NautobotAdapter": + """Build a NautobotAdapter with stubbed pynautobot client.""" + from infrahub_sync import SchemaMappingModel, SyncAdapter, SyncConfig + from infrahub_sync.adapters.nautobot import NautobotAdapter + + schema_mapping = [SchemaMappingModel(**m) for m in mappings] + config = SyncConfig( + name="t", + source=SyncAdapter(name="nautobot"), + destination=SyncAdapter(name="infrahub"), + schema_mapping=schema_mapping, + ) + adapter_settings = SyncAdapter(name="nautobot", settings={"url": "https://example.invalid", "token": "x"}) + NautobotAdapter._create_nautobot_client = lambda _self, _adapter: MagicMock() # ty: ignore[invalid-assignment] + return NautobotAdapter(target="test", adapter=adapter_settings, config=config) + + +def test_cursor_tier_is_timestamp_for_mapped_kinds() -> None: + adapter = _make_adapter([{"name": "InfraDevice", "mapping": "dcim.devices"}]) + assert adapter.cursor_tier_for("InfraDevice") is CursorTier.TIMESTAMP + + +def test_cursor_tier_is_none_for_unmapped_kinds() -> None: + adapter = _make_adapter([{"name": "InfraDevice", "mapping": "dcim.devices"}]) + assert adapter.cursor_tier_for("Unknown") is CursorTier.NONE + + +def test_cursor_tier_is_none_for_empty_mapping() -> None: + adapter = _make_adapter([{"name": "InfraDevice", "mapping": ""}]) + assert adapter.cursor_tier_for("InfraDevice") is CursorTier.NONE + + +def test_list_changed_since_uses_last_updated_filter() -> None: + adapter = _make_adapter([{"name": "InfraDevice", "mapping": "dcim.devices", "identifiers": ["name"]}]) + fake_record = _FakeRecord({"id": 1, "name": "leaf1"}) + fake_endpoint = MagicMock() + fake_endpoint.filter.return_value = [fake_record] + adapter.client.dcim.devices = fake_endpoint # ty: ignore[unresolved-attribute] + + fake_model = MagicMock() + fake_model.filter_records.side_effect = lambda records, **_kw: records + fake_model.transform_records.side_effect = lambda records, **_kw: records + adapter.InfraDevice = fake_model # ty: ignore[unresolved-attribute] + + cursor = CursorState(tier=CursorTier.TIMESTAMP, value="2026-05-17T10:00:00Z") + rows = list(adapter.list_changed_since("InfraDevice", cursor)) + + fake_endpoint.filter.assert_called_once_with(last_updated__gte="2026-05-17T10:00:00Z") + assert rows[0]["local_id"] == "1" + + +def test_list_changed_since_raises_for_unknown_model() -> None: + adapter = _make_adapter([{"name": "InfraDevice", "mapping": "dcim.devices"}]) + with pytest.raises(NotImplementedError): + list( + adapter.list_changed_since("Unknown", CursorState(tier=CursorTier.TIMESTAMP, value="2026-01-01T00:00:00Z")) + ) + + +def test_list_changed_since_falls_back_when_endpoint_rejects_filter() -> None: + """Some Nautobot endpoints (front-ports, rear-ports, ...) return 400 'Unknown filter field' + on `last_updated__gte`. The adapter must catch that and fall back to `endpoint.all()`. + """ + import pynautobot # ty: ignore[unresolved-import] # optional dep, see pyproject extras + + adapter = _make_adapter([{"name": "InfraDevice", "mapping": "dcim.devices", "identifiers": ["name"]}]) + fake_record = _FakeRecord({"id": 7, "name": "edge1"}) + + fake_endpoint = MagicMock() + # Build a real RequestError mirroring what pynautobot raises on 400. + fake_resp = MagicMock() + fake_resp.status_code = 400 + fake_resp.reason = "Bad Request" + fake_resp.json.return_value = {"last_updated__gte": ["Unknown filter field"]} + fake_resp.url = "https://demo.nautobot.com/api/dcim/devices/?last_updated__gte=…" + fake_resp.text = "" + fake_resp.request.body = None + fake_endpoint.filter.side_effect = pynautobot.core.query.RequestError(fake_resp) + fake_endpoint.all.return_value = [fake_record] + adapter.client.dcim.devices = fake_endpoint # ty: ignore[unresolved-attribute] + + fake_model = MagicMock() + fake_model.filter_records.side_effect = lambda records, **_kw: records + fake_model.transform_records.side_effect = lambda records, **_kw: records + adapter.InfraDevice = fake_model # ty: ignore[unresolved-attribute] + + cursor = CursorState(tier=CursorTier.TIMESTAMP, value="2026-05-17T10:00:00Z") + rows = list(adapter.list_changed_since("InfraDevice", cursor)) + + fake_endpoint.filter.assert_called_once_with(last_updated__gte="2026-05-17T10:00:00Z") + fake_endpoint.all.assert_called_once_with() + assert rows[0]["local_id"] == "7" + + +def test_list_existing_ids_returns_unique_ids() -> None: + adapter = _make_adapter([{"name": "InfraDevice", "mapping": "dcim.devices", "identifiers": ["name"]}]) + rec_a = _FakeRecord({"id": 1, "name": "leaf1"}) + rec_b = _FakeRecord({"id": 2, "name": "leaf2"}) + fake_endpoint = MagicMock() + fake_endpoint.all.return_value = [rec_a, rec_b] + adapter.client.dcim.devices = fake_endpoint # ty: ignore[unresolved-attribute] + + fake_model = MagicMock() + fake_model.filter_records.side_effect = lambda records, **_kw: records + fake_model.transform_records.side_effect = lambda records, **_kw: records + + def _make_instance(**payload: object) -> MagicMock: + instance = MagicMock() + instance.get_unique_id.return_value = payload["local_id"] + return instance + + fake_model.side_effect = _make_instance + adapter.InfraDevice = fake_model # ty: ignore[unresolved-attribute] + + ids = list(adapter.list_existing_ids("InfraDevice")) + fake_endpoint.all.assert_called_once_with() + assert ids == ["1", "2"] diff --git a/tests/adapters/test_netbox_incremental.py b/tests/adapters/test_netbox_incremental.py new file mode 100644 index 0000000..d3b6bec --- /dev/null +++ b/tests/adapters/test_netbox_incremental.py @@ -0,0 +1,158 @@ +import collections +from typing import TYPE_CHECKING +from unittest.mock import MagicMock + +import pytest + +from infrahub_sync.cache.cursors import CursorState, CursorTier + +if TYPE_CHECKING: + from infrahub_sync.adapters.netbox import NetboxAdapter + + +def _make_adapter(mappings: list[dict]) -> "NetboxAdapter": + """Build a NetboxAdapter with stubbed schema_mapping. + + The adapter ctor calls pynetbox.api() which would fail without a + live URL/token. Patch the client creation to a MagicMock instead. + """ + from infrahub_sync import SchemaMappingModel, SyncAdapter, SyncConfig + from infrahub_sync.adapters.netbox import NetboxAdapter + + schema_mapping = [SchemaMappingModel(**m) for m in mappings] + config = SyncConfig( + name="t", + source=SyncAdapter(name="netbox"), + destination=SyncAdapter(name="infrahub"), + schema_mapping=schema_mapping, + ) + target = "test" + settings = {"url": "https://example.invalid", "token": "x"} + + adapter_settings = SyncAdapter(name="netbox", settings=settings) + NetboxAdapter._create_netbox_client = lambda _self, _adapter: MagicMock() # ty: ignore[invalid-assignment] + return NetboxAdapter(target=target, adapter=adapter_settings, config=config) + + +def test_cursor_tier_is_timestamp_for_mapped_kinds() -> None: + adapter = _make_adapter( + [ + {"name": "InfraDevice", "mapping": "dcim.devices"}, + ] + ) + assert adapter.cursor_tier_for("InfraDevice") is CursorTier.TIMESTAMP + + +def test_cursor_tier_is_none_for_unmapped_kinds() -> None: + adapter = _make_adapter( + [ + {"name": "InfraDevice", "mapping": "dcim.devices"}, + ] + ) + assert adapter.cursor_tier_for("WeirdModelMissingFromMapping") is CursorTier.NONE + + +def test_cursor_tier_is_none_for_mapping_without_resource_path() -> None: + adapter = _make_adapter( + [ + {"name": "InfraDevice", "mapping": ""}, + ] + ) + assert adapter.cursor_tier_for("InfraDevice") is CursorTier.NONE + + +class _FakeRecord(collections.UserDict): + """Minimal pynetbox record stub: UserDict so dict(record) works.""" + + +def test_list_changed_since_uses_last_updated_filter() -> None: + adapter = _make_adapter( + [ + { + "name": "InfraDevice", + "mapping": "dcim.devices", + "identifiers": ["name"], + }, + ] + ) + # Build a fake pynetbox endpoint that returns one record. + # Use a dict subclass so dict(record) produces the expected shape. + fake_record = _FakeRecord({"id": 1, "name": "leaf1"}) + fake_endpoint = MagicMock() + fake_endpoint.filter.return_value = [fake_record] + adapter.client.dcim.devices = fake_endpoint + + # Register a minimal model stub so getattr(adapter, "InfraDevice") works. + # filter_records and transform_records pass records through unchanged. + fake_model = MagicMock() + fake_model.filter_records.side_effect = lambda **kw: kw["records"] + fake_model.transform_records.side_effect = lambda **kw: kw["records"] + fake_model.is_list.return_value = False + fake_model.fields = None + adapter.InfraDevice = fake_model # ty: ignore[unresolved-attribute] + + cursor = CursorState(tier=CursorTier.TIMESTAMP, value="2026-05-17T10:00:00Z") + rows = list(adapter.list_changed_since("InfraDevice", cursor)) + + fake_endpoint.filter.assert_called_once_with(last_updated__gte="2026-05-17T10:00:00Z") + # Result has at least the id we set on the fake record. + assert rows[0]["local_id"] == "1" + + +def test_list_changed_since_raises_for_unknown_model() -> None: + adapter = _make_adapter( + [ + {"name": "InfraDevice", "mapping": "dcim.devices"}, + ] + ) + with pytest.raises(NotImplementedError): + list( + adapter.list_changed_since( + "UnknownKind", CursorState(tier=CursorTier.TIMESTAMP, value="2026-01-01T00:00:00Z") + ) + ) + + +def test_list_existing_ids_returns_unique_ids() -> None: + adapter = _make_adapter( + [ + {"name": "InfraDevice", "mapping": "dcim.devices", "identifiers": ["name"]}, + ] + ) + + rec_a = _FakeRecord({"id": 1, "name": "leaf1"}) + rec_b = _FakeRecord({"id": 2, "name": "leaf2"}) + fake_endpoint = MagicMock() + fake_endpoint.all.return_value = [rec_a, rec_b] + adapter.client.dcim.devices = fake_endpoint + + # Stub the model so get_unique_id returns something predictable. + fake_model = MagicMock() + fake_model.filter_records.side_effect = lambda records, **_kw: records + fake_model.transform_records.side_effect = lambda records, **_kw: records + + # When the adapter instantiates `model(**payload)`, it calls the + # MagicMock — make the returned instance expose `get_unique_id` + # tied to the `local_id` we know netbox_obj_to_diffsync sets. + def _make_instance(**payload: object) -> MagicMock: + instance = MagicMock() + instance.get_unique_id.return_value = payload["local_id"] + return instance + + fake_model.side_effect = _make_instance + adapter.InfraDevice = fake_model # ty: ignore[unresolved-attribute] + + ids = list(adapter.list_existing_ids("InfraDevice")) + + fake_endpoint.all.assert_called_once_with() + assert ids == ["1", "2"] + + +def test_list_existing_ids_raises_for_unknown_model() -> None: + adapter = _make_adapter( + [ + {"name": "InfraDevice", "mapping": "dcim.devices"}, + ] + ) + with pytest.raises(NotImplementedError): + list(adapter.list_existing_ids("UnknownKind")) diff --git a/tests/cache/__init__.py b/tests/cache/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/cache/test_apply_plan.py b/tests/cache/test_apply_plan.py new file mode 100644 index 0000000..62c494b --- /dev/null +++ b/tests/cache/test_apply_plan.py @@ -0,0 +1,46 @@ +"""Potenda.apply_plan reads plan.parquet and dispatches writes; no source +extraction happens.""" + +from __future__ import annotations + +from types import SimpleNamespace +from typing import TYPE_CHECKING +from unittest.mock import MagicMock + +from infrahub_sync.cache.parquet_io import write_plan +from infrahub_sync.potenda import Potenda + +if TYPE_CHECKING: + from pathlib import Path + + +def test_apply_plan_dispatches_per_row(tmp_path: Path) -> None: + rows = [ + { + "action": "create", + "resource": "BuiltinTag", + "source_id": "prod", + "dest_id": "", + "attribute": "", + "old_value": "", + "new_value": '{"name":"prod"}', + "owner": "", + "skip_reason": "", + "conflict_class": "", + }, + ] + write_plan(run_dir=tmp_path, rows=rows) + + dst = MagicMock() + ptd = Potenda( + source=SimpleNamespace(top_level=[]), # ty: ignore[invalid-argument-type] + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=["BuiltinTag"], + run_dir=tmp_path, + ) + ptd.apply_plan() + dst.apply_cached_row.assert_called_once() + kwargs = dst.apply_cached_row.call_args.kwargs + assert kwargs["resource"] == "BuiltinTag" + assert kwargs["action"] == "create" diff --git a/tests/cache/test_cli_sync_cache.py b/tests/cache/test_cli_sync_cache.py new file mode 100644 index 0000000..27feb89 --- /dev/null +++ b/tests/cache/test_cli_sync_cache.py @@ -0,0 +1,71 @@ +"""End-to-end via Typer CliRunner: `sync` (serial + parallel) produces +run.json and plan.parquet under the cache.""" + +from __future__ import annotations + +import json +from pathlib import Path +from unittest.mock import MagicMock, patch + +from typer.testing import CliRunner + +from infrahub_sync.cli import app + +EXAMPLES_DIR = Path(__file__).resolve().parent.parent.parent / "examples" + + +def _make_fake_potenda(tmp_path: Path, tiers) -> MagicMock: + ptd = MagicMock() + ptd.tiers = tiers + ptd.run_id = "test-run" + ptd.run_dir = tmp_path + ptd.top_level = ["BuiltinTag"] + ptd.diff.return_value = MagicMock(has_diffs=MagicMock(return_value=False), str=MagicMock(return_value="")) + return ptd + + +def test_sync_serial_writes_run_json(tmp_path: Path, monkeypatch) -> None: + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + fake_ptd = _make_fake_potenda(tmp_path / "from-netbox" / "test-run", tiers=None) + fake_ptd.run_dir.mkdir(parents=True, exist_ok=True) + + runner = CliRunner() + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + ["sync", "--no-parallel", "--name", "from-netbox", "--directory", str(EXAMPLES_DIR)], + ) + assert result.exit_code == 0, result.output + run_json = fake_ptd.run_dir / "run.json" + assert run_json.exists() + data = json.loads(run_json.read_text()) + assert data["status"] == "applied" + assert data["mode"] == "sync" + fake_ptd.write_plan.assert_called_once() + fake_ptd.persist_baseline_counts.assert_called_once() + + +def test_sync_parallel_delegates_with_allow_drop(tmp_path: Path, monkeypatch) -> None: + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + fake_ptd = _make_fake_potenda(tmp_path / "from-netbox" / "test-run", tiers=[{"BuiltinTag"}]) + fake_ptd.run_dir.mkdir(parents=True, exist_ok=True) + + runner = CliRunner() + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + [ + "sync", + "--parallel", + "--allow-rowcount-drop", + "--name", + "from-netbox", + "--directory", + str(EXAMPLES_DIR), + ], + ) + assert result.exit_code == 0, result.output + fake_ptd.sync_in_tiers.assert_called_once_with(parallel=True, allow_rowcount_drop=True) + run_json = fake_ptd.run_dir / "run.json" + assert run_json.exists() + assert json.loads(run_json.read_text())["status"] == "applied" diff --git a/tests/cache/test_concurrent_load.py b/tests/cache/test_concurrent_load.py new file mode 100644 index 0000000..8fc5e65 --- /dev/null +++ b/tests/cache/test_concurrent_load.py @@ -0,0 +1,101 @@ +"""Potenda.load_both_sides runs source and destination concurrently.""" + +from __future__ import annotations + +import time +from pathlib import Path +from unittest.mock import MagicMock + +import pytest + +from infrahub_sync.potenda import Potenda + + +def _adapter_with_slow_load(label: str, sleep_seconds: float, *, fail: bool = False) -> MagicMock: + """Build a fake adapter whose `load()` blocks for `sleep_seconds` + and records its start/finish wall-clock times.""" + adapter = MagicMock() + adapter.top_level = [] + adapter.label = label + adapter.events: list[tuple[str, float]] = [] + + def fake_load() -> None: + adapter.events.append(("start", time.monotonic())) + time.sleep(sleep_seconds) + adapter.events.append(("end", time.monotonic())) + if fail: + msg = f"{label} load failed" + raise ValueError(msg) + + adapter.load.side_effect = fake_load + return adapter + + +def test_load_both_sides_runs_concurrently(tmp_path: Path) -> None: + src = _adapter_with_slow_load("src", 0.5) + dst = _adapter_with_slow_load("dst", 0.5) + ptd = Potenda( + source=src, + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=[], + run_dir=tmp_path, + ) + start = time.monotonic() + ptd.load_both_sides() + elapsed = time.monotonic() - start + # Sequential would be ~1.0s; concurrent should be ~0.5s. Allow generous slack. + assert elapsed < 0.8, f"loads ran sequentially (elapsed={elapsed:.2f}s)" + + # Intervals must intersect (sequential execution would pass `src_start < dst_end`). + src_start = next(t for label, t in src.events if label == "start") + src_end = next(t for label, t in src.events if label == "end") + dst_start = next(t for label, t in dst.events if label == "start") + dst_end = next(t for label, t in dst.events if label == "end") + assert max(src_start, dst_start) < min(src_end, dst_end) + + +def test_load_both_sides_sequential_when_disabled(tmp_path: Path) -> None: + src = _adapter_with_slow_load("src", 0.25) + dst = _adapter_with_slow_load("dst", 0.25) + ptd = Potenda( + source=src, + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=[], + run_dir=tmp_path, + concurrent_load=False, + ) + start = time.monotonic() + ptd.load_both_sides() + elapsed = time.monotonic() - start + # Sequential should be >= sum of both sleeps. Allow tiny scheduling slack. + assert elapsed >= 0.45, f"loads ran concurrently despite opt-out (elapsed={elapsed:.2f}s)" + + +def test_load_both_sides_surfaces_source_failure(tmp_path: Path) -> None: + src = _adapter_with_slow_load("src", 0.05, fail=True) + dst = _adapter_with_slow_load("dst", 0.05) + ptd = Potenda( + source=src, + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=[], + run_dir=tmp_path, + ) + with pytest.raises(ValueError, match="src load failed"): + ptd.load_both_sides() + + +def test_load_both_sides_surfaces_destination_failure(tmp_path: Path) -> None: + src = _adapter_with_slow_load("src", 0.05) + dst = _adapter_with_slow_load("dst", 0.05, fail=True) + ptd = Potenda( + source=src, + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=[], + run_dir=tmp_path, + ) + with pytest.raises(ValueError, match="dst load failed"): + ptd.load_both_sides() diff --git a/tests/cache/test_cursors.py b/tests/cache/test_cursors.py new file mode 100644 index 0000000..53900a6 --- /dev/null +++ b/tests/cache/test_cursors.py @@ -0,0 +1,30 @@ +"""CursorTier + CursorState tests.""" + +from __future__ import annotations + +import pytest + +from infrahub_sync.cache.cursors import CursorState, CursorTier + + +def test_cursor_tier_ordering() -> None: + """Higher tiers are strictly more capable.""" + assert CursorTier.NONE < CursorTier.PAGE_TOKEN + assert CursorTier.PAGE_TOKEN < CursorTier.TIMESTAMP + assert CursorTier.TIMESTAMP < CursorTier.INFRAHUB_DIFF + + +def test_cursor_state_constructs() -> None: + cs = CursorState(tier=CursorTier.TIMESTAMP, value="2026-05-12T15:30:00Z") + assert cs.tier is CursorTier.TIMESTAMP + assert cs.value == "2026-05-12T15:30:00Z" + + +def test_cursor_state_none_default() -> None: + cs = CursorState(tier=CursorTier.NONE) + assert cs.value is None + + +def test_cursor_state_value_required_for_non_none() -> None: + with pytest.raises(ValueError): + CursorState(tier=CursorTier.TIMESTAMP, value=None) diff --git a/tests/cache/test_guardrails.py b/tests/cache/test_guardrails.py new file mode 100644 index 0000000..08b0ea3 --- /dev/null +++ b/tests/cache/test_guardrails.py @@ -0,0 +1,34 @@ +"""RowcountGuardrail: refuses to proceed when a resource's row count collapses +since the last successful run.""" + +from __future__ import annotations + +import pytest + +from infrahub_sync.cache.guardrails import ( + RowcountGuardrail, + RowcountGuardrailError, +) + + +def test_first_run_with_no_prior_baseline_allowed() -> None: + g = RowcountGuardrail(previous={}, drop_threshold=0.5) + g.check("BuiltinTag", current=10) + + +def test_no_drop_allowed() -> None: + g = RowcountGuardrail(previous={"BuiltinTag": 100}, drop_threshold=0.5) + g.check("BuiltinTag", current=100) + g.check("BuiltinTag", current=200) + g.check("BuiltinTag", current=51) # exactly above the 50% threshold + + +def test_drop_over_threshold_raises() -> None: + g = RowcountGuardrail(previous={"BuiltinTag": 100}, drop_threshold=0.5) + with pytest.raises(RowcountGuardrailError, match="dropped from 100 to 49"): + g.check("BuiltinTag", current=49) + + +def test_allow_override_skips_check() -> None: + g = RowcountGuardrail(previous={"BuiltinTag": 100}, drop_threshold=0.5, allow_drop=True) + g.check("BuiltinTag", current=0) # no raise diff --git a/tests/cache/test_incremental_engine.py b/tests/cache/test_incremental_engine.py new file mode 100644 index 0000000..214b61a --- /dev/null +++ b/tests/cache/test_incremental_engine.py @@ -0,0 +1,133 @@ +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, ClassVar + +from diffsync import Adapter, DiffSyncModel + +from infrahub_sync.cache.cursors import CursorState, CursorTier +from infrahub_sync.potenda import Potenda + + +class _Device(DiffSyncModel): + _modelname: ClassVar[str] = "InfraDevice" + _identifiers: ClassVar[tuple[str, ...]] = ("name",) + _attributes: ClassVar[tuple[str, ...]] = ("description",) + + name: str + description: str | None = None + + +class _StubAdapter(Adapter): + InfraDevice = _Device + top_level: ClassVar[list[str]] = ["InfraDevice"] + type = "Stub" + + def __init__(self, *, name: str, deltas: list[dict] | None = None): + super().__init__(name=name) + self.calls: list[tuple[str, object]] = [] + self.deltas = deltas or [] + + def model_loader(self, model_name: str, _model: Any) -> None: # noqa: ANN401 + self.calls.append(("model_loader", model_name)) + + def load(self) -> None: + # Default load — record a "full_load" call and add nothing. + self.calls.append(("full_load", None)) + + def cursor_tier_for(self, _model_name: str) -> CursorTier: # noqa: PLR6301 + return CursorTier.TIMESTAMP + + def list_changed_since(self, _model_name: str, cursor: CursorState) -> list[dict]: + self.calls.append(("delta", cursor)) + return list(self.deltas) + + +def _make_potenda(tmp_path: Path): + src = _StubAdapter(name="src") + dst = _StubAdapter(name="dst") + from types import SimpleNamespace + + config = SimpleNamespace(diffsync_flags=[], incremental=None, name="test-sync") + pot = Potenda( + source=src, + destination=dst, + config=config, # ty: ignore[invalid-argument-type] + top_level=["InfraDevice"], + show_progress=False, + concurrent_load=False, + ) + pot.run_dir = tmp_path / "run-current" + pot.run_dir.mkdir(parents=True) + pot.cache_root = tmp_path + return pot, src, dst + + +def test_falls_back_to_full_load_when_no_prior_run(tmp_path: Path) -> None: + pot, src, _ = _make_potenda(tmp_path) + pot._schema_subhash = "abc" + + pot.load_one_side(side="A", adapter=src) + + assert ("full_load", None) in src.calls + assert all(call[0] != "delta" for call in src.calls) + + +def test_uses_incremental_when_prior_run_matches(tmp_path: Path) -> None: + import json + + from infrahub_sync.cache.parquet_io import write_resource_side + + prev_run = tmp_path / "2026-05-17T10-00-00Z" + prev_run.mkdir(parents=True) + (prev_run / "run.json").write_text(json.dumps({"status": "applied"})) + (prev_run / "schema-sub-hash.txt").write_text("HASHFIXED") + (prev_run / "cursors.json").write_text(json.dumps({"A": {"InfraDevice": "TIMESTAMP:2026-05-17T10:00:00Z"}})) + write_resource_side( + run_dir=prev_run, + side="A", + resource="InfraDevice", + rows=[{"name": "leaf-existing", "description": "old"}], + source_ids=["leaf-existing"], + extract_ts=datetime(2026, 5, 17, 10, tzinfo=timezone.utc), + ) + + pot, src, _ = _make_potenda(tmp_path) + src.deltas = [{"name": "leaf-new", "description": "new"}] + pot.cache_root = tmp_path + pot._schema_subhash = "HASHFIXED" + + pot.load_one_side(side="A", adapter=src) + + assert all(call[0] != "full_load" for call in src.calls) + assert any(call[0] == "delta" for call in src.calls) + names = {d.name for d in src.get_all("InfraDevice")} + assert names == {"leaf-existing", "leaf-new"} + + +def test_cursor_persisted_after_load(tmp_path: Path) -> None: + from infrahub_sync.cache.incremental import load_cursors + from infrahub_sync.cache.parquet_io import write_resource_side + + pot, src, _ = _make_potenda(tmp_path) + pot._schema_subhash = "abc" + + # First run: full extract (no prior run), then snapshot is written + + # cursor persisted. We simulate the snapshot directly because the + # stub adapter doesn't actually populate the store. + pot.load_one_side(side="A", adapter=src) + write_resource_side( + run_dir=pot.run_dir, + side="A", + resource="InfraDevice", + rows=[{"name": "leaf1", "description": "x"}], + source_ids=["leaf1"], + extract_ts=datetime(2026, 5, 18, 11, tzinfo=timezone.utc), + ) + pot.persist_cursors_for_run(side="A") + + cursors_path = pot.run_dir / "cursors.json" + assert cursors_path.exists() + + loaded = load_cursors(cursors_path, side="A") + assert loaded["InfraDevice"].tier is CursorTier.TIMESTAMP + assert loaded["InfraDevice"].value == "2026-05-18T11:00:00+00:00" diff --git a/tests/cache/test_incremental_helpers.py b/tests/cache/test_incremental_helpers.py new file mode 100644 index 0000000..1be63e4 --- /dev/null +++ b/tests/cache/test_incremental_helpers.py @@ -0,0 +1,193 @@ +from __future__ import annotations + +import json +from pathlib import Path + +from infrahub_sync.cache.incremental import previous_successful_run_dir, should_use_incremental + + +def _write_run(cache_root: Path, run_id: str, status: str) -> Path: + run_dir = cache_root / run_id + run_dir.mkdir(parents=True) + (run_dir / "run.json").write_text(json.dumps({"status": status})) + return run_dir + + +def test_returns_most_recent_applied_run(tmp_path: Path) -> None: + _write_run(tmp_path, "2026-05-17T10-00-00Z", "applied") + latest = _write_run(tmp_path, "2026-05-18T11-00-00Z", "applied") + _write_run(tmp_path, "2026-05-18T12-00-00Z", "failed") + + assert previous_successful_run_dir(tmp_path) == latest + + +def test_returns_none_when_no_successful_runs(tmp_path: Path) -> None: + _write_run(tmp_path, "2026-05-18T11-00-00Z", "failed") + + assert previous_successful_run_dir(tmp_path) is None + + +def test_returns_none_on_empty_cache(tmp_path: Path) -> None: + assert previous_successful_run_dir(tmp_path) is None + + +def test_skip_when_force_full(tmp_path: Path) -> None: + prev_run = tmp_path / "prev" + prev_run.mkdir() + (prev_run / "schema-sub-hash.txt").write_text("abc123") + + decision = should_use_incremental( + prev_run_dir=prev_run, + current_subhash="abc123", + force_full=True, + ) + assert decision is False + + +def test_skip_when_no_prev_run() -> None: + decision = should_use_incremental( + prev_run_dir=None, + current_subhash="abc123", + force_full=False, + ) + assert decision is False + + +def test_skip_when_subhash_mismatch(tmp_path: Path) -> None: + prev_run = tmp_path / "prev" + prev_run.mkdir() + (prev_run / "schema-sub-hash.txt").write_text("OLD000") + + decision = should_use_incremental( + prev_run_dir=prev_run, + current_subhash="NEW111", + force_full=False, + ) + assert decision is False + + +def test_use_incremental_when_subhash_matches(tmp_path: Path) -> None: + prev_run = tmp_path / "prev" + prev_run.mkdir() + (prev_run / "schema-sub-hash.txt").write_text("abc123") + + decision = should_use_incremental( + prev_run_dir=prev_run, + current_subhash="abc123", + force_full=False, + ) + assert decision is True + + +# --------------------------------------------------------------------------- +# Task 4: load_cursors / persist_cursors +# --------------------------------------------------------------------------- +from infrahub_sync.cache.cursors import CursorState, CursorTier # noqa: E402 +from infrahub_sync.cache.incremental import load_cursors, persist_cursors # noqa: E402 + + +def test_load_cursors_empty(tmp_path: Path) -> None: + cursors = load_cursors(tmp_path / "cursors.json", side="A") + assert cursors == {} + + +def test_persist_then_load_roundtrip(tmp_path: Path) -> None: + path = tmp_path / "cursors.json" + persist_cursors( + path, + side="A", + cursors={ + "InfraDevice": CursorState( + tier=CursorTier.TIMESTAMP, + value="2026-05-18T11:00:00Z", + ) + }, + ) + persist_cursors( + path, + side="B", + cursors={ + "InfraDevice": CursorState( + tier=CursorTier.INFRAHUB_DIFF, + value="2026-05-18T11:05:00Z", + ) + }, + ) + + loaded_a = load_cursors(path, side="A") + loaded_b = load_cursors(path, side="B") + + assert loaded_a["InfraDevice"].tier is CursorTier.TIMESTAMP + assert loaded_a["InfraDevice"].value == "2026-05-18T11:00:00Z" + assert loaded_b["InfraDevice"].tier is CursorTier.INFRAHUB_DIFF + assert loaded_b["InfraDevice"].value == "2026-05-18T11:05:00Z" + + +# --------------------------------------------------------------------------- +# Task 5: hydrate_from_parquet +# --------------------------------------------------------------------------- +from datetime import datetime, timezone # noqa: E402 + +from infrahub_sync.cache.incremental import hydrate_from_parquet # noqa: E402 +from infrahub_sync.cache.parquet_io import write_resource_side # noqa: E402 + + +def test_hydrate_replays_prior_rows(tmp_path: Path) -> None: + run_dir = tmp_path + extract_ts = datetime(2026, 5, 18, 11, tzinfo=timezone.utc) + write_resource_side( + run_dir=run_dir, + side="A", + resource="InfraDevice", + rows=[{"name": "leaf1"}, {"name": "leaf2"}], + source_ids=["leaf1", "leaf2"], + extract_ts=extract_ts, + ) + + captured: list[dict] = [] + + def add_row(model_name: str, payload: dict) -> None: + captured.append({"model": model_name, **payload}) + + rows_loaded, max_ts = hydrate_from_parquet( + run_dir=run_dir, + side="A", + resource="InfraDevice", + add_row=add_row, + ) + + assert rows_loaded == 2 + assert max_ts == extract_ts + assert {c["name"] for c in captured} == {"leaf1", "leaf2"} + + +def test_hydrate_missing_resource_returns_zero(tmp_path: Path) -> None: + rows_loaded, max_ts = hydrate_from_parquet( + run_dir=tmp_path, + side="A", + resource="InfraDevice", + add_row=lambda _model, _payload: None, + ) + + assert rows_loaded == 0 + assert max_ts is None + + +# --------------------------------------------------------------------------- +# Task 11: cadence / run-counter +# --------------------------------------------------------------------------- + + +def test_force_full_when_cadence_exceeded(tmp_path: Path) -> None: + prev_run = tmp_path / "prev" + prev_run.mkdir() + (prev_run / "schema-sub-hash.txt").write_text("abc123") + + decision = should_use_incremental( + prev_run_dir=prev_run, + current_subhash="abc123", + force_full=False, + runs_since_full=10, + cadence=10, + ) + assert decision is False diff --git a/tests/cache/test_locks.py b/tests/cache/test_locks.py new file mode 100644 index 0000000..3d1763e --- /dev/null +++ b/tests/cache/test_locks.py @@ -0,0 +1,58 @@ +"""Cross-process pipeline filelock.""" + +from __future__ import annotations + +import multiprocessing +from pathlib import Path +from typing import TYPE_CHECKING + +import pytest + +from infrahub_sync.cache.locks import pipeline_lock + +if TYPE_CHECKING: + from multiprocessing.synchronize import Event as EventT + + +def _hold_lock(sync_name: str, cache_dir: str, acquired: EventT, release: EventT) -> None: + """Subprocess target: take the lock, signal acquisition, then hold until told to release.""" + import os + + os.environ["INFRAHUB_SYNC_CACHE_DIR"] = cache_dir + with pipeline_lock(sync_name): + acquired.set() + # Bounded hold so the child never lingers if the parent dies mid-test. + release.wait(timeout=10.0) + + +def test_pipeline_lock_excludes_concurrent_run(tmp_path: Path) -> None: + cache_dir = str(tmp_path) + ctx = multiprocessing.get_context("spawn") + acquired = ctx.Event() + release = ctx.Event() + p = ctx.Process(target=_hold_lock, args=("p1", cache_dir, acquired, release)) + p.start() + try: + # Wait for the child to actually hold the lock instead of guessing with a sleep. + assert acquired.wait(timeout=10.0), "child process never acquired the lock" + import os + + os.environ["INFRAHUB_SYNC_CACHE_DIR"] = cache_dir + from filelock import Timeout + + with pytest.raises(Timeout), pipeline_lock("p1", timeout=0.05): + pass + finally: + release.set() + p.join(timeout=10.0) + if p.is_alive(): + p.terminate() + p.join() + + +def test_pipeline_lock_allows_different_pipelines(tmp_path: Path) -> None: + import os + + os.environ["INFRAHUB_SYNC_CACHE_DIR"] = str(tmp_path) + with pipeline_lock("p1"), pipeline_lock("p2"): + pass diff --git a/tests/cache/test_parquet_io.py b/tests/cache/test_parquet_io.py new file mode 100644 index 0000000..449029f --- /dev/null +++ b/tests/cache/test_parquet_io.py @@ -0,0 +1,83 @@ +"""parquet_io tests — atomic writes, schema enforcement, plan roundtrip.""" + +from __future__ import annotations + +from pathlib import Path + +import pyarrow as pa + +from infrahub_sync.cache.parquet_io import ( + PLAN_SCHEMA, + read_table, + write_plan, + write_table, +) + + +def test_write_table_atomically_no_tmp_left(tmp_path: Path) -> None: + table = pa.table({"x": [1, 2, 3]}) + target = tmp_path / "out.parquet" + write_table(str(target), table) + assert target.exists() + assert not list(tmp_path.glob("*.tmp")) + read = read_table(str(target)) + assert read.column("x").to_pylist() == [1, 2, 3] + + +def test_write_table_overwrites_existing(tmp_path: Path) -> None: + target = tmp_path / "out.parquet" + write_table(str(target), pa.table({"x": [1]})) + write_table(str(target), pa.table({"x": [42]})) + assert read_table(str(target)).column("x").to_pylist() == [42] + + +def test_plan_roundtrip(tmp_path: Path) -> None: + rows = [ + { + "action": "create", + "resource": "BuiltinTag", + "source_id": "tag-1", + "dest_id": "", + "attribute": "", + "old_value": "", + "new_value": '{"name":"prod"}', + "owner": "", + "skip_reason": "", + "conflict_class": "", + } + ] + write_plan(run_dir=tmp_path, rows=rows) + table = read_table(str(tmp_path / "plan.parquet")) + assert table.schema == PLAN_SCHEMA + assert table.column("action").to_pylist() == ["create"] + + +def test_write_resource_side_injects_metadata_columns(tmp_path: Path) -> None: + from datetime import datetime, timezone + + from infrahub_sync.cache.parquet_io import ( + read_table, + write_resource_side, + ) + + rows = [ + {"name": "prod", "description": None}, + {"name": "dev", "description": "dev tag"}, + ] + extract_ts = datetime(2026, 5, 12, 15, 30, tzinfo=timezone.utc) + write_resource_side( + run_dir=tmp_path, + side="A", + resource="BuiltinTag", + rows=rows, # ty: ignore[invalid-argument-type] + source_ids=["tag-1", "tag-2"], + extract_ts=extract_ts, + ) + table = read_table(str(tmp_path / "A" / "BuiltinTag.parquet")) + assert table.column("_source_id").to_pylist() == ["tag-1", "tag-2"] + assert table.column("_tombstone").to_pylist() == [False, False] + # _extract_ts uses ns-precision UTC. + ts_col = table.column("_extract_ts").to_pylist() + assert all(ts == extract_ts for ts in ts_col) + # Caller-supplied columns preserved. + assert table.column("name").to_pylist() == ["prod", "dev"] diff --git a/tests/cache/test_paths.py b/tests/cache/test_paths.py new file mode 100644 index 0000000..b34f6ff --- /dev/null +++ b/tests/cache/test_paths.py @@ -0,0 +1,36 @@ +"""Unit tests for infrahub_sync.cache.paths.""" + +from __future__ import annotations + +import re + +from infrahub_sync.cache.paths import ( + cache_root_for, + generate_run_id, + run_dir, +) + + +def test_cache_root_defaults_to_cwd_dot_cache(tmp_path, monkeypatch) -> None: + monkeypatch.chdir(tmp_path) + monkeypatch.delenv("INFRAHUB_SYNC_CACHE_DIR", raising=False) + root = cache_root_for("from-netbox") + assert root == tmp_path / ".infrahub-sync-cache" / "from-netbox" + + +def test_cache_root_honors_env(tmp_path, monkeypatch) -> None: + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path / "custom")) + root = cache_root_for("from-netbox") + assert root == tmp_path / "custom" / "from-netbox" + + +def test_generate_run_id_format() -> None: + rid = generate_run_id() + # ISO-ish: 20260512T1530-<8 hex> + assert re.fullmatch(r"\d{8}T\d{4}-[0-9a-f]{8}", rid), rid + + +def test_run_dir_concatenates(tmp_path, monkeypatch) -> None: + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + rd = run_dir("from-netbox", "20260512T1530-abc12345") + assert rd == tmp_path / "from-netbox" / "20260512T1530-abc12345" diff --git a/tests/cache/test_plan_serialization.py b/tests/cache/test_plan_serialization.py new file mode 100644 index 0000000..9d276a9 --- /dev/null +++ b/tests/cache/test_plan_serialization.py @@ -0,0 +1,75 @@ +"""Potenda.write_plan turns a diffsync Diff into plan.parquet.""" + +from __future__ import annotations + +from pathlib import Path +from types import SimpleNamespace + +from infrahub_sync.cache.parquet_io import read_table +from infrahub_sync.potenda import Potenda + + +class _FakeElement: + """Stand-in for diffsync's DiffElement.""" + + def __init__( + self, + action: str, + source_id: str, + old_attrs: dict | None = None, + new_attrs: dict | None = None, + ): + self.action = action + self.name = source_id + self._old = old_attrs or {} + self._new = new_attrs or {} + + def get_attrs_diffs(self) -> dict: + result: dict = {} + if self._old: + result["-"] = self._old + if self._new: + result["+"] = self._new + return result + + +class _FakeDiff: + """Mirrors diffsync.Diff's children mapping shape.""" + + def __init__(self, changes_per_resource: dict[str, list[_FakeElement]]): + self.children = { + resource: {element.name: element for element in elements} + for resource, elements in changes_per_resource.items() + } + + def has_diffs(self) -> bool: + return any(self.children.values()) + + +def test_write_plan_writes_one_row_per_change(tmp_path: Path) -> None: + ptd = Potenda( + source=SimpleNamespace(top_level=[]), # ty: ignore[invalid-argument-type] + destination=SimpleNamespace(top_level=[]), # ty: ignore[invalid-argument-type] + config=None, # ty: ignore[invalid-argument-type] + top_level=[], + run_dir=tmp_path, + ) + diff = _FakeDiff( + { + "BuiltinTag": [ + _FakeElement("create", "prod", new_attrs={"name": "prod"}), + _FakeElement( + "update", + "dev", + old_attrs={"description": "old"}, + new_attrs={"description": "d"}, + ), + ], + "DcimDevice": [], + } + ) + ptd.write_plan(diff) + table = read_table(str(tmp_path / "plan.parquet")) + actions = table.column("action").to_pylist() + resources = table.column("resource").to_pylist() + assert sorted(zip(actions, resources)) == [("create", "BuiltinTag"), ("update", "BuiltinTag")] diff --git a/tests/cache/test_potenda_cache_hook.py b/tests/cache/test_potenda_cache_hook.py new file mode 100644 index 0000000..9231479 --- /dev/null +++ b/tests/cache/test_potenda_cache_hook.py @@ -0,0 +1,50 @@ +"""Potenda writes per-resource Parquet snapshots when run_dir is set.""" + +from __future__ import annotations + +from pathlib import Path +from types import SimpleNamespace +from unittest.mock import MagicMock + +from infrahub_sync.cache.parquet_io import read_table +from infrahub_sync.potenda import Potenda + + +class _FakeRecord(SimpleNamespace): + """Stand-in for a DiffSyncModel instance.""" + + _identifiers: tuple[str, ...] = ("name",) + + def get_attrs(self) -> dict: + return {"name": self.name, "description": getattr(self, "description", None)} + + def get_unique_id(self) -> str: + return self.name + + +def _make_fake_adapter(records_by_kind: dict[str, list[_FakeRecord]]) -> MagicMock: + """diffsync-like adapter that yields records via `get_all`.""" + adapter = MagicMock() + adapter.top_level = list(records_by_kind.keys()) + adapter.get_all.side_effect = lambda kind: records_by_kind.get(kind, []) + adapter.load = MagicMock() + return adapter + + +def test_potenda_writes_resource_parquet(tmp_path: Path) -> None: + src = _make_fake_adapter({"BuiltinTag": [_FakeRecord(name="prod", description="x"), _FakeRecord(name="dev")]}) + dst = _make_fake_adapter({"BuiltinTag": [_FakeRecord(name="prod", description="x")]}) + ptd = Potenda( + source=src, + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=["BuiltinTag"], + run_dir=tmp_path, + ) + ptd.source_load() + ptd.destination_load() + + a_table = read_table(str(tmp_path / "A" / "BuiltinTag.parquet")) + b_table = read_table(str(tmp_path / "B" / "BuiltinTag.parquet")) + assert sorted(a_table.column("_source_id").to_pylist()) == ["dev", "prod"] + assert b_table.column("_source_id").to_pylist() == ["prod"] diff --git a/tests/cache/test_run_counter.py b/tests/cache/test_run_counter.py new file mode 100644 index 0000000..555adbd --- /dev/null +++ b/tests/cache/test_run_counter.py @@ -0,0 +1,16 @@ +from pathlib import Path + +from infrahub_sync.cache.sidecars import RunCounterFile + + +def test_counter_starts_at_zero(tmp_path: Path) -> None: + counter = RunCounterFile.load_or_default(tmp_path / "run-counter.json") + assert counter.runs_since_full == 0 + + +def test_increment_and_save_roundtrip(tmp_path: Path) -> None: + path = tmp_path / "run-counter.json" + counter = RunCounterFile.load_or_default(path) + counter.runs_since_full = 4 + counter.save() + assert RunCounterFile.load_or_default(path).runs_since_full == 4 diff --git a/tests/cache/test_schema_subhash_persist.py b/tests/cache/test_schema_subhash_persist.py new file mode 100644 index 0000000..182377e --- /dev/null +++ b/tests/cache/test_schema_subhash_persist.py @@ -0,0 +1,40 @@ +"""When a schema is available, get_potenda_from_instance persists the +sub-hash so `apply` can later reject mismatched runs.""" + +from __future__ import annotations + +from pathlib import Path +from types import SimpleNamespace +from unittest.mock import MagicMock, patch + +from infrahub_sync import ( + SchemaMappingField, + SchemaMappingModel, + SyncAdapter, + SyncInstance, +) +from infrahub_sync.cache.sidecars import SchemaHashFile +from infrahub_sync.utils import get_potenda_from_instance + + +def test_schema_subhash_persisted_when_cached_schema_present(tmp_path: Path, monkeypatch) -> None: + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + + inst = SyncInstance( + name="hashtest", + source=SyncAdapter(name="netbox"), + destination=SyncAdapter(name="infrahub"), + directory=str(tmp_path), + schema_mapping=[SchemaMappingModel(name="Tag", fields=[SchemaMappingField(name="name")])], + ) + inst._cached_schema = { # ty: ignore[unresolved-attribute] + "Tag": SimpleNamespace(kind="Tag", attributes=[], relationships=[]), + } + + with patch("infrahub_sync.utils.import_adapter") as fake_import: + fake_import.return_value = MagicMock() + ptd = get_potenda_from_instance(sync_instance=inst) + + hash_path = ptd.run_dir / "schema-sub-hash.txt" # ty: ignore[unsupported-operator] + assert hash_path.exists() + assert len(SchemaHashFile.load(hash_path).value) == 12 diff --git a/tests/cache/test_sidecars.py b/tests/cache/test_sidecars.py new file mode 100644 index 0000000..f27999e --- /dev/null +++ b/tests/cache/test_sidecars.py @@ -0,0 +1,78 @@ +"""Sidecar load/save tests for JSON metadata files.""" + +from __future__ import annotations + +from pathlib import Path + +from infrahub_sync.cache.sidecars import ( + CursorsFile, + RowcountsFile, + RunFile, + SchemaHashFile, +) + + +def test_cursors_file_load_default_missing(tmp_path: Path) -> None: + f = CursorsFile.load_or_default(tmp_path / "cursors.json") + assert f.cursors == {"A": {}, "B": {}} + + +def test_cursors_file_roundtrip(tmp_path: Path) -> None: + f = CursorsFile.load_or_default(tmp_path / "cursors.json") + f.cursors = {"A": {"BuiltinTag": "2026-05-12T15:30:00Z"}, "B": {}} + f.save() + g = CursorsFile.load_or_default(tmp_path / "cursors.json") + assert g.cursors == f.cursors + + +def test_rowcounts_file_set_and_get(tmp_path: Path) -> None: + f = RowcountsFile.load_or_default(tmp_path / "last-successful-rowcounts.json") + f.set("BuiltinTag", 4000) + f.save() + g = RowcountsFile.load_or_default(tmp_path / "last-successful-rowcounts.json") + assert g.get("BuiltinTag") == 4000 + assert g.get("MissingResource") is None + + +def test_run_file_records_status(tmp_path: Path) -> None: + f = RunFile.load_or_default(tmp_path / "run.json") + f.status = "dry-run" + f.mode = "diff" + f.summary = {"resources": 17, "diff_rows": 42} + f.save() + g = RunFile.load_or_default(tmp_path / "run.json") + assert g.status == "dry-run" + assert g.summary["resources"] == 17 + + +def test_schema_hash_file_text_roundtrip(tmp_path: Path) -> None: + f = SchemaHashFile(path=tmp_path / "schema-sub-hash.txt", value="abc123") + f.save() + assert SchemaHashFile.load(tmp_path / "schema-sub-hash.txt").value == "abc123" + + +def test_schema_subhash_stable_across_runs() -> None: + """Same config + same schema => same hash. Different config => different hash.""" + from types import SimpleNamespace + + from infrahub_sync import SchemaMappingField, SchemaMappingModel, SyncAdapter, SyncConfig + from infrahub_sync.cache import compute_schema_subhash + + cfg_a = SyncConfig( + name="t", + source=SyncAdapter(name="netbox"), + destination=SyncAdapter(name="infrahub"), + schema_mapping=[SchemaMappingModel(name="Tag", fields=[SchemaMappingField(name="name")])], + ) + schema_a = {"Tag": SimpleNamespace(kind="Tag", attributes=[], relationships=[])} + h1 = compute_schema_subhash(cfg_a, schema_a) + h2 = compute_schema_subhash(cfg_a, schema_a) + assert h1 == h2 + + cfg_b = SyncConfig( + name="t", + source=SyncAdapter(name="netbox"), + destination=SyncAdapter(name="infrahub"), + schema_mapping=[SchemaMappingModel(name="Tag", fields=[SchemaMappingField(name="slug")])], + ) + assert compute_schema_subhash(cfg_b, schema_a) != h1 diff --git a/tests/cache/test_sync_cache_flow.py b/tests/cache/test_sync_cache_flow.py new file mode 100644 index 0000000..796f917 --- /dev/null +++ b/tests/cache/test_sync_cache_flow.py @@ -0,0 +1,106 @@ +"""Sync (serial and parallel) writes the same cache artifacts as diff: +pipeline_lock, run.json, plan.parquet, last-successful-rowcounts.json.""" + +from __future__ import annotations + +from pathlib import Path +from types import SimpleNamespace +from unittest.mock import MagicMock + +from infrahub_sync.cache.parquet_io import read_table +from infrahub_sync.potenda import Potenda + + +class _FakeRecord(SimpleNamespace): + _identifiers: tuple[str, ...] = ("name",) + + def get_attrs(self) -> dict: + return {"name": self.name} + + def get_unique_id(self) -> str: + return self.name + + +def _make_fake_adapter(records_by_kind: dict[str, list[_FakeRecord]]) -> MagicMock: + adapter = MagicMock() + adapter.top_level = list(records_by_kind.keys()) + adapter.get_all.side_effect = lambda kind: records_by_kind.get(kind, []) + adapter.load = MagicMock() + return adapter + + +class _Child: + def __init__(self, action: str, name: str) -> None: + self.action = action + self.name = name + + def get_attrs_diffs(self) -> dict: # noqa: PLR6301 + return {"+": {"name": "x"}} + + +class _Diff: + def __init__(self, changes: dict[str, list[_Child]]) -> None: # ty: ignore[invalid-type-form] + self.children = {resource: {child.name: child for child in elements} for resource, elements in changes.items()} + + def has_diffs(self) -> bool: + return any(self.children.values()) + + def str(self) -> str: # noqa: PLR6301 # ty: ignore[invalid-type-form] + return "" + + +def test_sync_in_tiers_aggregates_plan_rows_across_tiers(tmp_path: Path) -> None: + """One plan.parquet at with rows from EVERY tier.""" + src = _make_fake_adapter({"Tag": [_FakeRecord(name="prod")], "Device": [_FakeRecord(name="d1")]}) + dst = _make_fake_adapter({"Tag": [], "Device": []}) + + # Two tiers, two distinct fake diffs. + diffs_per_call = iter( + [ + _Diff({"Tag": [_Child("create", "prod")]}), + _Diff({"Device": [_Child("create", "d1")]}), + ] + ) + + ptd = Potenda( + source=src, + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=["Tag", "Device"], + tiers=[{"Tag"}, {"Device"}], + run_dir=tmp_path, + ) + # Patch out diff/sync to use the fake diff iterator. + ptd.diff = lambda: next(diffs_per_call) # ty: ignore[invalid-assignment] + ptd.sync = MagicMock() # ty: ignore[invalid-assignment] + ptd.persist_baseline_counts = MagicMock() # ty: ignore[invalid-assignment] + + ptd.sync_in_tiers(parallel=True) + + plan = read_table(str(tmp_path / "plan.parquet")) + assert sorted(plan.column("resource").to_pylist()) == ["Device", "Tag"] + assert sorted(plan.column("action").to_pylist()) == ["create", "create"] + ptd.persist_baseline_counts.assert_called_once() # ty: ignore[unresolved-attribute] + + +def test_sync_in_tiers_runs_rowcount_guardrail(tmp_path: Path) -> None: + src = _make_fake_adapter({"Tag": [_FakeRecord(name="prod")]}) + dst = _make_fake_adapter({"Tag": []}) + + ptd = Potenda( + source=src, + destination=dst, + config=None, # ty: ignore[invalid-argument-type] + top_level=["Tag"], + tiers=[{"Tag"}], + run_dir=tmp_path, + ) + ptd.diff = lambda: _Diff({"Tag": []}) # ty: ignore[invalid-assignment] + ptd.sync = MagicMock() # ty: ignore[invalid-assignment] + ptd.check_rowcount_guardrail = MagicMock() # ty: ignore[invalid-assignment] + ptd.persist_baseline_counts = MagicMock() # ty: ignore[invalid-assignment] + + ptd.sync_in_tiers(parallel=True, allow_rowcount_drop=True) + + ptd.check_rowcount_guardrail.assert_called_once_with(allow_drop=True) # ty: ignore[unresolved-attribute] + ptd.persist_baseline_counts.assert_called_once() # ty: ignore[unresolved-attribute] diff --git a/tests/test_cli_full_extract.py b/tests/test_cli_full_extract.py new file mode 100644 index 0000000..bf9e52c --- /dev/null +++ b/tests/test_cli_full_extract.py @@ -0,0 +1,78 @@ +from __future__ import annotations + +from pathlib import Path +from typing import TYPE_CHECKING +from unittest.mock import MagicMock, patch + +from typer.testing import CliRunner + +from infrahub_sync.cli import app + +if TYPE_CHECKING: + import pytest + +EXAMPLES_DIR = Path(__file__).resolve().parent.parent / "examples" +runner = CliRunner() + + +def test_full_extract_flag_accepted() -> None: + result = runner.invoke(app, ["sync", "--help"]) + assert result.exit_code == 0 + assert "--full-extract" in result.output + + +def test_full_extract_flag_diff() -> None: + result = runner.invoke(app, ["diff", "--help"]) + assert result.exit_code == 0 + assert "--full-extract" in result.output + + +def _make_fake_potenda(run_dir: Path) -> MagicMock: + ptd = MagicMock() + ptd.tiers = None + ptd.run_id = "test-run" + ptd.run_dir = run_dir + ptd.top_level = ["BuiltinTag"] + fake_diff = MagicMock() + fake_diff.has_diffs.return_value = False + fake_diff.str.return_value = "" + ptd.diff.return_value = fake_diff + return ptd + + +def test_full_extract_is_the_default_on_sync(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + """Locks the new default: bare `sync` sets `ptd.force_full_extract = True`.""" + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + run_dir = tmp_path / "from-netbox" / "test-run" + run_dir.mkdir(parents=True, exist_ok=True) + fake_ptd = _make_fake_potenda(run_dir) + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + ["sync", "--no-parallel", "--name", "from-netbox", "--directory", str(EXAMPLES_DIR)], + ) + assert result.exit_code == 0, result.output + assert fake_ptd.force_full_extract is True + + +def test_no_full_extract_engages_incremental(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + """`--no-full-extract` sets `ptd.force_full_extract = False` so the cursor path is enabled.""" + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + run_dir = tmp_path / "from-netbox" / "test-run" + run_dir.mkdir(parents=True, exist_ok=True) + fake_ptd = _make_fake_potenda(run_dir) + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + [ + "sync", + "--no-parallel", + "--no-full-extract", + "--name", + "from-netbox", + "--directory", + str(EXAMPLES_DIR), + ], + ) + assert result.exit_code == 0, result.output + assert fake_ptd.force_full_extract is False diff --git a/tests/test_cli_parallel.py b/tests/test_cli_parallel.py new file mode 100644 index 0000000..1587f92 --- /dev/null +++ b/tests/test_cli_parallel.py @@ -0,0 +1,109 @@ +"""End-to-end CLI tests for `infrahub-sync sync --parallel`.""" + +from __future__ import annotations + +import logging +from pathlib import Path +from typing import TYPE_CHECKING +from unittest.mock import MagicMock, patch + +from typer.testing import CliRunner + +from infrahub_sync.cli import app + +if TYPE_CHECKING: + import pytest + +EXAMPLES_DIR = Path(__file__).resolve().parent.parent / "examples" + + +def _make_fake_potenda(tiers: list | None, run_dir: Path) -> MagicMock: + """Build a MagicMock Potenda with the attrs sync_cmd touches.""" + ptd = MagicMock() + ptd.tiers = tiers + ptd.run_id = "test-run" + ptd.run_dir = run_dir + ptd.top_level = ["BuiltinTag"] + # diff() result must expose has_diffs() -> False so serial path exits cleanly. + fake_diff = MagicMock() + fake_diff.has_diffs.return_value = False + fake_diff.str.return_value = "" + ptd.diff.return_value = fake_diff + return ptd + + +def test_parallel_flag_invokes_sync_in_tiers(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + """--parallel with auto-tiers delegates to sync_in_tiers(parallel=True).""" + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + run_dir = tmp_path / "from-netbox" / "test-run" + run_dir.mkdir(parents=True, exist_ok=True) + fake_ptd = _make_fake_potenda(tiers=[{"BuiltinTag"}, {"RoleGeneric"}], run_dir=run_dir) + runner = CliRunner() + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + ["sync", "--parallel", "--name", "from-netbox", "--directory", str(EXAMPLES_DIR)], + ) + assert result.exit_code == 0, result.output + fake_ptd.sync_in_tiers.assert_called_once_with(parallel=True, allow_rowcount_drop=False) + # Eager source/destination load is skipped when delegating to sync_in_tiers. + fake_ptd.source_load.assert_not_called() + fake_ptd.destination_load.assert_not_called() + + +def test_parallel_is_the_default(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + """Locks the new default: invoking `sync` without flags still invokes sync_in_tiers.""" + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + run_dir = tmp_path / "from-netbox" / "test-run" + run_dir.mkdir(parents=True, exist_ok=True) + fake_ptd = _make_fake_potenda(tiers=[{"BuiltinTag"}, {"RoleGeneric"}], run_dir=run_dir) + runner = CliRunner() + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + ["sync", "--name", "from-netbox", "--directory", str(EXAMPLES_DIR)], + ) + assert result.exit_code == 0, result.output + fake_ptd.sync_in_tiers.assert_called_once_with(parallel=True, allow_rowcount_drop=False) + + +def test_no_parallel_runs_serial(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + """`--no-parallel` opts out of sync_in_tiers and runs the serial load+diff+sync.""" + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + run_dir = tmp_path / "from-netbox" / "test-run" + run_dir.mkdir(parents=True, exist_ok=True) + fake_ptd = _make_fake_potenda(tiers=[{"BuiltinTag"}], run_dir=run_dir) + runner = CliRunner() + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + ["sync", "--no-parallel", "--name", "from-netbox", "--directory", str(EXAMPLES_DIR)], + ) + assert result.exit_code == 0, result.output + fake_ptd.sync_in_tiers.assert_not_called() + fake_ptd.load_both_sides.assert_called_once() + fake_ptd.diff.assert_called_once() + + +def test_parallel_flag_warns_when_order_explicit( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch, caplog: pytest.LogCaptureFixture +) -> None: + """--parallel with explicit order: in config warns and falls back to serial.""" + monkeypatch.setenv("INFRAHUB_SYNC_CACHE_DIR", str(tmp_path)) + run_dir = tmp_path / "from-netbox" / "test-run" + run_dir.mkdir(parents=True, exist_ok=True) + fake_ptd = _make_fake_potenda(tiers=None, run_dir=run_dir) # operator set order: explicitly + runner = CliRunner() + caplog.set_level(logging.WARNING, logger="infrahub_sync.cli") + with patch("infrahub_sync.cli.get_potenda_from_instance", return_value=fake_ptd): + result = runner.invoke( + app, + ["sync", "--parallel", "--name", "from-netbox", "--directory", str(EXAMPLES_DIR)], + ) + assert result.exit_code == 0, result.output + fake_ptd.sync_in_tiers.assert_not_called() + # Serial path ran: source/destination loaded and diff computed. + fake_ptd.load_both_sides.assert_called_once() + fake_ptd.diff.assert_called_once() + msgs = [r.getMessage() for r in caplog.records] + assert any("--parallel ignored" in m for m in msgs), msgs diff --git a/tests/test_dependency_graph.py b/tests/test_dependency_graph.py new file mode 100644 index 0000000..4ed62a9 --- /dev/null +++ b/tests/test_dependency_graph.py @@ -0,0 +1,111 @@ +"""Tests for infrahub_sync.dependency_graph.""" + +from __future__ import annotations + +from infrahub_sync import SchemaMappingField, SchemaMappingModel +from infrahub_sync.dependency_graph import build_dependency_graph + + +def _sm(name: str, fields: list[tuple[str, str | None]], identifiers: list[str] | None = None) -> SchemaMappingModel: + """Build a SchemaMappingModel from (field_name, reference) tuples.""" + return SchemaMappingModel( + name=name, + identifiers=identifiers, + fields=[SchemaMappingField(name=fn, reference=ref) for fn, ref in fields], + ) + + +def test_build_dependency_graph_simple_chain() -> None: + mapping = [ + _sm("Tag", [("name", None)], identifiers=["name"]), + _sm("Device", [("name", None), ("tag", "Tag")], identifiers=["name"]), + _sm("Interface", [("name", None), ("device", "Device")], identifiers=["name", "device"]), + ] + deps = build_dependency_graph(mapping) + assert deps == {"Tag": set(), "Device": {"Tag"}, "Interface": {"Device"}} + + +def test_build_dependency_graph_unions_repeated_names() -> None: + """Same kind name appearing multiple times merges field references.""" + mapping = [ + _sm("RoleGeneric", [("name", None)]), + _sm("RoleGeneric", [("name", None), ("tag", "BuiltinTag")]), + _sm("BuiltinTag", [("name", None)]), + ] + deps = build_dependency_graph(mapping) + assert deps == {"RoleGeneric": {"BuiltinTag"}, "BuiltinTag": set()} + + +def test_compute_tiers_no_cycle() -> None: + from infrahub_sync.dependency_graph import compute_tiers + + mapping = [ + _sm("Tag", [("name", None)], identifiers=["name"]), + _sm("Device", [("name", None), ("tag", "Tag")], identifiers=["name"]), + _sm("Interface", [("name", None), ("device", "Device")], identifiers=["name", "device"]), + ] + tiers, dropped = compute_tiers(mapping) + assert tiers == [{"Tag"}, {"Device"}, {"Interface"}] + assert dropped == [] + + +def test_compute_tiers_drops_optional_cycle_edge() -> None: + from infrahub_sync.dependency_graph import compute_tiers + + # AS -> Device (optional, AS.routing_device not in identifiers) + # Device -> AS (identity-bearing — AS is part of Device.identifiers) + mapping = [ + _sm( + "RoutingAS", + [("asn", None), ("routing_device", "Device")], + identifiers=["asn"], + ), + _sm( + "Device", + [("name", None), ("asn", "RoutingAS")], + identifiers=["name", "asn"], + ), + ] + tiers, dropped = compute_tiers(mapping) + assert dropped == [("RoutingAS", "Device")] + assert tiers == [{"RoutingAS"}, {"Device"}] + + +def test_compute_tiers_raises_on_identity_cycle() -> None: + import pytest + from infrahub_sdk.topological_sort import DependencyCycleExistsError + + from infrahub_sync.dependency_graph import compute_tiers + + mapping = [ + _sm("A", [("b", "B")], identifiers=["b"]), + _sm("B", [("a", "A")], identifiers=["a"]), + ] + with pytest.raises(DependencyCycleExistsError): + compute_tiers(mapping) + + +def test_compute_tiers_for_netbox_example_config() -> None: + """End-to-end against examples/netbox_to_infrahub/config.yml.""" + from pathlib import Path + + import yaml + + from infrahub_sync import SyncConfig + from infrahub_sync.dependency_graph import compute_tiers, flatten_tiers + + config_path = Path(__file__).resolve().parent.parent / "examples" / "netbox_to_infrahub" / "config.yml" + with config_path.open() as fh: + data = yaml.safe_load(fh) + + cfg = SyncConfig(**data) + tiers, _dropped = compute_tiers(cfg.schema_mapping) + + # Tier 0 must include leaf-like kinds with no outgoing refs. + assert "BuiltinTag" in tiers[0] + # Every mapped kind must appear somewhere in the computed tiers. + # (cfg.order is empty for examples that opt into auto-tiering, so this + # loop would be a no-op against `cfg.order` and miss regressions.) + flat = set(flatten_tiers(tiers)) + for name in {m.name for m in cfg.schema_mapping}: + assert name in flat, f"{name} missing from computed tiers" diff --git a/tests/test_diffsync_mixin_contract.py b/tests/test_diffsync_mixin_contract.py new file mode 100644 index 0000000..5c06ffd --- /dev/null +++ b/tests/test_diffsync_mixin_contract.py @@ -0,0 +1,26 @@ +"""Tests for DiffSyncMixin adapter contract methods.""" + +import pytest + +from infrahub_sync import DiffSyncMixin +from infrahub_sync.cache.cursors import CursorState, CursorTier + + +class _Stub(DiffSyncMixin): + pass + + +def test_cursor_tier_for_defaults_to_none() -> None: + assert _Stub().cursor_tier_for("Anything") is CursorTier.NONE + + +def test_list_changed_since_raises_when_unimplemented() -> None: + stub = _Stub() + with pytest.raises(NotImplementedError): + list(stub.list_changed_since("Anything", CursorState(tier=CursorTier.NONE))) + + +def test_list_existing_ids_raises_when_unimplemented() -> None: + stub = _Stub() + with pytest.raises(NotImplementedError): + list(stub.list_existing_ids("Anything")) diff --git a/tests/test_get_potenda_top_level.py b/tests/test_get_potenda_top_level.py new file mode 100644 index 0000000..6cd983a --- /dev/null +++ b/tests/test_get_potenda_top_level.py @@ -0,0 +1,39 @@ +"""Potenda receives the auto-computed top_level when order: is omitted.""" + +from __future__ import annotations + +from pathlib import Path +from typing import TYPE_CHECKING +from unittest.mock import MagicMock, patch + +import pytest + +if TYPE_CHECKING: + from infrahub_sync import SyncInstance + +from infrahub_sync.utils import get_instance, get_potenda_from_instance + +EXAMPLES_DIR = Path(__file__).resolve().parent.parent / "examples" + + +@pytest.fixture +def netbox_instance(tmp_path: Path) -> SyncInstance: + """Load the netbox_to_infrahub config but blank out `order:` to force + the auto-tier path.""" + src = EXAMPLES_DIR / "netbox_to_infrahub" / "config.yml" + dst = tmp_path / "config.yml" + dst.write_text(src.read_text()) + inst = get_instance(config_file=str(dst)) + assert inst is not None + inst.order = [] # force fallback to compute_order() + return inst + + +def test_potenda_top_level_comes_from_compute_order(netbox_instance: SyncInstance) -> None: + """If order: is empty, Potenda is built with the flattened tier order.""" + with patch("infrahub_sync.utils.import_adapter") as fake_import: + fake_import.return_value = MagicMock() + ptd = get_potenda_from_instance(sync_instance=netbox_instance) + expected = netbox_instance.compute_order() + assert ptd.top_level == expected + assert "BuiltinTag" in ptd.top_level diff --git a/tests/test_potenda_parallel.py b/tests/test_potenda_parallel.py new file mode 100644 index 0000000..6554b1f --- /dev/null +++ b/tests/test_potenda_parallel.py @@ -0,0 +1,92 @@ +"""Potenda.sync_in_tiers executes tiers in order and fans out within a tier.""" + +from __future__ import annotations + +import operator +import threading +import time +from collections import defaultdict + +import pytest + +from infrahub_sync.potenda import Potenda + + +class _RecordingAdapter: + """Adapter that records the order and concurrency of per-kind sync calls.""" + + def __init__(self) -> None: + self.top_level: list[str] = [] + self.calls: list[tuple[float, str]] = [] + self.concurrent: defaultdict[int, set[str]] = defaultdict(set) + self._lock = threading.Lock() + self._active: set[str] = set() + self._snapshot_id = 0 + + def __str__(self) -> str: + return "recording" + + def load(self) -> None: # noqa: PLR6301 + # diffsync adapter hook; no-op for the recording fixture. + return None + + def diff_from(self, *_: object, **__: object) -> _NullDiff: # noqa: PLR6301 + return _NullDiff() + + def sync_from(self, *_: object, diff: object = None, **__: object) -> object: + # Simulate per-kind work in parallel: each top_level kind sleeps and + # records overlap. + threads = [] + for kind in self.top_level: + t = threading.Thread(target=self._do_kind, args=(kind,)) + threads.append(t) + t.start() + for t in threads: + t.join() + return diff + + def _do_kind(self, kind: str) -> None: + with self._lock: + self._active.add(kind) + self._snapshot_id += 1 + self.concurrent[self._snapshot_id] = set(self._active) + time.sleep(0.02) + with self._lock: + self._active.discard(kind) + self.calls.append((time.time(), kind)) + + +class _NullDiff: + def has_diffs(self) -> bool: # noqa: PLR6301 + return False + + def str(self) -> str: # noqa: PLR6301 # ty: ignore[invalid-type-form] + return "" + + def items(self) -> list: # noqa: PLR6301 + return [] + + +@pytest.mark.timeout(5) +def test_sync_in_tiers_respects_tier_boundary() -> None: + src = _RecordingAdapter() + dst = _RecordingAdapter() + tiers = [{"Tag", "Role"}, {"Device"}] + ptd = Potenda( + source=src, # ty: ignore[invalid-argument-type] + destination=dst, # ty: ignore[invalid-argument-type] + config=None, # ty: ignore[invalid-argument-type] + top_level=["Tag", "Role", "Device"], + tiers=tiers, + ) + + ptd.sync_in_tiers(parallel=True) + + # Order: every Tag/Role call happens before any Device call. + by_time = sorted(dst.calls, key=operator.itemgetter(0)) + seen_tier0 = False + for _, kind in by_time: + if kind == "Device": + assert seen_tier0 + if kind in {"Tag", "Role"}: + seen_tier0 = True diff --git a/tests/test_potenda_tiers.py b/tests/test_potenda_tiers.py new file mode 100644 index 0000000..a706403 --- /dev/null +++ b/tests/test_potenda_tiers.py @@ -0,0 +1,43 @@ +"""Potenda stores the computed tiers and logs them on diff.""" + +from __future__ import annotations + +import logging +from typing import TYPE_CHECKING, ClassVar + +if TYPE_CHECKING: + import pytest + +from infrahub_sync.potenda import Potenda + + +class _FakeAdapter: + def __str__(self) -> str: + return "fake" + + top_level: ClassVar[list[str]] = [] + + +def test_potenda_accepts_tiers_kwarg() -> None: + ptd = Potenda( + source=_FakeAdapter(), # ty: ignore[invalid-argument-type] + destination=_FakeAdapter(), # ty: ignore[invalid-argument-type] + config=None, # ty: ignore[invalid-argument-type] + top_level=["A", "B", "C"], + tiers=[{"A"}, {"B", "C"}], + ) + assert ptd.tiers == [{"A"}, {"B", "C"}] + + +def test_potenda_logs_tiers_on_construction(caplog: pytest.LogCaptureFixture) -> None: + caplog.set_level(logging.INFO, logger="infrahub_sync.potenda") + Potenda( + source=_FakeAdapter(), # ty: ignore[invalid-argument-type] + destination=_FakeAdapter(), # ty: ignore[invalid-argument-type] + config=None, # ty: ignore[invalid-argument-type] + top_level=["A", "B"], + tiers=[{"A"}, {"B"}], + ) + msgs = [r.getMessage() for r in caplog.records] + assert any("tier 0" in m for m in msgs) + assert any("tier 1" in m for m in msgs) diff --git a/tests/test_sync_config_order.py b/tests/test_sync_config_order.py new file mode 100644 index 0000000..9815840 --- /dev/null +++ b/tests/test_sync_config_order.py @@ -0,0 +1,48 @@ +"""SyncConfig.compute_order() returns the operator override if present, +otherwise falls back to the auto-tiered, flattened topological order. +""" + +from __future__ import annotations + +from infrahub_sync import ( + SchemaMappingField, + SchemaMappingModel, + SyncAdapter, + SyncConfig, +) + + +def _cfg(order: list[str] | None, mapping: list[SchemaMappingModel]) -> SyncConfig: + return SyncConfig( + name="t", + source=SyncAdapter(name="netbox"), + destination=SyncAdapter(name="infrahub"), + order=order or [], + schema_mapping=mapping, + ) + + +def test_compute_order_uses_explicit_order_when_set() -> None: + cfg = _cfg( + order=["Device", "Tag"], + mapping=[ + SchemaMappingModel(name="Tag", identifiers=["name"], fields=[SchemaMappingField(name="name")]), + SchemaMappingModel( + name="Device", identifiers=["name"], fields=[SchemaMappingField(name="tag", reference="Tag")] + ), + ], + ) + assert cfg.compute_order() == ["Device", "Tag"] + + +def test_compute_order_falls_back_to_tiers_when_empty() -> None: + cfg = _cfg( + order=[], + mapping=[ + SchemaMappingModel(name="Tag", identifiers=["name"], fields=[SchemaMappingField(name="name")]), + SchemaMappingModel( + name="Device", identifiers=["name"], fields=[SchemaMappingField(name="tag", reference="Tag")] + ), + ], + ) + assert cfg.compute_order() == ["Tag", "Device"] diff --git a/uv.lock b/uv.lock index 220322b..0d60c10 100644 --- a/uv.lock +++ b/uv.lock @@ -382,6 +382,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/a4/a5/842ae8f0c08b61d6484b52f99a03510a3a72d23141942d216ebe81fefbce/filelock-3.25.2-py3-none-any.whl", hash = "sha256:ca8afb0da15f229774c9ad1b455ed96e85a81373065fb10446672f64444ddf70", size = 26759, upload-time = "2026-03-11T20:45:37.437Z" }, ] +[[package]] +name = "fsspec" +version = "2026.4.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/d5/8d/1c51c094345df128ca4a990d633fe1a0ff28726c9e6b3c41ba65087bba1d/fsspec-2026.4.0.tar.gz", hash = "sha256:301d8ac70ae90ef3ad05dcf94d6c3754a097f9b5fe4667d2787aa359ec7df7e4", size = 312760, upload-time = "2026-04-29T20:42:38.635Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/d5/0c/043d5e551459da400957a1395e0febbf771446ff34291afcbe3d8be2a279/fsspec-2026.4.0-py3-none-any.whl", hash = "sha256:11ef7bb35dab8a394fde6e608221d5cf3e8499401c249bebaeaad760a1a8dec2", size = 203402, upload-time = "2026-04-29T20:42:36.842Z" }, +] + [[package]] name = "graphql-core" version = "3.2.8" @@ -486,8 +495,11 @@ version = "1.6.1" source = { editable = "." } dependencies = [ { name = "diffsync", extra = ["redis"] }, + { name = "filelock" }, + { name = "fsspec" }, { name = "infrahub-sdk", extra = ["all"] }, { name = "netutils" }, + { name = "pyarrow" }, { name = "structlog" }, { name = "tqdm" }, ] @@ -517,11 +529,14 @@ dev = [ [package.metadata] requires-dist = [ { name = "diffsync", extras = ["redis"], specifier = ">=2.1,<3.0" }, + { name = "filelock", specifier = ">=3.13" }, + { name = "fsspec", specifier = ">=2024.6" }, { name = "infrahub-sdk", extras = ["all"], specifier = ">=1.17,<2" }, { name = "invoke", marker = "extra == 'dev'", specifier = ">=2.2.1,<3" }, { name = "ipython", marker = "extra == 'dev'" }, { name = "netutils", specifier = ">=1.9,<2.0" }, { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=4.0,<5.0" }, + { name = "pyarrow", specifier = ">=17,<22" }, { name = "pylint", marker = "extra == 'dev'" }, { name = "pytest", marker = "extra == 'dev'", specifier = ">=9.0.2,<10" }, { name = "pytest-asyncio", marker = "extra == 'dev'" }, @@ -1041,45 +1056,45 @@ wheels = [ [[package]] name = "pyarrow" -version = "23.0.1" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/88/22/134986a4cc224d593c1afde5494d18ff629393d74cc2eddb176669f234a4/pyarrow-23.0.1.tar.gz", hash = "sha256:b8c5873e33440b2bc2f4a79d2b47017a89c5a24116c055625e6f2ee50523f019", size = 1167336, upload-time = "2026-02-16T10:14:12.39Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/bc/a8/24e5dc6855f50a62936ceb004e6e9645e4219a8065f304145d7fb8a79d5d/pyarrow-23.0.1-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:3fab8f82571844eb3c460f90a75583801d14ca0cc32b1acc8c361650e006fd56", size = 34307390, upload-time = "2026-02-16T10:08:08.654Z" }, - { url = "https://files.pythonhosted.org/packages/bc/8e/4be5617b4aaae0287f621ad31c6036e5f63118cfca0dc57d42121ff49b51/pyarrow-23.0.1-cp310-cp310-macosx_12_0_x86_64.whl", hash = "sha256:3f91c038b95f71ddfc865f11d5876c42f343b4495535bd262c7b321b0b94507c", size = 35853761, upload-time = "2026-02-16T10:08:17.811Z" }, - { url = "https://files.pythonhosted.org/packages/2e/08/3e56a18819462210432ae37d10f5c8eed3828be1d6c751b6e6a2e93c286a/pyarrow-23.0.1-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:d0744403adabef53c985a7f8a082b502a368510c40d184df349a0a8754533258", size = 44493116, upload-time = "2026-02-16T10:08:25.792Z" }, - { url = "https://files.pythonhosted.org/packages/f8/82/c40b68001dbec8a3faa4c08cd8c200798ac732d2854537c5449dc859f55a/pyarrow-23.0.1-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:c33b5bf406284fd0bba436ed6f6c3ebe8e311722b441d89397c54f871c6863a2", size = 47564532, upload-time = "2026-02-16T10:08:34.27Z" }, - { url = "https://files.pythonhosted.org/packages/20/bc/73f611989116b6f53347581b02177f9f620efdf3cd3f405d0e83cdf53a83/pyarrow-23.0.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:ddf743e82f69dcd6dbbcb63628895d7161e04e56794ef80550ac6f3315eeb1d5", size = 48183685, upload-time = "2026-02-16T10:08:42.889Z" }, - { url = "https://files.pythonhosted.org/packages/b0/cc/6c6b3ecdae2a8c3aced99956187e8302fc954cc2cca2a37cf2111dad16ce/pyarrow-23.0.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:e052a211c5ac9848ae15d5ec875ed0943c0221e2fcfe69eee80b604b4e703222", size = 50605582, upload-time = "2026-02-16T10:08:51.641Z" }, - { url = "https://files.pythonhosted.org/packages/8d/94/d359e708672878d7638a04a0448edf7c707f9e5606cee11e15aaa5c7535a/pyarrow-23.0.1-cp310-cp310-win_amd64.whl", hash = "sha256:5abde149bb3ce524782d838eb67ac095cd3fd6090eba051130589793f1a7f76d", size = 27521148, upload-time = "2026-02-16T10:08:58.077Z" }, - { url = "https://files.pythonhosted.org/packages/b0/41/8e6b6ef7e225d4ceead8459427a52afdc23379768f54dd3566014d7618c1/pyarrow-23.0.1-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:6f0147ee9e0386f519c952cc670eb4a8b05caa594eeffe01af0e25f699e4e9bb", size = 34302230, upload-time = "2026-02-16T10:09:03.859Z" }, - { url = "https://files.pythonhosted.org/packages/bf/4a/1472c00392f521fea03ae93408bf445cc7bfa1ab81683faf9bc188e36629/pyarrow-23.0.1-cp311-cp311-macosx_12_0_x86_64.whl", hash = "sha256:0ae6e17c828455b6265d590100c295193f93cc5675eb0af59e49dbd00d2de350", size = 35850050, upload-time = "2026-02-16T10:09:11.877Z" }, - { url = "https://files.pythonhosted.org/packages/0c/b2/bd1f2f05ded56af7f54d702c8364c9c43cd6abb91b0e9933f3d77b4f4132/pyarrow-23.0.1-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:fed7020203e9ef273360b9e45be52a2a47d3103caf156a30ace5247ffb51bdbd", size = 44491918, upload-time = "2026-02-16T10:09:18.144Z" }, - { url = "https://files.pythonhosted.org/packages/0b/62/96459ef5b67957eac38a90f541d1c28833d1b367f014a482cb63f3b7cd2d/pyarrow-23.0.1-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:26d50dee49d741ac0e82185033488d28d35be4d763ae6f321f97d1140eb7a0e9", size = 47562811, upload-time = "2026-02-16T10:09:25.792Z" }, - { url = "https://files.pythonhosted.org/packages/7d/94/1170e235add1f5f45a954e26cd0e906e7e74e23392dcb560de471f7366ec/pyarrow-23.0.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:3c30143b17161310f151f4a2bcfe41b5ff744238c1039338779424e38579d701", size = 48183766, upload-time = "2026-02-16T10:09:34.645Z" }, - { url = "https://files.pythonhosted.org/packages/0e/2d/39a42af4570377b99774cdb47f63ee6c7da7616bd55b3d5001aa18edfe4f/pyarrow-23.0.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:db2190fa79c80a23fdd29fef4b8992893f024ae7c17d2f5f4db7171fa30c2c78", size = 50607669, upload-time = "2026-02-16T10:09:44.153Z" }, - { url = "https://files.pythonhosted.org/packages/00/ca/db94101c187f3df742133ac837e93b1f269ebdac49427f8310ee40b6a58f/pyarrow-23.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:f00f993a8179e0e1c9713bcc0baf6d6c01326a406a9c23495ec1ba9c9ebf2919", size = 27527698, upload-time = "2026-02-16T10:09:50.263Z" }, - { url = "https://files.pythonhosted.org/packages/9a/4b/4166bb5abbfe6f750fc60ad337c43ecf61340fa52ab386da6e8dbf9e63c4/pyarrow-23.0.1-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:f4b0dbfa124c0bb161f8b5ebb40f1a680b70279aa0c9901d44a2b5a20806039f", size = 34214575, upload-time = "2026-02-16T10:09:56.225Z" }, - { url = "https://files.pythonhosted.org/packages/e1/da/3f941e3734ac8088ea588b53e860baeddac8323ea40ce22e3d0baa865cc9/pyarrow-23.0.1-cp312-cp312-macosx_12_0_x86_64.whl", hash = "sha256:7707d2b6673f7de054e2e83d59f9e805939038eebe1763fe811ee8fa5c0cd1a7", size = 35832540, upload-time = "2026-02-16T10:10:03.428Z" }, - { url = "https://files.pythonhosted.org/packages/88/7c/3d841c366620e906d54430817531b877ba646310296df42ef697308c2705/pyarrow-23.0.1-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:86ff03fb9f1a320266e0de855dee4b17da6794c595d207f89bba40d16b5c78b9", size = 44470940, upload-time = "2026-02-16T10:10:10.704Z" }, - { url = "https://files.pythonhosted.org/packages/2c/a5/da83046273d990f256cb79796a190bbf7ec999269705ddc609403f8c6b06/pyarrow-23.0.1-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:813d99f31275919c383aab17f0f455a04f5a429c261cc411b1e9a8f5e4aaaa05", size = 47586063, upload-time = "2026-02-16T10:10:17.95Z" }, - { url = "https://files.pythonhosted.org/packages/5b/3c/b7d2ebcff47a514f47f9da1e74b7949138c58cfeb108cdd4ee62f43f0cf3/pyarrow-23.0.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:bf5842f960cddd2ef757d486041d57c96483efc295a8c4a0e20e704cbbf39c67", size = 48173045, upload-time = "2026-02-16T10:10:25.363Z" }, - { url = "https://files.pythonhosted.org/packages/43/b2/b40961262213beaba6acfc88698eb773dfce32ecdf34d19291db94c2bd73/pyarrow-23.0.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:564baf97c858ecc03ec01a41062e8f4698abc3e6e2acd79c01c2e97880a19730", size = 50621741, upload-time = "2026-02-16T10:10:33.477Z" }, - { url = "https://files.pythonhosted.org/packages/f6/70/1fdda42d65b28b078e93d75d371b2185a61da89dda4def8ba6ba41ebdeb4/pyarrow-23.0.1-cp312-cp312-win_amd64.whl", hash = "sha256:07deae7783782ac7250989a7b2ecde9b3c343a643f82e8a4df03d93b633006f0", size = 27620678, upload-time = "2026-02-16T10:10:39.31Z" }, - { url = "https://files.pythonhosted.org/packages/47/10/2cbe4c6f0fb83d2de37249567373d64327a5e4d8db72f486db42875b08f6/pyarrow-23.0.1-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:6b8fda694640b00e8af3c824f99f789e836720aa8c9379fb435d4c4953a756b8", size = 34210066, upload-time = "2026-02-16T10:10:45.487Z" }, - { url = "https://files.pythonhosted.org/packages/cb/4f/679fa7e84dadbaca7a65f7cdba8d6c83febbd93ca12fa4adf40ba3b6362b/pyarrow-23.0.1-cp313-cp313-macosx_12_0_x86_64.whl", hash = "sha256:8ff51b1addc469b9444b7c6f3548e19dc931b172ab234e995a60aea9f6e6025f", size = 35825526, upload-time = "2026-02-16T10:10:52.266Z" }, - { url = "https://files.pythonhosted.org/packages/f9/63/d2747d930882c9d661e9398eefc54f15696547b8983aaaf11d4a2e8b5426/pyarrow-23.0.1-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:71c5be5cbf1e1cb6169d2a0980850bccb558ddc9b747b6206435313c47c37677", size = 44473279, upload-time = "2026-02-16T10:11:01.557Z" }, - { url = "https://files.pythonhosted.org/packages/b3/93/10a48b5e238de6d562a411af6467e71e7aedbc9b87f8d3a35f1560ae30fb/pyarrow-23.0.1-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:9b6f4f17b43bc39d56fec96e53fe89d94bac3eb134137964371b45352d40d0c2", size = 47585798, upload-time = "2026-02-16T10:11:09.401Z" }, - { url = "https://files.pythonhosted.org/packages/5c/20/476943001c54ef078dbf9542280e22741219a184a0632862bca4feccd666/pyarrow-23.0.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:9fc13fc6c403d1337acab46a2c4346ca6c9dec5780c3c697cf8abfd5e19b6b37", size = 48179446, upload-time = "2026-02-16T10:11:17.781Z" }, - { url = "https://files.pythonhosted.org/packages/4b/b6/5dd0c47b335fcd8edba9bfab78ad961bd0fd55ebe53468cc393f45e0be60/pyarrow-23.0.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:5c16ed4f53247fa3ffb12a14d236de4213a4415d127fe9cebed33d51671113e2", size = 50623972, upload-time = "2026-02-16T10:11:26.185Z" }, - { url = "https://files.pythonhosted.org/packages/d5/09/a532297c9591a727d67760e2e756b83905dd89adb365a7f6e9c72578bcc1/pyarrow-23.0.1-cp313-cp313-win_amd64.whl", hash = "sha256:cecfb12ef629cf6be0b1887f9f86463b0dd3dc3195ae6224e74006be4736035a", size = 27540749, upload-time = "2026-02-16T10:12:23.297Z" }, - { url = "https://files.pythonhosted.org/packages/a5/8e/38749c4b1303e6ae76b3c80618f84861ae0c55dd3c2273842ea6f8258233/pyarrow-23.0.1-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:29f7f7419a0e30264ea261fdc0e5fe63ce5a6095003db2945d7cd78df391a7e1", size = 34471544, upload-time = "2026-02-16T10:11:32.535Z" }, - { url = "https://files.pythonhosted.org/packages/a3/73/f237b2bc8c669212f842bcfd842b04fc8d936bfc9d471630569132dc920d/pyarrow-23.0.1-cp313-cp313t-macosx_12_0_x86_64.whl", hash = "sha256:33d648dc25b51fd8055c19e4261e813dfc4d2427f068bcecc8b53d01b81b0500", size = 35949911, upload-time = "2026-02-16T10:11:39.813Z" }, - { url = "https://files.pythonhosted.org/packages/0c/86/b912195eee0903b5611bf596833def7d146ab2d301afeb4b722c57ffc966/pyarrow-23.0.1-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:cd395abf8f91c673dd3589cadc8cc1ee4e8674fa61b2e923c8dd215d9c7d1f41", size = 44520337, upload-time = "2026-02-16T10:11:47.764Z" }, - { url = "https://files.pythonhosted.org/packages/69/c2/f2a717fb824f62d0be952ea724b4f6f9372a17eed6f704b5c9526f12f2f1/pyarrow-23.0.1-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:00be9576d970c31defb5c32eb72ef585bf600ef6d0a82d5eccaae96639cf9d07", size = 47548944, upload-time = "2026-02-16T10:11:56.607Z" }, - { url = "https://files.pythonhosted.org/packages/84/a7/90007d476b9f0dc308e3bc57b832d004f848fd6c0da601375d20d92d1519/pyarrow-23.0.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:c2139549494445609f35a5cda4eb94e2c9e4d704ce60a095b342f82460c73a83", size = 48236269, upload-time = "2026-02-16T10:12:04.47Z" }, - { url = "https://files.pythonhosted.org/packages/b0/3f/b16fab3e77709856eb6ac328ce35f57a6d4a18462c7ca5186ef31b45e0e0/pyarrow-23.0.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:7044b442f184d84e2351e5084600f0d7343d6117aabcbc1ac78eb1ae11eb4125", size = 50604794, upload-time = "2026-02-16T10:12:11.797Z" }, - { url = "https://files.pythonhosted.org/packages/e9/a1/22df0620a9fac31d68397a75465c344e83c3dfe521f7612aea33e27ab6c0/pyarrow-23.0.1-cp313-cp313t-win_amd64.whl", hash = "sha256:a35581e856a2fafa12f3f54fce4331862b1cfb0bef5758347a858a4aa9d6bae8", size = 27660642, upload-time = "2026-02-16T10:12:17.746Z" }, +version = "21.0.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/ef/c2/ea068b8f00905c06329a3dfcd40d0fcc2b7d0f2e355bdb25b65e0a0e4cd4/pyarrow-21.0.0.tar.gz", hash = "sha256:5051f2dccf0e283ff56335760cbc8622cf52264d67e359d5569541ac11b6d5bc", size = 1133487, upload-time = "2025-07-18T00:57:31.761Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/17/d9/110de31880016e2afc52d8580b397dbe47615defbf09ca8cf55f56c62165/pyarrow-21.0.0-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:e563271e2c5ff4d4a4cbeb2c83d5cf0d4938b891518e676025f7268c6fe5fe26", size = 31196837, upload-time = "2025-07-18T00:54:34.755Z" }, + { url = "https://files.pythonhosted.org/packages/df/5f/c1c1997613abf24fceb087e79432d24c19bc6f7259cab57c2c8e5e545fab/pyarrow-21.0.0-cp310-cp310-macosx_12_0_x86_64.whl", hash = "sha256:fee33b0ca46f4c85443d6c450357101e47d53e6c3f008d658c27a2d020d44c79", size = 32659470, upload-time = "2025-07-18T00:54:38.329Z" }, + { url = "https://files.pythonhosted.org/packages/3e/ed/b1589a777816ee33ba123ba1e4f8f02243a844fed0deec97bde9fb21a5cf/pyarrow-21.0.0-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:7be45519b830f7c24b21d630a31d48bcebfd5d4d7f9d3bdb49da9cdf6d764edb", size = 41055619, upload-time = "2025-07-18T00:54:42.172Z" }, + { url = "https://files.pythonhosted.org/packages/44/28/b6672962639e85dc0ac36f71ab3a8f5f38e01b51343d7aa372a6b56fa3f3/pyarrow-21.0.0-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:26bfd95f6bff443ceae63c65dc7e048670b7e98bc892210acba7e4995d3d4b51", size = 42733488, upload-time = "2025-07-18T00:54:47.132Z" }, + { url = "https://files.pythonhosted.org/packages/f8/cc/de02c3614874b9089c94eac093f90ca5dfa6d5afe45de3ba847fd950fdf1/pyarrow-21.0.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:bd04ec08f7f8bd113c55868bd3fc442a9db67c27af098c5f814a3091e71cc61a", size = 43329159, upload-time = "2025-07-18T00:54:51.686Z" }, + { url = "https://files.pythonhosted.org/packages/a6/3e/99473332ac40278f196e105ce30b79ab8affab12f6194802f2593d6b0be2/pyarrow-21.0.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:9b0b14b49ac10654332a805aedfc0147fb3469cbf8ea951b3d040dab12372594", size = 45050567, upload-time = "2025-07-18T00:54:56.679Z" }, + { url = "https://files.pythonhosted.org/packages/7b/f5/c372ef60593d713e8bfbb7e0c743501605f0ad00719146dc075faf11172b/pyarrow-21.0.0-cp310-cp310-win_amd64.whl", hash = "sha256:9d9f8bcb4c3be7738add259738abdeddc363de1b80e3310e04067aa1ca596634", size = 26217959, upload-time = "2025-07-18T00:55:00.482Z" }, + { url = "https://files.pythonhosted.org/packages/94/dc/80564a3071a57c20b7c32575e4a0120e8a330ef487c319b122942d665960/pyarrow-21.0.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:c077f48aab61738c237802836fc3844f85409a46015635198761b0d6a688f87b", size = 31243234, upload-time = "2025-07-18T00:55:03.812Z" }, + { url = "https://files.pythonhosted.org/packages/ea/cc/3b51cb2db26fe535d14f74cab4c79b191ed9a8cd4cbba45e2379b5ca2746/pyarrow-21.0.0-cp311-cp311-macosx_12_0_x86_64.whl", hash = "sha256:689f448066781856237eca8d1975b98cace19b8dd2ab6145bf49475478bcaa10", size = 32714370, upload-time = "2025-07-18T00:55:07.495Z" }, + { url = "https://files.pythonhosted.org/packages/24/11/a4431f36d5ad7d83b87146f515c063e4d07ef0b7240876ddb885e6b44f2e/pyarrow-21.0.0-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:479ee41399fcddc46159a551705b89c05f11e8b8cb8e968f7fec64f62d91985e", size = 41135424, upload-time = "2025-07-18T00:55:11.461Z" }, + { url = "https://files.pythonhosted.org/packages/74/dc/035d54638fc5d2971cbf1e987ccd45f1091c83bcf747281cf6cc25e72c88/pyarrow-21.0.0-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:40ebfcb54a4f11bcde86bc586cbd0272bac0d516cfa539c799c2453768477569", size = 42823810, upload-time = "2025-07-18T00:55:16.301Z" }, + { url = "https://files.pythonhosted.org/packages/2e/3b/89fced102448a9e3e0d4dded1f37fa3ce4700f02cdb8665457fcc8015f5b/pyarrow-21.0.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:8d58d8497814274d3d20214fbb24abcad2f7e351474357d552a8d53bce70c70e", size = 43391538, upload-time = "2025-07-18T00:55:23.82Z" }, + { url = "https://files.pythonhosted.org/packages/fb/bb/ea7f1bd08978d39debd3b23611c293f64a642557e8141c80635d501e6d53/pyarrow-21.0.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:585e7224f21124dd57836b1530ac8f2df2afc43c861d7bf3d58a4870c42ae36c", size = 45120056, upload-time = "2025-07-18T00:55:28.231Z" }, + { url = "https://files.pythonhosted.org/packages/6e/0b/77ea0600009842b30ceebc3337639a7380cd946061b620ac1a2f3cb541e2/pyarrow-21.0.0-cp311-cp311-win_amd64.whl", hash = "sha256:555ca6935b2cbca2c0e932bedd853e9bc523098c39636de9ad4693b5b1df86d6", size = 26220568, upload-time = "2025-07-18T00:55:32.122Z" }, + { url = "https://files.pythonhosted.org/packages/ca/d4/d4f817b21aacc30195cf6a46ba041dd1be827efa4a623cc8bf39a1c2a0c0/pyarrow-21.0.0-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:3a302f0e0963db37e0a24a70c56cf91a4faa0bca51c23812279ca2e23481fccd", size = 31160305, upload-time = "2025-07-18T00:55:35.373Z" }, + { url = "https://files.pythonhosted.org/packages/a2/9c/dcd38ce6e4b4d9a19e1d36914cb8e2b1da4e6003dd075474c4cfcdfe0601/pyarrow-21.0.0-cp312-cp312-macosx_12_0_x86_64.whl", hash = "sha256:b6b27cf01e243871390474a211a7922bfbe3bda21e39bc9160daf0da3fe48876", size = 32684264, upload-time = "2025-07-18T00:55:39.303Z" }, + { url = "https://files.pythonhosted.org/packages/4f/74/2a2d9f8d7a59b639523454bec12dba35ae3d0a07d8ab529dc0809f74b23c/pyarrow-21.0.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:e72a8ec6b868e258a2cd2672d91f2860ad532d590ce94cdf7d5e7ec674ccf03d", size = 41108099, upload-time = "2025-07-18T00:55:42.889Z" }, + { url = "https://files.pythonhosted.org/packages/ad/90/2660332eeb31303c13b653ea566a9918484b6e4d6b9d2d46879a33ab0622/pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:b7ae0bbdc8c6674259b25bef5d2a1d6af5d39d7200c819cf99e07f7dfef1c51e", size = 42829529, upload-time = "2025-07-18T00:55:47.069Z" }, + { url = "https://files.pythonhosted.org/packages/33/27/1a93a25c92717f6aa0fca06eb4700860577d016cd3ae51aad0e0488ac899/pyarrow-21.0.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:58c30a1729f82d201627c173d91bd431db88ea74dcaa3885855bc6203e433b82", size = 43367883, upload-time = "2025-07-18T00:55:53.069Z" }, + { url = "https://files.pythonhosted.org/packages/05/d9/4d09d919f35d599bc05c6950095e358c3e15148ead26292dfca1fb659b0c/pyarrow-21.0.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:072116f65604b822a7f22945a7a6e581cfa28e3454fdcc6939d4ff6090126623", size = 45133802, upload-time = "2025-07-18T00:55:57.714Z" }, + { url = "https://files.pythonhosted.org/packages/71/30/f3795b6e192c3ab881325ffe172e526499eb3780e306a15103a2764916a2/pyarrow-21.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:cf56ec8b0a5c8c9d7021d6fd754e688104f9ebebf1bf4449613c9531f5346a18", size = 26203175, upload-time = "2025-07-18T00:56:01.364Z" }, + { url = "https://files.pythonhosted.org/packages/16/ca/c7eaa8e62db8fb37ce942b1ea0c6d7abfe3786ca193957afa25e71b81b66/pyarrow-21.0.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:e99310a4ebd4479bcd1964dff9e14af33746300cb014aa4a3781738ac63baf4a", size = 31154306, upload-time = "2025-07-18T00:56:04.42Z" }, + { url = "https://files.pythonhosted.org/packages/ce/e8/e87d9e3b2489302b3a1aea709aaca4b781c5252fcb812a17ab6275a9a484/pyarrow-21.0.0-cp313-cp313-macosx_12_0_x86_64.whl", hash = "sha256:d2fe8e7f3ce329a71b7ddd7498b3cfac0eeb200c2789bd840234f0dc271a8efe", size = 32680622, upload-time = "2025-07-18T00:56:07.505Z" }, + { url = "https://files.pythonhosted.org/packages/84/52/79095d73a742aa0aba370c7942b1b655f598069489ab387fe47261a849e1/pyarrow-21.0.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:f522e5709379d72fb3da7785aa489ff0bb87448a9dc5a75f45763a795a089ebd", size = 41104094, upload-time = "2025-07-18T00:56:10.994Z" }, + { url = "https://files.pythonhosted.org/packages/89/4b/7782438b551dbb0468892a276b8c789b8bbdb25ea5c5eb27faadd753e037/pyarrow-21.0.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:69cbbdf0631396e9925e048cfa5bce4e8c3d3b41562bbd70c685a8eb53a91e61", size = 42825576, upload-time = "2025-07-18T00:56:15.569Z" }, + { url = "https://files.pythonhosted.org/packages/b3/62/0f29de6e0a1e33518dec92c65be0351d32d7ca351e51ec5f4f837a9aab91/pyarrow-21.0.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:731c7022587006b755d0bdb27626a1a3bb004bb56b11fb30d98b6c1b4718579d", size = 43368342, upload-time = "2025-07-18T00:56:19.531Z" }, + { url = "https://files.pythonhosted.org/packages/90/c7/0fa1f3f29cf75f339768cc698c8ad4ddd2481c1742e9741459911c9ac477/pyarrow-21.0.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:dc56bc708f2d8ac71bd1dcb927e458c93cec10b98eb4120206a4091db7b67b99", size = 45131218, upload-time = "2025-07-18T00:56:23.347Z" }, + { url = "https://files.pythonhosted.org/packages/01/63/581f2076465e67b23bc5a37d4a2abff8362d389d29d8105832e82c9c811c/pyarrow-21.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:186aa00bca62139f75b7de8420f745f2af12941595bbbfa7ed3870ff63e25636", size = 26087551, upload-time = "2025-07-18T00:56:26.758Z" }, + { url = "https://files.pythonhosted.org/packages/c9/ab/357d0d9648bb8241ee7348e564f2479d206ebe6e1c47ac5027c2e31ecd39/pyarrow-21.0.0-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:a7a102574faa3f421141a64c10216e078df467ab9576684d5cd696952546e2da", size = 31290064, upload-time = "2025-07-18T00:56:30.214Z" }, + { url = "https://files.pythonhosted.org/packages/3f/8a/5685d62a990e4cac2043fc76b4661bf38d06efed55cf45a334b455bd2759/pyarrow-21.0.0-cp313-cp313t-macosx_12_0_x86_64.whl", hash = "sha256:1e005378c4a2c6db3ada3ad4c217b381f6c886f0a80d6a316fe586b90f77efd7", size = 32727837, upload-time = "2025-07-18T00:56:33.935Z" }, + { url = "https://files.pythonhosted.org/packages/fc/de/c0828ee09525c2bafefd3e736a248ebe764d07d0fd762d4f0929dbc516c9/pyarrow-21.0.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:65f8e85f79031449ec8706b74504a316805217b35b6099155dd7e227eef0d4b6", size = 41014158, upload-time = "2025-07-18T00:56:37.528Z" }, + { url = "https://files.pythonhosted.org/packages/6e/26/a2865c420c50b7a3748320b614f3484bfcde8347b2639b2b903b21ce6a72/pyarrow-21.0.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:3a81486adc665c7eb1a2bde0224cfca6ceaba344a82a971ef059678417880eb8", size = 42667885, upload-time = "2025-07-18T00:56:41.483Z" }, + { url = "https://files.pythonhosted.org/packages/0a/f9/4ee798dc902533159250fb4321267730bc0a107d8c6889e07c3add4fe3a5/pyarrow-21.0.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:fc0d2f88b81dcf3ccf9a6ae17f89183762c8a94a5bdcfa09e05cfe413acf0503", size = 43276625, upload-time = "2025-07-18T00:56:48.002Z" }, + { url = "https://files.pythonhosted.org/packages/5a/da/e02544d6997037a4b0d22d8e5f66bc9315c3671371a8b18c79ade1cefe14/pyarrow-21.0.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:6299449adf89df38537837487a4f8d3bd91ec94354fdd2a7d30bc11c48ef6e79", size = 44951890, upload-time = "2025-07-18T00:56:52.568Z" }, + { url = "https://files.pythonhosted.org/packages/e5/4e/519c1bc1876625fe6b71e9a28287c43ec2f20f73c658b9ae1d485c0c206e/pyarrow-21.0.0-cp313-cp313t-win_amd64.whl", hash = "sha256:222c39e2c70113543982c6b34f3077962b44fca38c0bd9e68bb6781534425c10", size = 26371006, upload-time = "2025-07-18T00:56:56.379Z" }, ] [[package]]