Skip to content

Conversation

@fordN
Copy link
Contributor

@fordN fordN commented Nov 7, 2025

This PR adds Python bindings for the Amp Admin API, enabling programmatic dataset registration, deployment, and monitoring.

Key Features

  • Unified Client: Single Client class supports both Flight SQL queries and admin operations
  • Auto-generated Models: Pydantic v2 models generated from OpenAPI spec for type safety
  • Fluent API: Chain from SQL query → manifest generation → registration → deployment
  • Schema Validation: Automatic Arrow schema inference via /schema endpoint
  • Job Monitoring: Poll and wait for deployment completion with typed error handling

Quick Example

from amp import Client

client = Client(
    query_url="grpc://localhost:8815",
    admin_url="http://localhost:8080"
)

# One-liner: Query → Register → Deploy
job = (
    client.query("SELECT * FROM eth.blocks")
    .with_dependency('eth', '_/eth_firehose@1.0.0')
    .register_as('_', 'my_dataset', '1.0.0', 'blocks', 'mainnet')
    .deploy(parallelism=4, wait=True)
)

Documentation

  • User Guide: docs/admin_client_guide.md
  • API Reference: docs/api/admin_api.md

@fordN fordN self-assigned this Nov 7, 2025
@fordN fordN force-pushed the ford/dataset-management branch from 2633560 to 00d08f7 Compare November 7, 2025 22:32
@fordN fordN force-pushed the ford/snowpipe-and-streaming-state branch from 68bbdb3 to e0e5765 Compare November 7, 2025 22:36
@fordN fordN force-pushed the ford/dataset-management branch from 00d08f7 to 26372d5 Compare November 8, 2025 18:24
@fordN fordN added the enhancement New feature or request label Nov 8, 2025
@fordN fordN force-pushed the ford/dataset-management branch 3 times, most recently from a7d0be3 to 36f0af5 Compare November 11, 2025 05:29
Base automatically changed from ford/snowpipe-and-streaming-state to main November 12, 2025 18:51
- Add pydantic>=2.0,<2.12 for data models (constrained for PyIceberg)
- Add datamodel-code-generator for model generation from OpenAPI
- Add respx for HTTP mocking in tests
- Add Makefile target for regenerating models
- Generate 838 lines of Pydantic v2 models from OpenAPI spec
- Add test manifest files for dataset registration testing
- Create AdminClient base class with HTTP request handling and
error mapping
- Implement DatasetsClient with register/deploy/list/delete operations
- Implement JobsClient with get/list/wait/stop/delete operations
- Implement SchemaClient for SQL validation and schema inference
- Create DeploymentContext for chainable deployment workflows
- Add exception hierarchy with 30+ typed error classes mapped from
API codes
- Support automatic job polling with configurable timeout
- Add query_url and admin_url parameters to Client (backward compatible
with url)
- Add datasets, jobs, schema properties for admin operations
- Extend QueryBuilder with with_dependency() for manifest dependencies
- Add to_manifest() for generating dataset manifests from SQL queries
- Add register_as() for one-line registration returning DeploymentContext
- Support fluent API: query → with_dependency → register_as → deploy
- Maintain backward compatibility (existing Client(url=...) still works)
- Add 10 unit tests for error mapping and exception hierarchy
- Add 10 unit tests for Pydantic model validation
- Add 10 integration tests for AdminClient HTTP operations
- Add 10 integration tests for DatasetsClient operations
- Add 18 integration tests for JobsClient operations including polling
- All 48 tests use respx for HTTP mocking (no real server required)
- 0.65s execution time on dev machine
- Add admin client to feature list
- Add quick start examples for admin operations
- Add links to admin client guide and API reference
- Update overview to highlight dataset management capabilities
- Add comprehensive admin_client_guide.md with usage patterns and
best practices
- Add complete API reference in docs/api/admin_api.md
# Register and deploy a dataset
job = (
client.query("SELECT block_num, hash FROM eth.blocks")
.with_dependency('eth', '_/eth_firehose@1.0.0')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what with_dependency is doing here in the context of that query. Does the eth_firehose dataset contain eth.blocks as well as logs and transactions datasets, etc.? Is it possible to automatically detect and populate dependencies based on the SQL, or to construct the SQL query in a way that automatically pulls in the dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the dataset's fully qualified name (FQN) is _/eth_firehose@1.0.0 (<namespace>/<dataset>@<version>) and it's being aliased as simply eth. blocks is a table in the dataset along with transactions and logs.

Agree that this is probably exposing more of the internals to the user than necessary. It reflects the structure of the manifest which requires a section listing all dependencies used in the SQL.

I think that the dependencies list could be generated based on the SQL, but I'm going to let the structure of these datasets and the recommended way to specify dependencies stabilize before making assumptions for the user. Right now we have a few ways of setting up these derived dataset dependencies: using the FQN in all places, specifying an alias like this, or simply using the dataset name without namespace and version (defaults to latest). The server side work for validating and deploying datasets employing all of these options has been WIP and is stabilizing.

query = query.with_dependency('eth', '_/eth_firehose@1.0.0')

# Generate manifest
manifest = query.to_manifest(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

# Generate manifest
manifest = query.to_manifest(
table_name='blocks',
network='mainnet'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the networks should be typed/enums since there are only a few of them?

@@ -0,0 +1,2586 @@
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this spec be generated by the server to avoid drift?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was generated by the server. I should add a process to keep it in sync with the spec on the server as this was manually copied over.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be a nice to have

@fordN fordN force-pushed the ford/dataset-management branch from 36f0af5 to 48e4160 Compare November 17, 2025 03:01
@fordN fordN merged commit f538843 into main Nov 17, 2025
9 checks passed
@fordN fordN deleted the ford/dataset-management branch November 17, 2025 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants