feat(iceberg): back the PyIceberg sink with real MinIO object store#92
Merged
Merged
Conversation
The Iceberg REST catalog wrote table data/metadata to an ephemeral
/tmp/warehouse via HadoopFileIO, so the events.validated -> Iceberg path
was not backed by the object store the rest of the stack uses. Point it
at the same MinIO `agentflow-lake` S3 bucket as the Flink jobs.
- docker-compose.iceberg.yml: self-contained MinIO + bucket-init + REST
catalog (S3FileIO, s3://agentflow-lake/warehouse), mirroring the image
tags/credentials in docker-compose.yml. Init retries the alias instead
of gating on the healthcheck so it comes up regardless of curl.
- config/iceberg.yaml: warehouse s3://agentflow-lake/warehouse + s3
catalog_properties, env-overridable via ${VAR:-default} (prod injects
real credentials; defaults match the local MinIO compose).
- iceberg_sink.py: minimal ${VAR}/${VAR:-default} env expansion applied
to catalog_uri, warehouse, and catalog_properties (no-op without "${").
- tests: 2 no-Docker unit tests (env expansion + s3 props passthrough,
asserting an s3:// warehouse never triggers a local mkdir); the
requires_docker integration test now exercises the S3-backed catalog.
- docs/architecture.md: catalog description updated to MinIO-backed.
Live-validated on Mac/colima: brought the stack up, wrote two order
batches through the real IcebergSink (env-repointed to in-network
services), confirmed partitioned parquet + Iceberg metadata/manifests
landed under s3://agentflow-lake/warehouse in MinIO, row_counts 1 -> 2.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DORA Metrics
|
brownjuly2003-code
added a commit
that referenced
this pull request
Jun 27, 2026
…nts.txt header (#95) * fix(cache): use redis set(ex=) instead of deprecated setex `Redis.setex` is `@deprecated_function` in redis-py 8.0.0 ("Use 'set' instead") and emits a DeprecationWarning from the query-cache hot path on every cached write. Switch `QueryCache.set` to `set(key, value, ex=ttl)`, which is behaviorally identical (int seconds), and drop the now-unused `timedelta` import. All six in-repo Redis test doubles (unit cache/entity_cache/versioning, integration tenant-isolation, chaos RESP client) implemented `setex`; they move to `set(self, key, value, ex=None)` with the matching argument order, and the chaos RESP client now issues `SET … EX` over the wire. The two `set_calls` ttl assertions compare the integer `ex` directly. Verified no-Docker: ruff + mypy clean, full unit suite 1096 passed / 1 skipped (the redis.setex DeprecationWarning is gone). The integration and chaos doubles change symmetrically and are validated by their CI jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: sync version story to v1.5.0 and clarify requirements.txt The README Status section and release badge still described v1.4.0 as the current line even though v1.5.0 is tagged and published, and the CHANGELOG [Unreleased] section did not record the DV2 re-architecture already on main. - README: badge v1.4 -> v1.5, "current release line" -> v1.5.0, extend the release arc to five increments with a v1.5.0 bullet (argon2id O(1) key hashing, NL->SQL guard bypass fix, strict-mypy expansion), and add a note that main carries post-v1.5.0 work pending the next tag. - CHANGELOG [Unreleased]: document the DV2 raw vault migration ClickHouse -> PostgreSQL (#91), the PyIceberg sink backed by real MinIO (#92), the LISTEN/NOTIFY OLTP->vault freshness (#93), and the dependency batch (#94). - requirements.txt: add a header explaining it is a supplemental OTel pin set installed on top of the pyproject package by the e2e/mutation/staging workflows and the security Safety scan, not the full dependency set (pyproject.toml is the source of truth). load_requirements() skips comment lines and `pip -r` ignores them, so the header is non-breaking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: JuliaEdom <uedomskikh@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The Iceberg REST catalog wrote table data/metadata to an ephemeral
/tmp/warehouseviaHadoopFileIO, so theevents.validated → Icebergpath was not backed by the object store the rest of the stack uses. This points the PyIceberg sink at the same MinIOagentflow-lakeS3 bucket the Flink jobs already write to.Changes
docker-compose.iceberg.yml— self-contained MinIO + bucket-init + REST catalog (S3FileIO,s3://agentflow-lake/warehouse), mirroring the image tags / credentials indocker-compose.yml. The init job retries themcalias instead of gating on the MinIO healthcheck, so the stack comes up regardless of whether the server image shipscurl.config/iceberg.yaml—warehouse: s3://agentflow-lake/warehouse+s3.*catalog_properties, env-overridable via${VAR:-default}(production injects real credentials; defaults match the local MinIO compose).src/processing/iceberg_sink.py— minimal${VAR}/${VAR:-default}env expansion applied tocatalog_uri,warehouse, andcatalog_propertiesvalues (no-op when the value contains no${).s3://warehouse never triggers a localmkdir); therequires_dockerintegration test now exercises the S3-backed catalog.docs/architecture.md— catalog description updated to MinIO-backed.Verification
ruff/ruff format/mypyclean;tests/unit/test_iceberg_sink.py(4) and the non-dockertests/integration/test_iceberg_sink.py(5) pass.IcebergSink(env-repointed to in-network services), and confirmed partitioned parquet + Iceberg metadata/manifests landed unders3://agentflow-lake/warehousein MinIO, withrow_countsgoing1 → 2.🤖 Generated with Claude Code