Skip to content

feat(iceberg): back the PyIceberg sink with real MinIO object store#92

Merged
brownjuly2003-code merged 1 commit into
mainfrom
feat/iceberg-minio-object-store
Jun 27, 2026
Merged

feat(iceberg): back the PyIceberg sink with real MinIO object store#92
brownjuly2003-code merged 1 commit into
mainfrom
feat/iceberg-minio-object-store

Conversation

@brownjuly2003-code

Copy link
Copy Markdown
Owner

What

The Iceberg REST catalog wrote table data/metadata to an ephemeral /tmp/warehouse via HadoopFileIO, so the events.validated → Iceberg path was not backed by the object store the rest of the stack uses. This points the PyIceberg sink at the same MinIO agentflow-lake S3 bucket the Flink jobs already write to.

Changes

  • docker-compose.iceberg.yml — self-contained MinIO + bucket-init + REST catalog (S3FileIO, s3://agentflow-lake/warehouse), mirroring the image tags / credentials in docker-compose.yml. The init job retries the mc alias instead of gating on the MinIO healthcheck, so the stack comes up regardless of whether the server image ships curl.
  • config/iceberg.yamlwarehouse: s3://agentflow-lake/warehouse + s3.* catalog_properties, env-overridable via ${VAR:-default} (production injects real credentials; defaults match the local MinIO compose).
  • src/processing/iceberg_sink.py — minimal ${VAR} / ${VAR:-default} env expansion applied to catalog_uri, warehouse, and catalog_properties values (no-op when the value contains no ${).
  • tests — 2 no-Docker unit tests (env expansion + S3 props passthrough, asserting an s3:// warehouse never triggers a local mkdir); the requires_docker integration test now exercises the S3-backed catalog.
  • docs/architecture.md — catalog description updated to MinIO-backed.

Verification

  • No-Docker gate: ruff / ruff format / mypy clean; tests/unit/test_iceberg_sink.py (4) and the non-docker tests/integration/test_iceberg_sink.py (5) pass.
  • Live-validated on Docker/colima: brought the stack up, wrote two order batches through the real IcebergSink (env-repointed to in-network services), and confirmed partitioned parquet + Iceberg metadata/manifests landed under s3://agentflow-lake/warehouse in MinIO, with row_counts going 1 → 2.
warehouse/agentflow/orders/data/created_at_day=2026-06-27/00000-0-...parquet   (x2 appends)
warehouse/agentflow/orders/metadata/00000..00002-...metadata.json
warehouse/agentflow/orders/metadata/snap-...-.avro  + ...-m0.avro manifests
warehouse/agentflow/{payments,clickstream,inventory,dead_letter}/metadata/...

🤖 Generated with Claude Code

The Iceberg REST catalog wrote table data/metadata to an ephemeral
/tmp/warehouse via HadoopFileIO, so the events.validated -> Iceberg path
was not backed by the object store the rest of the stack uses. Point it
at the same MinIO `agentflow-lake` S3 bucket as the Flink jobs.

- docker-compose.iceberg.yml: self-contained MinIO + bucket-init + REST
  catalog (S3FileIO, s3://agentflow-lake/warehouse), mirroring the image
  tags/credentials in docker-compose.yml. Init retries the alias instead
  of gating on the healthcheck so it comes up regardless of curl.
- config/iceberg.yaml: warehouse s3://agentflow-lake/warehouse + s3
  catalog_properties, env-overridable via ${VAR:-default} (prod injects
  real credentials; defaults match the local MinIO compose).
- iceberg_sink.py: minimal ${VAR}/${VAR:-default} env expansion applied
  to catalog_uri, warehouse, and catalog_properties (no-op without "${").
- tests: 2 no-Docker unit tests (env expansion + s3 props passthrough,
  asserting an s3:// warehouse never triggers a local mkdir); the
  requires_docker integration test now exercises the S3-backed catalog.
- docs/architecture.md: catalog description updated to MinIO-backed.

Live-validated on Mac/colima: brought the stack up, wrote two order
batches through the real IcebergSink (env-repointed to in-network
services), confirmed partitioned parquet + Iceberg metadata/manifests
landed under s3://agentflow-lake/warehouse in MinIO, row_counts 1 -> 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

DORA Metrics

  • Window: last 30 days
  • Branch: main
  • Deployment frequency: 181 total / 42.23 per week
  • Lead time for changes: avg 0.23h / median 0.0h
  • Change failure rate: 65.19% (118/181)
  • MTTR: 0.47h across 7 incident(s)

@brownjuly2003-code brownjuly2003-code merged commit c50a4e3 into main Jun 27, 2026
24 of 25 checks passed
@brownjuly2003-code brownjuly2003-code deleted the feat/iceberg-minio-object-store branch June 27, 2026 06:14
brownjuly2003-code added a commit that referenced this pull request Jun 27, 2026
…nts.txt header (#95)

* fix(cache): use redis set(ex=) instead of deprecated setex

`Redis.setex` is `@deprecated_function` in redis-py 8.0.0 ("Use 'set'
instead") and emits a DeprecationWarning from the query-cache hot path on
every cached write. Switch `QueryCache.set` to `set(key, value, ex=ttl)`,
which is behaviorally identical (int seconds), and drop the now-unused
`timedelta` import.

All six in-repo Redis test doubles (unit cache/entity_cache/versioning,
integration tenant-isolation, chaos RESP client) implemented `setex`; they
move to `set(self, key, value, ex=None)` with the matching argument order,
and the chaos RESP client now issues `SET … EX` over the wire. The two
`set_calls` ttl assertions compare the integer `ex` directly.

Verified no-Docker: ruff + mypy clean, full unit suite 1096 passed / 1
skipped (the redis.setex DeprecationWarning is gone). The integration and
chaos doubles change symmetrically and are validated by their CI jobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: sync version story to v1.5.0 and clarify requirements.txt

The README Status section and release badge still described v1.4.0 as the
current line even though v1.5.0 is tagged and published, and the CHANGELOG
[Unreleased] section did not record the DV2 re-architecture already on main.

- README: badge v1.4 -> v1.5, "current release line" -> v1.5.0, extend the
  release arc to five increments with a v1.5.0 bullet (argon2id O(1) key
  hashing, NL->SQL guard bypass fix, strict-mypy expansion), and add a note
  that main carries post-v1.5.0 work pending the next tag.
- CHANGELOG [Unreleased]: document the DV2 raw vault migration ClickHouse ->
  PostgreSQL (#91), the PyIceberg sink backed by real MinIO (#92), the
  LISTEN/NOTIFY OLTP->vault freshness (#93), and the dependency batch (#94).
- requirements.txt: add a header explaining it is a supplemental OTel pin
  set installed on top of the pyproject package by the e2e/mutation/staging
  workflows and the security Safety scan, not the full dependency set
  (pyproject.toml is the source of truth). load_requirements() skips comment
  lines and `pip -r` ignores them, so the header is non-breaking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: JuliaEdom <uedomskikh@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants