Skip to content

feat: add load_to_lakehouse Temporal activity for Iceberg ingestion#1134

Open
AtMrun wants to merge 46 commits intomainfrom
feat/load-to-lakehouse-activity
Open

feat: add load_to_lakehouse Temporal activity for Iceberg ingestion#1134
AtMrun wants to merge 46 commits intomainfrom
feat/load-to-lakehouse-activity

Conversation

@AtMrun
Copy link
Copy Markdown
Collaborator

@AtMrun AtMrun commented Mar 19, 2026

RFC: entity_raw Namespace — Per-Application Raw Data Tables

PRs:

Author: Mrunmayi Tripathi
Date: 2026-03-23
Status: In Review


Problem

Today, the metadata lakehouse only stores transformed/enriched data in entity_metadata. The raw records from source systems are discarded after transformation. This creates four key gaps:

  1. No auditability or compliance — we cannot trace back to exactly what the source system sent. If a customer disputes a metadata value, there's no source-of-truth to verify against. For regulated environments, there may be requirements to retain source data for audit trails — without raw storage, MDLH cannot serve as the system of record for metadata provenance.

  2. No reprocessing or debugging — if transformation logic changes (bug fix, schema evolution, new fields), we cannot re-derive transformed data from source records. Re-extraction from the source is expensive and may not reproduce historical state. Investigating data quality issues (missing assets, wrong counts, stale data) also requires going back to the source system, which is slow, often requires customer credentials, and the source state may have changed since the original extraction.

  3. No diffing — we cannot compare what changed between two extraction runs at the raw level. Identifying whether a data discrepancy is caused by a source-side change vs. a transformation bug is guesswork without access to both the before and after raw records.

  4. No cross-connector visibility — there's no unified place to query raw data across connectors for features like raw-to-transformed lineage, coverage reports, or extraction health dashboards. Connector developers also have no visibility into what their extraction actually produced, independent of the transformation layer, making it hard to validate and debug connector output.

Proposal

Introduce a new Iceberg namespace entity_raw with one table per registered Application (e.g. entity_raw.snowflake, entity_raw.redshift).

Each table stores the full raw record as a JSON string (raw_record column) alongside common metadata columns (typename, connection_qualified_name, workflow_run_id, extracted_at, tenant_id). These metadata columns are intentionally aligned with entity_metadata fields so that raw and transformed data can be correlated with a simple equi-join — enabling debugging, diffing, and lineage across the two layers.

Table names are not arbitrary — they must match a registered Application entity in Atlas. MDLH proactively creates tables for all known applications at startup and every 10 minutes, and also validates on-demand creation requests against the Atlas application registry. This prevents namespace pollution while keeping the system self-service for onboarded connectors.

The feature is opt-in per connector via a single environment variable (ENABLE_LAKEHOUSE_LOAD=true). When disabled, behavior is unchanged — no raw data is written, no new API calls are made.

This is a coordinated change across three repos:

  • MDLH — creates and manages the entity_raw tables in Iceberg, validates /load requests against the Atlas application registry
  • Application SDK — new Temporal activities that convert raw parquet files into the common entity_raw schema and submit load jobs to MDLH via the /load REST API
  • Connector apps (e.g. Redshift) — register the new SDK activities in their workflow and configure env vars. No extraction logic changes needed.

Design

Schema

Namespace: entity_raw | Table name: application name (e.g. snowflake, redshift)

Column Type Required Purpose
typename string yes Partition key — entity type (e.g. table, column)
connection_qualified_name string yes Join key to entity_metadata.connectionqualifiedname
workflow_id string yes Workflow that produced this record
workflow_run_id string yes Join key to entity_metadata.lastsyncrun
extracted_at long yes Epoch millis when the record was extracted
tenant_id string yes Tenant identifier
entity_name string no Best-effort entity name from raw data
raw_record string yes Full raw row serialised as JSON

Raw and transformed data can be correlated via:

SELECT r.raw_record, em.*
FROM   entity_raw.snowflake r
JOIN   entity_metadata.table em
  ON   r.connection_qualified_name = em.connectionqualifiedname
 AND   r.workflow_run_id           = em.lastsyncrun
 AND   LOWER(r.typename)           = LOWER(em.typename)

Table Lifecycle

Tables are named after registered Application entities in Atlas.

Proactive (startup + every 10 min) — MDLH queries Atlas for all ACTIVE Application entities and pre-creates a table for each. This runs during MDLH init (first install) and on every Notification Processor cycle (*/10 * * * *). Any new Application registered in Atlas gets its table within 10 minutes.

Reactive (on /load request, guarded) — if a /load request targets a non-existent table in entity_raw, MDLH checks whether the table name matches a registered Application. If yes, it auto-creates. If not, it rejects with a clear error listing valid applications. This handles race conditions where a new app sends data before the 10-minute scheduler catches up, while preventing arbitrary table names from polluting the namespace.

End-to-End Data Flow

┌──────────────┐     raw parquet        ┌──────────────┐    JSONL POST /load    ┌──────────────┐
│  Source DB   │ ─────────────────────▶ │  App (SDK)   │ ─────────────────────▶ │    MDLH      │
│  (Redshift)  │  fetch & extract       │  Temporal    │  prepare + load        │  Temporal    │
└──────────────┘                        │  Activities  │                        │  Workflows   │
                                        └──────────────┘                        └──────┬───────┘
                                                                                       │
                     ┌─────────────────────────────────────────────────────────────────┘
                     ▼
          ┌────────────────────────────────────────────────────────────┐
          │                      Iceberg Catalog                       │
          │                                                            │
          │   entity_raw/                    entity_metadata/           │
          │   ├─ redshift  (raw JSON)        ├─ database               │
          │   ├─ snowflake                   ├─ schema                 │
          │   └─ bigquery                    ├─ table                  │
          │                                  └─ column                 │
          │         ↕ JOIN on workflow_run_id + connection_qn ↕        │
          └────────────────────────────────────────────────────────────┘

MDLH table management

Atlas ──query registered apps──▶ ensureRawMetadataTables()
(Application typedef)            ├─ on startup (init workflow)
                                 └─ every 10 min (notification processor)

POST /load ──▶ Validator
               ├─ table exists? → proceed
               ├─ registered app? → auto-create, then proceed
               └─ unknown name? → reject

Application SDK & Connector Integration

What the SDK does

The SDK adds two new Temporal activities available to all connectors:

  1. prepare_raw_for_lakehouse — reads raw parquet files produced during extraction and wraps each row into the entity_raw schema as JSONL, adding metadata columns (typename, connection_qualified_name, workflow_run_id, extracted_at, tenant_id) alongside the original row as a raw_record JSON string.

  2. load_to_lakehouse — submits a load job to the MDLH /load API with an S3 glob pattern, then polls the status endpoint until completion or terminal failure.

Workflow execution order

preflight_check → get_workflow_args
  ↓
asyncio.gather(fetch_databases, fetch_schemas, fetch_tables, fetch_columns, ...)
  ↓
prepare_raw_for_lakehouse        ← NEW: raw parquet → common-schema JSONL
load_to_lakehouse (raw)          ← NEW: JSONL → entity_raw.{app_name}
  ↓
upload_to_atlan                  (existing)
load_to_lakehouse (transformed)  ← NEW: JSONL → entity_metadata.{typename}

Configuration

All lakehouse loading is controlled by environment variables — no code changes needed in connector apps beyond registering the two activities:

Variable Default Purpose
ENABLE_LAKEHOUSE_LOAD false Master switch
MDLH_BASE_URL http://lakehouse.atlas.svc.cluster.local:4541 MDLH service URL
LH_LOAD_RAW_NAMESPACE entity_raw Raw table namespace
LH_LOAD_RAW_TABLE_NAME APPLICATION_NAME Raw table name (e.g. redshift)
LH_LOAD_RAW_MODE APPEND Write mode for raw data
LH_LOAD_TRANSFORMED_NAMESPACE entity_metadata Transformed table namespace
LH_LOAD_TRANSFORMED_MODE APPEND Write mode for transformed data
LH_LOAD_POLL_INTERVAL_SECONDS 10 Status poll interval
LH_LOAD_MAX_POLL_ATTEMPTS 360 Max poll attempts (1 hour at 10s)

What connector apps need to do

Minimal — register the two SDK activities in the workflow's activity list and add a connection metadata normalizer to handle different connection object shapes. The Redshift app PR (#184) serves as the reference implementation.

Rollout

All changes are additive and idempotent. No migration needed.

  1. MDLH deploys first — raw tables created at init (new installs) or within 10 min (existing installs)
  2. SDK publishes — new version with lakehouse activities
  3. Connectors opt in — set ENABLE_LAKEHOUSE_LOAD=true per-connector, per-environment

Rollback

  • Set ENABLE_LAKEHOUSE_LOAD=false → stops writing immediately
  • Existing raw tables remain (no data loss), can be cleaned up later

Security

  • Table names validated against registered Application entities in Atlas — no arbitrary creation
  • Apps communicate with MDLH over internal K8s service mesh
  • MDLH /load API requires X-Atlan-Tenant-Id header

Open Questions

  1. Retention policy — should raw data have a TTL / automatic cleanup schedule?
  2. Partitioning — currently by typename; add time-based partitioning on extracted_at?
  3. Compression — should raw_record use ZSTD-compressed bytes instead of plain JSON?
  4. Backfill — backfill raw records for existing connectors, or capture going forward only?
  5. Size limits — max size for raw_record to prevent extremely large JSON blobs?

Add a new Temporal activity that calls the MDLH REST API to load
extracted data files into Iceberg lakehouse tables. Raw parquet files
are loaded after extraction completes, and transformed jsonl files are
loaded during exit activities. Both loads are gated behind
ENABLE_LAKEHOUSE_LOAD and per-table env var configuration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@snykgituser
Copy link
Copy Markdown

snykgituser commented Mar 19, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 19, 2026

📜 Docstring Coverage Report

RESULT: PASSED (minimum: 30.0%, actual: 79.9%)

Detailed Coverage Report
======= Coverage for /home/runner/work/application-sdk/application-sdk/ ========
----------------------------------- Summary ------------------------------------
| Name                                                                              | Total | Miss | Cover | Cover% |
|-----------------------------------------------------------------------------------|-------|------|-------|--------|
| application_sdk/__init__.py                                                       |     1 |    0 |     1 |   100% |
| application_sdk/constants.py                                                      |     2 |    0 |     2 |   100% |
| application_sdk/version.py                                                        |     1 |    0 |     1 |   100% |
| application_sdk/worker.py                                                         |     8 |    1 |     7 |    88% |
| application_sdk/activities/__init__.py                                            |    10 |    0 |    10 |   100% |
| application_sdk/activities/lock_management.py                                     |     3 |    0 |     3 |   100% |
| application_sdk/activities/common/__init__.py                                     |     1 |    1 |     0 |     0% |
| application_sdk/activities/common/models.py                                       |     8 |    2 |     6 |    75% |
| application_sdk/activities/common/sql_utils.py                                    |     6 |    1 |     5 |    83% |
| application_sdk/activities/common/utils.py                                        |    11 |    2 |     9 |    82% |
| application_sdk/activities/metadata_extraction/__init__.py                        |     1 |    1 |     0 |     0% |
| application_sdk/activities/metadata_extraction/base.py                            |     8 |    1 |     7 |    88% |
| application_sdk/activities/metadata_extraction/incremental.py                     |    19 |    0 |    19 |   100% |
| application_sdk/activities/metadata_extraction/lakehouse.py                       |     4 |    0 |     4 |   100% |
| application_sdk/activities/metadata_extraction/rest.py                            |     1 |    1 |     0 |     0% |
| application_sdk/activities/metadata_extraction/sql.py                             |    22 |    3 |    19 |    86% |
| application_sdk/activities/query_extraction/__init__.py                           |     1 |    1 |     0 |     0% |
| application_sdk/activities/query_extraction/sql.py                                |    13 |    1 |    12 |    92% |
| application_sdk/application/__init__.py                                           |    15 |    6 |     9 |    60% |
| application_sdk/application/metadata_extraction/sql.py                            |    12 |    4 |     8 |    67% |
| application_sdk/clients/__init__.py                                               |     4 |    0 |     4 |   100% |
| application_sdk/clients/atlan.py                                                  |     5 |    3 |     2 |    40% |
| application_sdk/clients/atlan_auth.py                                             |    10 |    0 |    10 |   100% |
| application_sdk/clients/base.py                                                   |     6 |    1 |     5 |    83% |
| application_sdk/clients/mdlh.py                                                   |    11 |    1 |    10 |    91% |
| application_sdk/clients/models.py                                                 |     3 |    0 |     3 |   100% |
| application_sdk/clients/redis.py                                                  |    27 |    0 |    27 |   100% |
| application_sdk/clients/sql.py                                                    |    23 |    0 |    23 |   100% |
| application_sdk/clients/temporal.py                                               |    15 |    1 |    14 |    93% |
| application_sdk/clients/utils.py                                                  |     2 |    1 |     1 |    50% |
| application_sdk/clients/workflow.py                                               |     9 |    2 |     7 |    78% |
| application_sdk/clients/azure/__init__.py                                         |     1 |    0 |     1 |   100% |
| application_sdk/clients/azure/auth.py                                             |     7 |    0 |     7 |   100% |
| application_sdk/clients/azure/client.py                                           |    13 |    0 |    13 |   100% |
| application_sdk/common/__init__.py                                                |     1 |    1 |     0 |     0% |
| application_sdk/common/aws_utils.py                                               |    10 |    1 |     9 |    90% |
| application_sdk/common/error_codes.py                                             |    14 |    2 |    12 |    86% |
| application_sdk/common/file_converter.py                                          |     9 |    5 |     4 |    44% |
| application_sdk/common/file_ops.py                                                |    16 |    1 |    15 |    94% |
| application_sdk/common/path.py                                                    |     2 |    1 |     1 |    50% |
| application_sdk/common/types.py                                                   |     2 |    1 |     1 |    50% |
| application_sdk/common/utils.py                                                   |    17 |    2 |    15 |    88% |
| application_sdk/common/incremental/__init__.py                                    |     1 |    1 |     0 |     0% |
| application_sdk/common/incremental/helpers.py                                     |    12 |    0 |    12 |   100% |
| application_sdk/common/incremental/marker.py                                      |     5 |    0 |     5 |   100% |
| application_sdk/common/incremental/models.py                                      |    11 |    0 |    11 |   100% |
| application_sdk/common/incremental/column_extraction/__init__.py                  |     1 |    0 |     1 |   100% |
| application_sdk/common/incremental/column_extraction/analysis.py                  |     3 |    0 |     3 |   100% |
| application_sdk/common/incremental/column_extraction/backfill.py                  |     3 |    0 |     3 |   100% |
| application_sdk/common/incremental/state/__init__.py                              |     1 |    1 |     0 |     0% |
| application_sdk/common/incremental/state/ancestral_merge.py                       |     3 |    0 |     3 |   100% |
| application_sdk/common/incremental/state/incremental_diff.py                      |     4 |    0 |     4 |   100% |
| application_sdk/common/incremental/state/state_reader.py                          |     2 |    0 |     2 |   100% |
| application_sdk/common/incremental/state/state_writer.py                          |     9 |    0 |     9 |   100% |
| application_sdk/common/incremental/state/table_scope.py                           |     8 |    0 |     8 |   100% |
| application_sdk/common/incremental/storage/__init__.py                            |     1 |    1 |     0 |     0% |
| application_sdk/common/incremental/storage/duckdb_utils.py                        |    12 |    2 |    10 |    83% |
| application_sdk/common/incremental/storage/rocksdb_utils.py                       |     3 |    0 |     3 |   100% |
| application_sdk/decorators/__init__.py                                            |     1 |    1 |     0 |     0% |
| application_sdk/decorators/locks.py                                               |     3 |    2 |     1 |    33% |
| application_sdk/decorators/mcp_tool.py                                            |     3 |    1 |     2 |    67% |
| application_sdk/docgen/__init__.py                                                |     5 |    2 |     3 |    60% |
| application_sdk/docgen/exporters/__init__.py                                      |     1 |    1 |     0 |     0% |
| application_sdk/docgen/exporters/mkdocs.py                                        |     7 |    3 |     4 |    57% |
| application_sdk/docgen/models/__init__.py                                         |     1 |    1 |     0 |     0% |
| application_sdk/docgen/models/export/__init__.py                                  |     1 |    1 |     0 |     0% |
| application_sdk/docgen/models/export/page.py                                      |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/__init__.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/customer.py                                |     3 |    1 |     2 |    67% |
| application_sdk/docgen/models/manifest/internal.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/metadata.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/page.py                                    |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/section.py                                 |     2 |    1 |     1 |    50% |
| application_sdk/docgen/parsers/__init__.py                                        |     1 |    1 |     0 |     0% |
| application_sdk/docgen/parsers/directory.py                                       |    13 |    2 |    11 |    85% |
| application_sdk/docgen/parsers/manifest.py                                        |     6 |    1 |     5 |    83% |
| application_sdk/handlers/__init__.py                                              |     8 |    1 |     7 |    88% |
| application_sdk/handlers/base.py                                                  |     7 |    1 |     6 |    86% |
| application_sdk/handlers/sql.py                                                   |    19 |    6 |    13 |    68% |
| application_sdk/interceptors/__init__.py                                          |     1 |    1 |     0 |     0% |
| application_sdk/interceptors/activity_failure_logging.py                          |     8 |    0 |     8 |   100% |
| application_sdk/interceptors/cleanup.py                                           |     7 |    1 |     6 |    86% |
| application_sdk/interceptors/correlation_context.py                               |    13 |    0 |    13 |   100% |
| application_sdk/interceptors/events.py                                            |     9 |    1 |     8 |    89% |
| application_sdk/interceptors/lock.py                                              |    10 |    2 |     8 |    80% |
| application_sdk/interceptors/models.py                                            |    13 |    1 |    12 |    92% |
| application_sdk/io/__init__.py                                                    |    25 |    0 |    25 |   100% |
| application_sdk/io/json.py                                                        |    15 |    1 |    14 |    93% |
| application_sdk/io/parquet.py                                                     |    22 |    1 |    21 |    95% |
| application_sdk/io/utils.py                                                       |     8 |    1 |     7 |    88% |
| application_sdk/observability/__init__.py                                         |     1 |    1 |     0 |     0% |
| application_sdk/observability/context.py                                          |     1 |    0 |     1 |   100% |
| application_sdk/observability/logger_adaptor.py                                   |    35 |    2 |    33 |    94% |
| application_sdk/observability/metrics_adaptor.py                                  |    12 |    1 |    11 |    92% |
| application_sdk/observability/models.py                                           |     5 |    1 |     4 |    80% |
| application_sdk/observability/observability.py                                    |    25 |    1 |    24 |    96% |
| application_sdk/observability/segment_client.py                                   |    14 |    2 |    12 |    86% |
| application_sdk/observability/traces_adaptor.py                                   |    16 |    1 |    15 |    94% |
| application_sdk/observability/utils.py                                            |     4 |    1 |     3 |    75% |
| application_sdk/observability/decorators/observability_decorator.py               |     7 |    4 |     3 |    43% |
| application_sdk/server/__init__.py                                                |     4 |    0 |     4 |   100% |
| application_sdk/server/fastapi/__init__.py                                        |    26 |    5 |    21 |    81% |
| application_sdk/server/fastapi/models.py                                          |    32 |   28 |     4 |    12% |
| application_sdk/server/fastapi/utils.py                                           |     5 |    0 |     5 |   100% |
| application_sdk/server/fastapi/middleware/logmiddleware.py                        |     4 |    4 |     0 |     0% |
| application_sdk/server/fastapi/middleware/metrics.py                              |     3 |    3 |     0 |     0% |
| application_sdk/server/fastapi/routers/__init__.py                                |     1 |    1 |     0 |     0% |
| application_sdk/server/fastapi/routers/server.py                                  |     8 |    2 |     6 |    75% |
| application_sdk/server/mcp/__init__.py                                            |     1 |    1 |     0 |     0% |
| application_sdk/server/mcp/models.py                                              |     2 |    2 |     0 |     0% |
| application_sdk/server/mcp/server.py                                              |     5 |    0 |     5 |   100% |
| application_sdk/services/__init__.py                                              |     1 |    0 |     1 |   100% |
| application_sdk/services/_utils.py                                                |     2 |    1 |     1 |    50% |
| application_sdk/services/atlan_storage.py                                         |     5 |    0 |     5 |   100% |
| application_sdk/services/eventstore.py                                            |     5 |    0 |     5 |   100% |
| application_sdk/services/objectstore.py                                           |    17 |    0 |    17 |   100% |
| application_sdk/services/secretstore.py                                           |    14 |    0 |    14 |   100% |
| application_sdk/services/statestore.py                                            |     9 |    1 |     8 |    89% |
| application_sdk/test_utils/__init__.py                                            |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/workflow_monitoring.py                                 |     3 |    0 |     3 |   100% |
| application_sdk/test_utils/e2e/__init__.py                                        |    14 |    2 |    12 |    86% |
| application_sdk/test_utils/e2e/base.py                                            |    16 |    2 |    14 |    88% |
| application_sdk/test_utils/e2e/client.py                                          |    10 |    2 |     8 |    80% |
| application_sdk/test_utils/e2e/conftest.py                                        |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/e2e/utils.py                                           |     3 |    1 |     2 |    67% |
| application_sdk/test_utils/hypothesis/__init__.py                                 |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/__init__.py                      |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/sql_client.py                    |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/temporal.py                      |     6 |    1 |     5 |    83% |
| application_sdk/test_utils/hypothesis/strategies/clients/__init__.py              |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/clients/sql.py                   |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/common/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/common/logger.py                 |     3 |    0 |     3 |   100% |
| application_sdk/test_utils/hypothesis/strategies/handlers/__init__.py             |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/__init__.py         |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/sql_metadata.py     |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/sql_preflight.py    |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/json_input.py             |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/parquet_input.py          |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/outputs/__init__.py              |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/outputs/json_output.py           |     2 |    1 |     1 |    50% |
| application_sdk/test_utils/hypothesis/strategies/outputs/statestore.py            |     3 |    1 |     2 |    67% |
| application_sdk/test_utils/hypothesis/strategies/server/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/server/fastapi/__init__.py       |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/__init__.py                       |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/config_loader.py                  |    10 |    4 |     6 |    60% |
| application_sdk/test_utils/scale_data_generator/data_generator.py                 |    10 |    3 |     7 |    70% |
| application_sdk/test_utils/scale_data_generator/driver.py                         |     3 |    3 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/__init__.py        |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/base.py            |     7 |    3 |     4 |    57% |
| application_sdk/test_utils/scale_data_generator/output_handler/csv_handler.py     |     5 |    5 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/json_handler.py    |     5 |    5 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/parquet_handler.py |     6 |    6 |     0 |     0% |
| application_sdk/transformers/__init__.py                                          |     3 |    1 |     2 |    67% |
| application_sdk/transformers/atlas/__init__.py                                    |     6 |    1 |     5 |    83% |
| application_sdk/transformers/atlas/sql.py                                         |    25 |    4 |    21 |    84% |
| application_sdk/transformers/common/__init__.py                                   |     1 |    1 |     0 |     0% |
| application_sdk/transformers/common/utils.py                                      |     6 |    0 |     6 |   100% |
| application_sdk/transformers/query/__init__.py                                    |    11 |    2 |     9 |    82% |
| application_sdk/workflows/__init__.py                                             |     4 |    0 |     4 |   100% |
| application_sdk/workflows/metadata_extraction/__init__.py                         |     3 |    1 |     2 |    67% |
| application_sdk/workflows/metadata_extraction/incremental_sql.py                  |     5 |    0 |     5 |   100% |
| application_sdk/workflows/metadata_extraction/lakehouse.py                        |     4 |    0 |     4 |   100% |
| application_sdk/workflows/metadata_extraction/sql.py                              |     7 |    0 |     7 |   100% |
| application_sdk/workflows/query_extraction/__init__.py                            |     2 |    2 |     0 |     0% |
| application_sdk/workflows/query_extraction/sql.py                                 |     4 |    0 |     4 |   100% |
| examples/application_custom_fastapi.py                                            |    14 |   14 |     0 |     0% |
| examples/application_fastapi.py                                                   |     9 |    9 |     0 |     0% |
| examples/application_hello_world.py                                               |     7 |    7 |     0 |     0% |
| examples/application_sql.py                                                       |     5 |    4 |     1 |    20% |
| examples/application_sql_miner.py                                                 |     5 |    4 |     1 |    20% |
| examples/application_sql_with_custom_pyatlan_transformer.py                       |    11 |    9 |     2 |    18% |
| examples/application_sql_with_custom_transformer.py                               |     9 |    8 |     1 |    11% |
| examples/application_sql_with_lakehouse_load.py                                   |     5 |    3 |     2 |    40% |
| examples/run_examples.py                                                          |     2 |    1 |     1 |    50% |
| tests/__init__.py                                                                 |     1 |    1 |     0 |     0% |
| tests/conftest.py                                                                 |     4 |    0 |     4 |   100% |
| tests/unit/__init__.py                                                            |     1 |    1 |     0 |     0% |
| tests/unit/test_worker.py                                                         |    28 |    8 |    20 |    71% |
| tests/unit/activities/__init__.py                                                 |     1 |    1 |     0 |     0% |
| tests/unit/activities/test_activities.py                                          |    41 |    3 |    38 |    93% |
| tests/unit/activities/test_base_metadata_extraction_activities.py                 |     7 |    0 |     7 |   100% |
| tests/unit/activities/test_connection_normalization.py                            |    25 |    7 |    18 |    72% |
| tests/unit/activities/test_load_to_lakehouse.py                                   |    38 |   19 |    19 |    50% |
| tests/unit/activities/test_lock_management.py                                     |    12 |    0 |    12 |   100% |
| tests/unit/activities/common/__init__.py                                          |     1 |    1 |     0 |     0% |
| tests/unit/activities/common/test_sql_utils.py                                    |     4 |    1 |     3 |    75% |
| tests/unit/activities/common/test_utils.py                                        |    39 |   13 |    26 |    67% |
| tests/unit/activities/metadata_extraction/__init__.py                             |     1 |    1 |     0 |     0% |
| tests/unit/activities/metadata_extraction/test_credential_loading.py              |    14 |    4 |    10 |    71% |
| tests/unit/activities/metadata_extraction/test_sql.py                             |    56 |   38 |    18 |    32% |
| tests/unit/activities/query_extraction/__init__.py                                |     1 |    1 |     0 |     0% |
| tests/unit/application/__init__.py                                                |     1 |    1 |     0 |     0% |
| tests/unit/application/test_application.py                                        |    44 |    3 |    41 |    93% |
| tests/unit/application/test_manifest.py                                           |    15 |    3 |    12 |    80% |
| tests/unit/application/metadata_extraction/test_sql.py                            |    36 |    6 |    30 |    83% |
| tests/unit/clients/__init__.py                                                    |     1 |    1 |     0 |     0% |
| tests/unit/clients/test_async_sql_client.py                                       |    15 |   14 |     1 |     7% |
| tests/unit/clients/test_atlan_auth.py                                             |    11 |    0 |    11 |   100% |
| tests/unit/clients/test_atlan_client.py                                           |     7 |    7 |     0 |     0% |
| tests/unit/clients/test_atlanauth.py                                              |    11 |    1 |    10 |    91% |
| tests/unit/clients/test_azure_auth.py                                             |    14 |    0 |    14 |   100% |
| tests/unit/clients/test_azure_client.py                                           |    19 |    0 |    19 |   100% |
| tests/unit/clients/test_base_client.py                                            |    23 |    1 |    22 |    96% |
| tests/unit/clients/test_redis_client.py                                           |    40 |    0 |    40 |   100% |
| tests/unit/clients/test_sql_client.py                                             |    28 |    6 |    22 |    79% |
| tests/unit/clients/test_temporal_client.py                                        |    24 |    4 |    20 |    83% |
| tests/unit/common/test_aws_utils.py                                               |    30 |    1 |    29 |    97% |
| tests/unit/common/test_column_extraction.py                                       |    10 |    0 |    10 |   100% |
| tests/unit/common/test_credential_utils.py                                        |    30 |    1 |    29 |    97% |
| tests/unit/common/test_file_converter.py                                          |    29 |    0 |    29 |   100% |
| tests/unit/common/test_file_ops.py                                                |    21 |    0 |    21 |   100% |
| tests/unit/common/test_path.py                                                    |     6 |    0 |     6 |   100% |
| tests/unit/common/test_utils.py                                                   |    74 |    6 |    68 |    92% |
| tests/unit/common/test_utils_file_discovery.py                                    |    11 |    0 |    11 |   100% |
| tests/unit/common/incremental/__init__.py                                         |     1 |    1 |     0 |     0% |
| tests/unit/common/incremental/test_helpers.py                                     |    39 |    1 |    38 |    97% |
| tests/unit/common/incremental/test_marker.py                                      |    18 |    2 |    16 |    89% |
| tests/unit/common/incremental/test_models.py                                      |    15 |    0 |    15 |   100% |
| tests/unit/common/incremental/test_state_reader.py                                |     8 |    2 |     6 |    75% |
| tests/unit/common/incremental/test_state_writer.py                                |    22 |    1 |    21 |    95% |
| tests/unit/decorators/__init__.py                                                 |     1 |    1 |     0 |     0% |
| tests/unit/decorators/test_mcp_tool.py                                            |    56 |    4 |    52 |    93% |
| tests/unit/docgen/parsers/test_directory_parser.py                                |    14 |    3 |    11 |    79% |
| tests/unit/docgen/parsers/test_manifest_parser.py                                 |    12 |   12 |     0 |     0% |
| tests/unit/handlers/__init__.py                                                   |     1 |    1 |     0 |     0% |
| tests/unit/handlers/test_base_handler.py                                          |    26 |    2 |    24 |    92% |
| tests/unit/handlers/test_handler_configmap.py                                     |    11 |    0 |    11 |   100% |
| tests/unit/handlers/sql/test_auth.py                                              |    10 |    4 |     6 |    60% |
| tests/unit/handlers/sql/test_check_schemas_and_databases.py                       |    14 |    4 |    10 |    71% |
| tests/unit/handlers/sql/test_extract_allowed_schemas.py                           |    11 |    3 |     8 |    73% |
| tests/unit/handlers/sql/test_metadata.py                                          |    27 |   10 |    17 |    63% |
| tests/unit/handlers/sql/test_preflight_check.py                                   |    16 |   15 |     1 |     6% |
| tests/unit/handlers/sql/test_prepare_metadata.py                                  |    14 |    4 |    10 |    71% |
| tests/unit/handlers/sql/test_tables_check.py                                      |     9 |    6 |     3 |    33% |
| tests/unit/handlers/sql/test_validate_filters.py                                  |    12 |    4 |     8 |    67% |
| tests/unit/interceptors/__init__.py                                               |     1 |    1 |     0 |     0% |
| tests/unit/interceptors/test_activity_failure_logging.py                          |    27 |    1 |    26 |    96% |
| tests/unit/interceptors/test_correlation_context.py                               |    44 |    0 |    44 |   100% |
| tests/unit/io/test_base_io.py                                                     |    28 |    3 |    25 |    89% |
| tests/unit/io/test_writer_data_integrity.py                                       |    12 |    5 |     7 |    58% |
| tests/unit/io/readers/test_json_reader.py                                         |    38 |   19 |    19 |    50% |
| tests/unit/io/readers/test_parquet_reader.py                                      |    60 |   38 |    22 |    37% |
| tests/unit/io/writers/test_json_writer.py                                         |     7 |    6 |     1 |    14% |
| tests/unit/io/writers/test_parquet_writer.py                                      |    57 |   10 |    47 |    82% |
| tests/unit/observability/__init__.py                                              |     1 |    1 |     0 |     0% |
| tests/unit/observability/test_logger_adaptor.py                                   |    54 |    4 |    50 |    93% |
| tests/unit/observability/test_metrics_adaptor.py                                  |    17 |    1 |    16 |    94% |
| tests/unit/observability/test_traces_adaptor.py                                   |    10 |    1 |     9 |    90% |
| tests/unit/server/__init__.py                                                     |     1 |    1 |     0 |     0% |
| tests/unit/server/fastapi/test_fastapi.py                                         |    77 |   27 |    50 |    65% |
| tests/unit/server/fastapi/test_fastapi_utils.py                                   |    34 |    0 |    34 |   100% |
| tests/unit/server/fastapi/test_manifest_and_configmaps.py                         |    17 |    7 |    10 |    59% |
| tests/unit/server/fastapi/routers/__init__.py                                     |     1 |    1 |     0 |     0% |
| tests/unit/server/fastapi/routers/server.py                                       |     1 |    1 |     0 |     0% |
| tests/unit/server/mcp/__init__.py                                                 |     1 |    1 |     0 |     0% |
| tests/unit/server/mcp/test_mcp_server.py                                          |    24 |    1 |    23 |    96% |
| tests/unit/services/test_atlan_storage.py                                         |    10 |    0 |    10 |   100% |
| tests/unit/services/test_eventstore.py                                            |    18 |    0 |    18 |   100% |
| tests/unit/services/test_objectstore.py                                           |    47 |    5 |    42 |    89% |
| tests/unit/services/test_statestore.py                                            |    14 |    0 |    14 |   100% |
| tests/unit/services/test_statestore_path_traversal.py                             |    23 |   17 |     6 |    26% |
| tests/unit/transformers/__init__.py                                               |     1 |    1 |     0 |     0% |
| tests/unit/transformers/atlas/__init__.py                                         |     1 |    1 |     0 |     0% |
| tests/unit/transformers/atlas/test_column.py                                      |    17 |    6 |    11 |    65% |
| tests/unit/transformers/atlas/test_database.py                                    |     8 |    6 |     2 |    25% |
| tests/unit/transformers/atlas/test_function.py                                    |     9 |    5 |     4 |    44% |
| tests/unit/transformers/atlas/test_procedure.py                                   |     7 |    6 |     1 |    14% |
| tests/unit/transformers/atlas/test_schema.py                                      |     8 |    6 |     2 |    25% |
| tests/unit/transformers/atlas/test_table.py                                       |    13 |    6 |     7 |    54% |
| tests/unit/transformers/query/test_sql_transformer.py                             |    16 |    4 |    12 |    75% |
| tests/unit/transformers/query/test_sql_transformer_output_validation.py           |     5 |    2 |     3 |    60% |
| tests/unit/workflows/metadata_extraction/test_base_workflow.py                    |    12 |    0 |    12 |   100% |
| tests/unit/workflows/metadata_extraction/test_sql_output_paths.py                 |    10 |    0 |    10 |   100% |
| tests/unit/workflows/metadata_extraction/test_sql_workflow.py                     |     9 |    4 |     5 |    56% |
| tests/unit/workflows/query_extraction/__init__.py                                 |     1 |    1 |     0 |     0% |
| tests/unit/workflows/query_extraction/test_sql.py                                 |     8 |    3 |     5 |    62% |
|-----------------------------------------------------------------------------------|-------|------|-------|--------|
| TOTAL                                                                             |  3053 |  721 |  2332 |  76.4% |
---------------- RESULT: PASSED (minimum: 30.0%, actual: 76.4%) ----------------

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 19, 2026

📦 Trivy Vulnerability Scan Results

Schema Version Created At Artifact Type
2 2026-03-25T06:15:47.453868898Z . repository

Report Summary

Could not generate summary table (data length mismatch: 9 vs 8).

Scan Result Details

requirements.txt
uv.lock

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 19, 2026

📦 Trivy Secret Scan Results

Schema Version Created At Artifact Type
2 2026-03-25T06:15:55.921422104Z . repository

Report Summary

Could not generate summary table (data length mismatch: 9 vs 8).

Scan Result Details

requirements.txt
uv.lock

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 19, 2026

@atlan-ci
Copy link
Copy Markdown
Collaborator

atlan-ci commented Mar 19, 2026

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
9296 6431 69% 0% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: b59f5ef by action🐍

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 19, 2026

🛠 Full Test Coverage Report: https://k.atlan.dev/coverage/application-sdk/pr/1134

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AtMrun and others added 16 commits March 19, 2026 16:04
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…s handling, and session-per-poll

- Add 30s HTTP timeout to all aiohttp sessions to prevent indefinite blocking
- Fail fast on non-retryable poll status codes (4xx) instead of burning all attempts
- Create fresh aiohttp session per poll iteration to avoid stale connections
- Rename _do_lakehouse_load → do_lakehouse_load (public cross-module API)
- Add correlation headers (X-Atlan-Tenant-Id, X-Lakehouse-Job-Id) for debugging
- Add cross-field validation on LhLoadRequest (require file_keys or patterns)
- Catch asyncio.TimeoutError alongside aiohttp.ClientError during polling
Transformed data loading:
- Load transformed data into per-entity-type Iceberg tables in
  entity_metadata (e.g. entity_metadata.database, entity_metadata.table)
  instead of a single hardcoded table
- TYPENAME_TO_ICEBERG_TABLE maps SDK typenames to MDLH table names
- fetch_and_transform now returns typename for downstream routing
- Remove LH_LOAD_TRANSFORMED_TABLE_NAME (derived from typename)

Raw data loading:
- New prepare_raw_for_lakehouse activity converts raw parquet to JSONL
  with common metadata columns (typename, connection_qualified_name,
  workflow_id, workflow_run_id, extracted_at, tenant_id, entity_name,
  raw_record as JSON string)
- Per-connector raw table: LH_LOAD_RAW_TABLE_NAME defaults to
  APPLICATION_NAME (e.g. raw_metadata.redshift)
- Enables join between raw and transformed data via shared fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move from standalone @activity.defn function to a method on
BaseMetadataExtractionActivities, so connector apps don't need
to import and register it separately — it's available as
activities.prepare_raw_for_lakehouse just like load_to_lakehouse.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hardcoded map

Replace TYPENAME_TO_ICEBERG_TABLE dict with _resolve_iceberg_table()
that defaults to typename.lower() — matching MDLH's naming convention
(lowercase of Atlas typedef). This works for all connectors (SQL, Looker,
Snowflake, etc.) without needing a per-connector mapping.

Only "extras-procedure" → "procedure" is kept as an override for the
SDK-specific naming quirk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match connector-framework convention:
http://lakehouse.atlas.svc.cluster.local:4541

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e URL

- Add prepare_raw_for_lakehouse to BaseSQLMetadataExtractionActivities
  (separate class hierarchy from BaseMetadataExtractionActivities)
- Fix test_sql_workflow: assert 11 activities, include prepare_raw_for_lakehouse
- Fix example: use correct default MDLH URL

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all lakehouse loading logic into MetadataExtractionWorkflow:
- load_raw_to_lakehouse(): prepare + load raw data (was inline in sql.py)
- load_transformed_to_lakehouse(): per-typename load (was _load_transformed_to_lakehouse)
- _submit_lakehouse_load(): private helper (was _execute_lakehouse_load)

sql.py run() is now a one-liner: await self.load_raw_to_lakehouse(...)
All env var checks, config building, and MDLH interaction live in the
base workflow — subclasses just call the public methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The retry policy is shared across upload_to_atlan and lakehouse
activities — it's not lakehouse-specific.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AtMrun and others added 17 commits March 21, 2026 11:32
- do_lakehouse_load -> submit_and_poll_mdlh_load
- _do_prepare_raw_for_lakehouse -> convert_raw_parquet_to_jsonl

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New files:
- activities/metadata_extraction/lakehouse.py
  All lakehouse implementation: submit_and_poll_mdlh_load,
  convert_raw_parquet_to_jsonl

- workflows/metadata_extraction/lakehouse.py
  LakehouseLoadMixin with load_raw_to_lakehouse,
  load_transformed_to_lakehouse, _submit_lakehouse_load,
  resolve_iceberg_table

Existing files now only contain thin delegation:
- base.py: activity methods delegate to lakehouse.py functions
- sql.py: same delegation for SQL activity class
- __init__.py: MetadataExtractionWorkflow inherits LakehouseLoadMixin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…AD is true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove raw_lakehouse_config dict — the activity reads workflow_id,
workflow_run_id, output_path, connection_qualified_name directly
from workflow_args. Only typenames is passed via _extracted_typenames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
os.listdir only works locally, not on S3 via Dapr. typenames are
always provided by _extracted_typenames from fetch_and_transform.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e_load

The load_to_lakehouse activity only reads lh_load_config — no need
to pass the entire workflow_args through Temporal serialization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep run_exit_activities and workflow_success outside the guard
so they always run regardless of lakehouse config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The asyncio.gather call was inside the ENABLE_LAKEHOUSE_LOAD block,
causing extraction to silently skip when lakehouse loading is disabled.
Fix all @patch targets from ...base.X to ...lakehouse.X so mocks
actually intercept the right module references. Add tests for
convert_raw_parquet_to_jsonl, resolve_iceberg_table, and
load_raw_to_lakehouse that were previously uncovered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AtMrun added 5 commits March 23, 2026 13:20
MDLH LhLoadActivityImpl.load() calls request.getFileKeys().size()
without a null check. When the SDK sends only patterns (no fileKeys),
exclude_none=True omits the field entirely, causing MDLH to
deserialize it as null and NPE on the log line.

Send file_keys=[] explicitly so the serialized payload always
includes "fileKeys": [] — works around the MDLH bug.
prepare_raw_for_lakehouse writes JSONL files to local disk but never
uploads them to S3. MDLH resolves the glob pattern against S3 and
finds 0 files, resulting in 0 rows loaded into the Iceberg table.

Add ObjectStore.upload_prefix call after JSONL generation so files
are available in S3 when MDLH processes the load request.
@AtMrun
Copy link
Copy Markdown
Collaborator Author

AtMrun commented Mar 24, 2026

Code Review

This PR adds lakehouse loading capabilities to the metadata extraction workflow. It introduces two new Temporal activities (load_to_lakehouse and prepare_raw_for_lakehouse) that convert raw parquet files to common-schema JSONL and submit load jobs to the MDLH REST API. The implementation includes a pre-flight health check against MDLH's actuator endpoint to gracefully skip loading on tenants without lakehouse deployed, Pydantic request/response models for the MDLH API contract, and comprehensive polling with error handling. The change spans 13 files across models, activities, workflows, constants, tests, and an example.

Confidence Score: 3/5

  • Well-structured implementation with clean separation (models, activities, workflow mixin, constants), good test coverage for the new code, and proper error handling with SDK error codes
  • Checked for: bugs, security (no secrets in logs, no injection vectors), CLAUDE.md compliance, performance patterns, Temporal workflow determinism, exception handling
  • Points deducted for: redundant health check on every load call (N+1 pattern for transformed loads), new aiohttp session per poll iteration instead of reusing, file handle not using SafeFileOps for writes, and missing documentation update per documentation.mdc mapping
Important Files Changed
File Change Risk
application_sdk/activities/metadata_extraction/lakehouse.py Added (new) High
application_sdk/workflows/metadata_extraction/lakehouse.py Added (new) Medium
application_sdk/activities/common/models.py Modified Low
application_sdk/activities/metadata_extraction/base.py Modified Medium
application_sdk/activities/metadata_extraction/sql.py Modified Low
application_sdk/workflows/metadata_extraction/__init__.py Modified Medium
application_sdk/workflows/metadata_extraction/sql.py Modified Medium
application_sdk/common/error_codes.py Modified Low
application_sdk/constants.py Modified Low
tests/unit/activities/test_load_to_lakehouse.py Added (new) Low
tests/unit/workflows/metadata_extraction/test_base_workflow.py Modified Low
tests/unit/workflows/metadata_extraction/test_sql_workflow.py Modified Low
examples/application_sql_with_lakehouse_load.py Added (new) Low

Change Flow

sequenceDiagram
    participant WF as SQL Workflow
    participant Mixin as LakehouseLoadMixin
    participant PrepAct as prepare_raw_for_lakehouse
    participant LoadAct as load_to_lakehouse
    participant HealthChk as check_lakehouse_enabled
    participant MDLH as MDLH REST API

    WF->>Mixin: load_raw_to_lakehouse()
    Mixin->>PrepAct: convert parquet -> JSONL
    PrepAct-->>Mixin: raw_lakehouse dir
    Mixin->>LoadAct: submit load (raw)
    LoadAct->>HealthChk: GET /actuator/health
    HealthChk-->>LoadAct: healthy?
    LoadAct->>MDLH: POST /load (JSONL pattern)
    MDLH-->>LoadAct: 202 + jobId
    LoadAct->>MDLH: GET /load/{jobId}/status (poll)
    MDLH-->>LoadAct: COMPLETED

    WF->>Mixin: load_transformed_to_lakehouse()
    loop For each typename
        Mixin->>LoadAct: submit load (transformed)
        LoadAct->>HealthChk: GET /actuator/health
        LoadAct->>MDLH: POST /load + poll
    end
Loading

Findings

# Severity File Issue
1 Warning application_sdk/activities/metadata_extraction/lakehouse.py:116 Health check runs on every submit_and_poll_mdlh_load call. For transformed loads, this means N health checks (one per typename) within the same workflow run. The health check should run once and cache the result, or be called at the workflow mixin level before the typename loop.
2 Warning application_sdk/activities/metadata_extraction/lakehouse.py:157 A new aiohttp.ClientSession is created for each poll iteration. Per performance.mdc rule "Database connections reused (connection pooling)" and aiohttp best practices, the session should be created once and reused across poll iterations to avoid TCP connection overhead on every poll.
3 Warning application_sdk/activities/metadata_extraction/lakehouse.py:269 Raw file writes use open(out_file, "wb") directly instead of SafeFileOps which is already imported. The rest of the codebase uses SafeFileOps for file operations to ensure consistent path safety.
4 Info application_sdk/activities/metadata_extraction/lakehouse.py Per documentation.mdc, changes to application_sdk/activities/** should update docs/concepts/activities.md, and changes to application_sdk/workflows/** should update docs/concepts/workflows.md. Neither doc was updated.

AtMrun added 6 commits March 24, 2026 20:53
… remove transformed loads

- Rename convert_raw_parquet_to_jsonl → convert_raw_parquet_to_parquet
- Replace Daft + orjson row loop with DuckDB to_json() for ~50x faster
  raw_record serialization (C++ vectorized, zero Python object creation)
- Fix DuckDB SQL injection: escape column names in struct literals
- Fix DuckDB connection leak: wrap in try/finally
- Skip upload_prefix when no parquet files were produced
- Remove load_transformed_to_lakehouse, resolve_iceberg_table, and
  LH_LOAD_TRANSFORMED_* constants — only raw lakehouse load for now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants