feat: add load_to_lakehouse Temporal activity for Iceberg ingestion by AtMrun · Pull Request #1134 · atlanhq/application-sdk

AtMrun · 2026-03-19T04:51:09Z

RFC: `entity_raw` Namespace — Per-Application Raw Data Tables

PRs:

MDLH: atlanhq/mdlh#270
Application SDK: atlanhq/application-sdk#1134
Redshift App (reference impl): atlanhq/atlan-redshift-app#184

Author: Mrunmayi Tripathi
Date: 2026-03-23
Status: In Review

Problem

Today, the metadata lakehouse only stores transformed/enriched data in entity_metadata. The raw records from source systems are discarded after transformation. This creates four key gaps:

No auditability or compliance — we cannot trace back to exactly what the source system sent. If a customer disputes a metadata value, there's no source-of-truth to verify against. For regulated environments, there may be requirements to retain source data for audit trails — without raw storage, MDLH cannot serve as the system of record for metadata provenance.
No reprocessing or debugging — if transformation logic changes (bug fix, schema evolution, new fields), we cannot re-derive transformed data from source records. Re-extraction from the source is expensive and may not reproduce historical state. Investigating data quality issues (missing assets, wrong counts, stale data) also requires going back to the source system, which is slow, often requires customer credentials, and the source state may have changed since the original extraction.
No diffing — we cannot compare what changed between two extraction runs at the raw level. Identifying whether a data discrepancy is caused by a source-side change vs. a transformation bug is guesswork without access to both the before and after raw records.
No cross-connector visibility — there's no unified place to query raw data across connectors for features like raw-to-transformed lineage, coverage reports, or extraction health dashboards. Connector developers also have no visibility into what their extraction actually produced, independent of the transformation layer, making it hard to validate and debug connector output.

Proposal

Introduce a new Iceberg namespace entity_raw with one table per registered Application (e.g. entity_raw.snowflake, entity_raw.redshift).

Each table stores the full raw record as a JSON string (raw_record column) alongside common metadata columns (typename, connection_qualified_name, workflow_run_id, extracted_at, tenant_id). These metadata columns are intentionally aligned with entity_metadata fields so that raw and transformed data can be correlated with a simple equi-join — enabling debugging, diffing, and lineage across the two layers.

Table names are not arbitrary — they must match a registered Application entity in Atlas. MDLH proactively creates tables for all known applications at startup and every 10 minutes, and also validates on-demand creation requests against the Atlas application registry. This prevents namespace pollution while keeping the system self-service for onboarded connectors.

The feature is opt-in per connector via a single environment variable (ENABLE_LAKEHOUSE_LOAD=true). When disabled, behavior is unchanged — no raw data is written, no new API calls are made.

This is a coordinated change across three repos:

MDLH — creates and manages the entity_raw tables in Iceberg, validates /load requests against the Atlas application registry
Application SDK — new Temporal activities that convert raw parquet files into the common entity_raw schema and submit load jobs to MDLH via the /load REST API
Connector apps (e.g. Redshift) — register the new SDK activities in their workflow and configure env vars. No extraction logic changes needed.

Design

Schema

Namespace: entity_raw | Table name: application name (e.g. snowflake, redshift)

Column	Type	Required	Purpose
`typename`	string	yes	Partition key — entity type (e.g. `table`, `column`)
`connection_qualified_name`	string	yes	Join key to `entity_metadata.connectionqualifiedname`
`workflow_id`	string	yes	Workflow that produced this record
`workflow_run_id`	string	yes	Join key to `entity_metadata.lastsyncrun`
`extracted_at`	long	yes	Epoch millis when the record was extracted
`tenant_id`	string	yes	Tenant identifier
`entity_name`	string	no	Best-effort entity name from raw data
`raw_record`	string	yes	Full raw row serialised as JSON

Raw and transformed data can be correlated via:

SELECT r.raw_record, em.*
FROM   entity_raw.snowflake r
JOIN   entity_metadata.table em
  ON   r.connection_qualified_name = em.connectionqualifiedname
 AND   r.workflow_run_id           = em.lastsyncrun
 AND   LOWER(r.typename)           = LOWER(em.typename)

Table Lifecycle

Tables are named after registered Application entities in Atlas.

Proactive (startup + every 10 min) — MDLH queries Atlas for all ACTIVE Application entities and pre-creates a table for each. This runs during MDLH init (first install) and on every Notification Processor cycle (*/10 * * * *). Any new Application registered in Atlas gets its table within 10 minutes.

Reactive (on /load request, guarded) — if a /load request targets a non-existent table in entity_raw, MDLH checks whether the table name matches a registered Application. If yes, it auto-creates. If not, it rejects with a clear error listing valid applications. This handles race conditions where a new app sends data before the 10-minute scheduler catches up, while preventing arbitrary table names from polluting the namespace.

End-to-End Data Flow

┌──────────────┐     raw parquet        ┌──────────────┐    JSONL POST /load    ┌──────────────┐
│  Source DB   │ ─────────────────────▶ │  App (SDK)   │ ─────────────────────▶ │    MDLH      │
│  (Redshift)  │  fetch & extract       │  Temporal    │  prepare + load        │  Temporal    │
└──────────────┘                        │  Activities  │                        │  Workflows   │
                                        └──────────────┘                        └──────┬───────┘
                                                                                       │
                     ┌─────────────────────────────────────────────────────────────────┘
                     ▼
          ┌────────────────────────────────────────────────────────────┐
          │                      Iceberg Catalog                       │
          │                                                            │
          │   entity_raw/                    entity_metadata/           │
          │   ├─ redshift  (raw JSON)        ├─ database               │
          │   ├─ snowflake                   ├─ schema                 │
          │   └─ bigquery                    ├─ table                  │
          │                                  └─ column                 │
          │         ↕ JOIN on workflow_run_id + connection_qn ↕        │
          └────────────────────────────────────────────────────────────┘

MDLH table management

Atlas ──query registered apps──▶ ensureRawMetadataTables()
(Application typedef)            ├─ on startup (init workflow)
                                 └─ every 10 min (notification processor)

POST /load ──▶ Validator
               ├─ table exists? → proceed
               ├─ registered app? → auto-create, then proceed
               └─ unknown name? → reject

Application SDK & Connector Integration

What the SDK does

The SDK adds two new Temporal activities available to all connectors:

prepare_raw_for_lakehouse — reads raw parquet files produced during extraction and wraps each row into the entity_raw schema as JSONL, adding metadata columns (typename, connection_qualified_name, workflow_run_id, extracted_at, tenant_id) alongside the original row as a raw_record JSON string.
load_to_lakehouse — submits a load job to the MDLH /load API with an S3 glob pattern, then polls the status endpoint until completion or terminal failure.

Workflow execution order

preflight_check → get_workflow_args
  ↓
asyncio.gather(fetch_databases, fetch_schemas, fetch_tables, fetch_columns, ...)
  ↓
prepare_raw_for_lakehouse        ← NEW: raw parquet → common-schema JSONL
load_to_lakehouse (raw)          ← NEW: JSONL → entity_raw.{app_name}
  ↓
upload_to_atlan                  (existing)
load_to_lakehouse (transformed)  ← NEW: JSONL → entity_metadata.{typename}

Configuration

All lakehouse loading is controlled by environment variables — no code changes needed in connector apps beyond registering the two activities:

Variable	Default	Purpose
`ENABLE_LAKEHOUSE_LOAD`	`false`	Master switch
`MDLH_BASE_URL`	`http://lakehouse.atlas.svc.cluster.local:4541`	MDLH service URL
`LH_LOAD_RAW_NAMESPACE`	`entity_raw`	Raw table namespace
`LH_LOAD_RAW_TABLE_NAME`	`APPLICATION_NAME`	Raw table name (e.g. `redshift`)
`LH_LOAD_RAW_MODE`	`APPEND`	Write mode for raw data
`LH_LOAD_TRANSFORMED_NAMESPACE`	`entity_metadata`	Transformed table namespace
`LH_LOAD_TRANSFORMED_MODE`	`APPEND`	Write mode for transformed data
`LH_LOAD_POLL_INTERVAL_SECONDS`	`10`	Status poll interval
`LH_LOAD_MAX_POLL_ATTEMPTS`	`360`	Max poll attempts (1 hour at 10s)

What connector apps need to do

Minimal — register the two SDK activities in the workflow's activity list and add a connection metadata normalizer to handle different connection object shapes. The Redshift app PR (#184) serves as the reference implementation.

Rollout

All changes are additive and idempotent. No migration needed.

MDLH deploys first — raw tables created at init (new installs) or within 10 min (existing installs)
SDK publishes — new version with lakehouse activities
Connectors opt in — set ENABLE_LAKEHOUSE_LOAD=true per-connector, per-environment

Rollback

Set ENABLE_LAKEHOUSE_LOAD=false → stops writing immediately
Existing raw tables remain (no data loss), can be cleaned up later

Security

Table names validated against registered Application entities in Atlas — no arbitrary creation
Apps communicate with MDLH over internal K8s service mesh
MDLH /load API requires X-Atlan-Tenant-Id header

Open Questions

Retention policy — should raw data have a TTL / automatic cleanup schedule?
Partitioning — currently by typename; add time-based partitioning on extracted_at?
Compression — should raw_record use ZSTD-compressed bytes instead of plain JSON?
Backfill — backfill raw records for existing connectors, or capture going forward only?
Size limits — max size for raw_record to prevent extremely large JSON blobs?

Add a new Temporal activity that calls the MDLH REST API to load extracted data files into Iceberg lakehouse tables. Raw parquet files are loaded after extraction completes, and transformed jsonl files are loaded during exit activities. Both loads are gated behind ENABLE_LAKEHOUSE_LOAD and per-table env var configuration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

snykgituser · 2026-03-19T04:51:19Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues
✅	Code Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

github-actions · 2026-03-19T04:51:35Z

📜 Docstring Coverage Report

RESULT: PASSED (minimum: 30.0%, actual: 79.9%)

Detailed Coverage Report

======= Coverage for /home/runner/work/application-sdk/application-sdk/ ========
----------------------------------- Summary ------------------------------------
| Name                                                                              | Total | Miss | Cover | Cover% |
|-----------------------------------------------------------------------------------|-------|------|-------|--------|
| application_sdk/__init__.py                                                       |     1 |    0 |     1 |   100% |
| application_sdk/constants.py                                                      |     2 |    0 |     2 |   100% |
| application_sdk/version.py                                                        |     1 |    0 |     1 |   100% |
| application_sdk/worker.py                                                         |     8 |    1 |     7 |    88% |
| application_sdk/activities/__init__.py                                            |    10 |    0 |    10 |   100% |
| application_sdk/activities/lock_management.py                                     |     3 |    0 |     3 |   100% |
| application_sdk/activities/common/__init__.py                                     |     1 |    1 |     0 |     0% |
| application_sdk/activities/common/models.py                                       |     8 |    2 |     6 |    75% |
| application_sdk/activities/common/sql_utils.py                                    |     6 |    1 |     5 |    83% |
| application_sdk/activities/common/utils.py                                        |    11 |    2 |     9 |    82% |
| application_sdk/activities/metadata_extraction/__init__.py                        |     1 |    1 |     0 |     0% |
| application_sdk/activities/metadata_extraction/base.py                            |     8 |    1 |     7 |    88% |
| application_sdk/activities/metadata_extraction/incremental.py                     |    19 |    0 |    19 |   100% |
| application_sdk/activities/metadata_extraction/lakehouse.py                       |     4 |    0 |     4 |   100% |
| application_sdk/activities/metadata_extraction/rest.py                            |     1 |    1 |     0 |     0% |
| application_sdk/activities/metadata_extraction/sql.py                             |    22 |    3 |    19 |    86% |
| application_sdk/activities/query_extraction/__init__.py                           |     1 |    1 |     0 |     0% |
| application_sdk/activities/query_extraction/sql.py                                |    13 |    1 |    12 |    92% |
| application_sdk/application/__init__.py                                           |    15 |    6 |     9 |    60% |
| application_sdk/application/metadata_extraction/sql.py                            |    12 |    4 |     8 |    67% |
| application_sdk/clients/__init__.py                                               |     4 |    0 |     4 |   100% |
| application_sdk/clients/atlan.py                                                  |     5 |    3 |     2 |    40% |
| application_sdk/clients/atlan_auth.py                                             |    10 |    0 |    10 |   100% |
| application_sdk/clients/base.py                                                   |     6 |    1 |     5 |    83% |
| application_sdk/clients/mdlh.py                                                   |    11 |    1 |    10 |    91% |
| application_sdk/clients/models.py                                                 |     3 |    0 |     3 |   100% |
| application_sdk/clients/redis.py                                                  |    27 |    0 |    27 |   100% |
| application_sdk/clients/sql.py                                                    |    23 |    0 |    23 |   100% |
| application_sdk/clients/temporal.py                                               |    15 |    1 |    14 |    93% |
| application_sdk/clients/utils.py                                                  |     2 |    1 |     1 |    50% |
| application_sdk/clients/workflow.py                                               |     9 |    2 |     7 |    78% |
| application_sdk/clients/azure/__init__.py                                         |     1 |    0 |     1 |   100% |
| application_sdk/clients/azure/auth.py                                             |     7 |    0 |     7 |   100% |
| application_sdk/clients/azure/client.py                                           |    13 |    0 |    13 |   100% |
| application_sdk/common/__init__.py                                                |     1 |    1 |     0 |     0% |
| application_sdk/common/aws_utils.py                                               |    10 |    1 |     9 |    90% |
| application_sdk/common/error_codes.py                                             |    14 |    2 |    12 |    86% |
| application_sdk/common/file_converter.py                                          |     9 |    5 |     4 |    44% |
| application_sdk/common/file_ops.py                                                |    16 |    1 |    15 |    94% |
| application_sdk/common/path.py                                                    |     2 |    1 |     1 |    50% |
| application_sdk/common/types.py                                                   |     2 |    1 |     1 |    50% |
| application_sdk/common/utils.py                                                   |    17 |    2 |    15 |    88% |
| application_sdk/common/incremental/__init__.py                                    |     1 |    1 |     0 |     0% |
| application_sdk/common/incremental/helpers.py                                     |    12 |    0 |    12 |   100% |
| application_sdk/common/incremental/marker.py                                      |     5 |    0 |     5 |   100% |
| application_sdk/common/incremental/models.py                                      |    11 |    0 |    11 |   100% |
| application_sdk/common/incremental/column_extraction/__init__.py                  |     1 |    0 |     1 |   100% |
| application_sdk/common/incremental/column_extraction/analysis.py                  |     3 |    0 |     3 |   100% |
| application_sdk/common/incremental/column_extraction/backfill.py                  |     3 |    0 |     3 |   100% |
| application_sdk/common/incremental/state/__init__.py                              |     1 |    1 |     0 |     0% |
| application_sdk/common/incremental/state/ancestral_merge.py                       |     3 |    0 |     3 |   100% |
| application_sdk/common/incremental/state/incremental_diff.py                      |     4 |    0 |     4 |   100% |
| application_sdk/common/incremental/state/state_reader.py                          |     2 |    0 |     2 |   100% |
| application_sdk/common/incremental/state/state_writer.py                          |     9 |    0 |     9 |   100% |
| application_sdk/common/incremental/state/table_scope.py                           |     8 |    0 |     8 |   100% |
| application_sdk/common/incremental/storage/__init__.py                            |     1 |    1 |     0 |     0% |
| application_sdk/common/incremental/storage/duckdb_utils.py                        |    12 |    2 |    10 |    83% |
| application_sdk/common/incremental/storage/rocksdb_utils.py                       |     3 |    0 |     3 |   100% |
| application_sdk/decorators/__init__.py                                            |     1 |    1 |     0 |     0% |
| application_sdk/decorators/locks.py                                               |     3 |    2 |     1 |    33% |
| application_sdk/decorators/mcp_tool.py                                            |     3 |    1 |     2 |    67% |
| application_sdk/docgen/__init__.py                                                |     5 |    2 |     3 |    60% |
| application_sdk/docgen/exporters/__init__.py                                      |     1 |    1 |     0 |     0% |
| application_sdk/docgen/exporters/mkdocs.py                                        |     7 |    3 |     4 |    57% |
| application_sdk/docgen/models/__init__.py                                         |     1 |    1 |     0 |     0% |
| application_sdk/docgen/models/export/__init__.py                                  |     1 |    1 |     0 |     0% |
| application_sdk/docgen/models/export/page.py                                      |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/__init__.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/customer.py                                |     3 |    1 |     2 |    67% |
| application_sdk/docgen/models/manifest/internal.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/metadata.py                                |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/page.py                                    |     2 |    1 |     1 |    50% |
| application_sdk/docgen/models/manifest/section.py                                 |     2 |    1 |     1 |    50% |
| application_sdk/docgen/parsers/__init__.py                                        |     1 |    1 |     0 |     0% |
| application_sdk/docgen/parsers/directory.py                                       |    13 |    2 |    11 |    85% |
| application_sdk/docgen/parsers/manifest.py                                        |     6 |    1 |     5 |    83% |
| application_sdk/handlers/__init__.py                                              |     8 |    1 |     7 |    88% |
| application_sdk/handlers/base.py                                                  |     7 |    1 |     6 |    86% |
| application_sdk/handlers/sql.py                                                   |    19 |    6 |    13 |    68% |
| application_sdk/interceptors/__init__.py                                          |     1 |    1 |     0 |     0% |
| application_sdk/interceptors/activity_failure_logging.py                          |     8 |    0 |     8 |   100% |
| application_sdk/interceptors/cleanup.py                                           |     7 |    1 |     6 |    86% |
| application_sdk/interceptors/correlation_context.py                               |    13 |    0 |    13 |   100% |
| application_sdk/interceptors/events.py                                            |     9 |    1 |     8 |    89% |
| application_sdk/interceptors/lock.py                                              |    10 |    2 |     8 |    80% |
| application_sdk/interceptors/models.py                                            |    13 |    1 |    12 |    92% |
| application_sdk/io/__init__.py                                                    |    25 |    0 |    25 |   100% |
| application_sdk/io/json.py                                                        |    15 |    1 |    14 |    93% |
| application_sdk/io/parquet.py                                                     |    22 |    1 |    21 |    95% |
| application_sdk/io/utils.py                                                       |     8 |    1 |     7 |    88% |
| application_sdk/observability/__init__.py                                         |     1 |    1 |     0 |     0% |
| application_sdk/observability/context.py                                          |     1 |    0 |     1 |   100% |
| application_sdk/observability/logger_adaptor.py                                   |    35 |    2 |    33 |    94% |
| application_sdk/observability/metrics_adaptor.py                                  |    12 |    1 |    11 |    92% |
| application_sdk/observability/models.py                                           |     5 |    1 |     4 |    80% |
| application_sdk/observability/observability.py                                    |    25 |    1 |    24 |    96% |
| application_sdk/observability/segment_client.py                                   |    14 |    2 |    12 |    86% |
| application_sdk/observability/traces_adaptor.py                                   |    16 |    1 |    15 |    94% |
| application_sdk/observability/utils.py                                            |     4 |    1 |     3 |    75% |
| application_sdk/observability/decorators/observability_decorator.py               |     7 |    4 |     3 |    43% |
| application_sdk/server/__init__.py                                                |     4 |    0 |     4 |   100% |
| application_sdk/server/fastapi/__init__.py                                        |    26 |    5 |    21 |    81% |
| application_sdk/server/fastapi/models.py                                          |    32 |   28 |     4 |    12% |
| application_sdk/server/fastapi/utils.py                                           |     5 |    0 |     5 |   100% |
| application_sdk/server/fastapi/middleware/logmiddleware.py                        |     4 |    4 |     0 |     0% |
| application_sdk/server/fastapi/middleware/metrics.py                              |     3 |    3 |     0 |     0% |
| application_sdk/server/fastapi/routers/__init__.py                                |     1 |    1 |     0 |     0% |
| application_sdk/server/fastapi/routers/server.py                                  |     8 |    2 |     6 |    75% |
| application_sdk/server/mcp/__init__.py                                            |     1 |    1 |     0 |     0% |
| application_sdk/server/mcp/models.py                                              |     2 |    2 |     0 |     0% |
| application_sdk/server/mcp/server.py                                              |     5 |    0 |     5 |   100% |
| application_sdk/services/__init__.py                                              |     1 |    0 |     1 |   100% |
| application_sdk/services/_utils.py                                                |     2 |    1 |     1 |    50% |
| application_sdk/services/atlan_storage.py                                         |     5 |    0 |     5 |   100% |
| application_sdk/services/eventstore.py                                            |     5 |    0 |     5 |   100% |
| application_sdk/services/objectstore.py                                           |    17 |    0 |    17 |   100% |
| application_sdk/services/secretstore.py                                           |    14 |    0 |    14 |   100% |
| application_sdk/services/statestore.py                                            |     9 |    1 |     8 |    89% |
| application_sdk/test_utils/__init__.py                                            |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/workflow_monitoring.py                                 |     3 |    0 |     3 |   100% |
| application_sdk/test_utils/e2e/__init__.py                                        |    14 |    2 |    12 |    86% |
| application_sdk/test_utils/e2e/base.py                                            |    16 |    2 |    14 |    88% |
| application_sdk/test_utils/e2e/client.py                                          |    10 |    2 |     8 |    80% |
| application_sdk/test_utils/e2e/conftest.py                                        |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/e2e/utils.py                                           |     3 |    1 |     2 |    67% |
| application_sdk/test_utils/hypothesis/__init__.py                                 |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/__init__.py                      |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/sql_client.py                    |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/temporal.py                      |     6 |    1 |     5 |    83% |
| application_sdk/test_utils/hypothesis/strategies/clients/__init__.py              |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/clients/sql.py                   |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/common/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/common/logger.py                 |     3 |    0 |     3 |   100% |
| application_sdk/test_utils/hypothesis/strategies/handlers/__init__.py             |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/__init__.py         |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/sql_metadata.py     |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/handlers/sql/sql_preflight.py    |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/json_input.py             |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/inputs/parquet_input.py          |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/outputs/__init__.py              |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/outputs/json_output.py           |     2 |    1 |     1 |    50% |
| application_sdk/test_utils/hypothesis/strategies/outputs/statestore.py            |     3 |    1 |     2 |    67% |
| application_sdk/test_utils/hypothesis/strategies/server/__init__.py               |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/hypothesis/strategies/server/fastapi/__init__.py       |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/__init__.py                       |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/config_loader.py                  |    10 |    4 |     6 |    60% |
| application_sdk/test_utils/scale_data_generator/data_generator.py                 |    10 |    3 |     7 |    70% |
| application_sdk/test_utils/scale_data_generator/driver.py                         |     3 |    3 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/__init__.py        |     1 |    1 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/base.py            |     7 |    3 |     4 |    57% |
| application_sdk/test_utils/scale_data_generator/output_handler/csv_handler.py     |     5 |    5 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/json_handler.py    |     5 |    5 |     0 |     0% |
| application_sdk/test_utils/scale_data_generator/output_handler/parquet_handler.py |     6 |    6 |     0 |     0% |
| application_sdk/transformers/__init__.py                                          |     3 |    1 |     2 |    67% |
| application_sdk/transformers/atlas/__init__.py                                    |     6 |    1 |     5 |    83% |
| application_sdk/transformers/atlas/sql.py                                         |    25 |    4 |    21 |    84% |
| application_sdk/transformers/common/__init__.py                                   |     1 |    1 |     0 |     0% |
| application_sdk/transformers/common/utils.py                                      |     6 |    0 |     6 |   100% |
| application_sdk/transformers/query/__init__.py                                    |    11 |    2 |     9 |    82% |
| application_sdk/workflows/__init__.py                                             |     4 |    0 |     4 |   100% |
| application_sdk/workflows/metadata_extraction/__init__.py                         |     3 |    1 |     2 |    67% |
| application_sdk/workflows/metadata_extraction/incremental_sql.py                  |     5 |    0 |     5 |   100% |
| application_sdk/workflows/metadata_extraction/lakehouse.py                        |     4 |    0 |     4 |   100% |
| application_sdk/workflows/metadata_extraction/sql.py                              |     7 |    0 |     7 |   100% |
| application_sdk/workflows/query_extraction/__init__.py                            |     2 |    2 |     0 |     0% |
| application_sdk/workflows/query_extraction/sql.py                                 |     4 |    0 |     4 |   100% |
| examples/application_custom_fastapi.py                                            |    14 |   14 |     0 |     0% |
| examples/application_fastapi.py                                                   |     9 |    9 |     0 |     0% |
| examples/application_hello_world.py                                               |     7 |    7 |     0 |     0% |
| examples/application_sql.py                                                       |     5 |    4 |     1 |    20% |
| examples/application_sql_miner.py                                                 |     5 |    4 |     1 |    20% |
| examples/application_sql_with_custom_pyatlan_transformer.py                       |    11 |    9 |     2 |    18% |
| examples/application_sql_with_custom_transformer.py                               |     9 |    8 |     1 |    11% |
| examples/application_sql_with_lakehouse_load.py                                   |     5 |    3 |     2 |    40% |
| examples/run_examples.py                                                          |     2 |    1 |     1 |    50% |
| tests/__init__.py                                                                 |     1 |    1 |     0 |     0% |
| tests/conftest.py                                                                 |     4 |    0 |     4 |   100% |
| tests/unit/__init__.py                                                            |     1 |    1 |     0 |     0% |
| tests/unit/test_worker.py                                                         |    28 |    8 |    20 |    71% |
| tests/unit/activities/__init__.py                                                 |     1 |    1 |     0 |     0% |
| tests/unit/activities/test_activities.py                                          |    41 |    3 |    38 |    93% |
| tests/unit/activities/test_base_metadata_extraction_activities.py                 |     7 |    0 |     7 |   100% |
| tests/unit/activities/test_connection_normalization.py                            |    25 |    7 |    18 |    72% |
| tests/unit/activities/test_load_to_lakehouse.py                                   |    38 |   19 |    19 |    50% |
| tests/unit/activities/test_lock_management.py                                     |    12 |    0 |    12 |   100% |
| tests/unit/activities/common/__init__.py                                          |     1 |    1 |     0 |     0% |
| tests/unit/activities/common/test_sql_utils.py                                    |     4 |    1 |     3 |    75% |
| tests/unit/activities/common/test_utils.py                                        |    39 |   13 |    26 |    67% |
| tests/unit/activities/metadata_extraction/__init__.py                             |     1 |    1 |     0 |     0% |
| tests/unit/activities/metadata_extraction/test_credential_loading.py              |    14 |    4 |    10 |    71% |
| tests/unit/activities/metadata_extraction/test_sql.py                             |    56 |   38 |    18 |    32% |
| tests/unit/activities/query_extraction/__init__.py                                |     1 |    1 |     0 |     0% |
| tests/unit/application/__init__.py                                                |     1 |    1 |     0 |     0% |
| tests/unit/application/test_application.py                                        |    44 |    3 |    41 |    93% |
| tests/unit/application/test_manifest.py                                           |    15 |    3 |    12 |    80% |
| tests/unit/application/metadata_extraction/test_sql.py                            |    36 |    6 |    30 |    83% |
| tests/unit/clients/__init__.py                                                    |     1 |    1 |     0 |     0% |
| tests/unit/clients/test_async_sql_client.py                                       |    15 |   14 |     1 |     7% |
| tests/unit/clients/test_atlan_auth.py                                             |    11 |    0 |    11 |   100% |
| tests/unit/clients/test_atlan_client.py                                           |     7 |    7 |     0 |     0% |
| tests/unit/clients/test_atlanauth.py                                              |    11 |    1 |    10 |    91% |
| tests/unit/clients/test_azure_auth.py                                             |    14 |    0 |    14 |   100% |
| tests/unit/clients/test_azure_client.py                                           |    19 |    0 |    19 |   100% |
| tests/unit/clients/test_base_client.py                                            |    23 |    1 |    22 |    96% |
| tests/unit/clients/test_redis_client.py                                           |    40 |    0 |    40 |   100% |
| tests/unit/clients/test_sql_client.py                                             |    28 |    6 |    22 |    79% |
| tests/unit/clients/test_temporal_client.py                                        |    24 |    4 |    20 |    83% |
| tests/unit/common/test_aws_utils.py                                               |    30 |    1 |    29 |    97% |
| tests/unit/common/test_column_extraction.py                                       |    10 |    0 |    10 |   100% |
| tests/unit/common/test_credential_utils.py                                        |    30 |    1 |    29 |    97% |
| tests/unit/common/test_file_converter.py                                          |    29 |    0 |    29 |   100% |
| tests/unit/common/test_file_ops.py                                                |    21 |    0 |    21 |   100% |
| tests/unit/common/test_path.py                                                    |     6 |    0 |     6 |   100% |
| tests/unit/common/test_utils.py                                                   |    74 |    6 |    68 |    92% |
| tests/unit/common/test_utils_file_discovery.py                                    |    11 |    0 |    11 |   100% |
| tests/unit/common/incremental/__init__.py                                         |     1 |    1 |     0 |     0% |
| tests/unit/common/incremental/test_helpers.py                                     |    39 |    1 |    38 |    97% |
| tests/unit/common/incremental/test_marker.py                                      |    18 |    2 |    16 |    89% |
| tests/unit/common/incremental/test_models.py                                      |    15 |    0 |    15 |   100% |
| tests/unit/common/incremental/test_state_reader.py                                |     8 |    2 |     6 |    75% |
| tests/unit/common/incremental/test_state_writer.py                                |    22 |    1 |    21 |    95% |
| tests/unit/decorators/__init__.py                                                 |     1 |    1 |     0 |     0% |
| tests/unit/decorators/test_mcp_tool.py                                            |    56 |    4 |    52 |    93% |
| tests/unit/docgen/parsers/test_directory_parser.py                                |    14 |    3 |    11 |    79% |
| tests/unit/docgen/parsers/test_manifest_parser.py                                 |    12 |   12 |     0 |     0% |
| tests/unit/handlers/__init__.py                                                   |     1 |    1 |     0 |     0% |
| tests/unit/handlers/test_base_handler.py                                          |    26 |    2 |    24 |    92% |
| tests/unit/handlers/test_handler_configmap.py                                     |    11 |    0 |    11 |   100% |
| tests/unit/handlers/sql/test_auth.py                                              |    10 |    4 |     6 |    60% |
| tests/unit/handlers/sql/test_check_schemas_and_databases.py                       |    14 |    4 |    10 |    71% |
| tests/unit/handlers/sql/test_extract_allowed_schemas.py                           |    11 |    3 |     8 |    73% |
| tests/unit/handlers/sql/test_metadata.py                                          |    27 |   10 |    17 |    63% |
| tests/unit/handlers/sql/test_preflight_check.py                                   |    16 |   15 |     1 |     6% |
| tests/unit/handlers/sql/test_prepare_metadata.py                                  |    14 |    4 |    10 |    71% |
| tests/unit/handlers/sql/test_tables_check.py                                      |     9 |    6 |     3 |    33% |
| tests/unit/handlers/sql/test_validate_filters.py                                  |    12 |    4 |     8 |    67% |
| tests/unit/interceptors/__init__.py                                               |     1 |    1 |     0 |     0% |
| tests/unit/interceptors/test_activity_failure_logging.py                          |    27 |    1 |    26 |    96% |
| tests/unit/interceptors/test_correlation_context.py                               |    44 |    0 |    44 |   100% |
| tests/unit/io/test_base_io.py                                                     |    28 |    3 |    25 |    89% |
| tests/unit/io/test_writer_data_integrity.py                                       |    12 |    5 |     7 |    58% |
| tests/unit/io/readers/test_json_reader.py                                         |    38 |   19 |    19 |    50% |
| tests/unit/io/readers/test_parquet_reader.py                                      |    60 |   38 |    22 |    37% |
| tests/unit/io/writers/test_json_writer.py                                         |     7 |    6 |     1 |    14% |
| tests/unit/io/writers/test_parquet_writer.py                                      |    57 |   10 |    47 |    82% |
| tests/unit/observability/__init__.py                                              |     1 |    1 |     0 |     0% |
| tests/unit/observability/test_logger_adaptor.py                                   |    54 |    4 |    50 |    93% |
| tests/unit/observability/test_metrics_adaptor.py                                  |    17 |    1 |    16 |    94% |
| tests/unit/observability/test_traces_adaptor.py                                   |    10 |    1 |     9 |    90% |
| tests/unit/server/__init__.py                                                     |     1 |    1 |     0 |     0% |
| tests/unit/server/fastapi/test_fastapi.py                                         |    77 |   27 |    50 |    65% |
| tests/unit/server/fastapi/test_fastapi_utils.py                                   |    34 |    0 |    34 |   100% |
| tests/unit/server/fastapi/test_manifest_and_configmaps.py                         |    17 |    7 |    10 |    59% |
| tests/unit/server/fastapi/routers/__init__.py                                     |     1 |    1 |     0 |     0% |
| tests/unit/server/fastapi/routers/server.py                                       |     1 |    1 |     0 |     0% |
| tests/unit/server/mcp/__init__.py                                                 |     1 |    1 |     0 |     0% |
| tests/unit/server/mcp/test_mcp_server.py                                          |    24 |    1 |    23 |    96% |
| tests/unit/services/test_atlan_storage.py                                         |    10 |    0 |    10 |   100% |
| tests/unit/services/test_eventstore.py                                            |    18 |    0 |    18 |   100% |
| tests/unit/services/test_objectstore.py                                           |    47 |    5 |    42 |    89% |
| tests/unit/services/test_statestore.py                                            |    14 |    0 |    14 |   100% |
| tests/unit/services/test_statestore_path_traversal.py                             |    23 |   17 |     6 |    26% |
| tests/unit/transformers/__init__.py                                               |     1 |    1 |     0 |     0% |
| tests/unit/transformers/atlas/__init__.py                                         |     1 |    1 |     0 |     0% |
| tests/unit/transformers/atlas/test_column.py                                      |    17 |    6 |    11 |    65% |
| tests/unit/transformers/atlas/test_database.py                                    |     8 |    6 |     2 |    25% |
| tests/unit/transformers/atlas/test_function.py                                    |     9 |    5 |     4 |    44% |
| tests/unit/transformers/atlas/test_procedure.py                                   |     7 |    6 |     1 |    14% |
| tests/unit/transformers/atlas/test_schema.py                                      |     8 |    6 |     2 |    25% |
| tests/unit/transformers/atlas/test_table.py                                       |    13 |    6 |     7 |    54% |
| tests/unit/transformers/query/test_sql_transformer.py                             |    16 |    4 |    12 |    75% |
| tests/unit/transformers/query/test_sql_transformer_output_validation.py           |     5 |    2 |     3 |    60% |
| tests/unit/workflows/metadata_extraction/test_base_workflow.py                    |    12 |    0 |    12 |   100% |
| tests/unit/workflows/metadata_extraction/test_sql_output_paths.py                 |    10 |    0 |    10 |   100% |
| tests/unit/workflows/metadata_extraction/test_sql_workflow.py                     |     9 |    4 |     5 |    56% |
| tests/unit/workflows/query_extraction/__init__.py                                 |     1 |    1 |     0 |     0% |
| tests/unit/workflows/query_extraction/test_sql.py                                 |     8 |    3 |     5 |    62% |
|-----------------------------------------------------------------------------------|-------|------|-------|--------|
| TOTAL                                                                             |  3053 |  721 |  2332 |  76.4% |
---------------- RESULT: PASSED (minimum: 30.0%, actual: 76.4%) ----------------

github-actions · 2026-03-19T04:52:12Z

📦 Trivy Vulnerability Scan Results

Schema Version	Created At	Artifact	Type
2	2026-03-25T06:15:47.453868898Z	.	repository

Report Summary

Could not generate summary table (data length mismatch: 9 vs 8).

Scan Result Details

requirements.txt

uv.lock

github-actions · 2026-03-19T04:52:13Z

📦 Trivy Secret Scan Results

Schema Version	Created At	Artifact	Type
2	2026-03-25T06:15:55.921422104Z	.	repository

Report Summary

Could not generate summary table (data length mismatch: 9 vs 8).

Scan Result Details

requirements.txt

uv.lock

github-actions · 2026-03-19T04:52:49Z

🛠 Docs available at: https://k.atlan.dev/application-sdk/feat/load-to-lakehouse-activity

atlan-ci · 2026-03-19T04:54:07Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
9296	6431	69%	0%	🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: b59f5ef by action🐍

github-actions · 2026-03-19T04:54:20Z

🛠 Full Test Coverage Report: https://k.atlan.dev/coverage/application-sdk/pr/1134

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…s handling, and session-per-poll - Add 30s HTTP timeout to all aiohttp sessions to prevent indefinite blocking - Fail fast on non-retryable poll status codes (4xx) instead of burning all attempts - Create fresh aiohttp session per poll iteration to avoid stale connections - Rename _do_lakehouse_load → do_lakehouse_load (public cross-module API) - Add correlation headers (X-Atlan-Tenant-Id, X-Lakehouse-Job-Id) for debugging - Add cross-field validation on LhLoadRequest (require file_keys or patterns) - Catch asyncio.TimeoutError alongside aiohttp.ClientError during polling

…e-activity

Transformed data loading: - Load transformed data into per-entity-type Iceberg tables in entity_metadata (e.g. entity_metadata.database, entity_metadata.table) instead of a single hardcoded table - TYPENAME_TO_ICEBERG_TABLE maps SDK typenames to MDLH table names - fetch_and_transform now returns typename for downstream routing - Remove LH_LOAD_TRANSFORMED_TABLE_NAME (derived from typename) Raw data loading: - New prepare_raw_for_lakehouse activity converts raw parquet to JSONL with common metadata columns (typename, connection_qualified_name, workflow_id, workflow_run_id, extracted_at, tenant_id, entity_name, raw_record as JSON string) - Per-connector raw table: LH_LOAD_RAW_TABLE_NAME defaults to APPLICATION_NAME (e.g. raw_metadata.redshift) - Enables join between raw and transformed data via shared fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move from standalone @activity.defn function to a method on BaseMetadataExtractionActivities, so connector apps don't need to import and register it separately — it's available as activities.prepare_raw_for_lakehouse just like load_to_lakehouse. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hardcoded map Replace TYPENAME_TO_ICEBERG_TABLE dict with _resolve_iceberg_table() that defaults to typename.lower() — matching MDLH's naming convention (lowercase of Atlas typedef). This works for all connectors (SQL, Looker, Snowflake, etc.) without needing a per-connector mapping. Only "extras-procedure" → "procedure" is kept as an override for the SDK-specific naming quirk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match connector-framework convention: http://lakehouse.atlas.svc.cluster.local:4541 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e URL - Add prepare_raw_for_lakehouse to BaseSQLMetadataExtractionActivities (separate class hierarchy from BaseMetadataExtractionActivities) - Fix test_sql_workflow: assert 11 activities, include prepare_raw_for_lakehouse - Fix example: use correct default MDLH URL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move all lakehouse loading logic into MetadataExtractionWorkflow: - load_raw_to_lakehouse(): prepare + load raw data (was inline in sql.py) - load_transformed_to_lakehouse(): per-typename load (was _load_transformed_to_lakehouse) - _submit_lakehouse_load(): private helper (was _execute_lakehouse_load) sql.py run() is now a one-liner: await self.load_raw_to_lakehouse(...) All env var checks, config building, and MDLH interaction live in the base workflow — subclasses just call the public methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The retry policy is shared across upload_to_atlan and lakehouse activities — it's not lakehouse-specific. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- do_lakehouse_load -> submit_and_poll_mdlh_load - _do_prepare_raw_for_lakehouse -> convert_raw_parquet_to_jsonl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New files: - activities/metadata_extraction/lakehouse.py All lakehouse implementation: submit_and_poll_mdlh_load, convert_raw_parquet_to_jsonl - workflows/metadata_extraction/lakehouse.py LakehouseLoadMixin with load_raw_to_lakehouse, load_transformed_to_lakehouse, _submit_lakehouse_load, resolve_iceberg_table Existing files now only contain thin delegation: - base.py: activity methods delegate to lakehouse.py functions - sql.py: same delegation for SQL activity class - __init__.py: MetadataExtractionWorkflow inherits LakehouseLoadMixin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…AD is true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove raw_lakehouse_config dict — the activity reads workflow_id, workflow_run_id, output_path, connection_qualified_name directly from workflow_args. Only typenames is passed via _extracted_typenames. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

os.listdir only works locally, not on S3 via Dapr. typenames are always provided by _extracted_typenames from fetch_and_transform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e_load The load_to_lakehouse activity only reads lh_load_config — no need to pass the entire workflow_args through Temporal serialization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Keep run_exit_activities and workflow_success outside the guard so they always run regardless of lakehouse config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The asyncio.gather call was inside the ENABLE_LAKEHOUSE_LOAD block, causing extraction to silently skip when lakehouse loading is disabled.

@patch

Fix all @patch targets from ...base.X to ...lakehouse.X so mocks actually intercept the right module references. Add tests for convert_raw_parquet_to_jsonl, resolve_iceberg_table, and load_raw_to_lakehouse that were previously uncovered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MDLH LhLoadActivityImpl.load() calls request.getFileKeys().size() without a null check. When the SDK sends only patterns (no fileKeys), exclude_none=True omits the field entirely, causing MDLH to deserialize it as null and NPE on the log line. Send file_keys=[] explicitly so the serialized payload always includes "fileKeys": [] — works around the MDLH bug.

prepare_raw_for_lakehouse writes JSONL files to local disk but never uploads them to S3. MDLH resolves the glob pattern against S3 and finds 0 files, resulting in 0 rows loaded into the Iceberg table. Add ObjectStore.upload_prefix call after JSONL generation so files are available in S3 when MDLH processes the load request.

…without lakehouse

AtMrun · 2026-03-24T15:18:59Z

Code Review

This PR adds lakehouse loading capabilities to the metadata extraction workflow. It introduces two new Temporal activities (load_to_lakehouse and prepare_raw_for_lakehouse) that convert raw parquet files to common-schema JSONL and submit load jobs to the MDLH REST API. The implementation includes a pre-flight health check against MDLH's actuator endpoint to gracefully skip loading on tenants without lakehouse deployed, Pydantic request/response models for the MDLH API contract, and comprehensive polling with error handling. The change spans 13 files across models, activities, workflows, constants, tests, and an example.

Confidence Score: 3/5

Well-structured implementation with clean separation (models, activities, workflow mixin, constants), good test coverage for the new code, and proper error handling with SDK error codes
Checked for: bugs, security (no secrets in logs, no injection vectors), CLAUDE.md compliance, performance patterns, Temporal workflow determinism, exception handling
Points deducted for: redundant health check on every load call (N+1 pattern for transformed loads), new aiohttp session per poll iteration instead of reusing, file handle not using SafeFileOps for writes, and missing documentation update per documentation.mdc mapping

Important Files Changed

File	Change	Risk
`application_sdk/activities/metadata_extraction/lakehouse.py`	Added (new)	High
`application_sdk/workflows/metadata_extraction/lakehouse.py`	Added (new)	Medium
`application_sdk/activities/common/models.py`	Modified	Low
`application_sdk/activities/metadata_extraction/base.py`	Modified	Medium
`application_sdk/activities/metadata_extraction/sql.py`	Modified	Low
`application_sdk/workflows/metadata_extraction/__init__.py`	Modified	Medium
`application_sdk/workflows/metadata_extraction/sql.py`	Modified	Medium
`application_sdk/common/error_codes.py`	Modified	Low
`application_sdk/constants.py`	Modified	Low
`tests/unit/activities/test_load_to_lakehouse.py`	Added (new)	Low
`tests/unit/workflows/metadata_extraction/test_base_workflow.py`	Modified	Low
`tests/unit/workflows/metadata_extraction/test_sql_workflow.py`	Modified	Low
`examples/application_sql_with_lakehouse_load.py`	Added (new)	Low

Change Flow

sequenceDiagram
    participant WF as SQL Workflow
    participant Mixin as LakehouseLoadMixin
    participant PrepAct as prepare_raw_for_lakehouse
    participant LoadAct as load_to_lakehouse
    participant HealthChk as check_lakehouse_enabled
    participant MDLH as MDLH REST API

    WF->>Mixin: load_raw_to_lakehouse()
    Mixin->>PrepAct: convert parquet -> JSONL
    PrepAct-->>Mixin: raw_lakehouse dir
    Mixin->>LoadAct: submit load (raw)
    LoadAct->>HealthChk: GET /actuator/health
    HealthChk-->>LoadAct: healthy?
    LoadAct->>MDLH: POST /load (JSONL pattern)
    MDLH-->>LoadAct: 202 + jobId
    LoadAct->>MDLH: GET /load/{jobId}/status (poll)
    MDLH-->>LoadAct: COMPLETED

    WF->>Mixin: load_transformed_to_lakehouse()
    loop For each typename
        Mixin->>LoadAct: submit load (transformed)
        LoadAct->>HealthChk: GET /actuator/health
        LoadAct->>MDLH: POST /load + poll
    end

Findings

#	Severity	File	Issue
1	Warning	`application_sdk/activities/metadata_extraction/lakehouse.py:116`	Health check runs on every `submit_and_poll_mdlh_load` call. For transformed loads, this means N health checks (one per typename) within the same workflow run. The health check should run once and cache the result, or be called at the workflow mixin level before the typename loop.
2	Warning	`application_sdk/activities/metadata_extraction/lakehouse.py:157`	A new `aiohttp.ClientSession` is created for each poll iteration. Per `performance.mdc` rule "Database connections reused (connection pooling)" and aiohttp best practices, the session should be created once and reused across poll iterations to avoid TCP connection overhead on every poll.
3	Warning	`application_sdk/activities/metadata_extraction/lakehouse.py:269`	Raw file writes use `open(out_file, "wb")` directly instead of `SafeFileOps` which is already imported. The rest of the codebase uses `SafeFileOps` for file operations to ensure consistent path safety.
4	Info	`application_sdk/activities/metadata_extraction/lakehouse.py`	Per `documentation.mdc`, changes to `application_sdk/activities/` should update `docs/concepts/activities.md`, and changes to `application_sdk/workflows/` should update `docs/concepts/workflows.md`. Neither doc was updated.

…-file processing, ensure base_dir exists

…ypename processing

… remove transformed loads - Rename convert_raw_parquet_to_jsonl → convert_raw_parquet_to_parquet - Replace Daft + orjson row loop with DuckDB to_json() for ~50x faster raw_record serialization (C++ vectorized, zero Python object creation) - Fix DuckDB SQL injection: escape column names in struct literals - Fix DuckDB connection leak: wrap in try/finally - Skip upload_prefix when no parquet files were produced - Remove load_transformed_to_lakehouse, resolve_iceberg_table, and LH_LOAD_TRANSFORMED_* constants — only raw lakehouse load for now

…ntirely if unavailable

AtMrun requested review from OnkarVO7, cmgrote, louisnow and stefan-atlan as code owners March 19, 2026 04:51

docs: add example app showing lakehouse load via env var configuration

efe04b0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AtMrun mentioned this pull request Mar 19, 2026

feat: wire lakehouse load into MySQL connector workflow atlanhq/atlan-sample-apps#151

Open

4 tasks

AtMrun and others added 16 commits March 19, 2026 16:04

Creating RFC

9ec2af1

feat: add reusable publish-app workflow using atlan CLI (#1122)

c1f56f8

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

fix: bump up atlan CLI version for publish-action (#1136)

627b537

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

feat(docker): rename base image to app-runtime-base (#1139)

ae86b80

feat: add merge-and-push slash command for safe git workflows (#1141)

e0717b8

Merge remote-tracking branch 'origin/main' into feat/load-to-lakehous…

35c4f1d

…e-activity

refactor: rename raw_metadata namespace to entity_raw

44c6da7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use correct in-cluster MDLH URL as default

53632fc

Match connector-framework convention: http://lakehouse.atlas.svc.cluster.local:4541 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: rename _LAKEHOUSE_RETRY_POLICY to _DEFAULT_RETRY_POLICY

d374a5c

The retry policy is shared across upload_to_atlan and lakehouse activities — it's not lakehouse-specific. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove RFC docs from PR

fd5f4e1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AtMrun and others added 17 commits March 21, 2026 11:32

refactor: rename internal lakehouse functions for clarity

ddddf38

- do_lakehouse_load -> submit_and_poll_mdlh_load - _do_prepare_raw_for_lakehouse -> convert_raw_parquet_to_jsonl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: inline retry policy, remove module-level constant

8e538cf

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: preserve original retry_policy style in run_exit_activities

9e54878

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: only call load_transformed_to_lakehouse when ENABLE_LAKEHOUSE_LO…

cbc4292

…AD is true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use bare return instead of return None to minimize diff

3ea0366

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use APP_TENANT_ID constant instead of undefined tenant_id variable

21a98b8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove os.listdir fallback for typename discovery

364ed09

os.listdir only works locally, not on S3 via Dapr. typenames are always provided by _extracted_typenames from fetch_and_transform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add else log when lakehouse load is disabled

976184c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: remove unnecessary workflow_args spread in _submit_lakehous…

3db7227

…e_load The load_to_lakehouse activity only reads lh_load_config — no need to pass the entire workflow_args through Temporal serialization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: guard load_raw_to_lakehouse with ENABLE_LAKEHOUSE_LOAD

1bd350e

Keep run_exit_activities and workflow_success outside the guard so they always run regardless of lakehouse config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing ENABLE_LAKEHOUSE_LOAD import in sql.py

adbe5e8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: always run fetch_and_transform regardless of lakehouse flag

c159327

The asyncio.gather call was inside the ENABLE_LAKEHOUSE_LOAD block, causing extraction to silently skip when lakehouse loading is disabled.

docs: add RFC for entity_raw namespace and lakehouse load feature

5f20b8d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing ActivityError import in base extraction activities

36a1194

AtMrun mentioned this pull request Mar 23, 2026

docs: RFC — entity_raw namespace for per-application raw data tables #1145

Closed

AtMrun added 5 commits March 23, 2026 13:20

fix: rename raw namespace default from entity_raw to int_entity_raw

d2f6f77

chore: remove RFC doc from PR

1b9344f

feat: add MDLH health check before lakehouse load to skip on tenants …

0f5aa7d

…without lakehouse

AtMrun added 6 commits March 24, 2026 20:53

fix: cache health check, reuse poll session, update concept docs

f31e46a

fix: per-request poll timeout, only cache positive health checks, per…

05a6838

…-file processing, ensure base_dir exists

perf: single DuckDB COPY for zero-copy parquet enrichment, parallel t…

0de0f6c

…ypename processing

refactor: move MDLH health check to prepare_raw activity, skip load e…

88d90f6

…ntirely if unavailable

refactor: extract MdlhClient from lakehouse activity functions

b59f5ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add load_to_lakehouse Temporal activity for Iceberg ingestion#1134

feat: add load_to_lakehouse Temporal activity for Iceberg ingestion#1134
AtMrun wants to merge 46 commits intomainfrom
feat/load-to-lakehouse-activity

AtMrun commented Mar 19, 2026 •

edited

Loading

Uh oh!

snykgituser commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

atlan-ci commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

AtMrun commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AtMrun commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RFC: entity_raw Namespace — Per-Application Raw Data Tables

Problem

Proposal

Design

Schema

Table Lifecycle

End-to-End Data Flow

MDLH table management

Application SDK & Connector Integration

What the SDK does

Workflow execution order

Configuration

What connector apps need to do

Rollout

Rollback

Security

Open Questions

Uh oh!

snykgituser commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📜 Docstring Coverage Report

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Trivy Vulnerability Scan Results

Report Summary

Scan Result Details

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Trivy Secret Scan Results

Report Summary

Scan Result Details

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atlan-ci commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AtMrun commented Mar 24, 2026

Code Review

Confidence Score: 3/5

Change Flow

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AtMrun commented Mar 19, 2026 •

edited

Loading

RFC: `entity_raw` Namespace — Per-Application Raw Data Tables

snykgituser commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading

atlan-ci commented Mar 19, 2026 •

edited

Loading

github-actions Bot commented Mar 19, 2026 •

edited

Loading