Skip to content

Latest commit

 

History

History
507 lines (392 loc) · 16.5 KB

File metadata and controls

507 lines (392 loc) · 16.5 KB

RULES.md — Qlik Load Script Optimization Protocol (QIX Engine)

Bulletproof ruleset for a coding agent generating high-performance Qlik scripts

Role: Senior Qlik Data Architect + ETL Engineer (Qlik Sense / QlikView)
Primary objective: Minimize RAM footprint (symbol tables + bit-width), minimize peak reload RAM, maximize reload throughput (optimized QVD loads), and keep interactive latency low (inference + calculation).
Secondary objective: Preserve traceability and governance (lineage, audit, security safety).
Non-negotiable: The script must be deterministic, idempotent, and safe to rerun.

This ruleset assumes you are building an enterprise-grade, layered QVD pipeline (Extract → Transform → Consume).
If you’re building a single-script “everything-in-one” app, still follow the same rules—just treat internal sections as layers.


0) Agent behavior contract (how you must act)

0.1 Fail-fast philosophy

If any forbidden rule is violated, stop output and explain:

  • what rule was violated
  • why it matters (RAM/CPU/latency/security)
  • what the compliant alternative is
  • the expected impact change

0.2 Always output these three blocks (mandatory)

Every generated script must include:

  1. Header: purpose, layer, sources, outputs, watermark strategy, assumptions.
  2. Instrumentation: basic logging variables + reload checkpoints.
  3. Validation: post-load sanity checks (row counts, distinct keys, null keys, fan-trap detectors).

0.3 Deterministic & idempotent

  • The same inputs must produce the same outputs (except explicit time-window increments).
  • Incremental pipelines must be safe on retry (no duplication, no partial writes without rollback strategy).

1) Core engine constraints you must optimize for (mental model)

1.1 What costs RAM

  • Symbol tables (distinct values per field)
  • Bit-width of indices for those fields (cardinality → pointer width)
  • Temporary duplicates during reload (resident loads, joins, concatenations)
  • Wide fact tables (many fields x many rows)

1.2 What costs interactive latency

  • Messy association topology (unexpected synthetic keys, circular references, unnecessary link paths)
  • Expressions that force temporary tables (cross-table arithmetic)
  • Too many objects/hypercubes (frontend, but you influence it by precomputing flags and shaping model)

1.3 The biggest reload killers

  • Pulling too much data over the network (no pushdown)
  • Non-optimized QVD loads in the final app layer
  • Peak RAM spikes due to keeping staging tables alive too long

2) Absolute prohibitions (HARD FAIL)

2.1 SQL & extraction

  • FORBIDDEN: SELECT * on any production source.
  • FORBIDDEN: Filtering in LOAD ... WHERE ... when the same filter could be pushed down into SQL (except WHERE Exists() and specific QVD patterns).
  • FORBIDDEN: Joining large tables in SQL to “flatten everything” unless explicitly justified with impact analysis.

2.2 Cardinality landmines

  • FORBIDDEN: Loading full high-cardinality timestamps (YYYY-MM-DD hh:mm:ss.fff) into the final associative model.
  • FORBIDDEN: Keeping unique row identifiers (transaction IDs, GUIDs) in the final model unless a documented analytic use-case requires it.

2.3 QVD optimization

  • FORBIDDEN: Transformations in the final app load (Layer 3) that break optimized loads:
    • calculations
    • arbitrary WHERE filters (except Exists()-based)
    • joins
    • resident transformations
  • FORBIDDEN: STORE and immediately reload the same QVD in the same script unless explicitly trading I/O for RAM to avoid memory failure (must be documented).

2.4 Model integrity

  • FORBIDDEN: Unresolved circular references.
  • FORBIDDEN: Unexpected synthetic keys (synthetic keys must be intentional and documented).
  • FORBIDDEN: OMIT in Section Access on fields used as link keys between tables.

2.5 Memory governance during reload

  • FORBIDDEN: Leaving large staging tables resident after their STORE step (must STORE + DROP immediately).

3) Mandatory rules (MUST DO)

3.1 Pushdown protocol (minimize ingestion payload)

You MUST implement all feasible dataset reduction at the source:

A. Vertical partitioning (columns):

  • Only select the columns required for the model + audit. Nothing else.

B. Horizontal partitioning (rows):

  • Apply WHERE clauses in SQL whenever possible.

C. Granularity reduction (aggregation):

  • If the analytical requirement is monthly, aggregate monthly in SQL.
  • Never load 100M rows into Qlik to later summarize to 10K rows.

3.2 Timestamp decoupling (mandatory)

For every timestamp field destined for the final model:

  • Split into DateKey and TimeKey (bucketed to business precision)

Standard pattern:

Date(Floor(TS)) as DateKey,
Time(Floor(TS, 1/24/60), 'hh:mm') as TimeKeyMinute

Allowed alternatives:

  • second-level if genuinely required (1/24/60/60)
  • 15-min buckets for ops analytics (1/24/4)
  • keep raw timestamp only in staging QVDs for lineage/debug, then drop in transform layer

3.3 Composite key hashing (mandatory)

All composite keys used for associations MUST be integerized.

Pattern:

AutoNumber(Key1 & '|' & Key2 & '|' & Key3, 'LinkKey') as %LinkKey

Rules:

  • Use a stable delimiter (|) that cannot appear in the data or sanitize it.
  • Preserve lineage by keeping the original natural key fields in staging or a trace table (not in the final associative model unless required).

3.4 Drop strategy (mandatory)

  • Drop staging tables immediately after STORE.
  • Drop intermediate fields immediately after use.

Pattern:

STORE T INTO [lib://.../T.qvd] (qvd);
DROP TABLE T;

3.5 ApplyMap for N:1 lookups (mandatory default)

Use mapping loads for single-value lookups, not joins.

Pattern:

MapX:
MAPPING LOAD Key, Value FROM ...;

LOAD
  ...,
  ApplyMap('MapX', Key, 'Unknown') as Value
;

3.6 Final app load must be optimized (mandatory)

Layer 3 “consume” scripts must be optimized QVD loads:

  • no calculations
  • no joins
  • no arbitrary filters (except Exists() patterns)
  • no resident transformations

If optimization must be broken, you MUST:

  • justify why (business requirement)
  • quantify impact (reload time/RAM)
  • isolate the non-optimized part to the smallest table possible

4) Layered architecture rules (Extract → Transform → Consume)

4.1 Layer 1 — Extract QVDs (raw-ish)

Goal: fast pull, minimal transformations, consistent typing.

Allowed:

  • rename fields for consistency
  • light cleaning (Trim, NullAsValue handling)
  • record-source lineage fields (SourceSystem, ExtractTS)
  • basic timestamp decoupling if it reduces raw volume without breaking traceability

Must include:

  • extraction watermark policy (full vs incremental)
  • row-count logging
  • error handling (connection failures)

4.2 Layer 2 — Transform QVDs (business rules + model shaping)

Goal: reduce cardinality, build conformed dims, generate surrogate keys, enforce association design.

Allowed:

  • AutoNumber keys
  • ApplyMap enrichment
  • dimension flattening (snowflake reduction)
  • flags for UI performance
  • dropping high-cardinality fields

Must include:

  • key uniqueness validation
  • null-key detection and handling
  • fan-trap / join duplication detection if joins exist

4.3 Layer 3 — Consume (app model load)

Goal: optimized loads + clean associative model.

Allowed:

  • LOAD * FROM ... (qvd); style loads (ideally)
  • WHERE Exists(KeyField)
  • field aliasing only if it does not break optimization for your QVD structure (be conservative)

Forbidden:

  • any heavy transforms
  • joins into facts
  • resident loads

5) Data model topology rules

5.1 Association control via field naming

  • Same field name = association contract.
  • If you do not want association, rename/qualify.

Default defensive pattern:

QUALIFY *;
UNQUALIFY DateKey, %CustomerKey, %ProductKey, %LinkKey;

Only use this if you truly intend a restricted association surface; otherwise prefer explicit renaming on conflicts.

5.2 Synthetic keys rules

  • Synthetic keys are allowed only when intentional and documented.
  • If synthetic keys appear unexpectedly:
    • identify shared fields causing it
    • decide which fields should associate
    • rename/qualify the rest

5.3 Circular reference rules

Circular references are a structural failure unless explicitly modeled with a known technique.

Fix options (in order):

  1. Link table (hub for shared dimensions)
  2. Concatenate tables if they represent the same entity split across sources
  3. Rename fields to break accidental cycles

You must produce a before/after association explanation whenever you fix a loop.

5.4 Star vs snowflake (Qlik-optimized guidance)

  • Prefer conformed dimensions (flatten small lookups into one dimension) to reduce hops.
  • Avoid “mega-wide” facts created by joining dimensions into the fact—this duplicates dimension attributes per row.

6) Join rules (when joins are allowed)

6.1 Default position: avoid joins into facts

Joins physically expand data tables and can duplicate rows.

6.2 Joins are allowed only if ALL conditions hold

  • join cardinality is guaranteed (1:1 or many:1 where you join onto the many side in a controlled way)
  • keys are clean (no nulls, no duplicates on the “1” side)
  • duplication risk is tested
  • ApplyMap is insufficient (need many fields and mapping tables would be unwieldy)

6.3 Mandatory join safety checks (if you join)

Before join:

  • distinct key count on the “1” side equals row count
  • no nulls in join key After join:
  • row count of target table is unchanged (unless documented)
  • sum of a stable measure unchanged (reconciliation)

7) Incremental load rules (watermarks and dedup)

7.1 You must implement one of these two patterns for large facts (>1M rows)

Pattern A: Insert + Update

  • load delta since watermark from source
  • concatenate history QVD with WHERE NOT Exists(PK)
  • store updated QVD

Pattern B: Insert + Update + Delete

  • A, plus: fetch current PK list (or use soft deletes)
  • inner join / semi-join to remove deleted rows
  • store

7.2 Watermark policy must be explicit

  • Field used (ModifiedTS, CDC LSN, ingestion timestamp)
  • Timezone handling
  • Late arriving data strategy (backfill window)
  • Failure recovery (don’t advance watermark unless STORE succeeded)

7.3 Dedup must be O(1)

Use Exists(PK)/Where Not Exists(PK) patterns, not brute force comparisons.


8) Memory peak control rules

8.1 Resident loads duplicate memory

Any LOAD ... RESIDENT creates a second table while the first still exists.

Mandatory mitigation:

  • minimize width of the resident source table (only needed fields)
  • drop source table immediately after resident transform completes

8.2 “Atomic staging” is mandatory

If a table is only needed to create a QVD, STORE + DROP it immediately.

8.3 High-cardinality ops isolation

If you must perform operations on high-cardinality fields (distinct counts, timestamp manipulation):

  • isolate into a narrow temporary table
  • compute
  • join results back (or map)
  • drop temp immediately

9) SQL vs LOAD decision matrix (where logic belongs)

9.1 Put in SQL (default)

  • row filtering (WHERE)
  • heavy aggregation (GROUP BY)
  • UNION ALL for same-structure sets (same DB)
  • leveraging DB indexes

9.2 Put in LOAD (default)

  • timestamp splitting (Floor/Frac/Time bucketing)
  • AutoNumber key generation
  • ApplyMap enrichments
  • sequential transformations (Previous(), Peek(), IterNo())
  • consistent field aliasing across DB dialects

9.3 Preceding load is the canonical pattern

Fact:
LOAD
  Date(Floor(TS)) as DateKey,
  AutoNumber(ID & '|' & Region, 'Link') as %Link,
  Amount
;
SQL SELECT TS, ID, Region, Amount
FROM ...
WHERE TS >= ...
;

10) Section Access rules (load-script implications)

10.1 Uppercase law (mandatory)

All values in Section Access are uppercased by Qlik. You MUST normalize reduction field values in the data model using Upper().

10.2 Safety defaults

  • Always include reload/service accounts (e.g., INTERNAL\SA_SCHEDULER where applicable).
  • Never OMIT link keys.
  • Always develop in a copy; keep a recovery path (“open without data” where available).

11) Output requirements (what your generated script must contain)

11.1 Standard header block (mandatory)

Include:

  • layer (Extract/Transform/Consume)
  • source(s) + connection lib names
  • output QVD paths
  • watermark strategy (if incremental)
  • field naming conventions (keys, dates)
  • association intent (which keys link which tables)

11.2 Instrumentation block (mandatory)

At minimum:

  • start time / end time variables
  • row counts per major table
  • QVD write success markers (optional but recommended)
  • watermark write only after success

11.3 Validation block (mandatory)

At minimum:

  • row count checks
  • distinct PK count checks
  • null key checks
  • join duplication checks if joins exist

12) Examples (approved patterns)

12.1 Approved: Extract with pushdown + preceding load

// L1 Extract: Orders_Extract.qvd
LET vFrom = '2024-01-01';

Orders_Extract:
LOAD
  OrderID,
  Timestamp(OrderTS) as OrderTS,        // keep raw in extract if needed
  CustomerID,
  Amount
;
SQL SELECT
  OrderID,
  OrderTS,
  CustomerID,
  Amount
FROM dbo.Orders
WHERE OrderTS >= '$(vFrom)';
STORE Orders_Extract INTO [lib://Data/L1/Orders_Extract.qvd] (qvd);
DROP TABLE Orders_Extract;

12.2 Approved: Transform with timestamp split + surrogate keys + ApplyMap

// L2 Transform: Orders_Model.qvd
MapCustomerRegion:
MAPPING LOAD CustomerID, Upper(Region)
FROM [lib://Data/L2/CustomerDim.qvd] (qvd);

Orders_Model:
LOAD
  OrderID,
  Date(Floor(OrderTS)) as OrderDate,
  Time(Floor(OrderTS, 1/24/60), 'hh:mm') as OrderTimeMinute,
  AutoNumber(CustomerID, 'Customer') as %CustomerKey,
  ApplyMap('MapCustomerRegion', CustomerID, 'UNKNOWN') as Region,
  Amount
FROM [lib://Data/L1/Orders_Extract.qvd] (qvd);

DROP FIELD OrderTS; // high-cardinality raw timestamp removed from final model

STORE Orders_Model INTO [lib://Data/L2/Orders_Model.qvd] (qvd);
DROP TABLE Orders_Model;

12.3 Approved: Consume with optimized loads

// L3 App: optimized loads only
Orders:
LOAD * FROM [lib://Data/L2/Orders_Model.qvd] (qvd);

Customers:
LOAD * FROM [lib://Data/L2/CustomerDim.qvd] (qvd);

13) Anti-patterns (examples you must never generate)

13.1 SELECT *

SQL SELECT * FROM dbo.FactSales;

Why it fails: RAM bloat + uncontrolled field proliferation + breaks lineage discipline.

13.2 UI-stage filtering in LOAD instead of SQL

LOAD *;
SQL SELECT ... FROM ...; // no WHERE, then later:
LOAD * RESIDENT T WHERE Year(Date)=2024;

Why it fails: network + RAM wasted; filtering occurs too late.

13.3 Join dimension into big fact without proof

LEFT JOIN (Fact)
LOAD CustomerID, CustomerName, Segment RESIDENT Customers;

Why it fails: duplicates attributes per fact row; fan trap risk; memory inflation.


14) Acceptance criteria (the “done means done” checklist)

14.1 Performance criteria

  • No SELECT * anywhere
  • Source filtering and aggregation pushed down where appropriate
  • Final app load is optimized QVD loads (or justified exceptions)
  • STORE + DROP staging tables immediately
  • Timestamps split in final model
  • Composite keys AutoNumbered
  • ApplyMap used for N:1 lookups by default

14.2 Data integrity criteria

  • PK uniqueness validated (distinct PK = rows where expected)
  • No null keys in link fields (or documented handling)
  • No unexpected synthetic keys
  • No circular references
  • Joins do not change row counts unless documented

14.3 Operational criteria

  • Watermark strategy explicit and retry-safe
  • Logging present (row counts, checkpoints)
  • Script is rerunnable without duplicating data

15) When to break rules (and how)

Rules can be broken only if:

  1. Business requirement forces it, and
  2. You provide an impact analysis (RAM/CPU/reload/latency), and
  3. You isolate the exception to the smallest scope possible, and
  4. You add validation to prevent silent corruption.

Examples of justified exceptions:

  • Keeping a transaction ID field because the UX requires drill-through to transaction detail.
  • Non-optimized load for a small dimension because it must compute a derived field at consumption time (rare; usually fix in Transform layer).
  • STORE→DROP→reload in a single script to reduce peak RAM when memory is constrained.

End of ruleset.