You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using NiFi as a batch ingest tool and orchestrator for Data Warehouse.
Principles
From 2 years of working and trying to make workflow as generic, modularized and simple to understadnd here are my principles I used to archive it.
Every logical step should output only 1 flow file. e.g. One Processor group that ingests data from a source system and outputs one flow file into success or error connection.
Speficy credentials only once. e.g. If you use InvokeHTTP processor, create a workflow so that there won't be 2 InvokeHTTP processors with the same credentials set.
If you can, create a Processor group that will act as a blackbox. Show the user what's going on without going to unnecessary detail.
Naming Convention:
Component
Naming Convention
Note
flow file attriute
lowerCamelCase
thisIsAnExampleForYou
connection
all lovercase separated with space
this is an example for you
processor group
Initial Caps
This Is An Example For You
input port
in
always use only this one name
output port
success, error
use always only these two names
connection
success, error (whenever possible)
breaking rule example: Parsing log files and routing each line on log level: Info, Warn, Error
Postgres
3 schemas
Name
Usage
stage
here are data loaded as is from source systems
core
cleaned integrated (foreign keys) data
mart
very flat tables, reports
4 table types (defined by postfixes)
Postfix
Name
usage
_i
input
contains temporary data used during etl
_t
today
always has current data
_ih
input historized
scd2 historization of _i tables
_d
exact copy (snapshot) of _i tables
backup
withoput_postfix
report
Only in mart schema (will be changed to _r)
Metabase
About
Template for creating batch based ETL workflow for datawarehouses