Metrics are main working unit of DataQuality framework. With them you'll be able to calculate multiple KPIs and get an analytical overview of your sources.
Tip: All metrics are returning tuple of [Double, String]. In most of them second part is empty.
Configuration for all metrics is unified and following:
{
id: "1011"
name: "TOP_N" // metric name from the list
type: "COLUMN" // base metric type
description: "median"
config: {
file: "USGS_2005", // source name
columns: ["Magnitude"], // Only for column metrics
params: {targetNumber:4, maxCapacity: 100} // check metric description for details
}
}Composed metrics are the way to combine multiple metric results. Calculation is going on after calculation of basic metrics.
There are basic math operators (+-*/), powering(^) and grouping features.
Example:
{
id: "0.5_tdigest_check"
name: "Q2 error"
description: "Determines null values percentage on column X"
formula: "$td05 - ($108 - 0.675*$109)" //use $metric_id to pull metric result in formula
}Tip: If you want to use TOP_N metric result in your formula, simply use following template: metric_name + _ + rating_position. It will get frequency of the result.
Returns amount of rows in the file.
Parameters: none
Tip: Require "column" field inside config object
Returns approximate TOP N frequent values inside of the column
Tip: "TOP_N" metrics is a special. They are returning multiple results of [Frequency, String] with name metric_id + _ + rating_position
Parameters:
- optional Int maxCapacity: Size of aggregation collection
- Int targetNumber: Requested N
Returns approximate TOP N frequent values inside of the column
Tip: This metric is using Twitter's Algebird api to use probabilistic data structures. Use this metric in case big value variance inside the column.
Parameters:
- optional Int maxCapacity: Size of aggregation collection
- Int targetNumber: Requested N
DataQuality also supports metrics between two columns of one source. All those metrics calculates amount of tuples that didn't passed metric specific constraint.
Such as:
- "COLUMN_EQ": Value of column A should be equal to value of column B
- "DAY_DISTANCE": Values of col.A and col.B should be in the provided Date format and distance between them should not exceed defined distance (in days)
- "LEVENSHTEIN_DISTANCE": Calculated Levenshtein distance between col.A and col.B should not exceed defined threshold.
Example:
{
id: "DAY_DIST_id0"
name: "DAY_DISTANCE"
type: "COLUMN"
description: "Test example of DAY_DISTANCE"
config: {
file: "SURVEY_FEEDBACK"
columns: ["QUESTIONNAIRE_DATE","ANSWER_DATE"]
params: {
threshold:"2"
dateFormat:"yyyy-MM-dd"
}
}
}Tip: Require "column" field inside config object
Returns amount of distinct values inside of the column (for small variance of data)
Parameters: none
Returns approx amount of distinct values inside of the column (for small variance of data)
Parameters:
- optional Double accuracyError: Less is better
Returns amount of null values
Parameters: none
Returns amount of empty values
Parameters: none
Returns minimum of the column
Parameters: none
Returns maximum of the column
Parameters: none
Returns sum of all numerical values in the column
Parameters: none
Returns sum of all numerical values in the column
In financial use case, this metric is recommended over SUM_NUMBER
Parameters: none
Returns average of all numerical value inside of the column
Parameters: none
Returns average of all numerical value inside of the column
In financial use case, this metric is recommended over AVG_NUMBER
Parameters: none
Return standard deviation of all numerical values inside of the column
Parameters: none
Return standard deviation of all numerical values inside of the column
In financial use case, this metric is recommended over STD_NUMBER
Parameters: none
Return string minimum of the column (mostly for comparing dates)
Parameters: none
Return string maximum of the column (mostly for comparing dates)
Parameters: none
Returns average string length
Parameters: none
Returns amount of values in provided format
Parameters:
- String dateFormat: Format for date
#####"FORMATTED_NUMBER" Returns amount of values in provided format
Parameters:
- Double precision: Provided precision
- Double scale: Provided scale
Returns amount of values in provided format
Parameters:
- String length: Length for strings
Returns amount of castable values inside column
Parameters: none
Returns amount of numerical values in provided domain
Parameters:
- Set String domainSet: Domain for checking
Returns amount of numerical values outside of provided domain
Parameters:
- Set String domainSet: Domain for checking
Returns amount of values in provided domain
Parameters:
- Set String domainSet: Domain for checking (example {domainSet: "NST:VST:BTH:ANY"})
Returns amount of values outside of provided domain
Parameters:
- Set String domainSet: Domain for checking
Return how many times string is present in the column
Parameters:
- String compareValue: String for search
Return how many times number is present in the column
Parameters:
- Double compareValue: Number for search
Returns median (second quantile) value of numerical column
Parameters:
- optional Double accuracyError: Less is more
Returns first quantile of numerical column
Parameters:
- optional Double accuracyError: Less is more
Returns third quantile of numerical column
Parameters:
- optional Double accuracyError: Less is more
Returns custom quantile
Parameters:
- optional Double accuracyError: Less is more
- Double targetSideNumber: Provided value for quantile (should be in [0,1])
Return custom percentile
Parameters:
- optional Double accuracyError: Less is more
- Double targetSideNumber: Provided value for percentile