You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implementations of the $k$-Nearest Neighbors and AutoRegression
algorithms for anomaly detection (AD) on tabular and time-series data,
respectively.
What is Unsupervised AD?
Given a training set $D$ of $m$ vectors in $\mathbb{R}^n$, a contamination
hyperparameter $c \in [0, 0.5)$, and a testing set $D'$ of $m'$ vectors in
$\mathbb{R}^n$, an unsupervised AD algorithm $\mathcal{A}(D \mid D', c)$ will
classify each vector $\mathbf{d_i'} \in D'$ as normal or anomalous as follows:
Let $f_\mathcal{A} \colon \mathbb{R}^n \to \mathbb{R}$ map an $n$-dimensional
vector $\mathbf{v}$ to its corresponding anomaly score, which is an
algorithm-specific scalar that quantifies "how anomalous" $\mathbf{v}$ is.
Let $\mathbf{d_i} \in D$. Compute a vector of training anomaly scores,
$\mathbf{s} \in \mathbb{R}^m$, where the $i\text{th}$ component of $\mathbf{s}$
is equal to $f_\mathcal{A}(\mathbf{d_i})$.
Compute a vector $\mathbf{s_{max}}$ which contains the
${\lceil c \cdot m \rceil}$ highest anomaly scores in $\mathbf{s}$.
Define the threshold score, $s_t$, as the minimal anomaly score in
$\mathbf{s_{max}}$.
Using $s_t$, any $\mathbf{d_i'} \in D'$ such that
$f_\mathcal{A}(\mathbf{d_i'}) \geq s_t$ will be classified as anomalous, and any
$\mathbf{d_i'}$ such that $f_\mathcal{A}(\mathbf{d_i'}) < s_t$ will be
classified as normal.
$k$-Nearest Neighbors (KNN) for AD
Within the context of anomaly detection, the KNN algorithm defines its anomaly
score function $f_\mathcal{A}$ as the arithmetic mean of the $k$ shortest
Euclidean distances from an input vector $\mathbf{v}$. In other words, if we let
$B$ be a set containing the $k$ training vectors from $D$ which are closest
to $v$, then we may formally define $f_\mathcal{A}$ as follows:
The following visualization in $\mathbb{R}^2$, where $k=3$, elucidates the fact
that points close to the most training vectors will have a small anomaly score,
whereas points further from most training vectors will have a larger anomaly
score.
AutoRegression (AutoReg) for AD
The AutoRegression algorithm is a variant of the traditional linear regression
model for tabular datasets, but extended to operate on time-series data.
Theoretical background
The following mathematical formalization represents the necessary steps that are
executed by the AutoReg algorithm when performing anomaly detection on a
time-series dataset.
1. Given an ordered collection of real-valued, consecutive, and univariate time-series training data $X_1, X_2, \dots, X_n$ and a window-length hyperparameter $p$, compute an $(n - p) \times (p + 1)$ matrix, $D$, of the following form:
Each $p$-dimensional row vector $\mathbf{x_j}$ of the above matrix represents a window of length $p$, which will be used to predict the next value in the time-series, $X_{j}$, according to the following formula, where $a_i$ represents the ith least-squares regression coefficient, $c$ is the bias term, and $\epsilon_t$ represents the error term for the time-series point $X_t$:
3. Compute regression coefficients $a_1, a_2, \dots, a_p$ and the bias term $c$ using least squares regression in accordance with the analytical solution:
Note that $\boldsymbol{\epsilon}$ is an $(n - p)$-dimensional column vector, implying that the first $p$ points are not able to be assigned anomaly scores (as they lack the previous context necessary for calculating the anomaly score).
5. Let $d$ be a hyperparameter in the range $(0, 1)$ corresponding to the percent of data points that should be considered anomalies. Define a list $\mathbf{x_d}$ which contains the $\lceil d \cdot n \rceil$ highest anomaly scores of the training data points. Define the threshold anomaly score, $|\epsilon_d|$, as the minimal anomaly score in $\mathbf{x_d}$. Using $|\epsilon_d|$, points with an anomaly score greater than or equal to $|\epsilon_d|$ will be classified as anomalous, and points with a lower anomaly score will be classified as normal.
6. Given an ordered collection of real-valued, consecutive, and univariate time-series testing data $X_1', X_2', \dots, X_m'$, where $m > p$, predict whether each testing point $X_i'$, where $p < i \leq m$, is anomalous or normal according to the following procedure:
Compute an $(m - p) \times (p + 1)$ matrix, $D'$, of the following form:
Compute an $(m - p)$-dimensional column vector of error terms, $\boldsymbol{\epsilon'}$, using matrix multiplication of $D'$ with the regression coefficients:
For each testing point $X_i'$ where $p < i \leq m$, compare the corresponding anomaly score $|\epsilon_i'|$ with $|\epsilon_d|$. If $|\epsilon_i'|$ is greater than or equal to $|\epsilon_d|$, classify $X_i'$ as anomalous. Otherwise, classify $X_i'$ as normal.
Visualization
The below animation visually demonstrates each of the six steps previously
outlined for a small example of univariate time-series data.
About
Implementations of the k-Nearest Neighbors (KNN) and AutoRegression (AutoReg) algorithms for anomaly detection on tabular and time-series data, respectively.