Linear and Nonlinear Granger Causality using Machine Learning Techniques
Report Bug
·
Request Feature
Table of Contents
mlcausality is a Python library for linear and nonlinear Granger causality analysis. Given time-series X and y, if the lags of both X and y provide a better prediction for the current value of y than the lags of y alone, then X is said to Granger cause y. Note that Granger causality is a misnomer: no actual causality is implied because Granger causality is entirely grounded in prediction.
The mlcausality package provides a new way for establishing such Granger causal links using machine learning techniques. Thanks to the usage of the sign test and the Wilcoxon signed rank test, mlcausality is extremely flexible and can be used with a multitude of machine learning regressors. By default, kernel ridge regression is used, but the mlcausality package can use other regressors such as:
- support vector regressor (SVR)
- random forest regressor
- XGBoost regressor
- LightGBM regressor
- CatBoost regressor
- linear regressor
and more!
When used correctly, mlcausality has exhibited leading performance both in terms of accuracy and execution speed. With mlcausality batteries are always included, and preprocessing is typically not necessary. Have a look at the Usage guide below to get an overview of the powerful capabilites that mlcausality provides.
The following presents the easiest way to get mlcausality up and running on your local computer.
Installation requires Python and the pip package installer.
In order to function correctly mlcausality requires the following Python packages:
mlcausality also has the following optional prerequisites which you should install if you plan to use the relevant regressors:
There are several installation options depending on the number of dependencies that need to be installed. Note that, for all options below, missing and optional dependencies will be installed using pip, except for cuML which does not have a pip installation path; if you wish to use the cuML library you have to install it separately.
From the list of options below, choose the one that suits your needs best:
-
To make a minimal install with just the core prerequisites, run:
pip install mlcausality@git+https://github.com/WojtekFulmyk/mlcausality.git
-
To install core prerequisites plus XGBoost, run:
pip install mlcausality[xgboost]@git+https://github.com/WojtekFulmyk/mlcausality.git
-
To install core prerequisites plus LightGBM, run:
pip install mlcausality[lightgbm]@git+https://github.com/WojtekFulmyk/mlcausality.git
-
To install core prerequisites plus CatBoost, run:
pip install mlcausality[catboost]@git+https://github.com/WojtekFulmyk/mlcausality.git
-
To install core prerequisites plus XGBoost, LightGBM and Catboost, run:
pip install mlcausality[all]@git+https://github.com/WojtekFulmyk/mlcausality.git
The mlcausality package provides the following functions:
mlcausality: Tests whetherXGranger causesy. This is a low-level function that only tests whetherXGranger-causesybut does not test whetheryGranger-causesXmlcausality_splits_loop: Runsmlcausalitybut for different train-test split combinations supplied by the user.bivariate_mlcausality: Runsmlcausalityto test for all bivariate Granger causal relationships amongst the time-series passed to the parameterdata.loco_mlcausality: Runsmlcausalityto test for all Granger causal relationships amongst the time-series passed to the parameterdataby successively leaving one column out (loco). This function tests for Granger causality in the presence of exogenous time-series whereasbivariate_mlcausalityonly tests for bivariate combinations.multireg_mlcausality: A multiregression analogue ofmlcausality. Most users will probably never have to runmultireg_mlcausalitydirectly; rather, it is expected thatmultiloco_mlcausalitywill be run instead. Currentlymultireg_mlcausalityonly supports the kernel ridge regressor and the CatBoost regressor.multiloco_mlcausality: A multiregression analogue ofloco_mlcausality. This function usesmultireg_mlcausalityunder the hood and is therefore currently supported for the kernel ridge regressor and the CatBoost regressor only. If you would like to recover Granger causal connections for an entire network efficiently using kernel ridge regression this is the function you want to use.
Suppose you have just 2 time-series of equal length, X and y, and you would like to find out whether X Granger-causes y. Then you can run:
import mlcausality
import numpy as np
import pandas as pd
X = np.random.random([500,1])
y = np.random.random([500,1])
z = mlcausality.mlcausality(X=X,y=y,lag=5)
#print(z)
The p-values of the sign test and the Wilcoxon signed rank test are output to z (and stdout in some cases depending on the function and parameters chosen). Granger causality can be established on the basis of these p-values and your desired level of precision. For instance, if you prefer the sign test over the Wilcoxon signed rank test and your desired significance level is 0.05, then if the p-value from the sign test is below 0.05 you would reject the null hypothesis of no causality and conclude that X Granger-causes y.
Note that both X and y can be multivariate, meaning that they can take multiple time-series. If X is multivariate then the Granger-causality test is run with respect to the lags of all time-series in X; in other words, the null hypothesis is that the time-series in X do not collectively Granger-cause y. If y is multivariate then the target time-series is the first column and all additional columns in y are exogenous time-series whose lags are kept in both the restricted and unrestricted models when conducting the Granger causality test.
Now suppose that, instead of just being interested in whether one time series Granger causes another, you would like to instead find all Granger-causal relationships amongst several time-series. In that case, you can run:
import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.multiloco_mlcausality(data, lags=[5,10])
print(z)
The above code will check for all Granger-causal connections amongst all time-series in data by successively leaving one column out in the restricted model. Note that the above code uses the multiloco_mlcausality multiregression function which will yield identical results to the loco_mlcausality function only if the regressor is kernel ridge (the default) but will do so significantly faster than loco_mlcausality.
The syntax of the mlcausality package is internally consistent. If you would like to use loco_mlcausality instead of multiloco_mlcausality for the code block above just substitute multiloco_mlcausality with loco_mlcausality to obtain an equivalent but slower solution. Moreover, if instead of finding Granger-causal relationships by leaving one column out you instead wanted to just test for Granger-causal relationships in a bivariate fashion, you can instead substitute multiloco_mlcausality for bivariate_mlcausality.
The functions mlcausality and multireg_mlcausality largely share the same parameter spaces so in most cases calls to these two functions can be made with the same parameters.
The functions bivariate_mlcausality, loco_mlcausality and multiloco_mlcausality largely share the same parameter spaces so in most cases calls to these two functions can be made with the same parameters. Moreover, bivariate_mlcausality and loco_mlcausality admit the parameters that mlcausality accepts and multiloco_mlcausality admits the parameters that multireg_mlcausality accepts. So, for instance, if one wishes to call loco_mlcausality, which uses mlcausality internally, with a specific set of parameters that one would like to pass to the inner mlcausality function, then one simply needs to pass those parameters to the loco_mlcausality function:
import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.loco_mlcausality(data, lags=[5,10],
regressor='catboostregressor')
print(z)
The above code recovers the whole network using CatBoost instead of kernel ridge (the default). Note that the parameter regressor is not defined for the loco_mlcausality function but it is defined for mlcausality, thus the parameter regressor is passed through to mlcausality.
mlcausality, mlcausality_splits_loop, bivariate_mlcausality and loco_mlcausality admit the following regressors:
- 'krr' : Kernel ridge regressor
- 'catboostregressor' : CatBoost regressor
- 'xgbregressor' : XGBoost regressor
- 'lgbmregressor' : LightGBM regressor
- 'randomforestregressor' : Random forest regressor
- 'cuml_randomforestregressor' : Random forest regressor using the cuML library
- 'linearregression' : Linear regressor
- 'classic' : Linear regressor in the classic sense (train == test == all data)
- 'svr' : Epsilon Support Vector Regressor
- 'nusvr' : Nu Support Vector Regressor
- 'cuml_svr' : Epsilon Support Vector Regressor using the cuML library
- 'knn' : Regression based on k-nearest neighbors
- 'gaussianprocessregressor' : Gaussian process regressor
- 'gradientboostingregressor' : Gradient boost regressor
- 'histgradientboostingregressor' : Histogram-based Gradient Boosting Regression Tree
- 'default' : kernel ridge regressor with the RBF kernel (default)
multireg_mlcausality and multiloco_mlcausality admit the following regressors:
- 'krr' : Kernel ridge regressor
- 'catboostregressor' : CatBoost regressor
- 'default' : kernel ridge regressor with the RBF kernel (default)
Note that the CatBoost regressor option in multireg_mlcausality (and by extension multiloco_mlcausality) usea a different objective (MultiRMSEWithMissingValues) than the objectives available for the CatBoost regressor in mlcausality-derived functions (most notably RMSE). Hence multiloco_mlcausality with regressor='catboostregressor' will not be identical to loco_mlcausality with regressor='catboostregressor'; for details read the CatBoost documentation.
Regressors can be called with regressor-specific parameters using the regressor_params option and they can be fitted with regressor-specific fit parameters using the regressor_fit_params option. For instance, the following recovers a network using CatBoost with 99 iterations (instead of the CatBoost default of 1000) and no verbosity:
import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.loco_mlcausality(data, lags=[5,10],
regressor='catboostregressor',
regressor_params={'iterations':99},
regressor_fit_params={'verbose':False})
print(z)
mlcausality comes with batteries included and you will typically not have to engage in substantial preprocessing before using the package. All functions support the usage of scalers or transformers from the scikit-learn package at various stages of the Granger causality testing process:
- parameters
scaler_init_1andscaler_init_2apply scalars to the input data and those transformations persist throughout the analysis. Predictions generated by the restricted and unrestricted models are not inverse-transformed and the Granger causality analysis will be performed on the raw (that is, non-inverse-transformed) predictions and errors. - parameters
scaler_prelogdiff_1,scaler_postlogdiff_2scaler_postlogdiff_1,scaler_prelogdiff_2,scaler_postsplit_1andscaler_postsplit_2apply scalars to the input data and those transformations do not persist throughout the analysis. Predictions generated by the restricted and unrestricted models are inverse-transformed and the Granger causality analysis is performed on the inverse-transformed predictions and errors. The names of the parameters indicate the stage at which the transformation occurs with respect to taking a logdiff (see below) or splitting the dataset into a train and test set. - parameter
logdifftransforms, in a reversible way, the data by taking a log difference of all time-series. Predictions generated by the restricted and unrestricted models are inverse-transformed and the Granger causality analysis is performed on the inverse-transformed predictions and errors. - parameters
scaler_dm_1andscaler_dm_2apply scalers to the design matricies of the train and test data.
For additional clarity, note that the order in which transformations are applied is as follows: init --> prelogdiff --> logdiff --> postlogdiff --> (data split occurs into test and train) --> postsplit --> dm
The following scalers and transformers are currently supported:
- 'maxabsscaler'
- 'minmaxscaler'
- 'powertransformer'
- 'quantiletransformer'
- 'robustscaler'
- 'standardscaler'
- 'normalizer' : Only available for
scaler_dm_1andscaler_dm_2.
Finally, note that parameters for the above scalers are available in parameters named *_params where * stands for the name of the scaler. So scaler_dm_1_params are the params for scaler_dm_1 etc.
The following usage example recovers a network by first applying a 'minmaxscaler' in the [2,3] range on the input data and then taking a log difference. Note that the MinMaxScaler is needed here because the input data is negative which would prevent the taking of a log difference in the absence of the scaler:
import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.multiloco_mlcausality(data,
lags=[5,10],
scaler_prelogdiff_1='minmaxscaler',
scaler_prelogdiff_1_params={'feature_range':(2,3)},
logdiff=True)
print(z)
Data can be split into a test and train set using the train_size or split or splits parameters, as appropriate.
If split(s) is None and train_size is a float then train_size indicates the fraction of the data on which training occurs with the rest of the data ending up in the test set. Moreover, if split(s) is None and train_size is an integer greater than 1 then train_size indicates the number of observations to include in the training set. Finally, if train_size is equal to 1 then the train set and the test set are identical and equal to all the available data.
Other than controlling the data split using train_size one can instead provide a list of 2 lists to the split or splits parameter as appropriate. The first of the 2 lists would provide the indicies for the training set, while the second fo the 2 lists would provide the indicies for the test set. All index lists must contain consecutive indicies with no gaps or holes otherwise lags will not be constructed correctly.
Note that both train and test, after lags are taken, always decrease in size by the number of lags. Moreover, if logdiff is True, an additional observation is lost from train and test because of the differencing operation.
The following provides an example of how to correctly use the split operator:
import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit()
splits = list(tscv.split(data))
split=splits[0]
z = mlcausality.multiloco_mlcausality(data,
lags=[5,10], split=split)
print(z)
y_bounds_violation_sign_drop is an important Boolean parameter with implications for testing Granger causality using the sign test and the Wilcoxon signed rank test. If True, observations in the test set whose target values are outside [min(train), max(train)] are not used when calculating the test statistics and p-values of the sign and Wilcoxon tests (note: this also requires y_bounds_error to not be set to 'raise' in the mlcausality function). If False, then the sign and Wilcoxon test statistics and p-values are calculated using all observations in the test set. The default is set to True because some models, especially tree-based models, extrapolate very poorly outside the range of target values that were seen in train.
Less commonly used features are documented using help(). For instance, the following provides more information about loco_mlcausality function inside an interactive Python shell:
import mlcausality
help(mlcausality.loco_mlcausality)
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.