Skip to content

ruvilonix/mlcausality

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License

mlcausality

Linear and Nonlinear Granger Causality using Machine Learning Techniques
Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License

About The Project

mlcausality is a Python library for linear and nonlinear Granger causality analysis. Given time-series X and y, if the lags of both X and y provide a better prediction for the current value of y than the lags of y alone, then X is said to Granger cause y. Note that Granger causality is a misnomer: no actual causality is implied because Granger causality is entirely grounded in prediction.

The mlcausality package provides a new way for establishing such Granger causal links using machine learning techniques. Thanks to the usage of the sign test and the Wilcoxon signed rank test, mlcausality is extremely flexible and can be used with a multitude of machine learning regressors. By default, kernel ridge regression is used, but the mlcausality package can use other regressors such as:

and more!

When used correctly, mlcausality has exhibited leading performance both in terms of accuracy and execution speed. With mlcausality batteries are always included, and preprocessing is typically not necessary. Have a look at the Usage guide below to get an overview of the powerful capabilites that mlcausality provides.

(back to top)

Getting Started

The following presents the easiest way to get mlcausality up and running on your local computer.

Prerequisites

Installation requires Python and the pip package installer.

In order to function correctly mlcausality requires the following Python packages:

mlcausality also has the following optional prerequisites which you should install if you plan to use the relevant regressors:

Installation

There are several installation options depending on the number of dependencies that need to be installed. Note that, for all options below, missing and optional dependencies will be installed using pip, except for cuML which does not have a pip installation path; if you wish to use the cuML library you have to install it separately.

From the list of options below, choose the one that suits your needs best:

  1. To make a minimal install with just the core prerequisites, run:

    pip install mlcausality@git+https://github.com/WojtekFulmyk/mlcausality.git
  2. To install core prerequisites plus XGBoost, run:

    pip install mlcausality[xgboost]@git+https://github.com/WojtekFulmyk/mlcausality.git
  3. To install core prerequisites plus LightGBM, run:

    pip install mlcausality[lightgbm]@git+https://github.com/WojtekFulmyk/mlcausality.git
  4. To install core prerequisites plus CatBoost, run:

    pip install mlcausality[catboost]@git+https://github.com/WojtekFulmyk/mlcausality.git
  5. To install core prerequisites plus XGBoost, LightGBM and Catboost, run:

    pip install mlcausality[all]@git+https://github.com/WojtekFulmyk/mlcausality.git

(back to top)

Usage

Overview of available functions

The mlcausality package provides the following functions:

  • mlcausality : Tests whether X Granger causes y. This is a low-level function that only tests whether X Granger-causes y but does not test whether y Granger-causes X
  • mlcausality_splits_loop : Runs mlcausality but for different train-test split combinations supplied by the user.
  • bivariate_mlcausality : Runs mlcausality to test for all bivariate Granger causal relationships amongst the time-series passed to the parameter data.
  • loco_mlcausality : Runs mlcausality to test for all Granger causal relationships amongst the time-series passed to the parameter data by successively leaving one column out (loco). This function tests for Granger causality in the presence of exogenous time-series whereas bivariate_mlcausality only tests for bivariate combinations.
  • multireg_mlcausality : A multiregression analogue of mlcausality. Most users will probably never have to run multireg_mlcausality directly; rather, it is expected that multiloco_mlcausality will be run instead. Currently multireg_mlcausality only supports the kernel ridge regressor and the CatBoost regressor.
  • multiloco_mlcausality : A multiregression analogue of loco_mlcausality. This function uses multireg_mlcausality under the hood and is therefore currently supported for the kernel ridge regressor and the CatBoost regressor only. If you would like to recover Granger causal connections for an entire network efficiently using kernel ridge regression this is the function you want to use.

Basic usage

Suppose you have just 2 time-series of equal length, X and y, and you would like to find out whether X Granger-causes y. Then you can run:

import mlcausality
import numpy as np
import pandas as pd
X = np.random.random([500,1])
y = np.random.random([500,1])
z = mlcausality.mlcausality(X=X,y=y,lag=5)
#print(z)

The p-values of the sign test and the Wilcoxon signed rank test are output to z (and stdout in some cases depending on the function and parameters chosen). Granger causality can be established on the basis of these p-values and your desired level of precision. For instance, if you prefer the sign test over the Wilcoxon signed rank test and your desired significance level is 0.05, then if the p-value from the sign test is below 0.05 you would reject the null hypothesis of no causality and conclude that X Granger-causes y.

Note that both X and y can be multivariate, meaning that they can take multiple time-series. If X is multivariate then the Granger-causality test is run with respect to the lags of all time-series in X; in other words, the null hypothesis is that the time-series in X do not collectively Granger-cause y. If y is multivariate then the target time-series is the first column and all additional columns in y are exogenous time-series whose lags are kept in both the restricted and unrestricted models when conducting the Granger causality test.

Now suppose that, instead of just being interested in whether one time series Granger causes another, you would like to instead find all Granger-causal relationships amongst several time-series. In that case, you can run:

import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.multiloco_mlcausality(data, lags=[5,10])
print(z)

The above code will check for all Granger-causal connections amongst all time-series in data by successively leaving one column out in the restricted model. Note that the above code uses the multiloco_mlcausality multiregression function which will yield identical results to the loco_mlcausality function only if the regressor is kernel ridge (the default) but will do so significantly faster than loco_mlcausality.

The syntax of the mlcausality package is internally consistent. If you would like to use loco_mlcausality instead of multiloco_mlcausality for the code block above just substitute multiloco_mlcausality with loco_mlcausality to obtain an equivalent but slower solution. Moreover, if instead of finding Granger-causal relationships by leaving one column out you instead wanted to just test for Granger-causal relationships in a bivariate fashion, you can instead substitute multiloco_mlcausality for bivariate_mlcausality.

Setting parameters

The functions mlcausality and multireg_mlcausality largely share the same parameter spaces so in most cases calls to these two functions can be made with the same parameters.

The functions bivariate_mlcausality, loco_mlcausality and multiloco_mlcausality largely share the same parameter spaces so in most cases calls to these two functions can be made with the same parameters. Moreover, bivariate_mlcausality and loco_mlcausality admit the parameters that mlcausality accepts and multiloco_mlcausality admits the parameters that multireg_mlcausality accepts. So, for instance, if one wishes to call loco_mlcausality, which uses mlcausality internally, with a specific set of parameters that one would like to pass to the inner mlcausality function, then one simply needs to pass those parameters to the loco_mlcausality function:

import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.loco_mlcausality(data, lags=[5,10],
    regressor='catboostregressor')
print(z)

The above code recovers the whole network using CatBoost instead of kernel ridge (the default). Note that the parameter regressor is not defined for the loco_mlcausality function but it is defined for mlcausality, thus the parameter regressor is passed through to mlcausality.

Available regressors

mlcausality, mlcausality_splits_loop, bivariate_mlcausality and loco_mlcausality admit the following regressors:

multireg_mlcausality and multiloco_mlcausality admit the following regressors:

Note that the CatBoost regressor option in multireg_mlcausality (and by extension multiloco_mlcausality) usea a different objective (MultiRMSEWithMissingValues) than the objectives available for the CatBoost regressor in mlcausality-derived functions (most notably RMSE). Hence multiloco_mlcausality with regressor='catboostregressor' will not be identical to loco_mlcausality with regressor='catboostregressor'; for details read the CatBoost documentation.

Regressors can be called with regressor-specific parameters using the regressor_params option and they can be fitted with regressor-specific fit parameters using the regressor_fit_params option. For instance, the following recovers a network using CatBoost with 99 iterations (instead of the CatBoost default of 1000) and no verbosity:

import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.loco_mlcausality(data, lags=[5,10],
    regressor='catboostregressor',
    regressor_params={'iterations':99},
    regressor_fit_params={'verbose':False})
print(z)

Data preprocessing

mlcausality comes with batteries included and you will typically not have to engage in substantial preprocessing before using the package. All functions support the usage of scalers or transformers from the scikit-learn package at various stages of the Granger causality testing process:

  • parameters scaler_init_1 and scaler_init_2 apply scalars to the input data and those transformations persist throughout the analysis. Predictions generated by the restricted and unrestricted models are not inverse-transformed and the Granger causality analysis will be performed on the raw (that is, non-inverse-transformed) predictions and errors.
  • parameters scaler_prelogdiff_1, scaler_postlogdiff_2 scaler_postlogdiff_1, scaler_prelogdiff_2, scaler_postsplit_1 and scaler_postsplit_2 apply scalars to the input data and those transformations do not persist throughout the analysis. Predictions generated by the restricted and unrestricted models are inverse-transformed and the Granger causality analysis is performed on the inverse-transformed predictions and errors. The names of the parameters indicate the stage at which the transformation occurs with respect to taking a logdiff (see below) or splitting the dataset into a train and test set.
  • parameter logdiff transforms, in a reversible way, the data by taking a log difference of all time-series. Predictions generated by the restricted and unrestricted models are inverse-transformed and the Granger causality analysis is performed on the inverse-transformed predictions and errors.
  • parameters scaler_dm_1 and scaler_dm_2 apply scalers to the design matricies of the train and test data.

For additional clarity, note that the order in which transformations are applied is as follows: init --> prelogdiff --> logdiff --> postlogdiff --> (data split occurs into test and train) --> postsplit --> dm

The following scalers and transformers are currently supported:

Finally, note that parameters for the above scalers are available in parameters named *_params where * stands for the name of the scaler. So scaler_dm_1_params are the params for scaler_dm_1 etc.

The following usage example recovers a network by first applying a 'minmaxscaler' in the [2,3] range on the input data and then taking a log difference. Note that the MinMaxScaler is needed here because the input data is negative which would prevent the taking of a log difference in the absence of the scaler:

import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
z = mlcausality.multiloco_mlcausality(data,
    lags=[5,10],
    scaler_prelogdiff_1='minmaxscaler',
    scaler_prelogdiff_1_params={'feature_range':(2,3)},
    logdiff=True)
print(z)

Data splits

Data can be split into a test and train set using the train_size or split or splits parameters, as appropriate.

If split(s) is None and train_size is a float then train_size indicates the fraction of the data on which training occurs with the rest of the data ending up in the test set. Moreover, if split(s) is None and train_size is an integer greater than 1 then train_size indicates the number of observations to include in the training set. Finally, if train_size is equal to 1 then the train set and the test set are identical and equal to all the available data.

Other than controlling the data split using train_size one can instead provide a list of 2 lists to the split or splits parameter as appropriate. The first of the 2 lists would provide the indicies for the training set, while the second fo the 2 lists would provide the indicies for the test set. All index lists must contain consecutive indicies with no gaps or holes otherwise lags will not be constructed correctly.

Note that both train and test, after lags are taken, always decrease in size by the number of lags. Moreover, if logdiff is True, an additional observation is lost from train and test because of the differencing operation.

The following provides an example of how to correctly use the split operator:

import mlcausality
import numpy as np
import pandas as pd
data = np.random.random([500,5])
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit()
splits = list(tscv.split(data))
split=splits[0]
z = mlcausality.multiloco_mlcausality(data,
    lags=[5,10], split=split)
print(z)

Other important parameters

y_bounds_violation_sign_drop is an important Boolean parameter with implications for testing Granger causality using the sign test and the Wilcoxon signed rank test. If True, observations in the test set whose target values are outside [min(train), max(train)] are not used when calculating the test statistics and p-values of the sign and Wilcoxon tests (note: this also requires y_bounds_error to not be set to 'raise' in the mlcausality function). If False, then the sign and Wilcoxon test statistics and p-values are calculated using all observations in the test set. The default is set to True because some models, especially tree-based models, extrapolate very poorly outside the range of target values that were seen in train.

Additional help and documentation

Less commonly used features are documented using help(). For instance, the following provides more information about loco_mlcausality function inside an interactive Python shell:

import mlcausality
help(mlcausality.loco_mlcausality)

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

About

Nonlinear Granger causality using machine learning techniques

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%