Skip to content

2) Algorithms

Rémi Sultan edited this page Oct 11, 2021 · 20 revisions

Regression

Linear Regression

   var df = Dataframes.create(
        new Column<>("y",
            rangeClosed(0, 10).mapToDouble(l -> l).boxed().collect(toList())),
        new Column<>("x",
            rangeClosed(0, 10).mapToDouble(l -> l + (nextBoolean() ? 1 : -1) * nextDouble(0, 0.5))
                .boxed()
                .collect(toList())),
        new Column<>("x²",
            rangeClosed(0, 10).mapToDouble(l -> l * l + (nextBoolean() ? 1 : -1) * nextDouble(0, 0.5))
                .boxed()
                .collect(toList()))
    );

    var linearRegression = new LinearRegression()
        .setResponseVariableName("y")
        .setPredictorNames("x", "x²")
        .train(df)
        .showMetrics();

    linearRegression.predict(df).tail();

Classification

We're using the IRIS dataset:

Logistic Regression

Here's how you can perform Logistic regression.

        // We are using the IRIS dataset
        var df = Dataframes.csv("/path/to/iris.data", ",", false);
        
        // Number of train iteration
        int epoch = 100;
        // Learning Rate
        double lr = 0.1;
        // Loss offset for Gradient Descent
        int alpha = 100;
        // Regularatisation (RIDGE, LASSO. Default is no regularization)
        Regularization ridge = RIDGE;
        // Regularization parameter
        double lambda = 0.019;
        // Label for Logistic regression, other labels will be annotated "Not"+ label
        String label = "Iris-setosa"; 
        var setosaRegression = new LogisticRegression(epoch, lr)
                .setResponseVariableName("c4")
                .setPredictorNames("c0", "c1", "c2", "c3")
                .setLossAccuracyOffset(alpha)
                .setRegularization(RIDGE)
                .setLambda(lambda)
                .setLabel(label)
                .train(df);
        setosaRegression.getHistory().tail();
        setosaRegression.predict(testDf).show(2000);

Softmax Regression

        // The difference with Logistic regression is that
        // Softmax regression handles several labels at a time

        var df = Dataframes.csv("/path/to/iris.data", ",", false);

        var softmaxRegression = new SoftmaxRegression(1000, 0.1)
                .setResponseVariableName("c4")
                .setPredictorNames("c0", "c1", "c2", "c3")
                .setRegularization(RIDGE)
                .setLambda(0.0014)
                .setLossAccuracyOffset(100)
                .train(df);
        softmaxRegression.getHistory().tail();
        softmaxRegression.predict(df).show(2000);

        softmaxRegression = new SoftmaxRegression(1000, 0.1)
                .setResponseVariableName("c4")
                .setPredictorNames("c0", "c1", "c2", "c3")
                .setRegularization(LASSO)
                .setLambda(0.0014)
                .setLossAccuracyOffset(100)
                .train(df);
        softmaxRegression.getHistory().tail();
        softmaxRegression.predict(df).show(2000);

Clustering

Still using the IRIS dataset

Kmeans

   var df = Dataframes
        .csv("/path/to/iris.data", ",", false)
        .mapWithout("c4");

    // There are 3 types of Iris so let's make a cluster of 3
    int K = 3;
    // Training on 10 iterations
    int epochs = 10;
    // Strategy of initialisation (RANDOM or PLUS_PLUS)
    InitialisationStrategy strategy = PLUS_PLUS;

    // var kMeans = new KMeans(K, epochs); will default to PLUS_PLUS strategy
    var kMeans = new KMeans(K, epochs, strategy).train(df).showMetrics();
    kMeans.predict(df).show(20);
    System.out.println(kMeans.getCentroids());

Kmedians

Identically we can use the same dataset with KMedians. The only difference with KMeans is that the distance computing strategies are different (KMedians uses Manathan Distance compared to Kmeans which uses Euclidian distance)

    int K = 3;
    int epochs = 10;
    InitialisationStrategy strategy = PLUS_PLUS;

    var kMedians = new KMedians(K, epochs, strategy);
    kMedians.train(df).showMetrics();

    kMedians.predict(df).show(20);
    System.out.println(kMedians.getCentroids());

KMedoid Evaluator

This evaluator allows you to check on the best number of clusters for your KMeans / KMedians

    var df = Dataframes.csv("/path/to/iris.data", ",", false);

    var kmedoidEvaluator = new KMedoidEvaluator(1, 14, MEDIAN);
    kmedoidEvaluator.evaluate(df).show(20);

The result printed from the dataframe is the Weighted Sum Of Squares of each clusters.

You can then determine with the Elbow Method the optimal number of cluster. Then it is up to your interpretation.

An example of chart here

Mean Shift

    var df = Dataframes.csv("/path/to/iris.data", ",", false);

    int bandwidth = 2;
    int epochs = 20;
    var meanShift = new MeanShift(bandwidth, epochs).train(df);
    meanShift.predict(df).tail();

Mean Shift

    var df = Dataframes.csv("/path/to/iris.data", ",", false);

    int bandwidth = 3;
    int epochs = 20;
    var medianShift = new MedianShift(bandwidth, epochs).train(df);
    medianShift.predict(df).tail();

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) allows you to define cluster not based on the density of points in the dataset. Still using the IRIS dataset

    var df = Dataframes.csv("/path/to/iris.data", ",", "\"", false);

    //Radius to look for when evaluating a point 
    int radius = 1;
    //Minimum number of neighbours around the evaluated point
    int minSamples = 5;
    var predicted = new DBSCAN(radius, minSamples)
        .predict(df.mapWithout("c4"));

    predicted.addColumn(df.getColumns()[4]).show(150);

Trees

Decision Trees

You can build a decision tree in order to predict a class for classification or a value for regression

    int treeDepth = 5;
    var impurityStrategy = ImpurityStrategy.GINI;
    var decisionTreeClassifier = new DecisionTreeClassifier(treeDepth, impurityStrategy);
    var dataframe = Dataframes.csvTrainTest("/path/to/iris.data", ",", "\"", false).shuffle();
    var dfSplit = dataframe.setSplitValue(0.4).split();
    decisionTreeClassifier.setResponseVariableName("c4").train(dfSplit.train());
    var newDf = decisionTreeClassifier.predict(dfSplit.test());
    newDf.show(0, 150);

Here we are using the wine quality dataset:

    int treeDepth = 5;
    var decisionTreeRegression = new DecisionTreeRegressor(treeDepth);
    var dataframe = Dataframes.csvTrainTest("/path/to/winequality-red.csv", ";").shuffle();
    var dfSplit = dataframe.setSplitValue(0.5).split();
    decisionTreeRegression
        .setResponseVariableName("quality")
        .train(dfSplit.train());
    var predictionDf = decisionTreeRegression.predict(dfSplit.test());
    predictionDf.show(0, 15000);

Random Forests

Random forest work the same way as decision trees in terms of API. They will reduce the variance of your model with low bias by performing several decision trees.

    int numberOfEstimators = 100;
    var impurityStrategy = ENTROPY;
    var randomForestClassifier = new RandomForestClassifier(numberOfEstimators, impurityStrategy)
        .setTreeDepth(5)
        .setSampleSizeRatio(0.4)
        .setResponseVariableName("c4");
    var dataframe = Dataframes.csvTrainTest("/path/to/iris.data", ",", "\"", false).shuffle();
    var dfSplit = dataframe.setSplitValue(0.4).split();
    randomForestClassifier.train(dfSplit.train());
    var newDf = randomForestClassifier.predict(dfSplit.test());
    newDf.show(0, 150);
    int numberOfEstimators = 10;
    var randomForestRegression = new RandomForestRegressor(numberOfEstimators)
        .setTreeDepth(3)
        .setSampleFeatureSize(2)
        .setSampleSizeRatio(0.4);
    var dataframe = Dataframes.csvTrainTest("/path/to/winequality-red.csv", ";").shuffle();
    var dfSplit = dataframe.setSplitValue(0.5).split();
    randomForestRegression
        .setResponseVariableName("quality")
        .train(dfSplit.train());
    var predictions = randomForestRegression.predict(dfSplit.test());
    predictions.show(0, 15000);

Isolation Forests

Isolation Forests is a clustering technique using ensembles like decision trees and random forests, they can be used to determine anomalies within a dataset or to evict outliers for future model fitting.

You can use the http dataset (there is a http_reduced.csv for faster training and evaluation)

    var df = Dataframes.csv("/path/to/", ",", "\"", true);
    var trainTestDataframe = Dataframes.trainTest(df.getColumns()).setSplitValue(0.5);

    int nbTrees = 200;
    var model = new IsolationForest(nbTrees).train(df.mapWithout("attack"));
    // Here we are using the TPRThresholdEvaluator in order to find the best threshold for our model
    // But you can directly set your own threshold with the setAnomalyThreshold method
    var evaluator = new TPRThresholdEvaluator("attack", "anomalies").setDesiredTPR(0.99).setLearningRate(0.02);
    Double threshold = evaluator.evaluate(model, trainTestDataframe);
    System.out.println("threshold = " + threshold);
    evaluator.showMetrics();

Saving models

You might be interested of saving your model for further re-use

    File file = new File("/path/to/your/file.gz");
    RandomForestClassifier randomForestClassifier;
    var dataframe = Dataframes.csvTrainTest(arg, ",", "\"", false).shuffle();
    var dfSplit = dataframe.setSplitValue(0.4).split();
    if (!file.exists()) {
      randomForestClassifier = new RandomForestClassifier(100, ENTROPY)
          .setTreeDepth(5)
          .setSampleSizeRatio(0.4)
          .setResponseVariableName("c4")
          .train(dfSplit.train());
      // Train first and then write your model
      Models.write(file.toPath(), randomForestClassifier);
    } else {
      // Read your already existing model
      randomForestClassifier = Models.read(file.toPath());
    }

    var newDf = randomForestClassifier.predict(dfSplit.test());
    newDf.show(0, 150);