-
Notifications
You must be signed in to change notification settings - Fork 0
2) Algorithms
var df = Dataframes.create(
new Column<>("y",
rangeClosed(0, 10).mapToDouble(l -> l).boxed().collect(toList())),
new Column<>("x",
rangeClosed(0, 10).mapToDouble(l -> l + (nextBoolean() ? 1 : -1) * nextDouble(0, 0.5))
.boxed()
.collect(toList())),
new Column<>("x²",
rangeClosed(0, 10).mapToDouble(l -> l * l + (nextBoolean() ? 1 : -1) * nextDouble(0, 0.5))
.boxed()
.collect(toList()))
);
var linearRegression = new LinearRegression()
.setResponseVariableName("y")
.setPredictorNames("x", "x²")
.train(df)
.showMetrics();
linearRegression.predict(df).tail();We're using the IRIS dataset:
Here's how you can perform Logistic regression.
// We are using the IRIS dataset
var df = Dataframes.csv("/path/to/iris.data", ",", false);
// Number of train iteration
int epoch = 100;
// Learning Rate
double lr = 0.1;
// Loss offset for Gradient Descent
int alpha = 100;
// Regularatisation (RIDGE, LASSO. Default is no regularization)
Regularization ridge = RIDGE;
// Regularization parameter
double lambda = 0.019;
// Label for Logistic regression, other labels will be annotated "Not"+ label
String label = "Iris-setosa";
var setosaRegression = new LogisticRegression(epoch, lr)
.setResponseVariableName("c4")
.setPredictorNames("c0", "c1", "c2", "c3")
.setLossAccuracyOffset(alpha)
.setRegularization(RIDGE)
.setLambda(lambda)
.setLabel(label)
.train(df);
setosaRegression.getHistory().tail();
setosaRegression.predict(testDf).show(2000); // The difference with Logistic regression is that
// Softmax regression handles several labels at a time
var df = Dataframes.csv("/path/to/iris.data", ",", false);
var softmaxRegression = new SoftmaxRegression(1000, 0.1)
.setResponseVariableName("c4")
.setPredictorNames("c0", "c1", "c2", "c3")
.setRegularization(RIDGE)
.setLambda(0.0014)
.setLossAccuracyOffset(100)
.train(df);
softmaxRegression.getHistory().tail();
softmaxRegression.predict(df).show(2000);
softmaxRegression = new SoftmaxRegression(1000, 0.1)
.setResponseVariableName("c4")
.setPredictorNames("c0", "c1", "c2", "c3")
.setRegularization(LASSO)
.setLambda(0.0014)
.setLossAccuracyOffset(100)
.train(df);
softmaxRegression.getHistory().tail();
softmaxRegression.predict(df).show(2000);Still using the IRIS dataset
var df = Dataframes
.csv("/path/to/iris.data", ",", false)
.mapWithout("c4");
// There are 3 types of Iris so let's make a cluster of 3
int K = 3;
// Training on 10 iterations
int epochs = 10;
// Strategy of initialisation (RANDOM or PLUS_PLUS)
InitialisationStrategy strategy = PLUS_PLUS;
// var kMeans = new KMeans(K, epochs); will default to PLUS_PLUS strategy
var kMeans = new KMeans(K, epochs, strategy).train(df).showMetrics();
kMeans.predict(df).show(20);
System.out.println(kMeans.getCentroids());Identically we can use the same dataset with KMedians. The only difference with KMeans is that the distance computing strategies are different (KMedians uses Manathan Distance compared to Kmeans which uses Euclidian distance)
int K = 3;
int epochs = 10;
InitialisationStrategy strategy = PLUS_PLUS;
var kMedians = new KMedians(K, epochs, strategy);
kMedians.train(df).showMetrics();
kMedians.predict(df).show(20);
System.out.println(kMedians.getCentroids());This evaluator allows you to check on the best number of clusters for your KMeans / KMedians
var df = Dataframes.csv("/path/to/iris.data", ",", false);
var kmedoidEvaluator = new KMedoidEvaluator(1, 14, MEDIAN);
kmedoidEvaluator.evaluate(df).show(20);The result printed from the dataframe is the Weighted Sum Of Squares of each clusters.
You can then determine with the Elbow Method the optimal number of cluster. Then it is up to your interpretation.
var df = Dataframes.csv("/path/to/iris.data", ",", false);
int bandwidth = 2;
int epochs = 20;
var meanShift = new MeanShift(bandwidth, epochs).train(df);
meanShift.predict(df).tail(); var df = Dataframes.csv("/path/to/iris.data", ",", false);
int bandwidth = 3;
int epochs = 20;
var medianShift = new MedianShift(bandwidth, epochs).train(df);
medianShift.predict(df).tail();Density-Based Spatial Clustering of Applications with Noise (DBSCAN) allows you to define cluster not based on the density of points in the dataset. Still using the IRIS dataset
var df = Dataframes.csv("/path/to/iris.data", ",", "\"", false);
//Radius to look for when evaluating a point
int radius = 1;
//Minimum number of neighbours around the evaluated point
int minSamples = 5;
var predicted = new DBSCAN(radius, minSamples)
.predict(df.mapWithout("c4"));
predicted.addColumn(df.getColumns()[4]).show(150);You can build a decision tree in order to predict a class for classification or a value for regression
int treeDepth = 5;
var impurityStrategy = ImpurityStrategy.GINI;
var decisionTreeClassifier = new DecisionTreeClassifier(treeDepth, impurityStrategy);
var dataframe = Dataframes.csvTrainTest("/path/to/iris.data", ",", "\"", false).shuffle();
var dfSplit = dataframe.setSplitValue(0.4).split();
decisionTreeClassifier.setResponseVariableName("c4").train(dfSplit.train());
var newDf = decisionTreeClassifier.predict(dfSplit.test());
newDf.show(0, 150);Here we are using the wine quality dataset:
int treeDepth = 5;
var decisionTreeRegression = new DecisionTreeRegressor(treeDepth);
var dataframe = Dataframes.csvTrainTest("/path/to/winequality-red.csv", ";").shuffle();
var dfSplit = dataframe.setSplitValue(0.5).split();
decisionTreeRegression
.setResponseVariableName("quality")
.train(dfSplit.train());
var predictionDf = decisionTreeRegression.predict(dfSplit.test());
predictionDf.show(0, 15000);Random forest work the same way as decision trees in terms of API. They will reduce the variance of your model with low bias by performing several decision trees.
int numberOfEstimators = 100;
var impurityStrategy = ENTROPY;
var randomForestClassifier = new RandomForestClassifier(numberOfEstimators, impurityStrategy)
.setTreeDepth(5)
.setSampleSizeRatio(0.4)
.setResponseVariableName("c4");
var dataframe = Dataframes.csvTrainTest("/path/to/iris.data", ",", "\"", false).shuffle();
var dfSplit = dataframe.setSplitValue(0.4).split();
randomForestClassifier.train(dfSplit.train());
var newDf = randomForestClassifier.predict(dfSplit.test());
newDf.show(0, 150); int numberOfEstimators = 10;
var randomForestRegression = new RandomForestRegressor(numberOfEstimators)
.setTreeDepth(3)
.setSampleFeatureSize(2)
.setSampleSizeRatio(0.4);
var dataframe = Dataframes.csvTrainTest("/path/to/winequality-red.csv", ";").shuffle();
var dfSplit = dataframe.setSplitValue(0.5).split();
randomForestRegression
.setResponseVariableName("quality")
.train(dfSplit.train());
var predictions = randomForestRegression.predict(dfSplit.test());
predictions.show(0, 15000);Isolation Forests is a clustering technique using ensembles like decision trees and random forests, they can be used to determine anomalies within a dataset or to evict outliers for future model fitting.
You can use the http dataset (there is a http_reduced.csv for faster training and evaluation)
var df = Dataframes.csv("/path/to/", ",", "\"", true);
var trainTestDataframe = Dataframes.trainTest(df.getColumns()).setSplitValue(0.5);
int nbTrees = 200;
var model = new IsolationForest(nbTrees).train(df.mapWithout("attack"));
// Here we are using the TPRThresholdEvaluator in order to find the best threshold for our model
// But you can directly set your own threshold with the setAnomalyThreshold method
var evaluator = new TPRThresholdEvaluator("attack", "anomalies").setDesiredTPR(0.99).setLearningRate(0.02);
Double threshold = evaluator.evaluate(model, trainTestDataframe);
System.out.println("threshold = " + threshold);
evaluator.showMetrics();You might be interested of saving your model for further re-use
File file = new File("/path/to/your/file.gz");
RandomForestClassifier randomForestClassifier;
var dataframe = Dataframes.csvTrainTest(arg, ",", "\"", false).shuffle();
var dfSplit = dataframe.setSplitValue(0.4).split();
if (!file.exists()) {
randomForestClassifier = new RandomForestClassifier(100, ENTROPY)
.setTreeDepth(5)
.setSampleSizeRatio(0.4)
.setResponseVariableName("c4")
.train(dfSplit.train());
// Train first and then write your model
Models.write(file.toPath(), randomForestClassifier);
} else {
// Read your already existing model
randomForestClassifier = Models.read(file.toPath());
}
var newDf = randomForestClassifier.predict(dfSplit.test());
newDf.show(0, 150);