| layout | default |
|---|
doddle-model is an in-memory machine learning library that can be summed up with three main characteristics:
- it is built on top of Breeze
- it provides immutable estimators that are a doddle to use in parallel code
- it exposes its functionality through a scikit-learn-like API [2] in idiomatic Scala using typeclasses
doddle-model takes the position of scikit-learn in Scala and as a consequence, it's much more lightweight than e.g. Spark ML. Fitted models can be deployed anywhere, from simple applications to concurrent, distributed systems built with Akka, Apache Beam or a framework of your choice. Training of estimators happens in-memory, which is advantageous unless you are dealing with enormous datasets that absolutely cannot fit into RAM.
You can chat with us on gitter.
The project is published for Scala versions 2.11, 2.12 and 2.13. Add the dependency to your SBT project definition:
libraryDependencies ++= Seq(
"io.github.picnicml" %% "doddle-model" % "<latest_version>",
// add optionally to utilize native libraries for a significant performance boost
"org.scalanlp" %% "breeze-natives" % "1.0"
)Note that the latest version is displayed in the maven central badge above and that the v prefix should be removed from the SBT definition.
This is a complete list of code examples, for an example of how to serve a trained doddle-model in a pipeline implemented with Apache Beam see doddle-beam-example.
- Standard Scaler
- Range Scaler
- Binarizer
- Normalizer
- One-Hot Encoder
- Mean Value Imputation
- Most Frequent Value Imputation
Want to help us? We have a document that will make deciding how to do that much easier. Be sure to also check the roadmap.
doddle-model is developed with performance in mind.
Breeze utilizes netlib-java for accessing hardware optimised linear algebra libraries (note that the breeze-natives dependency needs to be added to the SBT project definition). TL;DR seeing something like
INFO: successfully loaded /var/folders/9h/w52f2svd3jb750h890q1x4j80000gn/T/jniloader3358656786070405996netlib-native_system-osx-x86_64.jnilib
means that BLAS/LAPACK/ARPACK implementations are used. For more information see the Breeze documentation.
If you encounter java.lang.OutOfMemoryError: Java heap space increase the maximum heap size with -Xms and -Xmx JVM properties. E.g. use -Xms8192m -Xmx8192m for initial and maximum heap space of 8Gb. Note that the maximum heap limit for the 32-bit JVM is 4Gb (at least in theory) so make sure to use 64-bit JVM if more memory is needed. If the error still occurs and you are using hyperparameter search or cross validation, see the next section.
To limit the number of threads running at one time (and thus memory consumption) when doing cross validation and hyperparameter search, a FixedThreadPool executor is used. By default maximum number of threads is set to the number of system's cores. Set the -DmaxNumThreads JVM property to change that, e.g. to allow for 16 threads use -DmaxNumThreads=16.
All experiments ran multiple times (iterations) for all implementations and with fixed hyperparameters, selected in a way such that models yielded similar test set performance.
- dataset with 150000 training examples and 27147 test examples (10 features)
- each experiment ran for 100 iterations
- scikit-learn code, doddle-model code
| Implementation | RMSE | Training Time | Prediction Time |
|---|---|---|---|
| scikit-learn | 3.0936 | 0.042s (+/- 0.014s) | 0.002s (+/- 0.002s) |
| doddle-model | 3.0936 | 0.053s (+/- 0.061s) | 0.002s (+/- 0.004s) |
- dataset with 80000 training examples and 20000 test examples (250 features)
- each experiment ran for 100 iterations
- scikit-learn code, doddle-model code
| Implementation | Accuracy | Training Time | Prediction Time |
|---|---|---|---|
| scikit-learn | 0.8389 | 2.789s (+/- 0.090s) | 0.005s (+/- 0.006s) |
| doddle-model | 0.8377 | 3.080s (+/- 0.665s) | 0.025s (+/- 0.025s) |
- MNIST dataset with 60000 training examples and 10000 test examples (784 features)
- each experiment ran for 50 iterations
- scikit-learn code, doddle-model code
| Implementation | Accuracy | Training Time | Prediction Time |
|---|---|---|---|
| scikit-learn | 0.9234 | 21.243s (+/- 0.303s) | 0.074s (+/- 0.018s) |
| doddle-model | 0.9223 | 25.749s (+/- 1.813s) | 0.042s (+/- 0.032s) |
This is a collaborative project which wouldn't be possible without all the awesome contributors. The core team currently consists of the following developers:
- [1] Pattern Recognition and Machine Learning, Christopher Bishop
- [2] API design for machine learning software: experiences from the scikit-learn project, L. Buitinck et al.
- [3] UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science, Dua, D. and Karra Taniskidou, E.