Skip to content

callumcurtis/tenadiff

Repository files navigation

TenaDiff

"Ten[sors with] a[utomatic] Diff[erentiation]"

The TenaDiff library implements a tensor abstraction that automatically builds compute graphs and handles reverse-mode automatic differentiation, allowing users to easily build custom neural networks.

The library is implemented from scratch in C++23 and is designed to align with the PyTorch API.

Basic Example

Let's take a look at a simple example: performing a single step of gradient descent while fitting a model to a line.

// Create parameters for bias and slope.
// We will fit the line using a model of the form y = bias + slope * x.
// Each parameter will of shape (1,).
// The bias will be initialized to -0.03.
// The slope will be initialized to 0.01.
auto bias = Tensor<double>::create(Shape{1}, {-0.03});
auto slope = Tensor<double>::create(Shape{1}, {0.01});

// Set each tensor to require gradients so that
// they accumulate gradients during the backward pass.
bias->set_requires_grad(true);
slope->set_requires_grad(true);

// Create tensors to store intermediate results during the forward
// and backward passes.
// The input will be of shape (batch_size,).
// We will need to broadcast the bias and slope parameters
// to match the shape of the input.
// We will use `bias_broadcasted` and `slope_broadcasted` to store
// the broadcasted bias and slope.
// We will use `mul_out` and `add_out` to store the intermediate
// results of the multiplication and addition.
auto bias_broadcasted = Tensor<double>::create(Shape{batch_size});
auto slope_broadcasted = Tensor<double>::create(Shape{batch_size});
auto mul_out = Tensor<double>::create(Shape{batch_size});
auto add_out = Tensor<double>::create(Shape{batch_size});

// Use mean squared error loss.
// The input to the loss function will be of shape (batch_size,).
auto mse = MeanSquaredErrorLoss<double>(Shape{batch_size});

// Use stochastic gradient descent with a learning rate of 0.05.
auto optimizer = SGD<double>(0.05);

// Conduct the forward pass.
// Tensor operations are read from left to right.
// For instance, `inputs->mul(*slope_broadcasted, *mul_out)`
// describes element-wise multiplying the `inputs` tensor by the
// `slope_broadcasted` tensor and storing the result in the `mul_out` tensor.
bias->broadcast(*bias_broadcasted);
slope->broadcast(*slope_broadcasted);
inputs->mul(*slope_broadcasted, *mul_out);
auto &predictions = mul_out->add(*bias_broadcasted, *add_out);
auto &loss = mse(MeanSquaredErrorLoss<double>::Predictions(predictions),
                 MeanSquaredErrorLoss<double>::Targets(*targets));

// Retrieve the compute graph.
// The compute graph is a directed acyclic graph (DAG)
// that represents the computation of a neural network.
// It is constructed during the forward pass and is used
// to compute the gradients during the backward pass.
auto compute_graph = loss.compute_graph();

// Conduct the backward pass.
// The reverse mode of automatic differentiation is used.
// Starting at the loss, the compute graph is traversed
// backwards using differentiation rules to
// propagate the gradients back to the parameters.
compute_graph.backward();

// Update the parameters.
// The optimizer uses the gradients from the backward pass
// to update the parameters.
optimizer.step(compute_graph);

// Reset the compute graph.
// This is necessary to detach the nodes in the compute graph
// and zero the gradients from the previous backward pass.
compute_graph.reset();

Expanding this snippet to iterate over multiple mini-batches, we can fit our model.

> line 0.86 4.3
Batch 0 | Loss: 9.19552 | Bias: 0.406993 | Slope: 0.741539
Batch 10 | Loss: 0.317648 | Bias: 0.875876 | Slope: 3.37687
Batch 20 | Loss: 0.0273313 | Bias: 0.863401 | Slope: 4.06392
Batch 30 | Loss: 0.0013757 | Bias: 0.86017 | Slope: 4.24451
Batch 40 | Loss: 7.21439e-05 | Bias: 0.859589 | Slope: 4.28779
Batch 50 | Loss: 3.51138e-06 | Bias: 0.859878 | Slope: 4.29712
Batch 60 | Loss: 2.22195e-07 | Bias: 0.860021 | Slope: 4.29938
Batch 70 | Loss: 8.30594e-09 | Bias: 0.86 | Slope: 4.29985
Batch 80 | Loss: 6.48038e-10 | Bias: 0.859999 | Slope: 4.29996
Batch 90 | Loss: 2.9512e-11 | Bias: 0.86 | Slope: 4.29999
Batch 100 | Loss: 2.03897e-12 | Bias: 0.86 | Slope: 4.3
Batch 110 | Loss: 1.19486e-13 | Bias: 0.86 | Slope: 4.3
Batch 120 | Loss: 5.47383e-15 | Bias: 0.86 | Slope: 4.3
Batch 130 | Loss: 2.82228e-16 | Bias: 0.86 | Slope: 4.3
----------------------------------------
Converged after 139 batches.
Loss: 1.86122e-17
Bias: 0.86
Slope: 4.3

You can find the complete source code for this line fitting example in example/line/line.cpp. Try it out for yourself using the make line command.

Advanced Example

Let's take a look at a more interesting example: training a four-layer feed-forward neural network to fit a heightmap.

First, let's export a heightmap from the Rocky Mountains as a PNG using unrealheightmap.

example/heightmap/data/png/b42eb40cc0d4.png

Rocky Mountains

Convert the PNG to PGM format using the example/heightmap/scripts/png_to_pgm script included in this repository. See example/heightmap/data/pgm/b42eb40cc0d4.pgm for the resulting PGM file.

From here, we can write our training loop.

// Assume that we have an `inputs` tensor of shape (epoch_size, 2)
// that contains the normalized x and y coordinates of the heightmap.
// Also assume we have a `targets` tensor of shape (epoch_size, 1) that
// contains the normalized height values of the heightmap for each sample
// in the `inputs` tensor.

// Define the neural network.
// We will use a four-layer feed-forward neural network.
// This time, let's use the entire epoch as input in each iteration (batch gradient descent).
const dimension_size_t epoch_size = inputs->shape().at(0);
const auto hidden_layer_sizes = std::array<dimension_size_t, 3>{128, 64, 32};
auto lin1 =     Linear<double>(epoch_size, inputs->shape().at(1), hidden_layer_sizes[0]);
auto lin1relu = Tensor<double>::create(Shape{epoch_size, hidden_layer_sizes[0]});
auto lin2 =     Linear<double>(epoch_size, hidden_layer_sizes[0], hidden_layer_sizes[1]);
auto lin2relu = Tensor<double>::create(Shape{epoch_size, hidden_layer_sizes[1]});
auto lin3 =     Linear<double>(epoch_size, hidden_layer_sizes[1], hidden_layer_sizes[2]);
auto lin3relu = Tensor<double>::create(Shape{epoch_size, hidden_layer_sizes[2]});
auto lin4 =     Linear<double>(epoch_size, hidden_layer_sizes[2], 1);

// Use mean squared error loss.
// The input to the loss function will be of shape (epoch_size, 1),
// the same as the output of the last layer in our neural network.
auto mse = MeanSquaredErrorLoss<double>(Shape{epoch_size, 1});

// Use stochastic gradient descent with momentum.
auto optimizer = SGDWithMomentum<double>(
    SGDWithMomentum<double>::LearningRate(0.05),
    SGDWithMomentum<double>::Momentum(0.9));

// Training loop.
for (auto i = 0; i < 5000; ++i) {
    // Conduct the forward pass.
    lin1(*inputs).relu(*lin1relu);
    lin2(*lin1relu).relu(*lin2relu);
    lin3(*lin2relu).relu(*lin3relu);
    auto &predictions = lin4(*lin3relu);
    auto &loss = mse(MeanSquaredErrorLoss<double>::Predictions(predictions),
                     MeanSquaredErrorLoss<double>::Targets(*targets));

    // Retrieve the compute graph.
    auto compute_graph = loss.compute_graph();

    // Conduct the backward pass.
    compute_graph.backward();

    // Update the parameters.
    optimizer.step(compute_graph);

    // Reset the compute graph.
    compute_graph.reset();
}

With some additional boilerplate for reading and writing PGM files, we can fit our neural network to the heightmap.

> heightmap example/heightmap/data/pgm/b42eb40cc0d4.pgm example/heightmap/output/

If we output PGM files representing the model's predictions over the epochs, we can visualize the model's training progress.

example/heightmap/data/gif/b42eb40cc0d4.gif

Heightmap

Nice! We've trained a feed-forward neural network to fit a heightmap.

You can find the complete source code for this heightmap example in example/heightmap/heightmap.cpp. Try it out for yourself using the make heightmap command.

Getting Started

First, install the library.

> make install

After installation, you can write your own neural networks using the API described below. For examples of how to use the API, see the example directory. See example/CMakeLists.txt for an example of how to link against the library. To run the examples, use the following commands.

> make line
> make heightmap

API

As we saw in the examples, the atomic unit of the library is Tensor. On top of Tensor, there are a number of additional utilities.

Tensor

#include <tenadiff/tensor.hpp>

Tensor is responsible for performing the forward pass and building the compute graph while doing so. The compute graph is constructed fresh for each forward pass at runtime (like PyTorch and TensorFlow V2).

Method Description
Tensor::create Create a new tensor.
Tensor::shape Get the shape of the tensor.
Tensor::at Get the element at a given multi-dimensional index.
Tensor::broadcast Broadcast the tensor to an expanded shape.
Tensor::transpose Transpose the tensor.
Tensor::matmul Matrix multiply with another tensor.
Tensor::add Element-wise add with another tensor.
Tensor::sub Element-wise subtract another tensor from the current tensor.
Tensor::relu ReLU activation.
Tensor::softmax Apply the softmax function to the tensor.
Tensor::pow Element-wise power of the tensor.
Tensor::log Element-wise log of the tensor.
Tensor::mul Element-wise multiply with another tensor.
Tensor::neg Element-wise negation of the tensor.
Tensor::sum Sum the tensor into a 1-dimensional tensor.
Tensor::mean Take the mean of the tensor into a 1-dimensional tensor.
Tensor::set_requires_grad Set whether the tensor requires a gradient during the backward pass.
Tensor::requires_grad Check whether the tensor requires a gradient during the backward pass.
Tensor::grad Get the gradient of the tensor from the last backward pass.
Tensor::zero_grad Zero the gradient of the tensor.
Tensor::compute_graph Get the compute graph rooted at the tensor.
Tensor::update Update the tensor in-place using the gradient.

Tensor::Init

#include <tenadiff/tensor.hpp>

Tensor::Init is a static class that provides methods for initializing new tensors using random number distributions.

Method Description
Tensor::Init::uniform Initialize a new tensor with a uniform distribution.
Tensor::Init::kaiming_uniform Initialize a new tensor with a Kaiming uniform distribution.

ComputeGraph

#include <tenadiff/compute_graph.hpp>

ComputeGraph represents the computation of a neural network. The graph consists of linked ComputeGraphData and ComputeGraphOperation nodes. ComputeGraphData represents a Tensor in the graph. ComputeGraphOperation represents an operation between data nodes, such as Tensor::add. Tensor is responsible for building the graph during the forward pass. ComputeGraph is responsible for resetting the graph between passes, and handles propagating gradients backwards through the graph.

Check out the include/compute_graph.hpp header for details on the graph's structure and ownership model.

Method Description
ComputeGraph::backward Propagate gradients backwards through the graph.
ComputeGraph::root Get the root ComputeGraphData node of the graph.
ComputeGraph::parameters Get the parameter ComputeGraphData nodes of the graph.
ComputeGraph::requires_backward Check whether the given ComputeGraphData node requires a gradient during the backward pass.
ComputeGraph::reset Detach all nodes and zero any gradients from the previous backward pass.

In the majority of cases, you will only need to interact with ComputeGraph::backward and ComputeGraph::reset. As you are unlikely to require interacting with ComputeGraphData and ComputeGraphOperation nodes directly, their descriptions are omitted here for brevity.

Linear

#include <tenadiff/linear.hpp>

Linear represents a linear layer in a neural network. It applies an affine linear transformation to an input tensor of shape (batch_size, input_size).

Method Description
Linear::weights Get the weights of the linear layer.
Linear::biases Get the biases of the linear layer.
Linear::operator() Apply an affine linear transformation to the input tensor.

CrossEntropyLoss

#include <tenadiff/loss.hpp>

CrossEntropyLoss represents the cross-entropy loss function. It is used to measure the loss of a multi-class classification model with an arbitrary number of classes. The predictions must contain the unnormalized logits for each class, with shape (batch_size, num_classes). The targets must be one-hot encoded, with shape (batch_size, num_classes).

Method Description
CrossEntropyLoss::operator() Compute the cross-entropy loss between the predictions and targets.

MeanSquaredErrorLoss

#include <tenadiff/loss.hpp>

MeanSquaredErrorLoss represents the mean squared error loss function. It is used to measure the loss of a regression model.

Method Description
MeanSquaredErrorLoss::operator() Compute the mean squared error between the predictions and targets.

SGD

#include <tenadiff/optimizer.hpp>

SGD represents the stochastic gradient descent optimizer. Optimizers are responsible for updating the parameters of a neural network after a backward pass.

Method Description
SGD::step() Perform a single optimization step on the parameters of a compute graph.

SGDWithMomentum

#include <tenadiff/optimizer.hpp>

SGDWithMomentum represents the stochastic gradient descent optimizer with momentum. It extends the SGD optimizer with a momentum term.

Method Description
SGDWithMomentum::step() Perform a single optimization step on the parameters of a compute graph.

Performance

Based on profiling of the heightmap example, we can see that the Tensor::matmul forward and backward passes are the most expensive operations.

> perf record -F 99 -p <pid> -g -- sleep 60
> perf script > scratch/out.perf
> stackcollapse-perf.pl scratch/out.perf > scratch/out.folded
> flamegraph.pl scratch/out.folded > example/heightmap/data/flamegraph.svg

example/heightmap/data/flamegraph.svg

Flamegraph

There are definitely ways to improve performance. Further optimization will be left as future work.

Future Work

  • Optimize Tensor::matmul
  • Gradient clipping
  • Adam optimizer
  • Classifier example
  • Fused kernel for cross-entropy loss
  • Fused kernel for log softmax
  • Tensor storage layouts in addition to strided (e.g., sparse)
  • Device types in addition to CPU (e.g., GPU)
  • Compute graph export to Mermaid diagram
  • Python bindings
  • Utilities for exporting training metrics to TensorBoard
  • Utilities for data loading and preprocessing
  • Convolutional layers
  • Recurrent layers
  • Attention layers

Inspiration and Resources

Repositories:

Readings:

About

Tensor library with reverse-mode automatic differentiation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages