"Ten[sors with] a[utomatic] Diff[erentiation]"
The TenaDiff library implements a tensor abstraction that automatically builds compute graphs and handles reverse-mode automatic differentiation, allowing users to easily build custom neural networks.
The library is implemented from scratch in C++23 and is designed to align with the PyTorch API.
Let's take a look at a simple example: performing a single step of gradient descent while fitting a model to a line.
// Create parameters for bias and slope.
// We will fit the line using a model of the form y = bias + slope * x.
// Each parameter will of shape (1,).
// The bias will be initialized to -0.03.
// The slope will be initialized to 0.01.
auto bias = Tensor<double>::create(Shape{1}, {-0.03});
auto slope = Tensor<double>::create(Shape{1}, {0.01});
// Set each tensor to require gradients so that
// they accumulate gradients during the backward pass.
bias->set_requires_grad(true);
slope->set_requires_grad(true);
// Create tensors to store intermediate results during the forward
// and backward passes.
// The input will be of shape (batch_size,).
// We will need to broadcast the bias and slope parameters
// to match the shape of the input.
// We will use `bias_broadcasted` and `slope_broadcasted` to store
// the broadcasted bias and slope.
// We will use `mul_out` and `add_out` to store the intermediate
// results of the multiplication and addition.
auto bias_broadcasted = Tensor<double>::create(Shape{batch_size});
auto slope_broadcasted = Tensor<double>::create(Shape{batch_size});
auto mul_out = Tensor<double>::create(Shape{batch_size});
auto add_out = Tensor<double>::create(Shape{batch_size});
// Use mean squared error loss.
// The input to the loss function will be of shape (batch_size,).
auto mse = MeanSquaredErrorLoss<double>(Shape{batch_size});
// Use stochastic gradient descent with a learning rate of 0.05.
auto optimizer = SGD<double>(0.05);
// Conduct the forward pass.
// Tensor operations are read from left to right.
// For instance, `inputs->mul(*slope_broadcasted, *mul_out)`
// describes element-wise multiplying the `inputs` tensor by the
// `slope_broadcasted` tensor and storing the result in the `mul_out` tensor.
bias->broadcast(*bias_broadcasted);
slope->broadcast(*slope_broadcasted);
inputs->mul(*slope_broadcasted, *mul_out);
auto &predictions = mul_out->add(*bias_broadcasted, *add_out);
auto &loss = mse(MeanSquaredErrorLoss<double>::Predictions(predictions),
MeanSquaredErrorLoss<double>::Targets(*targets));
// Retrieve the compute graph.
// The compute graph is a directed acyclic graph (DAG)
// that represents the computation of a neural network.
// It is constructed during the forward pass and is used
// to compute the gradients during the backward pass.
auto compute_graph = loss.compute_graph();
// Conduct the backward pass.
// The reverse mode of automatic differentiation is used.
// Starting at the loss, the compute graph is traversed
// backwards using differentiation rules to
// propagate the gradients back to the parameters.
compute_graph.backward();
// Update the parameters.
// The optimizer uses the gradients from the backward pass
// to update the parameters.
optimizer.step(compute_graph);
// Reset the compute graph.
// This is necessary to detach the nodes in the compute graph
// and zero the gradients from the previous backward pass.
compute_graph.reset();Expanding this snippet to iterate over multiple mini-batches, we can fit our model.
> line 0.86 4.3
Batch 0 | Loss: 9.19552 | Bias: 0.406993 | Slope: 0.741539
Batch 10 | Loss: 0.317648 | Bias: 0.875876 | Slope: 3.37687
Batch 20 | Loss: 0.0273313 | Bias: 0.863401 | Slope: 4.06392
Batch 30 | Loss: 0.0013757 | Bias: 0.86017 | Slope: 4.24451
Batch 40 | Loss: 7.21439e-05 | Bias: 0.859589 | Slope: 4.28779
Batch 50 | Loss: 3.51138e-06 | Bias: 0.859878 | Slope: 4.29712
Batch 60 | Loss: 2.22195e-07 | Bias: 0.860021 | Slope: 4.29938
Batch 70 | Loss: 8.30594e-09 | Bias: 0.86 | Slope: 4.29985
Batch 80 | Loss: 6.48038e-10 | Bias: 0.859999 | Slope: 4.29996
Batch 90 | Loss: 2.9512e-11 | Bias: 0.86 | Slope: 4.29999
Batch 100 | Loss: 2.03897e-12 | Bias: 0.86 | Slope: 4.3
Batch 110 | Loss: 1.19486e-13 | Bias: 0.86 | Slope: 4.3
Batch 120 | Loss: 5.47383e-15 | Bias: 0.86 | Slope: 4.3
Batch 130 | Loss: 2.82228e-16 | Bias: 0.86 | Slope: 4.3
----------------------------------------
Converged after 139 batches.
Loss: 1.86122e-17
Bias: 0.86
Slope: 4.3You can find the complete source code for this line fitting example in
example/line/line.cpp. Try it out for yourself using the
make line command.
Let's take a look at a more interesting example: training a four-layer feed-forward neural network to fit a heightmap.
First, let's export a heightmap from the Rocky Mountains as a PNG using unrealheightmap.
example/heightmap/data/png/b42eb40cc0d4.png
Convert the PNG to PGM format using the
example/heightmap/scripts/png_to_pgm
script included in this repository.
See example/heightmap/data/pgm/b42eb40cc0d4.pgm
for the resulting PGM file.
From here, we can write our training loop.
// Assume that we have an `inputs` tensor of shape (epoch_size, 2)
// that contains the normalized x and y coordinates of the heightmap.
// Also assume we have a `targets` tensor of shape (epoch_size, 1) that
// contains the normalized height values of the heightmap for each sample
// in the `inputs` tensor.
// Define the neural network.
// We will use a four-layer feed-forward neural network.
// This time, let's use the entire epoch as input in each iteration (batch gradient descent).
const dimension_size_t epoch_size = inputs->shape().at(0);
const auto hidden_layer_sizes = std::array<dimension_size_t, 3>{128, 64, 32};
auto lin1 = Linear<double>(epoch_size, inputs->shape().at(1), hidden_layer_sizes[0]);
auto lin1relu = Tensor<double>::create(Shape{epoch_size, hidden_layer_sizes[0]});
auto lin2 = Linear<double>(epoch_size, hidden_layer_sizes[0], hidden_layer_sizes[1]);
auto lin2relu = Tensor<double>::create(Shape{epoch_size, hidden_layer_sizes[1]});
auto lin3 = Linear<double>(epoch_size, hidden_layer_sizes[1], hidden_layer_sizes[2]);
auto lin3relu = Tensor<double>::create(Shape{epoch_size, hidden_layer_sizes[2]});
auto lin4 = Linear<double>(epoch_size, hidden_layer_sizes[2], 1);
// Use mean squared error loss.
// The input to the loss function will be of shape (epoch_size, 1),
// the same as the output of the last layer in our neural network.
auto mse = MeanSquaredErrorLoss<double>(Shape{epoch_size, 1});
// Use stochastic gradient descent with momentum.
auto optimizer = SGDWithMomentum<double>(
SGDWithMomentum<double>::LearningRate(0.05),
SGDWithMomentum<double>::Momentum(0.9));
// Training loop.
for (auto i = 0; i < 5000; ++i) {
// Conduct the forward pass.
lin1(*inputs).relu(*lin1relu);
lin2(*lin1relu).relu(*lin2relu);
lin3(*lin2relu).relu(*lin3relu);
auto &predictions = lin4(*lin3relu);
auto &loss = mse(MeanSquaredErrorLoss<double>::Predictions(predictions),
MeanSquaredErrorLoss<double>::Targets(*targets));
// Retrieve the compute graph.
auto compute_graph = loss.compute_graph();
// Conduct the backward pass.
compute_graph.backward();
// Update the parameters.
optimizer.step(compute_graph);
// Reset the compute graph.
compute_graph.reset();
}With some additional boilerplate for reading and writing PGM files, we can fit our neural network to the heightmap.
> heightmap example/heightmap/data/pgm/b42eb40cc0d4.pgm example/heightmap/output/If we output PGM files representing the model's predictions over the epochs, we can visualize the model's training progress.
example/heightmap/data/gif/b42eb40cc0d4.gif
Nice! We've trained a feed-forward neural network to fit a heightmap.
You can find the complete source code for this heightmap example in
example/heightmap/heightmap.cpp.
Try it out for yourself using the make heightmap command.
First, install the library.
> make installAfter installation, you can write your own neural networks using the API described below. For examples of how to use the API, see the example directory. See example/CMakeLists.txt for an example of how to link against the library. To run the examples, use the following commands.
> make line
> make heightmapAs we saw in the examples, the atomic unit of the library is Tensor.
On top of Tensor, there are a number of additional utilities.
#include <tenadiff/tensor.hpp>
Tensor is responsible for performing the forward pass and building the compute
graph while doing so. The compute graph is constructed fresh for each forward pass
at runtime (like PyTorch and TensorFlow V2).
| Method | Description |
|---|---|
Tensor::create |
Create a new tensor. |
Tensor::shape |
Get the shape of the tensor. |
Tensor::at |
Get the element at a given multi-dimensional index. |
Tensor::broadcast |
Broadcast the tensor to an expanded shape. |
Tensor::transpose |
Transpose the tensor. |
Tensor::matmul |
Matrix multiply with another tensor. |
Tensor::add |
Element-wise add with another tensor. |
Tensor::sub |
Element-wise subtract another tensor from the current tensor. |
Tensor::relu |
ReLU activation. |
Tensor::softmax |
Apply the softmax function to the tensor. |
Tensor::pow |
Element-wise power of the tensor. |
Tensor::log |
Element-wise log of the tensor. |
Tensor::mul |
Element-wise multiply with another tensor. |
Tensor::neg |
Element-wise negation of the tensor. |
Tensor::sum |
Sum the tensor into a 1-dimensional tensor. |
Tensor::mean |
Take the mean of the tensor into a 1-dimensional tensor. |
Tensor::set_requires_grad |
Set whether the tensor requires a gradient during the backward pass. |
Tensor::requires_grad |
Check whether the tensor requires a gradient during the backward pass. |
Tensor::grad |
Get the gradient of the tensor from the last backward pass. |
Tensor::zero_grad |
Zero the gradient of the tensor. |
Tensor::compute_graph |
Get the compute graph rooted at the tensor. |
Tensor::update |
Update the tensor in-place using the gradient. |
#include <tenadiff/tensor.hpp>
Tensor::Init is a static class that provides methods for initializing new tensors using
random number distributions.
| Method | Description |
|---|---|
Tensor::Init::uniform |
Initialize a new tensor with a uniform distribution. |
Tensor::Init::kaiming_uniform |
Initialize a new tensor with a Kaiming uniform distribution. |
#include <tenadiff/compute_graph.hpp>
ComputeGraph represents the computation of a neural network. The graph consists
of linked ComputeGraphData and ComputeGraphOperation nodes. ComputeGraphData
represents a Tensor in the graph. ComputeGraphOperation represents an operation
between data nodes, such as Tensor::add. Tensor is responsible for building the
graph during the forward pass. ComputeGraph is responsible for resetting the graph
between passes, and handles propagating gradients backwards through the graph.
Check out the include/compute_graph.hpp header for details on the graph's structure and ownership model.
| Method | Description |
|---|---|
ComputeGraph::backward |
Propagate gradients backwards through the graph. |
ComputeGraph::root |
Get the root ComputeGraphData node of the graph. |
ComputeGraph::parameters |
Get the parameter ComputeGraphData nodes of the graph. |
ComputeGraph::requires_backward |
Check whether the given ComputeGraphData node requires a gradient during the backward pass. |
ComputeGraph::reset |
Detach all nodes and zero any gradients from the previous backward pass. |
In the majority of cases, you will only need to interact with ComputeGraph::backward and
ComputeGraph::reset. As you are unlikely to require interacting with ComputeGraphData and
ComputeGraphOperation nodes directly, their descriptions are omitted here for brevity.
#include <tenadiff/linear.hpp>
Linear represents a linear layer in a neural network.
It applies an affine linear transformation to an input tensor of shape (batch_size, input_size).
| Method | Description |
|---|---|
Linear::weights |
Get the weights of the linear layer. |
Linear::biases |
Get the biases of the linear layer. |
Linear::operator() |
Apply an affine linear transformation to the input tensor. |
#include <tenadiff/loss.hpp>
CrossEntropyLoss represents the cross-entropy loss function.
It is used to measure the loss of a multi-class classification model
with an arbitrary number of classes. The predictions must contain the
unnormalized logits for each class, with shape (batch_size, num_classes).
The targets must be one-hot encoded, with shape (batch_size, num_classes).
| Method | Description |
|---|---|
CrossEntropyLoss::operator() |
Compute the cross-entropy loss between the predictions and targets. |
#include <tenadiff/loss.hpp>
MeanSquaredErrorLoss represents the mean squared error loss function.
It is used to measure the loss of a regression model.
| Method | Description |
|---|---|
MeanSquaredErrorLoss::operator() |
Compute the mean squared error between the predictions and targets. |
#include <tenadiff/optimizer.hpp>
SGD represents the stochastic gradient descent optimizer.
Optimizers are responsible for updating the parameters of a neural network
after a backward pass.
| Method | Description |
|---|---|
SGD::step() |
Perform a single optimization step on the parameters of a compute graph. |
#include <tenadiff/optimizer.hpp>
SGDWithMomentum represents the stochastic gradient descent optimizer with momentum.
It extends the SGD optimizer with a momentum term.
| Method | Description |
|---|---|
SGDWithMomentum::step() |
Perform a single optimization step on the parameters of a compute graph. |
Based on profiling of the heightmap example, we can see that the Tensor::matmul forward and backward
passes are the most expensive operations.
> perf record -F 99 -p <pid> -g -- sleep 60
> perf script > scratch/out.perf
> stackcollapse-perf.pl scratch/out.perf > scratch/out.folded
> flamegraph.pl scratch/out.folded > example/heightmap/data/flamegraph.svgexample/heightmap/data/flamegraph.svg
There are definitely ways to improve performance. Further optimization will be left as future work.
- Optimize
Tensor::matmul - Gradient clipping
- Adam optimizer
- Classifier example
- Fused kernel for cross-entropy loss
- Fused kernel for log softmax
- Tensor storage layouts in addition to strided (e.g., sparse)
- Device types in addition to CPU (e.g., GPU)
- Compute graph export to Mermaid diagram
- Python bindings
- Utilities for exporting training metrics to TensorBoard
- Utilities for data loading and preprocessing
- Convolutional layers
- Recurrent layers
- Attention layers
Repositories:
- https://github.com/karpathy/llm.c
- https://github.com/karpathy/nanoGPT
- https://github.com/karpathy/micrograd
- https://github.com/martin226/litenet
Readings:
- https://udlbook.github.io/udlbook
- https://blog.ezyang.com/2019/05/pytorch-internals
- https://pytorch.org/blog/computational-graphs-constructed-in-pytorch
- https://pytorch.org/blog/how-computational-graphs-are-executed-in-pytorch
- https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative

