This document provides technical implementation details for Catalyst's design.
Catalyst uses explicit abstractions in the high-level design. I deliberately built more flexibly than I needed to, to better understand the structures at play in realistic systems. For example:
- Computational graphs are represented explicitly as DAGs, rather than building DAGs implicitly with default types
- I implemented runtime hardware selection with a strategy pattern
- I handled memory management explicitly using smart pointers
- I used a strategy pattern for tensor operations to prepare the architecture for custom GPU kernels.
Model class: Top-level container for deep networks exposed to users of the library
- Composes
Blockobjects, holding layers and activations - Manages training state, including epoch, batch, and hardware choice
- Holds a
LogBookfor diagnostics.
Model model(storage_dir, device, name);
model.add_block(std::make_shared<NeuralLayer>(width));
model.train(dataloader, batch_size, epochs);CGGraph: DAG of computations
- Adjacency list of nodes and connections:
std::unordered_map<SharedCGNodePtr, std::vector<SharedCGNodePtr>> - DFS-based topological sort for forward/backward passes, with caching and cache invalidation
- Activations are computed node-by-node, allowing batching. Gradients are computed in reverse topological order.
CGNode: Graph node, holding in this case a neuron but flexible to allow optimisations, eg operator fusion
- Stores weights, biases, activations, gradients
- Forward:
h = σ(Wx + b)where σ is ReLU - Backward: Computes ∂L/∂w using chain rule
- Update:
w ← w - η∇L.
Key Algorithm: Topological Sort
std::vector<SharedCGNodePtr> topo_sort_() {
if (is_topo_cache_valid_) return topo_forward_ordering_cache_;
std::stack<SharedCGNodePtr> st;
std::unordered_map<SharedCGNodePtr, bool> visited;
std::function<void(SharedCGNodePtr)> dfs_visit = [&](SharedCGNodePtr node) {
visited[node] = true;
for (auto& child : graph_adj_list_[node]) {
if (!visited[child]) dfs_visit(child);
}
st.push(node); // Post-order insertion
};
for (auto& pair : graph_adj_list_) {
if (!visited[pair.first]) dfs_visit(pair.first);
}
// Stack is reverse topological order; invert for forward order
std::vector<SharedCGNodePtr> forward_ordering;
while (!st.empty()) {
forward_ordering.push_back(st.top());
st.pop();
}
topo_forward_ordering_cache_ = forward_ordering;
is_topo_cache_valid_ = true;
return forward_ordering;
}Cache Invalidation: Any graph modification sets is_topo_cache_valid_ = false.
Strategy Pattern for Hardware Abstraction
// Interface
class TensorStrategy {
public:
virtual TensorPtr matmul(TensorPtr a, TensorPtr b) = 0;
virtual TensorPtr relu(TensorPtr a) = 0;
// ... other operations
};
// Selector
class TensorOpsSelector {
std::unique_ptr<TensorStrategy> strategy;
public:
TensorOpsSelector(bool use_gpu) {
strategy = use_gpu ? std::make_unique<GPUStrategy>()
: std::make_unique<CPUStrategy>();
}
};CPUStrategy: Delegates to LibTorch CPU operations. GPUStrategy: Intended for custom OpenCL kernels; currently delegates to LibTorch as a placeholder.
Kernel: Abstract base for OpenCL kernels
- Stores source code, workgroup size/count
- Virtual
compile()method for JIT compilation.
KernelFactory: Creates kernel instances
- Static method
build_kernel(type, workgroup_size, workgroup_count) - Returns the requested subclass (MatmulKernel, ReluKernel) instantiated on CPU or GPU.
KernelManagerSingleton: A global kernel cache
- Device querying and workgroup optimisation
- Kernel compilation and caching (avoids recompilation)
- Enqueue/dequeue for GPU execution.
Note: Kernel source (.cl) files are designed and stubbed but not implemented.
Dataset: Loads CSV, splits into train/val/test
- Constructor:
Dataset(path, target_variable_index) - Split:
split(train_frac, val_frac, test_frac)returns array of 3 datasets - Handles integer underflow from float→int conversion.
DataLoader: Iterator over dataset
get_next_observation()returns{input: Tensor, target: Tensor}- Automatic reset when exhausted.
Async Design: Logging doesn't block training
void log_async(const LogEntry& log_entry) {
auto log_task = std::async(std::launch::async, [this, log_entry]() {
add_log_to_map(log_entry); // In-memory
write_log_to_storage(log_entry); // Disk
});
// Returns immediately
}Thread Safety: Mutex on shared log map
void add_log_to_map(const LogEntry& log) {
std::mutex logs_mutex;
std::lock_guard<std::mutex> lock(logs_mutex);
logs_map[log.get_type()].push_back(log);
}LogEntry Types: SystemLogEntry, CheckpointLogEntry, EvaluationLogEntry, DebugLogEntry.
Serialisation: Uses Cereal library for JSON output.
Smart Pointers Throughout: We practice good memory management. There is no new/delete in the application code.
std::shared_ptr<CGNode>for graph nodes (whose ownership is shared to allow asynchronous execution on distributed setups)std::shared_ptr<torch::Tensor>for tensor datastd::unique_ptr<TensorStrategy>for strategy (single ownership)
Unit Tests Using Google Test):
test_dataset.cpp: Split logic, underflow handlingtest_dataloader.cpp: Iterator behavior, resettest_logbook.cpp: Async logging, serialisationtest_TensorOpsSelector_using_CPUStrategy.cpp: All tensor opstest_TensorOpsSelector_using_GPUStrategy.cpp: GPU dispatch
Integration Tests: Not implemented. Planned for testing the full training cycle.
-
CGGraph Segfaults: Pointer invalidation during multi-layer traversal
- May be caused by incorrect shared_ptr lifetimes
- Needs review of pointers in layer construction.
-
Type Mismatches: In gradient updates.
-
No Graph Validation: Accepts cyclic graphs (will infinite loop)
- This was a design choice as we are only dealing in feedforward networks, but we could add cycle detection and connectivity checks if the system was expanded.
-
Kernel Module Incomplete:
- MatmulKernel, ReluKernel headers exist
- OpenCL source files empty
- KernelManager device queries unfinished.
-
Missing Error Handling: No input validation, assumes valid data.
- This is a natural representation of data dependencies, so it's easy to handled distributed execution
- Topological sort determines execution order
- Reverse topological sort enables backpropagation
- It supports future graph optimisations (fusion, pruning)
- We can do runtime device selection without recompilation, like PyTorch
- Easily extensible to new backends (TPU, custom accelerators)
- Trivial testing of our custom implementation against major backends
- I used the tensor type from LibTorch (PyTorch backend) because I knew it would provide the tensor operations I needed. I considered Eigen as well but I wasn't experienced enough with deep learning systems to determine if it was suitable
- Provides a correct reference implementations for testing
- ASync logging avoids holding up the pipeline
- Simple implementation with std::async and mutex
- I wanted to understand C++ deep learning backends / high-performance systems programming
Having learned what I came for, I moved on from the project before finishing my ambitiously-scoped project, so the system is not optimised for performance.
- I didn't do benchmarking
- Yet to implement graph optimisations like operator fusion
- No mixed precision (FP16 forward pass, FP32 gradients)
- No gradient accumulation