Catalyst Architecture

This document provides technical implementation details for Catalyst's design.

Design Philosophy

Catalyst uses explicit abstractions in the high-level design. I deliberately built more flexibly than I needed to, to better understand the structures at play in realistic systems. For example:

Computational graphs are represented explicitly as DAGs, rather than building DAGs implicitly with default types
I implemented runtime hardware selection with a strategy pattern
I handled memory management explicitly using smart pointers
I used a strategy pattern for tensor operations to prepare the architecture for custom GPU kernels.

System Layers

1. Model Module (Public API)

Model class: Top-level container for deep networks exposed to users of the library

Composes Block objects, holding layers and activations
Manages training state, including epoch, batch, and hardware choice
Holds a LogBook for diagnostics.

Model model(storage_dir, device, name);
model.add_block(std::make_shared<NeuralLayer>(width));
model.train(dataloader, batch_size, epochs);

2. Computational Graph Module

CGGraph: DAG of computations

Adjacency list of nodes and connections: std::unordered_map<SharedCGNodePtr, std::vector<SharedCGNodePtr>>
DFS-based topological sort for forward/backward passes, with caching and cache invalidation
Activations are computed node-by-node, allowing batching. Gradients are computed in reverse topological order.

CGNode: Graph node, holding in this case a neuron but flexible to allow optimisations, eg operator fusion

Stores weights, biases, activations, gradients
Forward: h = σ(Wx + b) where σ is ReLU
Backward: Computes ∂L/∂w using chain rule
Update: w ← w - η∇L.

Key Algorithm: Topological Sort

std::vector<SharedCGNodePtr> topo_sort_() {
    if (is_topo_cache_valid_) return topo_forward_ordering_cache_;
    
    std::stack<SharedCGNodePtr> st;
    std::unordered_map<SharedCGNodePtr, bool> visited;
    
    std::function<void(SharedCGNodePtr)> dfs_visit = [&](SharedCGNodePtr node) {
        visited[node] = true;
        for (auto& child : graph_adj_list_[node]) {
            if (!visited[child]) dfs_visit(child);
        }
        st.push(node);  // Post-order insertion
    };
    
    for (auto& pair : graph_adj_list_) {
        if (!visited[pair.first]) dfs_visit(pair.first);
    }
    
    // Stack is reverse topological order; invert for forward order
    std::vector<SharedCGNodePtr> forward_ordering;
    while (!st.empty()) {
        forward_ordering.push_back(st.top());
        st.pop();
    }
    
    topo_forward_ordering_cache_ = forward_ordering;
    is_topo_cache_valid_ = true;
    return forward_ordering;
}

Cache Invalidation: Any graph modification sets is_topo_cache_valid_ = false.

3. Tensor Operations Module

Strategy Pattern for Hardware Abstraction

// Interface
class TensorStrategy {
public:
    virtual TensorPtr matmul(TensorPtr a, TensorPtr b) = 0;
    virtual TensorPtr relu(TensorPtr a) = 0;
    // ... other operations
};

// Selector
class TensorOpsSelector {
    std::unique_ptr<TensorStrategy> strategy;
public:
    TensorOpsSelector(bool use_gpu) {
        strategy = use_gpu ? std::make_unique<GPUStrategy>()
                           : std::make_unique<CPUStrategy>();
    }
};

CPUStrategy: Delegates to LibTorch CPU operations. GPUStrategy: Intended for custom OpenCL kernels; currently delegates to LibTorch as a placeholder.

4. GPU Operations Module

Kernel: Abstract base for OpenCL kernels

Stores source code, workgroup size/count
Virtual compile() method for JIT compilation.

KernelFactory: Creates kernel instances

Static method build_kernel(type, workgroup_size, workgroup_count)
Returns the requested subclass (MatmulKernel, ReluKernel) instantiated on CPU or GPU.

KernelManagerSingleton: A global kernel cache

Device querying and workgroup optimisation
Kernel compilation and caching (avoids recompilation)
Enqueue/dequeue for GPU execution.

Note: Kernel source (.cl) files are designed and stubbed but not implemented.

Supporting Systems

Data Pipeline

Dataset: Loads CSV, splits into train/val/test

Constructor: Dataset(path, target_variable_index)
Split: split(train_frac, val_frac, test_frac) returns array of 3 datasets
Handles integer underflow from float→int conversion.

DataLoader: Iterator over dataset

get_next_observation() returns {input: Tensor, target: Tensor}
Automatic reset when exhausted.

Logging System

Async Design: Logging doesn't block training

void log_async(const LogEntry& log_entry) {
    auto log_task = std::async(std::launch::async, [this, log_entry]() {
        add_log_to_map(log_entry);         // In-memory
        write_log_to_storage(log_entry);   // Disk
    });
    // Returns immediately
}

Thread Safety: Mutex on shared log map

void add_log_to_map(const LogEntry& log) {
    std::mutex logs_mutex;
    std::lock_guard<std::mutex> lock(logs_mutex);
    logs_map[log.get_type()].push_back(log);
}

LogEntry Types: SystemLogEntry, CheckpointLogEntry, EvaluationLogEntry, DebugLogEntry.

Serialisation: Uses Cereal library for JSON output.

Memory Management

Smart Pointers Throughout: We practice good memory management. There is no new/delete in the application code.

std::shared_ptr<CGNode> for graph nodes (whose ownership is shared to allow asynchronous execution on distributed setups)
std::shared_ptr<torch::Tensor> for tensor data
std::unique_ptr<TensorStrategy> for strategy (single ownership)

Testing Strategy

Unit Tests Using Google Test):

test_dataset.cpp: Split logic, underflow handling
test_dataloader.cpp: Iterator behavior, reset
test_logbook.cpp: Async logging, serialisation
test_TensorOpsSelector_using_CPUStrategy.cpp: All tensor ops
test_TensorOpsSelector_using_GPUStrategy.cpp: GPU dispatch

Integration Tests: Not implemented. Planned for testing the full training cycle.

Known Issues

CGGraph Segfaults: Pointer invalidation during multi-layer traversal
- May be caused by incorrect shared_ptr lifetimes
- Needs review of pointers in layer construction.
Type Mismatches: In gradient updates.
No Graph Validation: Accepts cyclic graphs (will infinite loop)
- This was a design choice as we are only dealing in feedforward networks, but we could add cycle detection and connectivity checks if the system was expanded.
Kernel Module Incomplete:
- MatmulKernel, ReluKernel headers exist
- OpenCL source files empty
- KernelManager device queries unfinished.
Missing Error Handling: No input validation, assumes valid data.

Design Decisions

DAG Representation for Computational Graphs

This is a natural representation of data dependencies, so it's easy to handled distributed execution
Topological sort determines execution order
Reverse topological sort enables backpropagation
It supports future graph optimisations (fusion, pruning)

Strategy Pattern for Hardware Abstraction

We can do runtime device selection without recompilation, like PyTorch
Easily extensible to new backends (TPU, custom accelerators)
Trivial testing of our custom implementation against major backends

LibTorch Implements Tensor Types

I used the tensor type from LibTorch (PyTorch backend) because I knew it would provide the tensor operations I needed. I considered Eigen as well but I wasn't experienced enough with deep learning systems to determine if it was suitable
Provides a correct reference implementations for testing

Asynchronous Logging

ASync logging avoids holding up the pipeline
Simple implementation with std::async and mutex

C++ Implementation Without Python Frontend

I wanted to understand C++ deep learning backends / high-performance systems programming

Performance Considerations

Having learned what I came for, I moved on from the project before finishing my ambitiously-scoped project, so the system is not optimised for performance.

I didn't do benchmarking
Yet to implement graph optimisations like operator fusion
No mixed precision (FP16 forward pass, FP32 gradients)
No gradient accumulation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Architecture

Design Philosophy

System Layers

1. Model Module (Public API)

2. Computational Graph Module

3. Tensor Operations Module

4. GPU Operations Module

Supporting Systems

Data Pipeline

Logging System

Memory Management

Testing Strategy

Known Issues

Design Decisions

DAG Representation for Computational Graphs

Strategy Pattern for Hardware Abstraction

LibTorch Implements Tensor Types

Asynchronous Logging

C++ Implementation Without Python Frontend

Performance Considerations

FilesExpand file tree

ARCHITECTURE_REVIEW.md

Latest commit

History

ARCHITECTURE_REVIEW.md

File metadata and controls

Catalyst Architecture

Design Philosophy

System Layers

1. Model Module (Public API)

2. Computational Graph Module

3. Tensor Operations Module

4. GPU Operations Module

Supporting Systems

Data Pipeline

Logging System

Memory Management

Testing Strategy

Known Issues

Design Decisions

DAG Representation for Computational Graphs

Strategy Pattern for Hardware Abstraction

LibTorch Implements Tensor Types

Asynchronous Logging

C++ Implementation Without Python Frontend

Performance Considerations