floDl

A Rust-native deep learning framework built on libtorch.
Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.

PyTorch Users • Getting Started • Graph Builder • Graph Tree • Training • Multi-GPU • Parity • Benchmarks • Migration Guide • Data Loading

If You Know PyTorch, You Know floDl

PyTorch

floDl

model = nn.Sequential(
    nn.Linear(2, 16),
    nn.GELU(),
    nn.LayerNorm(16),
    nn.Linear(16, 2),
)

pred = model(x)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

let pred = model.forward(&x)?;
let loss = mse_loss(&pred, &target)?;
loss.backward()?;
optimizer.step()?;

Same concepts, same names, same GPU kernels underneath. The ? operator replaces silent failures with compile-time error handling. Drop replaces the garbage collector. The full migration guide covers every op, module, and pattern.

New to Rust? Read Rust for PyTorch Users — 10 patterns in 15 minutes.

Getting Started

With the CLI (recommended, no Rust needed):

curl -sL https://flodl.dev/fdl -o fdl && chmod +x fdl
./fdl setup          # detect hardware, download libtorch, configure build environment
./fdl init my-proj   # scaffold a new project with training template

The fdl script auto-downloads a pre-compiled CLI binary (~750KB, pure Rust, no libtorch dependency). It detects your GPUs, downloads the right libtorch variant, and configures Docker or native builds. See the full CLI reference for all commands.

One-liner with Docker (no Rust, no setup):

curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project
make build    # first build (~5 min, downloads libtorch)
make run      # train the model

Native -- Rust 1.85+ and libtorch:

./fdl libtorch download    # auto-detects CPU or CUDA
cargo add flodl && cargo build

For CUDA: cargo add flodl --features cuda + CUDA toolkit.

Both paths generate an annotated training template. Edit src/main.rs to build your model:

use flodl::*;

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .also(Linear::new(16, 16)?)     // residual connection
    .through(Linear::new(16, 2)?)
    .build()?;

let params = model.parameters();
let mut optimizer = Adam::new(&params, 0.01);
model.train();

for (input_t, target_t) in &batches {
    let input = Variable::new(input_t.clone(), true);
    let target = Variable::new(target_t.clone(), false);

    let pred = model.forward(&input)?;
    let loss = mse_loss(&pred, &target)?;

    optimizer.zero_grad();
    loss.backward()?;
    clip_grad_norm(&params, 1.0)?;
    optimizer.step()?;
}

The Graph Builder

floDl's fluent graph builder lets you describe complex architectures as readable data flow — no boilerplate, no nn.Module subclassing.

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)                        // activation
    .through(LayerNorm::new(16)?)         // normalization
    .also(Linear::new(16, 16)?)           // residual connection
    .through(Linear::new(16, 2)?)         // output projection
    .build()?;

build() returns a Graph that implements Module — you can nest it inside other graphs. Things get interesting when architectures get complex:

let g = FlowBuilder::from(encoder).tag("encoded")
    .split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)
    .loop_body(refinement_block).for_n(3).tag("refined")
    .gate(router, modules![expert_a, expert_b]).using(&["encoded"])
    .switch(selector, modules![light_path, heavy_path]).using(&["refined"])
    .through(StateAdd).using(&["memory"]).tag("memory")
    .loop_body(decoder).while_cond(halt_condition, 10)
    .through(output_head)
    .build()?;

Every construct — split/merge, also, loop_body, gate, switch, map, tag/using — composes cleanly. Forward references (using before tag) carry state across calls, enabling recurrent architectures without special-casing.

Method	What it does
`from(m).through(m)`	Linear chain
`also(m)`	Residual: `input + m(input)`
`fork(m)`	Side branch: capture output as tag, stream continues
`split(modules![...]).merge(op)`	Parallel branches, merged by `Add` or `Mean`
`tag(name)` / `using(refs)`	Named references — backward or forward (across calls)
`loop_body(body).for_n(n)`	Fixed iteration with BPTT
`loop_body(body).while_cond` / `until_cond`	Conditional loops
`gate(router, modules![...])`	Soft routing — weighted combination
`switch(selector, modules![...])`	Hard routing — only selected branch
`map(body).each()` / `.over(tag)` / `.slices(n)`	Element-wise, tagged, or sliced iteration
`input(names)`	Auxiliary graph inputs for multi-input architectures

See the Graph Builder Tutorial and the full showcase.

Graph Tree: Hierarchical Composition

This is where floDl goes beyond PyTorch. Graphs nest inside graphs with label-path addressing — dot-separated paths that let you reach into any subgraph from the root. Train components independently, compose them into larger architectures, and control training phases declaratively.

// Build components independently
let scan = FlowBuilder::from(scan_net).tag("hidden")
    .label("scan").build()?;

let read = FlowBuilder::from(read_net).tag("confidence")
    .label("read").build()?;

let encoder = FlowBuilder::from(scan)
    .through(read)
    .label("encoder").build()?;

// Compose into full model
let model = FlowBuilder::from(encoder)
    .through(classifier)
    .build()?;

Dotted paths reach anywhere

Every tag and subgraph is addressable through dotted paths from the root:

model.validate_path("encoder")?;                 // -> Subgraph
model.validate_path("encoder.scan.hidden")?;      // -> Tag (three levels deep)
model.validate_path("encoder.read.confidence")?;  // -> Tag

Declarative training phases

Freeze and thaw entire subtrees by path — no manual parameter iteration:

// Phase 1: train only the classifier, encoder is frozen
model.freeze("encoder")?;
let fresh_params = model.parameters();  // only unfrozen params
let mut opt = Adam::new(&fresh_params, 1e-3);
// ... train ...

// Phase 2: thaw scan, keep read frozen (it's proven)
model.thaw("encoder.scan")?;
let mut opt = Adam::with_groups()
    .group(&model.parameters_at("encoder.scan")?, 1e-4)  // low LR
    .group(&model.parameters_at("classifier")?, 1e-3)
    .build();

Subgraph checkpoints

Train a component standalone, save it, load it into a larger model:

// Pre-trained encoder saved earlier
encoder.save_checkpoint("encoder_v1.fdl.gz")?;

// Load into the composed model — namespace + hash validated
model.load_subgraph_checkpoint("encoder", "encoder_v1.fdl.gz")?;
model.freeze("encoder.read")?;  // lock what's proven

Cross-boundary observation

Metrics flow up through the tree automatically:

model.record_at("encoder.scan.loss", scan_loss)?;
model.record_at("encoder.read.accuracy", read_acc)?;
model.record_scalar("total_loss", total)?;

model.flush(&[]);  // single call flushes the entire tree

// Trends across boundaries — drive training decisions
if model.trend_at("encoder.scan.loss")?.stalled(10, 1e-4) {
    model.thaw("encoder.read")?;  // scan stalled, unfreeze read
}

// Monitor sees all metrics with dotted names automatically
monitor.log(epoch, elapsed, &model);
// -> total_loss, encoder.scan.loss, encoder.read.accuracy

This is progressive model composition: each component is trained and validated independently before becoming a building block in a larger architecture. Checkpoints, metrics, and training phases compose just like the graphs themselves.

See the full Graph Tree Tutorial.

The Training Experience

Training Monitor

Drop-in monitor with adaptive ETA, resource tracking, and a live web dashboard — no external dependencies, no separate process.

use flodl::monitor::Monitor;

let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?;  // optional: live dashboard at http://localhost:3000

for epoch in 0..num_epochs {
    let t = std::time::Instant::now();
    // ... training ...
    monitor.log(epoch, t.elapsed(), &model);  // sees entire graph tree
}
monitor.finish();

  epoch   1/100  loss=1.5264  [49ms  ETA 4.8s]
  epoch  10/100  loss=0.3817  [25ms  ETA 2.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch  50/100  loss=0.0023  [24ms  ETA 1.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch 100/100  loss=0.0012  [23ms]             VRAM: 2.1/6.0 GB (82%)
  training complete in 2.8s  | loss: 0.0012

Interactive benchmark dashboard — real data from a 100-epoch training run

The live dashboard updates via Server-Sent Events (no WebSocket, no npm), tracks CPU/GPU/RAM/VRAM, and supports late join — open it mid-training and all past epochs backfill instantly.

monitor.save_html("training_report.html");  // self-contained archive
monitor.export_csv("training.csv")?;         // for external analysis

Observation and Trend Queries

Tags double as observation points. Collect metrics during training and use trend queries to make programmatic training decisions:

for epoch in 0..num_epochs {
    for (input, target) in &batches {
        let pred = graph.forward(&input)?;
        graph.collect(&["hidden"])?;                 // from graph tag
        graph.record_scalar("loss", loss.item()?);   // external metric
    }
    graph.flush(&["hidden", "loss"]);

    // Programmatic training control
    if graph.trend("loss").stalled(5, 1e-4) {
        optimizer.set_lr(optimizer.lr() * 0.5);      // decay LR
    }
    if graph.trend("loss").converged(5, 1e-5) {
        break;                                        // early stopping
    }
}

Method	What it does
`g.collect(tags)` / `g.flush(tags)`	Batch -> epoch metric aggregation
`g.record_scalar(tag, value)`	Inject external metrics (loss, accuracy)
`g.trend(tag).slope(n)`	OLS slope over last n epochs
`g.trend(tag).stalled(n, tol)`	Is \|slope\| below tolerance?
`g.trend(tag).improving(n)`	Is loss decreasing?
`g.trend(tag).converged(n, tol)`	Is variance below tolerance?
`g.trends(tags).all_improving(n)`	Group queries across branches

Visualization

let svg = g.svg(Some("model.svg"))?;              // architecture diagram
g.svg_with_profile(Some("profile.svg"))?;          // timing heatmap
g.plot_html("training.html", &["loss", "head"])?;  // interactive curves

See the Training Monitor Tutorial and the Observation example.

Multi-GPU Training

Ddp::setup() gives you transparent heterogeneous multi-GPU training with zero changes to your training loop. floDl detects your GPUs, picks the best strategy, and balances work automatically: the slowest GPU anchors the pace while faster ones run ahead intelligently.

Graph DDP -- one line to go from single-GPU to multi-GPU:

// Detect GPUs, replicate model, set optimizer, enable training
Ddp::setup(&model, &builder, |p| Adam::new(p, 0.001))?;

// Training loop is IDENTICAL for 1 or N GPUs
for batch in model.epoch(0) {
    let loss = model.forward_batch(&batch?)?;
    model.step()?;  // AllReduce + sync + optimizer + zero_grad
}

DDP Builder -- thread-per-GPU, works with any Module:

let state = Ddp::builder(model_factory, optim_factory, train_fn)
    .dataset(dataset)
    .batch_size(32)
    .num_epochs(10)
    .policy(ApplyPolicy::Cadence)       // ElChe for mixed GPUs
    .backend(AverageBackend::Nccl)      // or Cpu for A/B testing
    .run()?
    .join()?;

	Graph DDP	DDP Builder
Works with	`Graph` builder	Any `Module`
GPU model	Scatter per batch	Thread per GPU (Local SGD)
Mixed GPUs	El Che auto-enabled	`ApplyPolicy` x `AverageBackend`
Setup	One line (`Ddp::setup`)	Builder pattern
Dashboard	Integrated	Stderr logging

A/B testing: swap AverageBackend::Nccl for AverageBackend::Cpu with one line. If loss curves match, you have validated the cheaper backend for your workload.

See the Multi-GPU Tutorial, DDP Builder Tutorial, Data Loading Tutorial, and DDP Reference.

PyTorch Parity

floDl covers the modules, losses, and optimizers you actually use:

Category	Count	Highlights
NN Modules	30+	`Linear`, `Conv1d`/`2d`/`3d` + transpose, `GRU`/`LSTM`, `MultiheadAttention`, `Bilinear`, all norms (`Layer`/`RMS`/`Group`/`Batch`/`Instance`), all pooling, `Embedding`/`EmbeddingBag`, `PixelShuffle`, `Upsample`, `Unfold`/`Fold`
Activations	17	`ReLU`, `LeakyReLU`, `ELU`, `GELU`, `SiLU`, `Mish`, `SELU`, `Softplus`, `Hardswish`, `PReLU`, `Softmax`, ...
Losses	15	MSE, CrossEntropy, BCE, NLL, CTC, Focal, Triplet, KLDiv, SmoothL1, Cosine, Hinge, Margin, Poisson, ...
Optimizers	7	`SGD`, `Adam`, `AdamW`, `RMSprop`, `Adagrad`, `RAdam`, `NAdam` — all with parameter groups
Schedulers	8	Step, Cosine, Exponential, MultiStep, OneCycle, Cyclic, Warmup (composable), Plateau
Init	9	Xavier, Kaiming, orthogonal, truncated normal, uniform, normal
Tensor Ops	100+	Full arithmetic, trig, reductions, shape, indexing, comparisons, fused ops
Autograd	90+	Differentiable backward for every op above

Fused Adam/AdamW on CUDA (single kernel for all parameters). Fused gradient clipping via foreach ops. Mixed precision with AutocastGuard + GradScaler. CUDA Graphs for replay-based training.

The full migration guide has side-by-side code for every op, module, and pattern.

Performance

Same CUDA kernels as PyTorch — the difference comes from what happens between kernel launches. Ten models, ten interleaved rounds, locked GPU clocks (RTX 5060 Ti, v0.3.0 vs PyTorch 2.10.0):

Model	PyTorch	flodl	Delta
transformer	3183.0 ms	2199.8 ms	-31%
mlp	291.1 ms	207.0 ms	-29%
residual_tower	406.9 ms	309.7 ms	-24%
feedback_fixed	275.3 ms	231.3 ms	-16%
gated_routing	248.0 ms	217.3 ms	-12%
iterative_refine	230.7 ms	206.0 ms	-11%
gru_seq	1105.1 ms	1057.5 ms	-4%
conv_autoenc	398.2 ms	395.3 ms	-1%
lstm_seq	692.3 ms	692.3 ms	0%
convnet	1298.0 ms	1298.2 ms	0%

Wins 8 of 10, ties 2, zero regressions. The ties (convnet, lstm_seq) are compute-bound -- both frameworks saturate the GPU, confirming identical CUDA kernels. The gap appears where framework overhead matters: dispatch-bound architectures (transformer -31%, mlp -29%), graph routing (residual_tower -24%), and recurrent loops (feedback_fixed -16%).

Benchmark Report | Interactive dashboard

Why Rust for Deep Learning?

Deterministic memory. Python adds ~3-5 us of framework overhead per GPU op. Go's GC can't manage VRAM — an earlier Go implementation required 5 phases of lifecycle management (refcounting, GC callbacks, VRAM budgets, pending-free queues). Rust replaces all of that with impl Drop for Tensor. Memory is freed the instant a tensor leaves scope.

Zero-cost safety. Every op returns Result<T> — no silent failures. Ownership ensures tensors are freed exactly once. The borrow checker prevents data races at compile time.

Same GPU kernels. floDl binds libtorch — the C++ library under PyTorch. CUDA, cuBLAS, cuDNN are identical. floDl replaces the dispatch path, autograd tracking, and graph execution.

Features Reference

Training Tools

Tool	What it does
`clip_grad_norm` / `clip_grad_value`	Fused gradient clipping (2 kernels total via foreach ops)
`save_checkpoint` / `load_checkpoint`	Named `.fdl` checkpoints, structural hash, partial loading, `LoadReport`
`migrate_checkpoint`	Remap parameter names across versions
`Parameter::freeze` / `unfreeze`	Per-parameter gradient control
`GradScaler`	Dynamic loss scaling for fp16 training
`cast_parameters`	Cast model parameters to any dtype
`CpuWorker` / `ModelSnapshot`	Background checkpoint saving
`CudaGraph`	Capture/replay training steps for fixed-shape models

Module Traits

Beyond forward/parameters, Module provides optional methods the graph recognizes automatically:

Method	What happens
`as_named_input()`	`using()` refs arrive as a named map
`reset()`	Loops auto-call before iterating — clears per-forward state
`detach_state()`	Break gradient chains on retained state
`sub_modules()`	Recursive device placement, training mode, parameter collection

Build Profiles

# Optimize floDl in dev builds — your code stays fast to compile.
[profile.dev.package.flodl]
opt-level = 3

[profile.dev.package.flodl-sys]
opt-level = 3

# Release: cross-crate optimization for maximum throughput.
[profile.release]
lto = "thin"
codegen-units = 1

Profile	flodl	Your code	Typical rebuild
`cargo build`	`-O3` (cached)	`-O0` (fast)	< 2s
`cargo build --release`	`-O3` + LTO	`-O3` + LTO	full link

Multi-GPU (DDP)

Component	What it does
`Ddp::setup`	One-liner: detect GPUs, distribute, set optimizer, train
`Ddp::builder`	Thread-per-GPU with Local SGD, any Module
`ApplyPolicy`	Sync / Cadence / Async (when to average)
`AverageBackend`	Nccl / Cpu (how to average, A/B testable)
`ElChe`	Heterogeneous GPU cadence strategy
`NcclComms` / `NcclRankComm`	NCCL AllReduce, Broadcast, abort handles
`CudaEvent` / `CudaStream`	Async GPU-CPU pipeline, timing
`DataLoader`	Resident/streaming/distributed, VRAM-aware prefetch, auto OOM fallback

Numerical Verification

Every differentiable path is verified against finite-difference gradients:

117 autograd op-level checks (every op + compositions)
Module-level checks (every NN module, input + parameter gradients)
Exact optimizer step verifications (SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam)
1027 library tests, zero clippy warnings — all tests run on both CPU and CUDA

Hardware Compatibility

Developed and tested from NVIDIA Pascal (GTX 1060 6GB) to Blackwell (RTX 5060 Ti 16GB). PyTorch dropped Pascal support after 2.5.1 — floDl links libtorch's stable C API, which supports every architecture the driver supports. If nvidia-smi works, floDl trains on it.

Documentation

Choose your path

Background	Start here
New to Rust	Rust for PyTorch Users — 10 patterns in 15 minutes
Know Rust, new to DL	Tensors then Training
Know PyTorch	Porting Guide (or `/port` with AI) then Graph Builder
Scaling to multi-GPU	Multi-GPU Training then DDP Builder
Just show me code	`quickstart` or `showcase`

Tutorials

Rust for PyTorch Users — 10 Rust patterns in 15 minutes
Tensors — creation, ops, memory, CUDA
Autograd — variables, gradients, backward
Modules — all layers, convolutions, RNNs, attention, normalization
Training — losses, optimizers, mixed precision, full loop
Graph Builder — fluent API from simple to complex
Advanced Graphs — forward refs, loops, gates, switches
Visualization — DOT/SVG, profiling heatmaps
Utilities — checkpoints, clipping, freezing, initialization, scheduling
Training Monitor — ETA, resource tracking, live dashboard
Graph Tree — hierarchical composition, freeze/thaw, subgraph checkpoints
Multi-GPU Training — Ddp::setup, El Che, auto-balancing, DataLoader integration
DDP Builder — thread-per-GPU, Local SGD, A/B testable backends
Data Loading — DataLoader, resident/streaming modes, VRAM-aware prefetch, DDP integration

Examples

quickstart — build, train, and monitor a model with residual connections
sine_wave — sine regression with monitor, checkpoint round-trip
mixed_precision — float16 training with GradScaler
transfer_learning — checkpoint, partial load, freeze, fine-tune
schedulers — warmup + cosine + plateau composition
observation — collect, flush, trend queries, early stopping
showcase — every graph builder method in one graph

Porting from PyTorch

Porting Guide — module mapping, FlowBuilder patterns, training loop translation
AI-assisted porting — point any AI coding assistant at the skill guide for automated translation. With Claude Code: /port my_model.py
fdl api-ref — generate a structured API reference for your flodl version. Used by AI tools and useful on its own.

Architecture

+-----------------------------------------------------------+
|  User Code / Model Definitions                            |
+-----------------------------------------------------------+
|  monitor/  ETA, resource tracking, live web dashboard     |
+-----------------------------------------------------------+
|  graph/    Fluent builder, graph tree, execution, DOT/SVG |
+-----------------------------------------------------------+
|  data/     DataLoader, resident/streaming, prefetch       |
+-----------------------------------------------------------+
|  nn/       Modules, losses, optimizers, DDP, NCCL         |
+-----------------------------------------------------------+
|  autograd/ Reverse-mode AD, gradient tracking             |
+-----------------------------------------------------------+
|  tensor/   Owned tensors with Drop, CPU + CUDA            |
+-----------------------------------------------------------+
|  flodl-sys   FFI bindings to libtorch C++ shim            |
+-----------------------------------------------------------+
|  libtorch / CUDA / NCCL                                   |
+-----------------------------------------------------------+

Story

floDl started as a question: what would a deep learning framework look like if you designed it around Rust's ownership model instead of fighting a garbage collector?

An earlier attempt in Go proved the architecture — the graph builder, the module system, the observation engine — but hit a wall: Go's GC cannot manage GPU memory deterministically. That required building five layers of memory management infrastructure on top of the language, not with it.

Rust solved this at the language level. impl Drop for Tensor replaced hundreds of lines of lifecycle management. The graph builder, module composition, and design philosophy carried forward; the memory fights didn't.

License

floDl is open-sourced software licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github		.github
ai		ai
benchmarks		benchmarks
docs		docs
flodl-cli		flodl-cli
flodl-sys		flodl-sys
flodl		flodl
site		site
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.bench		Dockerfile.bench
Dockerfile.cuda		Dockerfile.cuda
Dockerfile.cuda.source		Dockerfile.cuda.source
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
download-libtorch.sh		download-libtorch.sh
fdl		fdl
init.sh		init.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

floDl

If You Know PyTorch, You Know floDl

Getting Started

The Graph Builder

Graph Tree: Hierarchical Composition

Dotted paths reach anywhere

Declarative training phases

Subgraph checkpoints

Cross-boundary observation

The Training Experience

Training Monitor

Observation and Trend Queries

Visualization

Multi-GPU Training

PyTorch Parity

Performance

Why Rust for Deep Learning?

Features Reference

Numerical Verification

Hardware Compatibility

Documentation

Choose your path

Tutorials

Examples

Porting from PyTorch

Architecture

Story

License

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

floDl

If You Know PyTorch, You Know floDl

Getting Started

The Graph Builder

Graph Tree: Hierarchical Composition

Dotted paths reach anywhere

Declarative training phases

Subgraph checkpoints

Cross-boundary observation

The Training Experience

Training Monitor

Observation and Trend Queries

Visualization

Multi-GPU Training

PyTorch Parity

Performance

Why Rust for Deep Learning?

Features Reference

Numerical Verification

Hardware Compatibility

Documentation

Choose your path

Tutorials

Examples

Porting from PyTorch

Architecture

Story

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 1

Languages

Packages