YOrch is a task-oriented orchestration and execution library. It takes advantage of compile-time expansion and optimization opportunities, and is currently header-only.
Here is a minimal example:
#include "yorch/yorch.hpp"
#include <iostream>
int main() {
auto tree = yorch::task_tree
.root([]() noexcept {
std::cout << "TaskA\n";
})()
.node<1>([]() noexcept {
std::cout << "TaskB\n";
})()
.node<1>([]() noexcept {
std::cout << "TaskC\n";
})()
.node<2>([]() noexcept {
std::cout << "TaskD\n";
})();
auto plan = yorch::compile_plan(tree);
const auto result = yorch::run_plan(plan);
return result.ok() ? 0 : 1;
}Here .root(...)() registers the root task TaskA, .node<1>(...)()
registers first-level child tasks, and .node<2>(...)() registers a deeper
task. The current task_tree represents a tree-shaped dependency structure.
Output:
TaskA
TaskB
TaskC
TaskD
YOrch accepts C++ callable objects as tasks.
- Free functions
void run(int)
- Free function pointers
void (*fp)(int) = run
- Free function references
void (&fr)(int) = run
- Non-generic lambdas
[](int x) { return x + 1; }
- Function objects with one unambiguous
operator()struct F { void operator()(int) const; }
- Function objects callable through
const[](int x) { return x; }
- Function wrapper objects with one concrete call signature
std::function<int(int)>
- Non-static member function pointers
&Worker::run
constnon-static member function pointers&Worker::read
noexceptfunctions andnoexceptmember functionsvoid run() noexceptvoid Worker::run() noexcept
- Static member functions
&Worker::make
- CUDA kernels
my_kernel<<<grid, block>>>(args...)
- Overloaded function names
- passing
rundirectly when bothvoid run(int)andvoid run(double)exist
- passing
- Generic lambdas
[](auto x) { return x; }
- Function objects with templated
operator()struct F { template <class T> void operator()(T); }
- Function objects with multiple overloaded
operator()struct F { void operator()(int); void operator()(double); }
- Dynamic callables whose signature is only known at runtime
- type-erased wrappers whose target signature is runtime-dependent
- C-style variadic functions
void log(const char* fmt, ...)
- Ref-qualified member function pointers
void Worker::run() &
volatileorconst volatilemember function pointersvoid Worker::run() volatile
The task registration APIs are described in Register Tasks.
YOrch supports different task structures. According to the number of direct dependencies a task may have, they can be grouped as:
- Tree
- DAG, Directed Acyclic Graph (WIP)
A task tree is a hierarchical dependency model. It is not an arbitrary graph; instead, it starts from one root task and expands downward level by level.
Its core properties are:
- Every task except the root has exactly one direct parent.
- A parent task runs before its child tasks.
- One parent task may expand into multiple child tasks to express later branches.
- If a task fails, the whole subtree below that task stops executing.
You can think of it as this structure:
load_config
init_runtime
load_assets
build_scene
build_index
start_workers
worker_a
worker_b
This pseudocode means:
init_runtimehasload_configas its direct parent.load_assetsandstart_workersboth haveinit_runtimeas their direct parent.build_sceneandbuild_indexonly depend onload_assets.worker_aandworker_bonly depend onstart_workers.
If load_assets fails, the subtree containing build_scene and build_index
stops. Whether the start_workers side continues depends on whether its own
parent chain succeeds.
If init_runtime fails, all following tasks below it lose their execution
precondition, including load_assets, start_workers, and their respective
children.
This model fits orchestration flows with clear stages and explicit upstream / downstream relationships. It gives up arbitrary DAG wiring in exchange for simpler dependency sources, clearer failure propagation paths, and easier-to-read execution plans.
The task tree construction APIs are described in Build a Task Tree.
Still in development.
The task structure describes "what to do" and "how tasks depend on each other".
The executor is responsible for actually running that structure. In YOrch, a
task tree is first compiled into a plan with compile_plan(...), and then passed
to run_plan(...). A plan can be understood as the static instruction sheet used
during execution: it records task nodes, parent-child relationships, argument
sources, slot layout, and constraints that must be checked before execution.
Execution also involves several concepts: context stores shared data for one
execution and can be passed to run_plan(...); policies select optional
execution details such as slot layout policy and execution policy; and execution
results are returned through step_result, which tells whether the whole plan
completed successfully.
The currently supported mode is synchronous serial DFS execution for task trees. The executor starts from the root task, enters child tasks only after their parent succeeds, and follows depth-first order. If a task returns a failure state, the executor stops the current branch or terminates the whole execution according to the result semantics. In this mode, YOrch tries to use compile-time plan information for optimization and expansion, avoiding extra runtime overhead for task scheduling.
Future execution modes will include async serial and async parallel execution, so the same task structure can be reused across more runtime models.
Executor concepts and APIs are described in Executor.
Use CMake presets:
cmake --list-presets
cmake --preset dev
cmake --build --preset dev
ctest --preset devYOrch uses Google Benchmark for runtime performance tests. Because the current
library mainly supports static task tree construction, task structures are
usually fixed into a plan through compile_plan(...) before execution. The
current benchmarks therefore focus on the execution cost of an already compiled
plan in run_plan(...), rather than the cost of building a task tree or calling
compile_plan(...).
Build the benchmark preset:
cmake --preset benchmark
cmake --build --preset benchmarkRun the run_plan benchmark:
./build/benchmark/benchmarks/bench_run_plan --benchmark_min_time=0.05sThe current bench_run_plan outputs two categories:
RunPlan/Runtime/...: realistic execution baseline. The benchmark usesDoNotOptimize(plan),DoNotOptimize(sink), andClobberMemory()to reduce over-optimization and better measure the actual cost ofrun_plan(...)dispatch, slots, prev-access, and policy paths.RunPlan/Optimized/...: optimization ceiling. The benchmark does not add extra barriers around the plan, allowing the compiler to inline, fold, and eliminate abstractions as much as possible. This shows how far YOrch's static execution path can be optimized under ideal conditions.
Benchmark names use this structure:
RunPlan/<Mode>/<Topology>/<Payload>/<NodeCount>/<SlotLayout>/<ExecPolicy>
For example:
RunPlan/Runtime/Chain32/Int/32/Compact/HeapStack
This means the benchmark uses Runtime mode to test a 32-node chain plan, the
payload is int, the slot layout is Compact, and the execution policy uses an
explicit heap stack.
Current topology coverage includes:
Chain8andChain32: chain-shaped task trees, used to observe execution cost as depth grows.Wide8: multiple siblings under one root, used to observe fanout and sibling traversal cost.Balanced15: a 15-node balanced tree, used to observe a mixed structure closer to typical task trees.FanoutConsumeCopies3: usesfanout_consume_with_copies_policy, with oneconsume_prevchild and twocopy_prevchildren, used to observe staging and copy cost.
Current policy combinations include:
OneToOne/RecursiveCompact/RecursiveOneToOne/HeapStackCompact/HeapStack
Recursive means exec_serial_dfs_recursive_policy, which uses compile-time
recursive expansion for synchronous DFS. HeapStack means
exec_serial_dfs_explicit_heap_stack_policy, which stores traversal state in an
explicit heap stack. OneToOne is the default layout with one physical slot per
node. Compact reuses slots according to lifetimes under synchronous serial DFS.
Benchmark numbers should only be treated as order-of-magnitude references. Different CPUs, compilers, options, system load, thermal state, and OS scheduling can all affect absolute nanosecond-level results. For serious comparisons, record the hardware, operating system, compiler version, CMake preset, build type, and runtime parameters, and prefer comparing relative differences between policies or topologies in the same environment.
The environment for this example result is:
- CPU: Apple M4
- OS: macOS
- Compiler:
AppleClang - Build:
benchmarkpreset,Release - Command:
./build/benchmark/benchmarks/bench_run_plan --benchmark_min_time=0.05s
In one typical result, the Recursive path in Runtime mode is roughly:
RunPlan/Runtime/Chain8/Int/8/OneToOne/Recursive 11.2 ns
RunPlan/Runtime/Chain32/Int/32/OneToOne/Recursive 50.6 ns
RunPlan/Runtime/Balanced15/Int/15/OneToOne/Recursive 20.8 ns
Converted per node, the Recursive path is roughly in the 1.4 ns/node to
1.6 ns/node range. This means the current static execution path for
synchronous serial DFS is very light. Its overhead is close to directly writing
the necessary static if checks and inline function-call chain by hand. It is
not dynamic scheduler-style overhead; it is closer to structured function calls
after template expansion.
In the same test batch, the HeapStack path is roughly:
RunPlan/Runtime/Chain8/Int/8/OneToOne/HeapStack 40.7 ns
RunPlan/Runtime/Chain32/Int/32/OneToOne/HeapStack 186 ns
RunPlan/Runtime/Balanced15/Int/15/OneToOne/HeapStack 76.8 ns
Converted per node, the HeapStack path is roughly in the 5 ns/node to
6 ns/node range. It is still light, but clearly more expensive than
Recursive, because it needs to maintain runtime frames, a heap stack, node
indices, and a dispatch table. Its main value is avoiding deep recursive call
stack risk, not chasing the lowest possible latency.
In Optimized mode, some Recursive cases can approach:
RunPlan/Optimized/Chain8/Int/8/OneToOne/Recursive 1.59 ns
RunPlan/Optimized/Wide8/Int/9/OneToOne/Recursive 1.59 ns
RunPlan/Optimized/Balanced15/Int/15/OneToOne/Recursive 1.59 ns
This does not mean real business execution necessarily takes only 1.59 ns.
Rather, it means that under ideal, statically analyzable conditions, the compiler
can see through much of YOrch's abstraction and inline or fold away the
Recursive static execution path. That is the meaning of the Optimized
results: they show the optimization ceiling of the zero-cost abstraction.
It is worth noting that Optimized/Chain32/Recursive is still close to
Runtime in the current result:
RunPlan/Optimized/Chain32/Int/32/OneToOne/Recursive 50.6 ns
This shows that the compiler does not eliminate every shape to the same degree. A longer dependency chain may hit inline or optimization thresholds, preserving more of the actual execution path. This difference helps identify which task shapes are easier for the optimizer to fully consume.
Based on the current results, the framework-level noise floor of YOrch is low.
For normal real-time applications, game frame loops, audio/video surrounding
logic, high-frequency business scheduling, and similar scenarios, if the tasks
themselves perform memory access, computation, logging, locking, IO, or other
business logic, the scheduling cost of run_plan(...) usually should not be the
main bottleneck.
If tasks on the hot path are extremely small, for example each task only performs a few integer operations, the framework cost becomes visible. In that case, prefer:
yorch::run_plan<
yorch::slot_layout_one_to_one_policy,
yorch::exec_serial_dfs_recursive_policy
>(plan);That is the default slot layout plus the Recursive execution policy. Consider
HeapStack only when the plan may be very deep and call stack depth risk matters
more than the lowest latency.
For extreme hot paths such as hard real-time code, audio callbacks, trading tick kernels, or tight loops at tens of millions of iterations per second, you should still measure with the real business payload, real task tree shape, and target machine. The current microbenchmark shows that YOrch framework overhead is in a very low range, but it cannot replace workload-specific benchmarks.
On macOS, benchmark runs may print output like:
Unable to determine clock rate from sysctl: hw.cpufrequency
This does not affect benchmark measurements, only the metadata output.
***WARNING*** Failed to set thread affinity. Estimated CPU frequency may be incorrect.
This means Google Benchmark could not read the exact CPU frequency or pin
thread affinity. It does not affect the Time and CPU timing measurements
themselves, but metadata such as Run on (10 X 24 MHz CPU s) should not be
interpreted as the real CPU frequency. For serious comparisons, run multiple
rounds and focus on trends, medians, and stable differences between policies.
