Skip to content

rdsea/ROHE

Repository files navigation

ROHE


Documentation PyPI - Status PyPI - Wheel PyPI - Version PyPI - Python Version Code style: Ruff GitHub Actions Workflow Status License


ROHE is a platform for orchestrating end-to-end machine learning inference pipelines on heterogeneous edge clusters. It provides quality-aware orchestration, runtime observation, and contract-driven SLA enforcement.

Features:


High-level view

ROHE High-level View

Fig. ROHE High-level View

Installation

# Install with uv (recommended)
git clone https://github.com/rdsea/ROHE.git
cd ROHE
uv sync

# Or from PyPI
pip install rohe

Note: Due to the continuous development of the required Python libraries, the installation may have some dependency conflicts.

Structure of the repository

The repository is structured as follows:

ROHE/
├── src/rohe/                 # Core platform
│   ├── api/                  # FastAPI endpoints
│   ├── cli/                  # Typer CLI
│   ├── common/               # Data models, abstractions, utilities
│   ├── export/               # Experiment data export
│   ├── experiment/           # Experiment lifecycle management
│   ├── external/             # External integrations (YOLO models)
│   ├── lib/                  # Deployment utilities
│   ├── messaging/            # Message bus abstractions
│   ├── models/               # Pydantic domain models (ExecutionPlan, contracts, metrics)
│   ├── monitoring/           # rohe-sdk, inference reporter, transport
│   ├── observation/          # Observation agents, metric collection
│   ├── orchestration/        # Inference orchestration (v2, adaptive, DREAM, LLF)
│   ├── quality/              # Quality evaluation (rules, anomaly, LLM diagnosis)
│   ├── registry/             # Service discovery (K8s, HTTP)
│   ├── repositories/         # Data access (MongoDB, Redis)
│   ├── service/              # FastAPI service factories
│   ├── service_registry/     # Consul integration
│   └── storage/              # Storage connectors (MongoDB, MinIO, S3)
├── examples/applications/    # 4 reference applications
│   ├── bts/                  # Building Time Series (4 models)
│   ├── cctvs/                # CCTV Surveillance (5 models)
│   ├── object_classification/# Image Classification (4 models)
│   ├── smart_building/       # Multi-modal Activity Recognition (8 models)
│   └── common/               # Shared service factories
├── experiments/              # Experiment scenarios and analysis
├── deployment/               # Infrastructure (Redis, Grafana)
├── userModule/               # User-extensible algorithms
├── datasets/                 # Experiment data
├── docs/                     # Documentation
└── tests/                    # Unit tests (260+ tests)

Publications

On Optimizing Resources for Real-Time End-to-End Machine Learning in Heterogeneous Edges: Pdf

Implementation:

Note: Other publications reuse most parts of this implementation.

Citation:

@article{nguyen2025optimizing,
  title={On Optimizing Resources for Real-Time End-to-End Machine Learning in Heterogeneous Edges},
  author={Nguyen, Minh-Tri and Truong, Hong-Linh},
  journal={Software: Practice and Experience},
  volume={55},
  number={3},
  pages={541--558},
  year={2025},
  publisher={Wiley Online Library}
}

Novel contract-based runtime explainability framework for end-to-end ensemble machine learning serving: Pdf

  • This publication uses ROHE as the orchestration framework with Observation Service for monitoring and explainability.
  • The core abstraction of ML contract can be found in QoA4ML
  • Example Application: Malware Detection and CCTVS
  • Sample Data

Citation:

@inproceedings{nguyen2024novel,
  title={Novel contract-based runtime explainability framework for end-to-end ensemble machine learning serving},
  author={Nguyen, Minh-Tri and Truong, Hong-Linh and Truong-Huu, Tram},
  booktitle={Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI},
  pages={234--244},
  year={2024}
}

Security orchestration with explainability for digital twins-based smart systems: Pdf

  • This publication also uses ROHE as the orchestration framework with Observation Service for monitoring and explainability.
  • The core abstraction of ML contract can be found in RXOMS
  • Example Application: Security in Digital Twins Network

Citation:

@inproceedings{nguyen2024security,
  title={Security orchestration with explainability for digital twins-based smart systems},
  author={Nguyen, Minh-Tri and Lam, An Ngoc and Nguyen, Phu and Truong, Hong-Linh},
  booktitle={2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)},
  pages={1194--1203},
  year={2024},
  organization={IEEE}
}

Optimizing Multiple Consumer-specific Objectives in End-to-End Ensemble Machine Learning Serving: Pdf

Implementation:

Citation:

@inproceedings{nguyen2024optimizing,
  title={Optimizing Multiple Consumer-specific Objectives in End-to-End Ensemble Machine Learning Serving},
  author={Nguyen, Minh-Tri and Truong, Hong-Linh and Arcaini, Paolo and Ishikawa, Fuyuki},
  booktitle={2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC)},
  pages={103--108},
  year={2024},
  organization={IEEE}
}

IoT Jounal submission (on going)

1. Observation Service

1.1 User Guide

  • Prerequisite: before using Observation Agent, users need:
    • Database service (e.g., MongoDB)
    • Communication service (e.g., AMQP message broker)
    • Container environment (e.g., Docker)
    • Visualization service (e.g., Prometheus, Grafana - optional)
  • Observation Service includes registration service and agent manager. Users can modify Observation Service configurations in $ROHE_PATH/config/observationConfig.yaml. The configuration defines: - Protocols with default configurations for public (connector) and consume (collector) metrics. - Database configuration where metrics and application data/metadata are stored. - Container Image of the Observation Agent - Logging Level (debugging, warning, etc)
  • To deploy Observation Service, use rohe:
$ rohe start observation-service
  • Application Registration
    • Users can register the application using rohe. Application metadata and related configurations will be saved to the Database
    • When register an end-to-end ML application, the users must provide application name (app_name - string), run ID (run_id - string), user ID (user_id - string), and send registration request to the Observation Service via its url.
    • The Observation Service will generate:
    • Application ID: appID
    • Database name: db for saving metric reports in runtime
    • Qoa configuration: qoa_config for reporting metrics

Example

$ rohe observation register-app --app <application_name> --run <run_ID> --user <user_ID> --url <registration_service_url> --output-dir <folder_path_to_save_app_metadata>
  • Then, users must implement QoA probes manually into the application. Probes use this metadata to register with the observation service. The metadata can be extended with information like stage_id microserviceID, method, role, etc. After the registration, the probes will receive communication protocol & configurations to report metrics.

  • While the applications are running, the reported metrics are processed by an Observation Agent. The Agent must be configured with application name, command, stream configuration including: - Processing window: interval, size - Processing module: specify parser and function names to process metric reports. User must define these processing moduled in $ROHE_PATH/userModule (e.g., userModule/common), including metric parser for parsing metric reports and function for window processing.

  • To start the Agent, the user can use rohe:

$ rohe observation start-agent --app <application_name> --conf <path_to_agent_configuration> --url <registration_service_url>
  • The Observation service will start the Agent on a container (e.g., Docker container). Metric processing results from the Agent are saved to files or database or message broker (developing) or Prometheus/Grafana (developing) depending on Agent configuration

  • To stop the Agent, the user can also use rohe:

$ rohe observation stop-agent --app <application_name> --conf <path_to_agent_configuration> --url <registration_service_url>
  • To delete/unregister an application using rohe:
$ rohe observation delete-app --app <application_name> --url <registration_service_url>

1.2 Development Guide

1.2.1 Registration Service

  • This service allows users to register and unregister applications. It is served by the FastAPI application in src/rohe/service/observation_service_fastapi.py, backed by routers under src/rohe/api/routes/.
  • Currently this service supports MongoDB as database and AMQP as communication protocol. The service will also support other communication protocols and databases

1.2.2 Observation Agent

  • Agents are currently deployed on the local docker environment via src/rohe/observation/analysis/containerized_agent/.
  • Remote deployment is supported via K8s manifests in each application's k8s/ directory.

2. Orchestration Service

2.1 User Guide

  • Prerequisite: before using Orchestration Service, users need:
    • Database service (e.g., MongoDB)
  • The Orchestration Service allocate service instances on edge nodes base on a specific orchestration algorithm (currently using scoring algorithm). Users can modify Orchestration Service configurations in $ROHE_PATH/config/orchestrationConfig.yaml. The configuration defines: - Database configuration where metrics and application data/metadata are stored. - Service queue priority - Orchestration algorithm
  • To deploy Orchestration Service, use rohe.
$ rohe start orchestration-service
  • Add nodes to the orchestration system
    • add-node-from-file takes a positional node-configuration file and an --url option pointing to the management endpoint.
    • A template for the configuration is in $ROHE_PATH/templates/orchestration_command/add_node.yaml.

Example

$ rohe orchestration add-node-from-file <configuration_path> --url <orchestration_service_url>
  • Add service to the orchestration system
    • add-service-from-file takes a positional service-configuration file and an --url option.
    • Template at $ROHE_PATH/templates/orchestration_command/add_service.yaml.

Example

$ rohe orchestration add-service-from-file <configuration_path> --url <orchestration_service_url>
  • List nodes / services registered with the orchestration system
$ rohe orchestration get-nodes --url <orchestration_service_url>
$ rohe orchestration get-services --url <orchestration_service_url>
  • Remove all nodes / services from the orchestration system
$ rohe orchestration remove-all-nodes --url <orchestration_service_url>
$ rohe orchestration remove-all-services --url <orchestration_service_url>
  • Start / stop the Orchestration Agent
    • The agent continuously checks the service queue for services waiting to be allocated, and places them on available nodes.
$ rohe orchestration start-agent --url <orchestration_service_url>
$ rohe orchestration stop-agent --url <orchestration_service_url>

2.2 Development Guide

2.2.1 Resource Management

The module provides the abstract class/object to manage the infrastructure resource by Node; application by Deployment; network routine by Service; and environment variable by ConfigMap.

  • Node: physical node
  • Deployment: each application has multiple microservices. Each microservice has its own Deployment setup specify: image, resource requirement, replicas, etc
  • Microservice: each microservice is advertised with a microservice name within K3s network so that other microservices can communicate with it.
  • ConfigMap: provide initial environment variable for docker containers of each deployment when starting.
  • resource: provide abstract, high-level class to manage resources (Microservice Queue and Node Collection).

2.2.2 Deployment Management

  • Provide utilities for generating deployment files from template ($ROHE_PATH/templates/deployment_template.yaml)
  • Deploy microservices, pod based on generated deployment files
  • See orchestration/allocation/algorithms/ for pluggable algorithm implementations.

2.2.3 Algorithm

This module provides functions to select resources to allocate microservices.

Current implementation: Scoring Algorithm

  • Input:

    • Microservice from a microservice Queue (queue of microservice need to be allocated), each microservice in the queue include
      • the number of instances (replicas/scales)
      • CPU requirement (array of CPU requirements on every CPU core). Example: [100,50,50,50] - the microservice use 4 CPUs with 100, 50, 50, and 50 millicpu on each core respectively.
      • Memory requirement (rss, vms - MByte)
      • Accelerator requirement (GPU - %)
      • Sensitivity: 0 - Not sensitive; 1 - CPU sensitive; 2 - Memory sensitive; 3 CPU & Memory sensitive
      • Other metadata: microservice name, ID, status, node (existing deployment), running (existing running instance), container image, ports configuration

    Example:

    {
        "EW:VE:TW:WQ:01":{
            "microservice_name":"object_detection_web_service",
            "node": {},
            "status": "queueing",
            "instance_ids": [],
            "running": 0,
            "image": "rdsea/od_web:2.0",
            "ports": [4002],
            "port_mapping": [{
                "con_port": 4002,
                "phy_port": 4002
            },{
                "con_port": 4003,
                "phy_port": 4003
            }],
            "cpu": 550,
            "accelerator": {
                "gpu": 0
            },
            "memory": {
                "rss": 200,
                "vms": 500
            },
            "processor": [500,50],
            "sensitivity": 0,
            "replicas": 2
    }
    • Node Collection: list of available nodes for allocating microservices each node includes information of capacity and used resources:
      • CPU (millicpu)
      • Memory (rss, vms - MByte)
      • Accelerator (GPU - %)

    Example

    "node1":{
        "node_name":"RaspberryPi_01",
        "MAC":"82:ae:30:11:38:01",
        "status": "running",
        "frequency": 1.5,
        "accelerator":{},
        "cpu": {
            "capacity": 4000,
            "used": 0
        },
        "memory": {
            "capacity": {
                "rss": 4096,
                "vms": 4096
            },
            "used": {
                "rss": 0,
                "vms": 0
            }
        },
        "processor": {
            "capacity": [1000,1000,1000,1000],
            "used": [0,0,0,0]
        }
    }, ...

Workflow of Scoring Algorithm: Scoring Workflow

  • Updating Microservice Queue
  • Filtering Nodes from the Node Collection
  • Scoring filtered node
  • Selecting node based on the score, applying different strategies: first/best/worst-fit

Available orchestration algorithms (all selectable via create_orchestrator(algorithm=…)):

  • v2: Async production orchestrator with ExecutionPlan and DataHub (requires a ServiceRegistry)
  • adaptive: Original multimodal orchestrator
  • dream: DREAM deadline-aware allocation
  • llf: Least-laxity-first scheduling

Example applications can be deployed on K8s:

bash examples/applications/bts/scripts/build.sh rdsea 0.0.1 true
bash examples/applications/bts/scripts/deploy.sh --local --load-images

Authors/Contributors

  • Minh-Tri Nguyen
  • Hong-Linh Truong
  • Vuong Nguyen
  • Anh-Dung Nguyen

License

Apache License

About

An orchestration framework for End-to-End Machine Learning Serving with Resource Optimization on Heterogeneous Edge

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages