Skip to content

Commit bc0245b

Browse files
committed
[Feat] Support using same monitor instance and regist python stats
1 parent 523bbc4 commit bc0245b

File tree

18 files changed

+503
-181
lines changed

18 files changed

+503
-181
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Custom Metrics
2+
UCM supports custom metrics with bidirectional updates from both Python and C++ runtimes. The unified monitoring interface provides the ability to mutate stats across language boundaries through a shared metrics registry.
3+
4+
## Architecture Overview
5+
The metrics consists of these components below:
6+
- **metrics** : Central stats registry that manages all metric lifecycle operations (registration, creation, updates, queries)
7+
- **observability.py** : Prometheus integration layer that handles metric exposition and multiprocess collection
8+
- **metrics_config.yaml** : Declarative configuration that defines which custom metrics to register and their properties
9+
10+
## Getting Started
11+
### Define Metrics in YAML Configuration
12+
Prometheus provides three fundamental metric types: Counter, Gauge, and Histogram. UCM implements corresponding wrappers for each type. The method for adding new metrics is as follows; please refer to the [example YAML](https://github.com/ModelEngine-Group/unified-cache-management/blob/develop/examples/metrics/metrics_configs.yaml) for more detailed information.
13+
```yaml
14+
log_interval: 5 # Interval in seconds for logging metrics
15+
16+
prometheus:
17+
multiproc_dir: "/vllm-workspace" # Directory for Prometheus multiprocess mode
18+
19+
metric_prefix: "ucm:"
20+
21+
# Enable/disable metrics by category
22+
enabled_metrics:
23+
counters: true
24+
gauges: true
25+
histograms: true
26+
27+
# Counter metrics configuration
28+
counters:
29+
- name: "received_requests"
30+
documentation: "Total number of requests sent to ucm"
31+
32+
# Gauge metrics configuration
33+
gauges:
34+
- name: "lookup_hit_rate"
35+
documentation: "Hit rate of ucm lookup requests"
36+
multiprocess_mode: "livemostrecent"
37+
38+
# Histogram metrics configuration
39+
histograms:
40+
- name: "load_requests_num"
41+
documentation: "Number of requests loaded from ucm"
42+
buckets: [1, 5, 10, 20, 50, 100, 200, 500, 1000]
43+
```
44+
45+
### Use Monitor APIs to Update Stats
46+
The monitor provides a unified interface for metric operations. Note that the workflow requires registering a stats class before creating an instance.
47+
:::::{tab-set}
48+
:sync-group: install
49+
50+
::::{tab-item} Python side interfaces
51+
:selected:
52+
:sync: py
53+
**Lifecycle Methods**
54+
- `register_istats(name, py::object)`: Register a new stats class implementation.
55+
- `create_stats(name)`: Create and initialize a registered stats object.
56+
57+
**Operation Methods**
58+
- `update_stats(name, dict)`: Update specific fields of a specific stats object.
59+
- `get_stats(name)`: Retrieve current values of a specific stats object.
60+
- `get_stats_and_clear(name)`: Retrieve and reset a specific stats object.
61+
- `get_all_stats_and_clear()`: Retrieve and reset all stats objects.
62+
- `reset_stats(name)`: Reset a specific stats object to initial state.
63+
- `reset_all()`: Reset all stats registered in monitor.
64+
65+
**Example:** Using built-in ConnStats
66+
```python
67+
from ucm.integration.vllm.conn_stats import ConnStats
68+
from ucm.shared.metrics import ucmmonitor
69+
70+
conn_stats = ConnStats(name="ConnStats")
71+
ucmmonitor.register_stats("ConnStats", conn_stats) # Register stats
72+
ucmmonitor.create_stats("ConnStats") # Create a stats obj
73+
74+
# Update stats
75+
ucmmonitor.update_stats(
76+
"ConnStats",
77+
{"interval_lookup_hit_rates": external_hit_blocks / len(ucm_block_ids)},
78+
)
79+
80+
```
81+
See more detailed example in [test case](https://github.com/ModelEngine-Group/unified-cache-management/tree/develop/ucm/shared/test/example).
82+
83+
::::
84+
85+
::::{tab-item} C++ side interfaces
86+
:sync: cc
87+
**Lifecycle Methods**
88+
- `RegistStats(std::string name, Creator creator)`: Register a new stats class implementation.
89+
- `CreateStats(const std::string& name)`: Create and initialize a registered stats object.
90+
91+
**Operation Methods**
92+
- `UpdateStats(const std::string& name, const std::unordered_map<std::string, double>& params)`: Update specific fields of a specific stats object.
93+
- `ResetStats(const std::string& name)`: Retrieve current values of a specific stats object.
94+
- `ResetAllStats()`: Retrieve and reset a specific stats object.
95+
- `GetStats(const std::string& name)`: Retrieve and reset all stats objects.
96+
- `GetStatsAndClear(const std::string& name)`: Reset a specific stats object to initial state.
97+
- `GetAllStatsAndClear()`: Reset all stats registered in monitor.
98+
99+
**Example:** Implementing custom stats in C++
100+
UCM supports custom metrics by following steps:
101+
- Step 1: linking the static library monitor_static
102+
```c++
103+
target_link_libraries(xxxstore PUBLIC storeinfra monitor_static)
104+
```
105+
- Step 2: Inheriting from the IStats class to implement custom stats classes
106+
- Step 3: Register stats class to monitor
107+
- Step 4: Create stats object in monitor
108+
- Step 5: Update or get stats info using operation methods
109+
110+
See more detailed example in [test case](https://github.com/ModelEngine-Group/unified-cache-management/tree/develop/ucm/shared/test/case).
111+
112+
::::
113+
:::::

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ user-guide/metrics/metrics
6363
:caption: Developer Guide
6464
:maxdepth: 1
6565
developer-guide/contribute
66+
developer-guide/add_metrics
6667
:::
6768

6869
:::{toctree}

ucm/integration/vllm/conn_stats.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from ucm.shared.metrics import ucmmonitor
2+
3+
4+
class ConnStats:
5+
def __init__(self, name: str = "PyStats1"):
6+
self._name = name
7+
self._data = {}
8+
9+
def Name(self) -> str:
10+
return self._name
11+
12+
def Update(self, params):
13+
for k, v in params.items():
14+
self._data.setdefault(k, []).append(v)
15+
16+
def Reset(self):
17+
self._data.clear()
18+
19+
def Data(self):
20+
return self._data.copy()

ucm/integration/vllm/ucm_connector.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,10 @@
1717
from vllm.platforms import current_platform
1818
from vllm.v1.core.sched.output import SchedulerOutput
1919

20+
from ucm.integration.vllm.conn_stats import ConnStats
2021
from ucm.logger import init_logger
22+
from ucm.observability import UCMStatsLogger
2123
from ucm.shared.metrics import ucmmonitor
22-
from ucm.shared.metrics.observability import UCMStatsLogger
2324
from ucm.store.factory import UcmConnectorFactory
2425
from ucm.store.ucmstore import Task, UcmKVStoreBase
2526
from ucm.utils import Config
@@ -172,12 +173,14 @@ def __init__(self, vllm_config: "VllmConfig", role: KVConnectorRole):
172173

173174
self.metrics_config = self.launch_config.get("metrics_config_path", "")
174175
if self.metrics_config:
176+
conn_stats = ConnStats(name="ConnStats")
177+
ucmmonitor.register_stats("ConnStats", conn_stats)
178+
ucmmonitor.create_stats("ConnStats")
175179
self.stats_logger = UCMStatsLogger(
176180
vllm_config.model_config.served_model_name,
177181
self.global_rank,
178182
self.metrics_config,
179183
)
180-
self.monitor = ucmmonitor.StatsMonitor.get_instance()
181184

182185
self.synchronize = (
183186
torch.cuda.synchronize
@@ -236,7 +239,7 @@ def get_num_new_matched_tokens(
236239
f"hit external: {external_hit_blocks}"
237240
)
238241
if self.metrics_config:
239-
self.monitor.update_stats(
242+
ucmmonitor.update_stats(
240243
"ConnStats",
241244
{"interval_lookup_hit_rates": external_hit_blocks / len(ucm_block_ids)},
242245
)
@@ -532,7 +535,7 @@ def start_load_kv(self, forward_context: "ForwardContext", **kwargs) -> None:
532535
/ 1024
533536
) # GB/s
534537
if self.metrics_config and is_load:
535-
self.monitor.update_stats(
538+
ucmmonitor.update_stats(
536539
"ConnStats",
537540
{
538541
"load_requests_num": num_loaded_request,
@@ -622,7 +625,7 @@ def wait_for_save(self) -> None:
622625
/ 1024
623626
) # GB/s
624627
if self.metrics_config and is_save:
625-
self.monitor.update_stats(
628+
ucmmonitor.update_stats(
626629
"ConnStats",
627630
{
628631
"save_requests_num": num_saved_request,
Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ def log_prometheus(self, stats: Any):
195195
try:
196196
metric_mapped = self.metric_mappings[stat_name]
197197
if metric_mapped is None:
198-
logger.warning(f"Stat {stat_name} not initialized.")
198+
logger.debug(f"Stat {stat_name} not initialized.")
199199
continue
200200
metric_obj = getattr(self, metric_mapped["attr"], None)
201201
metric_type = metric_mapped["type"]
@@ -213,8 +213,8 @@ def log_prometheus(self, stats: Any):
213213
else:
214214
value = []
215215
self._log_histogram(metric_obj, value)
216-
except Exception as e:
217-
logger.warning(f"Failed to log metric {stat_name}: {e}")
216+
except Exception:
217+
logger.debug(f"Failed to log metric {stat_name}")
218218

219219
@staticmethod
220220
def _metadata_to_labels(metadata: UCMEngineMetadata):
@@ -267,8 +267,6 @@ def __init__(self, model_name: str, rank: int, config_path: str = ""):
267267
# Load configuration
268268
config = self._load_config(config_path)
269269
self.log_interval = config.get("log_interval", 10)
270-
271-
self.monitor = ucmmonitor.StatsMonitor.get_instance()
272270
self.prometheus_logger = PrometheusLogger.GetOrCreate(self.metadata, config)
273271
self.is_running = True
274272

@@ -296,7 +294,7 @@ def _load_config(self, config_path: str) -> Dict[str, Any]:
296294
def log_worker(self):
297295
while self.is_running:
298296
# Use UCMStatsMonitor.get_states_and_clear() from external import
299-
stats = self.monitor.get_stats_and_clear("ConnStats")
297+
stats = ucmmonitor.get_all_stats_and_clear().data
300298
self.prometheus_logger.log_prometheus(stats)
301299
time.sleep(self.log_interval)
302300

ucm/shared/metrics/CMakeLists.txt

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,32 @@
1-
file(GLOB_RECURSE CORE_SRCS CONFIGURE_DEPENDS
2-
"${CMAKE_CURRENT_SOURCE_DIR}/cc/stats/*.cc"
3-
"${CMAKE_CURRENT_SOURCE_DIR}/cc/*.cc")
4-
add_library(monitor_static STATIC ${CORE_SRCS})
1+
file(GLOB_RECURSE DOMAIN_SRCS
2+
"${CMAKE_CURRENT_SOURCE_DIR}/cc/domain/*.cc"
3+
"${CMAKE_CURRENT_SOURCE_DIR}/cc/domain/stats/*.cc"
4+
)
5+
6+
file(GLOB_RECURSE API_SRCS
7+
"${CMAKE_CURRENT_SOURCE_DIR}/cc/api/*.cc"
8+
)
9+
10+
add_library(monitor_static STATIC
11+
${DOMAIN_SRCS}
12+
${API_SRCS}
13+
)
14+
515
set_property(TARGET monitor_static PROPERTY POSITION_INDEPENDENT_CODE ON)
16+
617
target_include_directories(monitor_static PUBLIC
7-
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/cc>
8-
$<INSTALL_INTERFACE:include>)
18+
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/cc/domain>
19+
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/cc/api>
20+
$<INSTALL_INTERFACE:include>
21+
)
22+
923
set_target_properties(monitor_static PROPERTIES OUTPUT_NAME monitor)
1024

1125
file(GLOB_RECURSE BINDINGS_SRCS CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/cpy/*.cc")
1226
pybind11_add_module(ucmmonitor ${BINDINGS_SRCS})
13-
target_link_libraries(ucmmonitor PRIVATE -Wl,--whole-archive monitor_static -Wl,--no-whole-archive)
14-
target_include_directories(ucmmonitor PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/cc)
27+
target_link_libraries(ucmmonitor PRIVATE monitor_static)
28+
29+
target_include_directories(ucmmonitor PRIVATE
30+
"${CMAKE_CURRENT_SOURCE_DIR}/cc/api"
31+
)
1532
set_target_properties(ucmmonitor PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
/**
2+
* MIT License
3+
*
4+
* Copyright (c) 2025 Huawei Technologies Co., Ltd. All rights reserved.
5+
*
6+
* Permission is hereby granted, free of charge, to any person obtaining a copy
7+
* of this software and associated documentation files (the "Software"), to deal
8+
* in the Software without restriction, including without limitation the rights
9+
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
* copies of the Software, and to permit persons to whom the Software is
11+
* furnished to do so, subject to the following conditions:
12+
*
13+
* The above copyright notice and this permission notice shall be included in all
14+
* copies or substantial portions of the Software.
15+
*
16+
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22+
* SOFTWARE.
23+
* */
24+
#include "stats_monitor_api.h"
25+
namespace UC::Metrics {
26+
27+
void RegistStats(std::string name, Creator creator)
28+
{
29+
StatsRegistry::GetInstance().RegisterStats(name, creator);
30+
}
31+
32+
void CreateStats(const std::string& name) { StatsMonitor::GetInstance().CreateStats(name); }
33+
34+
void UpdateStats(const std::string& name, const std::unordered_map<std::string, double>& params)
35+
{
36+
StatsMonitor::GetInstance().UpdateStats(name, params);
37+
}
38+
39+
void ResetStats(const std::string& name) { StatsMonitor::GetInstance().ResetStats(name); }
40+
41+
void ResetAllStats() { StatsMonitor::GetInstance().ResetAllStats(); }
42+
43+
StatsResult GetStats(const std::string& name)
44+
{
45+
StatsResult result;
46+
result.data = StatsMonitor::GetInstance().GetStats(name);
47+
return result;
48+
}
49+
50+
StatsResult GetStatsAndClear(const std::string& name)
51+
{
52+
StatsResult result;
53+
result.data = StatsMonitor::GetInstance().GetStatsAndClear(name);
54+
return result;
55+
}
56+
57+
StatsResult GetAllStatsAndClear()
58+
{
59+
StatsResult result;
60+
result.data = StatsMonitor::GetInstance().GetAllStatsAndClear();
61+
return result;
62+
}
63+
64+
} // namespace UC::Metrics

0 commit comments

Comments
 (0)