Dev report by ZhiruiLiang · Pull Request #1 · gpu2grid/openg2g

ZhiruiLiang · 2026-03-18T13:34:05Z

Include all the latest example scripts and results for report writing.

…Add functions to sweep ofo parameters, sweep DC locations, find DC hosting capacity, and optimize PV locations and capacities. Add IEEE 13, 34, 123 test feeders and example scripts. Include simulation outputs for IEEE 13, 34, 123 under multiple scenarios.

jaywonchung · 2026-03-18T13:45:50Z

Please remove all output files that can be generated from running the examples. Data files do not belong in the code repository except when it's necessary for examples to run.

jaywonchung · 2026-03-18T13:49:15Z

Also, please update code so that formatting, code checking, and testing pass. You can run them with bash scripts/lint.sh and pytest.

In general, I would back up a copy of the dir elsewhere so that you don't lose any files, and then ask claude code to clean up the branch so that it's suitable to post a PR.

jaywonchung

Thank you for your work. This PR will mark a substantial advancement in what OpenG2G an do!

There are many structural issues -- please take your time to address them, and please let me know if anything is unclear.

data/grid/ieee123/IEEE123Loads.DSS

jaywonchung · 2026-03-18T18:20:05Z

docs/examples/1-gpu-flexibility.md

I think you'll need to update _zensical.toml to include these example guides as a new section -- basically "Getting Started," "Guide," "Examples," and "API Reference" in order.

Please remove the number prefix in the beginning of each example description markdown file. These numbers will be shown in the documentation website's URL and signals that there is an ordering of examples, but in fact, I don't think the examples have any particularly natural ordering.

Please preview the documentation on your browser after running bash scripts/docs.sh serve and see if everything is rendered properly (e.g., no Markdown issues, links are all working).

jaywonchung · 2026-03-18T18:21:33Z

docs/examples/9-hardware-in-the-loop.md

I think this should be removed for now. We never tested it.

docs/guide/building-simulators.md

jaywonchung · 2026-03-18T18:32:23Z

docs/guide/building-simulators.md

+
+- **`target`**: Fraction of initial servers to activate. `0.5` = half the servers; `1.0` = all servers (default); `1.5` = 50% more servers than the initial `initial_num_replicas`.
+- **`model`**: When set, the ramp applies only to that model. When `None` (default), it applies to all models in the datacenter.
+- **Scale-up** (target > 1.0): The datacenter pre-allocates extra servers at construction time based on the peak target in the schedule. At `t=0`, only the initial server count is active; the extra servers activate when the fraction exceeds 1.0.


I was reading this an I think this API is weird. 1.0 (meaning 100%) should be a fixed ceiling. I don't think we want a constantly moving ceiling. For that, I think InferenceRamp(target=0.5).at(t=0) (with composition with |) is the right API. This way, we're specifying (time, target load) points across the whole timeline and we don't have to do this dynamic ceiling adjustment at all. Please refactor InferenceRamp.

It could be misleading since I change the original "num_replicas" to "initial_num_replicas", which is just the active
number of replicas at the beginning of simulation, not the ceiling of available servers. So Targets above 1.0 represent normal load fluctuation (e.g., demand increases), not a special "scale-up" mode. I have revised the documents and added a note explaining "When a ramp target exceeds 1.0, additional servers are allocated to accommodate the extra replicas. Users must ensure that the datacenter's total_gpu_capacity is sufficient for the peak demand across all models."

Updated four docs files:

building-simulators.md: Clarified that initial_num_replicas is the starting active count (not a ceiling), targets
above 1.0 are normal load fluctuation, renamed "Scale-up" code comment. Added GPU capacity planning warning.

data-pipeline.md: Replaced "scale beyond" with neutral "activate additional replicas beyond this count".

concepts.md: Removed "(ramp-down and scale-up)" parenthetical that implied asymmetric behavior.

building-simulators.md (OfflineDatacenter section): Replaced "support scale-up beyond" with neutral "support targets above 1.0".

Um, I actually don't see the changed code/docs. Perhaps you haven't pushed your changes to GitHub. So just from reading the text my reaction is:

My original design of num_replicas & a floating point number to express the fraction of active servers wasn't very good in the first place. It should rather have been:

In datacenter config, the site's max num servers is defined.

In inference model spec, num replicas (or initial num replicas) do NOT exist

The classes that define inference workload (the inference ramp class) holds the number of replicas active at timestep t.

This way, when someone looks at an inference ramp class, they don't have to even wonder what 0.9 or 1.5 means. They don't have to look up model spec num_replicas and multiply with the fraction. And we don't ahve to be fighting over what 1.0 means either. During simulation, the datacenter can also easily reason about whether it ended up with more active servers than it's max number of servers at each timestep it runs.

I will do this refactor later. Well, of course I won't object if you would do it 😋 but please feel free to resolve this comment as you see fit.

jaywonchung · 2026-03-18T19:20:47Z

openg2g/coordinator.py

    Attributes:
-        dc_states: Every datacenter state produced by the datacenter.
+        dc_states: Every datacenter state produced by the datacenter (flat list, all sites).
+        dc_states_by_site: Per-site datacenter states (multi-DC mode).


Multiple DC should be a generalization of the original single DC. That is, single DC mode should be programmatically equivalent to multiple DC mode, except that there is only one DC site. Not touching the original single DC code and data path and slapping on multi-DC code and data on top of it is poor design.

Coordinator refactoring — unified single/multi-DC code path:

Removed _single_dc_mode flag — single DC is now just datacenters={_DEFAULT_SITE: dc}, no branching

Removed self.datacenter legacy property — replaced with _resolve_dc(site_id) helper that looks up by site ID or falls back to the first site

Grid always receives a dict — eliminated the if _single_dc_mode: pass flat list else: pass dict branch in the run loop

record_datacenter always gets a site_id — dc_states_by_site is always populated, not just in multi-DC mode

Added datacenters property for external access to the site dict

Command routing simplified — uses _resolve_dc(target_site_id) instead of if target_site and target_site in
self._datacenters ... else self.datacenter

I also don't see the code but is this basically looking up the datacenter object based on a string name site_id inside the coordinator?

Ideally I want to avoid hardcoding site IDs inside the coordinator's source code. What if someone wants to name their three datacenter sites, "Alice," "Bob," and "Eve"? What if another person wants "one," "two," and "three"? The coordinator's source code shouldn't change due to datacenter site renaming, and ideally also not the controller source code. I'll look more into the code when it's pushed.

jaywonchung · 2026-03-18T19:22:24Z

openg2g/grid/opendss.py

+            # For 3-phase regulators without explicit phase suffix,
+            # try to infer phase from the regcontrol name (e.g. creg1a -> A=1)
+            if phase not in (1, 2, 3):
+                name_lower = rc_name.lower()
+                if name_lower.endswith("a"):
+                    phase = 1
+                elif name_lower.endswith("b"):
+                    phase = 2
+                elif name_lower.endswith("c"):
+                    phase = 3
+                elif n_phases == 3:
+                    # 3-phase ganged regulator — assign to phase 1 as primary
+                    phase = 1
+                    logger.info(
+                        "RegControl '%s' is 3-phase (buses=%s); treating as phase A.",
+                        rc_name,
+                        bus_names,
+                    )
+


Matching based on arbitrary string name patterns is highly error prone. Please make it type safe by generalizing the original pattern of having fields a, b, and c.

Regulator naming cleanup

Problem: The code inferred regulator phase from name suffixes (e.g., endswith("a") → phase A) — a brittle heuristic scattered across opendss.py and example scripts. IEEE13 used generic names (Reg1, Reg2, Reg3) that didn't follow any convention, requiring special-case fallbacks like rn == "reg1" throughout.

DSS file changes:

ieee13: Renamed regulators from Reg1/Reg2/Reg3 → reg1a/reg1b/reg1c (transformers) and creg1a/creg1b/creg1c (regcontrols), matching the creg{bank}{phase} convention already used by ieee34 and ieee123

ieee123: Added explicit phase suffix to the 3-phase ganged regulator bus ([150 150r] → [150.1.2.3 150r.1.2.3]) so phase is always determinable from bus data. (it's a single 3-phase ganged regulator with one regcontrol. This means one tap value controls all three phases simultaneously, so _build_phase_to_reg_map maps it to phase 1 only (from 150.1.2.3), and phases 2 and 3 have no independent control.)

Code changes in opendss.py:

_cache_regcontrol_map: Simplified to return only (transformer, winding) — phase is no longer stored here. Removed all name-based phase guessing

_build_phase_to_reg_map: Now determines phase solely from OpenDSS bus node data (e.g., bus.1 → phase 1). Regulators whose phase can't be determined from bus data are skipped with a debug log — users must address those by regulator name in TapPosition

Config and example script changes:

Updated config_ieee13.json: reg1/reg2/reg3 → creg1a/creg1b/creg1c

Removed all rn == "reg1" / rn == "reg2" / rn == "reg3" special-case branches from sweep_dc_locations.py and sweep_hosting_capacities.py

Updated hardcoded fallback TapPosition defaults from reg1/reg2/reg3 → creg1a/creg1b/creg1c

Updated test assertions accordingly

Naming convention across all three systems:

Transformer: reg{bank}{phase} (e.g., reg1a, reg2c)

RegControl: creg{bank}{phase} (e.g., creg1a, creg2c)

Phase always determinable from bus node suffix in the DSS definition

jaywonchung · 2026-03-18T19:24:33Z

openg2g/grid/config.py

    a: float | None = None
    b: float | None = None
    c: float | None = None
+    regulators: dict[str, float] = field(default_factory=dict)


Same complaint. Generalizing single -> multi should be a type-safe structural shift, not spraying escape hatch dicts on top of existing APIs that were built assuming single.

Resolved with the comment above

jaywonchung · 2026-03-18T19:24:57Z

openg2g/grid/opendss.py

+    Supports both single-DC (legacy) and multi-DC modes:
+
+    - **Single-DC**: Pass ``dc_bus``, ``dc_bus_kv``, ``connection_type``.
+    - **Multi-DC**: Pass ``dc_loads`` dict mapping site IDs to
+      :class:`DCLoadSpec`.
+


Cleaned up the OpenDSSGrid docstring and comments:

Replaced "Supports both single-DC (legacy) and multi-DC modes" with a single sentence describing dc_loads as the primary interface, with dc_bus/dc_bus_kv as convenience shorthand

Updated Args descriptions to remove "(single-DC mode)" / "(multi-DC mode)" labels

Cleaned up step() docstring and internal comments to remove "legacy" language

Removed the "Build dc_loads dict (multi-DC or legacy single-DC)" comment

jaywonchung · 2026-03-18T19:27:28Z

openg2g/datacenter/online.py

Many changes were made to the offline datacenter. Please review feature parity between the offline and online datacenters. For instance,

Did some new feature get introduced in offline datacenter but it's not handled by the online datacenter although it can be handled?

Did some configuration class change for a feature to be introduced in the offline datacenter, but the online datacenter silently ignores that configuration or even worse, error out?

The offlien and online datacenters used to share a lot of behavior between each other by sharing a lot of code paths. Did any assumptions based on shared codebase and behavior change only in offline and not updated in online?

jaywonchung added 30 commits February 23, 2026 01:59

Update README.md

9164588

WIP online

34d7502

Update URLs

10de3d5

Documentation improvements

dd8ce9e

mkdocstrings refs and conditional analytics

41e8ecc

Add site_url

2b4f897

v0.0.1 release

a4ff28d

Depend on Zeus >= 0.15.0

c89ca7d

Add README and URLs to pyproject.toml

42e43e5

Bump to v0.0.1.post1

ffa3e5f

Online API improvements

2db31f8

Improve mlenergy benchmark data bulk download

842abf1

Upgrade mlenergy-data

687447f

[CI] Add uv.lock to CI trigger paths

cdc5e9b

Improve opendssdirect.py API usage

e027975

Small efficiency improvements in opendss grid backend

eec959d

[Docs] Landing page and Getting Started improvements

dbe0e48

[Docs] Fix API doc generation

3ff6cec

[Docs] Fix OG image, updates in Guide

9b6ebc8

[Docs] Guide reord, architecture diagram

dbc56c5

[Docs] Guide updates

436ff44

[Grid] Remove redundant power resampling

c1b7c32

[Docs] Guide updates

9aca927

[Docs] Release candidate

fe66245

[Docs] Update pip install command

d955047

[Docs] Remove redundant comment

3310f9b

[API] Reorg core component constructors

a650224

Update uv.lock

c99c5cd

[*] Major refactor: Consistency and simplicity

b5f42cf

CI: Fix lint script

97170f7

jaywonchung and others added 7 commits March 4, 2026 21:55

[Misc] Bump to 0.1.0

4f20541

[Misc] Add docs badge

3cde674

Remove trace data

5ec7164

remove trace data

b9f6179

merge from master branch

d42724c

Zhirui Liang added 6 commits March 18, 2026 09:51

remove output results

c3a8372

fix PR errors

173ac95

fix PR errors

1e91d76

fix PR errors

c79ca5e

fix PR errors

ff57fbc

Delete unused dss files

97b3749

jaywonchung requested changes Mar 18, 2026

View reviewed changes

Conversation

ZhiruiLiang commented Mar 18, 2026

Uh oh!

jaywonchung commented Mar 18, 2026

Uh oh!

jaywonchung commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaywonchung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiruiLiang Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiruiLiang Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiruiLiang Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaywonchung commented Mar 18, 2026 •

edited

Loading

ZhiruiLiang Mar 31, 2026 •

edited

Loading

ZhiruiLiang Mar 31, 2026 •

edited

Loading

ZhiruiLiang Mar 31, 2026 •

edited

Loading