Conversation
…Add functions to sweep ofo parameters, sweep DC locations, find DC hosting capacity, and optimize PV locations and capacities. Add IEEE 13, 34, 123 test feeders and example scripts. Include simulation outputs for IEEE 13, 34, 123 under multiple scenarios.
…Add functions to sweep ofo parameters, sweep DC locations, find DC hosting capacity, and optimize PV locations and capacities. Add IEEE 13, 34, 123 test feeders and example scripts. Include simulation outputs for IEEE 13, 34, 123 under multiple scenarios.
|
Please remove all output files that can be generated from running the examples. Data files do not belong in the code repository except when it's necessary for examples to run. |
|
Also, please update code so that formatting, code checking, and testing pass. You can run them with bash scripts/lint.sh and pytest. In general, I would back up a copy of the dir elsewhere so that you don't lose any files, and then ask claude code to clean up the branch so that it's suitable to post a PR. |
jaywonchung
left a comment
There was a problem hiding this comment.
Thank you for your work. This PR will mark a substantial advancement in what OpenG2G an do!
There are many structural issues -- please take your time to address them, and please let me know if anything is unclear.
There was a problem hiding this comment.
- I think you'll need to update
_zensical.tomlto include these example guides as a new section -- basically "Getting Started," "Guide," "Examples," and "API Reference" in order. - Please remove the number prefix in the beginning of each example description markdown file. These numbers will be shown in the documentation website's URL and signals that there is an ordering of examples, but in fact, I don't think the examples have any particularly natural ordering.
- Please preview the documentation on your browser after running
bash scripts/docs.sh serveand see if everything is rendered properly (e.g., no Markdown issues, links are all working).
There was a problem hiding this comment.
I think this should be removed for now. We never tested it.
|
|
||
| - **`target`**: Fraction of initial servers to activate. `0.5` = half the servers; `1.0` = all servers (default); `1.5` = 50% more servers than the initial `initial_num_replicas`. | ||
| - **`model`**: When set, the ramp applies only to that model. When `None` (default), it applies to all models in the datacenter. | ||
| - **Scale-up** (target > 1.0): The datacenter pre-allocates extra servers at construction time based on the peak target in the schedule. At `t=0`, only the initial server count is active; the extra servers activate when the fraction exceeds 1.0. |
There was a problem hiding this comment.
I was reading this an I think this API is weird. 1.0 (meaning 100%) should be a fixed ceiling. I don't think we want a constantly moving ceiling. For that, I think InferenceRamp(target=0.5).at(t=0) (with composition with |) is the right API. This way, we're specifying (time, target load) points across the whole timeline and we don't have to do this dynamic ceiling adjustment at all. Please refactor InferenceRamp.
There was a problem hiding this comment.
It could be misleading since I change the original "num_replicas" to "initial_num_replicas", which is just the active
number of replicas at the beginning of simulation, not the ceiling of available servers. So Targets above 1.0 represent normal load fluctuation (e.g., demand increases), not a special "scale-up" mode. I have revised the documents and added a note explaining "When a ramp target exceeds 1.0, additional servers are allocated to accommodate the extra replicas. Users must ensure that the datacenter's total_gpu_capacity is sufficient for the peak demand across all models."
Updated four docs files:
- building-simulators.md: Clarified that initial_num_replicas is the starting active count (not a ceiling), targets
above 1.0 are normal load fluctuation, renamed "Scale-up" code comment. Added GPU capacity planning warning. - data-pipeline.md: Replaced "scale beyond" with neutral "activate additional replicas beyond this count".
- concepts.md: Removed "(ramp-down and scale-up)" parenthetical that implied asymmetric behavior.
- building-simulators.md (OfflineDatacenter section): Replaced "support scale-up beyond" with neutral "support targets above 1.0".
There was a problem hiding this comment.
Um, I actually don't see the changed code/docs. Perhaps you haven't pushed your changes to GitHub. So just from reading the text my reaction is:
My original design of num_replicas & a floating point number to express the fraction of active servers wasn't very good in the first place. It should rather have been:
- In datacenter config, the site's max num servers is defined.
- In inference model spec, num replicas (or initial num replicas) do NOT exist
- The classes that define inference workload (the inference ramp class) holds the number of replicas active at timestep t.
This way, when someone looks at an inference ramp class, they don't have to even wonder what 0.9 or 1.5 means. They don't have to look up model spec num_replicas and multiply with the fraction. And we don't ahve to be fighting over what 1.0 means either. During simulation, the datacenter can also easily reason about whether it ended up with more active servers than it's max number of servers at each timestep it runs.
I will do this refactor later. Well, of course I won't object if you would do it 😋 but please feel free to resolve this comment as you see fit.
| Attributes: | ||
| dc_states: Every datacenter state produced by the datacenter. | ||
| dc_states: Every datacenter state produced by the datacenter (flat list, all sites). | ||
| dc_states_by_site: Per-site datacenter states (multi-DC mode). |
There was a problem hiding this comment.
Multiple DC should be a generalization of the original single DC. That is, single DC mode should be programmatically equivalent to multiple DC mode, except that there is only one DC site. Not touching the original single DC code and data path and slapping on multi-DC code and data on top of it is poor design.
There was a problem hiding this comment.
Coordinator refactoring — unified single/multi-DC code path:
- Removed _single_dc_mode flag — single DC is now just datacenters={_DEFAULT_SITE: dc}, no branching
- Removed self.datacenter legacy property — replaced with _resolve_dc(site_id) helper that looks up by site ID or falls back to the first site
- Grid always receives a dict — eliminated the if _single_dc_mode: pass flat list else: pass dict branch in the run loop
- record_datacenter always gets a site_id — dc_states_by_site is always populated, not just in multi-DC mode
- Added datacenters property for external access to the site dict
- Command routing simplified — uses _resolve_dc(target_site_id) instead of if target_site and target_site in
self._datacenters ... else self.datacenter
There was a problem hiding this comment.
I also don't see the code but is this basically looking up the datacenter object based on a string name site_id inside the coordinator?
Ideally I want to avoid hardcoding site IDs inside the coordinator's source code. What if someone wants to name their three datacenter sites, "Alice," "Bob," and "Eve"? What if another person wants "one," "two," and "three"? The coordinator's source code shouldn't change due to datacenter site renaming, and ideally also not the controller source code. I'll look more into the code when it's pushed.
| # For 3-phase regulators without explicit phase suffix, | ||
| # try to infer phase from the regcontrol name (e.g. creg1a -> A=1) | ||
| if phase not in (1, 2, 3): | ||
| name_lower = rc_name.lower() | ||
| if name_lower.endswith("a"): | ||
| phase = 1 | ||
| elif name_lower.endswith("b"): | ||
| phase = 2 | ||
| elif name_lower.endswith("c"): | ||
| phase = 3 | ||
| elif n_phases == 3: | ||
| # 3-phase ganged regulator — assign to phase 1 as primary | ||
| phase = 1 | ||
| logger.info( | ||
| "RegControl '%s' is 3-phase (buses=%s); treating as phase A.", | ||
| rc_name, | ||
| bus_names, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Matching based on arbitrary string name patterns is highly error prone. Please make it type safe by generalizing the original pattern of having fields a, b, and c.
There was a problem hiding this comment.
Regulator naming cleanup
Problem: The code inferred regulator phase from name suffixes (e.g., endswith("a") → phase A) — a brittle heuristic scattered across opendss.py and example scripts. IEEE13 used generic names (Reg1, Reg2, Reg3) that didn't follow any convention, requiring special-case fallbacks like rn == "reg1" throughout.
DSS file changes:
- ieee13: Renamed regulators from Reg1/Reg2/Reg3 → reg1a/reg1b/reg1c (transformers) and creg1a/creg1b/creg1c (regcontrols), matching the creg{bank}{phase} convention already used by ieee34 and ieee123
- ieee123: Added explicit phase suffix to the 3-phase ganged regulator bus ([150 150r] → [150.1.2.3 150r.1.2.3]) so phase is always determinable from bus data. (it's a single 3-phase ganged regulator with one regcontrol. This means one tap value controls all three phases simultaneously, so _build_phase_to_reg_map maps it to phase 1 only (from 150.1.2.3), and phases 2 and 3 have no independent control.)
Code changes in opendss.py:
- _cache_regcontrol_map: Simplified to return only (transformer, winding) — phase is no longer stored here. Removed all name-based phase guessing
- _build_phase_to_reg_map: Now determines phase solely from OpenDSS bus node data (e.g., bus.1 → phase 1). Regulators whose phase can't be determined from bus data are skipped with a debug log — users must address those by regulator name in TapPosition
Config and example script changes:
- Updated config_ieee13.json: reg1/reg2/reg3 → creg1a/creg1b/creg1c
- Removed all rn == "reg1" / rn == "reg2" / rn == "reg3" special-case branches from sweep_dc_locations.py and sweep_hosting_capacities.py
- Updated hardcoded fallback TapPosition defaults from reg1/reg2/reg3 → creg1a/creg1b/creg1c
- Updated test assertions accordingly
Naming convention across all three systems:
- Transformer: reg{bank}{phase} (e.g., reg1a, reg2c)
- RegControl: creg{bank}{phase} (e.g., creg1a, creg2c)
- Phase always determinable from bus node suffix in the DSS definition
| a: float | None = None | ||
| b: float | None = None | ||
| c: float | None = None | ||
| regulators: dict[str, float] = field(default_factory=dict) |
There was a problem hiding this comment.
Same complaint. Generalizing single -> multi should be a type-safe structural shift, not spraying escape hatch dicts on top of existing APIs that were built assuming single.
There was a problem hiding this comment.
Resolved with the comment above
| Supports both single-DC (legacy) and multi-DC modes: | ||
|
|
||
| - **Single-DC**: Pass ``dc_bus``, ``dc_bus_kv``, ``connection_type``. | ||
| - **Multi-DC**: Pass ``dc_loads`` dict mapping site IDs to | ||
| :class:`DCLoadSpec`. | ||
|
|
There was a problem hiding this comment.
Cleaned up the OpenDSSGrid docstring and comments:
- Replaced "Supports both single-DC (legacy) and multi-DC modes" with a single sentence describing dc_loads as the primary interface, with dc_bus/dc_bus_kv as convenience shorthand
- Updated Args descriptions to remove "(single-DC mode)" / "(multi-DC mode)" labels
- Cleaned up step() docstring and internal comments to remove "legacy" language
- Removed the "Build dc_loads dict (multi-DC or legacy single-DC)" comment
There was a problem hiding this comment.
Many changes were made to the offline datacenter. Please review feature parity between the offline and online datacenters. For instance,
- Did some new feature get introduced in offline datacenter but it's not handled by the online datacenter although it can be handled?
- Did some configuration class change for a feature to be introduced in the offline datacenter, but the online datacenter silently ignores that configuration or even worse, error out?
- The offlien and online datacenters used to share a lot of behavior between each other by sharing a lot of code paths. Did any assumptions based on shared codebase and behavior change only in offline and not updated in online?
Include all the latest example scripts and results for report writing.