Skip to content

Dev report#1

Open
ZhiruiLiang wants to merge 43 commits intomasterfrom
dev_report
Open

Dev report#1
ZhiruiLiang wants to merge 43 commits intomasterfrom
dev_report

Conversation

@ZhiruiLiang
Copy link
Copy Markdown
Collaborator

Include all the latest example scripts and results for report writing.

jaywonchung and others added 7 commits March 4, 2026 21:55
…Add functions to sweep ofo parameters, sweep DC locations, find DC hosting capacity, and optimize PV locations and capacities. Add IEEE 13, 34, 123 test feeders and example scripts. Include simulation outputs for IEEE 13, 34, 123 under multiple scenarios.
…Add functions to sweep ofo parameters, sweep DC locations, find DC hosting capacity, and optimize PV locations and capacities. Add IEEE 13, 34, 123 test feeders and example scripts. Include simulation outputs for IEEE 13, 34, 123 under multiple scenarios.
@jaywonchung
Copy link
Copy Markdown
Member

Please remove all output files that can be generated from running the examples. Data files do not belong in the code repository except when it's necessary for examples to run.

@jaywonchung
Copy link
Copy Markdown
Member

jaywonchung commented Mar 18, 2026

Also, please update code so that formatting, code checking, and testing pass. You can run them with bash scripts/lint.sh and pytest.

In general, I would back up a copy of the dir elsewhere so that you don't lose any files, and then ask claude code to clean up the branch so that it's suitable to post a PR.

Copy link
Copy Markdown
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work. This PR will mark a substantial advancement in what OpenG2G an do!

There are many structural issues -- please take your time to address them, and please let me know if anything is unclear.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think you'll need to update _zensical.toml to include these example guides as a new section -- basically "Getting Started," "Guide," "Examples," and "API Reference" in order.
  2. Please remove the number prefix in the beginning of each example description markdown file. These numbers will be shown in the documentation website's URL and signals that there is an ordering of examples, but in fact, I don't think the examples have any particularly natural ordering.
  3. Please preview the documentation on your browser after running bash scripts/docs.sh serve and see if everything is rendered properly (e.g., no Markdown issues, links are all working).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed for now. We never tested it.


- **`target`**: Fraction of initial servers to activate. `0.5` = half the servers; `1.0` = all servers (default); `1.5` = 50% more servers than the initial `initial_num_replicas`.
- **`model`**: When set, the ramp applies only to that model. When `None` (default), it applies to all models in the datacenter.
- **Scale-up** (target > 1.0): The datacenter pre-allocates extra servers at construction time based on the peak target in the schedule. At `t=0`, only the initial server count is active; the extra servers activate when the fraction exceeds 1.0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reading this an I think this API is weird. 1.0 (meaning 100%) should be a fixed ceiling. I don't think we want a constantly moving ceiling. For that, I think InferenceRamp(target=0.5).at(t=0) (with composition with |) is the right API. This way, we're specifying (time, target load) points across the whole timeline and we don't have to do this dynamic ceiling adjustment at all. Please refactor InferenceRamp.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be misleading since I change the original "num_replicas" to "initial_num_replicas", which is just the active
number of replicas at the beginning of simulation, not the ceiling of available servers. So Targets above 1.0 represent normal load fluctuation (e.g., demand increases), not a special "scale-up" mode. I have revised the documents and added a note explaining "When a ramp target exceeds 1.0, additional servers are allocated to accommodate the extra replicas. Users must ensure that the datacenter's total_gpu_capacity is sufficient for the peak demand across all models."

Updated four docs files:

  • building-simulators.md: Clarified that initial_num_replicas is the starting active count (not a ceiling), targets
    above 1.0 are normal load fluctuation, renamed "Scale-up" code comment. Added GPU capacity planning warning.
  • data-pipeline.md: Replaced "scale beyond" with neutral "activate additional replicas beyond this count".
  • concepts.md: Removed "(ramp-down and scale-up)" parenthetical that implied asymmetric behavior.
  • building-simulators.md (OfflineDatacenter section): Replaced "support scale-up beyond" with neutral "support targets above 1.0".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, I actually don't see the changed code/docs. Perhaps you haven't pushed your changes to GitHub. So just from reading the text my reaction is:

My original design of num_replicas & a floating point number to express the fraction of active servers wasn't very good in the first place. It should rather have been:

  • In datacenter config, the site's max num servers is defined.
  • In inference model spec, num replicas (or initial num replicas) do NOT exist
  • The classes that define inference workload (the inference ramp class) holds the number of replicas active at timestep t.

This way, when someone looks at an inference ramp class, they don't have to even wonder what 0.9 or 1.5 means. They don't have to look up model spec num_replicas and multiply with the fraction. And we don't ahve to be fighting over what 1.0 means either. During simulation, the datacenter can also easily reason about whether it ended up with more active servers than it's max number of servers at each timestep it runs.

I will do this refactor later. Well, of course I won't object if you would do it 😋 but please feel free to resolve this comment as you see fit.

Attributes:
dc_states: Every datacenter state produced by the datacenter.
dc_states: Every datacenter state produced by the datacenter (flat list, all sites).
dc_states_by_site: Per-site datacenter states (multi-DC mode).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple DC should be a generalization of the original single DC. That is, single DC mode should be programmatically equivalent to multiple DC mode, except that there is only one DC site. Not touching the original single DC code and data path and slapping on multi-DC code and data on top of it is poor design.

Copy link
Copy Markdown
Collaborator Author

@ZhiruiLiang ZhiruiLiang Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coordinator refactoring — unified single/multi-DC code path:

  1. Removed _single_dc_mode flag — single DC is now just datacenters={_DEFAULT_SITE: dc}, no branching
  2. Removed self.datacenter legacy property — replaced with _resolve_dc(site_id) helper that looks up by site ID or falls back to the first site
  3. Grid always receives a dict — eliminated the if _single_dc_mode: pass flat list else: pass dict branch in the run loop
  4. record_datacenter always gets a site_id — dc_states_by_site is always populated, not just in multi-DC mode
  5. Added datacenters property for external access to the site dict
  6. Command routing simplified — uses _resolve_dc(target_site_id) instead of if target_site and target_site in
    self._datacenters ... else self.datacenter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see the code but is this basically looking up the datacenter object based on a string name site_id inside the coordinator?

Ideally I want to avoid hardcoding site IDs inside the coordinator's source code. What if someone wants to name their three datacenter sites, "Alice," "Bob," and "Eve"? What if another person wants "one," "two," and "three"? The coordinator's source code shouldn't change due to datacenter site renaming, and ideally also not the controller source code. I'll look more into the code when it's pushed.

Comment on lines +448 to +466
# For 3-phase regulators without explicit phase suffix,
# try to infer phase from the regcontrol name (e.g. creg1a -> A=1)
if phase not in (1, 2, 3):
name_lower = rc_name.lower()
if name_lower.endswith("a"):
phase = 1
elif name_lower.endswith("b"):
phase = 2
elif name_lower.endswith("c"):
phase = 3
elif n_phases == 3:
# 3-phase ganged regulator — assign to phase 1 as primary
phase = 1
logger.info(
"RegControl '%s' is 3-phase (buses=%s); treating as phase A.",
rc_name,
bus_names,
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching based on arbitrary string name patterns is highly error prone. Please make it type safe by generalizing the original pattern of having fields a, b, and c.

Copy link
Copy Markdown
Collaborator Author

@ZhiruiLiang ZhiruiLiang Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regulator naming cleanup

Problem: The code inferred regulator phase from name suffixes (e.g., endswith("a") → phase A) — a brittle heuristic scattered across opendss.py and example scripts. IEEE13 used generic names (Reg1, Reg2, Reg3) that didn't follow any convention, requiring special-case fallbacks like rn == "reg1" throughout.

DSS file changes:

  • ieee13: Renamed regulators from Reg1/Reg2/Reg3 → reg1a/reg1b/reg1c (transformers) and creg1a/creg1b/creg1c (regcontrols), matching the creg{bank}{phase} convention already used by ieee34 and ieee123
  • ieee123: Added explicit phase suffix to the 3-phase ganged regulator bus ([150 150r] → [150.1.2.3 150r.1.2.3]) so phase is always determinable from bus data. (it's a single 3-phase ganged regulator with one regcontrol. This means one tap value controls all three phases simultaneously, so _build_phase_to_reg_map maps it to phase 1 only (from 150.1.2.3), and phases 2 and 3 have no independent control.)

Code changes in opendss.py:

  • _cache_regcontrol_map: Simplified to return only (transformer, winding) — phase is no longer stored here. Removed all name-based phase guessing
  • _build_phase_to_reg_map: Now determines phase solely from OpenDSS bus node data (e.g., bus.1 → phase 1). Regulators whose phase can't be determined from bus data are skipped with a debug log — users must address those by regulator name in TapPosition

Config and example script changes:

  • Updated config_ieee13.json: reg1/reg2/reg3 → creg1a/creg1b/creg1c
  • Removed all rn == "reg1" / rn == "reg2" / rn == "reg3" special-case branches from sweep_dc_locations.py and sweep_hosting_capacities.py
  • Updated hardcoded fallback TapPosition defaults from reg1/reg2/reg3 → creg1a/creg1b/creg1c
  • Updated test assertions accordingly

Naming convention across all three systems:

  • Transformer: reg{bank}{phase} (e.g., reg1a, reg2c)
  • RegControl: creg{bank}{phase} (e.g., creg1a, creg2c)
  • Phase always determinable from bus node suffix in the DSS definition

Comment on lines 44 to +47
a: float | None = None
b: float | None = None
c: float | None = None
regulators: dict[str, float] = field(default_factory=dict)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same complaint. Generalizing single -> multi should be a type-safe structural shift, not spraying escape hatch dicts on top of existing APIs that were built assuming single.

Copy link
Copy Markdown
Collaborator Author

@ZhiruiLiang ZhiruiLiang Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved with the comment above

Comment on lines +69 to +74
Supports both single-DC (legacy) and multi-DC modes:

- **Single-DC**: Pass ``dc_bus``, ``dc_bus_kv``, ``connection_type``.
- **Multi-DC**: Pass ``dc_loads`` dict mapping site IDs to
:class:`DCLoadSpec`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the OpenDSSGrid docstring and comments:

  • Replaced "Supports both single-DC (legacy) and multi-DC modes" with a single sentence describing dc_loads as the primary interface, with dc_bus/dc_bus_kv as convenience shorthand
  • Updated Args descriptions to remove "(single-DC mode)" / "(multi-DC mode)" labels
  • Cleaned up step() docstring and internal comments to remove "legacy" language
  • Removed the "Build dc_loads dict (multi-DC or legacy single-DC)" comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many changes were made to the offline datacenter. Please review feature parity between the offline and online datacenters. For instance,

  • Did some new feature get introduced in offline datacenter but it's not handled by the online datacenter although it can be handled?
  • Did some configuration class change for a feature to be introduced in the offline datacenter, but the online datacenter silently ignores that configuration or even worse, error out?
  • The offlien and online datacenters used to share a lot of behavior between each other by sharing a lot of code paths. Did any assumptions based on shared codebase and behavior change only in offline and not updated in online?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants