You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
follow-up to #922 — an idle dashboard sits at ~700MB before a compile even runs. on the 2GB HA-addon VM in that issue there's no headroom for gcc.
three big eager loads contribute (sizes on disk):
definitions/components.json — 22MB, ~14k mashumaro instances after parse
definitions/automations.json — 15MB
firmware-job persistence — restores up to 55 historical jobs × 2000 lines of build output (~15MB worst case) at startup
most of this is wasted RAM: list/search endpoints don't need config_entries, users mostly open one component card at a time, and old build-logs are rarely re-opened.
approach — slim-index + lazy-body, keeping orjson + mashumaro on the fast path:
catalogs split build-time into an index.json (id / name / description / category / etc.) loaded eagerly, plus per-entry body files loaded on demand into a small LRU (~64 entries)
firmware logs split into job-metadata (in the existing metadata blob) + a per-job sidecar log file under ext_storage_path("dashboard-jobs")/<job_id>.log — FirmwareJob.output populates lazily when the frontend opens a job detail view
ruled out:
sqlite — read-only catalog regenerated by sync script, no relational queries we can't do in python, would lose the mashumaro fast path
__slots__ + drop DataClassORJSONMixin — would save ~4MB pre-lazy-body but only ~400KB once the LRU is bounded. not worth losing schema consistency, mashumaro's wire-codegen, and gaining a hand-rolled to_dict() per leaf model that has to stay in lockstep with field additions.
estimated saving — ~50-80MB resident off the 700MB baseline. headroom for #922's 2GB VM to stop sigkilling during compiles.
checklist
PR 1 — measurement scaffolding: tracemalloc-based memory benchmark in tests/benchmarks/test_catalog_memory.py covering all three catalog loaders + the firmware job restore path. regression gate, no production change.
PR 2 — sys.intern() on closed-vocabulary strings in _load_component / _load_config_entry (category / platform_type / supported_platforms / references_component). free win, no wire change. estimated 3-8MB.
PR 3 — components: slim index + lazy bodies. build-time: sync_components.py emits definitions/components.index.json + definitions/components/<id>.json. runtime: ComponentCatalog.load() parses index only, get_body(id) reads body on demand through bounded LRU. _FeaturedRecord stores underlying_id: str. get_components wire shape drops config_entries — coordinate with frontend repo PR same release window.
PR 4 — automations: same split shape as components. slim index must keep id + domain + applies_to + is_device_level because triggers_for_domains / actions_for_domains partition by domain on every request. refactor @functools.cache module-global to an AutomationsCatalog controller owned by DeviceBuilder.automations.
PR 5 — firmware logs: split FirmwareJob.output to per-job sidecar file. load_jobs restores metadata only. _ingest_output_line appends to in-memory list (for live followers) AND to the sidecar. persist_jobs stops rewriting outputs — sidecar is append-only. new firmware/get_job_output WS command (or flag on existing firmware/get_job).
boards (3MB) stays untouched — below the threshold for the complexity cost.
a diagnostic helper for separately-reported runtime memory growth landed in #935 (debug/memory_snapshot WS command) so future leak reports can carry tracemalloc diffs.
full design notes
Context
an idle device-builder instance currently sits at ~700MB resident — before a single firmware compile launches. the new dashboard ships three eagerly-loaded read-only catalogs that the old esphome dashboard did not carry in memory:
file
disk
resident objects
loader
definitions/components.json
22MB
~904 ComponentCatalogEntry + ~13k ConfigEntry (no __slots__)
controllers/automations/catalog.py:32 (@functools.cache at module-import)
definitions/boards.json
3MB
490 boards
BoardCatalog.load()
disk → python object explosion for catalogs of this shape is typically 3-5× — each JSON string becomes an str object (~50 bytes overhead), every list/dict gets PyObject headers, and the ~14k mashumaro dataclass instances each carry a ~280-byte __dict__. that's consistent with the 700MB baseline once you add esphome-library imports (esphome.config, esphome.codegen, the platform-specific component modules) which the dashboard's validation paths pull in at startup.
issue #922 reports SIGKILL on a 2GB HA-addon VM the moment a compile launches — gcc/g++ on its own can claim 300-500MB per invocation, and there's no headroom because the dashboard's permanent budget already consumed 700MB.
today's access pattern already does most of the work for us:
components/get_components (paginated list / search) → list view does not need config_entries
components/get_categories, components/get_integration_docs → only id + a couple of flat fields
components/get_component (detail view) → needs the full entry
add_component, resolve_default_components → full entry for one component at a time
catalogs are never pushed in subscribe_events.initial_state
so the bulk of every dashboard load is paying ~13k ConfigEntry trees that the typical session only opens for a handful of components.
why not the alternatives
sqlite: read-only catalog regenerated by a script, no relational queries, no FTS we can't do cheaper in python. adds wheel-bundling complexity (binary blob), gives up mashumaro type validation, costs a cursor + row→dataclass adapter on every read. the only argument is "single file, atomic regen" and the slim index gives us that for free.
__slots__ + drop DataClassORJSONMixin: @dataclass(slots=True) on ConfigEntry only saves memory if we also drop the mixin (a non-slotted base silently re-adds __dict__ via MRO). pre-lazy-body that would have saved ~4MB across ~14k resident instances — worth the cost. post-lazy-body only ~64 bodies live in the LRU at a time, so the win collapses to ~400KB. not worth losing schema consistency, mashumaro's wire-codegen, and gaining a hand-rolled to_dict() per leaf model that has to stay in lockstep with field additions. picks up if mashumaro upstream ever gets a slot-friendly mixin shape.
single packed file + offset index (mmap + seek): one file instead of ~900 saves a bit of inode overhead but pushes complexity into the sync script (compute / verify offsets) and adds a torn-write failure mode the per-entry shape sidesteps. per-entry files have precedent in definitions/boards/<id>/manifest.yaml (493 dirs).
wire shape
components/get_components (list) returns slim entries — no config_entries. frontend coordination required: the frontend repo PR drops any list-view reads of config_entries. same release window.
components/get_component (detail) keeps the full shape — the form renderer relies on config_entries here.
per-PR detail
PR 1 — measurement scaffolding
regression gate so future PRs are measured against a baseline:
new tests/benchmarks/test_catalog_memory.py with a tracemalloc snapshot around ComponentCatalog.load(), automations.catalog.load_catalog(), BoardCatalog.load(), and the firmware job restore path.
snapshot per-loader resident bytes; assert against a generous ceiling so a regression on main after a sync_components run surfaces immediately.
no production code change.
PR 2 — sys.intern() on closed-vocabulary strings
free win, no API change, low risk — lands while PR 3 is in review.
in _load_component / _load_config_entry, intern category, platform_type, supported_platforms members, references_component. closed vocabularies are ~20 categories × ~10 platforms × ~30 entry types — currently duplicated across 13k ConfigEntry instances.
estimated saving: 3-8MB. no wire change.
PR 3 — components: slim-index + bodies on disk
build-time changes in script/sync_components.py:
emit definitions/components.index.json: a list of ComponentCatalogIndexEntry carrying id, name, description, category, docs_url, image_url, dependencies, multi_conf, supported_platforms. every field any of get_components filter, get_categories, get_integration_docs, _categories_for_board or the featured-registry build path references.
emit definitions/components/<id>.json per component: the full shape including config_entries.
atomic regen: write the new tree to definitions/components.next/ then os.replace() the directory + write a single components.index.json last. a Ctrl-C mid-regen must not leave a torn catalog. validate via a manifest hash carried in the index header.
ComponentCatalog.get_body(id) -> ComponentCatalogEntry reads components/<id>.json on demand, hydrates with mashumaro, returns through a bounded LRU (maxsize=64).
body reads hop to a thread (asyncio.to_thread) to keep blockbuster happy and stay off the event loop.
_build_featured_registry() becomes index-only: _FeaturedRecord stores underlying_id: str, not the body. bodies are fetched at _materialise_featured time. most invasive bit of this PR — flag explicitly in PR description.
WS surface: get_components returns the slim type. get_component / add_component / resolve_default_components go through get_body. get_categories / get_integration_docs stay on the index.
test fixture work: tests/conftest.py materialises a tiny mock components/ directory at test setup (a couple of fixture components + an index file) rather than stubbing _COMPONENTS_JSON.
PR 4 — automations
same split shape as components, but the access pattern differs: triggers_for_domains / actions_for_domains / conditions_for_domains walk the whole catalog every request to partition core entries first. the slim index must therefore keep id + domain + applies_to + is_device_level so domain filtering stays index-only. bodies (the config_entries and option schemas) go behind the LRU.
the automations module currently loads at module-import time via @functools.cache (a global) — move that ownership onto an AutomationsCatalog controller object that mirrors ComponentCatalog's shape and is owned by DeviceBuilder.automations.
estimated saving: comparable to PR 3 (~15MB raw → ~3-5MB index + LRU).
PR 5 — firmware-job output: lazy restore
controllers/firmware/persistence.py:69 (load_jobs) hydrates each FirmwareJob including its output: list[str] field. limits today:
worst case: 55 × 2000 × ~150 bytes ≈ 15MB of build logs resident the user mostly never looks at, plus secondary churn: persist_jobs rewrites the whole jobs dict (outputs included) on every persist call.
approach:
split persistence: keep job metadata (everything but output) in the existing metadata_transaction blob; sidecar each job's output to a per-job log file under ext_storage_path("dashboard-jobs")/<job_id>.log. resolve through ext_storage_path, never reconstruct paths — per the deployment-modes invariant.
new firmware/get_job_output WS command (or flag on existing firmware/get_job) reads the sidecar file when the frontend opens a job-detail view.
live jobs append both to the in-memory output list (for subscribe_events follower frames) AND to the sidecar log file.
pruning deletes the corresponding sidecar log when a job falls off history.
estimated saving: ~15MB at idle. bonus: persist write amplification drops from "rewrite all 55 outputs every line trim" to "append to one sidecar".
most localized of the lazy-load changes — small blast radius, single controller, no wire-shape change beyond an additive firmware/get_job_output command. could land in parallel with PR 3.
boards untouched
3MB raw is below the threshold; complicating definitions/__init__.py for a 1-2MB save isn't worth it.
risks / gotchas
atomic regen of the catalog tree. emit to a sibling directory then os.replace(). a torn write would leave the runtime resolving fresh body files against a stale index (or vice versa).
wheel bundling. ~900 small files compress ~5-10% less efficiently than one packed JSON. acceptable.
blockbuster. body reads must go through asyncio.to_thread or blockbuster will complain about sync I/O on the event loop in tests.
featured-component first-touch. each board's featured_components references underlying components by id. with lazy bodies, opening a featured card pays one disk read the first time. not worth pre-warming — the LRU absorbs subsequent hits.
frontend coordination.get_components slim shape needs the frontend repo PR landing in the same release window. PR 3's description must include the matching frontend PR link before merge.
verification
unit tests: pytest tests/test_components* round-trips against the mock catalog tree; assert get_component / get_components / add_component / resolve_default_components shapes are unchanged end-to-end (modulo the deliberately removed config_entries on get_components list responses).
memory benchmark (PR 1): pytest tests/benchmarks/test_catalog_memory.py asserts resident bytes after load() are under the new ceiling; PR 3 / PR 4 each tighten that ceiling.
runtime benchmark: pytest --codspeed tests/benchmarks/test_startup.py shows ComponentCatalog.load() faster end-to-end (smaller parse). add a per-body decode benchmark to track the new lazy path.
end-to-end smoke: run the dashboard against a real config dir, open a few component detail views, confirm responses are shape-identical against main for the detail path and the list path drops only config_entries. watch ps -o rss on the process — target idle resident: under 600MB after PR 3, under 500MB after PR 4.
issue [Bug] Crash loop when compiling firmware #922 repro: ideally the reporter (or CI on a memory-capped runner) confirms the compile-loop no longer SIGKILLs on a 2GB VM once PR 3 + PR 4 ship.
follow-up to #922 — an idle dashboard sits at ~700MB before a compile even runs. on the 2GB HA-addon VM in that issue there's no headroom for gcc.
three big eager loads contribute (sizes on disk):
definitions/components.json— 22MB, ~14k mashumaro instances after parsedefinitions/automations.json— 15MBmost of this is wasted RAM: list/search endpoints don't need
config_entries, users mostly open one component card at a time, and old build-logs are rarely re-opened.approach — slim-index + lazy-body, keeping orjson + mashumaro on the fast path:
index.json(id / name / description / category / etc.) loaded eagerly, plus per-entry body files loaded on demand into a small LRU (~64 entries)ext_storage_path("dashboard-jobs")/<job_id>.log—FirmwareJob.outputpopulates lazily when the frontend opens a job detail viewruled out:
__slots__+ dropDataClassORJSONMixin— would save ~4MB pre-lazy-body but only ~400KB once the LRU is bounded. not worth losing schema consistency, mashumaro's wire-codegen, and gaining a hand-rolledto_dict()per leaf model that has to stay in lockstep with field additions.estimated saving — ~50-80MB resident off the 700MB baseline. headroom for #922's 2GB VM to stop sigkilling during compiles.
checklist
tests/benchmarks/test_catalog_memory.pycovering all three catalog loaders + the firmware job restore path. regression gate, no production change.sys.intern()on closed-vocabulary strings in_load_component/_load_config_entry(category/platform_type/supported_platforms/references_component). free win, no wire change. estimated 3-8MB.sync_components.pyemitsdefinitions/components.index.json+definitions/components/<id>.json. runtime:ComponentCatalog.load()parses index only,get_body(id)reads body on demand through bounded LRU._FeaturedRecordstoresunderlying_id: str.get_componentswire shape dropsconfig_entries— coordinate with frontend repo PR same release window.id+domain+applies_to+is_device_levelbecausetriggers_for_domains/actions_for_domainspartition by domain on every request. refactor@functools.cachemodule-global to anAutomationsCatalogcontroller owned byDeviceBuilder.automations.FirmwareJob.outputto per-job sidecar file.load_jobsrestores metadata only._ingest_output_lineappends to in-memory list (for live followers) AND to the sidecar.persist_jobsstops rewriting outputs — sidecar is append-only. newfirmware/get_job_outputWS command (or flag on existingfirmware/get_job).boards (3MB) stays untouched — below the threshold for the complexity cost.
a diagnostic helper for separately-reported runtime memory growth landed in #935 (
debug/memory_snapshotWS command) so future leak reports can carry tracemalloc diffs.full design notes
Context
an idle device-builder instance currently sits at ~700MB resident — before a single firmware compile launches. the new dashboard ships three eagerly-loaded read-only catalogs that the old
esphome dashboarddid not carry in memory:definitions/components.jsonComponentCatalogEntry+ ~13kConfigEntry(no__slots__)controllers/components.py:86(ComponentCatalog.load())definitions/automations.jsoncontrollers/automations/catalog.py:32(@functools.cacheat module-import)definitions/boards.jsonBoardCatalog.load()disk → python object explosion for catalogs of this shape is typically 3-5× — each JSON string becomes an
strobject (~50 bytes overhead), every list/dict gets PyObject headers, and the ~14k mashumaro dataclass instances each carry a ~280-byte__dict__. that's consistent with the 700MB baseline once you add esphome-library imports (esphome.config,esphome.codegen, the platform-specific component modules) which the dashboard's validation paths pull in at startup.issue #922 reports SIGKILL on a 2GB HA-addon VM the moment a compile launches — gcc/g++ on its own can claim 300-500MB per invocation, and there's no headroom because the dashboard's permanent budget already consumed 700MB.
today's access pattern already does most of the work for us:
components/get_components(paginated list / search) → list view does not needconfig_entriescomponents/get_categories,components/get_integration_docs→ onlyid+ a couple of flat fieldscomponents/get_component(detail view) → needs the full entryadd_component,resolve_default_components→ full entry for one component at a timesubscribe_events.initial_stateso the bulk of every dashboard load is paying ~13k
ConfigEntrytrees that the typical session only opens for a handful of components.why not the alternatives
__slots__+ dropDataClassORJSONMixin:@dataclass(slots=True)onConfigEntryonly saves memory if we also drop the mixin (a non-slotted base silently re-adds__dict__via MRO). pre-lazy-body that would have saved ~4MB across ~14k resident instances — worth the cost. post-lazy-body only ~64 bodies live in the LRU at a time, so the win collapses to ~400KB. not worth losing schema consistency, mashumaro's wire-codegen, and gaining a hand-rolledto_dict()per leaf model that has to stay in lockstep with field additions. picks up if mashumaro upstream ever gets a slot-friendly mixin shape.definitions/boards/<id>/manifest.yaml(493 dirs).wire shape
components/get_components(list) returns slim entries — noconfig_entries. frontend coordination required: the frontend repo PR drops any list-view reads ofconfig_entries. same release window.components/get_component(detail) keeps the full shape — the form renderer relies onconfig_entrieshere.per-PR detail
PR 1 — measurement scaffolding
regression gate so future PRs are measured against a baseline:
tests/benchmarks/test_catalog_memory.pywith atracemallocsnapshot aroundComponentCatalog.load(),automations.catalog.load_catalog(),BoardCatalog.load(), and the firmware job restore path.mainafter async_componentsrun surfaces immediately.PR 2 —
sys.intern()on closed-vocabulary stringsfree win, no API change, low risk — lands while PR 3 is in review.
_load_component/_load_config_entry, interncategory,platform_type,supported_platformsmembers,references_component. closed vocabularies are ~20 categories × ~10 platforms × ~30 entry types — currently duplicated across 13kConfigEntryinstances.PR 3 — components: slim-index + bodies on disk
build-time changes in
script/sync_components.py:definitions/components.index.json: a list ofComponentCatalogIndexEntrycarryingid,name,description,category,docs_url,image_url,dependencies,multi_conf,supported_platforms. every field any ofget_componentsfilter,get_categories,get_integration_docs,_categories_for_boardor the featured-registry build path references.definitions/components/<id>.jsonper component: the full shape includingconfig_entries.definitions/components.next/thenos.replace()the directory + write a singlecomponents.index.jsonlast. a Ctrl-C mid-regen must not leave a torn catalog. validate via a manifest hash carried in the index header.pyproject.toml'stool.setuptools.package-dataglob.runtime changes in
controllers/components.py:ComponentCatalogIndexEntry(slim shape) inmodels/components.py.ComponentCatalog.load()parsescomponents.index.jsononly.ComponentCatalog.get_body(id) -> ComponentCatalogEntryreadscomponents/<id>.jsonon demand, hydrates with mashumaro, returns through a bounded LRU (maxsize=64).asyncio.to_thread) to keep blockbuster happy and stay off the event loop._build_featured_registry()becomes index-only:_FeaturedRecordstoresunderlying_id: str, not the body. bodies are fetched at_materialise_featuredtime. most invasive bit of this PR — flag explicitly in PR description.get_componentsreturns the slim type.get_component/add_component/resolve_default_componentsgo throughget_body.get_categories/get_integration_docsstay on the index.test fixture work:
tests/conftest.pymaterialises a tiny mockcomponents/directory at test setup (a couple of fixture components + an index file) rather than stubbing_COMPONENTS_JSON.PR 4 — automations
same split shape as components, but the access pattern differs:
triggers_for_domains/actions_for_domains/conditions_for_domainswalk the whole catalog every request to partitioncoreentries first. the slim index must therefore keepid+domain+applies_to+is_device_levelso domain filtering stays index-only. bodies (theconfig_entriesand option schemas) go behind the LRU.the automations module currently loads at module-import time via
@functools.cache(a global) — move that ownership onto anAutomationsCatalogcontroller object that mirrorsComponentCatalog's shape and is owned byDeviceBuilder.automations.estimated saving: comparable to PR 3 (~15MB raw → ~3-5MB index + LRU).
PR 5 — firmware-job output: lazy restore
controllers/firmware/persistence.py:69(load_jobs) hydrates eachFirmwareJobincluding itsoutput: list[str]field. limits today:_MAX_OUTPUT_LINES_RETAINED = 2000per job_MAX_PRIMARY_TERMINAL_JOBS = 50,_MAX_AUX_TERMINAL_JOBS = 5worst case: 55 × 2000 × ~150 bytes ≈ 15MB of build logs resident the user mostly never looks at, plus secondary churn:
persist_jobsrewrites the whole jobs dict (outputs included) on every persist call.approach:
output) in the existingmetadata_transactionblob; sidecar each job'soutputto a per-job log file underext_storage_path("dashboard-jobs")/<job_id>.log. resolve throughext_storage_path, never reconstruct paths — per the deployment-modes invariant.load_jobsrestores metadata only;FirmwareJob.outputstarts empty.firmware/get_job_outputWS command (or flag on existingfirmware/get_job) reads the sidecar file when the frontend opens a job-detail view.outputlist (forsubscribe_eventsfollower frames) AND to the sidecar log file.estimated saving: ~15MB at idle. bonus: persist write amplification drops from "rewrite all 55 outputs every line trim" to "append to one sidecar".
most localized of the lazy-load changes — small blast radius, single controller, no wire-shape change beyond an additive
firmware/get_job_outputcommand. could land in parallel with PR 3.boards untouched
3MB raw is below the threshold; complicating
definitions/__init__.pyfor a 1-2MB save isn't worth it.risks / gotchas
os.replace(). a torn write would leave the runtime resolving fresh body files against a stale index (or vice versa).asyncio.to_threador blockbuster will complain about sync I/O on the event loop in tests.featured_componentsreferences underlying components by id. with lazy bodies, opening a featured card pays one disk read the first time. not worth pre-warming — the LRU absorbs subsequent hits.get_componentsslim shape needs the frontend repo PR landing in the same release window. PR 3's description must include the matching frontend PR link before merge.verification
pytest tests/test_components*round-trips against the mock catalog tree; assertget_component/get_components/add_component/resolve_default_componentsshapes are unchanged end-to-end (modulo the deliberately removedconfig_entriesonget_componentslist responses).pytest tests/benchmarks/test_catalog_memory.pyasserts resident bytes afterload()are under the new ceiling; PR 3 / PR 4 each tighten that ceiling.pytest --codspeed tests/benchmarks/test_startup.pyshowsComponentCatalog.load()faster end-to-end (smaller parse). add a per-body decode benchmark to track the new lazy path.mainfor the detail path and the list path drops onlyconfig_entries. watchps -o rsson the process — target idle resident: under 600MB after PR 3, under 500MB after PR 4.