Skip to content

cut static memory footprint via lazy-loading catalogs + firmware logs #934

@marcelveldt

Description

@marcelveldt

follow-up to #922 — an idle dashboard sits at ~700MB before a compile even runs. on the 2GB HA-addon VM in that issue there's no headroom for gcc.

three big eager loads contribute (sizes on disk):

  • definitions/components.json — 22MB, ~14k mashumaro instances after parse
  • definitions/automations.json — 15MB
  • firmware-job persistence — restores up to 55 historical jobs × 2000 lines of build output (~15MB worst case) at startup

most of this is wasted RAM: list/search endpoints don't need config_entries, users mostly open one component card at a time, and old build-logs are rarely re-opened.

approach — slim-index + lazy-body, keeping orjson + mashumaro on the fast path:

  • catalogs split build-time into an index.json (id / name / description / category / etc.) loaded eagerly, plus per-entry body files loaded on demand into a small LRU (~64 entries)
  • firmware logs split into job-metadata (in the existing metadata blob) + a per-job sidecar log file under ext_storage_path("dashboard-jobs")/<job_id>.logFirmwareJob.output populates lazily when the frontend opens a job detail view

ruled out:

  • sqlite — read-only catalog regenerated by sync script, no relational queries we can't do in python, would lose the mashumaro fast path
  • __slots__ + drop DataClassORJSONMixin — would save ~4MB pre-lazy-body but only ~400KB once the LRU is bounded. not worth losing schema consistency, mashumaro's wire-codegen, and gaining a hand-rolled to_dict() per leaf model that has to stay in lockstep with field additions.

estimated saving — ~50-80MB resident off the 700MB baseline. headroom for #922's 2GB VM to stop sigkilling during compiles.

checklist

  • PR 1 — measurement scaffolding: tracemalloc-based memory benchmark in tests/benchmarks/test_catalog_memory.py covering all three catalog loaders + the firmware job restore path. regression gate, no production change.
  • PR 2sys.intern() on closed-vocabulary strings in _load_component / _load_config_entry (category / platform_type / supported_platforms / references_component). free win, no wire change. estimated 3-8MB.
  • PR 3 — components: slim index + lazy bodies. build-time: sync_components.py emits definitions/components.index.json + definitions/components/<id>.json. runtime: ComponentCatalog.load() parses index only, get_body(id) reads body on demand through bounded LRU. _FeaturedRecord stores underlying_id: str. get_components wire shape drops config_entries — coordinate with frontend repo PR same release window.
  • PR 4 — automations: same split shape as components. slim index must keep id + domain + applies_to + is_device_level because triggers_for_domains / actions_for_domains partition by domain on every request. refactor @functools.cache module-global to an AutomationsCatalog controller owned by DeviceBuilder.automations.
  • PR 5 — firmware logs: split FirmwareJob.output to per-job sidecar file. load_jobs restores metadata only. _ingest_output_line appends to in-memory list (for live followers) AND to the sidecar. persist_jobs stops rewriting outputs — sidecar is append-only. new firmware/get_job_output WS command (or flag on existing firmware/get_job).

boards (3MB) stays untouched — below the threshold for the complexity cost.

a diagnostic helper for separately-reported runtime memory growth landed in #935 (debug/memory_snapshot WS command) so future leak reports can carry tracemalloc diffs.

full design notes

Context

an idle device-builder instance currently sits at ~700MB resident — before a single firmware compile launches. the new dashboard ships three eagerly-loaded read-only catalogs that the old esphome dashboard did not carry in memory:

file disk resident objects loader
definitions/components.json 22MB ~904 ComponentCatalogEntry + ~13k ConfigEntry (no __slots__) controllers/components.py:86 (ComponentCatalog.load())
definitions/automations.json 15MB triggers / actions / conditions / light_effects controllers/automations/catalog.py:32 (@functools.cache at module-import)
definitions/boards.json 3MB 490 boards BoardCatalog.load()

disk → python object explosion for catalogs of this shape is typically 3-5× — each JSON string becomes an str object (~50 bytes overhead), every list/dict gets PyObject headers, and the ~14k mashumaro dataclass instances each carry a ~280-byte __dict__. that's consistent with the 700MB baseline once you add esphome-library imports (esphome.config, esphome.codegen, the platform-specific component modules) which the dashboard's validation paths pull in at startup.

issue #922 reports SIGKILL on a 2GB HA-addon VM the moment a compile launches — gcc/g++ on its own can claim 300-500MB per invocation, and there's no headroom because the dashboard's permanent budget already consumed 700MB.

today's access pattern already does most of the work for us:

  • components/get_components (paginated list / search) → list view does not need config_entries
  • components/get_categories, components/get_integration_docs → only id + a couple of flat fields
  • components/get_component (detail view) → needs the full entry
  • add_component, resolve_default_components → full entry for one component at a time
  • catalogs are never pushed in subscribe_events.initial_state

so the bulk of every dashboard load is paying ~13k ConfigEntry trees that the typical session only opens for a handful of components.

why not the alternatives

  • sqlite: read-only catalog regenerated by a script, no relational queries, no FTS we can't do cheaper in python. adds wheel-bundling complexity (binary blob), gives up mashumaro type validation, costs a cursor + row→dataclass adapter on every read. the only argument is "single file, atomic regen" and the slim index gives us that for free.
  • __slots__ + drop DataClassORJSONMixin: @dataclass(slots=True) on ConfigEntry only saves memory if we also drop the mixin (a non-slotted base silently re-adds __dict__ via MRO). pre-lazy-body that would have saved ~4MB across ~14k resident instances — worth the cost. post-lazy-body only ~64 bodies live in the LRU at a time, so the win collapses to ~400KB. not worth losing schema consistency, mashumaro's wire-codegen, and gaining a hand-rolled to_dict() per leaf model that has to stay in lockstep with field additions. picks up if mashumaro upstream ever gets a slot-friendly mixin shape.
  • single packed file + offset index (mmap + seek): one file instead of ~900 saves a bit of inode overhead but pushes complexity into the sync script (compute / verify offsets) and adds a torn-write failure mode the per-entry shape sidesteps. per-entry files have precedent in definitions/boards/<id>/manifest.yaml (493 dirs).

wire shape

  • components/get_components (list) returns slim entries — no config_entries. frontend coordination required: the frontend repo PR drops any list-view reads of config_entries. same release window.
  • components/get_component (detail) keeps the full shape — the form renderer relies on config_entries here.

per-PR detail

PR 1 — measurement scaffolding

regression gate so future PRs are measured against a baseline:

  • new tests/benchmarks/test_catalog_memory.py with a tracemalloc snapshot around ComponentCatalog.load(), automations.catalog.load_catalog(), BoardCatalog.load(), and the firmware job restore path.
  • snapshot per-loader resident bytes; assert against a generous ceiling so a regression on main after a sync_components run surfaces immediately.
  • no production code change.

PR 2 — sys.intern() on closed-vocabulary strings

free win, no API change, low risk — lands while PR 3 is in review.

  • in _load_component / _load_config_entry, intern category, platform_type, supported_platforms members, references_component. closed vocabularies are ~20 categories × ~10 platforms × ~30 entry types — currently duplicated across 13k ConfigEntry instances.
  • estimated saving: 3-8MB. no wire change.

PR 3 — components: slim-index + bodies on disk

build-time changes in script/sync_components.py:

  • emit definitions/components.index.json: a list of ComponentCatalogIndexEntry carrying id, name, description, category, docs_url, image_url, dependencies, multi_conf, supported_platforms. every field any of get_components filter, get_categories, get_integration_docs, _categories_for_board or the featured-registry build path references.
  • emit definitions/components/<id>.json per component: the full shape including config_entries.
  • atomic regen: write the new tree to definitions/components.next/ then os.replace() the directory + write a single components.index.json last. a Ctrl-C mid-regen must not leave a torn catalog. validate via a manifest hash carried in the index header.
  • update pyproject.toml's tool.setuptools.package-data glob.

runtime changes in controllers/components.py:

  • new model: ComponentCatalogIndexEntry (slim shape) in models/components.py.
  • ComponentCatalog.load() parses components.index.json only.
  • ComponentCatalog.get_body(id) -> ComponentCatalogEntry reads components/<id>.json on demand, hydrates with mashumaro, returns through a bounded LRU (maxsize=64).
  • body reads hop to a thread (asyncio.to_thread) to keep blockbuster happy and stay off the event loop.
  • _build_featured_registry() becomes index-only: _FeaturedRecord stores underlying_id: str, not the body. bodies are fetched at _materialise_featured time. most invasive bit of this PR — flag explicitly in PR description.
  • WS surface: get_components returns the slim type. get_component / add_component / resolve_default_components go through get_body. get_categories / get_integration_docs stay on the index.

test fixture work: tests/conftest.py materialises a tiny mock components/ directory at test setup (a couple of fixture components + an index file) rather than stubbing _COMPONENTS_JSON.

PR 4 — automations

same split shape as components, but the access pattern differs: triggers_for_domains / actions_for_domains / conditions_for_domains walk the whole catalog every request to partition core entries first. the slim index must therefore keep id + domain + applies_to + is_device_level so domain filtering stays index-only. bodies (the config_entries and option schemas) go behind the LRU.

the automations module currently loads at module-import time via @functools.cache (a global) — move that ownership onto an AutomationsCatalog controller object that mirrors ComponentCatalog's shape and is owned by DeviceBuilder.automations.

estimated saving: comparable to PR 3 (~15MB raw → ~3-5MB index + LRU).

PR 5 — firmware-job output: lazy restore

controllers/firmware/persistence.py:69 (load_jobs) hydrates each FirmwareJob including its output: list[str] field. limits today:

  • _MAX_OUTPUT_LINES_RETAINED = 2000 per job
  • _MAX_PRIMARY_TERMINAL_JOBS = 50, _MAX_AUX_TERMINAL_JOBS = 5

worst case: 55 × 2000 × ~150 bytes ≈ 15MB of build logs resident the user mostly never looks at, plus secondary churn: persist_jobs rewrites the whole jobs dict (outputs included) on every persist call.

approach:

  • split persistence: keep job metadata (everything but output) in the existing metadata_transaction blob; sidecar each job's output to a per-job log file under ext_storage_path("dashboard-jobs")/<job_id>.log. resolve through ext_storage_path, never reconstruct paths — per the deployment-modes invariant.
  • load_jobs restores metadata only; FirmwareJob.output starts empty.
  • new firmware/get_job_output WS command (or flag on existing firmware/get_job) reads the sidecar file when the frontend opens a job-detail view.
  • live jobs append both to the in-memory output list (for subscribe_events follower frames) AND to the sidecar log file.
  • pruning deletes the corresponding sidecar log when a job falls off history.

estimated saving: ~15MB at idle. bonus: persist write amplification drops from "rewrite all 55 outputs every line trim" to "append to one sidecar".

most localized of the lazy-load changes — small blast radius, single controller, no wire-shape change beyond an additive firmware/get_job_output command. could land in parallel with PR 3.

boards untouched

3MB raw is below the threshold; complicating definitions/__init__.py for a 1-2MB save isn't worth it.

risks / gotchas

  • atomic regen of the catalog tree. emit to a sibling directory then os.replace(). a torn write would leave the runtime resolving fresh body files against a stale index (or vice versa).
  • wheel bundling. ~900 small files compress ~5-10% less efficiently than one packed JSON. acceptable.
  • blockbuster. body reads must go through asyncio.to_thread or blockbuster will complain about sync I/O on the event loop in tests.
  • featured-component first-touch. each board's featured_components references underlying components by id. with lazy bodies, opening a featured card pays one disk read the first time. not worth pre-warming — the LRU absorbs subsequent hits.
  • frontend coordination. get_components slim shape needs the frontend repo PR landing in the same release window. PR 3's description must include the matching frontend PR link before merge.

verification

  1. unit tests: pytest tests/test_components* round-trips against the mock catalog tree; assert get_component / get_components / add_component / resolve_default_components shapes are unchanged end-to-end (modulo the deliberately removed config_entries on get_components list responses).
  2. memory benchmark (PR 1): pytest tests/benchmarks/test_catalog_memory.py asserts resident bytes after load() are under the new ceiling; PR 3 / PR 4 each tighten that ceiling.
  3. runtime benchmark: pytest --codspeed tests/benchmarks/test_startup.py shows ComponentCatalog.load() faster end-to-end (smaller parse). add a per-body decode benchmark to track the new lazy path.
  4. end-to-end smoke: run the dashboard against a real config dir, open a few component detail views, confirm responses are shape-identical against main for the detail path and the list path drops only config_entries. watch ps -o rss on the process — target idle resident: under 600MB after PR 3, under 500MB after PR 4.
  5. issue [Bug] Crash loop when compiling firmware #922 repro: ideally the reporter (or CI on a memory-capped runner) confirms the compile-loop no longer SIGKILLs on a 2GB VM once PR 3 + PR 4 ship.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions