Skip to content

Add 'Recompute Bonds' extension and worker resource limits for large systems #909

@PythonFZ

Description

@PythonFZ

Problem

When a frame has more than 100 atoms (or 1000 on the client side), bond computation is skipped for performance reasons. This is the right default, but once the user has loaded a large system, there is no way to trigger bond recomputation from the UI — bonds are simply absent.

Additionally, recomputing bonds (neighbor lists via vesin) on millions of atoms can be very expensive. Running this on an internal worker risks unbounded memory usage and long execution times that could kill the worker process.

Current thresholds

Location Threshold File
Trajectory upload 100 src/zndraw/routes/trajectory.py:49
enrich_raw_frame 100 src/zndraw/enrichment.py:44
Frame source provider 100 src/zndraw/providers/frame_source.py:71
Client serialization 1000 src/zndraw/client/serialization.py:21

Proposal

1. Add a "Recompute Bonds" modifier extension

A simple modifier that calls add_connectivity() on the current frame(s), allowing the user to explicitly trigger bond computation for large systems via the UI.

This mirrors the existing recompute_bonds parameter already available on Wrap and Center modifiers, but as a standalone action.

2. Worker resource limits for expensive tasks

Currently workers only have:

  • Heartbeat timeout: 60 s (ZNDRAW_JOBLIB_WORKER_TIMEOUT_SECONDS)
  • Internal task timeout: 3600 s (ZNDRAW_JOBLIB_INTERNAL_TASK_TIMEOUT_SECONDS)

There are no memory limits. A neighbor list computation on millions of atoms can OOM the worker (and potentially the whole server if running in-process).

Options to consider:

  • Per-task timeout — allow extensions to declare a max execution time (taskiq supports task-level timeouts)
  • Memory limit — harder in-process; easier with external workers (Docker --memory, cgroups). Could also add a soft check via resource.getrusage or psutil and abort if RSS exceeds a threshold
  • Configurable per server — expose ZNDRAW_JOBLIB_TASK_MEMORY_LIMIT_MB and/or per-extension timeout overrides in settings
  • Atom count guard in the extension itself — the "Recompute Bonds" extension could refuse or warn above a configurable atom count to prevent accidental OOM

Tasks

  • Add RecomputeBonds modifier extension
  • Investigate taskiq per-task timeout / memory limit options
  • Add configurable resource limits to worker settings
  • Add atom-count guard or warning to expensive extensions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions