aneoconsulting · AncientPatata · Apr 26, 2026
diff --git a/.docs/conf.py b/.docs/conf.py
@@ -51,9 +51,9 @@
 
 # -- Options to show "Edit on GitHub" button ---------------------------------
 html_context = {
-    "display_github": True,  # Integrate GitHub
-    "github_user": "aneoconsulting",  # Username
-    "github_repo": "ArmoniK.Api",  # Repo name
-    "github_version": "main",  # Version
-    "conf_py_path": "/.docs/",  # Path in the checkout to the docs root
+    "display_github": True,
+    "github_user": "aneoconsulting",
+    "github_repo": "PymoniK",
+    "github_version": "main",
+    "conf_py_path": "/.docs/",
 }
diff --git a/.docs/development/contribution.md b/.docs/development/contribution.md
@@ -1,21 +1,104 @@
 # Contributing
 
-This doesn't differ from our other projects ([Read ArmoniK.CLI's CONTRIBUTING.md](https://github.com/aneoconsulting/ArmoniK.CLI/blob/main/CONTRIBUTING.md)).
-
-## Open Issues
-
-Here's a non-exhaustive list of things that are outright/partially missing from PymoniK that we'd like to see added in: 
-
-- **Unit tests, end-to-end tests** : This project doesn't have any testing associated to it, 0% code coverage. We'd like to change this.
-- **More sophisticated examples** : We'd like to add even more examples and tutorials of common use cases that use ArmoniK under the hood.
-- **PymoniK logger** : The session created/closed/cancelled prints and the different errors should be logged instead of being printed.
-- **Local PymoniK** : Once of the big advantages of the way things have been coded is being able to switch from/to a remote/local context by just removing the `invoke` methods. This could be done even better, by adding a `local=True` flag to PymoniK that makes it so invokes are executed as regular function calls that run locally. The challenge is mainly handling `Pymonik.put`s and the map_invoke. 
-- **Rename `ResultHandle` to `ObjectHandle`** : The naming of `ResultHandle` was choosen because invocations return results, but as it turns out, these results are also then served as inputs. It'd be better for naming (especially since you can put things onto ArmoniK) to rename `ResultHandle` to the more generic `ObjectHandle`.
-- **Cleaner sub-tasking** : Subtasking requires the user to pass in a `delegate=True` flag to invokes, this isn't particularly clean or nice. There must be a better way of doing it. 
-- **Tasks returning multiple results**: As of right now, a task can only return a single result object (even when you return a tuple). There should be support for cases where you'd like to return multiple results from a task and not have them be grouped up into one (multiple smaller objects being passed onto multiple tasks). We think this change would be easier to implement once sub-tasking is in place, since it also involves analyzing what the user will return. There should be pre-execution tests to check if the user is returning a different number of results in different branches of the task and that should result in a failure (invalid task). 
-- **`results.as_completed`** : For more sophisticated and better time-to-execute, we'd like to implement a method for `MultiResultHandle` that allows the user to for instance loop through a `MultiResultHandle` and have the code execute as the result is done/retrieved. Moreover, as a side-effect, having this feature would allow for the usage of `tqdm` to create progress bars which is also really nice. 
-- **Intermediate Objects** : Support being able to create/download ArmoniK Objects (Intermediate results) within tasks. The download part would require `GetDirectData` in the Python API.   
-- **Remote to local error propagation** : Supply an additional "error name" to created tasks, when a result creation fails, we create this result, locally we can retrieve the remote stack trace using `my_result.error()` if we try-except, either that are we enrich the local result failure grpc exception with the remote one. 
-- **Test JIT-ing** : jitting tasks using namba/taichi if they're pure for additional performance. (Should just test if it'd work as intended..)
-- **More sophisticated result deserialization**: Right now it's just first-level depickling, it'd be nice to be able to pass in a dict that has ResultHandle values for example and be able to dynamically fetch those but that'd add a lot of complexity to data dependencies that'd need to be handled. There has to be a nice way of doing this and it's worth exploring
-- **PymoniK Visualizer**: With the current implementation of invoke/map_invoke, we can make it so you can surround your PymoniK context with a Grapher context that dynamically builds up a visualization of your task graph that you can then save and look at/analyze/vizualize later.  
+Thank you for considering a contribution. PymoniK is a small,
+opinionated SDK and we'd like to keep it that way — but there's
+plenty to do, and outside perspectives are welcome.
+
+For repo-wide conventions, see ANEO's
+[contribution guidelines on ArmoniK.CLI](https://github.com/aneoconsulting/ArmoniK.CLI/blob/main/CONTRIBUTING.md);
+PymoniK follows the same shape.
+
+## Before you start
+
+- For non-trivial changes, open an issue to discuss the approach
+  before writing code. Saves rounds.
+
+## What we'd particularly like help with
+
+Roughly in priority order — none of these are claimed; happy to
+talk through any of them.
+
+### Production-readiness
+
+- **`pymonik image build` CLI.** Read the user's `pyproject.toml`,
+  render a Dockerfile from a template, run `docker build`, print the
+  tag. The most-asked-for missing piece.
+- **OIDC / bearer-token credentials.** The `Credentials` class only
+  handles mTLS. Ship a `BearerCredentials(token_provider=...)` that
+  plugs into the gRPC channel via the metadata callback.
+- **`pymonik doctor` CLI.** Hits the cluster's `Versions` and
+  `Health` services, reports cluster compatibility with the local
+  pymonik version, surfaces obvious misconfigs (events stream
+  reachable, partition exists, AKCONFIG sane).
+
+### Observability
+
+- **Wire OTel into ArmoniK upstream** so cluster-side spans (polling
+  agent, control plane, agent sidecar) chain into PymoniK's. The W3C
+  trace context already propagates; the cluster just needs to emit
+  spans under it. This is an upstream contribution, not a PymoniK PR.
+- **Notebook display hooks.** `Future.__repr__` rendering a progress
+  bar in Jupyter; `FutureList` showing a per-task heatmap.
+
+### Performance
+
+- **Cross-session blob reuse.** Use ArmoniK's `Results.import_data` to
+  bind a fresh result id to data already uploaded in a prior session,
+  driven by a local `~/.cache/pymonik/blobs/` hash-to-opaque-id index.
+- **Per-session warm subprocess** for `deps=` + `isolate=True`. Spawn
+  one child Python at session-open time, feed tasks through a Unix
+  socket. Drops per-task startup from ~500 ms to ~1 ms while
+  preserving subprocess isolation.
+
+### Async core
+
+- Drop the threading completion loop, port the events stream to
+  `grpc.aio`, unify `Future` on a single `anyio.Event`. The threading
+  bridge in `Future` is the largest piece of accidental complexity in
+  the codebase.
+
+### Tests and examples
+
+- More end-to-end tests against a `testcontainers`-spun ArmoniK.
+- `hypothesis` round-trip tests for the envelope and refs.
+
+### Documentation
+
+- This doc tree is a fresh rewrite; it'll have rough edges. Reading
+  through any of the guides and filing an issue (or PR draft) for
+  things that confused you is genuinely valuable.
+- Worked examples for fault tolerance — show what happens when a
+  worker pod gets evicted mid-task, and how `retries=` covers it.
+
+## Small but appreciated
+
+- Typo fixes, dead-link fixes, doctest fixes.
+- Ruff / pyright cleanups in `_internal/`.
+- More attribute coverage on existing OTel spans (anything that'd
+  help filter in a UI).
+
+## What we generally don't want
+
+- **Major API churn** — the public surface (decorator, `.spawn`,
+  `.map`, `Future`, `Blob`, `Materialize`) is mostly settled.
+  Suggest naming changes via an issue first; don't rename in a PR.
+- **Adding heavy dependencies** to the runtime. The current set is
+  deliberate. New deps need a strong "it would be much worse to
+  hand-roll this" argument.
+- **Hiding ArmoniK from users.** PymoniK wraps the lower-level
+  `armonik` package; it doesn't try to replace it. Anywhere you'd
+  reach for `armonik.client.*`, that should still work alongside the
+  PymoniK API.
+
+## How to ship a PR (when you have permissions)
+
+The maintainer's workflow:
+
+1. Local commits, no force pushes to shared branches.
+2. Run the fast suite: `uv run pytest -m "not slow"`.
+3. Run pyright: `uv run basedpyright src/pymonik`.
+4. Run ruff: `uv run ruff check && uv run ruff format`.
+5. If you touched `worker.py` or anything in `_internal/`, rebuild
+   the worker image and restart the partition; rerun a
+   representative example end-to-end against the rebuilt cluster.
+6. Open the PR with a clear description.
diff --git a/.docs/development/development.md b/.docs/development/development.md
@@ -1,31 +1,118 @@
 # Developing PymoniK
 
-We'll be covering some basic information to help you in working on and developing PymoniK
+This page covers what you need to know to work *on* PymoniK itself —
+not on top of it.
 
-## Requirements
+## Prerequisites
 
-We're using `uv` throughout the project, so please make sure that you have it installed. You can refer to their [official `uv` installation guide](https://docs.astral.sh/uv/getting-started/installation/) 
+- Python 3.11. PymoniK pins to 3.11 because cloudpickle isn't
+  cross-minor-compatible with the worker image — tests and
+  `LocalCluster` need to match what the worker runs.
+- [`uv`](https://docs.astral.sh/uv/) for project management.
+- Docker, if you'll touch worker images or run integration tests
+  against a real cluster.
 
-## Test client
+## Layout
 
-The test client contains some basic examples of working with PymoniK, PymoniK is installed in editable mode `uv add ../pymonik --editable`, it's useful to just create a python file there for testing and then `uv run`ning it to quickly iterate on PymoniK. Keep in mind that if you make changes that affect how the worker functions (obviously like making a change to the `worker.py` file), you'll have to reload the worker image. You can do this by running the following command:
+```
+src/pymonik/                 # the package
+  __init__.py                #   public API re-exports
+  client.py                  #   PymonikClient
+  session.py                 #   Session + completion loops
+  task.py                    #   @task decorator, Task wrapper
+  future.py                  #   Future, FutureList
+  options.py                 #   TaskOpts, merge semantics
+  envelope.py                #   wire format (msgspec)
+  blob.py                    #   Blob, Materialize
+  worker.py                  #   pymonik-worker entrypoint
+  worker_session.py          #   from-inside-a-worker submission
+  context.py                 #   pymonik.current() / WorkerContext
+  errors.py                  #   PymonikError hierarchy
+  composition.py             #   gather, as_completed
+  testing/                   #   LocalCluster
+  cli/                       #   pymonik CLI (stub today)
+  _internal/                 #   not part of the public API
+    submit.py                #     shared submission pipeline
+    refs.py                  #     FutureRef / BlobRef / MaterializeRef
+    env_builder.py           #     uv venv + flock for runtime deps
+    subprocess_dispatch.py   #     deps + isolate=True path
+    task_runner.py           #     subprocess child entrypoint
+    exec_cache.py            #     local result cache
+    query.py                 #     fluent introspection
+    info.py                  #     TaskInfo / ResultInfo / ...
+    channel.py               #     gRPC channel helpers
+    _otel.py                 #     OpenTelemetry helper
+    _logging.py              #     opt-in structlog setup
+examples/                     # runnable examples (also CI-gated)
+tests/                        # pytest suite
+.docs/                        # this documentation (Sphinx + MyST)
+worker-image/                 # Dockerfile for the harmonic_snake worker
+```
+
+The `_internal/` prefix marks code that may change without notice.
+Anything re-exported from `pymonik/__init__.py` is part of the public
+API and follows semver-ish rules within the alpha.
 
-```bash
-kubectl rollout restart deployment/compute-plane-pymonik #(1) -n armonik #(2)
+## Running tests
+
+```sh
+uv sync                      # one-time install
+uv run pytest                # everything
+uv run pytest -m "not slow"  # skip slow integration tests (no `uv` install)
+uv run pytest tests/test_otel.py -v   # one file
 ```
 
-1. This should be compute-plane-(NAME OF YOUR PYMONIK PARTITION). 
-2. If you're deploying locally the namespace is typically armonik, otherwise use the namespace of your kubernetes cluster
+The test suite is divided:
+
+- **Fast tests** (~30) — pure unit tests, no network, no `uv venv`
+  builds. Run in seconds. These are what CI runs on every push.
+- **Slow tests** marked `@pytest.mark.slow` — exercise the runtime
+  deps path with a real `uv` install. Need `uv` on `PATH`. Skip on
+  Windows (the subprocess wire is POSIX-only for now).
+
+The `LocalCluster` exercises the same submission pipeline as the
+real client, so most behavioural tests don't need a cluster. Only
+tests that depend on cluster-side behaviour (partition scheduling,
+events stream over the network) need a live ArmoniK; mark those
+`@pytest.mark.e2e` and run them separately when you have a deploy.
 
+## Type checking
 
-## Automation Script (`automation.py`)
+```sh
+uv run basedpyright src/pymonik
+```
 
-The `automation.py` script at the root of the project should help you realize most of your development tasks, it also auto-installs development dependencies.
+The codebase aims for `typeCheckingMode = "standard"`. New code
+should be fully annotated; private helpers may skip annotations when
+obvious.
 
-For example, if you want to access the documentation offline of if you're working on it (thank you!) then you can use the `serve-docs` command.
+A few upstream-typing quirks (anyio's `to_thread.run_sync` overload
+resolution, armonik's `Result` field types) produce false positives
+in `Session` / `WorkerSession` / `task.py`. These predate the
+revamp and aren't from new changes — leave them be unless you're
+fixing them upstream.
 
-To see a list of all available commands and their general descriptions, you can run: 
+## Linting and formatting
 
+```sh
+uv run ruff check
+uv run ruff format
 ```
-uv run automation.py --help
+
+Ruff replaces black + flake8 + isort. Configuration is in
+`pyproject.toml` under `[tool.ruff]`.
+
+## Working against a cluster
+
+If you're touching code that affects the worker (anything in
+`worker.py` or `_internal/`), you'll need to rebuild the image and
+restart the partition:
+
+```sh
+docker build -t my-org/harmonic_snake:dev worker-image/
+docker push my-org/harmonic_snake:dev    # or load directly into your kind cluster
+kubectl rollout restart deployment/compute-plane-pymonik -n armonik
 ```
+
+For client-only changes, just `uv sync` (or run with the editable
+install) and re-run your client script — no image rebuild needed.
diff --git a/.docs/examples/monte_carlo.md b/.docs/examples/monte_carlo.md
@@ -1,3 +1,104 @@
-# Distributed Monte Carlo for PI Estimation
+# Monte Carlo: estimating π
 
-This page hasn't been written yet, but the example code for it is in `test_client/estimate_pi.py`.
+A short example that hits every basic primitive: a worker function,
+`map` for fan-out, `spawn` for fan-in.
+
+The full source lives at `examples/estimate_pi.py`.
+
+## The idea
+
+Monte Carlo estimation of π: throw N random points in the unit square;
+the fraction inside the unit quarter-circle is approximately π/4. The
+more points, the better the estimate.
+
+It's embarrassingly parallel — every chunk of points is independent,
+and the only fan-in is a sum. Perfect for ArmoniK.
+
+## The code
+
+```python
+import random
+from pymonik import PymonikClient, task
+
+
+@task
+def count_inside(n: int, seed: int) -> int:
+    """How many of `n` random points fall inside the unit quarter circle?"""
+    rng = random.Random(seed)
+    inside = 0
+    for _ in range(n):
+        x = rng.random()
+        y = rng.random()
+        if x * x + y * y <= 1.0:
+            inside += 1
+    return inside
+
+
+@task
+def estimate(total_inside: int, total_points: int) -> float:
+    """Combine the per-shard counts into a single π estimate."""
+    return 4.0 * total_inside / total_points
+
+
+@task
+def add_all(xs: list[int]) -> int:
+    return sum(xs)
+
+
+def run(total_points: int = 10_000_000, num_tasks: int = 32) -> float:
+    points_per_task = total_points // num_tasks
+
+    with PymonikClient() as client:
+        with client.session(partition="pymonik") as s:
+            shards = count_inside.starmap(
+                (points_per_task, i) for i in range(num_tasks)
+            )
+            total_inside = add_all.spawn(shards)
+            pi = estimate.spawn(total_inside, num_tasks * points_per_task)
+            return pi.result(timeout=120)
+
+
+if __name__ == "__main__":
+    print(f"π ≈ {run()}")
+```
+
+## What's happening
+
+- `count_inside.starmap(...)` submits 32 tasks in one gRPC call. Each
+  one runs in parallel on whatever workers ArmoniK schedules. Returns
+  a `FutureList[int]`. We use `starmap` because we already have arg
+  tuples; if the per-task args were single values, `count_inside.map(iter)`
+  would be cleaner.
+- `add_all.spawn(shards)` passes the `FutureList` directly. PymoniK
+  rewrites each upstream future as a data dependency — `add_all` won't
+  run until every `count_inside` has finished. The client doesn't
+  block.
+- `estimate.spawn(total_inside, ...)` chains again: `estimate` waits
+  for `add_all` via the same mechanism.
+- Only `pi.result(timeout=120)` blocks. By the time the client wakes
+  up, the entire DAG (32 + 1 + 1 = 34 tasks) has run.
+
+## Tweaking
+
+- **More accuracy?** Bigger `total_points`. Each shard is independent,
+  so increasing the count just makes the leaves heavier.
+- **More parallelism?** Bigger `num_tasks`. Each shard is small enough
+  that the overhead of submission per task starts to matter at a few
+  hundred — you'll see diminishing returns.
+- **Reproducible?** Drop the seed indirection and pass a fixed seed.
+  The worker uses Python's stdlib `random`, which is deterministic
+  given a seed.
+
+## Variations worth trying
+
+1. Replace `count_inside` with one that uses `numpy.random` for
+   speed — declare `client.session(deps=["numpy"])` to make numpy
+   available without rebuilding the image. See
+   [Runtime environment](../guides/runtime-environment.md).
+2. Use the local exec cache to skip re-running shards:
+   `PymonikClient(cache=True)` + `@task(cache=True)` on
+   `count_inside`. Re-running the script with the same args returns
+   instantly on the second invocation.
+3. Use `LocalCluster` to run the whole thing in-process for a unit
+   test — no cluster needed. See
+   [Local testing](../guides/local-testing.md).