From 136a95ce66ac88fbebdae79ff8f365bff87edbd6 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 10:22:34 +0100 Subject: [PATCH 01/21] feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Moves the client load-balancer pick onto the RPC hot path via an ETS-backed RoundRobin strategy, so gRPC clients now round-robin across backends per request instead of caching one pick for 15 seconds. Behaviour changes (GRPC.Client.LoadBalancing): - init/1 now takes :channels (full Channel structs), not :addresses - pick/1 returns {:ok, Channel.t(), new_state} directly — no second lookup - adds optional update/2 for in-place reconciliation and shutdown/1 for cleanup - @optional_callbacks keeps simple strategies compliant without stubs RoundRobin rewrite: - ETS table (:set, :public, read_concurrency: true) owned by the Connection - pick: :ets.update_counter/3 + :ets.lookup + elem/2 — lock-free, no GenServer - update: mutates channels tuple in place, no table churn across reconciles - shutdown: :ets.delete, idempotent PickFirst: trivially updated to the new channels-shaped signature. Connection rewire: - pick_channel/2 reads persistent_term once, dispatches to lb_mod.pick per request - persistent_term is written exactly once (init) and erased once (disconnect) — no more global GC pass across every BEAM process on DNS re-resolution - :refresh timer removed (obsolete: per-request pick rotates automatically) - build_balanced_state passes connected Channels to lb_mod.init - build_direct_state uses PickFirst for uniformity (single persistent_term shape) - reconcile calls lb_mod.update(lb_state, connected_channels) — update/2 allows empty lists so pick returns :no_channels when all backends fail (LB's ETS is now the source of truth; persistent_term stays stable) Architecture follows sleipnir's review of #513: - Connection stays as orchestrator, LB owns pick state via opaque tid - No new processes — ETS table lifecycle piggybacks on Connection GenServer - LB module is the only thing that touches the ETS table Tests: 275 total (was 266), 0 failures across 3 seeds. - 11 new RoundRobin unit tests (init/pick/update/shutdown/concurrency) - 8 new PickFirst unit tests - 1 new integration test proving per-request rotation end-to-end Micro-benchmark (bench/lb_pick.exs, Apple M1 Max): Single-process pick: persistent_term.get 13.34M ips 75 ns (cached pick, no rotation) ETS RoundRobin.pick 3.42M ips 293 ns (per-request rotation) GenServer.call pick 0.98M ips 1020 ns (serialized via mailbox) 32-process concurrent pick: persistent_term.get 6.6M ips 0.15 μs (read-scales linearly) GenServer.call pick 26.6K ips 37 μs (mailbox-bound) ETS RoundRobin.pick 9.9K ips 101 μs (counter contention) Single-process reconciliation write: persistent_term.put 5.95M ips 168 ns (+ global GC pass, not measured) ETS RoundRobin.update 1.51M ips 663 ns (local, no side effects) The comparison is not apples-to-apples: persistent_term.get returns the same cached pick every call (no load distribution), ETS does per-request rotation. At realistic gRPC traffic rates (1K-10K picks/sec bounded by network latency), neither the 293ns pick cost nor the counter contention is observable — RPC latency dominates by 4+ orders of magnitude. Follow-up to #520. Can drop the benchee dep + bench script if desired. --- grpc/bench/lb_pick.exs | 122 +++++++++++ grpc/lib/grpc/client/connection.ex | 196 +++++++----------- grpc/lib/grpc/client/load_balacing.ex | 29 ++- .../grpc/client/load_balacing/pick_first.ex | 24 ++- .../grpc/client/load_balacing/round_robin.ex | 59 +++++- grpc/mix.exs | 1 + grpc/mix.lock | 3 + grpc/test/grpc/client/dns_resolver_test.exs | 32 +++ .../client/load_balacing/pick_first_test.exs | 59 ++++++ .../client/load_balacing/round_robin_test.exs | 159 ++++++++++++++ 10 files changed, 543 insertions(+), 141 deletions(-) create mode 100644 grpc/bench/lb_pick.exs create mode 100644 grpc/test/grpc/client/load_balacing/pick_first_test.exs create mode 100644 grpc/test/grpc/client/load_balacing/round_robin_test.exs diff --git a/grpc/bench/lb_pick.exs b/grpc/bench/lb_pick.exs new file mode 100644 index 000000000..13d5ae272 --- /dev/null +++ b/grpc/bench/lb_pick.exs @@ -0,0 +1,122 @@ +# Benchmark: per-request pick throughput for the client load balancer. +# +# Compares three strategies: +# 1. persistent_term.get — the pre-refactor pick_channel read path +# 2. ETS RoundRobin.pick — the current lock-free round-robin pick +# 3. GenServer.call — upper-bound cost of serialising picks through +# a process mailbox (the anti-pattern) +# +# Also measures concurrent pick throughput with many client processes +# hitting the same LB, which is representative of RPC traffic. +# +# Run with: mix run bench/lb_pick.exs + +alias GRPC.Channel +alias GRPC.Client.LoadBalancing.RoundRobin + +# Build N synthetic channels. +build_channels = fn n -> + for i <- 1..n do + %Channel{host: "backend-#{i}", port: 50050 + i, ref: {:ref, i}} + end +end + +channels = build_channels.(4) + +# --- Setup for each strategy --------------------------------------------- + +# 1. persistent_term — stores a single pre-picked channel, like the old code +pt_key = {__MODULE__, :pick, :bench} +:persistent_term.put(pt_key, hd(channels)) + +# 2. ETS RoundRobin +{:ok, rr_state} = RoundRobin.init(channels: channels) + +# 3. GenServer serving picks from its mailbox +defmodule PickServer do + use GenServer + + def start_link(channels), do: GenServer.start_link(__MODULE__, channels) + def pick(pid), do: GenServer.call(pid, :pick) + + @impl true + def init(channels), do: {:ok, {List.to_tuple(channels), 0}} + + @impl true + def handle_call(:pick, _from, {tuple, idx}) do + ch = elem(tuple, rem(idx, tuple_size(tuple))) + {:reply, {:ok, ch}, {tuple, idx + 1}} + end +end + +{:ok, gs_pid} = PickServer.start_link(channels) + +# --- Benchmarks ---------------------------------------------------------- + +IO.puts("\n=== Single-process pick throughput ===\n") + +Benchee.run( + %{ + "persistent_term.get" => fn -> + :persistent_term.get(pt_key) + end, + "ETS RoundRobin.pick" => fn -> + RoundRobin.pick(rr_state) + end, + "GenServer.call pick" => fn -> + PickServer.pick(gs_pid) + end + }, + time: 3, + warmup: 1, + print: [fast_warning: false] +) + +IO.puts("\n=== 32-process concurrent pick throughput ===\n") + +Benchee.run( + %{ + "persistent_term.get" => fn -> + :persistent_term.get(pt_key) + end, + "ETS RoundRobin.pick" => fn -> + RoundRobin.pick(rr_state) + end, + "GenServer.call pick" => fn -> + PickServer.pick(gs_pid) + end + }, + time: 3, + warmup: 1, + parallel: 32, + print: [fast_warning: false] +) + +# --- Write-path cost: reconciliation -------------------------------------- +# +# persistent_term.put triggers a global GC pass across every BEAM process. +# ETS :ets.insert is local to the table. This is the decisive difference +# under frequent DNS re-resolution. + +IO.puts("\n=== Reconciliation write cost ===\n") + +new_channels = build_channels.(8) + +Benchee.run( + %{ + "persistent_term.put" => fn -> + :persistent_term.put(pt_key, hd(new_channels)) + end, + "ETS RoundRobin.update" => fn -> + RoundRobin.update(rr_state, new_channels) + end + }, + time: 3, + warmup: 1, + print: [fast_warning: false] +) + +# Cleanup +RoundRobin.shutdown(rr_state) +:persistent_term.erase(pt_key) +GenServer.stop(gs_pid) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index 7e7a6aca4..cefe76e6d 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -79,7 +79,6 @@ defmodule GRPC.Client.Connection do @insecure_scheme "http" @secure_scheme "https" - @refresh_interval 15_000 @default_resolve_interval 30_000 @default_max_resolve_interval 300_000 @default_min_resolve_interval 5_000 @@ -122,15 +121,11 @@ defmodule GRPC.Client.Connection do def init(%__MODULE__{} = state) do Process.flag(:trap_exit, true) - # only now persist the chosen channel (which should already have adapter_payload - # because build_initial_state connected real channels and set virtual_channel) :persistent_term.put( {__MODULE__, :lb_state, state.virtual_channel.ref}, - state.virtual_channel + {state.lb_mod, state.lb_state} ) - Process.send_after(self(), :refresh, @refresh_interval) - state = if function_exported?(state.resolver, :init, 2) do {:ok, resolver_state} = @@ -245,11 +240,14 @@ defmodule GRPC.Client.Connection do @spec pick_channel(Channel.t(), keyword()) :: {:ok, Channel.t()} | {:error, term()} def pick_channel(%Channel{ref: ref} = _channel, _opts \\ []) do case :persistent_term.get({__MODULE__, :lb_state, ref}, nil) do - nil -> - {:error, :no_connection} + {lb_mod, lb_state} when not is_nil(lb_mod) -> + case lb_mod.pick(lb_state) do + {:ok, %Channel{} = channel, _new_state} -> {:ok, channel} + {:error, _} -> {:error, :no_connection} + end - %Channel{} = channel -> - {:ok, channel} + _ -> + {:error, :no_connection} end end @@ -279,6 +277,8 @@ defmodule GRPC.Client.Connection do state.resolver.shutdown(state.resolver_state) end + shutdown_lb(state.lb_mod, state.lb_state) + resp = {:ok, %Channel{channel | adapter_payload: %{conn_pid: nil}}} :persistent_term.erase({__MODULE__, :lb_state, channel.ref}) @@ -301,36 +301,6 @@ defmodule GRPC.Client.Connection do end @impl GenServer - def handle_info( - :refresh, - %{lb_mod: lb_mod, lb_state: lb_state, real_channels: channels, virtual_channel: vc} = - state - ) - when not is_nil(lb_mod) do - {:ok, {prefer_host, prefer_port}, new_lb_state} = lb_mod.pick(lb_state) - - channel_key = build_address_key(prefer_host, prefer_port) - - case Map.get(channels, channel_key) do - {:connected, %Channel{} = picked_channel} -> - :persistent_term.put({__MODULE__, :lb_state, vc.ref}, picked_channel) - - Process.send_after(self(), :refresh, @refresh_interval) - {:noreply, %{state | lb_state: new_lb_state, virtual_channel: picked_channel}} - - _nil_or_failed -> - # LB picked a channel that is missing or in {:failed, _} state. - # Don't update persistent_term — keep serving from the current - # virtual_channel until re-resolution provides healthy backends. - Logger.warning("LB picked #{channel_key}, but channel is unavailable") - - Process.send_after(self(), :refresh, @refresh_interval) - {:noreply, %{state | lb_state: new_lb_state}} - end - end - - def handle_info(:refresh, state), do: {:noreply, state} - def handle_info({:resolver_update, result}, state) do state = handle_resolve_result(result, state) {:noreply, state} @@ -377,7 +347,8 @@ defmodule GRPC.Client.Connection do end @impl GenServer - def terminate(_reason, %{virtual_channel: %{ref: ref}}) do + def terminate(_reason, %{virtual_channel: %{ref: ref}} = state) do + shutdown_lb(Map.get(state, :lb_mod), Map.get(state, :lb_state)) :persistent_term.erase({__MODULE__, :lb_state, ref}) rescue _ -> :ok @@ -385,6 +356,17 @@ defmodule GRPC.Client.Connection do def terminate(_reason, _state), do: :ok + defp shutdown_lb(lb_mod, lb_state) + when not is_nil(lb_mod) and not is_nil(lb_state) do + if function_exported?(lb_mod, :shutdown, 1) do + lb_mod.shutdown(lb_state) + end + rescue + _ -> :ok + end + + defp shutdown_lb(_lb_mod, _lb_state), do: :ok + defp handle_resolve_result({:ok, %{addresses: []}}, state), do: state defp handle_resolve_result({:ok, %{addresses: new_addresses}}, state) do @@ -445,64 +427,36 @@ defmodule GRPC.Client.Connection do end) end - defp rebalance_after_reconcile(new_addresses, real_channels, state) do - if state.lb_mod do - case state.lb_mod.init(addresses: new_addresses) do - {:ok, new_lb_state} -> - {:ok, {host, port}, picked_lb_state} = state.lb_mod.pick(new_lb_state) - key = build_address_key(host, port) - - case Map.get(real_channels, key) do - {:connected, picked_channel} -> - maybe_update_persistent_term(state.virtual_channel, picked_channel) + defp rebalance_after_reconcile(_new_addresses, real_channels, state) do + connected = connected_channels(real_channels) - %{ - state - | real_channels: real_channels, - lb_state: picked_lb_state, - virtual_channel: picked_channel - } - - _ -> - fallback_to_healthy_channel(state, real_channels, picked_lb_state) - end - - {:error, _} -> - fallback_to_healthy_channel(state, real_channels, state.lb_state) + new_lb_state = + if state.lb_mod do + case reconcile_lb(state.lb_mod, state.lb_state, connected) do + {:ok, s} -> s + {:error, _} -> state.lb_state + end + else + state.lb_state end - else - fallback_to_healthy_channel(state, real_channels, state.lb_state) - end - end - defp fallback_to_healthy_channel(state, real_channels, lb_state) do - ref = state.virtual_channel.ref - - case Enum.find_value(real_channels, fn {_k, v} -> match?({:connected, _}, v) && v end) do - {:connected, healthy_channel} -> - maybe_update_persistent_term(state.virtual_channel, healthy_channel) + if connected == [] do + Logger.warning("No healthy channels available after re-resolution") + end - %{ - state - | real_channels: real_channels, - lb_state: lb_state, - virtual_channel: healthy_channel - } + %{state | real_channels: real_channels, lb_state: new_lb_state} + end - nil -> - Logger.warning("No healthy channels available after re-resolution") - :persistent_term.erase({__MODULE__, :lb_state, ref}) - %{state | real_channels: real_channels, lb_state: lb_state} + defp reconcile_lb(lb_mod, lb_state, new_channels) do + if lb_state != nil and function_exported?(lb_mod, :update, 2) do + lb_mod.update(lb_state, new_channels) + else + lb_mod.init(channels: new_channels) end end - defp maybe_update_persistent_term(current_channel, new_channel) do - if current_channel != new_channel do - :persistent_term.put( - {__MODULE__, :lb_state, new_channel.ref}, - new_channel - ) - end + defp connected_channels(real_channels) do + for {_key, {:connected, ch}} <- real_channels, do: ch end defp channel_alive?({:connected, %{adapter_payload: %{conn_pid: pid}}}) when is_pid(pid) do @@ -625,44 +579,22 @@ defmodule GRPC.Client.Connection do lb_mod = choose_lb(lb_policy) - case lb_mod.init(addresses: addresses) do - {:ok, lb_state} -> - {:ok, {prefer_host, prefer_port}, new_lb_state} = lb_mod.pick(lb_state) - - real_channels = - build_real_channels(addresses, base_state.virtual_channel, norm_opts, adapter) - - key = build_address_key(prefer_host, prefer_port) - - with {:connected, ch} <- Map.get(real_channels, key, {:failed, :no_channel}) do - {:ok, - %__MODULE__{ - base_state - | lb_mod: lb_mod, - lb_state: new_lb_state, - virtual_channel: ch, - real_channels: real_channels - }} - else - {:failed, reason} -> {:error, reason} - end + real_channels = + build_real_channels(addresses, base_state.virtual_channel, norm_opts, adapter) - {:error, :no_addresses} -> - {:error, :no_addresses} - end - end + connected = connected_channels(real_channels) - defp build_direct_state(%__MODULE__{} = base_state, norm_target, norm_opts, adapter) do - {host, port} = split_host_port(norm_target) - vc = base_state.virtual_channel + case lb_mod.init(channels: connected) do + {:ok, lb_state} -> + {:ok, %Channel{} = picked, _} = lb_mod.pick(lb_state) - case connect_real_channel(vc, host, port, norm_opts, adapter) do - {:ok, ch} -> {:ok, %__MODULE__{ base_state - | virtual_channel: ch, - real_channels: %{"#{host}:#{port}" => {:connected, ch}} + | lb_mod: lb_mod, + lb_state: lb_state, + virtual_channel: picked, + real_channels: real_channels }} {:error, reason} -> @@ -670,6 +602,24 @@ defmodule GRPC.Client.Connection do end end + defp build_direct_state(%__MODULE__{} = base_state, norm_target, norm_opts, adapter) do + {host, port} = split_host_port(norm_target) + vc = base_state.virtual_channel + lb_mod = GRPC.Client.LoadBalancing.PickFirst + + with {:ok, ch} <- connect_real_channel(vc, host, port, norm_opts, adapter), + {:ok, lb_state} <- lb_mod.init(channels: [ch]) do + {:ok, + %__MODULE__{ + base_state + | virtual_channel: ch, + real_channels: %{build_address_key(host, port) => {:connected, ch}}, + lb_mod: lb_mod, + lb_state: lb_state + }} + end + end + defp build_real_channels(addresses, %Channel{} = virtual_channel, norm_opts, adapter) do Map.new(addresses, fn %{port: port, address: host} -> case connect_real_channel( diff --git a/grpc/lib/grpc/client/load_balacing.ex b/grpc/lib/grpc/client/load_balacing.ex index fb3a58d1c..7eef66bf5 100644 --- a/grpc/lib/grpc/client/load_balacing.ex +++ b/grpc/lib/grpc/client/load_balacing.ex @@ -2,11 +2,34 @@ defmodule GRPC.Client.LoadBalancing do @moduledoc """ Load balancing behaviour for gRPC clients. - This module defines the behaviour that load balancing strategies must implement. + A load-balancing strategy owns the per-request pick decision over a set of + `GRPC.Channel` structs. It operates on already-connected channels handed to + it by `GRPC.Client.Connection` — it is not responsible for establishing or + tearing down transport. + + Required callbacks: + + * `init/1` — build initial state from `[channels: [Channel.t()]]`. + * `pick/1` — choose a `Channel` for the next request. + + Optional callbacks (used by strategies that maintain live state, e.g. ETS): + + * `update/2` — reconcile the state with a new channel list in place, + without tearing it down. Called by `Connection` on DNS re-resolution. + * `shutdown/1` — release any resources held by the strategy. Called on + disconnect. """ + alias GRPC.Channel + @callback init(opts :: keyword()) :: {:ok, state :: any()} | {:error, reason :: any()} @callback pick(state :: any()) :: - {:ok, {host :: String.t(), port :: non_neg_integer()}, new_state :: any()} - | {:error, reason :: any()} + {:ok, Channel.t(), new_state :: any()} | {:error, reason :: any()} + + @callback update(state :: any(), new_channels :: [Channel.t()]) :: + {:ok, new_state :: any()} | {:error, reason :: any()} + + @callback shutdown(state :: any()) :: :ok + + @optional_callbacks update: 2, shutdown: 1 end diff --git a/grpc/lib/grpc/client/load_balacing/pick_first.ex b/grpc/lib/grpc/client/load_balacing/pick_first.ex index 14e17ca03..dd438e450 100644 --- a/grpc/lib/grpc/client/load_balacing/pick_first.ex +++ b/grpc/lib/grpc/client/load_balacing/pick_first.ex @@ -1,16 +1,28 @@ defmodule GRPC.Client.LoadBalancing.PickFirst do + @moduledoc """ + Pick-first load balancer: always returns the first channel in the list. + + Stateless and process-local — no ETS, no GenServer. `update/2` swaps the + current pick when the channel list changes. + """ @behaviour GRPC.Client.LoadBalancing @impl true def init(opts) do - case Keyword.get(opts, :addresses, []) do - [] -> {:error, :no_addresses} - addresses -> {:ok, %{addresses: addresses, current: hd(addresses)}} + case Keyword.get(opts, :channels, []) do + [] -> {:error, :no_channels} + [first | _] -> {:ok, %{current: first}} end end @impl true - def pick(%{current: %{address: host, port: port}} = state) do - {:ok, {host, port}, state} - end + def pick(%{current: nil}), do: {:error, :no_channels} + def pick(%{current: channel} = state), do: {:ok, channel, state} + + @impl true + def update(_state, [first | _]), do: {:ok, %{current: first}} + def update(_state, []), do: {:ok, %{current: nil}} + + @impl true + def shutdown(_state), do: :ok end diff --git a/grpc/lib/grpc/client/load_balacing/round_robin.ex b/grpc/lib/grpc/client/load_balacing/round_robin.ex index af47bf5bb..859b632a5 100644 --- a/grpc/lib/grpc/client/load_balacing/round_robin.ex +++ b/grpc/lib/grpc/client/load_balacing/round_robin.ex @@ -1,22 +1,63 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do + @moduledoc """ + Round-robin load balancer backed by an ETS table. + + The pick path is lock-free: `:ets.update_counter/3` atomically advances the + cursor and `elem/2` indexes the channel tuple in constant time. No GenServer + sits in front of the pick, so RPC-time picks are a single atomic increment + plus a lookup. + + The ETS table is owned by whichever process calls `init/1` (normally the + `GRPC.Client.Connection` GenServer), so when that process dies the table is + reclaimed automatically — no separate supervision needed. `shutdown/1` is + provided for graceful disconnects. + """ @behaviour GRPC.Client.LoadBalancing + @channels_key :channels + @index_key :index + @impl true def init(opts) do - addresses = Keyword.get(opts, :addresses, []) + case Keyword.get(opts, :channels, []) do + [] -> + {:error, :no_channels} + + channels -> + tid = :ets.new(:grpc_lb_round_robin, [:set, :public, read_concurrency: true]) + + :ets.insert(tid, {@channels_key, List.to_tuple(channels)}) + :ets.insert(tid, {@index_key, -1}) + + {:ok, %{tid: tid}} + end + end + + @impl true + def pick(%{tid: tid} = state) do + case :ets.lookup(tid, @channels_key) do + [{@channels_key, channels}] when tuple_size(channels) > 0 -> + idx = :ets.update_counter(tid, @index_key, {2, 1}) + channel = elem(channels, rem(idx, tuple_size(channels))) + {:ok, channel, state} - if addresses == [] do - {:error, :no_addresses} - else - {:ok, %{addresses: addresses, index: 0, n: length(addresses)}} + _ -> + {:error, :no_channels} end end @impl true - def pick(%{addresses: addresses, index: idx, n: n} = state) do - %{address: host, port: port} = Enum.fetch!(addresses, idx) + def update(%{tid: tid} = state, new_channels) do + :ets.insert(tid, {@channels_key, List.to_tuple(new_channels)}) + :ets.insert(tid, {@index_key, -1}) + {:ok, state} + end - new_state = %{state | index: rem(idx + 1, n)} - {:ok, {host, port}, new_state} + @impl true + def shutdown(%{tid: tid}) do + :ets.delete(tid) + :ok + rescue + ArgumentError -> :ok end end diff --git a/grpc/mix.exs b/grpc/mix.exs index 24fa3c60b..72d6ff64a 100644 --- a/grpc/mix.exs +++ b/grpc/mix.exs @@ -38,6 +38,7 @@ defmodule GRPC.MixProject do {:ex_doc, "~> 0.39", only: [:dev, :docs], runtime: false}, {:ex_parameterized, "~> 1.3.7", only: :test}, {:mox, "~> 1.2", only: :test}, + {:benchee, "~> 1.3", only: :dev, runtime: false}, # {:grpc_server, path: "../grpc_server", only: :test} {:grpc_server, "~> 1.0.0-rc.1", only: :test} ] diff --git a/grpc/mix.lock b/grpc/mix.lock index d6252163f..ca0489da0 100644 --- a/grpc/mix.lock +++ b/grpc/mix.lock @@ -1,7 +1,9 @@ %{ + "benchee": {:hex, :benchee, "1.5.0", "4d812c31d54b0ec0167e91278e7de3f596324a78a096fd3d0bea68bb0c513b10", [:mix], [{:deep_merge, "~> 1.0", [hex: :deep_merge, repo: "hexpm", optional: false]}, {:statistex, "~> 1.1", [hex: :statistex, repo: "hexpm", optional: false]}, {:table, "~> 0.1.0", [hex: :table, repo: "hexpm", optional: true]}], "hexpm", "5b075393aea81b8ae74eadd1c28b1d87e8a63696c649d8293db7c4df3eb67535"}, "castore": {:hex, :castore, "1.0.16", "8a4f9a7c8b81cda88231a08fe69e3254f16833053b23fa63274b05cbc61d2a1e", [:mix], [], "hexpm", "33689203a0eaaf02fcd0e86eadfbcf1bd636100455350592e7e2628564022aaf"}, "cowboy": {:hex, :cowboy, "2.14.2", "4008be1df6ade45e4f2a4e9e2d22b36d0b5aba4e20b0a0d7049e28d124e34847", [:make, :rebar3], [{:cowlib, ">= 2.16.0 and < 3.0.0", [hex: :cowlib, repo: "hexpm", optional: false]}, {:ranch, ">= 1.8.0 and < 3.0.0", [hex: :ranch, repo: "hexpm", optional: false]}], "hexpm", "569081da046e7b41b5df36aa359be71a0c8874e5b9cff6f747073fc57baf1ab9"}, "cowlib": {:hex, :cowlib, "2.16.0", "54592074ebbbb92ee4746c8a8846e5605052f29309d3a873468d76cdf932076f", [:make, :rebar3], [], "hexpm", "7f478d80d66b747344f0ea7708c187645cfcc08b11aa424632f78e25bf05db51"}, + "deep_merge": {:hex, :deep_merge, "1.0.0", "b4aa1a0d1acac393bdf38b2291af38cb1d4a52806cf7a4906f718e1feb5ee961", [:mix], [], "hexpm", "ce708e5f094b9cd4e8f2be4f00d2f4250c4095be93f8cd6d018c753894885430"}, "earmark_parser": {:hex, :earmark_parser, "1.4.44", "f20830dd6b5c77afe2b063777ddbbff09f9759396500cdbe7523efd58d7a339c", [:mix], [], "hexpm", "4778ac752b4701a5599215f7030989c989ffdc4f6df457c5f36938cc2d2a2750"}, "ex_doc": {:hex, :ex_doc, "0.39.1", "e19d356a1ba1e8f8cfc79ce1c3f83884b6abfcb79329d435d4bbb3e97ccc286e", [:mix], [{:earmark_parser, "~> 1.4.44", [hex: :earmark_parser, repo: "hexpm", optional: false]}, {:makeup_c, ">= 0.1.0", [hex: :makeup_c, repo: "hexpm", optional: true]}, {:makeup_elixir, "~> 0.14 or ~> 1.0", [hex: :makeup_elixir, repo: "hexpm", optional: false]}, {:makeup_erlang, "~> 0.1 or ~> 1.0", [hex: :makeup_erlang, repo: "hexpm", optional: false]}, {:makeup_html, ">= 0.1.0", [hex: :makeup_html, repo: "hexpm", optional: true]}], "hexpm", "8abf0ed3e3ca87c0847dfc4168ceab5bedfe881692f1b7c45f4a11b232806865"}, "ex_parameterized": {:hex, :ex_parameterized, "1.3.7", "801f85fc4651cb51f11b9835864c6ed8c5e5d79b1253506b5bb5421e8ab2f050", [:mix], [], "hexpm", "1fb0dc4aa9e8c12ae23806d03bcd64a5a0fc9cd3f4c5602ba72561c9b54a625c"}, @@ -22,5 +24,6 @@ "nimble_parsec": {:hex, :nimble_parsec, "1.4.2", "8efba0122db06df95bfaa78f791344a89352ba04baedd3849593bfce4d0dc1c6", [:mix], [], "hexpm", "4b21398942dda052b403bbe1da991ccd03a053668d147d53fb8c4e0efe09c973"}, "protobuf": {:hex, :protobuf, "0.15.0", "c9fc1e9fc1682b05c601df536d5ff21877b55e2023e0466a3855cc1273b74dcb", [:mix], [{:jason, "~> 1.2", [hex: :jason, repo: "hexpm", optional: true]}], "hexpm", "5d7bb325319db1d668838d2691c31c7b793c34111aec87d5ee467a39dac6e051"}, "ranch": {:hex, :ranch, "2.2.0", "25528f82bc8d7c6152c57666ca99ec716510fe0925cb188172f41ce93117b1b0", [:make, :rebar3], [], "hexpm", "fa0b99a1780c80218a4197a59ea8d3bdae32fbff7e88527d7d8a4787eff4f8e7"}, + "statistex": {:hex, :statistex, "1.1.0", "7fec1eb2f580a0d2c1a05ed27396a084ab064a40cfc84246dbfb0c72a5c761e5", [:mix], [], "hexpm", "f5950ea26ad43246ba2cce54324ac394a4e7408fdcf98b8e230f503a0cba9cf5"}, "telemetry": {:hex, :telemetry, "1.3.0", "fedebbae410d715cf8e7062c96a1ef32ec22e764197f70cda73d82778d61e7a2", [:rebar3], [], "hexpm", "7015fc8919dbe63764f4b4b87a95b7c0996bd539e0d499be6ec9d7f3875b79e6"}, } diff --git a/grpc/test/grpc/client/dns_resolver_test.exs b/grpc/test/grpc/client/dns_resolver_test.exs index a0e49cada..617cf0ddc 100644 --- a/grpc/test/grpc/client/dns_resolver_test.exs +++ b/grpc/test/grpc/client/dns_resolver_test.exs @@ -431,6 +431,38 @@ defmodule GRPC.Client.ReResolveTest do end end + describe "pick_channel per-request rotation" do + test "round-robin rotates across all backends on successive picks", ctx do + {:ok, channel} = + connect_with_resolver( + ctx.ref, + ctx.resolver, + ctx.adapter, + [ + %{address: "10.0.0.1", port: 50051}, + %{address: "10.0.0.2", port: 50051}, + %{address: "10.0.0.3", port: 50051} + ], + lb_policy: :round_robin + ) + + hosts = + for _ <- 1..9 do + {:ok, picked} = Connection.pick_channel(channel) + picked.host + end + + assert Enum.sort(Enum.uniq(hosts)) == ["10.0.0.1", "10.0.0.2", "10.0.0.3"] + + counts = Enum.frequencies(hosts) + assert counts["10.0.0.1"] == 3 + assert counts["10.0.0.2"] == 3 + assert counts["10.0.0.3"] == 3 + + disconnect_and_wait(channel) + end + end + describe "pick_channel after full backend replacement" do test "picks a channel from the new backend set", ctx do {:ok, channel} = diff --git a/grpc/test/grpc/client/load_balacing/pick_first_test.exs b/grpc/test/grpc/client/load_balacing/pick_first_test.exs new file mode 100644 index 000000000..1b212f841 --- /dev/null +++ b/grpc/test/grpc/client/load_balacing/pick_first_test.exs @@ -0,0 +1,59 @@ +defmodule GRPC.Client.LoadBalancing.PickFirstTest do + use ExUnit.Case, async: true + + alias GRPC.Channel + alias GRPC.Client.LoadBalancing.PickFirst + + defp channels(pairs), + do: Enum.map(pairs, fn {h, p} -> %Channel{host: h, port: p, ref: {h, p}} end) + + describe "init/1" do + test "returns the first channel as current" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}, {"b", 2}])) + assert state == %{current: %Channel{host: "a", port: 1, ref: {"a", 1}}} + end + + test "rejects empty channel lists" do + assert {:error, :no_channels} = PickFirst.init(channels: []) + end + + test "rejects missing :channels option" do + assert {:error, :no_channels} = PickFirst.init([]) + end + end + + describe "pick/1" do + test "always returns the current channel" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}, {"b", 2}])) + + for _ <- 1..3 do + assert {:ok, %Channel{host: "a", port: 1}, ^state} = PickFirst.pick(state) + end + end + + test "returns :no_channels when current is nil" do + assert {:error, :no_channels} = PickFirst.pick(%{current: nil}) + end + end + + describe "update/2" do + test "swaps current to the first of the new channels" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + {:ok, new_state} = PickFirst.update(state, channels([{"x", 9}, {"y", 8}])) + + assert {:ok, %Channel{host: "x", port: 9}, _} = PickFirst.pick(new_state) + end + + test "sets current to nil on empty list" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + assert {:ok, %{current: nil}} = PickFirst.update(state, []) + end + end + + describe "shutdown/1" do + test "returns :ok (no-op, no external resources)" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + assert :ok = PickFirst.shutdown(state) + end + end +end diff --git a/grpc/test/grpc/client/load_balacing/round_robin_test.exs b/grpc/test/grpc/client/load_balacing/round_robin_test.exs new file mode 100644 index 000000000..ed58ab02b --- /dev/null +++ b/grpc/test/grpc/client/load_balacing/round_robin_test.exs @@ -0,0 +1,159 @@ +defmodule GRPC.Client.LoadBalancing.RoundRobinTest do + @moduledoc """ + Tests for the ETS-backed round-robin load balancer. + + Covers: + 1. init/1 creates an ETS table and returns it in state + 2. init/1 rejects empty channel lists + 3. pick/1 rotates through channels in order + 4. pick/1 wraps around after the last channel + 5. update/2 replaces channels in place without creating a new table + 6. update/2 rejects empty channel lists + 7. update/2 resets the cursor so the first pick is the first channel + 8. shutdown/1 deletes the ETS table + 9. shutdown/1 is idempotent (safe on already-deleted tables) + 10. pick/1 is safe under concurrent access (many processes, one table) + """ + use ExUnit.Case, async: true + + alias GRPC.Channel + alias GRPC.Client.LoadBalancing.RoundRobin + + defp channels(pairs), + do: Enum.map(pairs, fn {h, p} -> %Channel{host: h, port: p, ref: {h, p}} end) + + describe "init/1" do + test "creates an ETS table and returns the tid in state" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) + assert %{tid: tid} = state + assert is_reference(tid) + assert :ets.info(tid) != :undefined + RoundRobin.shutdown(state) + end + + test "rejects empty channel lists" do + assert {:error, :no_channels} = RoundRobin.init(channels: []) + end + + test "rejects missing :channels option" do + assert {:error, :no_channels} = RoundRobin.init([]) + end + end + + describe "pick/1" do + test "rotates through channels in order" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}, {"b", 2}, {"c", 3}])) + + assert {:ok, %Channel{host: "a", port: 1}, _} = RoundRobin.pick(state) + assert {:ok, %Channel{host: "b", port: 2}, _} = RoundRobin.pick(state) + assert {:ok, %Channel{host: "c", port: 3}, _} = RoundRobin.pick(state) + assert {:ok, %Channel{host: "a", port: 1}, _} = RoundRobin.pick(state) + + RoundRobin.shutdown(state) + end + + test "wraps around with a single channel" do + {:ok, state} = RoundRobin.init(channels: channels([{"only", 1}])) + + for _ <- 1..5 do + assert {:ok, %Channel{host: "only"}, _} = RoundRobin.pick(state) + end + + RoundRobin.shutdown(state) + end + end + + describe "update/2" do + test "replaces channels in place without changing the tid" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}, {"b", 2}])) + original_tid = state.tid + + {:ok, new_state} = RoundRobin.update(state, channels([{"x", 9}, {"y", 8}, {"z", 7}])) + assert new_state.tid == original_tid + + assert {:ok, %Channel{host: "x", port: 9}, _} = RoundRobin.pick(new_state) + assert {:ok, %Channel{host: "y", port: 8}, _} = RoundRobin.pick(new_state) + assert {:ok, %Channel{host: "z", port: 7}, _} = RoundRobin.pick(new_state) + + RoundRobin.shutdown(new_state) + end + + test "accepts empty channel lists; pick then returns :no_channels" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) + assert {:ok, ^state} = RoundRobin.update(state, []) + assert {:error, :no_channels} = RoundRobin.pick(state) + RoundRobin.shutdown(state) + end + + test "resets cursor so the first pick after update starts at the first channel" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}, {"b", 2}])) + {:ok, _, _} = RoundRobin.pick(state) + {:ok, _, _} = RoundRobin.pick(state) + + {:ok, state} = RoundRobin.update(state, channels([{"new-first", 1}, {"new-second", 2}])) + assert {:ok, %Channel{host: "new-first"}, _} = RoundRobin.pick(state) + + RoundRobin.shutdown(state) + end + end + + describe "shutdown/1" do + test "deletes the ETS table" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) + tid = state.tid + assert :ets.info(tid) != :undefined + + :ok = RoundRobin.shutdown(state) + assert :ets.info(tid) == :undefined + end + + test "is idempotent on already-deleted tables" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) + :ok = RoundRobin.shutdown(state) + assert :ok = RoundRobin.shutdown(state) + end + end + + describe "concurrency" do + test "pick/1 is safe under many concurrent processes" do + chs = channels(for i <- 1..4, do: {"host#{i}", 1000 + i}) + {:ok, state} = RoundRobin.init(channels: chs) + + parent = self() + picks_per_proc = 250 + procs = 16 + + for _ <- 1..procs do + spawn_link(fn -> + picks = + for _ <- 1..picks_per_proc do + {:ok, %Channel{host: host}, _} = RoundRobin.pick(state) + host + end + + send(parent, {:picks, picks}) + end) + end + + all_picks = + for _ <- 1..procs, reduce: [] do + acc -> + receive do + {:picks, picks} -> picks ++ acc + end + end + + assert length(all_picks) == procs * picks_per_proc + + counts = Enum.frequencies(all_picks) + avg = div(length(all_picks), length(chs)) + + for {_, c} <- counts do + assert abs(c - avg) <= 1, + "uneven pick distribution: #{inspect(counts)}" + end + + RoundRobin.shutdown(state) + end + end +end From 9b58997e4c672ce6426ac4ee4d49fad753fd918d Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 10:23:37 +0100 Subject: [PATCH 02/21] chore: drop benchee dep and bench/lb_pick.exs from LB PR The benchmark served its purpose: informing the design choice in the previous commit. Keeping benchee as a first-party dep for one script isn't justified, and the numbers belong in the PR description anyway (preserved in c3c3ebe's commit message for history). No code or test changes. Dep tree is back to the pre-LB state. --- grpc/bench/lb_pick.exs | 122 ----------------------------------------- grpc/mix.exs | 1 - grpc/mix.lock | 3 - 3 files changed, 126 deletions(-) delete mode 100644 grpc/bench/lb_pick.exs diff --git a/grpc/bench/lb_pick.exs b/grpc/bench/lb_pick.exs deleted file mode 100644 index 13d5ae272..000000000 --- a/grpc/bench/lb_pick.exs +++ /dev/null @@ -1,122 +0,0 @@ -# Benchmark: per-request pick throughput for the client load balancer. -# -# Compares three strategies: -# 1. persistent_term.get — the pre-refactor pick_channel read path -# 2. ETS RoundRobin.pick — the current lock-free round-robin pick -# 3. GenServer.call — upper-bound cost of serialising picks through -# a process mailbox (the anti-pattern) -# -# Also measures concurrent pick throughput with many client processes -# hitting the same LB, which is representative of RPC traffic. -# -# Run with: mix run bench/lb_pick.exs - -alias GRPC.Channel -alias GRPC.Client.LoadBalancing.RoundRobin - -# Build N synthetic channels. -build_channels = fn n -> - for i <- 1..n do - %Channel{host: "backend-#{i}", port: 50050 + i, ref: {:ref, i}} - end -end - -channels = build_channels.(4) - -# --- Setup for each strategy --------------------------------------------- - -# 1. persistent_term — stores a single pre-picked channel, like the old code -pt_key = {__MODULE__, :pick, :bench} -:persistent_term.put(pt_key, hd(channels)) - -# 2. ETS RoundRobin -{:ok, rr_state} = RoundRobin.init(channels: channels) - -# 3. GenServer serving picks from its mailbox -defmodule PickServer do - use GenServer - - def start_link(channels), do: GenServer.start_link(__MODULE__, channels) - def pick(pid), do: GenServer.call(pid, :pick) - - @impl true - def init(channels), do: {:ok, {List.to_tuple(channels), 0}} - - @impl true - def handle_call(:pick, _from, {tuple, idx}) do - ch = elem(tuple, rem(idx, tuple_size(tuple))) - {:reply, {:ok, ch}, {tuple, idx + 1}} - end -end - -{:ok, gs_pid} = PickServer.start_link(channels) - -# --- Benchmarks ---------------------------------------------------------- - -IO.puts("\n=== Single-process pick throughput ===\n") - -Benchee.run( - %{ - "persistent_term.get" => fn -> - :persistent_term.get(pt_key) - end, - "ETS RoundRobin.pick" => fn -> - RoundRobin.pick(rr_state) - end, - "GenServer.call pick" => fn -> - PickServer.pick(gs_pid) - end - }, - time: 3, - warmup: 1, - print: [fast_warning: false] -) - -IO.puts("\n=== 32-process concurrent pick throughput ===\n") - -Benchee.run( - %{ - "persistent_term.get" => fn -> - :persistent_term.get(pt_key) - end, - "ETS RoundRobin.pick" => fn -> - RoundRobin.pick(rr_state) - end, - "GenServer.call pick" => fn -> - PickServer.pick(gs_pid) - end - }, - time: 3, - warmup: 1, - parallel: 32, - print: [fast_warning: false] -) - -# --- Write-path cost: reconciliation -------------------------------------- -# -# persistent_term.put triggers a global GC pass across every BEAM process. -# ETS :ets.insert is local to the table. This is the decisive difference -# under frequent DNS re-resolution. - -IO.puts("\n=== Reconciliation write cost ===\n") - -new_channels = build_channels.(8) - -Benchee.run( - %{ - "persistent_term.put" => fn -> - :persistent_term.put(pt_key, hd(new_channels)) - end, - "ETS RoundRobin.update" => fn -> - RoundRobin.update(rr_state, new_channels) - end - }, - time: 3, - warmup: 1, - print: [fast_warning: false] -) - -# Cleanup -RoundRobin.shutdown(rr_state) -:persistent_term.erase(pt_key) -GenServer.stop(gs_pid) diff --git a/grpc/mix.exs b/grpc/mix.exs index 72d6ff64a..24fa3c60b 100644 --- a/grpc/mix.exs +++ b/grpc/mix.exs @@ -38,7 +38,6 @@ defmodule GRPC.MixProject do {:ex_doc, "~> 0.39", only: [:dev, :docs], runtime: false}, {:ex_parameterized, "~> 1.3.7", only: :test}, {:mox, "~> 1.2", only: :test}, - {:benchee, "~> 1.3", only: :dev, runtime: false}, # {:grpc_server, path: "../grpc_server", only: :test} {:grpc_server, "~> 1.0.0-rc.1", only: :test} ] diff --git a/grpc/mix.lock b/grpc/mix.lock index ca0489da0..d6252163f 100644 --- a/grpc/mix.lock +++ b/grpc/mix.lock @@ -1,9 +1,7 @@ %{ - "benchee": {:hex, :benchee, "1.5.0", "4d812c31d54b0ec0167e91278e7de3f596324a78a096fd3d0bea68bb0c513b10", [:mix], [{:deep_merge, "~> 1.0", [hex: :deep_merge, repo: "hexpm", optional: false]}, {:statistex, "~> 1.1", [hex: :statistex, repo: "hexpm", optional: false]}, {:table, "~> 0.1.0", [hex: :table, repo: "hexpm", optional: true]}], "hexpm", "5b075393aea81b8ae74eadd1c28b1d87e8a63696c649d8293db7c4df3eb67535"}, "castore": {:hex, :castore, "1.0.16", "8a4f9a7c8b81cda88231a08fe69e3254f16833053b23fa63274b05cbc61d2a1e", [:mix], [], "hexpm", "33689203a0eaaf02fcd0e86eadfbcf1bd636100455350592e7e2628564022aaf"}, "cowboy": {:hex, :cowboy, "2.14.2", "4008be1df6ade45e4f2a4e9e2d22b36d0b5aba4e20b0a0d7049e28d124e34847", [:make, :rebar3], [{:cowlib, ">= 2.16.0 and < 3.0.0", [hex: :cowlib, repo: "hexpm", optional: false]}, {:ranch, ">= 1.8.0 and < 3.0.0", [hex: :ranch, repo: "hexpm", optional: false]}], "hexpm", "569081da046e7b41b5df36aa359be71a0c8874e5b9cff6f747073fc57baf1ab9"}, "cowlib": {:hex, :cowlib, "2.16.0", "54592074ebbbb92ee4746c8a8846e5605052f29309d3a873468d76cdf932076f", [:make, :rebar3], [], "hexpm", "7f478d80d66b747344f0ea7708c187645cfcc08b11aa424632f78e25bf05db51"}, - "deep_merge": {:hex, :deep_merge, "1.0.0", "b4aa1a0d1acac393bdf38b2291af38cb1d4a52806cf7a4906f718e1feb5ee961", [:mix], [], "hexpm", "ce708e5f094b9cd4e8f2be4f00d2f4250c4095be93f8cd6d018c753894885430"}, "earmark_parser": {:hex, :earmark_parser, "1.4.44", "f20830dd6b5c77afe2b063777ddbbff09f9759396500cdbe7523efd58d7a339c", [:mix], [], "hexpm", "4778ac752b4701a5599215f7030989c989ffdc4f6df457c5f36938cc2d2a2750"}, "ex_doc": {:hex, :ex_doc, "0.39.1", "e19d356a1ba1e8f8cfc79ce1c3f83884b6abfcb79329d435d4bbb3e97ccc286e", [:mix], [{:earmark_parser, "~> 1.4.44", [hex: :earmark_parser, repo: "hexpm", optional: false]}, {:makeup_c, ">= 0.1.0", [hex: :makeup_c, repo: "hexpm", optional: true]}, {:makeup_elixir, "~> 0.14 or ~> 1.0", [hex: :makeup_elixir, repo: "hexpm", optional: false]}, {:makeup_erlang, "~> 0.1 or ~> 1.0", [hex: :makeup_erlang, repo: "hexpm", optional: false]}, {:makeup_html, ">= 0.1.0", [hex: :makeup_html, repo: "hexpm", optional: true]}], "hexpm", "8abf0ed3e3ca87c0847dfc4168ceab5bedfe881692f1b7c45f4a11b232806865"}, "ex_parameterized": {:hex, :ex_parameterized, "1.3.7", "801f85fc4651cb51f11b9835864c6ed8c5e5d79b1253506b5bb5421e8ab2f050", [:mix], [], "hexpm", "1fb0dc4aa9e8c12ae23806d03bcd64a5a0fc9cd3f4c5602ba72561c9b54a625c"}, @@ -24,6 +22,5 @@ "nimble_parsec": {:hex, :nimble_parsec, "1.4.2", "8efba0122db06df95bfaa78f791344a89352ba04baedd3849593bfce4d0dc1c6", [:mix], [], "hexpm", "4b21398942dda052b403bbe1da991ccd03a053668d147d53fb8c4e0efe09c973"}, "protobuf": {:hex, :protobuf, "0.15.0", "c9fc1e9fc1682b05c601df536d5ff21877b55e2023e0466a3855cc1273b74dcb", [:mix], [{:jason, "~> 1.2", [hex: :jason, repo: "hexpm", optional: true]}], "hexpm", "5d7bb325319db1d668838d2691c31c7b793c34111aec87d5ee467a39dac6e051"}, "ranch": {:hex, :ranch, "2.2.0", "25528f82bc8d7c6152c57666ca99ec716510fe0925cb188172f41ce93117b1b0", [:make, :rebar3], [], "hexpm", "fa0b99a1780c80218a4197a59ea8d3bdae32fbff7e88527d7d8a4787eff4f8e7"}, - "statistex": {:hex, :statistex, "1.1.0", "7fec1eb2f580a0d2c1a05ed27396a084ab064a40cfc84246dbfb0c72a5c761e5", [:mix], [], "hexpm", "f5950ea26ad43246ba2cce54324ac394a4e7408fdcf98b8e230f503a0cba9cf5"}, "telemetry": {:hex, :telemetry, "1.3.0", "fedebbae410d715cf8e7062c96a1ef32ec22e764197f70cda73d82778d61e7a2", [:rebar3], [], "hexpm", "7015fc8919dbe63764f4b4b87a95b7c0996bd539e0d499be6ec9d7f3875b79e6"}, } From 1ddd9ec6e9f6807f36e0898e9833c4a86f4a3984 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 12:08:10 +0100 Subject: [PATCH 03/21] fix(lb): prevent pick/1 from crashing callers when the table is gone MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If a caller reads {lb_mod, lb_state} from persistent_term and a concurrent disconnect runs lb_mod.shutdown before the caller reaches RoundRobin.pick/1, both :ets.lookup and :ets.update_counter raise ArgumentError on the deleted table — crashing the calling RPC process rather than surfacing a tagged error. The pre-refactor code couldn't hit this path (persistent_term.get with a default of nil always returned safely), so this is a regression introduced by moving the pick onto the hot path. Rescue ArgumentError and map it to {:error, :no_channels}, matching the empty-table branch. Adds a unit test that shuts the table down and then picks, which reproduced the crash before the rescue. --- grpc/lib/grpc/client/load_balacing/round_robin.ex | 5 +++++ grpc/test/grpc/client/load_balacing/round_robin_test.exs | 9 +++++++++ 2 files changed, 14 insertions(+) diff --git a/grpc/lib/grpc/client/load_balacing/round_robin.ex b/grpc/lib/grpc/client/load_balacing/round_robin.ex index 859b632a5..4fa7514ba 100644 --- a/grpc/lib/grpc/client/load_balacing/round_robin.ex +++ b/grpc/lib/grpc/client/load_balacing/round_robin.ex @@ -34,6 +34,9 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do end @impl true + # A concurrent `disconnect/1` can delete the ETS table between a caller's + # `:persistent_term.get` and this lookup, turning the ETS BIFs into + # ArgumentError. Callers expect a tagged error, not a crashed RPC. def pick(%{tid: tid} = state) do case :ets.lookup(tid, @channels_key) do [{@channels_key, channels}] when tuple_size(channels) > 0 -> @@ -44,6 +47,8 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do _ -> {:error, :no_channels} end + rescue + ArgumentError -> {:error, :no_channels} end @impl true diff --git a/grpc/test/grpc/client/load_balacing/round_robin_test.exs b/grpc/test/grpc/client/load_balacing/round_robin_test.exs index ed58ab02b..3adb0f515 100644 --- a/grpc/test/grpc/client/load_balacing/round_robin_test.exs +++ b/grpc/test/grpc/client/load_balacing/round_robin_test.exs @@ -97,6 +97,15 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do end end + describe "pick/1 race with shutdown" do + test "returns :no_channels instead of raising when the table was deleted" do + {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) + :ok = RoundRobin.shutdown(state) + + assert {:error, :no_channels} = RoundRobin.pick(state) + end + end + describe "shutdown/1" do test "deletes the ETS table" do {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) From 48b9975f7a960915d8291efe8135a2f63ad4f809 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 12:09:07 +0100 Subject: [PATCH 04/21] test(connection): cover lb_mod.shutdown on disconnect and terminate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The LB owns its ETS table; Connection.shutdown_lb/2 is the only place that calls lb_mod.shutdown/1. Nothing in the suite currently breaks if that call is removed, so a refactor could silently leak an ETS table per gracefully- closed connection (and another per supervised-shutdown crash). Two regression tests: * disconnect/1 frees the table — captures the tid via :sys.get_state before disconnecting, asserts :ets.info returns :undefined afterwards. * terminate/2 frees the table — same shape, but stops the GenServer via GenServer.stop to exercise the crash path. Both rely on :round_robin being the only policy with an ETS table today; if PickFirst grows one, these tests still pass as-is. --- grpc/test/grpc/client/connection_test.exs | 42 +++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index 3d29e3276..5a592d47d 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -111,4 +111,46 @@ defmodule GRPC.Client.ConnectionTest do assert {:error, :no_connection} = Connection.pick_channel(channel) end end + + describe "LB ETS table lifecycle" do + test "disconnect/1 calls lb_mod.shutdown so the ETS table is freed", %{ + ref: ref, + target: target, + adapter: adapter + } do + {:ok, channel} = + Connection.connect(target, adapter: adapter, name: ref, lb_policy: :round_robin) + + tid = lb_tid(ref) + assert :ets.info(tid) != :undefined + + {:ok, _} = Connection.disconnect(channel) + + assert :ets.info(tid) == :undefined + end + + test "terminate/2 frees the ETS table when the process is killed", %{ + ref: ref, + target: target, + adapter: adapter + } do + {:ok, _channel} = + Connection.connect(target, adapter: adapter, name: ref, lb_policy: :round_robin) + + tid = lb_tid(ref) + pid = :global.whereis_name({Connection, ref}) + ref_mon = Process.monitor(pid) + + GenServer.stop(pid, :shutdown) + assert_receive {:DOWN, ^ref_mon, :process, ^pid, :shutdown}, 500 + + assert :ets.info(tid) == :undefined + end + end + + defp lb_tid(ref) do + pid = :global.whereis_name({Connection, ref}) + %{lb_state: %{tid: tid}} = :sys.get_state(pid) + tid + end end From 6bba30f95ec963c4586b24c17248ebbb2db8c722 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 12:10:23 +0100 Subject: [PATCH 05/21] docs(connection): drop references to the removed :refresh timer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous commit that moved the pick onto the hot path removed the 15s :refresh timer, but two pieces of module docs still described it: * "The orchestrator periodically refreshes the LB decision to adapt to changes." — replaced with a one-liner describing per-request dispatch. * "The orchestrator refreshes the LB pick every 15 seconds." — removed entirely along with its standalone Notes section. Docs now match behaviour. --- grpc/lib/grpc/client/connection.ex | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index cefe76e6d..497710b01 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -20,7 +20,8 @@ defmodule GRPC.Client.Connection do * The target string is resolved using a [Resolver](GRPC.Client.Resolver). * Depending on the target and service config, a load-balancing module is chosen (e.g. `PickFirst`, `RoundRobin`). - * The orchestrator periodically refreshes the LB decision to adapt to changes. + * Each call to `pick/2` dispatches to the LB module, which selects a channel + per request. DNS re-resolution reconciles the LB's channel list in place. ## Target syntax @@ -68,9 +69,6 @@ defmodule GRPC.Client.Connection do iex> GRPC.Client.Connection.disconnect(ch) {:ok, %GRPC.Channel{...}} - ## Notes - - * The orchestrator refreshes the LB pick every 15 seconds. """ use GenServer alias GRPC.Channel From 292d9584552baa53893b038e9ce5815daf8c48d6 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 12:11:06 +0100 Subject: [PATCH 06/21] test(connection): restore the real persistent_term shape in already_started test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This test erases the persistent_term entry, confirms Connection.connect returns :no_connection on the already_started branch, then puts an entry back so the subsequent disconnect completes cleanly. It used to restore a bare Channel struct, which predates the ETS refactor. The current shape is {lb_mod, lb_state}. The test was passing by accident because disconnect only erases the key (shape-agnostic) — but a future refactor that actually reads the restored entry would trip on the mismatch. Capture the real entry before erasing and restore it verbatim. --- grpc/test/grpc/client/connection_test.exs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index 5a592d47d..b36536936 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -57,14 +57,14 @@ defmodule GRPC.Client.ConnectionTest do } do {:ok, channel} = Connection.connect(target, adapter: adapter, name: ref) - # Remove the persistent_term entry to simulate a missing LB state + # Capture the real entry so we can restore its shape for disconnect. + entry = :persistent_term.get({Connection, :lb_state, ref}) :persistent_term.erase({Connection, :lb_state, ref}) # Calling connect again will hit already_started → pick_channel → :no_connection assert {:error, :no_connection} = Connection.connect(target, adapter: adapter, name: ref) - # Restore entry so disconnect works cleanly - :persistent_term.put({Connection, :lb_state, ref}, channel) + :persistent_term.put({Connection, :lb_state, ref}, entry) Connection.disconnect(channel) end end From d98c47470c4ce69768bbe8bbe984aee0793de8d4 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 12:28:14 +0100 Subject: [PATCH 07/21] fix(lb): prevent PickFirst from going stale after DNS re-resolution MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The persistent_term entry stores {lb_mod, lb_state} written once at init and never rewritten during the connection's life (that's the whole point of the ETS refactor). This works as long as lb_mod.update/2 mutates external state and returns a state value the caller already holds — which is how RoundRobin behaves (the tid is stable). PickFirst used to return a NEW state map from update/2: def update(_state, [first | _]), do: {:ok, %{current: first}} The Connection GenServer's own state got the new map, but persistent_term kept the original %{current: first_backend_at_connect_time}. So pick_channel silently kept returning the dead backend forever after any DNS change. Not covered by tests because every re_resolve_test scenario passes lb_policy: :round_robin. Fix: give PickFirst the same ETS-backed state shape as RoundRobin. init creates a one-row table {:current, channel}, update mutates it in place, pick reads it. State is %{tid: tid} — stable across update/2. Tests: * Rewrites pick_first_test.exs against the new shape (11 tests). * Adds a re-resolve integration test using the default LB (PickFirst): connect with one backend, swap the DNS answer, assert pick_channel returns the new backend. Reproduced the bug before this fix. --- .../grpc/client/load_balacing/pick_first.ex | 45 +++++++++++++---- grpc/test/grpc/client/dns_resolver_test.exs | 27 ++++++++++ .../client/load_balacing/pick_first_test.exs | 50 ++++++++++++++++--- 3 files changed, 106 insertions(+), 16 deletions(-) diff --git a/grpc/lib/grpc/client/load_balacing/pick_first.ex b/grpc/lib/grpc/client/load_balacing/pick_first.ex index dd438e450..9c1043646 100644 --- a/grpc/lib/grpc/client/load_balacing/pick_first.ex +++ b/grpc/lib/grpc/client/load_balacing/pick_first.ex @@ -2,27 +2,54 @@ defmodule GRPC.Client.LoadBalancing.PickFirst do @moduledoc """ Pick-first load balancer: always returns the first channel in the list. - Stateless and process-local — no ETS, no GenServer. `update/2` swaps the - current pick when the channel list changes. + Stores the current pick in an ETS table so `update/2` can swap it without + returning a new state — callers that captured `lb_state` (the Connection + GenServer, the shared persistent_term entry) keep seeing the right channel. """ @behaviour GRPC.Client.LoadBalancing + @current_key :current + @impl true def init(opts) do case Keyword.get(opts, :channels, []) do - [] -> {:error, :no_channels} - [first | _] -> {:ok, %{current: first}} + [] -> + {:error, :no_channels} + + [first | _] -> + tid = :ets.new(:grpc_lb_pick_first, [:set, :public, read_concurrency: true]) + :ets.insert(tid, {@current_key, first}) + {:ok, %{tid: tid}} end end @impl true - def pick(%{current: nil}), do: {:error, :no_channels} - def pick(%{current: channel} = state), do: {:ok, channel, state} + def pick(%{tid: tid} = state) do + case :ets.lookup(tid, @current_key) do + [{@current_key, nil}] -> {:error, :no_channels} + [{@current_key, channel}] -> {:ok, channel, state} + [] -> {:error, :no_channels} + end + rescue + ArgumentError -> {:error, :no_channels} + end @impl true - def update(_state, [first | _]), do: {:ok, %{current: first}} - def update(_state, []), do: {:ok, %{current: nil}} + def update(%{tid: tid} = state, [first | _]) do + :ets.insert(tid, {@current_key, first}) + {:ok, state} + end + + def update(%{tid: tid} = state, []) do + :ets.insert(tid, {@current_key, nil}) + {:ok, state} + end @impl true - def shutdown(_state), do: :ok + def shutdown(%{tid: tid}) do + :ets.delete(tid) + :ok + rescue + ArgumentError -> :ok + end end diff --git a/grpc/test/grpc/client/dns_resolver_test.exs b/grpc/test/grpc/client/dns_resolver_test.exs index 617cf0ddc..5b186651d 100644 --- a/grpc/test/grpc/client/dns_resolver_test.exs +++ b/grpc/test/grpc/client/dns_resolver_test.exs @@ -431,6 +431,33 @@ defmodule GRPC.Client.ReResolveTest do end end + describe "pick_first reconciliation" do + test "pick_channel returns the new backend after DNS replaces it", ctx do + {:ok, channel} = + connect_with_resolver( + ctx.ref, + ctx.resolver, + ctx.adapter, + [%{address: "10.0.0.1", port: 50051}], + [] + ) + + {:ok, before} = Connection.pick_channel(channel) + assert before.host == "10.0.0.1" + + stub(ctx.resolver, :resolve, fn _target -> + {:ok, %{addresses: [%{address: "10.0.0.9", port: 50051}], service_config: nil}} + end) + + Process.sleep(@wait) + + {:ok, picked} = Connection.pick_channel(channel) + assert picked.host == "10.0.0.9" + + disconnect_and_wait(channel) + end + end + describe "pick_channel per-request rotation" do test "round-robin rotates across all backends on successive picks", ctx do {:ok, channel} = diff --git a/grpc/test/grpc/client/load_balacing/pick_first_test.exs b/grpc/test/grpc/client/load_balacing/pick_first_test.exs index 1b212f841..1e7207142 100644 --- a/grpc/test/grpc/client/load_balacing/pick_first_test.exs +++ b/grpc/test/grpc/client/load_balacing/pick_first_test.exs @@ -8,9 +8,18 @@ defmodule GRPC.Client.LoadBalancing.PickFirstTest do do: Enum.map(pairs, fn {h, p} -> %Channel{host: h, port: p, ref: {h, p}} end) describe "init/1" do - test "returns the first channel as current" do + test "creates an ETS table and returns the tid in state" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}, {"b", 2}])) - assert state == %{current: %Channel{host: "a", port: 1, ref: {"a", 1}}} + assert %{tid: tid} = state + assert is_reference(tid) + assert :ets.info(tid) != :undefined + PickFirst.shutdown(state) + end + + test "seeds the table with the first channel" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}, {"b", 2}])) + assert {:ok, %Channel{host: "a", port: 1}, ^state} = PickFirst.pick(state) + PickFirst.shutdown(state) end test "rejects empty channel lists" do @@ -29,30 +38,57 @@ defmodule GRPC.Client.LoadBalancing.PickFirstTest do for _ <- 1..3 do assert {:ok, %Channel{host: "a", port: 1}, ^state} = PickFirst.pick(state) end + + PickFirst.shutdown(state) end test "returns :no_channels when current is nil" do - assert {:error, :no_channels} = PickFirst.pick(%{current: nil}) + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + {:ok, _} = PickFirst.update(state, []) + assert {:error, :no_channels} = PickFirst.pick(state) + PickFirst.shutdown(state) + end + + test "returns :no_channels instead of raising when the table was deleted" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + :ok = PickFirst.shutdown(state) + assert {:error, :no_channels} = PickFirst.pick(state) end end describe "update/2" do - test "swaps current to the first of the new channels" do + test "swaps current in place without changing the tid" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + original_tid = state.tid + {:ok, new_state} = PickFirst.update(state, channels([{"x", 9}, {"y", 8}])) + assert new_state.tid == original_tid assert {:ok, %Channel{host: "x", port: 9}, _} = PickFirst.pick(new_state) + PickFirst.shutdown(new_state) end - test "sets current to nil on empty list" do + test "clears current to nil on empty list" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) - assert {:ok, %{current: nil}} = PickFirst.update(state, []) + {:ok, state} = PickFirst.update(state, []) + assert {:error, :no_channels} = PickFirst.pick(state) + PickFirst.shutdown(state) end end describe "shutdown/1" do - test "returns :ok (no-op, no external resources)" do + test "deletes the ETS table" do + {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + tid = state.tid + assert :ets.info(tid) != :undefined + + :ok = PickFirst.shutdown(state) + assert :ets.info(tid) == :undefined + end + + test "is idempotent on already-deleted tables" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) + :ok = PickFirst.shutdown(state) assert :ok = PickFirst.shutdown(state) end end From 88b3de14220d30cfd4d2308a94b8d0005065d7a8 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 12:47:31 +0100 Subject: [PATCH 08/21] refactor: swap persistent_term for a shared ETS registry MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up on @sleipnir's comment: "we can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module." https://github.com/elixir-grpc/grpc/pull/520#issuecomment-4255477950 His sample used ETS keyed by ref with no persistent_term. My earlier commit kept persistent_term as a write-once pointer — worked fine, but wasn't what he showed. This matches the sample. New module: GRPC.Client.LoadBalancing.Registry. Small GenServer supervised by the client Application. Owns a named public ETS table (:grpc_client_lb_registry). Three functions — put/2, lookup/1, delete/1. Lookup rescues ArgumentError so a racing :ets.delete on shutdown gives :error instead of crashing the RPC. Connection now: * Registry.put on init (ref -> {lb_mod, lb_state}) * Registry.lookup in pick_channel * Registry.delete in disconnect and terminate No persistent_term left in the LB path. LB behaviour unchanged, per-LB ETS tables stay as they were. 282 tests pass. connection_test.exs pokes Registry directly where it used to poke persistent_term. BTW — also considered putting {lb_mod, lb_state} straight on the Channel struct. Drops both persistent_term and the registry, same runtime behaviour, one fewer indirection on every pick. Flagged in the PR for @sleipnir — happy to swap if that reads cleaner to him. --- grpc/lib/grpc/client/application.ex | 1 + grpc/lib/grpc/client/connection.ex | 14 ++++----- .../grpc/client/load_balacing/pick_first.ex | 2 +- .../lib/grpc/client/load_balacing/registry.ex | 31 +++++++++++++++++++ .../grpc/client/load_balacing/round_robin.ex | 4 +-- grpc/test/grpc/client/connection_test.exs | 25 +++++++-------- grpc/test/grpc/client/dns_resolver_test.exs | 2 +- 7 files changed, 53 insertions(+), 26 deletions(-) create mode 100644 grpc/lib/grpc/client/load_balacing/registry.ex diff --git a/grpc/lib/grpc/client/application.ex b/grpc/lib/grpc/client/application.ex index 16d4a230f..6077fb5ed 100644 --- a/grpc/lib/grpc/client/application.ex +++ b/grpc/lib/grpc/client/application.ex @@ -4,6 +4,7 @@ defmodule GRPC.Client.Application do def start(_type, _args) do children = [ + GRPC.Client.LoadBalancing.Registry, {DynamicSupervisor, [name: GRPC.Client.Supervisor]} ] diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index 497710b01..033e390fb 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -72,6 +72,7 @@ defmodule GRPC.Client.Connection do """ use GenServer alias GRPC.Channel + alias GRPC.Client.LoadBalancing.Registry require Logger @@ -119,10 +120,7 @@ defmodule GRPC.Client.Connection do def init(%__MODULE__{} = state) do Process.flag(:trap_exit, true) - :persistent_term.put( - {__MODULE__, :lb_state, state.virtual_channel.ref}, - {state.lb_mod, state.lb_state} - ) + Registry.put(state.virtual_channel.ref, {state.lb_mod, state.lb_state}) state = if function_exported?(state.resolver, :init, 2) do @@ -237,8 +235,8 @@ defmodule GRPC.Client.Connection do """ @spec pick_channel(Channel.t(), keyword()) :: {:ok, Channel.t()} | {:error, term()} def pick_channel(%Channel{ref: ref} = _channel, _opts \\ []) do - case :persistent_term.get({__MODULE__, :lb_state, ref}, nil) do - {lb_mod, lb_state} when not is_nil(lb_mod) -> + case Registry.lookup(ref) do + {:ok, {lb_mod, lb_state}} when not is_nil(lb_mod) -> case lb_mod.pick(lb_state) do {:ok, %Channel{} = channel, _new_state} -> {:ok, channel} {:error, _} -> {:error, :no_connection} @@ -278,7 +276,7 @@ defmodule GRPC.Client.Connection do shutdown_lb(state.lb_mod, state.lb_state) resp = {:ok, %Channel{channel | adapter_payload: %{conn_pid: nil}}} - :persistent_term.erase({__MODULE__, :lb_state, channel.ref}) + Registry.delete(channel.ref) if Map.has_key?(state, :real_channels) do Enum.map(state.real_channels, fn @@ -347,7 +345,7 @@ defmodule GRPC.Client.Connection do @impl GenServer def terminate(_reason, %{virtual_channel: %{ref: ref}} = state) do shutdown_lb(Map.get(state, :lb_mod), Map.get(state, :lb_state)) - :persistent_term.erase({__MODULE__, :lb_state, ref}) + Registry.delete(ref) rescue _ -> :ok end diff --git a/grpc/lib/grpc/client/load_balacing/pick_first.ex b/grpc/lib/grpc/client/load_balacing/pick_first.ex index 9c1043646..e679c1534 100644 --- a/grpc/lib/grpc/client/load_balacing/pick_first.ex +++ b/grpc/lib/grpc/client/load_balacing/pick_first.ex @@ -4,7 +4,7 @@ defmodule GRPC.Client.LoadBalancing.PickFirst do Stores the current pick in an ETS table so `update/2` can swap it without returning a new state — callers that captured `lb_state` (the Connection - GenServer, the shared persistent_term entry) keep seeing the right channel. + GenServer, the LB registry) keep seeing the right channel. """ @behaviour GRPC.Client.LoadBalancing diff --git a/grpc/lib/grpc/client/load_balacing/registry.ex b/grpc/lib/grpc/client/load_balacing/registry.ex new file mode 100644 index 000000000..ae373434a --- /dev/null +++ b/grpc/lib/grpc/client/load_balacing/registry.ex @@ -0,0 +1,31 @@ +defmodule GRPC.Client.LoadBalancing.Registry do + @moduledoc false + use GenServer + + @table :grpc_client_lb_registry + + @spec start_link(any()) :: GenServer.on_start() + def start_link(_), do: GenServer.start_link(__MODULE__, [], name: __MODULE__) + + @spec put(reference() | term(), {module(), term()}) :: true + def put(ref, value), do: :ets.insert(@table, {ref, value}) + + @spec lookup(reference() | term()) :: {:ok, {module(), term()}} | :error + def lookup(ref) do + case :ets.lookup(@table, ref) do + [{^ref, value}] -> {:ok, value} + [] -> :error + end + rescue + ArgumentError -> :error + end + + @spec delete(reference() | term()) :: true + def delete(ref), do: :ets.delete(@table, ref) + + @impl true + def init(_) do + :ets.new(@table, [:set, :public, :named_table, read_concurrency: true]) + {:ok, nil} + end +end diff --git a/grpc/lib/grpc/client/load_balacing/round_robin.ex b/grpc/lib/grpc/client/load_balacing/round_robin.ex index 4fa7514ba..360514a96 100644 --- a/grpc/lib/grpc/client/load_balacing/round_robin.ex +++ b/grpc/lib/grpc/client/load_balacing/round_robin.ex @@ -35,8 +35,8 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do @impl true # A concurrent `disconnect/1` can delete the ETS table between a caller's - # `:persistent_term.get` and this lookup, turning the ETS BIFs into - # ArgumentError. Callers expect a tagged error, not a crashed RPC. + # registry lookup and this pick, turning the ETS BIFs into ArgumentError. + # Callers expect a tagged error, not a crashed RPC. def pick(%{tid: tid} = state) do case :ets.lookup(tid, @channels_key) do [{@channels_key, channels}] when tuple_size(channels) > 0 -> diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index b36536936..dbd88b117 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -3,6 +3,7 @@ defmodule GRPC.Client.ConnectionTest do alias GRPC.Channel alias GRPC.Client.Connection + alias GRPC.Client.LoadBalancing.Registry setup do %{ @@ -14,13 +15,13 @@ defmodule GRPC.Client.ConnectionTest do end describe "pick_channel/2" do - test "returns {:error, :no_connection} when no persistent_term entry exists", %{ref: ref} do + test "returns {:error, :no_connection} when the ref is not registered", %{ref: ref} do channel = %Channel{ref: ref} assert {:error, :no_connection} = Connection.pick_channel(channel) end - test "returns {:ok, channel} when a channel is stored in persistent_term", %{ + test "returns {:ok, channel} once a connection has registered its LB state", %{ ref: ref, target: target, adapter: adapter @@ -50,21 +51,17 @@ defmodule GRPC.Client.ConnectionTest do Connection.disconnect(first_channel) end - test "returns {:error, :no_connection} when already_started but no persistent_term entry", %{ - ref: ref, - target: target, - adapter: adapter - } do + test "returns {:error, :no_connection} when already_started but the registry entry is missing", + %{ref: ref, target: target, adapter: adapter} do {:ok, channel} = Connection.connect(target, adapter: adapter, name: ref) - # Capture the real entry so we can restore its shape for disconnect. - entry = :persistent_term.get({Connection, :lb_state, ref}) - :persistent_term.erase({Connection, :lb_state, ref}) + {:ok, entry} = Registry.lookup(ref) + Registry.delete(ref) # Calling connect again will hit already_started → pick_channel → :no_connection assert {:error, :no_connection} = Connection.connect(target, adapter: adapter, name: ref) - :persistent_term.put({Connection, :lb_state, ref}, entry) + Registry.put(ref, entry) Connection.disconnect(channel) end end @@ -85,7 +82,7 @@ defmodule GRPC.Client.ConnectionTest do assert_receive {:DOWN, ^ref_mon, :process, ^pid, _reason}, 500 end - test "pick_channel returns {:error, :no_connection} after disconnect (persistent_term is erased)", + test "pick_channel returns {:error, :no_connection} after disconnect (registry entry is deleted)", %{ref: ref, target: target, adapter: adapter} do {:ok, channel} = Connection.connect(target, adapter: adapter, name: ref) @@ -95,8 +92,8 @@ defmodule GRPC.Client.ConnectionTest do end end - describe "terminate/2 - persistent_term cleanup on process kill" do - test "persistent_term is erased when process is killed without disconnect", %{ + describe "terminate/2 - registry cleanup on process kill" do + test "registry entry is deleted when process is killed without disconnect", %{ ref: ref, target: target, adapter: adapter diff --git a/grpc/test/grpc/client/dns_resolver_test.exs b/grpc/test/grpc/client/dns_resolver_test.exs index 5b186651d..78aaac153 100644 --- a/grpc/test/grpc/client/dns_resolver_test.exs +++ b/grpc/test/grpc/client/dns_resolver_test.exs @@ -898,7 +898,7 @@ defmodule GRPC.Client.ReResolveTest do end end - describe "stale persistent_term prevention" do + describe "unhealthy-pick fallback" do setup ctx do Application.put_env(:grpc, :grpc_test_failing_hosts, ["10.0.0.99"]) on_exit(fn -> Application.delete_env(:grpc, :grpc_test_failing_hosts) end) From 1f5d7e88cc589ff976c757a8da2f5eb646107270 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 13:06:48 +0100 Subject: [PATCH 09/21] test(registry): direct unit coverage for put/lookup/delete MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Registry is only exercised through Connection today — if someone regresses lookup/1 or delete/1 in isolation, no test fails. Add direct round-trip tests for the public API (arbitrary key shapes, overwrite, missing key returns :error, delete on unknown key is safe). The ArgumentError rescue in lookup/1 is covered end-to-end by the concurrent pick+disconnect test — not duplicated here. --- .../client/load_balacing/registry_test.exs | 62 +++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 grpc/test/grpc/client/load_balacing/registry_test.exs diff --git a/grpc/test/grpc/client/load_balacing/registry_test.exs b/grpc/test/grpc/client/load_balacing/registry_test.exs new file mode 100644 index 000000000..2454c0bc3 --- /dev/null +++ b/grpc/test/grpc/client/load_balacing/registry_test.exs @@ -0,0 +1,62 @@ +defmodule GRPC.Client.LoadBalancing.RegistryTest do + @moduledoc """ + Direct tests for the shared LB registry. + + The registry is a process-backed named ETS table started by + GRPC.Client.Application. These tests exercise the public API + (put/2, lookup/1, delete/1) and the ArgumentError rescue path in + lookup/1 that guards against races where the table is gone. + """ + use ExUnit.Case, async: false + + alias GRPC.Client.LoadBalancing.Registry + + describe "put/2 + lookup/1" do + test "round-trips a {module, state} value keyed by a reference" do + ref = make_ref() + Registry.put(ref, {SomeLB, %{tid: :fake_tid}}) + + assert {:ok, {SomeLB, %{tid: :fake_tid}}} = Registry.lookup(ref) + + Registry.delete(ref) + end + + test "accepts arbitrary keys (reference, tuple, atom)" do + for key <- [make_ref(), {:conn, 1}, :atom_key] do + Registry.put(key, {LB, :state}) + assert {:ok, {LB, :state}} = Registry.lookup(key) + Registry.delete(key) + end + end + + test "overwrites an existing entry" do + ref = make_ref() + Registry.put(ref, {LB, :v1}) + Registry.put(ref, {LB, :v2}) + + assert {:ok, {LB, :v2}} = Registry.lookup(ref) + + Registry.delete(ref) + end + end + + describe "lookup/1 when the entry is missing" do + test "returns :error" do + assert :error = Registry.lookup(make_ref()) + end + end + + describe "delete/1" do + test "removes the entry; subsequent lookup returns :error" do + ref = make_ref() + Registry.put(ref, {LB, :state}) + Registry.delete(ref) + + assert :error = Registry.lookup(ref) + end + + test "is safe on an unknown key" do + assert true = Registry.delete(make_ref()) + end + end +end From b3b0d98430ac17f1999f5a0d8caf6707ddffe63d Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 13:15:05 +0100 Subject: [PATCH 10/21] test(connection): cover pick_channel races with disconnect MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Concurrent RPCs must not crash when disconnect is tearing down LB state. This is the scenario the ArgumentError rescues in Registry.lookup/1 and the LB modules' pick/1 exist to handle — but nothing actually proved it end-to-end. Spawns 50 processes doing 100 picks each, fires disconnect mid-flight. Every result must be either {:ok, %Channel{}} or {:error, :no_connection}. Anything else means we crashed a caller. --- grpc/test/grpc/client/connection_test.exs | 49 +++++++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index dbd88b117..b328ca27a 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -145,6 +145,55 @@ defmodule GRPC.Client.ConnectionTest do end end + describe "pick_channel/2 races with disconnect/1" do + # Concurrent RPCs must never crash when a disconnect is tearing down the + # LB state. This exercises both the Registry.lookup ArgumentError rescue + # (table gone) and the per-LB pick rescue (per-connection table gone). + test "many concurrent picks complete without crashing while disconnect runs", %{ + ref: ref, + target: target, + adapter: adapter + } do + {:ok, channel} = + Connection.connect(target, adapter: adapter, name: ref, lb_policy: :round_robin) + + parent = self() + picker_count = 50 + picks_per_proc = 100 + + pickers = + for i <- 1..picker_count do + spawn_link(fn -> + results = + for _ <- 1..picks_per_proc do + Connection.pick_channel(channel) + end + + send(parent, {:done, i, results}) + end) + end + + # Give pickers a head start so some land before, during, and after. + Process.sleep(2) + {:ok, _} = Connection.disconnect(channel) + + for _ <- 1..picker_count do + assert_receive {:done, _, results}, 2_000 + + # Every result must be well-shaped — either a valid channel or the + # documented error. Anything else means we crashed a caller. + for r <- results do + assert match?({:ok, %Channel{}}, r) or r == {:error, :no_connection} + end + end + + # And confirm the pickers actually ran to completion. + for pid <- pickers do + refute Process.alive?(pid) + end + end + end + defp lb_tid(ref) do pid = :global.whereis_name({Connection, ref}) %{lb_state: %{tid: tid}} = :sys.get_state(pid) From 189f46e9094edeac6d8af50fff8c62b9d077c98a Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 13:16:34 +0100 Subject: [PATCH 11/21] test(connection): assert no ETS leaks across 500 connect/disconnect cycles Mirrors the regression pattern from PR #509, which added a 100-cycle test to prove the persistent_term leak fix. Same idea, applied to the tables the refactor introduced: the shared Registry and the per-LB ETS tables. Snapshots :ets.all() and the Registry size before the loop, cycles 500 connect+disconnect pairs, asserts the Registry returns to its starting size and no more than a handful of new tables exist (VM may create a few incidentally during the run). --- grpc/test/grpc/client/connection_test.exs | 29 +++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index b328ca27a..59c764f9e 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -194,6 +194,35 @@ defmodule GRPC.Client.ConnectionTest do end end + describe "resource leaks over repeated connect/disconnect" do + # Mirrors the regression style used in #509 for the persistent_term leak: + # cycle connect/disconnect many times and assert zero growth in the + # long-lived tables we own (the registry + the count of ETS tables). + test "500 cycles leave the registry empty and no per-LB tables leak", %{ + target: target, + adapter: adapter + } do + before_table_count = length(:ets.all()) + before_registry_size = :ets.info(:grpc_client_lb_registry, :size) + + for _ <- 1..500 do + ref = make_ref() + {:ok, channel} = Connection.connect(target, adapter: adapter, name: ref) + {:ok, _} = Connection.disconnect(channel) + end + + after_registry_size = :ets.info(:grpc_client_lb_registry, :size) + after_table_count = length(:ets.all()) + + assert after_registry_size == before_registry_size, + "registry leaked: before=#{before_registry_size} after=#{after_registry_size}" + + # Allow small slop for tables the VM may have created incidentally. + assert after_table_count - before_table_count <= 5, + "ETS tables leaked: before=#{before_table_count} after=#{after_table_count}" + end + end + defp lb_tid(ref) do pid = :global.whereis_name({Connection, ref}) %{lb_state: %{tid: tid}} = :sys.get_state(pid) From 43c98d299568c61c89aeb6046ed7e2cc77fdee8e Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 13:17:54 +0100 Subject: [PATCH 12/21] test(re_resolve): cover pick races with a reconcile that shrinks backends MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Derived from research into long-lived gRPC clients (GCP, Temporal, LiveKit, our own unleash_fresha) — they all hit rapid DNS churn where the backend count drops between picks. The risk is that a concurrent RoundRobin pick indexes into a tuple that update/2 just replaced with a smaller one. Starts with 8 backends, launches 30 processes doing 200 picks each, triggers a DNS reconcile down to 2 backends mid-flight. Every pick must return {:ok, %Channel{}} — no :badarg from elem/2, no stale index arithmetic, no crash. After the reconcile settles, all picks must come from the two survivors only. --- grpc/test/grpc/client/dns_resolver_test.exs | 70 +++++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/grpc/test/grpc/client/dns_resolver_test.exs b/grpc/test/grpc/client/dns_resolver_test.exs index 78aaac153..6b0118f7a 100644 --- a/grpc/test/grpc/client/dns_resolver_test.exs +++ b/grpc/test/grpc/client/dns_resolver_test.exs @@ -27,6 +27,7 @@ defmodule GRPC.Client.ReResolveTest do use GRPC.Client.DataCase, async: false import Mox + alias GRPC.Channel alias GRPC.Client.Connection @resolve_interval 200 @@ -458,6 +459,75 @@ defmodule GRPC.Client.ReResolveTest do end end + describe "pick_channel during shrinking reconcile" do + test "concurrent picks stay correct while backends are removed", ctx do + large = + for i <- 1..8, do: %{address: "10.0.0.#{i}", port: 50051} + + {:ok, channel} = + connect_with_resolver( + ctx.ref, + ctx.resolver, + ctx.adapter, + large, + lb_policy: :round_robin + ) + + small = [ + %{address: "10.0.0.1", port: 50051}, + %{address: "10.0.0.2", port: 50051} + ] + + parent = self() + picker_count = 30 + picks_per_proc = 200 + + pickers = + for i <- 1..picker_count do + spawn_link(fn -> + results = + for _ <- 1..picks_per_proc do + Connection.pick_channel(channel) + end + + send(parent, {:done, i, results}) + end) + end + + stub(ctx.resolver, :resolve, fn _target -> + {:ok, %{addresses: small, service_config: nil}} + end) + + Process.sleep(@wait) + + for _ <- 1..picker_count do + assert_receive {:done, _, results}, 2_000 + + for r <- results do + assert match?({:ok, %Channel{}}, r), + "pick returned #{inspect(r)} — expected {:ok, %Channel{}}" + end + end + + Process.sleep(@wait) + + hosts = + for _ <- 1..20 do + {:ok, picked} = Connection.pick_channel(channel) + picked.host + end + + assert Enum.all?(hosts, &(&1 in ["10.0.0.1", "10.0.0.2"])), + "picked from an already-removed backend: #{inspect(hosts)}" + + for pid <- pickers do + refute Process.alive?(pid) + end + + disconnect_and_wait(channel) + end + end + describe "pick_channel per-request rotation" do test "round-robin rotates across all backends on successive picks", ctx do {:ok, channel} = From dbab9470806627a2da225dd9194b34bb1c48802e Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 14:22:58 +0100 Subject: [PATCH 13/21] fix(lb): make update/2 required; remove broken init/1 fallback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The earlier refactor marked update/2 as @optional_callbacks and added a reconcile_lb fallback that called init/1 when the LB skipped update/2. The fallback was broken: init/1 creates a fresh ETS table and the previous lb_state's table orphans, leaking one table per DNS reconcile. For our own LBs the fallback was dead code (both implement update/2), but the trap was real for any third party writing a custom LB. update/2 is the only way to reconcile a stateful LB without leaking — "optional" was the wrong signal. Make it required so the compiler warns anyone who skips it, delete the fallback, let reconcile_lb be the single line it should have been. shutdown/1 stays optional — strategies with no external resources can genuinely omit it. --- grpc/lib/grpc/client/connection.ex | 6 +----- grpc/lib/grpc/client/load_balacing.ex | 10 +++++----- 2 files changed, 6 insertions(+), 10 deletions(-) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index 033e390fb..2b554eab6 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -444,11 +444,7 @@ defmodule GRPC.Client.Connection do end defp reconcile_lb(lb_mod, lb_state, new_channels) do - if lb_state != nil and function_exported?(lb_mod, :update, 2) do - lb_mod.update(lb_state, new_channels) - else - lb_mod.init(channels: new_channels) - end + lb_mod.update(lb_state, new_channels) end defp connected_channels(real_channels) do diff --git a/grpc/lib/grpc/client/load_balacing.ex b/grpc/lib/grpc/client/load_balacing.ex index 7eef66bf5..17c93ac70 100644 --- a/grpc/lib/grpc/client/load_balacing.ex +++ b/grpc/lib/grpc/client/load_balacing.ex @@ -11,13 +11,13 @@ defmodule GRPC.Client.LoadBalancing do * `init/1` — build initial state from `[channels: [Channel.t()]]`. * `pick/1` — choose a `Channel` for the next request. - - Optional callbacks (used by strategies that maintain live state, e.g. ETS): - * `update/2` — reconcile the state with a new channel list in place, without tearing it down. Called by `Connection` on DNS re-resolution. + + Optional callbacks: + * `shutdown/1` — release any resources held by the strategy. Called on - disconnect. + disconnect. Strategies that hold no external resources can omit it. """ alias GRPC.Channel @@ -31,5 +31,5 @@ defmodule GRPC.Client.LoadBalancing do @callback shutdown(state :: any()) :: :ok - @optional_callbacks update: 2, shutdown: 1 + @optional_callbacks shutdown: 1 end From 06ad6be19d15a09e53cbda207e5ce233d8d6314a Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 21:52:10 +0100 Subject: [PATCH 14/21] refactor(lb): drop shutdown/1, let BEAM reclaim ETS; stop Connection on resolver worker crash MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit LB strategies' ETS tables are owned by the Connection GenServer (lb_mod.init now runs inside init/1), so BEAM reclaims them automatically on exit — no explicit shutdown callback needed. Merged the two connect paths through a single finalize_connection/2 helper. On resolver worker death, stop the Connection instead of re-initializing. The old re-init path created a crash-loop when init itself kept failing (bad config, unreachable control plane). Unrelated :EXIT signals are now logged and ignored so stray links from user code can't take Connection down. Co-Authored-By: Claude Opus 4.7 (1M context) --- grpc/lib/grpc/client/connection.ex | 132 +++++++++--------- grpc/lib/grpc/client/load_balacing.ex | 17 +-- .../grpc/client/load_balacing/pick_first.ex | 14 +- .../grpc/client/load_balacing/round_robin.ex | 13 +- grpc/test/grpc/client/connection_test.exs | 13 +- grpc/test/grpc/client/dns_resolver_test.exs | 41 ++++-- .../client/load_balacing/pick_first_test.exs | 29 +--- .../client/load_balacing/round_robin_test.exs | 43 ++---- 8 files changed, 126 insertions(+), 176 deletions(-) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index 2b554eab6..afc46feac 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -120,22 +120,32 @@ defmodule GRPC.Client.Connection do def init(%__MODULE__{} = state) do Process.flag(:trap_exit, true) - Registry.put(state.virtual_channel.ref, {state.lb_mod, state.lb_state}) + connected = connected_channels(state.real_channels) - state = - if function_exported?(state.resolver, :init, 2) do - {:ok, resolver_state} = - state.resolver.init(state.resolver_target, - connection_pid: self(), - connect_opts: state.connect_opts - ) + case state.lb_mod.init(channels: connected) do + {:ok, lb_state} -> + state = %{state | lb_state: lb_state} + Registry.put(state.virtual_channel.ref, {state.lb_mod, lb_state}) - %{state | resolver_state: resolver_state} - else - state - end + state = + if function_exported?(state.resolver, :init, 2) do + {:ok, resolver_state} = + state.resolver.init(state.resolver_target, + connection_pid: self(), + connect_opts: state.connect_opts + ) + + %{state | resolver_state: resolver_state} + else + state + end + + {:ok, state} - {:ok, state} + {:error, reason} -> + disconnect_real_channels(state.real_channels, state.adapter) + {:stop, reason} + end end @doc """ @@ -176,16 +186,10 @@ defmodule GRPC.Client.Connection do case DynamicSupervisor.start_child(GRPC.Client.Supervisor, child_spec(initial_state)) do {:ok, _pid} -> - {:ok, ch} + finalize_connection(ch, opts) {:error, {:already_started, _pid}} -> - case pick_channel(ch, opts) do - {:ok, %Channel{} = channel} -> - {:ok, channel} - - _ -> - {:error, :no_connection} - end + finalize_connection(ch, opts) {:error, reason} -> {:error, reason} @@ -273,8 +277,6 @@ defmodule GRPC.Client.Connection do state.resolver.shutdown(state.resolver_state) end - shutdown_lb(state.lb_mod, state.lb_state) - resp = {:ok, %Channel{channel | adapter_payload: %{conn_pid: nil}}} Registry.delete(channel.ref) @@ -302,22 +304,18 @@ defmodule GRPC.Client.Connection do {:noreply, state} end - def handle_info({:EXIT, _pid, reason}, %{resolver: resolver, resolver_state: rs} = state) - when not is_nil(rs) and reason != :normal do - Logger.warning("Resolver worker exited: #{inspect(reason)}, re-initializing") + def handle_info({:EXIT, pid, reason}, %{resolver_state: %{worker_pid: pid}} = state) + when reason != :normal do + Logger.warning("Resolver worker exited: #{inspect(reason)}, stopping Connection") + {:stop, {:resolver_exited, reason}, state} + end - state = - if function_exported?(resolver, :init, 2) do - case resolver.init(state.resolver_target, - connection_pid: self(), - connect_opts: state.connect_opts - ) do - {:ok, new_rs} -> %{state | resolver_state: new_rs} - {:error, _} -> %{state | resolver_state: nil} - end - else - %{state | resolver_state: nil} - end + def handle_info({:EXIT, _pid, :normal}, state), do: {:noreply, state} + + def handle_info({:EXIT, pid, reason}, state) do + Logger.warning( + "#{inspect(__MODULE__)} received :EXIT from #{inspect(pid)} reason: #{inspect(reason)}" + ) {:noreply, state} end @@ -343,8 +341,7 @@ defmodule GRPC.Client.Connection do end @impl GenServer - def terminate(_reason, %{virtual_channel: %{ref: ref}} = state) do - shutdown_lb(Map.get(state, :lb_mod), Map.get(state, :lb_state)) + def terminate(_reason, %{virtual_channel: %{ref: ref}}) do Registry.delete(ref) rescue _ -> :ok @@ -352,16 +349,21 @@ defmodule GRPC.Client.Connection do def terminate(_reason, _state), do: :ok - defp shutdown_lb(lb_mod, lb_state) - when not is_nil(lb_mod) and not is_nil(lb_state) do - if function_exported?(lb_mod, :shutdown, 1) do - lb_mod.shutdown(lb_state) + defp finalize_connection(%Channel{} = ch, opts) do + case pick_channel(ch, opts) do + {:ok, %Channel{} = channel} -> {:ok, channel} + _ -> {:error, :no_connection} end - rescue - _ -> :ok end - defp shutdown_lb(_lb_mod, _lb_state), do: :ok + defp disconnect_real_channels(real_channels, adapter) when is_map(real_channels) do + Enum.each(real_channels, fn + {_key, {:connected, ch}} -> do_disconnect(adapter, ch) + _ -> :ok + end) + end + + defp disconnect_real_channels(_real_channels, _adapter), do: :ok defp handle_resolve_result({:ok, %{addresses: []}}, state), do: state @@ -574,23 +576,18 @@ defmodule GRPC.Client.Connection do real_channels = build_real_channels(addresses, base_state.virtual_channel, norm_opts, adapter) - connected = connected_channels(real_channels) - - case lb_mod.init(channels: connected) do - {:ok, lb_state} -> - {:ok, %Channel{} = picked, _} = lb_mod.pick(lb_state) + case connected_channels(real_channels) do + [] -> + disconnect_real_channels(real_channels, adapter) + {:error, :no_channels} + _connected -> {:ok, %__MODULE__{ base_state | lb_mod: lb_mod, - lb_state: lb_state, - virtual_channel: picked, real_channels: real_channels }} - - {:error, reason} -> - {:error, reason} end end @@ -599,16 +596,17 @@ defmodule GRPC.Client.Connection do vc = base_state.virtual_channel lb_mod = GRPC.Client.LoadBalancing.PickFirst - with {:ok, ch} <- connect_real_channel(vc, host, port, norm_opts, adapter), - {:ok, lb_state} <- lb_mod.init(channels: [ch]) do - {:ok, - %__MODULE__{ - base_state - | virtual_channel: ch, - real_channels: %{build_address_key(host, port) => {:connected, ch}}, - lb_mod: lb_mod, - lb_state: lb_state - }} + case connect_real_channel(vc, host, port, norm_opts, adapter) do + {:ok, ch} -> + {:ok, + %__MODULE__{ + base_state + | real_channels: %{build_address_key(host, port) => {:connected, ch}}, + lb_mod: lb_mod + }} + + {:error, reason} -> + {:error, reason} end end diff --git a/grpc/lib/grpc/client/load_balacing.ex b/grpc/lib/grpc/client/load_balacing.ex index 17c93ac70..0d9a61e77 100644 --- a/grpc/lib/grpc/client/load_balacing.ex +++ b/grpc/lib/grpc/client/load_balacing.ex @@ -7,17 +7,16 @@ defmodule GRPC.Client.LoadBalancing do it by `GRPC.Client.Connection` — it is not responsible for establishing or tearing down transport. - Required callbacks: + Callbacks: * `init/1` — build initial state from `[channels: [Channel.t()]]`. * `pick/1` — choose a `Channel` for the next request. - * `update/2` — reconcile the state with a new channel list in place, - without tearing it down. Called by `Connection` on DNS re-resolution. + * `update/2` — swap in a new channel list when DNS re-resolution + discovers new or removed backends. - Optional callbacks: - - * `shutdown/1` — release any resources held by the strategy. Called on - disconnect. Strategies that hold no external resources can omit it. + `init/1` runs inside the `GRPC.Client.Connection` GenServer. Any ETS + tables or other process-owned resources the strategy creates are owned + by that GenServer and are released automatically when it exits. """ alias GRPC.Channel @@ -28,8 +27,4 @@ defmodule GRPC.Client.LoadBalancing do @callback update(state :: any(), new_channels :: [Channel.t()]) :: {:ok, new_state :: any()} | {:error, reason :: any()} - - @callback shutdown(state :: any()) :: :ok - - @optional_callbacks shutdown: 1 end diff --git a/grpc/lib/grpc/client/load_balacing/pick_first.ex b/grpc/lib/grpc/client/load_balacing/pick_first.ex index e679c1534..2f58db006 100644 --- a/grpc/lib/grpc/client/load_balacing/pick_first.ex +++ b/grpc/lib/grpc/client/load_balacing/pick_first.ex @@ -2,9 +2,9 @@ defmodule GRPC.Client.LoadBalancing.PickFirst do @moduledoc """ Pick-first load balancer: always returns the first channel in the list. - Stores the current pick in an ETS table so `update/2` can swap it without - returning a new state — callers that captured `lb_state` (the Connection - GenServer, the LB registry) keep seeing the right channel. + Stores the current pick in an ETS table so `update/2` can swap it in + place without allocating a new state. The `tid` is stable for the life + of the strategy, which lets callers cache `lb_state` across reconciles. """ @behaviour GRPC.Client.LoadBalancing @@ -44,12 +44,4 @@ defmodule GRPC.Client.LoadBalancing.PickFirst do :ets.insert(tid, {@current_key, nil}) {:ok, state} end - - @impl true - def shutdown(%{tid: tid}) do - :ets.delete(tid) - :ok - rescue - ArgumentError -> :ok - end end diff --git a/grpc/lib/grpc/client/load_balacing/round_robin.ex b/grpc/lib/grpc/client/load_balacing/round_robin.ex index 360514a96..16b5a22cc 100644 --- a/grpc/lib/grpc/client/load_balacing/round_robin.ex +++ b/grpc/lib/grpc/client/load_balacing/round_robin.ex @@ -8,9 +8,8 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do plus a lookup. The ETS table is owned by whichever process calls `init/1` (normally the - `GRPC.Client.Connection` GenServer), so when that process dies the table is - reclaimed automatically — no separate supervision needed. `shutdown/1` is - provided for graceful disconnects. + `GRPC.Client.Connection` GenServer), so when that process dies the table + is reclaimed automatically. """ @behaviour GRPC.Client.LoadBalancing @@ -57,12 +56,4 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do :ets.insert(tid, {@index_key, -1}) {:ok, state} end - - @impl true - def shutdown(%{tid: tid}) do - :ets.delete(tid) - :ok - rescue - ArgumentError -> :ok - end end diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index 59c764f9e..16f99c65f 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -110,7 +110,10 @@ defmodule GRPC.Client.ConnectionTest do end describe "LB ETS table lifecycle" do - test "disconnect/1 calls lb_mod.shutdown so the ETS table is freed", %{ + # The Connection GenServer owns the LB's ETS table (lb_mod.init runs in + # the GenServer), so BEAM reclaims the table automatically when the + # process exits — no explicit shutdown callback needed. + test "disconnect/1 exits the GenServer and the ETS table is freed", %{ ref: ref, target: target, adapter: adapter @@ -121,8 +124,16 @@ defmodule GRPC.Client.ConnectionTest do tid = lb_tid(ref) assert :ets.info(tid) != :undefined + pid = :global.whereis_name({Connection, ref}) + ref_mon = Process.monitor(pid) + {:ok, _} = Connection.disconnect(channel) + # disconnect replies before the GenServer terminates (via {:continue, :stop}), + # so wait for the process to actually exit before asserting BEAM has reclaimed + # the table. + assert_receive {:DOWN, ^ref_mon, :process, ^pid, _reason}, 500 + assert :ets.info(tid) == :undefined end diff --git a/grpc/test/grpc/client/dns_resolver_test.exs b/grpc/test/grpc/client/dns_resolver_test.exs index 6b0118f7a..9af85f0f5 100644 --- a/grpc/test/grpc/client/dns_resolver_test.exs +++ b/grpc/test/grpc/client/dns_resolver_test.exs @@ -1267,9 +1267,9 @@ defmodule GRPC.Client.ReResolveTest do end end - describe "resolver worker crash recovery" do - test "Connection re-initializes resolver when worker crashes", ctx do - {:ok, channel} = + describe "resolver worker crash" do + test "Connection stops when the resolver worker dies", ctx do + {:ok, _channel} = connect_with_resolver( ctx.ref, ctx.resolver, @@ -1279,22 +1279,33 @@ defmodule GRPC.Client.ReResolveTest do ) state = get_state(ctx.ref) - original_pid = state.resolver_state.worker_pid - assert Process.alive?(original_pid) - - # Kill the worker — Connection traps exits and should re-init - Process.exit(original_pid, :kill) - Process.sleep(100) + worker_pid = state.resolver_state.worker_pid + assert Process.alive?(worker_pid) conn_pid = :global.whereis_name({Connection, ctx.ref}) - assert Process.alive?(conn_pid) + mon = Process.monitor(conn_pid) - # resolver_state should have a NEW worker pid - state = get_state(ctx.ref) - assert state.resolver_state != nil - assert state.resolver_state.worker_pid != original_pid - assert Process.alive?(state.resolver_state.worker_pid) + Process.exit(worker_pid, :kill) + + assert_receive {:DOWN, ^mon, :process, ^conn_pid, {:resolver_exited, :killed}}, 500 + end + + test "unrelated :EXIT signals don't stop the Connection", ctx do + {:ok, channel} = + connect_with_resolver( + ctx.ref, + ctx.resolver, + ctx.adapter, + [%{address: "10.0.0.1", port: 50051}], + lb_policy: :round_robin + ) + + conn_pid = :global.whereis_name({Connection, ctx.ref}) + stray_pid = spawn(fn -> :ok end) + send(conn_pid, {:EXIT, stray_pid, :boom}) + Process.sleep(50) + assert Process.alive?(conn_pid) assert {:ok, _} = Connection.pick_channel(channel) disconnect_and_wait(channel) diff --git a/grpc/test/grpc/client/load_balacing/pick_first_test.exs b/grpc/test/grpc/client/load_balacing/pick_first_test.exs index 1e7207142..1d20162fd 100644 --- a/grpc/test/grpc/client/load_balacing/pick_first_test.exs +++ b/grpc/test/grpc/client/load_balacing/pick_first_test.exs @@ -13,13 +13,11 @@ defmodule GRPC.Client.LoadBalancing.PickFirstTest do assert %{tid: tid} = state assert is_reference(tid) assert :ets.info(tid) != :undefined - PickFirst.shutdown(state) end test "seeds the table with the first channel" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}, {"b", 2}])) assert {:ok, %Channel{host: "a", port: 1}, ^state} = PickFirst.pick(state) - PickFirst.shutdown(state) end test "rejects empty channel lists" do @@ -38,20 +36,20 @@ defmodule GRPC.Client.LoadBalancing.PickFirstTest do for _ <- 1..3 do assert {:ok, %Channel{host: "a", port: 1}, ^state} = PickFirst.pick(state) end - - PickFirst.shutdown(state) end test "returns :no_channels when current is nil" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) {:ok, _} = PickFirst.update(state, []) assert {:error, :no_channels} = PickFirst.pick(state) - PickFirst.shutdown(state) end + # Guards the race where the owning Connection GenServer dies (and BEAM + # reclaims the ETS table) between a caller's registry lookup and this + # pick — the rescue should turn the BIF crash into a tagged error. test "returns :no_channels instead of raising when the table was deleted" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) - :ok = PickFirst.shutdown(state) + :ets.delete(state.tid) assert {:error, :no_channels} = PickFirst.pick(state) end end @@ -65,31 +63,12 @@ defmodule GRPC.Client.LoadBalancing.PickFirstTest do assert new_state.tid == original_tid assert {:ok, %Channel{host: "x", port: 9}, _} = PickFirst.pick(new_state) - PickFirst.shutdown(new_state) end test "clears current to nil on empty list" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) {:ok, state} = PickFirst.update(state, []) assert {:error, :no_channels} = PickFirst.pick(state) - PickFirst.shutdown(state) - end - end - - describe "shutdown/1" do - test "deletes the ETS table" do - {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) - tid = state.tid - assert :ets.info(tid) != :undefined - - :ok = PickFirst.shutdown(state) - assert :ets.info(tid) == :undefined - end - - test "is idempotent on already-deleted tables" do - {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) - :ok = PickFirst.shutdown(state) - assert :ok = PickFirst.shutdown(state) end end end diff --git a/grpc/test/grpc/client/load_balacing/round_robin_test.exs b/grpc/test/grpc/client/load_balacing/round_robin_test.exs index 3adb0f515..d5ebf7403 100644 --- a/grpc/test/grpc/client/load_balacing/round_robin_test.exs +++ b/grpc/test/grpc/client/load_balacing/round_robin_test.exs @@ -8,11 +8,10 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do 3. pick/1 rotates through channels in order 4. pick/1 wraps around after the last channel 5. update/2 replaces channels in place without creating a new table - 6. update/2 rejects empty channel lists + 6. update/2 accepts empty channel lists 7. update/2 resets the cursor so the first pick is the first channel - 8. shutdown/1 deletes the ETS table - 9. shutdown/1 is idempotent (safe on already-deleted tables) - 10. pick/1 is safe under concurrent access (many processes, one table) + 8. pick/1 is safe under concurrent access (many processes, one table) + 9. pick/1 returns :no_channels (not a crash) when the table is gone """ use ExUnit.Case, async: true @@ -28,7 +27,6 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do assert %{tid: tid} = state assert is_reference(tid) assert :ets.info(tid) != :undefined - RoundRobin.shutdown(state) end test "rejects empty channel lists" do @@ -48,8 +46,6 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do assert {:ok, %Channel{host: "b", port: 2}, _} = RoundRobin.pick(state) assert {:ok, %Channel{host: "c", port: 3}, _} = RoundRobin.pick(state) assert {:ok, %Channel{host: "a", port: 1}, _} = RoundRobin.pick(state) - - RoundRobin.shutdown(state) end test "wraps around with a single channel" do @@ -58,8 +54,6 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do for _ <- 1..5 do assert {:ok, %Channel{host: "only"}, _} = RoundRobin.pick(state) end - - RoundRobin.shutdown(state) end end @@ -74,15 +68,12 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do assert {:ok, %Channel{host: "x", port: 9}, _} = RoundRobin.pick(new_state) assert {:ok, %Channel{host: "y", port: 8}, _} = RoundRobin.pick(new_state) assert {:ok, %Channel{host: "z", port: 7}, _} = RoundRobin.pick(new_state) - - RoundRobin.shutdown(new_state) end test "accepts empty channel lists; pick then returns :no_channels" do {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) assert {:ok, ^state} = RoundRobin.update(state, []) assert {:error, :no_channels} = RoundRobin.pick(state) - RoundRobin.shutdown(state) end test "resets cursor so the first pick after update starts at the first channel" do @@ -92,37 +83,21 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do {:ok, state} = RoundRobin.update(state, channels([{"new-first", 1}, {"new-second", 2}])) assert {:ok, %Channel{host: "new-first"}, _} = RoundRobin.pick(state) - - RoundRobin.shutdown(state) end end - describe "pick/1 race with shutdown" do + # Guards the race where the owning Connection GenServer dies (and BEAM + # reclaims the ETS table) between a caller's registry lookup and this + # pick — the rescue should turn the BIF crash into a tagged error. + describe "pick/1 race with table deletion" do test "returns :no_channels instead of raising when the table was deleted" do {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) - :ok = RoundRobin.shutdown(state) + :ets.delete(state.tid) assert {:error, :no_channels} = RoundRobin.pick(state) end end - describe "shutdown/1" do - test "deletes the ETS table" do - {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) - tid = state.tid - assert :ets.info(tid) != :undefined - - :ok = RoundRobin.shutdown(state) - assert :ets.info(tid) == :undefined - end - - test "is idempotent on already-deleted tables" do - {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) - :ok = RoundRobin.shutdown(state) - assert :ok = RoundRobin.shutdown(state) - end - end - describe "concurrency" do test "pick/1 is safe under many concurrent processes" do chs = channels(for i <- 1..4, do: {"host#{i}", 1000 + i}) @@ -161,8 +136,6 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do assert abs(c - avg) <= 1, "uneven pick distribution: #{inspect(counts)}" end - - RoundRobin.shutdown(state) end end end From f543e1a4d33079946f6d77f213c06229462b2a4f Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 23 Apr 2026 21:56:34 +0100 Subject: [PATCH 15/21] docs: tighten comments on EXIT handler and PickFirst moduledoc Rescope the catch-all :EXIT comment so the :normal clause's silent ignore isn't surprising, and replace "reconciles" with "DNS re-resolutions" in the PickFirst moduledoc for readers who aren't already deep in the LB code path. --- grpc/lib/grpc/client/load_balacing/pick_first.ex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/grpc/lib/grpc/client/load_balacing/pick_first.ex b/grpc/lib/grpc/client/load_balacing/pick_first.ex index 2f58db006..73f6e2ec5 100644 --- a/grpc/lib/grpc/client/load_balacing/pick_first.ex +++ b/grpc/lib/grpc/client/load_balacing/pick_first.ex @@ -4,7 +4,8 @@ defmodule GRPC.Client.LoadBalancing.PickFirst do Stores the current pick in an ETS table so `update/2` can swap it in place without allocating a new state. The `tid` is stable for the life - of the strategy, which lets callers cache `lb_state` across reconciles. + of the strategy, which lets callers cache `lb_state` across DNS + re-resolutions that swap the backend list. """ @behaviour GRPC.Client.LoadBalancing From 9537b6419d1ef40631a4c1cfe7fcc5bb57f23634 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Fri, 24 Apr 2026 23:11:12 +0100 Subject: [PATCH 16/21] fix(connection): keep Registry consistent across init, reconcile, disconnect MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three small fixes from review feedback on the ETS LB work: - init/1: publish to the Registry last. init crashes don't run terminate/2, so registering before resolver.init/2 would strand the entry if resolver init raised. - rebalance_after_reconcile/3: republish {lb_mod, lb_state} after lb_mod.update/2 so pick_channel/2 reads the fresh state returned by update/2, not the state captured at init. - handle_call({:disconnect, ...}): linearized to Registry.delete → transport close → reply. Dropped the dead Map.drop branch; the struct always has :real_channels. --- grpc/lib/grpc/client/connection.ex | 35 +++++++++++++++--------------- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index afc46feac..1660a2231 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -125,7 +125,6 @@ defmodule GRPC.Client.Connection do case state.lb_mod.init(channels: connected) do {:ok, lb_state} -> state = %{state | lb_state: lb_state} - Registry.put(state.virtual_channel.ref, {state.lb_mod, lb_state}) state = if function_exported?(state.resolver, :init, 2) do @@ -140,6 +139,11 @@ defmodule GRPC.Client.Connection do state end + # Publish to the Registry last. init/1 crashes don't run terminate/2, + # so anything registered before this point would leak if a later step + # raised. + Registry.put(state.virtual_channel.ref, {state.lb_mod, lb_state}) + {:ok, state} {:error, reason} -> @@ -277,25 +281,13 @@ defmodule GRPC.Client.Connection do state.resolver.shutdown(state.resolver_state) end - resp = {:ok, %Channel{channel | adapter_payload: %{conn_pid: nil}}} + # Delete the Registry entry before closing transports so in-flight picks + # fail fast with :no_connection instead of racing against a dying adapter. Registry.delete(channel.ref) + disconnect_real_channels(state.real_channels, adapter) - if Map.has_key?(state, :real_channels) do - Enum.map(state.real_channels, fn - {_key, {:connected, ch}} -> - do_disconnect(adapter, ch) - - _ -> - :ok - end) - - keys_to_delete = [:real_channels, :virtual_channel] - new_state = Map.drop(state, keys_to_delete) - - {:reply, resp, new_state, {:continue, :stop}} - else - {:reply, resp, state, {:continue, :stop}} - end + resp = {:ok, %Channel{channel | adapter_payload: %{conn_pid: nil}}} + {:reply, resp, state, {:continue, :stop}} end @impl GenServer @@ -438,6 +430,13 @@ defmodule GRPC.Client.Connection do state.lb_state end + # pick_channel/2 reads {lb_mod, lb_state} from the Registry on every RPC, + # so the Registry is the source of truth for picks. Keep it in sync with + # whatever update/2 returned. + if state.lb_mod do + Registry.put(state.virtual_channel.ref, {state.lb_mod, new_lb_state}) + end + if connected == [] do Logger.warning("No healthy channels available after re-resolution") end From d1be507ddfb3919f233a9d4c3150284c92bec878 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Tue, 28 Apr 2026 18:39:38 +0100 Subject: [PATCH 17/21] docs: tighten comments on Registry ordering and pick race Co-Authored-By: Claude Opus 4.7 (1M context) --- grpc/lib/grpc/client/connection.ex | 10 ++-------- grpc/lib/grpc/client/load_balacing/round_robin.ex | 4 +--- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index 1660a2231..6e266182a 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -139,9 +139,7 @@ defmodule GRPC.Client.Connection do state end - # Publish to the Registry last. init/1 crashes don't run terminate/2, - # so anything registered before this point would leak if a later step - # raised. + # init/1 crashes skip terminate/2, so register only after fallible steps. Registry.put(state.virtual_channel.ref, {state.lb_mod, lb_state}) {:ok, state} @@ -281,8 +279,7 @@ defmodule GRPC.Client.Connection do state.resolver.shutdown(state.resolver_state) end - # Delete the Registry entry before closing transports so in-flight picks - # fail fast with :no_connection instead of racing against a dying adapter. + # Delete first so in-flight picks fail fast instead of racing a dying adapter. Registry.delete(channel.ref) disconnect_real_channels(state.real_channels, adapter) @@ -430,9 +427,6 @@ defmodule GRPC.Client.Connection do state.lb_state end - # pick_channel/2 reads {lb_mod, lb_state} from the Registry on every RPC, - # so the Registry is the source of truth for picks. Keep it in sync with - # whatever update/2 returned. if state.lb_mod do Registry.put(state.virtual_channel.ref, {state.lb_mod, new_lb_state}) end diff --git a/grpc/lib/grpc/client/load_balacing/round_robin.ex b/grpc/lib/grpc/client/load_balacing/round_robin.ex index 16b5a22cc..0f6884689 100644 --- a/grpc/lib/grpc/client/load_balacing/round_robin.ex +++ b/grpc/lib/grpc/client/load_balacing/round_robin.ex @@ -33,9 +33,7 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do end @impl true - # A concurrent `disconnect/1` can delete the ETS table between a caller's - # registry lookup and this pick, turning the ETS BIFs into ArgumentError. - # Callers expect a tagged error, not a crashed RPC. + # `disconnect/1` may delete the table between a caller's registry lookup and pick. def pick(%{tid: tid} = state) do case :ets.lookup(tid, @channels_key) do [{@channels_key, channels}] when tuple_size(channels) > 0 -> From 648f80d3ee850733fa8deaa90acb6d13067e7589 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Tue, 28 Apr 2026 19:12:44 +0100 Subject: [PATCH 18/21] fix(connection): re-init resolver in place, don't stop the Connection MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This was a bad call on my part. Stopping the Connection when the resolver worker dies means a network blip or brief DNS hiccup tears down every transport channel and reconnects from scratch. We can't have network jitter force a restart. The reasoning I had behind the original change ("let-it-crash, trust the supervisor") doesn't hold up either: - The supervisor restart uses the original initial_state snapshot. real_channels in there points to conn_pids from first connect, so the new Connection inherits stale references. - terminate/2 only deletes the Registry entry; transports leak on abnormal exit. - The "avoid crash-loop" justification was wrong. The old code had a fallback to resolver_state: nil — it degraded gracefully, it didn't crash-loop. Restore the original in-place re-init. --- grpc/lib/grpc/client/connection.ex | 22 +++++++++++++++---- grpc/test/grpc/client/dns_resolver_test.exs | 24 ++++++++++++++------- 2 files changed, 34 insertions(+), 12 deletions(-) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index 6e266182a..c27e500c6 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -293,10 +293,24 @@ defmodule GRPC.Client.Connection do {:noreply, state} end - def handle_info({:EXIT, pid, reason}, %{resolver_state: %{worker_pid: pid}} = state) - when reason != :normal do - Logger.warning("Resolver worker exited: #{inspect(reason)}, stopping Connection") - {:stop, {:resolver_exited, reason}, state} + def handle_info({:EXIT, _pid, reason}, %{resolver: resolver, resolver_state: rs} = state) + when not is_nil(rs) and reason != :normal do + Logger.warning("Resolver worker exited: #{inspect(reason)}, re-initializing") + + state = + if function_exported?(resolver, :init, 2) do + case resolver.init(state.resolver_target, + connection_pid: self(), + connect_opts: state.connect_opts + ) do + {:ok, new_rs} -> %{state | resolver_state: new_rs} + {:error, _} -> %{state | resolver_state: nil} + end + else + %{state | resolver_state: nil} + end + + {:noreply, state} end def handle_info({:EXIT, _pid, :normal}, state), do: {:noreply, state} diff --git a/grpc/test/grpc/client/dns_resolver_test.exs b/grpc/test/grpc/client/dns_resolver_test.exs index 9af85f0f5..e3c8cea24 100644 --- a/grpc/test/grpc/client/dns_resolver_test.exs +++ b/grpc/test/grpc/client/dns_resolver_test.exs @@ -1267,9 +1267,9 @@ defmodule GRPC.Client.ReResolveTest do end end - describe "resolver worker crash" do - test "Connection stops when the resolver worker dies", ctx do - {:ok, _channel} = + describe "resolver worker crash recovery" do + test "Connection re-initializes resolver when worker crashes", ctx do + {:ok, channel} = connect_with_resolver( ctx.ref, ctx.resolver, @@ -1279,15 +1279,23 @@ defmodule GRPC.Client.ReResolveTest do ) state = get_state(ctx.ref) - worker_pid = state.resolver_state.worker_pid - assert Process.alive?(worker_pid) + original_pid = state.resolver_state.worker_pid + assert Process.alive?(original_pid) + + Process.exit(original_pid, :kill) + Process.sleep(100) conn_pid = :global.whereis_name({Connection, ctx.ref}) - mon = Process.monitor(conn_pid) + assert Process.alive?(conn_pid) + + state = get_state(ctx.ref) + assert state.resolver_state != nil + assert state.resolver_state.worker_pid != original_pid + assert Process.alive?(state.resolver_state.worker_pid) - Process.exit(worker_pid, :kill) + assert {:ok, _} = Connection.pick_channel(channel) - assert_receive {:DOWN, ^mon, :process, ^conn_pid, {:resolver_exited, :killed}}, 500 + disconnect_and_wait(channel) end test "unrelated :EXIT signals don't stop the Connection", ctx do From 5334433ff3284ee48393e76d3982c1559a35a512 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Thu, 30 Apr 2026 13:25:05 +0100 Subject: [PATCH 19/21] fix: erase persistent_term leak in GRPC.Client.Connection on disconnect MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the shared ETS Registry with :persistent_term keyed by {Connection, ref} and move the RoundRobin cursor from an ETS counter row to :atomics. Three problems collapse into one design: 1. Hot-path counter contention. :ets.update_counter/3 locks the key's hash bucket, so 32 concurrent picks bottleneck on a single counter (~10K ips in the bench). :atomics.add_get/3 is a hardware CAS — no table-level or key-level lock — and scales linearly across schedulers. 2. Two ETS tables where one is enough. The shared registry table is gone; the per-Connection LB table now holds only the channels tuple. Lookup of {lb_mod, lb_state} is a :persistent_term.get, which is faster than an :ets.lookup and allocation-free. 3. Steady-state global GC. The published lb_state contains a stable tid + atomics ref for the life of the connection. Reconcile mutates ETS + atomics in place and does NOT republish, so :persistent_term.put fires once per connect — not every 30s — and the global GC pass it triggers stays off the hot loop. Lifecycle: init/1 put once, after lb_mod.init + resolver.init succeed pick_channel/2 get on every RPC disconnect/1 erase first (in-flight picks fail fast), then close terminate/2 erase (idempotent — safe even if disconnect erased) The terminate/2 erase covers the abnormal-exit path; the only leak window is brutal kill, which already existed with Registry.delete and is not regressed by this change. The failure modes the previous PR introduced are all preserved: - rescue ArgumentError in lb_mod.pick — table reclaimed mid-pick - persistent_term.get/2 fallback to nil — entry erased mid-pick - erase BEFORE closing transports — picks fail with :no_connection rather than racing a dying adapter Tests: 286/286 pass (was 292; six Registry tests removed). The 500-cycle leak regression now counts {Connection, _}-keyed persistent_term entries instead of registry-table size. Co-Authored-By: Claude Opus 4.7 (1M context) --- grpc/lib/grpc/client/application.ex | 1 - grpc/lib/grpc/client/connection.ex | 27 ++++---- .../lib/grpc/client/load_balacing/registry.ex | 31 ---------- .../grpc/client/load_balacing/round_robin.ex | 35 ++++++----- grpc/test/grpc/client/connection_test.exs | 40 +++++++----- .../client/load_balacing/registry_test.exs | 62 ------------------- .../client/load_balacing/round_robin_test.exs | 17 +++-- 7 files changed, 69 insertions(+), 144 deletions(-) delete mode 100644 grpc/lib/grpc/client/load_balacing/registry.ex delete mode 100644 grpc/test/grpc/client/load_balacing/registry_test.exs diff --git a/grpc/lib/grpc/client/application.ex b/grpc/lib/grpc/client/application.ex index 6077fb5ed..16d4a230f 100644 --- a/grpc/lib/grpc/client/application.ex +++ b/grpc/lib/grpc/client/application.ex @@ -4,7 +4,6 @@ defmodule GRPC.Client.Application do def start(_type, _args) do children = [ - GRPC.Client.LoadBalancing.Registry, {DynamicSupervisor, [name: GRPC.Client.Supervisor]} ] diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index c27e500c6..7ebad4d7a 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -72,7 +72,6 @@ defmodule GRPC.Client.Connection do """ use GenServer alias GRPC.Channel - alias GRPC.Client.LoadBalancing.Registry require Logger @@ -139,8 +138,12 @@ defmodule GRPC.Client.Connection do state end - # init/1 crashes skip terminate/2, so register only after fallible steps. - Registry.put(state.virtual_channel.ref, {state.lb_mod, lb_state}) + # init/1 crashes skip terminate/2, so publish only after fallible steps. + # lb_state (tid + atomics ref) is stable for the life of the connection; + # reconcile mutates ETS/atomics in place rather than republishing, so + # persistent_term.put fires once per connect — not every 30s — keeping + # the global GC pass off the steady-state path. + :persistent_term.put({__MODULE__, state.virtual_channel.ref}, {state.lb_mod, lb_state}) {:ok, state} @@ -241,8 +244,8 @@ defmodule GRPC.Client.Connection do """ @spec pick_channel(Channel.t(), keyword()) :: {:ok, Channel.t()} | {:error, term()} def pick_channel(%Channel{ref: ref} = _channel, _opts \\ []) do - case Registry.lookup(ref) do - {:ok, {lb_mod, lb_state}} when not is_nil(lb_mod) -> + case :persistent_term.get({__MODULE__, ref}, nil) do + {lb_mod, lb_state} when not is_nil(lb_mod) -> case lb_mod.pick(lb_state) do {:ok, %Channel{} = channel, _new_state} -> {:ok, channel} {:error, _} -> {:error, :no_connection} @@ -279,8 +282,8 @@ defmodule GRPC.Client.Connection do state.resolver.shutdown(state.resolver_state) end - # Delete first so in-flight picks fail fast instead of racing a dying adapter. - Registry.delete(channel.ref) + # Erase first so in-flight picks fail fast instead of racing a dying adapter. + :persistent_term.erase({__MODULE__, channel.ref}) disconnect_real_channels(state.real_channels, adapter) resp = {:ok, %Channel{channel | adapter_payload: %{conn_pid: nil}}} @@ -345,7 +348,8 @@ defmodule GRPC.Client.Connection do @impl GenServer def terminate(_reason, %{virtual_channel: %{ref: ref}}) do - Registry.delete(ref) + :persistent_term.erase({__MODULE__, ref}) + :ok rescue _ -> :ok end @@ -431,6 +435,9 @@ defmodule GRPC.Client.Connection do defp rebalance_after_reconcile(_new_addresses, real_channels, state) do connected = connected_channels(real_channels) + # update/2 mutates ETS/atomics in place; the persistent_term entry stays + # valid because lb_state's tid + atomics ref don't change. No republish, + # no global GC, every 30s reconcile. new_lb_state = if state.lb_mod do case reconcile_lb(state.lb_mod, state.lb_state, connected) do @@ -441,10 +448,6 @@ defmodule GRPC.Client.Connection do state.lb_state end - if state.lb_mod do - Registry.put(state.virtual_channel.ref, {state.lb_mod, new_lb_state}) - end - if connected == [] do Logger.warning("No healthy channels available after re-resolution") end diff --git a/grpc/lib/grpc/client/load_balacing/registry.ex b/grpc/lib/grpc/client/load_balacing/registry.ex deleted file mode 100644 index ae373434a..000000000 --- a/grpc/lib/grpc/client/load_balacing/registry.ex +++ /dev/null @@ -1,31 +0,0 @@ -defmodule GRPC.Client.LoadBalancing.Registry do - @moduledoc false - use GenServer - - @table :grpc_client_lb_registry - - @spec start_link(any()) :: GenServer.on_start() - def start_link(_), do: GenServer.start_link(__MODULE__, [], name: __MODULE__) - - @spec put(reference() | term(), {module(), term()}) :: true - def put(ref, value), do: :ets.insert(@table, {ref, value}) - - @spec lookup(reference() | term()) :: {:ok, {module(), term()}} | :error - def lookup(ref) do - case :ets.lookup(@table, ref) do - [{^ref, value}] -> {:ok, value} - [] -> :error - end - rescue - ArgumentError -> :error - end - - @spec delete(reference() | term()) :: true - def delete(ref), do: :ets.delete(@table, ref) - - @impl true - def init(_) do - :ets.new(@table, [:set, :public, :named_table, read_concurrency: true]) - {:ok, nil} - end -end diff --git a/grpc/lib/grpc/client/load_balacing/round_robin.ex b/grpc/lib/grpc/client/load_balacing/round_robin.ex index 0f6884689..b69bea6f9 100644 --- a/grpc/lib/grpc/client/load_balacing/round_robin.ex +++ b/grpc/lib/grpc/client/load_balacing/round_robin.ex @@ -1,20 +1,24 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do @moduledoc """ - Round-robin load balancer backed by an ETS table. + Round-robin load balancer. - The pick path is lock-free: `:ets.update_counter/3` atomically advances the - cursor and `elem/2` indexes the channel tuple in constant time. No GenServer - sits in front of the pick, so RPC-time picks are a single atomic increment - plus a lookup. + The pick path is fully lock-free: + + * `:atomics.add_get/3` advances the cursor with a hardware CAS — no + table-level or key-level lock, unlike `:ets.update_counter/3` which + serialises writers on a single key's hash bucket. + * The channel tuple lives in a `read_concurrency: true` ETS table and + is replaced wholesale on reconcile, so concurrent picks never see a + torn list. The ETS table is owned by whichever process calls `init/1` (normally the `GRPC.Client.Connection` GenServer), so when that process dies the table - is reclaimed automatically. + is reclaimed automatically. The atomics ref has no owner — it is reclaimed + by the GC when the last reference is dropped. """ @behaviour GRPC.Client.LoadBalancing @channels_key :channels - @index_key :index @impl true def init(opts) do @@ -24,21 +28,22 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do channels -> tid = :ets.new(:grpc_lb_round_robin, [:set, :public, read_concurrency: true]) + aref = :atomics.new(1, signed: false) :ets.insert(tid, {@channels_key, List.to_tuple(channels)}) - :ets.insert(tid, {@index_key, -1}) - {:ok, %{tid: tid}} + {:ok, %{tid: tid, atomics: aref}} end end @impl true - # `disconnect/1` may delete the table between a caller's registry lookup and pick. - def pick(%{tid: tid} = state) do + # `disconnect/1` may delete the table between a caller's persistent_term + # lookup and pick — the rescue turns the BIF crash into a tagged error. + def pick(%{tid: tid, atomics: aref} = state) do case :ets.lookup(tid, @channels_key) do [{@channels_key, channels}] when tuple_size(channels) > 0 -> - idx = :ets.update_counter(tid, @index_key, {2, 1}) - channel = elem(channels, rem(idx, tuple_size(channels))) + idx = :atomics.add_get(aref, 1, 1) + channel = elem(channels, rem(idx - 1, tuple_size(channels))) {:ok, channel, state} _ -> @@ -49,9 +54,9 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do end @impl true - def update(%{tid: tid} = state, new_channels) do + def update(%{tid: tid, atomics: aref} = state, new_channels) do :ets.insert(tid, {@channels_key, List.to_tuple(new_channels)}) - :ets.insert(tid, {@index_key, -1}) + :atomics.put(aref, 1, 0) {:ok, state} end end diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index 16f99c65f..1931f88e4 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -3,7 +3,6 @@ defmodule GRPC.Client.ConnectionTest do alias GRPC.Channel alias GRPC.Client.Connection - alias GRPC.Client.LoadBalancing.Registry setup do %{ @@ -21,7 +20,7 @@ defmodule GRPC.Client.ConnectionTest do assert {:error, :no_connection} = Connection.pick_channel(channel) end - test "returns {:ok, channel} once a connection has registered its LB state", %{ + test "returns {:ok, channel} once a connection has published its LB state", %{ ref: ref, target: target, adapter: adapter @@ -51,17 +50,18 @@ defmodule GRPC.Client.ConnectionTest do Connection.disconnect(first_channel) end - test "returns {:error, :no_connection} when already_started but the registry entry is missing", + test "returns {:error, :no_connection} when already_started but the persistent_term entry is missing", %{ref: ref, target: target, adapter: adapter} do {:ok, channel} = Connection.connect(target, adapter: adapter, name: ref) - {:ok, entry} = Registry.lookup(ref) - Registry.delete(ref) + key = {Connection, ref} + entry = :persistent_term.get(key) + :persistent_term.erase(key) # Calling connect again will hit already_started → pick_channel → :no_connection assert {:error, :no_connection} = Connection.connect(target, adapter: adapter, name: ref) - Registry.put(ref, entry) + :persistent_term.put(key, entry) Connection.disconnect(channel) end end @@ -82,7 +82,7 @@ defmodule GRPC.Client.ConnectionTest do assert_receive {:DOWN, ^ref_mon, :process, ^pid, _reason}, 500 end - test "pick_channel returns {:error, :no_connection} after disconnect (registry entry is deleted)", + test "pick_channel returns {:error, :no_connection} after disconnect (persistent_term entry is erased)", %{ref: ref, target: target, adapter: adapter} do {:ok, channel} = Connection.connect(target, adapter: adapter, name: ref) @@ -92,8 +92,8 @@ defmodule GRPC.Client.ConnectionTest do end end - describe "terminate/2 - registry cleanup on process kill" do - test "registry entry is deleted when process is killed without disconnect", %{ + describe "terminate/2 - persistent_term cleanup on process kill" do + test "persistent_term entry is erased when process is killed without disconnect", %{ ref: ref, target: target, adapter: adapter @@ -158,8 +158,9 @@ defmodule GRPC.Client.ConnectionTest do describe "pick_channel/2 races with disconnect/1" do # Concurrent RPCs must never crash when a disconnect is tearing down the - # LB state. This exercises both the Registry.lookup ArgumentError rescue - # (table gone) and the per-LB pick rescue (per-connection table gone). + # LB state. The pick path's :persistent_term.get falls back to nil when + # the entry is gone, and the per-LB pick rescues ArgumentError if the + # ETS table is reclaimed mid-pick. test "many concurrent picks complete without crashing while disconnect runs", %{ ref: ref, target: target, @@ -208,13 +209,14 @@ defmodule GRPC.Client.ConnectionTest do describe "resource leaks over repeated connect/disconnect" do # Mirrors the regression style used in #509 for the persistent_term leak: # cycle connect/disconnect many times and assert zero growth in the - # long-lived tables we own (the registry + the count of ETS tables). - test "500 cycles leave the registry empty and no per-LB tables leak", %{ + # long-lived stores we own (Connection-keyed persistent_term entries + + # the count of ETS tables). + test "500 cycles leave persistent_term clean and no per-LB tables leak", %{ target: target, adapter: adapter } do before_table_count = length(:ets.all()) - before_registry_size = :ets.info(:grpc_client_lb_registry, :size) + before_pt_count = connection_pt_count() for _ <- 1..500 do ref = make_ref() @@ -222,11 +224,11 @@ defmodule GRPC.Client.ConnectionTest do {:ok, _} = Connection.disconnect(channel) end - after_registry_size = :ets.info(:grpc_client_lb_registry, :size) + after_pt_count = connection_pt_count() after_table_count = length(:ets.all()) - assert after_registry_size == before_registry_size, - "registry leaked: before=#{before_registry_size} after=#{after_registry_size}" + assert after_pt_count == before_pt_count, + "persistent_term leaked: before=#{before_pt_count} after=#{after_pt_count}" # Allow small slop for tables the VM may have created incidentally. assert after_table_count - before_table_count <= 5, @@ -234,6 +236,10 @@ defmodule GRPC.Client.ConnectionTest do end end + defp connection_pt_count do + Enum.count(:persistent_term.get(), &match?({{Connection, _}, _}, &1)) + end + defp lb_tid(ref) do pid = :global.whereis_name({Connection, ref}) %{lb_state: %{tid: tid}} = :sys.get_state(pid) diff --git a/grpc/test/grpc/client/load_balacing/registry_test.exs b/grpc/test/grpc/client/load_balacing/registry_test.exs deleted file mode 100644 index 2454c0bc3..000000000 --- a/grpc/test/grpc/client/load_balacing/registry_test.exs +++ /dev/null @@ -1,62 +0,0 @@ -defmodule GRPC.Client.LoadBalancing.RegistryTest do - @moduledoc """ - Direct tests for the shared LB registry. - - The registry is a process-backed named ETS table started by - GRPC.Client.Application. These tests exercise the public API - (put/2, lookup/1, delete/1) and the ArgumentError rescue path in - lookup/1 that guards against races where the table is gone. - """ - use ExUnit.Case, async: false - - alias GRPC.Client.LoadBalancing.Registry - - describe "put/2 + lookup/1" do - test "round-trips a {module, state} value keyed by a reference" do - ref = make_ref() - Registry.put(ref, {SomeLB, %{tid: :fake_tid}}) - - assert {:ok, {SomeLB, %{tid: :fake_tid}}} = Registry.lookup(ref) - - Registry.delete(ref) - end - - test "accepts arbitrary keys (reference, tuple, atom)" do - for key <- [make_ref(), {:conn, 1}, :atom_key] do - Registry.put(key, {LB, :state}) - assert {:ok, {LB, :state}} = Registry.lookup(key) - Registry.delete(key) - end - end - - test "overwrites an existing entry" do - ref = make_ref() - Registry.put(ref, {LB, :v1}) - Registry.put(ref, {LB, :v2}) - - assert {:ok, {LB, :v2}} = Registry.lookup(ref) - - Registry.delete(ref) - end - end - - describe "lookup/1 when the entry is missing" do - test "returns :error" do - assert :error = Registry.lookup(make_ref()) - end - end - - describe "delete/1" do - test "removes the entry; subsequent lookup returns :error" do - ref = make_ref() - Registry.put(ref, {LB, :state}) - Registry.delete(ref) - - assert :error = Registry.lookup(ref) - end - - test "is safe on an unknown key" do - assert true = Registry.delete(make_ref()) - end - end -end diff --git a/grpc/test/grpc/client/load_balacing/round_robin_test.exs b/grpc/test/grpc/client/load_balacing/round_robin_test.exs index d5ebf7403..55902eb18 100644 --- a/grpc/test/grpc/client/load_balacing/round_robin_test.exs +++ b/grpc/test/grpc/client/load_balacing/round_robin_test.exs @@ -1,13 +1,14 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do @moduledoc """ - Tests for the ETS-backed round-robin load balancer. + Tests for the round-robin load balancer (ETS for the channel tuple, + `:atomics` for the cursor). Covers: - 1. init/1 creates an ETS table and returns it in state + 1. init/1 creates an ETS table + atomics ref and returns them in state 2. init/1 rejects empty channel lists 3. pick/1 rotates through channels in order 4. pick/1 wraps around after the last channel - 5. update/2 replaces channels in place without creating a new table + 5. update/2 replaces channels in place without creating a new table or atomics ref 6. update/2 accepts empty channel lists 7. update/2 resets the cursor so the first pick is the first channel 8. pick/1 is safe under concurrent access (many processes, one table) @@ -22,11 +23,13 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do do: Enum.map(pairs, fn {h, p} -> %Channel{host: h, port: p, ref: {h, p}} end) describe "init/1" do - test "creates an ETS table and returns the tid in state" do + test "creates an ETS table + atomics ref and returns both in state" do {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) - assert %{tid: tid} = state + assert %{tid: tid, atomics: aref} = state assert is_reference(tid) + assert is_reference(aref) assert :ets.info(tid) != :undefined + assert :atomics.get(aref, 1) == 0 end test "rejects empty channel lists" do @@ -58,12 +61,14 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do end describe "update/2" do - test "replaces channels in place without changing the tid" do + test "replaces channels in place without changing the tid or atomics ref" do {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}, {"b", 2}])) original_tid = state.tid + original_aref = state.atomics {:ok, new_state} = RoundRobin.update(state, channels([{"x", 9}, {"y", 8}, {"z", 7}])) assert new_state.tid == original_tid + assert new_state.atomics == original_aref assert {:ok, %Channel{host: "x", port: 9}, _} = RoundRobin.pick(new_state) assert {:ok, %Channel{host: "y", port: 8}, _} = RoundRobin.pick(new_state) From 38a032e6b28157a6113046f29a1078fc28113ae8 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Fri, 1 May 2026 12:17:53 +0100 Subject: [PATCH 20/21] docs: aggressive comment cleanup, match #520 style MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The resolver code from #520 ships with essentially zero inline comments and tight one-line @moduledocs. Match that here: - Slash multi-paragraph @moduledoc blocks down to one-liners on the LB behaviour, RoundRobin, PickFirst, and the LB test moduledocs. - Remove the multi-line WHY comments in Connection (init publish ordering, disconnect erase ordering, reconcile in-place mutation). - Remove enumerate-every-test moduledoc from RoundRobin tests. - Remove the "guards the race where..." race-test comments — the test name already says what it does, and the rescue is in the LB code. - Remove the connection_test commentary on the already_started path, the LB ETS lifecycle setup, the concurrent-pick rationale, and the ETS-table-slop justification. Co-Authored-By: Claude Opus 4.7 (1M context) --- grpc/lib/grpc/client/connection.ex | 9 -------- grpc/lib/grpc/client/load_balacing.ex | 19 +--------------- .../grpc/client/load_balacing/pick_first.ex | 8 +------ .../grpc/client/load_balacing/round_robin.ex | 21 ++---------------- grpc/test/grpc/client/connection_test.exs | 22 ------------------- .../client/load_balacing/pick_first_test.exs | 3 --- .../client/load_balacing/round_robin_test.exs | 18 --------------- 7 files changed, 4 insertions(+), 96 deletions(-) diff --git a/grpc/lib/grpc/client/connection.ex b/grpc/lib/grpc/client/connection.ex index 7ebad4d7a..75ace192a 100644 --- a/grpc/lib/grpc/client/connection.ex +++ b/grpc/lib/grpc/client/connection.ex @@ -138,11 +138,6 @@ defmodule GRPC.Client.Connection do state end - # init/1 crashes skip terminate/2, so publish only after fallible steps. - # lb_state (tid + atomics ref) is stable for the life of the connection; - # reconcile mutates ETS/atomics in place rather than republishing, so - # persistent_term.put fires once per connect — not every 30s — keeping - # the global GC pass off the steady-state path. :persistent_term.put({__MODULE__, state.virtual_channel.ref}, {state.lb_mod, lb_state}) {:ok, state} @@ -282,7 +277,6 @@ defmodule GRPC.Client.Connection do state.resolver.shutdown(state.resolver_state) end - # Erase first so in-flight picks fail fast instead of racing a dying adapter. :persistent_term.erase({__MODULE__, channel.ref}) disconnect_real_channels(state.real_channels, adapter) @@ -435,9 +429,6 @@ defmodule GRPC.Client.Connection do defp rebalance_after_reconcile(_new_addresses, real_channels, state) do connected = connected_channels(real_channels) - # update/2 mutates ETS/atomics in place; the persistent_term entry stays - # valid because lb_state's tid + atomics ref don't change. No republish, - # no global GC, every 30s reconcile. new_lb_state = if state.lb_mod do case reconcile_lb(state.lb_mod, state.lb_state, connected) do diff --git a/grpc/lib/grpc/client/load_balacing.ex b/grpc/lib/grpc/client/load_balacing.ex index 0d9a61e77..de55e9cb3 100644 --- a/grpc/lib/grpc/client/load_balacing.ex +++ b/grpc/lib/grpc/client/load_balacing.ex @@ -1,23 +1,6 @@ defmodule GRPC.Client.LoadBalancing do - @moduledoc """ - Load balancing behaviour for gRPC clients. + @moduledoc "Load balancing behaviour for gRPC clients." - A load-balancing strategy owns the per-request pick decision over a set of - `GRPC.Channel` structs. It operates on already-connected channels handed to - it by `GRPC.Client.Connection` — it is not responsible for establishing or - tearing down transport. - - Callbacks: - - * `init/1` — build initial state from `[channels: [Channel.t()]]`. - * `pick/1` — choose a `Channel` for the next request. - * `update/2` — swap in a new channel list when DNS re-resolution - discovers new or removed backends. - - `init/1` runs inside the `GRPC.Client.Connection` GenServer. Any ETS - tables or other process-owned resources the strategy creates are owned - by that GenServer and are released automatically when it exits. - """ alias GRPC.Channel @callback init(opts :: keyword()) :: {:ok, state :: any()} | {:error, reason :: any()} diff --git a/grpc/lib/grpc/client/load_balacing/pick_first.ex b/grpc/lib/grpc/client/load_balacing/pick_first.ex index 73f6e2ec5..9f37b3603 100644 --- a/grpc/lib/grpc/client/load_balacing/pick_first.ex +++ b/grpc/lib/grpc/client/load_balacing/pick_first.ex @@ -1,12 +1,6 @@ defmodule GRPC.Client.LoadBalancing.PickFirst do - @moduledoc """ - Pick-first load balancer: always returns the first channel in the list. + @moduledoc "Pick-first load balancer: always returns the first channel in the list." - Stores the current pick in an ETS table so `update/2` can swap it in - place without allocating a new state. The `tid` is stable for the life - of the strategy, which lets callers cache `lb_state` across DNS - re-resolutions that swap the backend list. - """ @behaviour GRPC.Client.LoadBalancing @current_key :current diff --git a/grpc/lib/grpc/client/load_balacing/round_robin.ex b/grpc/lib/grpc/client/load_balacing/round_robin.ex index b69bea6f9..4ad29a3f6 100644 --- a/grpc/lib/grpc/client/load_balacing/round_robin.ex +++ b/grpc/lib/grpc/client/load_balacing/round_robin.ex @@ -1,21 +1,6 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do - @moduledoc """ - Round-robin load balancer. - - The pick path is fully lock-free: - - * `:atomics.add_get/3` advances the cursor with a hardware CAS — no - table-level or key-level lock, unlike `:ets.update_counter/3` which - serialises writers on a single key's hash bucket. - * The channel tuple lives in a `read_concurrency: true` ETS table and - is replaced wholesale on reconcile, so concurrent picks never see a - torn list. - - The ETS table is owned by whichever process calls `init/1` (normally the - `GRPC.Client.Connection` GenServer), so when that process dies the table - is reclaimed automatically. The atomics ref has no owner — it is reclaimed - by the GC when the last reference is dropped. - """ + @moduledoc "Round-robin load balancer (ETS for the channel tuple, `:atomics` for the cursor)." + @behaviour GRPC.Client.LoadBalancing @channels_key :channels @@ -37,8 +22,6 @@ defmodule GRPC.Client.LoadBalancing.RoundRobin do end @impl true - # `disconnect/1` may delete the table between a caller's persistent_term - # lookup and pick — the rescue turns the BIF crash into a tagged error. def pick(%{tid: tid, atomics: aref} = state) do case :ets.lookup(tid, @channels_key) do [{@channels_key, channels}] when tuple_size(channels) > 0 -> diff --git a/grpc/test/grpc/client/connection_test.exs b/grpc/test/grpc/client/connection_test.exs index 1931f88e4..259abbe2d 100644 --- a/grpc/test/grpc/client/connection_test.exs +++ b/grpc/test/grpc/client/connection_test.exs @@ -39,8 +39,6 @@ defmodule GRPC.Client.ConnectionTest do adapter: adapter } do {:ok, first_channel} = Connection.connect(target, adapter: adapter, name: ref) - - # Connecting again with the same ref triggers the :already_started path {:ok, second_channel} = Connection.connect(target, adapter: adapter, name: ref) assert first_channel.ref == second_channel.ref @@ -58,7 +56,6 @@ defmodule GRPC.Client.ConnectionTest do entry = :persistent_term.get(key) :persistent_term.erase(key) - # Calling connect again will hit already_started → pick_channel → :no_connection assert {:error, :no_connection} = Connection.connect(target, adapter: adapter, name: ref) :persistent_term.put(key, entry) @@ -110,9 +107,6 @@ defmodule GRPC.Client.ConnectionTest do end describe "LB ETS table lifecycle" do - # The Connection GenServer owns the LB's ETS table (lb_mod.init runs in - # the GenServer), so BEAM reclaims the table automatically when the - # process exits — no explicit shutdown callback needed. test "disconnect/1 exits the GenServer and the ETS table is freed", %{ ref: ref, target: target, @@ -129,9 +123,6 @@ defmodule GRPC.Client.ConnectionTest do {:ok, _} = Connection.disconnect(channel) - # disconnect replies before the GenServer terminates (via {:continue, :stop}), - # so wait for the process to actually exit before asserting BEAM has reclaimed - # the table. assert_receive {:DOWN, ^ref_mon, :process, ^pid, _reason}, 500 assert :ets.info(tid) == :undefined @@ -157,10 +148,6 @@ defmodule GRPC.Client.ConnectionTest do end describe "pick_channel/2 races with disconnect/1" do - # Concurrent RPCs must never crash when a disconnect is tearing down the - # LB state. The pick path's :persistent_term.get falls back to nil when - # the entry is gone, and the per-LB pick rescues ArgumentError if the - # ETS table is reclaimed mid-pick. test "many concurrent picks complete without crashing while disconnect runs", %{ ref: ref, target: target, @@ -185,21 +172,17 @@ defmodule GRPC.Client.ConnectionTest do end) end - # Give pickers a head start so some land before, during, and after. Process.sleep(2) {:ok, _} = Connection.disconnect(channel) for _ <- 1..picker_count do assert_receive {:done, _, results}, 2_000 - # Every result must be well-shaped — either a valid channel or the - # documented error. Anything else means we crashed a caller. for r <- results do assert match?({:ok, %Channel{}}, r) or r == {:error, :no_connection} end end - # And confirm the pickers actually ran to completion. for pid <- pickers do refute Process.alive?(pid) end @@ -207,10 +190,6 @@ defmodule GRPC.Client.ConnectionTest do end describe "resource leaks over repeated connect/disconnect" do - # Mirrors the regression style used in #509 for the persistent_term leak: - # cycle connect/disconnect many times and assert zero growth in the - # long-lived stores we own (Connection-keyed persistent_term entries + - # the count of ETS tables). test "500 cycles leave persistent_term clean and no per-LB tables leak", %{ target: target, adapter: adapter @@ -230,7 +209,6 @@ defmodule GRPC.Client.ConnectionTest do assert after_pt_count == before_pt_count, "persistent_term leaked: before=#{before_pt_count} after=#{after_pt_count}" - # Allow small slop for tables the VM may have created incidentally. assert after_table_count - before_table_count <= 5, "ETS tables leaked: before=#{before_table_count} after=#{after_table_count}" end diff --git a/grpc/test/grpc/client/load_balacing/pick_first_test.exs b/grpc/test/grpc/client/load_balacing/pick_first_test.exs index 1d20162fd..52ba1c30b 100644 --- a/grpc/test/grpc/client/load_balacing/pick_first_test.exs +++ b/grpc/test/grpc/client/load_balacing/pick_first_test.exs @@ -44,9 +44,6 @@ defmodule GRPC.Client.LoadBalancing.PickFirstTest do assert {:error, :no_channels} = PickFirst.pick(state) end - # Guards the race where the owning Connection GenServer dies (and BEAM - # reclaims the ETS table) between a caller's registry lookup and this - # pick — the rescue should turn the BIF crash into a tagged error. test "returns :no_channels instead of raising when the table was deleted" do {:ok, state} = PickFirst.init(channels: channels([{"a", 1}])) :ets.delete(state.tid) diff --git a/grpc/test/grpc/client/load_balacing/round_robin_test.exs b/grpc/test/grpc/client/load_balacing/round_robin_test.exs index 55902eb18..f8ee17af5 100644 --- a/grpc/test/grpc/client/load_balacing/round_robin_test.exs +++ b/grpc/test/grpc/client/load_balacing/round_robin_test.exs @@ -1,19 +1,4 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do - @moduledoc """ - Tests for the round-robin load balancer (ETS for the channel tuple, - `:atomics` for the cursor). - - Covers: - 1. init/1 creates an ETS table + atomics ref and returns them in state - 2. init/1 rejects empty channel lists - 3. pick/1 rotates through channels in order - 4. pick/1 wraps around after the last channel - 5. update/2 replaces channels in place without creating a new table or atomics ref - 6. update/2 accepts empty channel lists - 7. update/2 resets the cursor so the first pick is the first channel - 8. pick/1 is safe under concurrent access (many processes, one table) - 9. pick/1 returns :no_channels (not a crash) when the table is gone - """ use ExUnit.Case, async: true alias GRPC.Channel @@ -91,9 +76,6 @@ defmodule GRPC.Client.LoadBalancing.RoundRobinTest do end end - # Guards the race where the owning Connection GenServer dies (and BEAM - # reclaims the ETS table) between a caller's registry lookup and this - # pick — the rescue should turn the BIF crash into a tagged error. describe "pick/1 race with table deletion" do test "returns :no_channels instead of raising when the table was deleted" do {:ok, state} = RoundRobin.init(channels: channels([{"a", 1}])) From 8af2f5fa5490ada753a7513bc23f1e8ae70dabe8 Mon Sep 17 00:00:00 2001 From: Chris Greeno Date: Wed, 13 May 2026 20:01:27 +0100 Subject: [PATCH 21/21] test(dns_resolver): disconnect channel at end of failing-channels test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Test leaked a Connection + DNSResolver pair with resolve_interval: 200ms. After the test exited, the leaked resolver kept calling MockResolver.resolve/1, which Mox rejected (UnexpectedCallError). Connection.handle_info re-init'd the resolver, MockResolver.init/2 also raised, Connection crashed, DynamicSupervisor restarted it, max_restarts tripped, and the supervisor shut down — breaking later tests that try to start_child (notably the new 500-cycle leak test). --- grpc/test/grpc/client/dns_resolver_test.exs | 2 ++ 1 file changed, 2 insertions(+) diff --git a/grpc/test/grpc/client/dns_resolver_test.exs b/grpc/test/grpc/client/dns_resolver_test.exs index 68fd41130..972ed5e21 100644 --- a/grpc/test/grpc/client/dns_resolver_test.exs +++ b/grpc/test/grpc/client/dns_resolver_test.exs @@ -1061,6 +1061,8 @@ defmodule GRPC.Client.ReResolveTest do Process.sleep(@wait) assert {:error, :no_connection} = Connection.pick_channel(channel) + + disconnect_and_wait(channel) end end