Skip to content

Commit ecb4f49

Browse files
committed
Reuse cached corpus snapshot in direct rust fallback
1 parent 5e1dfe1 commit ecb4f49

5 files changed

Lines changed: 42 additions & 10 deletions

File tree

docs/modernization-handoff.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ Delivered:
9696
- Rust session execution refactor to shared session corpus storage + borrowed orchestration helpers, eliminating deep corpus clone overhead on each session EM run
9797
- unified Rust session orchestration API (`run_em_on_session`) that applies settings + runs EM in one call inside Rust session orchestration
9898
- `Lda::Backends::Rust` now routes session-path EM through managed Rust session orchestration (`run_em_on_session_with_corpus`), leaving session reuse/recovery decisions in Rust; when session orchestration is unavailable it still prefers direct Rust non-session orchestration (`run_em_with_start_seed`) before legacy Ruby-side beta-input fallback (`run_em`)
99+
- direct non-session Rust orchestration now reuses the backend's cached Rust corpus snapshot instead of rebuilding corpus arrays from `@corpus` on each fallback invocation
99100
- Rust managed-session orchestration API (`run_em_on_session_with_corpus`) added to recreate missing sessions and run EM in one Rust call
100101
- Rust session lifecycle replacement API (`replace_corpus_session`) added so corpus reassignment can update existing Rust sessions in place (config reset + corpus swap) instead of Ruby-side drop/recreate
101102
- `Lda::Backends::Rust` now keeps session-based orchestration on the managed Rust path (`run_em_on_session_with_corpus`) even when sessions are dropped externally

docs/porting-strategy.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ Completed in `codex/experiment-ruby3-modernization`:
6363
- Rust session orchestration now runs on shared Rust-side corpus session data via borrowed execution helpers, avoiding deep corpus array cloning on each session EM call.
6464
- Unified Rust session API added (`run_em_on_session`) to apply settings and execute EM in one call inside Rust session orchestration.
6565
- `Lda::Backends::Rust` now prefers direct Rust non-session orchestration (`run_em_with_start_seed`) before legacy `run_em(initial_beta, ...)` compatibility fallback when a session path is unavailable.
66+
- Direct non-session Rust orchestration now reuses the backend's cached Rust corpus snapshot instead of rebuilding corpus arrays from `@corpus` on each fallback invocation.
6667
- Rust managed-session orchestration API added (`run_em_on_session_with_corpus`) to recreate missing sessions and execute EM in one Rust call.
6768
- Rust session lifecycle replacement API added (`replace_corpus_session`) so corpus reassignment can update existing Rust sessions in place (config reset + corpus swap) instead of Ruby-side drop/recreate.
6869
- `Lda::Backends::Rust` now routes session-path EM through `run_em_on_session_with_corpus`, leaving session reuse/recovery decisions in Rust and reducing fallback to non-session orchestration when sessions are externally dropped.

docs/rust-orchestration-guardrails.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Current parity expectations:
1515
- Session-based orchestration paths (`run_em_on_session`, `run_em_on_session_with_start_seed`, `run_em_on_session_start`, `run_em_on_session_with_corpus`) must match direct non-session orchestration for equivalent settings/seeds.
1616
- `Lda::Backends::Rust` session-path EM should prefer the managed Rust session entrypoint (`run_em_on_session_with_corpus`) rather than branching in Ruby between session-only and recovery paths.
1717
- `Lda::Backends::Rust` non-session fallback should prefer Rust start-aware orchestration (`run_em_with_start_seed`) before legacy beta-input orchestration (`run_em`).
18+
- Direct non-session fallback should reuse the backend's cached Rust corpus snapshot rather than rebuilding corpus arrays from `@corpus` for each invocation.
1819
- Rust backend corpus/session lifecycle must not leak session count across corpus replacement.
1920
- Missing-session recovery in managed session orchestration (`run_em_on_session_with_corpus`) must recreate a usable session and keep parity with direct orchestration.
2021
- Corpus reassignment through Rust session replacement lifecycle (`replace_corpus_session`) must preserve stable session count and route subsequent EM runs over updated corpus data.

lib/lda-ruby/backends/rust.rb

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -176,31 +176,32 @@ def rust_orchestrated_em_with_session(start)
176176
def rust_orchestrated_em_with_start_seed(start)
177177
return false unless defined?(::Lda::RustBackend)
178178
return false unless ::Lda::RustBackend.respond_to?(:run_em_with_start_seed)
179+
return false unless ensure_rust_corpus_snapshot
179180

180-
em_input = rust_em_corpus_input
181-
return false if em_input.nil?
181+
topics = Integer(num_topics)
182+
return false unless topics.positive?
182183

183184
random_seed = Integer(next_random_seed)
184185
output = ::Lda::RustBackend.run_em_with_start_seed(
185186
start.to_s,
186-
em_input.fetch(:document_words),
187-
em_input.fetch(:document_counts),
188-
Integer(em_input.fetch(:topics)),
189-
Integer(em_input.fetch(:terms)),
187+
@rust_document_words,
188+
@rust_document_counts,
189+
topics,
190+
Integer(@rust_corpus_terms),
190191
Integer(max_iter),
191192
Float(convergence),
192193
Integer(em_max_iter),
193194
Float(em_convergence),
194195
Float(init_alpha),
195-
Float(em_input.fetch(:min_probability)),
196+
MIN_PROBABILITY,
196197
random_seed
197198
)
198199

199200
return false unless valid_rust_em_output?(
200201
output,
201-
em_input.fetch(:document_lengths),
202-
em_input.fetch(:topics),
203-
em_input.fetch(:terms)
202+
@rust_document_lengths,
203+
topics,
204+
Integer(@rust_corpus_terms)
204205
)
205206

206207
beta_probabilities, beta_log, gamma, phi = output
@@ -349,6 +350,12 @@ def register_rust_corpus_session(previous_session_id = nil)
349350
end
350351

351352
def ensure_rust_corpus_session
353+
ensure_rust_corpus_snapshot
354+
rescue StandardError
355+
false
356+
end
357+
358+
def ensure_rust_corpus_snapshot
352359
has_session_data = @rust_corpus_terms && @rust_document_lengths && @rust_document_words && @rust_document_counts
353360
return true if has_session_data
354361

test/rust_orchestration_test.rb

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -667,6 +667,28 @@ def test_rust_backend_non_session_fallback_prefers_run_em_with_start_seed
667667
backend&.corpus = nil
668668
end
669669

670+
def test_rust_backend_direct_non_session_path_reuses_cached_corpus_snapshot
671+
backend = Lda::Backends::Rust.new(random_seed: 1234)
672+
backend.corpus = Lda::TextCorpus.new(FIXTURE_DOCUMENTS)
673+
backend.verbose = false
674+
backend.num_topics = @topics
675+
backend.max_iter = @max_iter
676+
backend.convergence = @convergence
677+
backend.em_max_iter = @em_max_iter
678+
backend.em_convergence = @em_convergence
679+
backend.init_alpha = @init_alpha
680+
681+
backend.define_singleton_method(:rust_orchestrated_em_with_session) { |_start| false }
682+
backend.define_singleton_method(:rust_em_corpus_input) do
683+
raise "direct non-session path should reuse cached corpus snapshot"
684+
end
685+
686+
backend.em("random")
687+
assert_equal @topics, backend.gamma.first.size
688+
ensure
689+
backend&.corpus = nil
690+
end
691+
670692
def test_rust_backend_session_path_prefers_managed_session_entrypoint
671693
backend = nil
672694
rust_singleton = nil

0 commit comments

Comments
 (0)