Skip to content

Commit aa975a8

Browse files
committed
Reuse cached snapshot in beta fallback
1 parent 1af9aa9 commit aa975a8

8 files changed

Lines changed: 146 additions & 10 deletions

File tree

docs/modernization-handoff.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ Delivered:
9797
- unified Rust session orchestration API (`run_em_on_session`) that applies settings + runs EM in one call inside Rust session orchestration
9898
- `Lda::Backends::Rust` now routes cached-corpus EM through managed Rust orchestration (`run_em_on_session_with_corpus`), leaving session reuse/recovery decisions in Rust and preferring that managed path even when no active session id is cached locally; when the managed API is unavailable it still prefers direct Rust non-session orchestration (`run_em_with_start_seed`) before legacy Ruby-side beta-input fallback (`run_em`)
9999
- direct non-session Rust orchestration now reuses the backend's cached Rust corpus snapshot instead of rebuilding corpus arrays from `@corpus` on each fallback invocation
100+
- legacy Rust beta-input compatibility fallback now also reuses the backend's cached Rust corpus snapshot, only asking the pure-Ruby backend to synthesize the initial beta matrix
100101
- Rust managed-session orchestration API (`run_em_on_session_with_corpus`) added to recreate missing sessions and run EM in one Rust call, and now directly falls back to start-aware array execution inside Rust if session-backed execution cannot be used
101102
- Rust session lifecycle replacement API (`replace_corpus_session`) added so corpus reassignment can update existing Rust sessions in place (config reset + corpus swap) instead of Ruby-side drop/recreate
102103
- `Lda::Backends::Rust` now keeps session-based orchestration on the managed Rust path (`run_em_on_session_with_corpus`) even when sessions are dropped externally

docs/porting-strategy.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ Completed in `codex/experiment-ruby3-modernization`:
6868
- Rust session lifecycle replacement API added (`replace_corpus_session`) so corpus reassignment can update existing Rust sessions in place (config reset + corpus swap) instead of Ruby-side drop/recreate.
6969
- `Lda::Backends::Rust` now routes cached-corpus EM through `run_em_on_session_with_corpus`, leaving session reuse/recovery decisions in Rust, preferring that managed path even when no active session id is cached locally, and reducing Ruby-side fallback branching when sessions are externally dropped.
7070
- `run_em_on_session_with_corpus` now acts as a unified Rust managed-corpus entrypoint: it attempts session-backed execution first, then falls back to direct start-aware array execution inside Rust when a managed session cannot be used.
71+
- Legacy `run_em(initial_beta, ...)` compatibility fallback now reuses the Rust backend's cached corpus snapshot and only relies on the pure-Ruby backend to synthesize the initial beta matrix.
7172
- Dockerized rust runtime workflow added for local parity with CI (`Dockerfile.rust`, `bin/docker-test-rust`).
7273
- Gem packaging now excludes local Rust cargo build artifacts (`target/**`) for clean release builds.
7374
- Backend benchmark driver added (`bin/benchmark-backends`) to track pure/native/rust runtime deltas.

docs/rust-orchestration-guardrails.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Current parity expectations:
1616
- `Lda::Backends::Rust` cached-corpus EM should prefer the managed Rust session entrypoint (`run_em_on_session_with_corpus`) even when no active session id is cached locally, rather than branching in Ruby between session-only, recovery, and direct paths.
1717
- `Lda::Backends::Rust` non-session fallback should prefer Rust start-aware orchestration (`run_em_with_start_seed`) before legacy beta-input orchestration (`run_em`).
1818
- Direct non-session fallback should reuse the backend's cached Rust corpus snapshot rather than rebuilding corpus arrays from `@corpus` for each invocation.
19+
- Legacy beta-input compatibility fallback should also reuse the backend's cached Rust corpus snapshot rather than rebuilding full EM corpus input in Ruby.
1920
- Rust backend corpus/session lifecycle must not leak session count across corpus replacement.
2021
- Missing-session recovery in managed session orchestration (`run_em_on_session_with_corpus`) must recreate a usable session and keep parity with direct orchestration.
2122
- Managed Rust corpus orchestration (`run_em_on_session_with_corpus`) must keep parity with direct orchestration even when it falls back internally from session-backed execution to start-seeded array execution.

ext/lda-ruby-rust/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Current scope:
2121
- `Lda::RustBackend.seeded_topic_term_probabilities(document_words, document_counts, topics, terms, min_probability)`
2222
- `Lda::RustBackend.random_topic_term_probabilities(topics, terms, min_probability, random_seed)`
2323
- `Lda::RustBackend.create_corpus_session(document_words, document_counts, terms)`
24+
- `Lda::RustBackend.replace_corpus_session(session_id, document_words, document_counts, terms)`
2425
- `Lda::RustBackend.drop_corpus_session(session_id)`
2526
- `Lda::RustBackend.configure_corpus_session(session_id, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability)`
2627
- `Lda::RustBackend.run_em(initial_beta, document_words, document_counts, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability)`
@@ -29,6 +30,7 @@ Current scope:
2930
- `Lda::RustBackend.run_em_on_session(session_id, start, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability, random_seed)`
3031
- `Lda::RustBackend.run_em_on_session_start(session_id, start, random_seed)`
3132
- `Lda::RustBackend.run_em_on_session_with_start_seed(session_id, start, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability, random_seed)`
33+
- `Lda::RustBackend.run_em_on_session_with_corpus(session_id, document_words, document_counts, terms, start, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability, random_seed)`
3234

3335
Hot-path kernels currently executed in Rust when `backend: :rust` is active:
3436
- topic weights for a word across topics
@@ -46,7 +48,9 @@ Hot-path kernels currently executed in Rust when `backend: :rust` is active:
4648
- unified session-settings orchestration (`run_em_on_session`) that applies settings and executes EM in one call
4749
- session-based EM orchestration against Rust-managed corpus lifecycle (`create_corpus_session` + `run_em_on_session_with_start_seed`)
4850
- settings-aware session orchestration (`configure_corpus_session` + `run_em_on_session_start`)
49-
- `Lda::Backends::Rust` prefers `run_em_with_start_seed` for direct non-session orchestration when session orchestration is unavailable
51+
- managed corpus orchestration (`run_em_on_session_with_corpus`) that can recreate missing sessions and, if session-backed execution cannot be used, falls back internally to direct start-aware execution inside Rust
52+
- `Lda::Backends::Rust` prefers `run_em_on_session_with_corpus` whenever a cached Rust corpus snapshot is available, even if no session id is currently cached locally
53+
- direct and legacy beta-input compatibility fallbacks both reuse the backend's cached Rust corpus snapshot instead of rebuilding corpus arrays in Ruby
5054
- unknown EM start modes in seed-aware orchestration follow Ruby's non-seeded fallback behavior (seeded by explicit `random_seed`)
5155

5256
Remaining numeric LDA kernels are still provided by the pure Ruby backend and will move incrementally.

lib/lda-ruby/backends/pure_ruby.rb

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,23 @@ def rust_em_input(start)
5959
build_em_input(start)
6060
end
6161

62+
# Returns only the initial beta matrix for Rust compatibility paths that
63+
# already hold a cached corpus snapshot.
64+
def rust_initial_beta_probabilities(start, document_words, document_counts, topics, terms)
65+
start_mode = start.to_s
66+
67+
if start_mode.strip.casecmp("seeded").zero? || start_mode.strip.casecmp("deterministic").zero?
68+
seeded_topic_term_probabilities(
69+
Integer(topics),
70+
Integer(terms),
71+
document_words,
72+
document_counts
73+
)
74+
else
75+
initial_topic_term_probabilities(Integer(topics), Integer(terms))
76+
end
77+
end
78+
6279
def em_from_input(em_input)
6380
return nil if em_input.nil?
6481

@@ -129,21 +146,20 @@ def build_em_input(start)
129146
document_words = @corpus.documents.map { |document| document.words.map(&:to_i) }
130147
document_counts = @corpus.documents.map { |document| document.counts.map(&:to_f) }
131148

132-
initial_beta_probabilities =
133-
if start.to_s.strip.casecmp("seeded").zero? || start.to_s.strip.casecmp("deterministic").zero?
134-
seeded_topic_term_probabilities(topics, terms, document_words, document_counts)
135-
else
136-
initial_topic_term_probabilities(topics, terms)
137-
end
138-
139149
{
140150
topics: topics,
141151
terms: terms,
142152
document_words: document_words,
143153
document_counts: document_counts,
144154
document_totals: document_counts.map { |counts| counts.sum.to_f },
145155
document_lengths: document_words.map(&:length),
146-
initial_beta_probabilities: initial_beta_probabilities,
156+
initial_beta_probabilities: rust_initial_beta_probabilities(
157+
start,
158+
document_words,
159+
document_counts,
160+
topics,
161+
terms
162+
),
147163
min_probability: MIN_PROBABILITY
148164
}
149165
end

lib/lda-ruby/backends/rust.rb

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,32 @@ def rust_orchestrated_em_with_beta(start)
221221
return false unless defined?(::Lda::RustBackend)
222222
return false unless ::Lda::RustBackend.respond_to?(:run_em)
223223

224-
em_input = @fallback.rust_em_input(start)
224+
em_input =
225+
if ensure_rust_corpus_snapshot && @fallback.respond_to?(:rust_initial_beta_probabilities)
226+
topics = Integer(num_topics)
227+
terms = Integer(@rust_corpus_terms)
228+
initial_beta_probabilities = @fallback.rust_initial_beta_probabilities(
229+
start,
230+
@rust_document_words,
231+
@rust_document_counts,
232+
topics,
233+
terms
234+
)
235+
236+
{
237+
topics: topics,
238+
terms: terms,
239+
document_words: @rust_document_words,
240+
document_counts: @rust_document_counts,
241+
document_totals: @rust_document_counts.map { |counts| counts.sum.to_f },
242+
document_lengths: @rust_document_lengths,
243+
initial_beta_probabilities: initial_beta_probabilities,
244+
min_probability: MIN_PROBABILITY
245+
}
246+
else
247+
@fallback.rust_em_input(start)
248+
end
249+
225250
return true if em_input.nil?
226251

227252
output = ::Lda::RustBackend.run_em(

test/pure_ruby_orchestration_test.rb

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,26 @@ def test_em_from_input_matches_seeded_em_output
3636
assert_nested_close(direct.compute_phi, from_input.compute_phi, 1e-9)
3737
end
3838

39+
def test_rust_initial_beta_probabilities_matches_rust_em_input_for_random_start
40+
from_helper = build_backend
41+
from_input = build_backend
42+
43+
document_words = from_helper.corpus.documents.map { |document| document.words.map(&:to_i) }
44+
document_counts = from_helper.corpus.documents.map { |document| document.counts.map(&:to_f) }
45+
terms = from_helper.corpus.documents.flat_map(&:words).max + 1
46+
47+
helper_beta = from_helper.rust_initial_beta_probabilities(
48+
"random",
49+
document_words,
50+
document_counts,
51+
from_helper.num_topics,
52+
terms
53+
)
54+
em_input = from_input.rust_em_input("random")
55+
56+
assert_nested_close(helper_beta, em_input[:initial_beta_probabilities], 1e-12)
57+
end
58+
3959
def test_apply_em_state_sets_outputs
4060
backend = build_backend
4161

test/rust_orchestration_test.rb

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -689,6 +689,74 @@ def test_rust_backend_direct_non_session_path_reuses_cached_corpus_snapshot
689689
backend&.corpus = nil
690690
end
691691

692+
def test_rust_backend_beta_fallback_reuses_cached_corpus_snapshot
693+
backend = nil
694+
fallback = nil
695+
fallback_singleton = nil
696+
rust_em_input_alias = :__test_original_rust_em_input_for_beta_snapshot__
697+
rust_initial_beta_alias = :__test_original_rust_initial_beta_for_beta_snapshot__
698+
699+
backend = Lda::Backends::Rust.new(random_seed: 1234)
700+
backend.corpus = Lda::TextCorpus.new(FIXTURE_DOCUMENTS)
701+
backend.verbose = false
702+
backend.num_topics = @topics
703+
backend.max_iter = @max_iter
704+
backend.convergence = @convergence
705+
backend.em_max_iter = @em_max_iter
706+
backend.em_convergence = @em_convergence
707+
backend.init_alpha = @init_alpha
708+
709+
backend.define_singleton_method(:rust_orchestrated_em_with_managed_corpus) { |_start| false }
710+
backend.define_singleton_method(:rust_orchestrated_em_with_start_seed) { |_start| false }
711+
712+
cached_document_words = backend.instance_variable_get(:@rust_document_words)
713+
cached_document_counts = backend.instance_variable_get(:@rust_document_counts)
714+
cached_terms = backend.instance_variable_get(:@rust_corpus_terms)
715+
expected_topics = @topics
716+
717+
fallback = backend.instance_variable_get(:@fallback)
718+
fallback_singleton = fallback.singleton_class
719+
used_cached_snapshot = false
720+
721+
silence_redefinition_warnings do
722+
fallback_singleton.send(:alias_method, rust_em_input_alias, :rust_em_input)
723+
fallback_singleton.send(:alias_method, rust_initial_beta_alias, :rust_initial_beta_probabilities)
724+
725+
fallback_singleton.send(:define_method, :rust_em_input) do |_start|
726+
raise "beta fallback should not rebuild full rust_em_input when snapshot is cached"
727+
end
728+
729+
fallback_singleton.send(:define_method, :rust_initial_beta_probabilities) do |start, document_words, document_counts, topics, terms|
730+
used_cached_snapshot =
731+
document_words.equal?(cached_document_words) &&
732+
document_counts.equal?(cached_document_counts) &&
733+
topics == expected_topics &&
734+
terms == cached_terms
735+
public_send(rust_initial_beta_alias, start, document_words, document_counts, topics, terms)
736+
end
737+
end
738+
739+
backend.em("random")
740+
assert_equal true, used_cached_snapshot
741+
assert_equal @topics, backend.gamma.first.size
742+
ensure
743+
silence_redefinition_warnings do
744+
if defined?(fallback_singleton) && fallback_singleton.method_defined?(rust_initial_beta_alias)
745+
fallback_singleton.send(:remove_method, :rust_initial_beta_probabilities)
746+
fallback_singleton.send(:alias_method, :rust_initial_beta_probabilities, rust_initial_beta_alias)
747+
fallback_singleton.send(:remove_method, rust_initial_beta_alias)
748+
end
749+
750+
if defined?(fallback_singleton) && fallback_singleton.method_defined?(rust_em_input_alias)
751+
fallback_singleton.send(:remove_method, :rust_em_input)
752+
fallback_singleton.send(:alias_method, :rust_em_input, rust_em_input_alias)
753+
fallback_singleton.send(:remove_method, rust_em_input_alias)
754+
end
755+
end
756+
757+
backend&.corpus = nil
758+
end
759+
692760
def test_rust_backend_prefers_managed_corpus_entrypoint_without_active_session
693761
backend = nil
694762
rust_singleton = nil

0 commit comments

Comments
 (0)