Conversation
❌ 4 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
| rng = np.random.default_rng(rng) | ||
| if not isinstance(rng, _FakeRandomGen) or not isinstance( | ||
| rng._arg, int | np.random.RandomState | ||
| ): | ||
| # TODO: remove this error and if we don’t have a _FakeRandomGen, | ||
| # just use rng.integers to make a seed farther down | ||
| msg = f"rng needs to be an int or a np.random.RandomState, not a {type(rng).__name__} when passing a dask array" |
There was a problem hiding this comment.
reminder to self: do this TODO
| for rowidx, sub_rng in zip( | ||
| under_target, rng.spawn(len(under_target)), strict=True | ||
| ): | ||
| _downsample_array( |
There was a problem hiding this comment.
I think spawn is more for parallel independent streams. This is correct for reproducibility but it might be causing performance overhead for independence that is usually not needed. So far I saw usage of spawn in either when rng is being given to another function while having keeping the rng still independent in the current code flow or in parallel cases. Not in a sequential for loop. My argument goes like this: If this for loop was unrolled and the code was written sequentially would we still use spawn. I think no because in sparse multiply we don't do that for example.
In fact I ran an ai generated script and these were the results. For some reason FakeRandomGen is super slow. But also shows that spawn actually has an overhead.
python bench_downsample.py
Warming up numba JIT...
Done.
counts_per_cell=50 n_obs=500 n_vars=200 nnz=29840
legacy (random_state=0) 20.07 ms
real Generator (with spawn) 8.30 ms
real Generator (no spawn) 4.53 ms
--> spawn overhead: +83% real vs legacy: -59%
counts_per_cell=100 n_obs=5000 n_vars=500 nnz=746289
legacy (random_state=0) 399.50 ms
real Generator (with spawn) 86.80 ms
real Generator (no spawn) 53.74 ms
--> spawn overhead: +62% real vs legacy: -78%
counts_per_cell=50 n_obs=20000 n_vars=1000 nnz=5969861
legacy (random_state=0) 2968.35 ms
real Generator (with spawn) 320.06 ms
real Generator (no spawn) 185.98 ms
--> spawn overhead: +72% real vs legacy: -89%
❯ python bench_downsample.py
Warming up numba JIT...
Done.
counts_per_cell=50 n_obs=500 n_vars=200 nnz=29840
legacy (_FakeRandomGen) 18.96 ms
real Generator (with spawn) 7.77 ms
real Generator (no spawn) 4.22 ms
--> spawn overhead: +84% real vs legacy: -59%
counts_per_cell=100 n_obs=5000 n_vars=500 nnz=746289
legacy (_FakeRandomGen) 412.81 ms
real Generator (with spawn) 88.76 ms
real Generator (no spawn) 54.64 ms
--> spawn overhead: +62% real vs legacy: -78%
counts_per_cell=50 n_obs=20000 n_vars=1000 nnz=5969861
legacy (_FakeRandomGen) 2956.91 ms
real Generator (with spawn) 322.48 ms
real Generator (no spawn) 187.00 ms
--> spawn overhead: +72% real vs legacy: -89%
~/p/scanpy │ pa/rng ⇣1 *1 !2 ?2 ✔ │ 28s │ scanpy Py │ 16:30:20
the script: https://gist.github.com/selmanozleyen/2046c849bb10f541642cea2ac7daa0db
Is this True? I was looking into your comment and found this: # High quality entropy created with: f"0x{secrets.randbits(128):x}"
import numpy as np
entropy = 0x3034c61a9ae04ff8cb62ab8ec2c4b501
rng = np.random.default_rng(entropy)
# Generate a random number as the first step
first_random = rng.uniform()
rng = np.random.default_rng(entropy)
# Now, spawn as the first step
child_rng1, child_rng2 = rng.spawn(2)
# And then do RNG
assert first_random == rng.uniform()Also, reading through the linked thread, it seems the consensus was |
|
@ilan-gold yeah I'm sure! I was confused: they have two parts of internal state. Spawning advances the SeedSequence. I can do a writeup when I'm back working |
|
Ok @flying-sheep That makes a load of sense - I assumed as much was going on (how else could what I posted be true unless they were literally tracking the number of children created) and it turns out, that is exactly what is happening. Like literally: after spawning so the entropy is unchanged. |
There was a problem hiding this comment.
Should we store the new RNG in adata.uns? If no, this fixes #1131
No opinion, seems moderately unrelated. I guess with the new structure it's a bit meaningless because seeds have no impact if you're passing in a Generator and you can't serialize a Generator
Should we keep random_state in the docs?
No
How should we annotate that the default rng isn’t actually None but “a new instance of _LegacyRandom(0)” but people can pass rng=None to get the future default behavior?
I noticed there is no default docstring for rng like for random_state. Why not? It seems like the long term goal is to unify public APIs. Is there a way to create a default docstring that generalizes but can refer to an old version of the pakcage/docs as a disclaimer for "default" behavior? Maybe not. It's hard for me to hold everything in my head.
Should I handle passing rng to neighbors transformer?
It seems like we don't do it now unless it's a self-constructed transformer, so I probably wouldn't. Is this what you mean?
| """ | ||
| rng_init, rng_umap = np.random.default_rng(rng).spawn(2) | ||
| meta_random_state = ( | ||
| dict(random_state=rng._arg) if isinstance(rng, _FakeRandomGen) else {} |
There was a problem hiding this comment.
I guess to distinguish it from all the attributes inherited from Generator, but I guess it’s not necessary.
It’s a private class, so once we’ve established that an object has this class, accessing its fields might as well be public.
| """Return `self` `n_children` times. | ||
|
|
||
| In a real generator, the spawned children are independent, | ||
| but for backwards compatibility we return the same instance. |
There was a problem hiding this comment.
| but for backwards compatibility we return the same instance. | |
| but for backwards compatibility we return the same instance so that its internal state is advanced by each child. |
| meta_random_state = ( | ||
| dict(random_state=rng._arg) if isinstance(rng, _FakeRandomGen) else {} | ||
| ) | ||
| rng = _if_legacy_apply_global(rng) |
There was a problem hiding this comment.
Is this not a behavior change? For example if init_pos in adata.obsm branch + if layout == "fa" would never set the global seed?
There was a problem hiding this comment.
this kind of comment is exactly why I was looking forward to more eyes on this PR, thanks!
| if isinstance(rng, _FakeRandomGen): | ||
| from sklearn.random_projection import sample_without_replacement | ||
|
|
||
| idx = sample_without_replacement(np.prod(dims), nsamp, random_state=rng._arg) |
There was a problem hiding this comment.
Again seems like _arg should be public
| return _downsample_array_inner(col, cumcounts, sample, inplace=inplace) | ||
|
|
||
|
|
||
| # TODO: can/should this be parallelized? |
There was a problem hiding this comment.
Seems like a good candidate!
There was a problem hiding this comment.
yeah, I just left it alone to not cause any more changes in this PR as there were.
| rng = np.random.default_rng(rng) | ||
| # this exists only to be stored in our PCA container object | ||
| random_state_meta = _legacy_random_state(rng) | ||
| [rng_init, rng_svds] = rng.spawn(2) |
There was a problem hiding this comment.
I see you have "we use spawn to pass sub-generators to each subtask, so we can add subtasks without affecting subsequent calls e.g." in your comment on the numpy issue but why do we want this property of independent subtasks?
There was a problem hiding this comment.
See my comment there: numpy/numpy#24086 (comment)
I’m torn between these two arguments:
- when libraries change from under us, things change anyway, so whenever you update your packages and expect your script to produce the exact same result even when passing a seed, and
- Some steps aren’t impactful enough to have qualitative changes, and spawning allows to pass
rngto more things that we previously didn’t pass it to (see example in that comment)
There was a problem hiding this comment.
[rng_root] = rng.spawn(1)
-[rng_eigsh] = rng_root.spawn(1)
+rng_eigsh, rng_other_stuff = rng_root.spawn(2)
random_init = rng_root.uniform(...)
eigsh(A, k, v0=random_init, rng=rng_eigsh)
+do_thing(rng_other_stuff)i think in your example it only matters if you don't know eigsh still works under the hood in parallel after it returns or not. But we know it doesn't right? So continueing with rng_other_stuff or rng_eigsh doesn't matter.
I think it only matters for either:
- abstraction reasons: when you don't know or don't want to know what the function you call does with the rng, how long it keeps it etc.
- or writing parallel code
Not in AnnData, but being serialized in general is an explicit feature of them.
great point, I’ll add that, and that’ll make documenting this easy.
I think we do it when the user passed a transformer class or when using an explicitly transformer that supports it (sklearn?) My point was: sklearn doesn’t support |
numpy.random.Generator#3371The idea is to make our code behave as closely as possible to how it did, but make it read like modern code:
random_stateintorngargument (deduplicating them and using 0 by default)RandomState)_FakeRandomGen.wrap_globaland_if_legacy_apply_globalto conditionally replacenp.random.seedand other global-state-mutating functions_legacy_random_stateto get back arandom_stateargument for APIs that don’t takerng, e.g.scikit-learnstuff and"random_state"inadata.uns[thing]["params"]I also didn’t convert the
transformerargument toneighbors(yet?), or deprecated stuff likelouvainor theexternalAPIs.Reviewing
First a short info abut how Generators work:
spawnindependent children (doing so advances their internal state once)rng: SeedLike | RNGLike | None = Nonefor the argument,Nonemeaning random initializationNow questions to the reviewers:
adata.uns? If no, this fixes Passing a RandomState instance can cause failures to save #1131random_statein the docs?rngisn’t actuallyNonebut “a new instance of_LegacyRandom(0)” but people can passrng=Noneto get the future default behavior?rngto neighbors transformer?rngcan be passed?np.random.seed()TODO:
np.random.seedcalls (maybeif isinstance(rng, _FakeRandomGen): gen = legacy_numpy_gen(rng)or so?)spawnto_FakeRandomGen(that does nothing) and usespawnfor a tree structurerng.clone()or similar? numpy/numpy#24086 (comment)