This repository was archived by the owner on Jan 22, 2026. It is now read-only.
Commit 2f030ed
Use threads instead of processes in Dataset.summaries
Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once.
ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix.
Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/
This is a possible source of observed deadlocks during calls to Dataset.records.
Using threads should not be a performance regression since the operation we're parallelizing over is network-bound,
not CPU-bound, so there should not be much contention for the GIL.1 parent fb68074 commit 2f030ed
1 file changed
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
140 | | - | |
| 140 | + | |
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
| |||
283 | 283 | | |
284 | 284 | | |
285 | 285 | | |
286 | | - | |
| 286 | + | |
287 | 287 | | |
288 | 288 | | |
289 | 289 | | |
| |||
0 commit comments