quad_utils_mod search: replace 3D scratch arrays with compressed sparse row format#1102
quad_utils_mod search: replace 3D scratch arrays with compressed sparse row format#1102hkershaw-brown wants to merge 1 commit into
Conversation
See #979 The GRID_QUAD_FULLY_IRREGULAR coarse-index build previously allocated reg_list_lon/lat(nrx, nry, max_reg_list_num) as a 3D temporary, reaching ~5.5 GB at nrx=nry=900. It also errored out if any coarse box accumulated more than max_reg_list_num=800 candidate quads. Replace with a two-pass compressed sparse row build: - Pass 1: count overlaps into grid_num; peak memory ~11 MB - Prefix sum: derive grid_start and exact total entry count - Pass 2: fill flat lists using a cursor array; no cap, no abort Replace hardcoded num_reg with calculation based on grid num_reg = max(10, min(nx_or_ny, int(sqrt(nx*ny / TARGET_CANDIDATES)))) where TARGET_CANDIDATES=8, giving ~8 candidate quads per coarse box regardless of grid resolution. Remove no longer used max_reg_list_num from quad_irreg_grid_coords
|
Q. global vs. regional. wrap where there is no wrap. (mad wrapping of region around globe) How bad is it to have mad-wrap? <-- --> DART/models/utilities/quad_utils_mod.f90 Lines 764 to 769 in acf0a14 |
|
@mgharamti this is the quad search I am playing with. I think the main concerns are calculating the course grid from the actual grid rather than the hardcoded 3 cases, and the search structure size. But open to any input. DART/models/utilities/quad_utils_mod.f90 Lines 730 to 751 in acf0a14 |
mgharamti
left a comment
There was a problem hiding this comment.
@hkershaw-brown, thanks for the work on this. I was always on the edge with the previous style of 3 hard-coded cases but didn't know how to go around that. I think I understand your design and strategy. CSR clearly avoids working with huge 3D arrays.
My first lazy question is why the 8 targets on average? I guess how did you choose it? It seems to me that search cost per obs is ~O(num_grids) and number of bins/boxes is ~O(N/TARGET_CANDIDATES) and so it's a performance tuning parameter. At least we need to explain it in the docs if you decide to go with it.
Second, could we add a post-fill consistency check that verifies fill_pos(i,j) == grid_start(i,j) + grid_num(i,j) for every coarse bin? This would catch any future divergence between the counting pass and filling pass, especially around edge cases such as longitude wrapping, masked/invalid quads, or boundary-touching cells.
| endif | ||
|
|
||
| deallocate(reg_list_lon, reg_list_lat) | ||
| deallocate(fill_pos) |
There was a problem hiding this comment.
The new new approach is based on the assumption (fact?) that what we counted in the first pass is equal to what we filled in the second pass. I'd be more comfortable if we added a sanity check after filling (similar to the check in the old code). You could do something like this:
do i = 1, nrx
do j = 1, nry
if (fill_pos(i,j) /= h%ii%grid_start(i,j) + h%ii%grid_num(i,j)) then
call error_handler(E_ERR, 'init_irreg_interp', &
'count/fill mismatch in irregular grid interpolation lists')
endif
enddo
enddo
There was a problem hiding this comment.
this is a great idea, & the check is in init so not a problem runtime wise to add.
| call get_quad_corners(h%ii%lons_2d, i, j, cyclic, pole, nx, ny, u_c_lons, istatus) | ||
| if (istatus /= 0) print *, 'get_quad_corners for lons returns failure' | ||
|
|
||
| call get_quad_corners(h%ii%lats_2d, i, j, cyclic, pole, nx, ny, u_c_lats, istatus) | ||
| if (istatus /= 0) print *, 'get_quad_corners for lats returns failure' |
There was a problem hiding this comment.
Do we need to cycle here rather than just print an error? If a failure produces bad corners, both passes may count/fill bad bins or one failure path could behave differently depending on the state (right?)
There was a problem hiding this comment.
I will take a closer look - I think I am missing the logic in the original code.
The loop limits are the x,y bounds of the lon and lat arrays.
Is the print just checking if the loop limits are wrong?
If you can go off the end of the arrays,
it seems like both should fail, or both should pass.
If one fails and one passes I think this should be a catastrophic error rather than a print, since lon,lat are for the same point.
My lazy choice: Dr. Google for what was a good choice for this kind of algorithm. I think the equivalent average target (
The grids are not uniform so 'average' is probably a disgusting assumption. I think it is well worth doing some runs with the test code and different (and variable) grid resolutions to count the actual number of quads per box with the new code (and counting the per course box with the 3 hardcoded POP cases). Maybe we have 799 quads in one box, 3 in another. Currently the 8 is just an ok choice where I was finding comparable speeds to the current version, but it is definitely a magic number that needs justifying. |
| h%ii%grid_num = 0 | ||
|
|
||
| do i = 1, xlim | ||
| ! There's no wraparound in y, one box less than grid boundaries |
There was a problem hiding this comment.
this is a good comment, put back in @hkershaw-brown
There was a problem hiding this comment.
actually can you use quad_utils with a periodic y domain? e.g. wrf idealized. Not at the moment I guess.


Description:
Quad_utils_mod box search for fully irregular grids:
The runtime calculation of storage requirements is needed in particular for regional CESM where people are creating their own regional grids, but is applicable to any model using quad_utils.
The per-code memory is quite high for the box storage (see #979 & https://github.com/hkershaw-brown/quad_search_perf/blob/main/README.md). This pull request does not address distributing memory for the search storage (may be needed later), but does reduce the per core memory.
Compressed Sparse Row format:
The GRID_QUAD_FULLY_IRREGULAR coarse-index search-info array setup previously allocated reg_list_lon/lat(nrx, nry, max_reg_list_num) as a 3D temporary, reaching ~5.5 GB at nrx=nry=900. It also errored out if any coarse box accumulated more than max_reg_list_num=800 candidate quads. Noted in issue #979
Replace with a two-pass compressed sparse row build:
Replace hardcoded num_reg with calculation based on grid
num_reg = max(10, min(nx_or_ny, int(sqrt(nx*ny / TARGET_CANDIDATES))))
where TARGET_CANDIDATES=8, giving ~8 candidate quads per coarse box regardless of grid resolution.
Remove no longer used max_reg_list_num from quad_irreg_grid_coords
Fixes issue
Fixes #979
Types of changes
Documentation changes needed?
Tests
Please describe any tests you ran to verify your changes.
Looking at brute force vs. current method vs new method: https://github.com/hkershaw-brown/quad_search_perf
I've only created global test grids.
Quad_util_developer tests bitwise, but this are fairly limited.
Currently running ROMS_rutgers case to compare memory (still in queue)
Checklist for merging
Checklist for release
Testing Datasets