Skip to content

quad_utils_mod search: replace 3D scratch arrays with compressed sparse row format#1102

Draft
hkershaw-brown wants to merge 1 commit into
mainfrom
quad_search
Draft

quad_utils_mod search: replace 3D scratch arrays with compressed sparse row format#1102
hkershaw-brown wants to merge 1 commit into
mainfrom
quad_search

Conversation

@hkershaw-brown
Copy link
Copy Markdown
Member

@hkershaw-brown hkershaw-brown commented Apr 27, 2026

Description:

Quad_utils_mod box search for fully irregular grids:

  • Determine the search storage requirement at run time from the grid. Currently there are 3 hardcoded cases, leading to people adding "big" numbers to the module to get their code running vs erroring out. The hardcoded cases I believe are based on the POP workhorse and high res grids ~2015 (looking at POP/model_mod.f90 blame)
  • Replaces the quad_utils_mod search storage (3D arrays) for fully irregular grids with a compressed sparse row format.

The runtime calculation of storage requirements is needed in particular for regional CESM where people are creating their own regional grids, but is applicable to any model using quad_utils.
The per-code memory is quite high for the box storage (see #979 & https://github.com/hkershaw-brown/quad_search_perf/blob/main/README.md). This pull request does not address distributing memory for the search storage (may be needed later), but does reduce the per core memory.

Compressed Sparse Row format:
The GRID_QUAD_FULLY_IRREGULAR coarse-index search-info array setup previously allocated reg_list_lon/lat(nrx, nry, max_reg_list_num) as a 3D temporary, reaching ~5.5 GB at nrx=nry=900. It also errored out if any coarse box accumulated more than max_reg_list_num=800 candidate quads. Noted in issue #979

Replace with a two-pass compressed sparse row build:

  • Pass 1: count overlaps into grid_num; peak memory ~11 MB
  • Prefix sum: derive grid_start and exact total entry count
  • Pass 2: fill flat lists using a cursor array

Replace hardcoded num_reg with calculation based on grid
num_reg = max(10, min(nx_or_ny, int(sqrt(nx*ny / TARGET_CANDIDATES))))
where TARGET_CANDIDATES=8, giving ~8 candidate quads per coarse box regardless of grid resolution.

Remove no longer used max_reg_list_num from quad_irreg_grid_coords

Fixes issue

Fixes #979

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Documentation changes needed?

  • My change requires a change to the documentation.
    • I have updated the documentation accordingly.

Tests

Please describe any tests you ran to verify your changes.
Looking at brute force vs. current method vs new method: https://github.com/hkershaw-brown/quad_search_perf
I've only created global test grids.
Quad_util_developer tests bitwise, but this are fairly limited.

Currently running ROMS_rutgers case to compare memory (still in queue)

Checklist for merging

  • Updated changelog entry
  • Documentation updated
  • Update conf.py

Checklist for release

  • Merge into main
  • Create release from the main branch with appropriate tag
  • Delete feature-branch

Testing Datasets

  • Dataset needed for testing available upon request
  • Dataset download instructions included
  • No dataset needed

See #979

The GRID_QUAD_FULLY_IRREGULAR coarse-index build previously allocated
reg_list_lon/lat(nrx, nry, max_reg_list_num) as a 3D temporary, reaching
~5.5 GB at nrx=nry=900. It also errored out if any coarse box
accumulated more than max_reg_list_num=800 candidate quads.

Replace with a two-pass compressed sparse row build:
- Pass 1: count overlaps into grid_num; peak memory ~11 MB
- Prefix sum: derive grid_start and exact total entry count
- Pass 2: fill flat lists using a cursor array; no cap, no abort

Replace hardcoded num_reg with calculation based on grid
  num_reg = max(10, min(nx_or_ny, int(sqrt(nx*ny / TARGET_CANDIDATES))))
where TARGET_CANDIDATES=8, giving ~8 candidate quads per coarse box
regardless of grid resolution.

Remove no longer used max_reg_list_num from quad_irreg_grid_coords
@hkershaw-brown
Copy link
Copy Markdown
Member Author

Q. global vs. regional.
if you initialize the interpolation as global, but the grid is regional how does this affect things?

wrap where there is no wrap. (mad wrapping of region around globe)
global, sparse -> slow search
global, not sparse, too many points in a box. -> error out or slow search

How bad is it to have mad-wrap? <-- -->

! for a global grid, the initial values have already been set in
! the derived type. otherwise, find the min/max of lons and lats.
if (.not. h%opt%global_grid) then
h%ii%min_lon = minval(h%ii%lons_2d)
h%ii%max_lon = maxval(h%ii%lons_2d)
h%ii%lon_width = h%ii%max_lon - h%ii%min_lon ! FIXME: wrap?

@hkershaw-brown hkershaw-brown marked this pull request as draft April 28, 2026 12:41
@hkershaw-brown
Copy link
Copy Markdown
Member Author

or regional refinements in global grids:

Screenshot 2026-04-28 at 9 59 45 AM Screenshot 2026-04-28 at 9 59 41 AM

@hkershaw-brown
Copy link
Copy Markdown
Member Author

@mgharamti this is the quad search I am playing with. I think the main concerns are calculating the course grid from the actual grid rather than the hardcoded 3 cases, and the search structure size. But open to any input.

!> Build the data structure for interpolation for an irregular quad grid
subroutine init_irreg_interp(h)
type(quad_interp_handle), intent(inout) :: h
character(len=*), parameter :: routine = 'init_irreg_interp'
! Need a temporary data structure to build this.
! These arrays keep a list of the x and y indices of dipole quads
! that potentially overlap the regular boxes.
integer, allocatable :: reg_list_lon(:,:,:)
integer, allocatable :: reg_list_lat(:,:,:)
real(r8) :: u_c_lons(4), u_c_lats(4), pole_row_lon
integer :: i, j, k, pindex, nx, ny, nrx, nry, istatus
integer :: reg_lon_ind(2), reg_lat_ind(2), u_total, u_index
logical :: cyclic, pole
integer :: xlim
allocate(reg_list_lon(h%ii%num_reg_x, h%ii%num_reg_y, h%ii%max_reg_list_num))
allocate(reg_list_lat(h%ii%num_reg_x, h%ii%num_reg_y, h%ii%max_reg_list_num))

Copy link
Copy Markdown
Contributor

@mgharamti mgharamti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hkershaw-brown, thanks for the work on this. I was always on the edge with the previous style of 3 hard-coded cases but didn't know how to go around that. I think I understand your design and strategy. CSR clearly avoids working with huge 3D arrays.

My first lazy question is why the 8 targets on average? I guess how did you choose it? It seems to me that search cost per obs is ~O(num_grids) and number of bins/boxes is ~O(N/TARGET_CANDIDATES) and so it's a performance tuning parameter. At least we need to explain it in the docs if you decide to go with it.

Second, could we add a post-fill consistency check that verifies fill_pos(i,j) == grid_start(i,j) + grid_num(i,j) for every coarse bin? This would catch any future divergence between the counting pass and filling pass, especially around edge cases such as longitude wrapping, masked/invalid quads, or boundary-touching cells.

endif

deallocate(reg_list_lon, reg_list_lat)
deallocate(fill_pos)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new new approach is based on the assumption (fact?) that what we counted in the first pass is equal to what we filled in the second pass. I'd be more comfortable if we added a sanity check after filling (similar to the check in the old code). You could do something like this:

do i = 1, nrx
   do j = 1, nry
      if (fill_pos(i,j) /= h%ii%grid_start(i,j) + h%ii%grid_num(i,j)) then
         call error_handler(E_ERR, 'init_irreg_interp', &
              'count/fill mismatch in irregular grid interpolation lists')
      endif
   enddo
enddo

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great idea, & the check is in init so not a problem runtime wise to add.

Comment on lines 800 to 803
call get_quad_corners(h%ii%lons_2d, i, j, cyclic, pole, nx, ny, u_c_lons, istatus)
if (istatus /= 0) print *, 'get_quad_corners for lons returns failure'

call get_quad_corners(h%ii%lats_2d, i, j, cyclic, pole, nx, ny, u_c_lats, istatus)
if (istatus /= 0) print *, 'get_quad_corners for lats returns failure'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cycle here rather than just print an error? If a failure produces bad corners, both passes may count/fill bad bins or one failure path could behave differently depending on the state (right?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a closer look - I think I am missing the logic in the original code.

The loop limits are the x,y bounds of the lon and lat arrays.
Is the print just checking if the loop limits are wrong?

If you can go off the end of the arrays,
it seems like both should fail, or both should pass.

If one fails and one passes I think this should be a catastrophic error rather than a print, since lon,lat are for the same point.

@hkershaw-brown
Copy link
Copy Markdown
Member Author

My first lazy question is why the 8 targets on average? I guess how did you choose it? It seems to me that search cost per obs is ~O(num_grids) and number of bins/boxes is ~O(N/TARGET_CANDIDATES) and so it's a performance tuning parameter. At least we need to explain it in the docs if you decide to go with it.

My lazy choice: Dr. Google for what was a good choice for this kind of algorithm.
I was looking at brute force (best memory, but the worst case timing ~50,000 slower) vs. current hard coded method (max memory, target is to go as fast as that).
#979 (comment).

I think the equivalent average target (nx*ny/num_reg*num_reg)is ~10ish for the hard coded versions for the POP grids:

grid nx ny num_reg average quads per coarse box
gx1 (3deg?) 320 384 90 15.17037037
tx0.5 0.5deg 720 480 180 10.66666667
tx0.1 0.1 deg 3600 2400 900 10.66666667

The grids are not uniform so 'average' is probably a disgusting assumption.

I think it is well worth doing some runs with the test code and different (and variable) grid resolutions to count the actual number of quads per box with the new code (and counting the per course box with the 3 hardcoded POP cases). Maybe we have 799 quads in one box, 3 in another. Currently the 8 is just an ok choice where I was finding comparable speeds to the current version, but it is definitely a magic number that needs justifying.

h%ii%grid_num = 0

do i = 1, xlim
! There's no wraparound in y, one box less than grid boundaries
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good comment, put back in @hkershaw-brown

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually can you use quad_utils with a periodic y domain? e.g. wrf idealized. Not at the moment I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

quad_utils max_reg_list_num

2 participants