Skip to content

Conversation

@cwdunkerly
Copy link
Contributor

@cwdunkerly cwdunkerly commented Dec 19, 2024

This branch consists of two updates:

  1. Implementing the EE high endpoint API and pulling data directly through a getInfo() call. (Previously it required access to a storage bucket files were saved to before being downloaded locally).

There have been changes to libraries (added multiprocessing), I don't believe they require changes to the installation setup. It has been a while since this first task was done so it might be worth it to double check.

I used this version of the code to download gridded data for over 5000 points across 11 collections and it has worked without issue, so I think it's time to merge this version onto master.

  1. re-adding precipitation as a valid option for comparison, and includes precipitation on both the daily_comparison and monthly_comparison timeseries plots.

Note that the precipitation bias calculation (and monthly plot) is using average precipitation. If you would prefer this be deltas let me know and I'll fix it.

EDIT: I had to update the tests on a third push, all tests are passing I believe.

ee_download.py now requests the timeseries data directly instead of going through cloud storage.  Additionally, this downloading is now parallelized in a taskpool, which scales to use 1/2 of user's CPU cores.

Removed all references to GCS.
@cwdunkerly cwdunkerly requested a review from JohnVolk December 19, 2024 05:45
@JohnVolk
Copy link
Collaborator

@cwdunkerly Thanks for working on this and trying to improve the speed here!

I've tested locally all the results are looking correct. After adding some minor changes to the docs I will merge shortly.

There are two things I would appreciate you to consider and help with based on the test results, the first one is higher priority for now:

  1. Please change the monthly precipitation comparison plots to use monthly total precipitation as opposed to monthly averages (for scatter and time series plots).

  2. Maybe improve error handling for download script based on the following scenario I ran into:

When testing locally using the agera5 dataset it appears that it got about halfway through the stations (downloaded 2 of the four stations total data) in about 4 minutes for each site but then ran into an http error. Afterwards I reran the download function and it completed downloading the other two sites much quicker (about a minute each) and successfully updated the metadata file with the paths to the paired gridded data without any errors. Below is the traceback of the original error:

download_grid_data(
    merged_input_path,
    config_path=agera5_config,
    force_download=False)
gridwxcomp/my_tests/agera5/agera5_loa.csv downloaded in 4.25 minutes.

gridwxcomp/my_tests/agera5/agera5_castle_valley_near_moab.csv downloaded in 4.27 minutes.
---------------------------------------------------------------------------
HttpError                                 Traceback (most recent call last)
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py:402](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py#line=401), in _execute_cloud_call(call, num_retries)
    401 try:
--> 402   return call.execute(num_retries=num_retries)
    403 except googleapiclient.errors.HttpError as e:

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/_helpers.py:130](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/_helpers.py#line=129), in positional.<locals>.positional_decorator.<locals>.positional_wrapper(*args, **kwargs)
    129         logger.warning(message)
--> 130 return wrapped(*args, **kwargs)

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/http.py:938](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/http.py#line=937), in HttpRequest.execute(self, http, num_retries)
    937 if resp.status >= 300:
--> 938     raise HttpError(resp, content, uri=self.uri)
    939 return self.postproc(resp, content)

HttpError: <HttpError 400 when requesting https://earthengine.googleapis.com/v1/projects/earthengine-legacy/value:compute?prettyPrint=false&alt=json returned "Computation timed out.". Details: "Computation timed out.">

During handling of the above exception, another exception occurred:

EEException                               Traceback (most recent call last)
Cell In[5], line 2
      1 ee.Initialize()
----> 2 download_grid_data(
      3     merged_input_path,
      4     config_path=agera5_config,
      5     force_download=False)

File [~/gridwxcomp/gridwxcomp/ee_download.py:189](http://localhost:8889/lab/tree/my_specific_tests/gridwxcomp/ee_download.py#line=188), in download_grid_data(metadata_path, config_path, local_folder, force_download)
    187 thread_count = int(os.cpu_count() [/](http://localhost:8889/) 2)
    188 pool = Pool(thread_count)
--> 189 pool.map(_download_point_data, iterable_list)
    190 pool.close()
    191 pool.join()

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:367](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=366), in Pool.map(self, func, iterable, chunksize)
    362 def map(self, func, iterable, chunksize=None):
    363     '''
    364     Apply `func` to each element in `iterable`, collecting the results
    365     in a list that is returned.
    366     '''
--> 367     return self._map_async(func, iterable, mapstar, chunksize).get()

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:774](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=773), in ApplyResult.get(self, timeout)
    772     return self._value
    773 else:
--> 774     raise self._value

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:125](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=124), in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    123 job, i, func, args, kwds = task
    124 try:
--> 125     result = (True, func(*args, **kwds))
    126 except Exception as e:
    127     if wrap_exception and func is not _helper_reraises_exception:

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:48](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=47), in mapstar(args)
     47 def mapstar(args):
---> 48     return list(map(*args))

File [~/gridwxcomp/gridwxcomp/ee_download.py:104](http://localhost:8889/lab/tree/my_specific_tests/gridwxcomp/ee_download.py#line=103), in _download_point_data(param_dict)
    100     return ftr.set({'output': output_list})
    102 output_stats = (ee.FeatureCollection(ic.map(_reduce_point_img))
    103                 .map(_summary_feature_col))
--> 104 output_timeseries = output_stats.aggregate_array('output').getInfo()
    105 column_names = ['date', 'station_name'] + bands
    106 output_df = pd.DataFrame(data=output_timeseries, columns=column_names)

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/computedobject.py:107](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/computedobject.py#line=106), in ComputedObject.getInfo(self)
    101 def getInfo(self) -> Optional[Any]:
    102   """Fetch and return information about this object.
    103 
    104   Returns:
    105     The object can evaluate to anything.
    106   """
--> 107   return data.computeValue(self)

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py:1107](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py#line=1106), in computeValue(obj)
   1104 body = {'expression': serializer.encode(obj, for_cloud_api=True)}
   1105 _maybe_populate_workload_tag(body)
-> 1107 return _execute_cloud_call(
   1108     _get_cloud_projects()
   1109     .value()
   1110     .compute(body=body, project=_get_projects_path(), prettyPrint=False)
   1111 )['result']

File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py:404](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py#line=403), in _execute_cloud_call(call, num_retries)
    402   return call.execute(num_retries=num_retries)
    403 except googleapiclient.errors.HttpError as e:
--> 404   raise _translate_cloud_exception(e)

EEException: Computation timed out.

This looks like a API delay or overload based on the request given. I wanted to put this on your radar, if it comes up often we might want to reduce the load given to the GEE API to prevent this, maybe remove the multi-processing call or put the request in a retry loop or something else.

@JohnVolk JohnVolk merged commit 1e68f53 into master Dec 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants