-
Notifications
You must be signed in to change notification settings - Fork 9
Merge EE high endpoint branch #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ee_download.py now requests the timeseries data directly instead of going through cloud storage. Additionally, this downloading is now parallelized in a taskpool, which scales to use 1/2 of user's CPU cores. Removed all references to GCS.
|
@cwdunkerly Thanks for working on this and trying to improve the speed here! I've tested locally all the results are looking correct. After adding some minor changes to the docs I will merge shortly. There are two things I would appreciate you to consider and help with based on the test results, the first one is higher priority for now:
When testing locally using the agera5 dataset it appears that it got about halfway through the stations (downloaded 2 of the four stations total data) in about 4 minutes for each site but then ran into an http error. Afterwards I reran the download function and it completed downloading the other two sites much quicker (about a minute each) and successfully updated the metadata file with the paths to the paired gridded data without any errors. Below is the traceback of the original error: download_grid_data(
merged_input_path,
config_path=agera5_config,
force_download=False)gridwxcomp/my_tests/agera5/agera5_loa.csv downloaded in 4.25 minutes.
gridwxcomp/my_tests/agera5/agera5_castle_valley_near_moab.csv downloaded in 4.27 minutes.---------------------------------------------------------------------------
HttpError Traceback (most recent call last)
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py:402](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py#line=401), in _execute_cloud_call(call, num_retries)
401 try:
--> 402 return call.execute(num_retries=num_retries)
403 except googleapiclient.errors.HttpError as e:
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/_helpers.py:130](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/_helpers.py#line=129), in positional.<locals>.positional_decorator.<locals>.positional_wrapper(*args, **kwargs)
129 logger.warning(message)
--> 130 return wrapped(*args, **kwargs)
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/http.py:938](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/googleapiclient/http.py#line=937), in HttpRequest.execute(self, http, num_retries)
937 if resp.status >= 300:
--> 938 raise HttpError(resp, content, uri=self.uri)
939 return self.postproc(resp, content)
HttpError: <HttpError 400 when requesting https://earthengine.googleapis.com/v1/projects/earthengine-legacy/value:compute?prettyPrint=false&alt=json returned "Computation timed out.". Details: "Computation timed out.">
During handling of the above exception, another exception occurred:
EEException Traceback (most recent call last)
Cell In[5], line 2
1 ee.Initialize()
----> 2 download_grid_data(
3 merged_input_path,
4 config_path=agera5_config,
5 force_download=False)
File [~/gridwxcomp/gridwxcomp/ee_download.py:189](http://localhost:8889/lab/tree/my_specific_tests/gridwxcomp/ee_download.py#line=188), in download_grid_data(metadata_path, config_path, local_folder, force_download)
187 thread_count = int(os.cpu_count() [/](http://localhost:8889/) 2)
188 pool = Pool(thread_count)
--> 189 pool.map(_download_point_data, iterable_list)
190 pool.close()
191 pool.join()
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:367](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=366), in Pool.map(self, func, iterable, chunksize)
362 def map(self, func, iterable, chunksize=None):
363 '''
364 Apply `func` to each element in `iterable`, collecting the results
365 in a list that is returned.
366 '''
--> 367 return self._map_async(func, iterable, mapstar, chunksize).get()
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:774](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=773), in ApplyResult.get(self, timeout)
772 return self._value
773 else:
--> 774 raise self._value
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:125](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=124), in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
123 job, i, func, args, kwds = task
124 try:
--> 125 result = (True, func(*args, **kwds))
126 except Exception as e:
127 if wrap_exception and func is not _helper_reraises_exception:
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py:48](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/multiprocessing/pool.py#line=47), in mapstar(args)
47 def mapstar(args):
---> 48 return list(map(*args))
File [~/gridwxcomp/gridwxcomp/ee_download.py:104](http://localhost:8889/lab/tree/my_specific_tests/gridwxcomp/ee_download.py#line=103), in _download_point_data(param_dict)
100 return ftr.set({'output': output_list})
102 output_stats = (ee.FeatureCollection(ic.map(_reduce_point_img))
103 .map(_summary_feature_col))
--> 104 output_timeseries = output_stats.aggregate_array('output').getInfo()
105 column_names = ['date', 'station_name'] + bands
106 output_df = pd.DataFrame(data=output_timeseries, columns=column_names)
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/computedobject.py:107](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/computedobject.py#line=106), in ComputedObject.getInfo(self)
101 def getInfo(self) -> Optional[Any]:
102 """Fetch and return information about this object.
103
104 Returns:
105 The object can evaluate to anything.
106 """
--> 107 return data.computeValue(self)
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py:1107](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py#line=1106), in computeValue(obj)
1104 body = {'expression': serializer.encode(obj, for_cloud_api=True)}
1105 _maybe_populate_workload_tag(body)
-> 1107 return _execute_cloud_call(
1108 _get_cloud_projects()
1109 .value()
1110 .compute(body=body, project=_get_projects_path(), prettyPrint=False)
1111 )['result']
File [~/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py:404](http://localhost:8889/home/john/anaconda3/envs/gridwxcomp_eedev/lib/python3.11/site-packages/ee/data.py#line=403), in _execute_cloud_call(call, num_retries)
402 return call.execute(num_retries=num_retries)
403 except googleapiclient.errors.HttpError as e:
--> 404 raise _translate_cloud_exception(e)
EEException: Computation timed out.This looks like a API delay or overload based on the request given. I wanted to put this on your radar, if it comes up often we might want to reduce the load given to the GEE API to prevent this, maybe remove the multi-processing call or put the request in a retry loop or something else. |
This branch consists of two updates:
There have been changes to libraries (added multiprocessing), I don't believe they require changes to the installation setup. It has been a while since this first task was done so it might be worth it to double check.
I used this version of the code to download gridded data for over 5000 points across 11 collections and it has worked without issue, so I think it's time to merge this version onto master.
Note that the precipitation bias calculation (and monthly plot) is using average precipitation. If you would prefer this be deltas let me know and I'll fix it.
EDIT: I had to update the tests on a third push, all tests are passing I believe.