Add CDA-ETL module with initial implementation and basic functionality#1732
Add CDA-ETL module with initial implementation and basic functionality#1732RyanM-RMA wants to merge 3 commits into
Conversation
Includes the addition of the following: - Dockerfile and docker-compose for containerizing the service. - Gradle build configuration for the module. - Core ETL pipeline components: configuration, session management, location, project, and timeseries processing. - Environment variable management with `etl.env.example`. - Utility functions and cache handling. - Initial set of unit tests for basic validation (e.g., configuration handling).
| @@ -0,0 +1,14 @@ | |||
| services: | |||
There was a problem hiding this comment.
why separate from the root docker-compose.yml? I'd think the defaults should be sourced for CWBI Test and destination the CDA service container
| final def envFile = 'etl.env' | ||
| final def reqFile = 'requirements.txt' | ||
|
|
||
| tasks.register('installRequirements', Exec) { |
There was a problem hiding this comment.
Take a look at what Stephen did on: https://github.com/DOI-BOR/WTMP-Python-Plotting/blob/main/build.gradle using a gradle plugin to manage python.
There was a problem hiding this comment.
which we already do for Node JS. So makes sense to use a plugin for python as well.
| # SOFTWARE. | ||
| import cwms | ||
|
|
||
| class SessionManager: |
There was a problem hiding this comment.
take a look at python's contextmanager as I think it will simplify the session management. See my regi-python PR as an example:
- usage: https://github.com/USACE-WaterManagement/regi-python/pull/1/changes#diff-884cdfd74221e802652f52dace72c4e14e4c0d67ad1a57a2f8b02b4155362786R41
- context definition: https://github.com/USACE-WaterManagement/regi-python/pull/1/changes#diff-48673334bb966021f118fc5fd7b8632e14b648abc64e10d3554f38060c0a65e8R23
|
Might be some ideas in here we can use with cwms-cli I thought this was a novel idea. Setting locations in the env for reuse Ie |
- Split core functionality into modular components for improved clarity and maintainability, including separate processing for locations, projects, and timeseries. - Introduced caching logic in `cache_util.py` for optimized data retrieval and storage. - Added threading utilities for concurrent task execution in `threading_util.py`. - Enhanced `SessionManager` logic with dynamic session initialization. - Updated Gradle build to include a `runEtlUnitTests` task for streamlined testing. - Improved environment variable examples in `etl.env.example`. - Introduced comprehensive unit tests for locations, projects, and timeseries modules. - Various bug fixes and restructured imports for consistency.
…s, projects, and timeseries - Replaced inline processing calls with modular `cache_*` and `store_cached_*` workflows. - Added caching and validation logic for projects, locations, and timeseries. - Consolidated threading and retrieval utilities for improved reliability. - Updated example environment variables and tests for modular workflows. - Improved logging for debugging and transparency during execution.
| LOG_LEVEL=INFO | ||
|
|
||
| # Data retrieval | ||
| LOCATIONS=SWT.EUFA-Dam |
There was a problem hiding this comment.
the list of data to load should not be defined by environment variables. Use an external config file (json, yml, etc)
| # | ||
|
|
||
| # Required settings | ||
| SOURCE_CDA_URL=https://cwms-data-test.cwbi.us/cwms-data/ |
There was a problem hiding this comment.
Is the toggle to re-extract data from source based on the existence of this env variable? if so, it isn't a required variable
| # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
| # SOFTWARE. | ||
| import pytest | ||
| from unittest.mock import MagicMock, patch |
There was a problem hiding this comment.
I think maintaining mock data for unit tests at this layer is unnecessary maintenance overhead. Integration tests against the real destination data source is much more valuable.
| return | ||
|
|
||
| # Retrieval | ||
| cwms.api.API_VERSION = 1 |
There was a problem hiding this comment.
what's with these api versions being set to 1 and 2?
There was a problem hiding this comment.
This is actually a cwms-python bug that I was trying to work around. I've created an issue to track it. The code has been removed.
| timeseries.cache_timeseries(config.timeseries, config.start_time, config.end_time) | ||
|
|
||
| session_manager.use_dest_session() | ||
| # Store cached data, so we're not keeping it all in memory |
There was a problem hiding this comment.
this comment implies that it is in memory now - it should be on the file system at this point right, not in memory?
| session_manager.use_source_session() | ||
|
|
||
| # Read and cache data | ||
| location.cache_locations(config.locations) |
There was a problem hiding this comment.
we aren't caching data here, we're storing and preserving that data in source control
| for ts in timeseries: | ||
| splits = ts.split(".") | ||
| if len(splits) != 7: | ||
| logger.warning(f"Invalid time series identifier '{ts}' encountered. Expected format is '[office_id].[location].[parameter].[parameter_type].[interval].[duration].[version]'") |
There was a problem hiding this comment.
why is office id on the time series id? we really don't need any validation here - CDA itself will validate the data requests
There was a problem hiding this comment.
This will be resolved with the change to YAML based configuration.
| threading_util.execute_tasks(_store_one_ts_data, ts_info) | ||
|
|
||
|
|
||
| def _retrieve_one_ts_identifier(ts_info): |
There was a problem hiding this comment.
I'm confused by this method, if cache_data exists, we just drop it on the floor?
| ts_id = ts_info[1] | ||
|
|
||
| cache_data = cache_util.get_from_cache(office_id, "Timeseries Identifiers", ts_id, "id") | ||
| cwms.store_timeseries_identifier(cache_data) |
There was a problem hiding this comment.
the store ts call will create the identifier, no need for a separate method call
Includes the addition of the following:
etl.env.example.Summary
Add module for Extract Transform and Load (ETL) between CDA API's.
Related Issue
Closes https://jira.hecdev.net/browse/REGI-481
Validation
Tested by running the gradle and docker-compose processes, verifying data is valid through running CWMSVue and REGI.
Checklist