Add CDA-ETL module with initial implementation and basic functionality by RyanM-RMA · Pull Request #1732 · USACE/cwms-data-api

RyanM-RMA · 2026-05-15T17:19:37Z

Includes the addition of the following:

Dockerfile and docker-compose for containerizing the service.
Gradle build configuration for the module.
Core ETL pipeline components: configuration, session management, location, project, and timeseries processing.
Environment variable management with etl.env.example.
Utility functions and cache handling.
Initial set of unit tests for basic validation (e.g., configuration handling).

Summary

Add module for Extract Transform and Load (ETL) between CDA API's.

Related Issue

Closes https://jira.hecdev.net/browse/REGI-481

Validation

Tested by running the gradle and docker-compose processes, verifying data is valid through running CWMSVue and REGI.

Checklist

AI tools used

Includes the addition of the following: - Dockerfile and docker-compose for containerizing the service. - Gradle build configuration for the module. - Core ETL pipeline components: configuration, session management, location, project, and timeseries processing. - Environment variable management with `etl.env.example`. - Utility functions and cache handling. - Initial set of unit tests for basic validation (e.g., configuration handling).

adamkorynta · 2026-05-15T18:22:54Z

@@ -0,0 +1,14 @@
+services:


why separate from the root docker-compose.yml? I'd think the defaults should be sourced for CWBI Test and destination the CDA service container

adamkorynta · 2026-05-15T18:25:52Z

+final def envFile = 'etl.env'
+final def reqFile = 'requirements.txt'
+
+tasks.register('installRequirements', Exec) {


Take a look at what Stephen did on: https://github.com/DOI-BOR/WTMP-Python-Plotting/blob/main/build.gradle using a gradle plugin to manage python.

which we already do for Node JS. So makes sense to use a plugin for python as well.

adamkorynta · 2026-05-15T18:31:34Z

+#  SOFTWARE.
+import cwms
+
+class SessionManager:


take a look at python's contextmanager as I think it will simplify the session management. See my regi-python PR as an example:

usage: https://github.com/USACE-WaterManagement/regi-python/pull/1/changes#diff-884cdfd74221e802652f52dace72c4e14e4c0d67ad1a57a2f8b02b4155362786R41

context definition: https://github.com/USACE-WaterManagement/regi-python/pull/1/changes#diff-48673334bb966021f118fc5fd7b8632e14b648abc64e10d3554f38060c0a65e8R23

krowvin · 2026-05-15T19:30:12Z

@Enovotny

Might be some ideas in here we can use with cwms-cli

I thought this was a novel idea. Setting locations in the env for reuse

Ie
self.locations = os.getenv("LOCATIONS", "").split(",")

- Split core functionality into modular components for improved clarity and maintainability, including separate processing for locations, projects, and timeseries. - Introduced caching logic in `cache_util.py` for optimized data retrieval and storage. - Added threading utilities for concurrent task execution in `threading_util.py`. - Enhanced `SessionManager` logic with dynamic session initialization. - Updated Gradle build to include a `runEtlUnitTests` task for streamlined testing. - Improved environment variable examples in `etl.env.example`. - Introduced comprehensive unit tests for locations, projects, and timeseries modules. - Various bug fixes and restructured imports for consistency.

…s, projects, and timeseries - Replaced inline processing calls with modular `cache_*` and `store_cached_*` workflows. - Added caching and validation logic for projects, locations, and timeseries. - Consolidated threading and retrieval utilities for improved reliability. - Updated example environment variables and tests for modular workflows. - Improved logging for debugging and transparency during execution.

adamkorynta · 2026-05-21T00:05:13Z

+LOG_LEVEL=INFO
+
+# Data retrieval
+LOCATIONS=SWT.EUFA-Dam


the list of data to load should not be defined by environment variables. Use an external config file (json, yml, etc)

adamkorynta · 2026-05-21T00:06:10Z

+#
+
+# Required settings
+SOURCE_CDA_URL=https://cwms-data-test.cwbi.us/cwms-data/


Is the toggle to re-extract data from source based on the existence of this env variable? if so, it isn't a required variable

adamkorynta · 2026-05-21T00:07:42Z

+#  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+#  SOFTWARE.
+import pytest
+from unittest.mock import MagicMock, patch


I think maintaining mock data for unit tests at this layer is unnecessary maintenance overhead. Integration tests against the real destination data source is much more valuable.

adamkorynta · 2026-05-21T00:09:06Z

+        return
+
+    # Retrieval
+    cwms.api.API_VERSION = 1


what's with these api versions being set to 1 and 2?

This is actually a cwms-python bug that I was trying to work around. I've created an issue to track it. The code has been removed.

HydrologicEngineeringCenter/cwms-python#289

adamkorynta · 2026-05-21T00:11:06Z

+    timeseries.cache_timeseries(config.timeseries, config.start_time, config.end_time)
+
+    session_manager.use_dest_session()
+    # Store cached data, so we're not keeping it all in memory


this comment implies that it is in memory now - it should be on the file system at this point right, not in memory?

adamkorynta · 2026-05-21T00:20:14Z

+    session_manager.use_source_session()
+
+    # Read and cache data
+    location.cache_locations(config.locations)


we aren't caching data here, we're storing and preserving that data in source control

adamkorynta · 2026-05-21T00:22:41Z

+    for ts in timeseries:
+        splits = ts.split(".")
+        if len(splits) != 7:
+            logger.warning(f"Invalid time series identifier '{ts}' encountered.  Expected format is '[office_id].[location].[parameter].[parameter_type].[interval].[duration].[version]'")


why is office id on the time series id? we really don't need any validation here - CDA itself will validate the data requests

This will be resolved with the change to YAML based configuration.

adamkorynta · 2026-05-21T00:24:04Z

+    threading_util.execute_tasks(_store_one_ts_data, ts_info)
+
+
+def _retrieve_one_ts_identifier(ts_info):


I'm confused by this method, if cache_data exists, we just drop it on the floor?

adamkorynta · 2026-05-21T00:24:30Z

+    ts_id = ts_info[1]
+
+    cache_data = cache_util.get_from_cache(office_id, "Timeseries Identifiers", ts_id, "id")
+    cwms.store_timeseries_identifier(cache_data)


the store ts call will create the identifier, no need for a separate method call

adamkorynta reviewed May 15, 2026

View reviewed changes

RyanM-RMA added 2 commits May 15, 2026 20:03

RyanM-RMA marked this pull request as ready for review May 20, 2026 17:18

adamkorynta requested changes May 21, 2026

View reviewed changes

		threading_util.execute_tasks(_store_one_ts_data, ts_info)


		def _retrieve_one_ts_identifier(ts_info):

Conversation

RyanM-RMA commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Validation

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krowvin commented May 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RyanM-RMA commented May 15, 2026 •

edited

Loading