fix(package): Pass credentials via env vars to dataset_manager container.#1239
Conversation
WalkthroughAdds environment-based database credentials handling. The native dataset manager now loads DB credentials from environment variables during config loading. The containerised dataset manager propagates DB username/password to containers via a new extra_env_vars argument, updating the generate_container_start_cmd API to accept this fourth parameter. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant NativeDatasetManager as Native DatasetManager
participant Config
participant Env as Environment
participant Logger
User->>NativeDatasetManager: run
NativeDatasetManager->>Config: load()
Config-->>NativeDatasetManager: config loaded
NativeDatasetManager->>Config: validate_logs_dir()
NativeDatasetManager->>Env: database.load_credentials_from_env()
alt credentials load fails
NativeDatasetManager->>Logger: log error
NativeDatasetManager-->>User: exit -1
else success
NativeDatasetManager-->>User: continue execution
end
sequenceDiagram
autonumber
participant User
participant DatasetManager as Container DatasetManager
participant Config
participant General as general.generate_container_start_cmd
participant Container
User->>DatasetManager: run
DatasetManager->>Config: read clp_config.database.{username,password}
DatasetManager->>General: generate_container_start_cmd(name, mounts, exec_container, extra_env_vars)
note right of General: extra_env_vars includes DB user/pass env vars
General-->>DatasetManager: start command
DatasetManager->>Container: start with env propagated
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (1)
220-229: Make credential-loading failures explicit rather than folding into “Failed to load config.”load_credentials_from_env() can raise ValueError; surfacing that distinctly improves debuggability and avoids conflating config parsing errors with missing credentials.
Apply this diff to separate the error paths:
@@ - # Validate and load config file - config_file_path = Path(parsed_args.config) - try: - clp_config = load_config_file(config_file_path, default_config_file_path, clp_home) - clp_config.validate_logs_dir() - clp_config.database.load_credentials_from_env() - except: - logger.exception("Failed to load config.") - return -1 + # Validate and load config file + config_file_path = Path(parsed_args.config) + try: + clp_config = load_config_file(config_file_path, default_config_file_path, clp_home) + clp_config.validate_logs_dir() + except Exception: + logger.exception("Failed to load config.") + return -1 + + try: + clp_config.database.load_credentials_from_env() + except ValueError as e: + logger.error(f"Database credentials are not set in the environment: {e}") + return -1 + except Exception: + logger.exception("Failed to load database credentials from environment.") + return -1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.pycomponents/clp-package-utils/clp_package_utils/scripts/dataset_manager.py
📚 Learning: 2025-07-03T20:10:43.789Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
🧬 Code Graph Analysis (2)
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
load_credentials_from_env(264-269)load_credentials_from_env(375-379)load_credentials_from_env(471-476)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
components/clp-package-utils/clp_package_utils/general.py (1)
generate_container_start_cmd(362-403)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: conventional-commits
🔇 Additional comments (2)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (2)
10-11: LGTM: import of DB credential env var names aligns container env with native loader.This makes the wrapper’s env variable names match what the native script expects.
151-151: Signature migration complete – all generate_container_start_cmd calls now use four arguments
Verified across scripts (search.py, decompress.py, archive_manager.py, compress.py, dataset_manager.py); no legacy 3-arg invocations remain.
| extra_env_vars = { | ||
| CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username, | ||
| CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password, | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Avoid leaking DB credentials via command line and logs
You’re embedding DB credentials into the docker CLI as -e KEY=VALUE. This has two issues:
- Secrets are exposed in process listings (ps, /proc) on the host while docker runs.
- The debug log below prints the full docker command, which will include the secrets.
At minimum, sanitise the debug output to prevent secret leakage in logs.
Minimal fix: mask sensitive env vars in the logged command. Add a sanitiser and use it where we log the docker command.
Add this helper near the top of the file (or a shared utility), and import Set from typing:
from typing import Set
SENSITIVE_ENV_KEYS: Set[str] = {
CLP_DB_USER_ENV_VAR_NAME,
CLP_DB_PASS_ENV_VAR_NAME,
"AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY",
"AWS_SESSION_TOKEN",
"AWS_SECURITY_TOKEN",
}
def _sanitize_docker_cmd_for_logging(cmd: List[str], sensitive_keys: Set[str]) -> List[str]:
sanitized: List[str] = []
i = 0
while i < len(cmd):
token = cmd[i]
if token == "-e" and i + 1 < len(cmd):
key_value = cmd[i + 1]
key, sep, value = key_value.partition("=")
if sep and key in sensitive_keys:
sanitized.extend(["-e", f"{key}=***"])
else:
sanitized.extend(["-e", key_value])
i += 2
continue
sanitized.append(token)
i += 1
return sanitizedThen replace the debug log:
sanitized_cmd = _sanitize_docker_cmd_for_logging(cmd, SENSITIVE_ENV_KEYS)
logger.debug(f"Docker command failed: {' '.join(sanitized_cmd)}")Stronger alternative (recommended): don’t put secret values on the CLI at all. Instead:
- Pass “-e KEY” (no value) to docker and provide the value via subprocess.run(..., env=...).
- Or use --env-file with a temp file (0600) to avoid both CLI and parent-env exposure.
This requires adjusting generate_container_start_cmd and call-sites to support either env inheritance or env-file, but eliminates the process-listing risk.
🤖 Prompt for AI Agents
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py around
lines 146-149: the code builds docker CLI args with -e KEY=VALUE and logs the
full command, leaking secrets; add a sanitizer and use it when logging. Import
Set from typing, define SENSITIVE_ENV_KEYS containing CLP_DB_USER_ENV_VAR_NAME,
CLP_DB_PASS_ENV_VAR_NAME and common AWS keys, add the helper
_sanitize_docker_cmd_for_logging(cmd: List[str], sensitive_keys: Set[str]) ->
List[str] (as described in the review) to mask values for -e KEY=VALUE entries,
then replace the debug log that prints the docker command with a sanitized
version (sanitized_cmd = _sanitize_docker_cmd_for_logging(cmd,
SENSITIVE_ENV_KEYS) and log ' '.join(sanitized_cmd)). Optionally (recommended)
change the approach to avoid passing secret values on the CLI at all by using
subprocess env or --env-file with restrictive permissions instead of -e
KEY=VALUE.
There was a problem hiding this comment.
as suggested by Haiqi offline, this is worth fixing. @coderabbitai please help create an issue. do not link to PR #1239 though because it's not directly related
There was a problem hiding this comment.
🧩 Analysis chain
I'll create a GitHub issue for this security concern right away.
@junhaoliao I've created a GitHub issue documenting the database credential security vulnerability. The issue includes both the immediate fix (command sanitisation for logs) and the recommended stronger approach (avoiding CLI credential passing entirely). This will help track the security improvement separately from any specific PR.
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue for the DB credential security vulnerability
gh issue create \
--title "Security: Database credentials exposed in Docker command line and debug logs" \
--assignee junhaoliao \
--body "## Problem
Database credentials are currently being passed to Docker containers using `-e KEY=VALUE` arguments, which creates two security vulnerabilities:
1. **Process listing exposure**: Secrets are visible in process listings (ps, /proc) on the host while Docker runs
2. **Debug log leakage**: The debug logging prints the full Docker command, including the credential values
This affects the dataset manager and potentially other container operations that handle database credentials.
## Current Vulnerable Code Pattern
```python
extra_env_vars = {
CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
}
container_start_cmd = generate_container_start_cmd(
container_name, necessary_mounts, clp_config.execution_container, extra_env_vars
)
```
And later in debug logging:
```python
logger.debug(f\"Docker command failed: {' '.join(cmd)}\")
```
## Suggested Solutions
### Minimal Fix (Immediate)
Add command sanitisation for debug logging to prevent credential leakage in logs:
1. Create a sanitiser function that masks sensitive environment variables
2. Include common sensitive keys: database credentials and AWS keys
3. Use the sanitiser when logging Docker commands
### Recommended Fix (Stronger Security)
Avoid passing secret values on the command line entirely:
- Use subprocess.run(..., env=...) to pass environment variables through the process environment
- Or use Docker's --env-file with temporary files (0600 permissions)
- This eliminates both CLI and parent-environment exposure risks
## Impact
- **Severity**: Medium-High (credential exposure)
- **Scope**: Any containerised operations handling database credentials
- **Risk**: Secrets visible in process monitoring, logs, and command history
## Acceptance Criteria
- [ ] Database credentials are not visible in process listings
- [ ] Debug logs do not contain credential values
- [ ] Solution works across all containerised database operations
- [ ] Backwards compatibility maintained"
Length of output: 2999
Perfect! I've successfully created GitHub issue #1240 for the database credential security vulnerability. The issue has been created with a comprehensive description that includes:
- Clear problem statement about the security risks
- Current vulnerable code patterns
- Both immediate (log sanitisation) and recommended (avoid CLI secrets) solutions
- Impact assessment and acceptance criteria
- Assignment to you as requested
The issue is now available at: #1240
This will help track the security improvement work separately and ensure the credential exposure vulnerability gets addressed properly across all containerised database operations in the codebase.
🧬 Code Graph Analysis Results
-
File: components/clp-py-utils/clp_py_utils/clp_config.py (lines 96-98)
Content:class StorageEngine(KebabCaseStrEnum): CLP = auto() CLP_S = auto() -
File: components/clp-py-utils/clp_py_utils/clp_config.py (lines 107-109)
Content:class StorageType(LowercaseStrEnum): FS = auto() S3 = auto() -
File: components/clp-py-utils/clp_py_utils/clp_config.py (lines 874-878)
Content:def validate_logs_dir(self): try: validate_path_could_be_dir(self.logs_directory) except ValueError as ex: raise ValueError(f"logs_directory is invalid: {ex}") -
File: components/clp-py-utils/clp_py_utils/clp_config.py (lines 226-247)
Content:def get_clp_connection_params_and_type(self, disable_localhost_socket_connection: bool = False): self.ensure_credentials_loaded() host = self.host if disable_localhost_socket_connection and "localhost" == self.host: host = "127.0.0.1" connection_params_and_type = { # NOTE: clp-core does not distinguish between mysql and mariadb "type": "mysql", "host": host, "port": self.port, "username": self.username, "password": self.password, "name": self.name, "table_prefix": CLP_METADATA_TABLE_PREFIX, "compress": self.compress, "autocommit": self.auto_commit, } if self.ssl_cert: connection_params_and_type["ssl_cert"] = self.ssl_cert return connection_params_and_type -
File: components/clp-package-utils/clp_package_utils/general.py (lines 329-346)
Content (docstring-style summary):
"""
Writes the given container config to the logs directory so that it is accessible in the container.-
Parameters:
- container_clp_config: CLPConfig to dump for the container.
- clp_config: Host CLPConfig (used to determine logs directory path).
- config_filename: Name for the config file within both host/container logs dirs.
-
Returns:
- (config_file_path_on_container, config_file_path_on_host)
-
Notes:
- Serializes container_clp_config to YAML and writes to host path, returning corresponding container path.
"""
- Serializes container_clp_config to YAML and writes to host path, returning corresponding container path.
(No code block included to keep concise; full implementation provided in file.)
-
-
File: components/clp-package-utils/clp_package_utils/general.py (lines 325-326)
Content:def get_container_config_filename(container_name: str) -> str: return f".{container_name}-config.yml" -
File: components/clp-package-utils/clp_package_utils/general.py (lines 128-133)
Content:def generate_container_name(job_type: str) -> str: """ :param job_type: :return: A unique container name for the given job type. """ return f"clp-{job_type}-{str(uuid.uuid4())[-4:]}" -
File: components/clp-package-utils/clp_package_utils/general.py (lines 229-310)
Content (docstring-style summary):
"""
generate_container_config(clp_config: CLPConfig, clp_home: pathlib.Path) -> Tuple[CLPConfig, CLPDockerMounts]-
Purpose:
Create a container-ready copy of the CLP config and assemble the docker mounts
mapping host paths into the container, conditionally mounting inputs/outputs
based on the configuration (e.g., FS inputs, archives, streams, AWS config). -
Returns:
- container_clp_config: CLPConfig copy with container-relative paths.
- docker_mounts: CLPDockerMounts with all mount specifications (some may be None).
-
Behavior details:
- If logs_input.type is FS, bind-mmount logs_input.directory into the container.
- Bind-mount data_dir, logs_dir, archives_output_dir, and stream_output as needed.
- Bind the shared config file if present.
- If aws_config_directory is set, bind-mount it and set container path accordingly.
"""
(No code block included to keep concise; full implementation provided in file.)
-
-
File: components/clp-package-utils/clp_package_utils/general.py (lines 362-403)
Content (docstring-style summary):
"""
generate_container_start_cmd(container_name: str,
container_mounts: List[Optional[DockerMount]],
container_image: str,
extra_env_vars: Optional[Dict[str, str]] = None) -> List[str]-
Purpose:
Build the docker run command with mounts, environment variables, user, networking, etc. -
Returns:
- A list of command tokens to run the container with the specified mounts and env vars.
-
Important details:
- Sets PYTHONPATH to the container site-packages path.
- Adds -e entries for all env vars (including extras).
- Adds mounts for any non-None DockerMounts.
"""
(No code block included to keep concise; full implementation provided in file.)
-
-
File: components/clp-py-utils/clp_py_utils/s3_utils.py (lines 101-177)
Content (docstring-style summary):
"""
generate_container_auth_options(clp_config: CLPConfig, container_type: str) -> Tuple[bool, List[str]]-
Purpose:
Determine whether an AWS config mount and/or environment variables should be supplied
to a container based on storage configurations and the container type. -
Returns:
- (config_mount, credentials_env_vars)
-
Behavior details:
- For certain components, enable AWS config dir mount if using S3 with profile auth.
- If using env_vars authentication, collect AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY (and optionally
AWS_SESSION_TOKEN is rejected as unsupported in env var mode).
-
Raises:
- ValueError if container_type is invalid.
"""
- ValueError if container_type is invalid.
(No code block included to keep concise; full implementation provided in file.)
-
-
File: components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (lines 32-73)
Content (docstring-style summary):
"""
main(argv):- Parses --config and --storage-engine
- Loads CLP config and credentials
- Creates metadata tables (or CLP_S datasets) in the metadata DB by invoking
internal SQL adapters and subprocesses to initialize databases. - Returns 0 on success, -1 on failure.
"""
(No code block included to keep concise; full implementation provided in file.)
-
File: components/clp-py-utils/clp_py_utils/load_config_helpers (not shown in snippets)
Note: The following function is used by dataset_manager.py:- load_config_file(config_file_path, default_config_file_path, clp_home)
Content (brief summary of behavior, not full snippet):
- If the given config file exists, loads and parses CLPConfig from YAML.
- If not, falls back to default config path and builds a new CLPConfig.
- Adjusts paths to be absolute based on CLP_HOME, validates data/logs paths, and adjusts per-host directories.
-
File: components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (lines 167-242)
Content (docstring-style summary):
"""
Native dataset_manager main logic (delegated to containerized run).- Sets up CLI for list/delete with config loading, validation, and then executes
dataset-management logic inside a container. - Validates dataset names when deleting with CLP_S storage engine.
- Handles AWS S3 auth mounting if required and constructs the docker run command.
"""
(No code block included to keep concise; full implementation provided in file.)
- Sets up CLI for list/delete with config loading, validation, and then executes
-
File: components/clp-package-utils/clp_package_utils/scripts/native/search.py (lines 237-332)
Content (docstring-style summary):
"""
Native search CLI logic invoked by the container, after container config is dumped.- Validates config, credentials, and dataset (if provided).
- Executes the internal search routine via clp_package_utils.scripts.native.search.
- Returns exit code.
"""
(No code block included to keep concise; full implementation provided in file.)
Note: The above entries include direct, exact content for compact snippets and concise summaries for larger functions. The summaries are designed to capture parameters, return values, exceptions, and key implementation details relevant to understanding dataset_manager’s behavior and its interactions with config loading, container setup, and dataset validation.
🧠 Learnings used
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.
Learnt from: haiqi96
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-29T14:04:13.769Z
Learning: User haiqi96 requested creating a GitHub issue to document a bug fix from PR #1136, which addressed MySQL compatibility issues with invalid SQL CAST operations in the WebUI component.
junhaoliao
left a comment
There was a problem hiding this comment.
for the title, how about:
fix(package): Pass credentials via env vars to dataset_manager container.
quinntaylormitchell
left a comment
There was a problem hiding this comment.
lgtm as well
Description
The PR #1152 standardized how we pass credentials to the containers, but missed to update dataset_manager.py and it's native scripts.
This PR adds the missing changes to these two scripts.
Checklist
breaking change.
Validation performed
Manually ran the script and make sure it prints out
2025-08-19T21:41:09.391 INFO [dataset_manager] Found 0 datasets., showing that it is able to read the database credentials.Summary by CodeRabbit