feat: add safe gate privacy checks#206
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces privacy-focused security checks to the PR gate script, identifying risky file paths (e.g., .env, SSH keys) and sensitive content markers (e.g., API keys). Feedback from the review identifies a logic error where deleted files would still trigger warnings, potentially blocking the removal of secrets. Additionally, refactoring is suggested to optimize path resolution and improve code readability by passing Path objects directly to internal helper functions.
| def _privacy_problems(changed_paths: tuple[str, ...]) -> tuple[str, ...]: | ||
| problems: list[str] = [] | ||
| for path in sorted(changed_paths): | ||
| if _is_risky_path(path): | ||
| problems.append(f"privacy_risky_path:{path}") | ||
|
|
||
| text = _read_changed_text(path) | ||
| if text is None: | ||
| continue | ||
| for marker in PRIVATE_MARKERS: | ||
| if marker in text: | ||
| problems.append(f"privacy_marker:{marker}:{path}") | ||
| return tuple(problems) |
There was a problem hiding this comment.
The current implementation flags risky paths even if the file has been deleted in the PR. This is a significant issue as it blocks users from committing the removal of secrets or sensitive files. By moving the existence check to the start of the loop, we avoid false positives for deleted files and can pass the resolved Path to _read_changed_text to eliminate redundant system calls.
def _privacy_problems(changed_paths: tuple[str, ...]) -> tuple[str, ...]:
problems: list[str] = []
for path in sorted(changed_paths):
repo_path = _repo_relative_path(path)
if repo_path is None or not repo_path.is_file():
continue
if _is_risky_path(path):
problems.append(f"privacy_risky_path:{path}")
text = _read_changed_text(repo_path)
if text is None:
continue
for marker in PRIVATE_MARKERS:
if marker in text:
problems.append(f"privacy_marker:{marker}:{path}")
return tuple(problems)| def _read_changed_text(path: str) -> str | None: | ||
| candidate = _repo_relative_path(path) | ||
| if candidate is None or not candidate.is_file(): | ||
| return None | ||
| try: | ||
| data = candidate.read_bytes() | ||
| except OSError: | ||
| return None | ||
| if b"\0" in data: | ||
| return None | ||
| try: | ||
| return data.decode("utf-8") | ||
| except UnicodeDecodeError: | ||
| return None |
There was a problem hiding this comment.
Refactor _read_changed_text to accept a Path object directly. This avoids redundant path resolution and existence checks when called from _privacy_problems. The simplified error handling also makes the function more readable while maintaining the same logic for binary and encoding detection.
| def _read_changed_text(path: str) -> str | None: | |
| candidate = _repo_relative_path(path) | |
| if candidate is None or not candidate.is_file(): | |
| return None | |
| try: | |
| data = candidate.read_bytes() | |
| except OSError: | |
| return None | |
| if b"\0" in data: | |
| return None | |
| try: | |
| return data.decode("utf-8") | |
| except UnicodeDecodeError: | |
| return None | |
| def _read_changed_text(candidate: Path) -> str | None: | |
| try: | |
| data = candidate.read_bytes() | |
| if b"\0" in data: | |
| return None | |
| return data.decode("utf-8") | |
| except (OSError, UnicodeDecodeError): | |
| return None |
Adds minimal deterministic privacy/secret boundary checks to safe_pr_gate.
Includes:
.env,*.pem,*.key,id_rsa,id_ed25519BEGIN PRIVATE KEY,GITHUB_TOKEN=,OPENAI_API_KEY=,GEMINI_API_KEY=Scope
Local-only boundary checks. Not a broad secret scanner. No network, no external APIs, no binary scanning, and no broad refactors.
Validation
python -m compileall -q scripts/safe_pr_gate.pypytest tests/test_safe_pr_gate.py -q