-
Notifications
You must be signed in to change notification settings - Fork 1
Pseudonymisation and FTPS #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
a21ce7f
First go at FTPS upload. Works!
jeremyestein 210556b
Document that settings should live outside of any repo so they don't …
jeremyestein 1cf2e8c
Control the PIXL install from our pyproject so the dependencies are all
jeremyestein 8f96e8a
Make FTPS upload more usable and check that file to be uploaded is
jeremyestein b64b299
Convert CSV into parquet with the appropriate decimal array type.
jeremyestein 9edb4ff
Make separate exporter container that runs cron to anonymise and export
jeremyestein 129f8a6
Write a toy hasher so we can develop the rest of the pipeline in the
jeremyestein e64d155
Merge branch 'dev' into jeremy/pseudon
jeremyestein e0d64a4
Consistently use Python-style variable names
jeremyestein 7f7dd8a
Move config around to reflect recent container changes
jeremyestein f6b1a67
Make log message more useful
jeremyestein 00167a9
many env files now
jeremyestein 4ef9683
Document manual pipeline calls
jeremyestein 32ccc74
Fix all but one linting error
jeremyestein b54d1fb
Remove __init__.py that I had accidentally introduced at the top level,
jeremyestein 9b39a47
Ignore missing import errors
jeremyestein 48e2a83
Example crontab timing value was invalid (too many fields)
jeremyestein 0ceaa39
Apply suggestions from code review
jeremyestein dfcb845
Clarify pyarrow data type usage
jeremyestein b4c7405
Add mention of PIXL to readme
jeremyestein File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,3 +11,6 @@ wheels/ | |
|
|
||
| # IDEs | ||
| .idea/ | ||
|
|
||
| # settings files (should not be in the source tree anyway, but just in case) | ||
| *.env | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| 3.11 | ||
| 3.13 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,20 @@ | ||
| FROM python:3.14-slim-bookworm | ||
| FROM python:3.13-slim-bookworm AS waveform_base | ||
| LABEL authors="Stephen Thompson, Jeremy Stein" | ||
| # Cron is really small. For the sake of not having to reinstall it all the time, | ||
| # put it on both images even though we only need it on exporter. | ||
| RUN export DEBIAN_FRONTEND=noninteractive && \ | ||
| apt-get update && \ | ||
| apt-get install --yes --no-install-recommends cron && \ | ||
| apt-get autoremove --yes && apt-get clean --yes && rm -rf /var/lib/apt/lists/* | ||
| COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ | ||
| WORKDIR /app | ||
| ARG UVCACHE=/root/.cache/uv | ||
| COPY pyproject.toml uv.lock* /app/ | ||
| COPY PIXL /PIXL | ||
| WORKDIR /app | ||
| COPY waveform-controller/pyproject.toml waveform-controller/uv.lock /app/ | ||
| RUN --mount=type=cache,target=${UVCACHE} uv pip install --system . | ||
| COPY . /app/ | ||
| COPY waveform-controller/. /app/ | ||
| RUN --mount=type=cache,target=${UVCACHE} uv pip install --system . | ||
| FROM waveform_base AS waveform_controller | ||
| CMD ["emap-extract-waveform"] | ||
| FROM waveform_base AS waveform_exporter | ||
| ENTRYPOINT ["/app/exporter-scripts/entrypoint.sh"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # This is an EXAMPLE file, do not put real secrets in here. | ||
| # Copy it to ../config/exporter.env and then DELETE THIS COMMENT. | ||
| # When does the exporter run | ||
| EXPORTER_CRON_SCHEDULE="14 5 * * *" | ||
| FTPS_HOST=myftps.example.com | ||
| FTPS_PORT=990 | ||
| FTPS_USERNAME= | ||
| FTPS_PASSWORD= |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| # This is an EXAMPLE file, do not put real secrets in here. | ||
| # Copy it to ../config/hasher.env and then DELETE THIS COMMENT. | ||
| HASHER_API_AZ_CLIENT_ID= | ||
| HASHER_API_AZ_CLIENT_PASSWORD= | ||
| HASHER_API_AZ_TENANT_ID= | ||
| HASHER_API_AZ_KEY_VAULT_NAME= |
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,15 +1,46 @@ | ||
| services: | ||
| waveform-controller: | ||
| build: | ||
| context: . | ||
| dockerfile: Dockerfile | ||
| context: .. | ||
| dockerfile: waveform-controller/Dockerfile | ||
| target: waveform_controller | ||
| args: | ||
| HTTP_PROXY: ${HTTP_PROXY} | ||
| http_proxy: ${http_proxy} | ||
| HTTPS_PROXY: ${HTTPS_PROXY} | ||
| https_proxy: ${https_proxy} | ||
| # ideally we'd use docker secrets but it's not enabled currently | ||
| env_file: | ||
| - ./config/settings.env | ||
| - ../config/controller.env | ||
| volumes: | ||
| - ../waveform-export:/app/waveform-export | ||
| - ../waveform-export:/waveform-export | ||
| waveform-exporter: | ||
| build: | ||
| context: .. | ||
| dockerfile: waveform-controller/Dockerfile | ||
| target: waveform_exporter | ||
| args: | ||
| HTTP_PROXY: ${HTTP_PROXY} | ||
| http_proxy: ${http_proxy} | ||
| HTTPS_PROXY: ${HTTPS_PROXY} | ||
| https_proxy: ${https_proxy} | ||
| # ideally we'd use docker secrets but it's not enabled currently | ||
| env_file: | ||
| - ../config/exporter.env | ||
| volumes: | ||
| - ../waveform-export:/waveform-export | ||
| waveform-hasher: | ||
| build: | ||
| context: ../PIXL | ||
jeremyestein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| dockerfile: ./docker/pixl-python/Dockerfile | ||
| target: hasher_api | ||
| args: | ||
| PIXL_PACKAGE_DIR: hasher | ||
| HTTP_PROXY: ${HTTP_PROXY} | ||
| http_proxy: ${http_proxy} | ||
| HTTPS_PROXY: ${HTTPS_PROXY} | ||
| https_proxy: ${https_proxy} | ||
| ports: | ||
| - "127.0.0.1:${HASHER_API_PORT}:8000" | ||
| env_file: | ||
| - ../config/hasher.env | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| #!/bin/bash | ||
|
|
||
| # (can't use -u because need to check for potentially unset var) | ||
| set -eo pipefail | ||
|
|
||
| # Set up cron schedule according to the environment variable | ||
| if [ -z "$EXPORTER_CRON_SCHEDULE" ]; then | ||
| echo "You must set EXPORTER_CRON_SCHEDULE when running this container" | ||
| exit 1 | ||
| fi | ||
| set -x | ||
| cat <<EOF | crontab - | ||
| PATH=/usr/local/bin:/usr/bin:/bin | ||
| SHELL=/usr/bin/bash | ||
| $EXPORTER_CRON_SCHEDULE /app/exporter-scripts/scheduled-script.sh | ||
| EOF | ||
thompson318 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # cron scheduler is PID 1 in this container | ||
| exec cron -f | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Run by the cron scheduler | ||
| # Probably want snakemake instead... | ||
| emap-csv-pseudon --help | ||
| emap-send-ftps --help |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| import argparse | ||
| import logging | ||
| import os | ||
| from pathlib import Path | ||
|
|
||
| import settings | ||
| from core.uploader._ftps import _connect_to_ftp, _create_and_set_as_cwd | ||
|
|
||
| from locations import WAVEFORM_PSEUDONYMISED_PARQUET | ||
|
|
||
| logging.basicConfig(format="%(levelname)s:%(asctime)s: %(message)s") | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def do_upload(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument( | ||
| "file_to_upload", | ||
| type=Path, | ||
| help="file to upload relative to pseudonymised folder", | ||
| ) | ||
| args = parser.parse_args() | ||
| do_upload_inner(args.file_to_upload) | ||
|
|
||
|
|
||
| def do_upload_inner(rel_file_to_upload: Path): | ||
| # Absolute paths override the base path, so disallow that (abspath1 / abspath2 == abspath2) | ||
| if rel_file_to_upload.is_absolute(): | ||
| raise ValueError("File must be relative to pseudonymised folder") | ||
| WAVEFORM_PSEUDONYMISED_PARQUET.mkdir(parents=False, exist_ok=True) | ||
| file_to_upload = (WAVEFORM_PSEUDONYMISED_PARQUET / rel_file_to_upload).resolve( | ||
| strict=True | ||
| ) | ||
| # Check that even after evaluating ".." and symlinks, the file is still under the "safe" directory | ||
| # for upload. | ||
| if not file_to_upload.is_relative_to(WAVEFORM_PSEUDONYMISED_PARQUET): | ||
| raise ValueError( | ||
| f"File {file_to_upload} must be under {WAVEFORM_PSEUDONYMISED_PARQUET}. " | ||
| f"If this is unexpected, maybe you are using symlinks or '..' in the path?" | ||
| ) | ||
| if not file_to_upload.exists(): | ||
| raise ValueError(f"File {file_to_upload} does not exist") | ||
| logger.info( | ||
| "Connecting to FTPS server %s:%s, with username %s", | ||
| settings.FTPS_HOST, | ||
| settings.FTPS_PORT, | ||
| settings.FTPS_USERNAME, | ||
| ) | ||
| ftp = _connect_to_ftp( | ||
| settings.FTPS_HOST, | ||
| settings.FTPS_PORT, | ||
| settings.FTPS_USERNAME, | ||
| settings.FTPS_PASSWORD, | ||
| ) | ||
| remote_project_dir = "waveform-export" | ||
| _create_and_set_as_cwd(ftp, remote_project_dir) | ||
| remote_filename = os.path.basename(file_to_upload) | ||
| command = f"STOR {remote_filename}" | ||
| logger.info("Uploading file %s", file_to_upload) | ||
| with open(file_to_upload, "rb") as file_to_upload_fh: | ||
| ftp.storbinary(command, file_to_upload_fh) | ||
| print("Directory listing: ") | ||
| ftp.dir() | ||
| ftp.quit() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| from pathlib import Path | ||
|
|
||
| WAVEFORM_EXPORT_BASE = Path("/waveform-export") | ||
| WAVEFORM_ORIGINAL_CSV = WAVEFORM_EXPORT_BASE / "original-csv" | ||
| WAVEFORM_ORIGINAL_PARQUET = WAVEFORM_EXPORT_BASE / "original-parquet" | ||
| WAVEFORM_PSEUDONYMISED_PARQUET = WAVEFORM_EXPORT_BASE / "pseudonymised" |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| from functools import lru_cache | ||
|
|
||
|
|
||
| @lru_cache | ||
| def do_hash(type_prefix, value: str): | ||
| """Stub implementation of deidentification function for testing purposes. | ||
|
|
||
| Not that I think this will happen in practice, but we'd want the CSN | ||
| "1234" to hash to a different value than the MRN "1234", so prefix | ||
| each value with its type. | ||
| """ | ||
| SALT = "waveform-exporter" | ||
| full_value_to_hash = f"{SALT}:{type_prefix}:{value}" | ||
| hash_str = f"{hash(full_value_to_hash) & 0xFFFFFFFF:08x}" | ||
| return hash_str |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.