Skip to content

Integration with GreatExpectations, validation of lab models#217

Open
julianam-w wants to merge 28 commits intomainfrom
gx-lab-validation
Open

Integration with GreatExpectations, validation of lab models#217
julianam-w wants to merge 28 commits intomainfrom
gx-lab-validation

Conversation

@julianam-w
Copy link
Copy Markdown
Collaborator

@julianam-w julianam-w commented Aug 24, 2025

MAUI-5788

  • Code review
  • Testing

Setup script

  • Compile expectations to json
  • Create batch definitions that is for the whole table
  • Create a suite of expectations
  • Create validation definitions
  • Create checkpoint from validation definitions

Validate script

  • Execute validation and generate report

@julianam-w julianam-w changed the base branch from main to maui-5623-split-sensitive-lab August 24, 2025 13:18
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @julianam-w, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces Great Expectations to the project, establishing a robust data validation framework for lab-related data models. It sets up a comprehensive system for defining, running, and reporting on data quality expectations, ensuring the reliability of ds__lab_requests and ds__lab_tests datasets, including their sensitive counterparts.

Highlights

  • Great Expectations Framework: A new gx/ directory is added, containing Python scripts (setup.py, validate.py) and configuration files (great_expectations.yml, checkpoint, expectation suites) to integrate Great Expectations for data quality checks.
  • Lab Data Model Validation: Dedicated expectation suites (ds__lab_requests_expectations, ds__lab_tests_expectations, and their sensitive versions) are introduced to validate the integrity and consistency of lab request and lab test data.
  • Refactored dbt Models: Existing ds__lab_requests.sql and ds__lab_tests.sql dbt models are refactored to leverage new macros (lab_requests_dataset, lab_tests_dataset), enabling better separation and validation of sensitive and non-sensitive lab data.
  • New Sensitive Data Models: New dbt models (ds__sensitive_lab_requests.sql, ds__sensitive_lab_tests.sql) are created specifically for sensitive lab data, along with their corresponding YAML schemas.
  • Reporting Updates: SQL reports that consume lab data are updated to correctly reference the new sensitive/non-sensitive lab data models, ensuring accurate reporting based on the validated datasets.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive data validation setup using Great Expectations, which is a fantastic addition for ensuring data quality. The modular structure with separate scripts for setup and validation is well-designed. However, there are a few critical issues to address, most importantly a security vulnerability involving hardcoded credentials. I've also included suggestions to improve error handling, repository hygiene, and code robustness.

Comment thread gx/great_expectations.yml Outdated
Comment thread gx/complete_validation_results.json Outdated
Comment thread gx/expectations/ds__lab_requests_expectations.py Outdated
Comment thread gx/setup.py Outdated
Comment thread gx/utils/asset_manager.py
Comment thread gx/utils/asset_manager.py
Comment thread gx/utils/validation_manager.py Outdated
Comment thread gx/utils/validation_manager.py
Comment thread gx/validate.py Outdated
julianam-w and others added 7 commits August 24, 2025 23:27
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
@julianam-w
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a comprehensive data validation framework using Great Expectations, which is a great addition for ensuring data quality. The separation of setup and validation logic into distinct scripts (setup.py, validate.py) is well-thought-out. The use of Python scripts to define expectation suites is also a good practice for maintainability.

My review focuses on improving code quality, fixing potential bugs, and ensuring consistency. I've identified a critical syntax error, some uses of bad practices like bare except clauses, and opportunities to reduce code duplication. I've also noted that the version of great-expectations is quite old and suggest an update.

One architectural point to consider is the dual representation of expectation suites (both .py and .json files are checked in). This can lead to them getting out of sync. It would be better to either have the Python scripts as the single source of truth and generate the JSON files at runtime (and add gx/expectations/*.json to .gitignore), or commit only the JSON files if they are to be edited manually.

Addressing these points will make the new validation framework more robust and easier to maintain.

Comment thread gx/utils/validation_manager.py
Comment thread gx/expectations/ds__lab_tests_expectations.py Outdated
Comment thread gx/expectations/ds__sensitive_lab_requests_expectations.py Outdated
Comment thread gx/expectations/ds__sensitive_lab_tests_expectations.py Outdated
Comment thread pyproject.toml
Comment thread gx/validate.py Outdated
Comment thread gx/utils/asset_manager.py
julianam-w and others added 6 commits August 25, 2025 00:34
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
@julianam-w
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive data validation framework using Great Expectations, which is a fantastic step towards ensuring data quality for the lab models. The changes are well-structured, with separate scripts for setup and validation, and clear documentation in the new gx/README.md.

My review focuses on a few critical correctness issues in the expectation definitions and some medium-severity improvements related to code practices, such as avoiding broad exception handling and wildcard imports. I've also noted that the Great Expectations version is quite old and recommend planning an upgrade. Addressing these points will make the new validation system more robust and reliable.

Comment thread gx/expectations/ds__lab_requests_expectations.py
Comment thread gx/expectations/ds__lab_tests_expectations.py Outdated
Comment thread gx/expectations/ds__sensitive_lab_requests_expectations.py
Comment thread pyproject.toml
Comment thread gx/expectations/ds__lab_requests_expectations.py
Comment thread gx/setup.py Outdated
Comment thread gx/utils/asset_manager.py
Comment thread gx/validate.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust data validation framework using Great Expectations, which is a fantastic addition for ensuring data quality in the lab models. The structure with separate setup and validation scripts, along with modular utility functions, is well-designed. I've identified a critical logical error in two of the expectation queries that needs to be addressed. Additionally, there's a potential security risk in how new datasources are created, and a few medium-severity issues related to code maintainability and a typo. Overall, this is a great feature, and with these fixes, it will be a solid and secure implementation.

Comment thread gx/expectations/ds__lab_requests_expectations.py
Comment thread gx/expectations/ds__sensitive_lab_requests_expectations.py
Comment thread gx/utils/asset_manager.py Outdated
Comment thread gx/expectations/ds__lab_tests_expectations.py Outdated
Comment thread gx/setup.py Outdated
Comment thread gx/utils/docs_manager.py Outdated
Comment thread gx/utils/result_extractor.py Outdated
Comment thread gx/utils/validation_manager.py
Comment thread gx/validate.py Outdated
julianam-w and others added 6 commits August 25, 2025 08:50
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Base automatically changed from maui-5623-split-sensitive-lab to main August 26, 2025 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant