Integration with GreatExpectations, validation of lab models#217
Integration with GreatExpectations, validation of lab models#217julianam-w wants to merge 28 commits intomainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @julianam-w, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces Great Expectations to the project, establishing a robust data validation framework for lab-related data models. It sets up a comprehensive system for defining, running, and reporting on data quality expectations, ensuring the reliability of ds__lab_requests and ds__lab_tests datasets, including their sensitive counterparts.
Highlights
- Great Expectations Framework: A new gx/ directory is added, containing Python scripts (setup.py, validate.py) and configuration files (great_expectations.yml, checkpoint, expectation suites) to integrate Great Expectations for data quality checks.
- Lab Data Model Validation: Dedicated expectation suites (ds__lab_requests_expectations, ds__lab_tests_expectations, and their sensitive versions) are introduced to validate the integrity and consistency of lab request and lab test data.
- Refactored dbt Models: Existing ds__lab_requests.sql and ds__lab_tests.sql dbt models are refactored to leverage new macros (lab_requests_dataset, lab_tests_dataset), enabling better separation and validation of sensitive and non-sensitive lab data.
- New Sensitive Data Models: New dbt models (ds__sensitive_lab_requests.sql, ds__sensitive_lab_tests.sql) are created specifically for sensitive lab data, along with their corresponding YAML schemas.
- Reporting Updates: SQL reports that consume lab data are updated to correctly reference the new sensitive/non-sensitive lab data models, ensuring accurate reporting based on the validated datasets.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive data validation setup using Great Expectations, which is a fantastic addition for ensuring data quality. The modular structure with separate scripts for setup and validation is well-designed. However, there are a few critical issues to address, most importantly a security vulnerability involving hardcoded credentials. I've also included suggestions to improve error handling, repository hygiene, and code robustness.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
…l/tamanu-source-dbt into gx-lab-validation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This PR introduces a comprehensive data validation framework using Great Expectations, which is a great addition for ensuring data quality. The separation of setup and validation logic into distinct scripts (setup.py, validate.py) is well-thought-out. The use of Python scripts to define expectation suites is also a good practice for maintainability.
My review focuses on improving code quality, fixing potential bugs, and ensuring consistency. I've identified a critical syntax error, some uses of bad practices like bare except clauses, and opportunities to reduce code duplication. I've also noted that the version of great-expectations is quite old and suggest an update.
One architectural point to consider is the dual representation of expectation suites (both .py and .json files are checked in). This can lead to them getting out of sync. It would be better to either have the Python scripts as the single source of truth and generate the JSON files at runtime (and add gx/expectations/*.json to .gitignore), or commit only the JSON files if they are to be edited manually.
Addressing these points will make the new validation framework more robust and easier to maintain.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
…l/tamanu-source-dbt into gx-lab-validation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive data validation framework using Great Expectations, which is a fantastic step towards ensuring data quality for the lab models. The changes are well-structured, with separate scripts for setup and validation, and clear documentation in the new gx/README.md.
My review focuses on a few critical correctness issues in the expectation definitions and some medium-severity improvements related to code practices, such as avoiding broad exception handling and wildcard imports. I've also noted that the Great Expectations version is quite old and recommend planning an upgrade. Addressing these points will make the new validation system more robust and reliable.
There was a problem hiding this comment.
Code Review
This pull request introduces a robust data validation framework using Great Expectations, which is a fantastic addition for ensuring data quality in the lab models. The structure with separate setup and validation scripts, along with modular utility functions, is well-designed. I've identified a critical logical error in two of the expectation queries that needs to be addressed. Additionally, there's a potential security risk in how new datasources are created, and a few medium-severity issues related to code maintainability and a typo. Overall, this is a great feature, and with these fixes, it will be a solid and secure implementation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: julianam-w <77252335+julianam-w@users.noreply.github.com>
MAUI-5788
Setup script
Validate script