Conversation
Summary of ChangesHello @julianam-w, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the dbt project's capabilities by introducing a robust mechanism to reconstruct operational database schemas from change logs. This new 'replica' layer provides a near real-time view of source tables, which is crucial for advanced analytics and historical data analysis. The changes include new dbt profiles, macros, automated model generation scripts, and updated documentation to support this new data flow. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new reconstructs layer in the dbt project, designed to create replica tables from the logs.changes data. This is a significant architectural addition, enabling analytics on a replicated database. The changes include new dbt macros for dynamic table reconstruction, a Python script to automate model creation, and updates to configuration and documentation.
My review focuses on improving the robustness and maintainability of the new implementation. Key points include:
- Strengthening the logic for discovering table columns in the reconstruction macro.
- Making the Python script for model generation more resilient to changes in dbt's output.
- Adhering to architectural principles by moving filtering logic out of the raw reconstruction layer.
- Avoiding hardcoded configuration values.
Overall, this is a great initiative. The feedback provided aims to make the new functionality more robust and easier to maintain.
| WITH latest_version AS ( | ||
| SELECT record_data, | ||
| row_number() over (order by string_to_array(version, '.')::int[] desc) as rn | ||
| FROM {{ source("logs__tamanu", "changes") }} | ||
| WHERE table_name = '{{ table_name }}' | ||
| and version != 'unknown' | ||
| ) | ||
| SELECT DISTINCT key | ||
| FROM latest_version, | ||
| LATERAL jsonb_each_text(record_data) | ||
| WHERE rn = 1 |
There was a problem hiding this comment.
The current logic for discovering columns is not robust. It relies on finding a single record with the highest software version and extracts JSON keys only from that record. This can lead to missing columns if:
- Other records for the same table have different (e.g., optional) fields.
- The
versionstring is not in a format that can be sorted correctly (e.g., contains non-numeric parts).
A more reliable approach is to scan all records for the given table_name and collect all distinct keys that have ever appeared. This ensures the reconstructed table schema is complete.
SELECT DISTINCT key
FROM {{ source("logs__tamanu", "changes") }},
LATERAL jsonb_object_keys(record_data) AS keys(key)
WHERE table_name = '{{ table_name }}'| {% for table_name in table_names %} | ||
| {{ print(table_name) }} | ||
| {% endfor %} |
There was a problem hiding this comment.
Returning values from a dbt operation by printing them line-by-line is fragile and can break if dbt's logging output changes. A more robust approach is to print the result as a single JSON string. This makes parsing on the client-side (in your Python script) much more reliable.
{{ print(tojson(table_names)) }}| # Parse output to extract table names | ||
| table_names = [] | ||
| lines = result.stdout.split("\n") | ||
|
|
||
| for line in lines: | ||
| line = line.strip() | ||
|
|
||
| if ( | ||
| line | ||
| and not line.startswith("Running") | ||
| and not line.startswith("Completed") | ||
| and not line.startswith("Found") | ||
| and not line.startswith("Registered") | ||
| and not ":" in line | ||
| ): # Skip timestamps | ||
|
|
||
| # This should be a table name - validate it | ||
| if line.replace("_", "").isalnum() and len(line) > 0: | ||
| table_names.append(line) | ||
|
|
||
| return table_names |
There was a problem hiding this comment.
Parsing dbt's stdout is brittle. Since the get_table_names macro should be modified to return a JSON array (as per my other comment), this function should be updated to parse that JSON. This makes the script much more robust to changes in dbt's logging format. You'll also need to import json at the top of the file.
# Parse JSON output to extract table names
output_lines = result.stdout.split('\n')
for line in output_lines:
if line.strip().startswith('['):
try:
import json
return json.loads(line.strip())
except json.JSONDecodeError:
cprint(f"Failed to parse JSON from dbt output: {line.strip()}", "error")
return []
cprint("Could not find JSON output from dbt macro.", "error")
return []|
|
||
| ## Project Overview | ||
| This dbt project transforms Tamanu healthcare system data into optimised datasets following a structured data flow: **sources/logs → bases → datasets → reports**. The architecture supports both reporting and analytics use cases while maintaining data governance and privacy standards. | ||
| This dbt project transforms Tamanu healthcare system data into optimised datasets following a structured data flow: |
There was a problem hiding this comment.
Could we say EMR instead of healthcare system? I have heard PMs referring to it as EMR in their presentations.
There was a problem hiding this comment.
Contentious to call it an EMR or EHR, I would rather avoid it.
There was a problem hiding this comment.
Maybe we could rename this macro to get_tables_with_metadata(). It would return the table name and an array of keys, which we can then pass to other macro in a loop to create models:
WITH latest_records AS (
SELECT DISTINCT ON (table_oid)
table_name,
record_data
FROM logs.changes
WHERE version != 'unknown'
ORDER BY table_oid, string_to_array(version, '.')::int[] DESC
)
SELECT
table_name,
ARRAY_AGG(DISTINCT key) AS keys
FROM latest_records,
LATERAL jsonb_object_keys(record_data::jsonb) AS key
GROUP BY table_name;
There was a problem hiding this comment.
Reason for keeping it separate is because I anticipate customisation of the tables. For example patient_program_registration and patient_program_registration_conditions. So this list of tables is to only check for missing tables and to generate a template.
If it is part of a macro that loops through, it will override any custom logic applied. I've used a python script to determine the new tables that needs a template to be generated.
| - name: reporting | ||
| description: Reporting schema models built as views | ||
| default: true | ||
| definition: |
There was a problem hiding this comment.
Does "default true" here means that if we don't pass the selector it will by default run this set?
There was a problem hiding this comment.
That's correct.
There was a problem hiding this comment.
If we update the other macro to return the table name and keys, we could simplify this macro like this:
{% macro jsonb_to_columns_dynamic(table_name, keys) %}
WITH latest_changes AS (
SELECT DISTINCT ON (record_id)
record_updated_at,
record_data
FROM {{ source('logs__tamanu', 'changes') }}
{% if is_incremental() %}
WHERE record_updated_at > (
SELECT COALESCE(MAX(record_updated_at), '1900-01-01')
FROM {{ this }}
)
{% endif %}
AND table_name = '{{ table_name }}'
ORDER BY record_id, record_updated_at DESC
)
SELECT
record_updated_at,
{% if keys | length > 0 %}
{% for key in keys %}
record_data->>'{{ key }}' AS {{ key }}{% if not loop.last %},{% endif %}
{% endfor %}
{% else %}
NULL AS placeholder
{% endif %}
FROM latest_changes
{% endmacro %}
No description provided.