Skip to content

DM-54070: Add support for APDB record updates#33

Open
JeremyMcCormick wants to merge 26 commits intomainfrom
tickets/DM-54070
Open

DM-54070: Add support for APDB record updates#33
JeremyMcCormick wants to merge 26 commits intomainfrom
tickets/DM-54070

Conversation

@JeremyMcCormick
Copy link
Copy Markdown
Contributor

@JeremyMcCormick JeremyMcCormick commented Feb 11, 2026

Overview

This is a major update to the repository adding support for propagating APDB update records to the PPDB in BigQuery. The commit history was completely rebuilt and consolidated, and starting with it as a point of reference may be useful. In particular, 9e056c4 and a78ee70 contain the majority of the changes for implementing the new functionality.

As discussed previously, applying record updates using a typical RDMS pattern of SQL UPDATE statements would be infeasible given BigQuery's limitations in this area. In particular, the documentation on quotas indicates that there are limits on the number of update statements which may be executed per day. Given that single replica chunks may contain millions of individual update records, applying them using UPDATE statements would be infeasible, as executing this many queries would far exceed the quota. So a different approach needed to be taken where the updates are batched together and applied using a single MERGE statement.

The process for applying the updates implemented on this branch is as follows:

  1. During replication, the update records are serialized to a JSON file in the local replica chunk directory.
  2. The uploader process of the replication application uploads this JSON file to cloud storage alongside the parquet files.
  3. During the promotion step, the JSON files containing the updates are read in from cloud storage for the relevant chunks and the data is copied into an "expanded updates table", which represents the data generically as a combination of table name, field name, record ID, and new value, along with other necessary information like the replica chunk ID.
  4. The records in the expanded updates table are "deduplicated" so that only the most recent version of a particular update changing the same table name, field name, and record ID is included in the batch. The most recent update is found by the ordering of replica chunk, update time, and update order, in that specific order. (This process does not really represent "deduplication" - it is selecting out only the most recent update of the same type to a particular record.)
  5. The deduplicated records are merged into production during the promotion step, after the temporary table has been created, which includes the current prod table + new records. The updates are applied to this new table, so that it should not matter if the target records are present in the staging or prod tables.

Implementing this process required some major changes and additions to the existing infrastructure, outlined below.

Major Changes

  • A new updates package was added under bigquery, containing the following modules:
    • update_records - Pydantic model for packaging a set of update records for a single replica chunk
    • expanded_update_record - expanded update record representing the changes in a common form within the initial BigQuery table
    • updates_table - encapsulation of the initial target BQ table which contains the update information (also contains functionality for "deduplicating" the update records, described above)
    • updates_merger - tool for merging the records in the updates table into the target APDB tables (DiaObject, etc.)
    • updates_manager - manages the overall process of applying the updates during promotion
  • Existing classes and modules in the bigquery package were modified to support the new functionality:
    • ppdb_bigquery
      • Added support for reading Postgres password from Google Secrets Manager
      • Added support for handling update records in the store method by serializing them to a local JSON file in the chunk directory
      • Ported several methods related to chunk "promotion" from the db module in dax_ppdb_gcp to this class so that they are more easily accessible
  • A few core framework classes, or portions of them, were ported from lsst-dm/dax_ppdbx_gcp into this repository.
    • Functionality for interacting with the SQL database for replica chunk management was copied from the db module into ppdb_bigquery.
    • The replica_chunk_promoter module for promoting replica chunk data from staging to production was copied into the bigquery package, modified, and renamed as chunk_promoter.

Minor Changes

  • Dependencies were reorganized and simplified.
    • The google-cloud-bigquery package was added as a primary dependency. The repository now depends heavily on this library after the updates in this ticket, so making it optional does not make sense for dependency management.
    • The lsst-dax-ppdbx-gcp library was also made a required dependency. It is similarly a core dependency now and does not make sense to have as optional.
    • Checks for formerly optional dependencies were stripped from the codebase, in particular, from the ppdb_bigquery module and all test modules.
  • Tests support usage of Google Cloud services now.
    • A check was added to skip tests where a Google Cloud environment is required but not available.
    • Certain tests will now (lightly) utilize cloud services such as Google Cloud Storage, BigQuery, etc., where appropriate.
      • This was implemented carefully to not use existing names for production objects such as datasets or buckets. The tests will typically create and then teardown these objects themselves.
  • The config module was renamed to ppdb_config, because the primary class it defines is PpdbConfig (follows DM standards for module naming).
  • Added a new sql_resource module for loading SQL from a package resource
    • New SQL files for the updates were placed under a resources directory in the Python source tree
  • Moved test schema from tests directory into resources directory where it is more easily accessible
  • Reorganized some of the test utilities in the tests package to make certain methods available, e.g., for generating test data
  • See full commit history for a few additional, minor changes which were not listed.

Known Issues

These are known issues that will be resolved on separate tickets.

  • There is a lot of duplication and overlap of the database-related (SQL/Postgres) methods and functionality in the ppdb_bigquery module. Some of this was preexisting and other overlapping methods were introduced in this PR. This can be cleaned up and consolidated (DM-54522).
  • Determining what records are updated in the cloud pipeline is currently done by reading the chunk manifests from cloud storage during replication. It would be better if instead there was a field in the PpdbReplicaChunk table which tracked whether the chunk has updates, perhaps an update_count field, so that this was unnecessary. (I may include this in DM-54522 or it could be a separate ticket.) I will make this improvement on this PR.
  • The tests and test coverage in this package could use some improvements, though I did add test cases for the new modules in the updates package, as well as a few core, existing classes like chunk_uploader and ppdb_bigquery (DM-54536).
    • Test coverage should be much better now than it was before but still could be improved.
    • We could use some better utilities/classes for generating test data. Some exist, but there is overlap, and this in general could be improved and consolidated. The tests for the new updates support could use a more extensive set of records, e.g., for testing the "deduplication" process, etc.
    • There aren't any existing tools for creating a BigQuery dataset from a Felis YAML schema, which would be particularly useful for standing up test datasets (DM-49220).
    • The tests package in the Python source tree has some new modules, though they are a bit disorganized and miscellaneous. These should be cleaned up and consolidated. It is also possible that some of the classes should just be included into the test modules, even if there is minor duplication as a result.
    • End-to-end testing on the full replication process is not implemented in this repository, and is difficult to do without some additional functionality which doesn't exist yet. I will likely create a follow-up ticket for deploying and testing the new "updates" functionality to the cloud pipeline and RSP. It is beyond the scope of this repository's functionality to fully test that ingestion pipeline at this time.
    • Some usage of cloud services needs to be cleaned up so that resources are torn down after they are used. All of the resources created are marked so that they will be automatically deleted eventually, but ideally they should be deleted by the test case itself.
  • There are a large number of miscellaneous issues related to improvements to this repository's Python codebase (DM-54522).
    • These issues/bullets will likely be broken out into separate, more manageable child tickets which include a few related issues together.

Additional Notes

This ticket includes some refactoring, porting of classes from other repositories, etc. that I now realize was excessive to include and should have been done on separate tickets. I will keep this in mind for the future and try to make changes on ticket branches more targeted, in particular, for ease of review and testing. The commit history of this PR should be useful for disambiguating some of this work, though beb2eab includes refactoring alongside updates for supporting the new functionality (These would have been difficult to separate out when I rebuilt the commit history.).

Disclosure on LLM Usage

I used Copilot extensively during the development of this ticket, in particular, for the following:

  • Creation of test cases for new modules
  • Design and implementation of the algorithm and class structure for applying updates
  • Code refactoring, cleanup, and bug-fixing

I reviewed all of the AI-generated code multiple times and made many changes to it. The test cases, in particular, could use further attention and consolidation (see above).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for exporting, uploading, deduplicating, and merging APDB update records into PPDB BigQuery tables, along with supporting SQL resources and test coverage updates.

Changes:

  • Introduces BigQuery “updates” subsystem (expand → load → deduplicate → merge) with SQL MERGE resources.
  • Extends chunk export/upload metadata to track update-record presence and GCS location (gcs_uri), and adds promotable-chunk SQL/query utilities.
  • Refactors tests/config to use a resource-based test schema path and adds multiple BigQuery/GCS integration tests.

Reviewed changes

Copilot reviewed 36 out of 41 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/test_updates_table.py BigQuery UpdatesTable integration tests
tests/test_updates_merger.py BigQuery MERGE integration tests
tests/test_updates_manager.py End-to-end updates manager test
tests/test_update_records.py UpdateRecords JSON + GCS uploader tests
tests/test_update_record_expander.py Unit tests for update expansion
tests/test_ppdb_sql.py Switch tests to schema resource URI
tests/test_ppdb_bigquery.py Adds BigQuery test case scaffolding
requirements.txt Removes ppdbx-gcp from base reqs
python/lsst/dax/ppdb/tests/config/init.py Test config package init
python/lsst/dax/ppdb/tests/_updates.py Shared synthetic update-record fixtures
python/lsst/dax/ppdb/tests/_ppdb.py Adds test schema URI + mixin refactor
python/lsst/dax/ppdb/tests/_bigquery.py BigQuery test mixins + uploader stub helpers
python/lsst/dax/ppdb/sql/_ppdb_sql.py Engine creation API change for DB init
python/lsst/dax/ppdb/sql/_ppdb_sql_base.py Refactors engine/connect-args helpers
python/lsst/dax/ppdb/ppdb.py Imports config from new module
python/lsst/dax/ppdb/ppdb_config.py New pydantic-based config loader
python/lsst/dax/ppdb/config/sql/select_promotable_chunks.sql New SQL for promotable chunk selection
python/lsst/dax/ppdb/config/sql/merge_diasource_updates.sql New MERGE SQL for DiaSource updates
python/lsst/dax/ppdb/config/sql/merge_diaobject_updates.sql New MERGE SQL for DiaObject updates
python/lsst/dax/ppdb/config/sql/merge_diaforcedsource_updates.sql New MERGE SQL for DiaForcedSource updates
python/lsst/dax/ppdb/config/schemas/test_apdb_schema.yaml Adds test schema as package data
python/lsst/dax/ppdb/bigquery/updates/updates_table.py Creates/loads/dedups updates table
python/lsst/dax/ppdb/bigquery/updates/updates_merger.py Merger classes for applying updates
python/lsst/dax/ppdb/bigquery/updates/updates_manager.py Orchestrates download/expand/load/merge
python/lsst/dax/ppdb/bigquery/updates/update_records.py Pydantic model for update records JSON
python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py Expands logical updates into field updates
python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py Model for a single expanded update row
python/lsst/dax/ppdb/bigquery/updates/init.py Exports updates public API
python/lsst/dax/ppdb/bigquery/sql_resource.py Loads SQL from package resources
python/lsst/dax/ppdb/bigquery/replica_chunk_promoter.py Promotion workflow for staged chunks
python/lsst/dax/ppdb/bigquery/query_runner.py Utility for running/logging BQ jobs
python/lsst/dax/ppdb/bigquery/ppdb_replica_chunk_extended.py Adds gcs_uri to chunk metadata
python/lsst/dax/ppdb/bigquery/ppdb_bigquery.py Writes update_records.json; adds promotable-chunks query
python/lsst/dax/ppdb/bigquery/manifest.py Tracks whether updates are included
python/lsst/dax/ppdb/bigquery/chunk_uploader.py Uploads update_records.json; stores gcs_uri
python/lsst/dax/ppdb/bigquery/init.py Exports ChunkUploader
python/lsst/dax/ppdb/_factory.py Updates config import location
python/lsst/dax/ppdb/init.py Re-exports new config module
pyproject.toml Adds package data + gcp extra dep
docker/Dockerfile.replication Adds build deps for Python packages
.gitignore Ignores .scratch directory
Comments suppressed due to low confidence (1)

python/lsst/dax/ppdb/tests/_bigquery.py:33

  • This module imports google.cloud.storage unconditionally. Because _bigquery.py is used by multiple test cases/mixins, this can break test collection in environments where optional GCP dependencies aren’t installed. Wrap these imports in try/except (similar to other tests) and skip/disable GCP-dependent helpers when unavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from fc6daf7 to 0519130 Compare March 19, 2026 22:47
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from 6c7a5d3 to 4e54fc1 Compare March 21, 2026 02:52
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 8 times, most recently from 039bc08 to 0be5f74 Compare March 31, 2026 00:59
@JeremyMcCormick JeremyMcCormick marked this pull request as ready for review March 31, 2026 21:01
Copy link
Copy Markdown
Collaborator

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked what I could, but it is too much to read in one go. Anyways, I left a bunch of comments and suggestions. My main issues:

  • The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.
  • Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.
  • Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.
  • You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

This will need to be used directly within `dax_ppdb`, so it is ported
from `dax_ppdbx_gcp` with a few minor changes.
The name of the manifest was simplified to not include the chunk ID,
as this unnecessarily complicates data processing. A few utility
methods were added along with a flag indicating if the replica chunk
includes update records in a separate JSON file.
The password for the Cloud SQL connection is currently provided via
Google Secrets Manager, so this implements a hook in the base SQL class
for inserting the password into the database engine URL, as it will not
be present in a Postgres password file within the cloud function
environments.
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 2 times, most recently from 1647aeb to 74f2313 Compare April 3, 2026 00:08
@JeremyMcCormick
Copy link
Copy Markdown
Contributor Author

JeremyMcCormick commented Apr 3, 2026

@andy-slac

The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.

I can re-work the classes for generating the merge. Are you suggesting to have multiple BigQuery MERGE operations, one for each update type, rather than doing it by table? I think that should be possible, if not the most efficient. It might result in a full table scan for each update type. That could be okay though if the Google Cloud project is configured with a reserved slot capacity rather than dynamic. We just have to keep in mind that this will run everyday and full table scans of the bigger tables like DiaSource and DiaForcedSource will add up, so it would be best if the merge could be done with a single statement per table.

  • Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.

The context in which this runs is during promotion, which fails as an entire process, so it may not matter much if exceptions are unhandled in practice. The promotion function will catch and log them and then fail the entire batch of chunks.

However, I can look over especially the modules in the updates package and see if it makes sense to add more exception handling and some specific exception types like you suggested. This might help with debugging and control flow.

  • Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.

I like this idea a lot compared with writing JSON and will work on implementing it once all the minor suggested changes are made, based on #33 (comment).

  • You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

I agree with this as well and will work on some extensions to the ApdbUpdateRecord interface based on your suggestion from #33 (comment).

@JeremyMcCormick
Copy link
Copy Markdown
Contributor Author

JeremyMcCormick commented Apr 3, 2026

@andy-slac

As far as other major rewrites, I am wondering if I should rework how the replica chunks with updates are found in updates_manager. Currently, it scans all of the chunk manifests from cloud storage to see if they have any update records, but it would be more efficient and cleaner if PpdbReplicaChunk had an update_count field (or a has_updates flag) so that the chunks with updates could be more easily retrieved and flagged. I had planned to update that in a follow-on ticket, but maybe it makes sense to do here instead. I have a feeling that reading all of the chunk manifests from cloud storage will be inefficient, and it could be useful to be able to easily see which chunks have updates by reading this info from the db instead.

@andy-slac
Copy link
Copy Markdown
Collaborator

Are you suggesting to have multiple BigQuery MERGE operations, one for each update type, rather than doing it by table?

We cannot do that, we need to keep relative order of updates as different update types can update the same field. It has to be by table.

Currently, it scans all of the chunk manifests from cloud storage to see if they have any update records, but it would be more efficient and cleaner if PpdbReplicaChunk had an update_count field (or a has_updates flag) so that the chunks with updates could be more easily retrieved and flagged.

I want to stress that every replica chunk needs to be applied as a whole, including regular inserts and update records before you move to the next chunk. I do not understand what do you mean by scanning manifests for update records.

@JeremyMcCormick
Copy link
Copy Markdown
Contributor Author

JeremyMcCormick commented Apr 3, 2026

Are you suggesting to have multiple BigQuery MERGE operations, one for each update type, rather than doing it by table?

We cannot do that, we need to keep relative order of updates as different update types can update the same field. It has to be by table.

Okay, it can be done by table as it is now, but I don't fully understand how you want the code restructured. The SQL files work fine, but you requested it be done differently. I need some more information on what you're suggesting like a sketch of some new classes and how they would build the merge statement, etc. I believe you had suggested to have one "merge" class per update type. Can you outline how these classes would then work together to build the complete merge statement?

This is a big change and will require rewriting a lot of the code in this PR, so I want to make sure that I understand what you are intending before starting this.

Currently, it scans all of the chunk manifests from cloud storage to see if they have any update records, but it would be more efficient and cleaner if PpdbReplicaChunk had an update_count field (or a has_updates flag) so that the chunks with updates could be more easily retrieved and flagged.

I want to stress that every replica chunk needs to be applied as a whole, including regular inserts and update records before you move to the next chunk.

It is not feasible to apply the replica chunks one by one, or at least it would likely be very inefficient. The way it already works in promotion, the chunks are copied into the production tables together as a batch, provided that all of the prior chunks were loaded successfully. The updates seem as if they can be applied in the same way, as a batch. There should be no difference in the result from applying them all together. What would be the point in instead applying them one by one?

I do not understand what do you mean by scanning manifests for update records.

In this PR, the manifest indicates whether the chunk has updates or not with a boolean flag, so all of them have to be read from cloud storage to see if the update records need to be loaded, including those without updates.

I think this is inefficient and would be better done by using the database. I'd need to add a new field to PpdbReplicaChunk to make this work. I thought about doing this on a follow-on ticket as mentioned in the PR description, but if I am making a lot of improvements to this PR based on your requested changes, I can do it here instead.

- Add a few missing statuses
- Update the manifest name based on changes to `Manifest` class
- Provide a method for creating a new object with an updated GCS URI
@lsst lsst deleted a comment from andy-slac Apr 7, 2026
The `test_ppdbBigQuery` module was also renamed to
`tests_ppdb_bigquery` following snakecase conventions.
This was previously referred to as "deduplication," which was
misleading, because the records do not represent duplicates. They are
updates on the same combination of `(table, field, record_id)` where
only the latest one should be kept.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants