DM-54070: Add support for APDB record updates by JeremyMcCormick · Pull Request #33 · lsst/dax_ppdb

JeremyMcCormick · 2026-02-11T23:42:16Z

Overview

This is a major update to the repository adding support for propagating APDB update records to the PPDB in BigQuery. The commit history was completely rebuilt and consolidated, and starting with it as a point of reference may be useful. In particular, 9e056c4 and a78ee70 contain the majority of the changes for implementing the new functionality.

As discussed previously, applying record updates using a typical RDMS pattern of SQL UPDATE statements would be infeasible given BigQuery's limitations in this area. In particular, the documentation on quotas indicates that there are limits on the number of update statements which may be executed per day. Given that single replica chunks may contain millions of individual update records, applying them using UPDATE statements would be infeasible, as executing this many queries would far exceed the quota. So a different approach needed to be taken where the updates are batched together and applied using a single MERGE statement.

The process for applying the updates implemented on this branch is as follows:

During replication, the update records are serialized to a JSON file in the local replica chunk directory.
The uploader process of the replication application uploads this JSON file to cloud storage alongside the parquet files.
During the promotion step, the JSON files containing the updates are read in from cloud storage for the relevant chunks and the data is copied into an "expanded updates table", which represents the data generically as a combination of table name, field name, record ID, and new value, along with other necessary information like the replica chunk ID.
The records in the expanded updates table are "deduplicated" so that only the most recent version of a particular update changing the same table name, field name, and record ID is included in the batch. The most recent update is found by the ordering of replica chunk, update time, and update order, in that specific order. (This process does not really represent "deduplication" - it is selecting out only the most recent update of the same type to a particular record.)
The deduplicated records are merged into production during the promotion step, after the temporary table has been created, which includes the current prod table + new records. The updates are applied to this new table, so that it should not matter if the target records are present in the staging or prod tables.

Implementing this process required some major changes and additions to the existing infrastructure, outlined below.

Major Changes

A new updates package was added under bigquery, containing the following modules:
- update_records - Pydantic model for packaging a set of update records for a single replica chunk
- expanded_update_record - expanded update record representing the changes in a common form within the initial BigQuery table
- updates_table - encapsulation of the initial target BQ table which contains the update information (also contains functionality for "deduplicating" the update records, described above)
- updates_merger - tool for merging the records in the updates table into the target APDB tables (DiaObject, etc.)
- updates_manager - manages the overall process of applying the updates during promotion
Existing classes and modules in the bigquery package were modified to support the new functionality:
- ppdb_bigquery
  - Added support for reading Postgres password from Google Secrets Manager
  - Added support for handling update records in the store method by serializing them to a local JSON file in the chunk directory
  - Ported several methods related to chunk "promotion" from the db module in dax_ppdb_gcp to this class so that they are more easily accessible
A few core framework classes, or portions of them, were ported from lsst-dm/dax_ppdbx_gcp into this repository.
- Functionality for interacting with the SQL database for replica chunk management was copied from the db module into ppdb_bigquery.
- The replica_chunk_promoter module for promoting replica chunk data from staging to production was copied into the bigquery package, modified, and renamed as chunk_promoter.

Minor Changes

Dependencies were reorganized and simplified.
- The google-cloud-bigquery package was added as a primary dependency. The repository now depends heavily on this library after the updates in this ticket, so making it optional does not make sense for dependency management.
- The lsst-dax-ppdbx-gcp library was also made a required dependency. It is similarly a core dependency now and does not make sense to have as optional.
- Checks for formerly optional dependencies were stripped from the codebase, in particular, from the ppdb_bigquery module and all test modules.
Tests support usage of Google Cloud services now.
- A check was added to skip tests where a Google Cloud environment is required but not available.
- Certain tests will now (lightly) utilize cloud services such as Google Cloud Storage, BigQuery, etc., where appropriate.
  - This was implemented carefully to not use existing names for production objects such as datasets or buckets. The tests will typically create and then teardown these objects themselves.
The config module was renamed to ppdb_config, because the primary class it defines is PpdbConfig (follows DM standards for module naming).
Added a new sql_resource module for loading SQL from a package resource
- New SQL files for the updates were placed under a resources directory in the Python source tree
Moved test schema from tests directory into resources directory where it is more easily accessible
Reorganized some of the test utilities in the tests package to make certain methods available, e.g., for generating test data
See full commit history for a few additional, minor changes which were not listed.

Known Issues

These are known issues that will be resolved on separate tickets.

There is a lot of duplication and overlap of the database-related (SQL/Postgres) methods and functionality in the ppdb_bigquery module. Some of this was preexisting and other overlapping methods were introduced in this PR. This can be cleaned up and consolidated (DM-54522).
Determining what records are updated in the cloud pipeline is currently done by reading the chunk manifests from cloud storage during replication. It would be better if instead there was a field in the PpdbReplicaChunk table which tracked whether the chunk has updates, perhaps an update_count field, so that this was unnecessary. (I may include this in DM-54522 or it could be a separate ticket.) I will make this improvement on this PR.
The tests and test coverage in this package could use some improvements, though I did add test cases for the new modules in the updates package, as well as a few core, existing classes like chunk_uploader and ppdb_bigquery (DM-54536).
- Test coverage should be much better now than it was before but still could be improved.
- We could use some better utilities/classes for generating test data. Some exist, but there is overlap, and this in general could be improved and consolidated. The tests for the new updates support could use a more extensive set of records, e.g., for testing the "deduplication" process, etc.
- There aren't any existing tools for creating a BigQuery dataset from a Felis YAML schema, which would be particularly useful for standing up test datasets (DM-49220).
- The tests package in the Python source tree has some new modules, though they are a bit disorganized and miscellaneous. These should be cleaned up and consolidated. It is also possible that some of the classes should just be included into the test modules, even if there is minor duplication as a result.
- End-to-end testing on the full replication process is not implemented in this repository, and is difficult to do without some additional functionality which doesn't exist yet. I will likely create a follow-up ticket for deploying and testing the new "updates" functionality to the cloud pipeline and RSP. It is beyond the scope of this repository's functionality to fully test that ingestion pipeline at this time.
- Some usage of cloud services needs to be cleaned up so that resources are torn down after they are used. All of the resources created are marked so that they will be automatically deleted eventually, but ideally they should be deleted by the test case itself.
There are a large number of miscellaneous issues related to improvements to this repository's Python codebase (DM-54522).
- These issues/bullets will likely be broken out into separate, more manageable child tickets which include a few related issues together.

Additional Notes

This ticket includes some refactoring, porting of classes from other repositories, etc. that I now realize was excessive to include and should have been done on separate tickets. I will keep this in mind for the future and try to make changes on ticket branches more targeted, in particular, for ease of review and testing. The commit history of this PR should be useful for disambiguating some of this work, though beb2eab includes refactoring alongside updates for supporting the new functionality (These would have been difficult to separate out when I rebuilt the commit history.).

Disclosure on LLM Usage

I used Copilot extensively during the development of this ticket, in particular, for the following:

Creation of test cases for new modules
Design and implementation of the algorithm and class structure for applying updates
Code refactoring, cleanup, and bug-fixing

I reviewed all of the AI-generated code multiple times and made many changes to it. The test cases, in particular, could use further attention and consolidation (see above).

Copilot

Pull request overview

Adds end-to-end support for exporting, uploading, deduplicating, and merging APDB update records into PPDB BigQuery tables, along with supporting SQL resources and test coverage updates.

Changes:

Introduces BigQuery “updates” subsystem (expand → load → deduplicate → merge) with SQL MERGE resources.
Extends chunk export/upload metadata to track update-record presence and GCS location (gcs_uri), and adds promotable-chunk SQL/query utilities.
Refactors tests/config to use a resource-based test schema path and adds multiple BigQuery/GCS integration tests.

Reviewed changes

Copilot reviewed 36 out of 41 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/test_updates_table.py	BigQuery UpdatesTable integration tests
tests/test_updates_merger.py	BigQuery MERGE integration tests
tests/test_updates_manager.py	End-to-end updates manager test
tests/test_update_records.py	UpdateRecords JSON + GCS uploader tests
tests/test_update_record_expander.py	Unit tests for update expansion
tests/test_ppdb_sql.py	Switch tests to schema resource URI
tests/test_ppdb_bigquery.py	Adds BigQuery test case scaffolding
requirements.txt	Removes ppdbx-gcp from base reqs
python/lsst/dax/ppdb/tests/config/init.py	Test config package init
python/lsst/dax/ppdb/tests/_updates.py	Shared synthetic update-record fixtures
python/lsst/dax/ppdb/tests/_ppdb.py	Adds test schema URI + mixin refactor
python/lsst/dax/ppdb/tests/_bigquery.py	BigQuery test mixins + uploader stub helpers
python/lsst/dax/ppdb/sql/_ppdb_sql.py	Engine creation API change for DB init
python/lsst/dax/ppdb/sql/_ppdb_sql_base.py	Refactors engine/connect-args helpers
python/lsst/dax/ppdb/ppdb.py	Imports config from new module
python/lsst/dax/ppdb/ppdb_config.py	New pydantic-based config loader
python/lsst/dax/ppdb/config/sql/select_promotable_chunks.sql	New SQL for promotable chunk selection
python/lsst/dax/ppdb/config/sql/merge_diasource_updates.sql	New MERGE SQL for DiaSource updates
python/lsst/dax/ppdb/config/sql/merge_diaobject_updates.sql	New MERGE SQL for DiaObject updates
python/lsst/dax/ppdb/config/sql/merge_diaforcedsource_updates.sql	New MERGE SQL for DiaForcedSource updates
python/lsst/dax/ppdb/config/schemas/test_apdb_schema.yaml	Adds test schema as package data
python/lsst/dax/ppdb/bigquery/updates/updates_table.py	Creates/loads/dedups updates table
python/lsst/dax/ppdb/bigquery/updates/updates_merger.py	Merger classes for applying updates
python/lsst/dax/ppdb/bigquery/updates/updates_manager.py	Orchestrates download/expand/load/merge
python/lsst/dax/ppdb/bigquery/updates/update_records.py	Pydantic model for update records JSON
python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py	Expands logical updates into field updates
python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py	Model for a single expanded update row
python/lsst/dax/ppdb/bigquery/updates/init.py	Exports updates public API
python/lsst/dax/ppdb/bigquery/sql_resource.py	Loads SQL from package resources
python/lsst/dax/ppdb/bigquery/replica_chunk_promoter.py	Promotion workflow for staged chunks
python/lsst/dax/ppdb/bigquery/query_runner.py	Utility for running/logging BQ jobs
python/lsst/dax/ppdb/bigquery/ppdb_replica_chunk_extended.py	Adds `gcs_uri` to chunk metadata
python/lsst/dax/ppdb/bigquery/ppdb_bigquery.py	Writes update_records.json; adds promotable-chunks query
python/lsst/dax/ppdb/bigquery/manifest.py	Tracks whether updates are included
python/lsst/dax/ppdb/bigquery/chunk_uploader.py	Uploads update_records.json; stores gcs_uri
python/lsst/dax/ppdb/bigquery/init.py	Exports ChunkUploader
python/lsst/dax/ppdb/_factory.py	Updates config import location
python/lsst/dax/ppdb/init.py	Re-exports new config module
pyproject.toml	Adds package data + gcp extra dep
docker/Dockerfile.replication	Adds build deps for Python packages
.gitignore	Ignores `.scratch` directory

Comments suppressed due to low confidence (1)

python/lsst/dax/ppdb/tests/_bigquery.py:33

This module imports google.cloud.storage unconditionally. Because _bigquery.py is used by multiple test cases/mixins, this can break test collection in environments where optional GCP dependencies aren’t installed. Wrap these imports in try/except (similar to other tests) and skip/disable GCP-dependent helpers when unavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/lsst/dax/ppdb/bigquery/chunk_uploader.py

python/lsst/dax/ppdb/bigquery/updates/updates_merger.py

python/lsst/dax/ppdb/bigquery/updates/updates_manager.py

python/lsst/dax/ppdb/bigquery/updates/updates_table.py

python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py

tests/test_updates_table.py

tests/test_update_record_expander.py

tests/test_updates_manager.py

tests/test_update_records.py

tests/test_update_record_expander.py

python/lsst/dax/ppdb/bigquery/chunk_promoter.py

andy-slac

I checked what I could, but it is too much to read in one go. Anyways, I left a bunch of comments and suggestions. My main issues:

The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.
Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.
Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.
You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

.github/workflows/build.yaml

python/lsst/dax/ppdb/bigquery/updates/__init__.py

python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py

tests/test_update_records.py

tests/test_updates_table.py

This will need to be used directly within `dax_ppdb`, so it is ported from `dax_ppdbx_gcp` with a few minor changes.

The name of the manifest was simplified to not include the chunk ID, as this unnecessarily complicates data processing. A few utility methods were added along with a flag indicating if the replica chunk includes update records in a separate JSON file.

The password for the Cloud SQL connection is currently provided via Google Secrets Manager, so this implements a hook in the base SQL class for inserting the password into the database engine URL, as it will not be present in a Postgres password file within the cloud function environments.

JeremyMcCormick · 2026-04-03T00:16:38Z

@andy-slac

The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.

I can re-work the classes for generating the merge. Are you suggesting to have multiple BigQuery MERGE operations, one for each update type, rather than doing it by table? I think that should be possible, if not the most efficient. It might result in a full table scan for each update type. That could be okay though if the Google Cloud project is configured with a reserved slot capacity rather than dynamic. We just have to keep in mind that this will run everyday and full table scans of the bigger tables like DiaSource and DiaForcedSource will add up, so it would be best if the merge could be done with a single statement per table.

Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.

The context in which this runs is during promotion, which fails as an entire process, so it may not matter much if exceptions are unhandled in practice. The promotion function will catch and log them and then fail the entire batch of chunks.

However, I can look over especially the modules in the updates package and see if it makes sense to add more exception handling and some specific exception types like you suggested. This might help with debugging and control flow.

Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.

I like this idea a lot compared with writing JSON and will work on implementing it once all the minor suggested changes are made, based on #33 (comment).

You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

I agree with this as well and will work on some extensions to the ApdbUpdateRecord interface based on your suggestion from #33 (comment).

JeremyMcCormick · 2026-04-03T00:22:30Z

@andy-slac

As far as other major rewrites, I am wondering if I should rework how the replica chunks with updates are found in updates_manager. Currently, it scans all of the chunk manifests from cloud storage to see if they have any update records, but it would be more efficient and cleaner if PpdbReplicaChunk had an update_count field (or a has_updates flag) so that the chunks with updates could be more easily retrieved and flagged. I had planned to update that in a follow-on ticket, but maybe it makes sense to do here instead. I have a feeling that reading all of the chunk manifests from cloud storage will be inefficient, and it could be useful to be able to easily see which chunks have updates by reading this info from the db instead.

andy-slac · 2026-04-03T16:49:05Z

Are you suggesting to have multiple BigQuery MERGE operations, one for each update type, rather than doing it by table?

We cannot do that, we need to keep relative order of updates as different update types can update the same field. It has to be by table.

Currently, it scans all of the chunk manifests from cloud storage to see if they have any update records, but it would be more efficient and cleaner if PpdbReplicaChunk had an update_count field (or a has_updates flag) so that the chunks with updates could be more easily retrieved and flagged.

I want to stress that every replica chunk needs to be applied as a whole, including regular inserts and update records before you move to the next chunk. I do not understand what do you mean by scanning manifests for update records.

JeremyMcCormick · 2026-04-03T17:02:07Z

Are you suggesting to have multiple BigQuery MERGE operations, one for each update type, rather than doing it by table?

We cannot do that, we need to keep relative order of updates as different update types can update the same field. It has to be by table.

Okay, it can be done by table as it is now, but I don't fully understand how you want the code restructured. The SQL files work fine, but you requested it be done differently. I need some more information on what you're suggesting like a sketch of some new classes and how they would build the merge statement, etc. I believe you had suggested to have one "merge" class per update type. Can you outline how these classes would then work together to build the complete merge statement?

This is a big change and will require rewriting a lot of the code in this PR, so I want to make sure that I understand what you are intending before starting this.

Currently, it scans all of the chunk manifests from cloud storage to see if they have any update records, but it would be more efficient and cleaner if PpdbReplicaChunk had an update_count field (or a has_updates flag) so that the chunks with updates could be more easily retrieved and flagged.

I want to stress that every replica chunk needs to be applied as a whole, including regular inserts and update records before you move to the next chunk.

It is not feasible to apply the replica chunks one by one, or at least it would likely be very inefficient. The way it already works in promotion, the chunks are copied into the production tables together as a batch, provided that all of the prior chunks were loaded successfully. The updates seem as if they can be applied in the same way, as a batch. There should be no difference in the result from applying them all together. What would be the point in instead applying them one by one?

I do not understand what do you mean by scanning manifests for update records.

In this PR, the manifest indicates whether the chunk has updates or not with a boolean flag, so all of them have to be read from cloud storage to see if the update records need to be loaded, including those without updates.

I think this is inefficient and would be better done by using the database. I'd need to add a new field to PpdbReplicaChunk to make this work. I thought about doing this on a follow-on ticket as mentioned in the PR description, but if I am making a lot of improvements to this PR based on your requested changes, I can do it here instead.

- Add a few missing statuses - Update the manifest name based on changes to `Manifest` class - Provide a method for creating a new object with an updated GCS URI

The `test_ppdbBigQuery` module was also renamed to `tests_ppdb_bigquery` following snakecase conventions.

This was previously referred to as "deduplication," which was misleading, because the records do not represent duplicates. They are updates on the same combination of `(table, field, record_id)` where only the latest one should be kept.

This also cleans up the test GCS bucket in the teardown of the test case.

JeremyMcCormick marked this pull request as draft February 11, 2026 23:42

JeremyMcCormick force-pushed the tickets/DM-54070 branch 4 times, most recently from 35ac308 to a1327a5 Compare February 23, 2026 20:58

JeremyMcCormick force-pushed the tickets/DM-54070 branch from e66b7e7 to 2bd43ff Compare February 25, 2026 21:11

JeremyMcCormick requested a review from Copilot March 15, 2026 23:15

Copilot started reviewing on behalf of JeremyMcCormick March 15, 2026 23:16 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from fc6daf7 to 0519130 Compare March 19, 2026 22:47

JeremyMcCormick commented Mar 20, 2026

View reviewed changes

tests/test_update_record_expander.py Outdated Show resolved Hide resolved

JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from 6c7a5d3 to 4e54fc1 Compare March 21, 2026 02:52

JeremyMcCormick commented Mar 22, 2026

View reviewed changes

python/lsst/dax/ppdb/bigquery/chunk_promoter.py Outdated Show resolved Hide resolved

JeremyMcCormick force-pushed the tickets/DM-54070 branch 8 times, most recently from 039bc08 to 0be5f74 Compare March 31, 2026 00:59

JeremyMcCormick marked this pull request as ready for review March 31, 2026 21:01

JeremyMcCormick requested a review from andy-slac March 31, 2026 21:01

JeremyMcCormick force-pushed the tickets/DM-54070 branch from 0be5f74 to 3e1a408 Compare March 31, 2026 21:46

andy-slac requested changes Apr 1, 2026

View reviewed changes

JeremyMcCormick force-pushed the tickets/DM-54070 branch from 3e1a408 to 6ed9f72 Compare April 2, 2026 19:59

JeremyMcCormick added 5 commits April 2, 2026 18:54

Add query_runner module

3e63636

This will need to be used directly within `dax_ppdb`, so it is ported from `dax_ppdbx_gcp` with a few minor changes.

Move test schema to resources directory

5cdc3a8

Rename test module

d2cf968

JeremyMcCormick force-pushed the tickets/DM-54070 branch 2 times, most recently from 1647aeb to 74f2313 Compare April 3, 2026 00:08

JeremyMcCormick added 4 commits April 3, 2026 17:00

Add package for handling APDB update records in BigQuery

1e7bead

Add utility module for generating test update records

c7ba79a

Update model for managing extended repica chunk

ce0aee8

- Add a few missing statuses - Update the manifest name based on changes to `Manifest` class - Provide a method for creating a new object with an updated GCS URI

Add module for handling the replica chunk promotion process

7ec55c7

JeremyMcCormick force-pushed the tickets/DM-54070 branch from 74f2313 to 1825b25 Compare April 3, 2026 22:00

JeremyMcCormick added 2 commits April 7, 2026 14:02

Integrate support for handling replica chunk updates

8c00ffa

Add module with miscellaneous test utilities for BigQuery

4298c3e

JeremyMcCormick force-pushed the tickets/DM-54070 branch from b13c92e to 3c74db9 Compare April 7, 2026 19:02

lsst deleted a comment from andy-slac Apr 7, 2026

JeremyMcCormick added 3 commits April 7, 2026 16:49

Add tests of APDB update record handling

60ddaa4

The `test_ppdbBigQuery` module was also renamed to `tests_ppdb_bigquery` following snakecase conventions.

Make all fields in updates table required

da74885

JeremyMcCormick force-pushed the tickets/DM-54070 branch from 247217e to ae8a405 Compare April 7, 2026 21:49

JeremyMcCormick added 4 commits April 7, 2026 18:46

Skip updates of type close_diaobject_validity when nDiaSources is None

d498a5b

Add specialized exception class and error handling

b7648ca

Move test of chunk uploader to its own test module

9709ec3

This also cleans up the test GCS bucket in the teardown of the test case.

Serialize the update records to parquet instead of JSON

1b9196a

JeremyMcCormick force-pushed the tickets/DM-54070 branch from adda2e9 to 1b9196a Compare April 7, 2026 23:46

Add specialized exception type and catch errors during promotion

3e42518

Conversation

JeremyMcCormick commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andy-slac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JeremyMcCormick commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeremyMcCormick commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andy-slac commented Apr 3, 2026

Uh oh!

JeremyMcCormick commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JeremyMcCormick commented Feb 11, 2026 •

edited

Loading

JeremyMcCormick commented Apr 3, 2026 •

edited

Loading

JeremyMcCormick commented Apr 3, 2026 •

edited

Loading

JeremyMcCormick commented Apr 3, 2026 •

edited

Loading