Skip to content

Create tutorial to get ZTF light curves from HATS catalog#325

Open
jaladh-singhal wants to merge 4 commits into
Caltech-IPAC:mainfrom
jaladh-singhal:ztf-lightcurves
Open

Create tutorial to get ZTF light curves from HATS catalog#325
jaladh-singhal wants to merge 4 commits into
Caltech-IPAC:mainfrom
jaladh-singhal:ztf-lightcurves

Conversation

@jaladh-singhal
Copy link
Copy Markdown
Member

Fixes IRSA-7768

@jaladh-singhal jaladh-singhal self-assigned this May 27, 2026
@jaladh-singhal jaladh-singhal added the content Content related issues/PRs. label May 27, 2026

with Client(n_workers=get_nworkers(ztf_lc_cone),
threads_per_worker=1,
memory_limit=None # each partition can be several GB; avoid per-worker cap
Copy link
Copy Markdown
Member Author

@jaladh-singhal jaladh-singhal May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was running out of memory locally without this. Let's see how it performs on CI.

right_on='oid',
how='inner'
)
combined_df
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, I tried 3 more approaches before settling on this one (which takes takes 2m±5s both compute calls combined):

  1. ztf_objects_cone.join(ztf_lc_cone, ...).compute() takes 2m±5s
  2. ztf_objects_cone.merge(ztf_lc_cone, ...).compute() takes 2m±5s
  3. ztf_objects_cone.compute(); ztf_lc_cone.id_search(values={'objectid': list(ztf_objects_cone_df.oid)}).compute() took 2m50s

I kept this approach because of time as well as to maintain the narrative of keeping objects table search optional.

@bsipocz
Copy link
Copy Markdown
Member

bsipocz commented May 28, 2026

I've pushed a commit directly that fixes the oldestdeps job failure, review is to come separately

@bsipocz bsipocz added the GHA buildhtml Enable extra buildhtml job on GHA label May 28, 2026
@bsipocz
Copy link
Copy Markdown
Member

bsipocz commented May 28, 2026

I'm not sure what goes on in circleCI, we may actually hit that memory limit even though the graph doesn;t show it (but the resolution of the graph is pretty bad, so it's still my prime suspect for the reason for the failure) -- keep an eye on the GHA buildhtml job instead for now.

@jaladh-singhal
Copy link
Copy Markdown
Member Author

@bsipocz buildhtml job is getting skipped (I tried re-triggering it)

Copy link
Copy Markdown
Contributor

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaladh-singhal! I'm requesting changes that are important but small. The meat of this is great. Thanks for putting it together so fast.

I ran this on Fornax and found that we can reduce memory usage to a max of <8G by using a dask client for the index search (in addition to the others that you already noticed). Hopefully that will be enough for it to run in CI. 🤞

I flagged two things that often confuse users (how objects are defined, and how the object ID column is named) and suggest we add admonitions for those.

There's two languaging things I think we should change. I commented on only the first instance of each of these, so look for other instances throughout the notebook.

  • I think "object" or "target" will be more clear to ZTF users than "source". "source" means different things to different people throughout astronomy, so you're usage isn't wrong per se, but in my experience with time-domain use cases and surveys like ZTF it almost always means a single observation (ie, one point in the light curve).

  • "Objects Table", "Lightcurves", and "HATS Collection" are proper nouns so use those spellings and capitalizations consistently. (In case it's confusing, "Lightcurves" is the name of the catalog, while a "light curve" is a time series of data points for a given object (and can also be plural). It's usually obvious which is meant, so that spelling should be used. But there are cases where a sentence/phrase is equally correct either way, so then you can just pick one.
    ...And then there's also the column name, which is spelled "lightcurve" 😅.)


```{code-cell} ipython3
# Uncomment the next line to install dependencies if needed.
# !pip install s3fs "lsdb>=0.6.6,<0.8" pyarrow pandas astropy matplotlib
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lsdb<0.8 just for our own CI and due to other notebooks? I wonder if/how we can handle that without implying to end users that this particular notebook requires lsdb<0.8.

Comment on lines +86 to +87
ztf_lc_hats_prefix = "ztf/enhanced/dr24/lc/hats" # Light curves catalog
ztf_objects_hats_prefix = "ztf/enhanced/dr24/objects/hats" # Objects table
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ztf_lc_hats_prefix = "ztf/enhanced/dr24/lc/hats" # Light curves catalog
ztf_objects_hats_prefix = "ztf/enhanced/dr24/objects/hats" # Objects table
ztf_lc_hats_prefix = "ztf/enhanced/dr24/lc/hats" # Lightcurves catalog
ztf_objects_hats_prefix = "ztf/enhanced/dr24/objects/hats" # Objects Table

Use the proper nouns. (ZTF named these products "Lightcurves" and "Objects Table".)

ztf_lc_schema_df
```

Notice the `lightcurve` column — this is a **nested column** that stores the full photometric time series for each ZTF object.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Notice the `lightcurve` column — this is a **nested column** that stores the full photometric time series for each ZTF object.
Notice the `lightcurve` column — this is a **[nested](https://nested-pandas.readthedocs.io/) column** that stores the full photometric time series for each ZTF object.

Maybe link here. I find myself needing to refer to those docs to figure out/remember how to work with nested columns.


### 5.1 Explore the Objects Table Schema

The Objects Table contains per-band summary statistics for each ZTF source.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Objects Table contains per-band summary statistics for each ZTF source.
The Objects Table contains summary statistics for each ZTF object.

Comment on lines +208 to +209
ztf_lcs_by_id_df = ztf_lcs_by_id.compute()
ztf_lcs_by_id_df
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this on Fornax to check memory usage and found that Dask holds onto the memory it grabs here. Adding del ztf_lcs_by_id_df and similar for the catalog objects doesn't help. Wrapping this in a client context does help. (same as you figured out for sec. 4 below)

11.5G = Max memory usage of notebook without Client here
7.1G = Max memory usage of notebook with Client here

I used this:

Suggested change
ztf_lcs_by_id_df = ztf_lcs_by_id.compute()
ztf_lcs_by_id_df
def get_nworkers(object_ids):
return min(os.cpu_count(), len(object_ids))
with Client(n_workers=get_nworkers(object_ids),
threads_per_worker=1,
memory_limit=None # each partition can be several GB; avoid per-worker cap
) as client:
print(f"You can monitor progress in the Dask dashboard at {client.dashboard_link}")
ztf_lcs_by_id_df = ztf_lcs_by_id.compute()
ztf_lcs_by_id_df

We save the list of columns interesting to us for later use when opening the catalog with `lsdb`:

```{code-cell} ipython3
ztf_lc_columns = ["objectid", "objra", "objdec", "filterid", "nepochs", "lightcurve"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hoped we could save more memory by selecting only the lightcurve columns that actually get used, but I tried this and the savings is insignificant (we need 4 out of the 5 columns, so I suppose that makes sense). Leaving this suggestion in case you want to show that it's possible to load only some of the nested columns. Also fine to ignore this.

Suggested change
ztf_lc_columns = ["objectid", "objra", "objdec", "filterid", "nepochs", "lightcurve"]
ztf_lc_columns = ["objectid", "objra", "objdec", "filterid", "nepochs",
"lightcurve.hmjd", "lightcurve.mag", "lightcurve.magerr", "lightcurve.catflags"]

## 4. Get Light Curves by Sky Position

If you have sky coordinates and want all ZTF sources within a given area, use a cone search.

Copy link
Copy Markdown
Contributor

@troyraen troyraen Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this admonition (or similar) to describe how ZTF objects are defined. This is the first place where it's directly relevant, but you could move it up to the introduction if you prefer.

Suggested change
:::{important} ZTF objects are defined per (filter, field, quadrant)
ZTF objects (i.e., unique object IDs) are defined _per_ (filter, field, quadrant).
This means that observations of a single _astrophysical_ object are usually spread out amongst several different _ZTF_ objects.
At minimum, a given astrophysical object will be represented by up to 3 ZTF objects, one per filter (g, r, and i).
The per-filter observations may themselves be separated into additional ZTF objects if the astrophysical object lies near the boundary of a ZTF field and/or quadrant.
ZTF's pixel scale is 1"/pixel (see [ZTF Technical Specifications](https://www.ptf.caltech.edu/page/ztf_technical)), so combining all ZTF objects within a 1" cone search may be reasonable for a given astrophysical object.
:::

You can choose whether you want to change the actual code below here or not. I think it's fine to leave as-is as long as we provide this admonition. If changing, I would probably reduce search_radius to 1 arcsec, remove the ["nepochs", ">", 50] row filter, and group by filter (ie, band) before plotting the light curves. (FWIW, our Fornax light-curve-collector notebook justifies the 1" cone search by citing Graham et al., 2024. Unfortunately, ADS returns 79 papers for "Graham et al., 2024" and I don't know which one that came from.)


for ax, (_, row) in zip(axs, most_variable.iterrows()):
lc = row['lightcurve'].query("catflags == 0") # to keep only clean epochs
title = (f"ZTF Object {row['objectid']} ({row['filtercode']} band)\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should stick with "filter". Most people will understand that band == filter but newbies may be confused if we don't explain it.

Suggested change
title = (f"ZTF Object {row['objectid']} ({row['filtercode']} band)\n"
title = (f"ZTF Object {row['objectid']} ({row['filtercode']} filter)\n"

Comment on lines +28 to +30
- Retrieve light curves for specific sources by ZTF object IDs using an index search.
- Retrieve light curves for sources in a sky region using a cone search.
- Cross-reference the Objects Table to enrich cone search results with per-source variability statistics.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use either "object" or "target" instead of "source".

Suggested change
- Retrieve light curves for specific sources by ZTF object IDs using an index search.
- Retrieve light curves for sources in a sky region using a cone search.
- Cross-reference the Objects Table to enrich cone search results with per-source variability statistics.
- Retrieve light curves for specific objects by ZTF object IDs using an index search.
- Retrieve light curves for objects in a sky region using a cone search.
- Cross-reference the Objects Table to enrich cone search results with per-object variability statistics.

```

We'll select a subset of columns useful for characterizing and annotating variable sources:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another thing that trips people up, so let's add an admonition. Could move this up to the intro (or elsewhere) if you prefer.

Suggested change
:::{important} `objectid` == `oid`
ZTF's object ID column is named `objectid` in Lightcurves and `oid` in Objects Table.
Despite this difference, the two columns are the same and can be used to join the catalogs.
:::

@troyraen troyraen added content: parquet Content related issues/PRs for notebooks with parquet/HATS relevance content: ztf Content related issues/PRs for notebooks with ZTF relevance labels Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

content: parquet Content related issues/PRs for notebooks with parquet/HATS relevance content: ztf Content related issues/PRs for notebooks with ZTF relevance content Content related issues/PRs. GHA buildhtml Enable extra buildhtml job on GHA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants