Create tutorial to get ZTF light curves from HATS catalog#325
Create tutorial to get ZTF light curves from HATS catalog#325jaladh-singhal wants to merge 4 commits into
Conversation
|
|
||
| with Client(n_workers=get_nworkers(ztf_lc_cone), | ||
| threads_per_worker=1, | ||
| memory_limit=None # each partition can be several GB; avoid per-worker cap |
There was a problem hiding this comment.
It was running out of memory locally without this. Let's see how it performs on CI.
| right_on='oid', | ||
| how='inner' | ||
| ) | ||
| combined_df |
There was a problem hiding this comment.
Just FYI, I tried 3 more approaches before settling on this one (which takes takes 2m±5s both compute calls combined):
ztf_objects_cone.join(ztf_lc_cone, ...).compute()takes 2m±5sztf_objects_cone.merge(ztf_lc_cone, ...).compute()takes 2m±5sztf_objects_cone.compute(); ztf_lc_cone.id_search(values={'objectid': list(ztf_objects_cone_df.oid)}).compute()took 2m50s
I kept this approach because of time as well as to maintain the narrative of keeping objects table search optional.
|
I've pushed a commit directly that fixes the oldestdeps job failure, review is to come separately |
|
I'm not sure what goes on in circleCI, we may actually hit that memory limit even though the graph doesn;t show it (but the resolution of the graph is pretty bad, so it's still my prime suspect for the reason for the failure) -- keep an eye on the GHA buildhtml job instead for now. |
|
@bsipocz buildhtml job is getting skipped (I tried re-triggering it) |
troyraen
left a comment
There was a problem hiding this comment.
Thanks @jaladh-singhal! I'm requesting changes that are important but small. The meat of this is great. Thanks for putting it together so fast.
I ran this on Fornax and found that we can reduce memory usage to a max of <8G by using a dask client for the index search (in addition to the others that you already noticed). Hopefully that will be enough for it to run in CI. 🤞
I flagged two things that often confuse users (how objects are defined, and how the object ID column is named) and suggest we add admonitions for those.
There's two languaging things I think we should change. I commented on only the first instance of each of these, so look for other instances throughout the notebook.
-
I think "object" or "target" will be more clear to ZTF users than "source". "source" means different things to different people throughout astronomy, so you're usage isn't wrong per se, but in my experience with time-domain use cases and surveys like ZTF it almost always means a single observation (ie, one point in the light curve).
-
"Objects Table", "Lightcurves", and "HATS Collection" are proper nouns so use those spellings and capitalizations consistently. (In case it's confusing, "Lightcurves" is the name of the catalog, while a "light curve" is a time series of data points for a given object (and can also be plural). It's usually obvious which is meant, so that spelling should be used. But there are cases where a sentence/phrase is equally correct either way, so then you can just pick one.
...And then there's also the column name, which is spelled "lightcurve" 😅.)
|
|
||
| ```{code-cell} ipython3 | ||
| # Uncomment the next line to install dependencies if needed. | ||
| # !pip install s3fs "lsdb>=0.6.6,<0.8" pyarrow pandas astropy matplotlib |
There was a problem hiding this comment.
Is lsdb<0.8 just for our own CI and due to other notebooks? I wonder if/how we can handle that without implying to end users that this particular notebook requires lsdb<0.8.
| ztf_lc_hats_prefix = "ztf/enhanced/dr24/lc/hats" # Light curves catalog | ||
| ztf_objects_hats_prefix = "ztf/enhanced/dr24/objects/hats" # Objects table |
There was a problem hiding this comment.
| ztf_lc_hats_prefix = "ztf/enhanced/dr24/lc/hats" # Light curves catalog | |
| ztf_objects_hats_prefix = "ztf/enhanced/dr24/objects/hats" # Objects table | |
| ztf_lc_hats_prefix = "ztf/enhanced/dr24/lc/hats" # Lightcurves catalog | |
| ztf_objects_hats_prefix = "ztf/enhanced/dr24/objects/hats" # Objects Table |
Use the proper nouns. (ZTF named these products "Lightcurves" and "Objects Table".)
| ztf_lc_schema_df | ||
| ``` | ||
|
|
||
| Notice the `lightcurve` column — this is a **nested column** that stores the full photometric time series for each ZTF object. |
There was a problem hiding this comment.
| Notice the `lightcurve` column — this is a **nested column** that stores the full photometric time series for each ZTF object. | |
| Notice the `lightcurve` column — this is a **[nested](https://nested-pandas.readthedocs.io/) column** that stores the full photometric time series for each ZTF object. |
Maybe link here. I find myself needing to refer to those docs to figure out/remember how to work with nested columns.
|
|
||
| ### 5.1 Explore the Objects Table Schema | ||
|
|
||
| The Objects Table contains per-band summary statistics for each ZTF source. |
There was a problem hiding this comment.
| The Objects Table contains per-band summary statistics for each ZTF source. | |
| The Objects Table contains summary statistics for each ZTF object. |
| ztf_lcs_by_id_df = ztf_lcs_by_id.compute() | ||
| ztf_lcs_by_id_df |
There was a problem hiding this comment.
I ran this on Fornax to check memory usage and found that Dask holds onto the memory it grabs here. Adding del ztf_lcs_by_id_df and similar for the catalog objects doesn't help. Wrapping this in a client context does help. (same as you figured out for sec. 4 below)
11.5G = Max memory usage of notebook without Client here
7.1G = Max memory usage of notebook with Client here
I used this:
| ztf_lcs_by_id_df = ztf_lcs_by_id.compute() | |
| ztf_lcs_by_id_df | |
| def get_nworkers(object_ids): | |
| return min(os.cpu_count(), len(object_ids)) | |
| with Client(n_workers=get_nworkers(object_ids), | |
| threads_per_worker=1, | |
| memory_limit=None # each partition can be several GB; avoid per-worker cap | |
| ) as client: | |
| print(f"You can monitor progress in the Dask dashboard at {client.dashboard_link}") | |
| ztf_lcs_by_id_df = ztf_lcs_by_id.compute() | |
| ztf_lcs_by_id_df |
| We save the list of columns interesting to us for later use when opening the catalog with `lsdb`: | ||
|
|
||
| ```{code-cell} ipython3 | ||
| ztf_lc_columns = ["objectid", "objra", "objdec", "filterid", "nepochs", "lightcurve"] |
There was a problem hiding this comment.
I hoped we could save more memory by selecting only the lightcurve columns that actually get used, but I tried this and the savings is insignificant (we need 4 out of the 5 columns, so I suppose that makes sense). Leaving this suggestion in case you want to show that it's possible to load only some of the nested columns. Also fine to ignore this.
| ztf_lc_columns = ["objectid", "objra", "objdec", "filterid", "nepochs", "lightcurve"] | |
| ztf_lc_columns = ["objectid", "objra", "objdec", "filterid", "nepochs", | |
| "lightcurve.hmjd", "lightcurve.mag", "lightcurve.magerr", "lightcurve.catflags"] |
| ## 4. Get Light Curves by Sky Position | ||
|
|
||
| If you have sky coordinates and want all ZTF sources within a given area, use a cone search. | ||
|
|
There was a problem hiding this comment.
Let's add this admonition (or similar) to describe how ZTF objects are defined. This is the first place where it's directly relevant, but you could move it up to the introduction if you prefer.
| :::{important} ZTF objects are defined per (filter, field, quadrant) | |
| ZTF objects (i.e., unique object IDs) are defined _per_ (filter, field, quadrant). | |
| This means that observations of a single _astrophysical_ object are usually spread out amongst several different _ZTF_ objects. | |
| At minimum, a given astrophysical object will be represented by up to 3 ZTF objects, one per filter (g, r, and i). | |
| The per-filter observations may themselves be separated into additional ZTF objects if the astrophysical object lies near the boundary of a ZTF field and/or quadrant. | |
| ZTF's pixel scale is 1"/pixel (see [ZTF Technical Specifications](https://www.ptf.caltech.edu/page/ztf_technical)), so combining all ZTF objects within a 1" cone search may be reasonable for a given astrophysical object. | |
| ::: | |
You can choose whether you want to change the actual code below here or not. I think it's fine to leave as-is as long as we provide this admonition. If changing, I would probably reduce search_radius to 1 arcsec, remove the ["nepochs", ">", 50] row filter, and group by filter (ie, band) before plotting the light curves. (FWIW, our Fornax light-curve-collector notebook justifies the 1" cone search by citing Graham et al., 2024. Unfortunately, ADS returns 79 papers for "Graham et al., 2024" and I don't know which one that came from.)
|
|
||
| for ax, (_, row) in zip(axs, most_variable.iterrows()): | ||
| lc = row['lightcurve'].query("catflags == 0") # to keep only clean epochs | ||
| title = (f"ZTF Object {row['objectid']} ({row['filtercode']} band)\n" |
There was a problem hiding this comment.
Maybe we should stick with "filter". Most people will understand that band == filter but newbies may be confused if we don't explain it.
| title = (f"ZTF Object {row['objectid']} ({row['filtercode']} band)\n" | |
| title = (f"ZTF Object {row['objectid']} ({row['filtercode']} filter)\n" |
| - Retrieve light curves for specific sources by ZTF object IDs using an index search. | ||
| - Retrieve light curves for sources in a sky region using a cone search. | ||
| - Cross-reference the Objects Table to enrich cone search results with per-source variability statistics. |
There was a problem hiding this comment.
I think we should use either "object" or "target" instead of "source".
| - Retrieve light curves for specific sources by ZTF object IDs using an index search. | |
| - Retrieve light curves for sources in a sky region using a cone search. | |
| - Cross-reference the Objects Table to enrich cone search results with per-source variability statistics. | |
| - Retrieve light curves for specific objects by ZTF object IDs using an index search. | |
| - Retrieve light curves for objects in a sky region using a cone search. | |
| - Cross-reference the Objects Table to enrich cone search results with per-object variability statistics. |
| ``` | ||
|
|
||
| We'll select a subset of columns useful for characterizing and annotating variable sources: | ||
|
|
There was a problem hiding this comment.
This is another thing that trips people up, so let's add an admonition. Could move this up to the intro (or elsewhere) if you prefer.
| :::{important} `objectid` == `oid` | |
| ZTF's object ID column is named `objectid` in Lightcurves and `oid` in Objects Table. | |
| Despite this difference, the two columns are the same and can be used to join the catalogs. | |
| ::: | |
Fixes IRSA-7768