Adding Database side analysis functions by diveshjain-phy · Pull Request #3 · simonsobs/lightcurvedb

diveshjain-phy · 2025-08-27T13:23:34Z

diveshjain-phy · 2025-08-27T13:38:26Z

@JBorrow, this is an ongoing work to address the analysis function issue on lightserve. It took some time to understand and resolve the primary key and foreign key requirements of the tables when creating hypertables. Once I’ve added the analysis functions, I’ll request a review.

JBorrow · 2025-08-27T13:45:01Z

Cool! So this actually works with timescsaledb like this?

diveshjain-phy · 2025-08-27T13:50:57Z

Yes, seems like working. Had to add 'time' as a primary key in FluxMeasurement Table as it was a requirement for creating hypertables chunked in time. Then had to carry the changes forward.

JBorrow · 2025-08-27T13:53:13Z

Very interesting that it requires time as a primary key... As long as it's ok with a composite primary key with the id field we're ok, otherwise we might need to rethink things.

diveshjain-phy · 2025-08-27T13:54:25Z

Yes I checked that works.

JBorrow · 2025-08-27T13:58:11Z

Very interesting. Keep exploring this direction... Do we need a separate table for each time range? Or does timescaledb have some functionality for arbrtitrary time ranges too?

diveshjain-phy · 2025-08-27T14:09:59Z

As far as I have read, we don't need to worry about separate tables for different time ranges. timescaledb uses one table and automatically partitions data into time-based chunks. When we query any time range, it automatically finds the relevant chunks.

JBorrow · 2025-08-27T18:21:32Z

Ah I understand, so the chunk_time_interval => INTERVAL '6 months', is more of an optimization thing?

diveshjain-phy · 2025-08-28T09:28:00Z

Yes. Best practice is to set chunk_time_interval so that one chunk of data takes up 25% of RAM. Most examples show 7-14 days as starting points, with TimescaleDB's default being 7 days.

diveshjain-phy · 2025-09-17T13:00:36Z

Right now I have set the aggregate bucketing at 1 month. This way every row in the table corresponds to the month start, so any query range includes entire months whose buckets start inside that range. For example, asking for 1 Mar–2 Sep returns the September bucket that starts on 1 Sep, and because that bucket covers the whole month, the results run through 30 Sep. If you don’t anticipate any issues, we can combine information from the Aggregate Table and Raw Table for accurate results. Alternatively, we can set shorter buckets, but this requires optimisation unless we know what users are comfortable with.

… database while testing

diveshjain-phy · 2025-10-06T15:33:34Z

@JBorrow I've implemented performance tests for the aggregate statistics endpoints to compare continuous aggregates vs raw queries:

/analysis/aggregate/{source_id}/{band_name} (with continuous aggregates)
/analysis/wo_ca/aggregate/{source_id}/{band_name} (without continuous aggregates)

============================================================
Testing: GET http://localhost:8000/analysis/aggregate/1/f145
============================================================
Mean:12.94 ms
std:7.83 ms
min:8.40 ms
max:117.82 ms

============================================================
Testing: GET http://localhost:8000/analysis/wo_ca/aggregate/1/f145
============================================================
Mean:68.66 ms
std:56.27 ms
min:57.14 ms
max:2525.97 ms

Ratio of mean times with aggregate to without aggregate calls: 0.18845746342388867

Is it the right way to approach the tests?

perf_test.py

diveshjain-phy · 2025-12-12T17:20:38Z

@JBorrow

In addition to the client and models layers, I have added the storage layer to the codebase. Backend can be specified via Settings in config.py. The client layer has been fully migrated and tested, with the exception of cutout operations which remain on the sqlalchemy implementation. Client interaction with storage layer is storage agnostic. cli tools (setup.py, ephemeral.py), and simulations work with the new backend.

JBorrow

Looking good, some minor comments on the postgres setup.

JBorrow · 2025-12-16T18:14:39Z

lightcurvedb/storage/base/schema.py

+SOURCES_TABLE = """
+CREATE TABLE IF NOT EXISTS sources (
+    id SERIAL PRIMARY KEY,
+    name VARCHAR(255),


VARCHAR confers no performance improvement over TEXT and is more limiting.

JBorrow · 2025-12-16T18:27:41Z

lightcurvedb/storage/base/schema.py

+    name VARCHAR(50) PRIMARY KEY,
+    telescope VARCHAR(100) NOT NULL,
+    instrument VARCHAR(100) NOT NULL,


Same comments as above.

JBorrow · 2025-12-16T18:30:32Z

lightcurvedb/storage/postgres/flux.py

+    async def get_band_data(self, source_id: int, band_name: str) -> LightcurveBandData:
+        """
+        Get all measurements as arrays using database-side aggregation.
+        """
+        query = """
+            SELECT
+                COALESCE(ARRAY_AGG(id ORDER BY time), ARRAY[]::INTEGER[]) as ids,
+                COALESCE(ARRAY_AGG(time ORDER BY time), ARRAY[]::TIMESTAMPTZ[]) as times,
+                COALESCE(ARRAY_AGG(ra ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra,
+                COALESCE(ARRAY_AGG(dec ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec,
+                COALESCE(ARRAY_AGG(ra_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra_uncertainty,
+                COALESCE(ARRAY_AGG(dec_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec_uncertainty,
+                COALESCE(ARRAY_AGG(i_flux ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_flux,
+                COALESCE(ARRAY_AGG(i_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_uncertainty
+            FROM flux_measurements
+            WHERE source_id = %(source_id)s AND band_name = %(band_name)s
+        """
+
+        async with self.conn.cursor(row_factory=dict_row) as cur:
+            await cur.execute(query, {"source_id": source_id, "band_name": band_name})
+            row = await cur.fetchone()
+            return LightcurveBandData(**row)
+
+    async def get_time_range(
+        self,
+        source_id: int,
+        band_name: str,
+        start_time: datetime,
+        end_time: datetime
+    ) -> LightcurveBandData:
+        """
+        Get measurements in time.
+        """
+        query = """
+            SELECT
+                COALESCE(ARRAY_AGG(id ORDER BY time), ARRAY[]::INTEGER[]) as ids,
+                COALESCE(ARRAY_AGG(time ORDER BY time), ARRAY[]::TIMESTAMPTZ[]) as times,
+                COALESCE(ARRAY_AGG(ra ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra,
+                COALESCE(ARRAY_AGG(dec ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec,
+                COALESCE(ARRAY_AGG(ra_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra_uncertainty,
+                COALESCE(ARRAY_AGG(dec_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec_uncertainty,
+                COALESCE(ARRAY_AGG(i_flux ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_flux,
+                COALESCE(ARRAY_AGG(i_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_uncertainty
+            FROM flux_measurements
+            WHERE source_id = %(source_id)s
+              AND band_name = %(band_name)s
+              AND time BETWEEN %(start_time)s AND %(end_time)s
+        """
+
+        async with self.conn.cursor(row_factory=dict_row) as cur:
+            await cur.execute(query, {
+                "source_id": source_id,
+                "band_name": band_name,
+                "start_time": start_time,
+                "end_time": end_time
+            })
+            row = await cur.fetchone()
+            return LightcurveBandData(**row)


These two could be the same function but with the time range values accepting None as an input maybe?

JBorrow · 2025-12-16T18:35:05Z

lightcurvedb/storage/base/schema.py

+FLUX_MEASUREMENTS_TABLE = """
+CREATE TABLE IF NOT EXISTS flux_measurements (
+    id SERIAL PRIMARY KEY,
+    band_name VARCHAR(50) NOT NULL REFERENCES bands(name),
+    source_id INTEGER NOT NULL REFERENCES sources(id),
+    time TIMESTAMPTZ NOT NULL,
+    ra DOUBLE PRECISION NOT NULL CHECK (ra >= -180 AND ra <= 180),
+    dec DOUBLE PRECISION NOT NULL CHECK (dec >= -90 AND dec <= 90),
+    ra_uncertainty DOUBLE PRECISION,
+    dec_uncertainty DOUBLE PRECISION,
+    i_flux DOUBLE PRECISION NOT NULL,
+    i_uncertainty DOUBLE PRECISION,
+    extra JSONB
+);
+
+CREATE INDEX IF NOT EXISTS idx_flux_source_band_time
+    ON flux_measurements (source_id, band_name, time DESC);
+
+CREATE INDEX IF NOT EXISTS idx_flux_time
+    ON flux_measurements (time DESC);


Could consider partitioning the flux-measurements table by soruce_id? https://www.postgresql.org/docs/current/ddl-partitioning.html

Migration to timescaledb

578cc80

Continuous aggregate deployment and bad statistics

762bee9

diveshjain-phy mentioned this pull request Aug 29, 2025

Added preliminary api endpoint simonsobs/lightserve#20

Open

Updating mean statistic computes

65e07c3

diveshjain-phy requested a review from JBorrow September 1, 2025 15:12

diveshjain-phy added 2 commits September 17, 2025 13:45

Refactoring the code for structure and readability

75f6942

minor update

fcd23e5

diveshjain-phy added 12 commits September 17, 2025 15:02

minor update

e3a367c

update to aggregate setup. rentention policy+multiple aggregate table

d43f15a

implementation of daily, weekly and monthly table

167e495

selection of table in derived statistics

6c10186

minor update

16aee35

generate data with past timestamps

a9484ca

Manual refresh for historical data

eb961b2

Implementation of manual refresh when dealing with historical data in…

1fe986e

… database while testing

added time stamps in responses and fixed minor bugs

04a6fee

minor updates

ed02963

passing time resolution as part of response body

ad879f8

added time series field and variance parameter

64e0e31

diveshjain-phy and others added 25 commits November 21, 2025 13:33

added comfiguration for psycopg connection

67866b3

Setup connection management system

c7db0de

Merge branch 'main' into database_side_analysis

6416208

Add storage protocols and update models

0891326

create protocols for hot swappable storage backends

90149be

rebasing commits

7a3093f

cleaning sqlalchemy dependencies

57302e9

updating simulation- stripping sqlalchemy

667476c

remove legacy files

49db684

update simulation codes

851ade9

update configuration file to include backend types

2eb650e

update the ephemeral and setup file

9e69a99

fixing ports on ephemeral run

2674115

client band operations and tests

c47f308

client band operations and tests

6d7b5d9

minor updation

d3088e8

added commit to each transaction

c509dba

fixing commits within each transaction

253fef4

moved tests from shared connection to isolated connections per test

46a0894

update client source

fdb3cb7

moving lighcurve functionalities

e13a341

moving lighcurve functionalities

42d51ae

updating feed client and querying

9192265

fixing measurement

2e9a56e

removing port binding in ephemeral setup

20a6db7

JBorrow reviewed Dec 16, 2025

View reviewed changes

diveshjain-phy added 3 commits December 19, 2025 12:31

Minor refactoring of code

b4ba852

refactoring flux measurement table ddl

b204ad6

added statistics functionality

1f4920d

Conversation

diveshjain-phy commented Aug 27, 2025

Uh oh!

diveshjain-phy commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JBorrow commented Aug 27, 2025

Uh oh!

diveshjain-phy commented Aug 27, 2025

Uh oh!

JBorrow commented Aug 27, 2025

Uh oh!

diveshjain-phy commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JBorrow commented Aug 27, 2025

Uh oh!

diveshjain-phy commented Aug 27, 2025

Uh oh!

JBorrow commented Aug 27, 2025

Uh oh!

diveshjain-phy commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diveshjain-phy commented Sep 17, 2025

Uh oh!

diveshjain-phy commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diveshjain-phy commented Dec 12, 2025

Uh oh!

JBorrow left a comment

Choose a reason for hiding this comment

Uh oh!

JBorrow Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

JBorrow Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

JBorrow Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

JBorrow Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

diveshjain-phy commented Aug 27, 2025 •

edited

Loading

diveshjain-phy commented Aug 27, 2025 •

edited

Loading

diveshjain-phy commented Aug 28, 2025 •

edited

Loading

diveshjain-phy commented Oct 6, 2025 •

edited

Loading