Conversation
|
@JBorrow, this is an ongoing work to address the analysis function issue on lightserve. It took some time to understand and resolve the primary key and foreign key requirements of the tables when creating hypertables. Once I’ve added the analysis functions, I’ll request a review. |
|
Cool! So this actually works with timescsaledb like this? |
|
Yes, seems like working. Had to add 'time' as a primary key in FluxMeasurement Table as it was a requirement for creating hypertables chunked in time. Then had to carry the changes forward. |
|
Very interesting that it requires time as a primary key... As long as it's ok with a composite primary key with the id field we're ok, otherwise we might need to rethink things. |
|
Yes I checked that works. |
|
Very interesting. Keep exploring this direction... Do we need a separate table for each time range? Or does timescaledb have some functionality for arbrtitrary time ranges too? |
|
As far as I have read, we don't need to worry about separate tables for different time ranges. timescaledb uses one table and automatically partitions data into time-based chunks. When we query any time range, it automatically finds the relevant chunks. |
|
Ah I understand, so the chunk_time_interval => INTERVAL '6 months', is more of an optimization thing? |
|
Yes. Best practice is to set chunk_time_interval so that one chunk of data takes up 25% of RAM. Most examples show 7-14 days as starting points, with TimescaleDB's default being 7 days. |
|
Right now I have set the aggregate bucketing at 1 month. This way every row in the table corresponds to the month start, so any query range includes entire months whose buckets start inside that range. For example, asking for 1 Mar–2 Sep returns the September bucket that starts on 1 Sep, and because that bucket covers the whole month, the results run through 30 Sep. If you don’t anticipate any issues, we can combine information from the Aggregate Table and Raw Table for accurate results. Alternatively, we can set shorter buckets, but this requires optimisation unless we know what users are comfortable with. |
… database while testing
|
@JBorrow I've implemented performance tests for the aggregate statistics endpoints to compare continuous aggregates vs raw queries:
Is it the right way to approach the tests? |
|
In addition to the client and models layers, I have added the storage layer to the codebase. Backend can be specified via Settings in config.py. The client layer has been fully migrated and tested, with the exception of cutout operations which remain on the sqlalchemy implementation. Client interaction with storage layer is storage agnostic. cli tools (setup.py, ephemeral.py), and simulations work with the new backend. |
JBorrow
left a comment
There was a problem hiding this comment.
Looking good, some minor comments on the postgres setup.
lightcurvedb/storage/base/schema.py
Outdated
| SOURCES_TABLE = """ | ||
| CREATE TABLE IF NOT EXISTS sources ( | ||
| id SERIAL PRIMARY KEY, | ||
| name VARCHAR(255), |
There was a problem hiding this comment.
VARCHAR confers no performance improvement over TEXT and is more limiting.
lightcurvedb/storage/base/schema.py
Outdated
| name VARCHAR(50) PRIMARY KEY, | ||
| telescope VARCHAR(100) NOT NULL, | ||
| instrument VARCHAR(100) NOT NULL, |
| async def get_band_data(self, source_id: int, band_name: str) -> LightcurveBandData: | ||
| """ | ||
| Get all measurements as arrays using database-side aggregation. | ||
| """ | ||
| query = """ | ||
| SELECT | ||
| COALESCE(ARRAY_AGG(id ORDER BY time), ARRAY[]::INTEGER[]) as ids, | ||
| COALESCE(ARRAY_AGG(time ORDER BY time), ARRAY[]::TIMESTAMPTZ[]) as times, | ||
| COALESCE(ARRAY_AGG(ra ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra, | ||
| COALESCE(ARRAY_AGG(dec ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec, | ||
| COALESCE(ARRAY_AGG(ra_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra_uncertainty, | ||
| COALESCE(ARRAY_AGG(dec_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec_uncertainty, | ||
| COALESCE(ARRAY_AGG(i_flux ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_flux, | ||
| COALESCE(ARRAY_AGG(i_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_uncertainty | ||
| FROM flux_measurements | ||
| WHERE source_id = %(source_id)s AND band_name = %(band_name)s | ||
| """ | ||
|
|
||
| async with self.conn.cursor(row_factory=dict_row) as cur: | ||
| await cur.execute(query, {"source_id": source_id, "band_name": band_name}) | ||
| row = await cur.fetchone() | ||
| return LightcurveBandData(**row) | ||
|
|
||
| async def get_time_range( | ||
| self, | ||
| source_id: int, | ||
| band_name: str, | ||
| start_time: datetime, | ||
| end_time: datetime | ||
| ) -> LightcurveBandData: | ||
| """ | ||
| Get measurements in time. | ||
| """ | ||
| query = """ | ||
| SELECT | ||
| COALESCE(ARRAY_AGG(id ORDER BY time), ARRAY[]::INTEGER[]) as ids, | ||
| COALESCE(ARRAY_AGG(time ORDER BY time), ARRAY[]::TIMESTAMPTZ[]) as times, | ||
| COALESCE(ARRAY_AGG(ra ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra, | ||
| COALESCE(ARRAY_AGG(dec ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec, | ||
| COALESCE(ARRAY_AGG(ra_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as ra_uncertainty, | ||
| COALESCE(ARRAY_AGG(dec_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as dec_uncertainty, | ||
| COALESCE(ARRAY_AGG(i_flux ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_flux, | ||
| COALESCE(ARRAY_AGG(i_uncertainty ORDER BY time), ARRAY[]::DOUBLE PRECISION[]) as i_uncertainty | ||
| FROM flux_measurements | ||
| WHERE source_id = %(source_id)s | ||
| AND band_name = %(band_name)s | ||
| AND time BETWEEN %(start_time)s AND %(end_time)s | ||
| """ | ||
|
|
||
| async with self.conn.cursor(row_factory=dict_row) as cur: | ||
| await cur.execute(query, { | ||
| "source_id": source_id, | ||
| "band_name": band_name, | ||
| "start_time": start_time, | ||
| "end_time": end_time | ||
| }) | ||
| row = await cur.fetchone() | ||
| return LightcurveBandData(**row) |
There was a problem hiding this comment.
These two could be the same function but with the time range values accepting None as an input maybe?
lightcurvedb/storage/base/schema.py
Outdated
| FLUX_MEASUREMENTS_TABLE = """ | ||
| CREATE TABLE IF NOT EXISTS flux_measurements ( | ||
| id SERIAL PRIMARY KEY, | ||
| band_name VARCHAR(50) NOT NULL REFERENCES bands(name), | ||
| source_id INTEGER NOT NULL REFERENCES sources(id), | ||
| time TIMESTAMPTZ NOT NULL, | ||
| ra DOUBLE PRECISION NOT NULL CHECK (ra >= -180 AND ra <= 180), | ||
| dec DOUBLE PRECISION NOT NULL CHECK (dec >= -90 AND dec <= 90), | ||
| ra_uncertainty DOUBLE PRECISION, | ||
| dec_uncertainty DOUBLE PRECISION, | ||
| i_flux DOUBLE PRECISION NOT NULL, | ||
| i_uncertainty DOUBLE PRECISION, | ||
| extra JSONB | ||
| ); | ||
|
|
||
| CREATE INDEX IF NOT EXISTS idx_flux_source_band_time | ||
| ON flux_measurements (source_id, band_name, time DESC); | ||
|
|
||
| CREATE INDEX IF NOT EXISTS idx_flux_time | ||
| ON flux_measurements (time DESC); |
There was a problem hiding this comment.
Could consider partitioning the flux-measurements table by soruce_id? https://www.postgresql.org/docs/current/ddl-partitioning.html
Addressing simonsobs/lightserve#11