Skip to content

Commit 40fa46d

Browse files
committed
Time zone localization (#61)
This commit introduces time zone localization functionality to handle conversion of timezone-naive (TIMESTAMP_NTZ) timestamps to timezone-aware (TIMESTAMP_TZ) timestamps. The changes include new localizer classes, refactoring of time configuration models, updates to test utilities, and bug fixes. Changes: Added TimeZoneLocalizer and TimeZoneLocalizerByColumn classes for localizing tz-naive timestamps to standard time zones Refactored time configuration models to include dtype field and consolidated IndexTimeRange classes Updated test utilities and fixed API inconsistencies (np.concat → np.concatenate)
0 parents  commit 40fa46d

63 files changed

Lines changed: 11625 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.buildinfo

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Sphinx build info version 1
2+
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
3+
config: 516f3e47e59ddbc79ec112ccfed74fbf
4+
tags: 645f666f9bcd5a90fca523b33c5a78b7

.nojekyll

Whitespace-only changes.

_sources/explanation/index.md.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
```{eval-rst}
2+
.. _explanation-page:
3+
```
4+
# Explanation
5+
6+
```{eval-rst}
7+
.. toctree::
8+
:maxdepth: 2
9+
:caption: Contents:
10+
11+
```
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Getting Started
2+
3+
```{eval-rst}
4+
.. toctree::
5+
:maxdepth: 2
6+
7+
installation
8+
quick_start
9+
```
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
2+
```{eval-rst}
3+
.. _installation:
4+
```
5+
6+
# Installation
7+
8+
1. Install Python 3.11 or later.
9+
10+
#. Create a Python 3.11+ virtual environment. This example uses the ``venv`` module in the standard
11+
library to create a virtual environment in your home directory. You may prefer a single
12+
`python-envs` in your home directory instead of the current directory. You may also prefer ``conda``
13+
or ``mamba``.
14+
15+
```{eval-rst}
16+
.. code-block:: console
17+
18+
$ python -m venv env
19+
```
20+
21+
2. Activate the virtual environment.
22+
23+
```{eval-rst}
24+
.. code-block:: console
25+
26+
$ source env/bin/activate
27+
```
28+
29+
Whenever you are done using chronify, you can deactivate the environment by running ``deactivate``.
30+
31+
3. Install the Python package `chronify`.
32+
33+
To use DuckDB or SQLite as the backend:
34+
```{eval-rst}
35+
.. code-block:: console
36+
37+
$ pip install chronify
38+
```
39+
40+
To use Apache Spark via Apache Thrift Server as the backend, you must install pyhive.
41+
This command will install the necessary dependencies.
42+
43+
```{eval-rst}
44+
.. code-block:: console
45+
46+
$ pip install "chronify[spark]"
47+
```
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Quick Start
2+
3+
```python
4+
5+
from datetime import datetime, timedelta
6+
7+
import numpy as np
8+
import pandas as pd
9+
from chronify import DatetimeRange, Store, TableSchema
10+
11+
store = Store.create_file_db(file_path="time_series.db")
12+
resolution = timedelta(hours=1)
13+
time_range = pd.date_range("2020-01-01", "2020-12-31 23:00:00", freq=resolution)
14+
store.ingest_tables(
15+
(
16+
pd.DataFrame({"timestamp": time_range, "value": np.random.random(8784), "id": 1}),
17+
pd.DataFrame({"timestamp": time_range, "value": np.random.random(8784), "id": 2}),
18+
),
19+
TableSchema(
20+
name="devices",
21+
value_column="value",
22+
time_config=DatetimeRange(
23+
time_column="timestamp",
24+
start=datetime(2020, 1, 1, 0),
25+
length=8784,
26+
resolution=timedelta(hours=1),
27+
),
28+
time_array_id_columns=["id"],
29+
)
30+
)
31+
query = "SELECT timestamp, value FROM devices WHERE id = ?"
32+
df = store.read_query("devices", query, params=(2,))
33+
df.head()
34+
```
35+
36+
```
37+
timestamp value id
38+
0 2020-01-01 00:00:00 0.594748 2
39+
1 2020-01-01 01:00:00 0.608295 2
40+
2 2020-01-01 02:00:00 0.297535 2
41+
3 2020-01-01 03:00:00 0.870238 2
42+
4 2020-01-01 04:00:00 0.376144 2
43+
```

_sources/how_tos/index.md.txt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
```{eval-rst}
2+
.. _how-tos-page:
3+
```
4+
# How Tos
5+
6+
```{eval-rst}
7+
.. toctree::
8+
:maxdepth: 2
9+
:caption: Contents:
10+
11+
getting_started/index
12+
ingest_multiple_tables
13+
map_time_config
14+
spark_backend
15+
```
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# How to Ingest Multiple Tables Efficiently
2+
3+
There are a few important considerations when ingesting many tables:
4+
- Use one database connection.
5+
- Avoid loading all tables into memory at once, if possible.
6+
- Ensure additions are atomic. If anything fails, the final state should be the same as the initial
7+
state.
8+
9+
**Setup**
10+
11+
The input data are in CSV files. Each file contains a timestamp column and one value column per
12+
device.
13+
14+
```python
15+
from datetime import datetime, timedelta
16+
17+
import numpy as np
18+
import pandas as pd
19+
from chronify import DatetimeRange, Store, TableSchema, CsvTableSchema
20+
21+
store = Store.create_in_memory_db()
22+
resolution = timedelta(hours=1)
23+
time_config = DatetimeRange(
24+
time_column="timestamp",
25+
start=datetime(2020, 1, 1, 0),
26+
length=8784,
27+
resolution=timedelta(hours=1),
28+
)
29+
src_schema = CsvTableSchema(
30+
time_config=time_config,
31+
column_dtypes=[
32+
ColumnDType(name="timestamp", dtype=DateTime(timezone=False)),
33+
ColumnDType(name="device1", dtype=Double()),
34+
ColumnDType(name="device2", dtype=Double()),
35+
ColumnDType(name="device3", dtype=Double()),
36+
],
37+
value_columns=["device1", "device2", "device3"],
38+
pivoted_dimension_name="device",
39+
)
40+
dst_schema = TableSchema(
41+
name="devices",
42+
value_column="value",
43+
time_array_id_columns=["id"],
44+
)
45+
```
46+
47+
## Automated through chronfiy
48+
Chronify will manage the database connection and errors.
49+
```python
50+
store.ingest_from_csvs(
51+
src_schema,
52+
dst_schema,
53+
(
54+
"/path/to/file1.csv",
55+
"/path/to/file2.csv",
56+
"/path/to/file3.csv",
57+
),
58+
)
59+
60+
```
61+
62+
## Self-Managed
63+
Open one connection to the database for the duration of your additions. Handle errors.
64+
```python
65+
with store.engine.connect() as conn:
66+
try:
67+
store.ingest_from_csv(src_schema, dst_schema, "/path/to/file1.csv")
68+
store.ingest_from_csv(src_schema, dst_schema, "/path/to/file2.csv")
69+
store.ingest_from_csv(src_schema, dst_schema, "/path/to/file3.csv")
70+
except Exception:
71+
conn.rollback()
72+
```
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# How to Map Time
2+
This recipe demonstrates how to map a table's time configuration from one type to another.
3+
4+
**Source table**: data is stored in representative time where there is one week of data per month by
5+
hour for one year.
6+
7+
**Destination table**: data is stored with `datetime` timestamps for each hour of the year.
8+
9+
**Workflow**:
10+
- Add the source table to the database.
11+
- Call `Store.map_table_time_config()`
12+
- Chronify adds the destination table to the database.
13+
14+
This example creates a representative time table used in chronify's tests.
15+
16+
1. Ingest the source data.
17+
18+
```python
19+
from datetime import datetime, timedelta
20+
21+
import numpy as np
22+
import pandas as pd
23+
24+
from chronify import (
25+
DatetimeRange,
26+
RepresentativePeriodFormat,
27+
RepresentativePeriodTimeNTZ,
28+
Store,
29+
CsvTableSchema,
30+
TableSchema,
31+
)
32+
33+
src_table_name = "ev_charging"
34+
dst_table_name = "ev_charging_datetime"
35+
hours_per_year = 12 * 7 * 24
36+
num_time_arrays = 3
37+
df = pd.DataFrame({
38+
"id": np.concatenate([np.repeat(i, hours_per_year) for i in range(1, 1 + num_time_arrays)]),
39+
"month": np.tile(np.repeat(range(1, 13), 7 * 24), num_time_arrays),
40+
"day_of_week": np.tile(np.tile(np.repeat(range(7), 24), 12), num_time_arrays),
41+
"hour": np.tile(np.tile(range(24), 12 * 7), num_time_arrays),
42+
"value": np.random.random(hours_per_year * num_time_arrays),
43+
})
44+
schema = TableSchema(
45+
name=src_table_name,
46+
value_column="value",
47+
time_config=RepresentativePeriodTimeNTZ(
48+
time_format=RepresentativePeriodFormat.ONE_WEEK_PER_MONTH_BY_HOUR,
49+
),
50+
time_array_id_columns=["id"],
51+
)
52+
store = Store.create_in_memory_db()
53+
store.ingest_table(df, schema)
54+
store.read_query(src_table_name, f"SELECT * FROM {src_table_name} LIMIT 5").head()
55+
```
56+
57+
```
58+
id month day_of_week hour value
59+
0 1 1 0 0 0.578496
60+
1 1 1 0 1 0.092271
61+
2 1 1 0 2 0.111521
62+
3 1 1 0 3 0.671668
63+
4 1 1 0 4 0.782365
64+
```
65+
66+
2. Map the table's time to datetime.
67+
```python
68+
dst_schema = TableSchema(
69+
name=dst_table_name,
70+
value_column="value",
71+
time_array_id_columns=["id"],
72+
time_config=DatetimeRange(
73+
time_column="timestamp",
74+
start=datetime(2020, 1, 1, 0),
75+
length=8784,
76+
resolution=timedelta(hours=1),
77+
)
78+
)
79+
store.map_table_time_config(src_table_name, dst_schema)
80+
store.read_query(dst_table_name, f"SELECT * FROM {dst_table_name} LIMIT 5").head()
81+
```
82+
83+
```
84+
id value timestamp
85+
0 3 0.006213 2020-01-01 00:00:00
86+
1 3 0.865765 2020-01-01 01:00:00
87+
2 3 0.187256 2020-01-01 02:00:00
88+
3 3 0.336157 2020-01-01 03:00:00
89+
4 3 0.582281 2020-01-01 04:00:00
90+
```
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Apache Spark Backend
2+
Download Spark from https://spark.apache.org/downloads.html and install it. Spark provides startup
3+
scripts for UNIX operating systems (not Windows).
4+
5+
## Install chronify with Spark support
6+
```
7+
$ pip install "chronify[spark]"
8+
```
9+
10+
## Installation on a development computer
11+
Installation can be as simple as
12+
```
13+
$ tar -xzf spark-4.0.1-bin-hadoop3.tgz
14+
$ export SPARK_HOME=$(pwd)/spark-4.0.1-bin-hadoop3
15+
```
16+
17+
Start a Thrift server. This allows JDBC clients to send SQL queries to an in-process Spark cluster
18+
running in local mode.
19+
```
20+
$ $SPARK_HOME/sbin/start-thriftserver.sh --master=spark://$(hostname):7077
21+
```
22+
23+
The URL to connect to this server is `hive://localhost:10000/default`
24+
25+
## Installation on an HPC
26+
The chronify development team uses these
27+
[scripts](https://github.com/NREL/HPC/tree/master/applications/spark) to run Spark on NREL's HPC.
28+
29+
## Chronify Usage
30+
This example creates a chronify Store with Spark as the backend and then adds a view to a Parquet
31+
file. Chronify will run its normal time checks.
32+
33+
First, create the Parquet file and chronify schema.
34+
35+
```python
36+
from datetime import datetime, timedelta
37+
38+
import numpy as np
39+
import pandas as pd
40+
from chronify import DatetimeRange, Store, TableSchema, CsvTableSchema
41+
42+
initial_time = datetime(2020, 1, 1)
43+
end_time = datetime(2020, 12, 31, 23)
44+
resolution = timedelta(hours=1)
45+
timestamps = pd.date_range(initial_time, end_time, freq=resolution, unit="us")
46+
dfs = []
47+
for i in range(1, 4):
48+
df = pd.DataFrame(
49+
{
50+
"timestamp": timestamps,
51+
"id": i,
52+
"value": np.random.random(len(timestamps)),
53+
}
54+
)
55+
dfs.append(df)
56+
df = pd.concat(dfs)
57+
df.to_parquet("data.parquet", index=False)
58+
schema = TableSchema(
59+
name="devices",
60+
value_column="value",
61+
time_config=DatetimeRange(
62+
time_column="timestamp",
63+
start=initial_time,
64+
length=len(timestamps),
65+
resolution=resolution,
66+
),
67+
time_array_id_columns=["id"],
68+
)
69+
```
70+
71+
```python
72+
from chronify import Store
73+
74+
store = Store.create_new_hive_store("hive://localhost:10000/default")
75+
store.create_view_from_parquet("data.parquet")
76+
```
77+
78+
Verify the data:
79+
```python
80+
store.read_table(schema.name).head()
81+
```
82+
```
83+
timestamp id value
84+
0 2020-01-01 00:00:00 1 0.785399
85+
1 2020-01-01 01:00:00 1 0.102756
86+
2 2020-01-01 02:00:00 1 0.178587
87+
3 2020-01-01 03:00:00 1 0.326194
88+
4 2020-01-01 04:00:00 1 0.994851
89+
```
90+
91+
## Time configuration mapping
92+
The primary use case for Spark is to map datasets that are larger than can be processed by DuckDB
93+
on one computer. In such a workflow a user would call
94+
```python
95+
store.map_table_time_config(src_table_name, dst_schema, output_file="mapped_data.parquet")
96+
```

0 commit comments

Comments
 (0)