Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 65 additions & 106 deletions nasdaq/README.md
Original file line number Diff line number Diff line change
@@ -1,155 +1,114 @@
# Replaying the NASDAQ order book

This is an example project live-replaying the complete NASDAQ exchange orders from January 30 2020 with CedarDB.
For an overview of the dataset, take a look at [our example dataset docs](https://cedardb.com/docs/example_datasets/nasdaq/).
This example live-replays the complete NASDAQ order stream from January 30, 2020, with CedarDB. For dataset background, see [the NASDAQ example dataset docs](https://cedardb.com/docs/example_datasets/nasdaq/).

What's especially noteworthy here is that CedarDB is not only running the **transactional query workload**,
inserting thousands of events every 100 ms, but also the **complex analytical queries** which feed the various
views in the Grafana dashboard. It's an excellent illustration of the power of Hybrid Transactional/Analytical
Processing (HTAP).
![Grafana](./grafana.png)

This example consists of separate applications:
The setup is fully dockerized. The demo stack contains:

1. A parser written in Python that parses NASDAQ's proprietary ITCHv5 protocol into human-readable CSV files.
2. A C++ client connecting to CedarDB and live-replaying all orders.
3. A Grafana Dashboard displaying live analytics (pictured below).
1. `parser`: downloads the NASDAQ ITCH dump and converts it into CSV files.
2. `cedar`: runs CedarDB and stores the parsed data on a Docker volume.
3. `client`: creates the schema, loads reference and pre-market data, and replays the live market stream in 100 ms batches.
4. `grafana`: shows live analytics on top of the replay.
5. `aichat`: optional web UI for natural-language questions over the same database.

In comparison mode, the stack also starts PostgreSQL and replays the same workload into both databases.

![Grafana](./grafana.png)
## Getting started

In addition to Grafana, you can also issue queries yourself to get insight into the market state.
This guide will show you how to do both using `docker compose`.
Prerequisites:

1. Docker with Compose support.
2. A stable internet connection to pull the required Docker images and download the NASDAQ dataset on first run.

## Getting started
Optional:

This guide assumes you already have a cedardb docker image, i.e. have completed [this guide](https://cedardb.com/docs/getting_started/running_docker_image/) up to step two.
1. A CedarDB license at `db-config/cedar/license.env`. You can sign up for a trial at https://console.cedardb.com/signup.

### 1. Prepare the data
Execute the `prepare.sh` script:
```shell
./prepare.sh
```
It downloads the raw binary package capture that NASDAQ provides, extracts it and transforms it into CSV files.
This downloads about 3.3 GB and writes ~16 GB CSV files.
The license is needed to create the dedicated `grafana` database user and grant the required user permissions cleanly. It also enables database statistics in comparison mode.

If no license is present, `demo.sh` falls back to using the `postgres` admin user for Grafana access because the dedicated `grafana` user cannot be granted the required read permissions.

## Run the demo

You should now have a set of files in the data directory containing the stock exchange events:
Use `demo.sh` as the entrypoint for the stack:

```shell
du -h data/*.csv
./demo.sh start
```

```
5,3G data/cancellations.csv
181M data/cancellationsPreMarket.csv
337M data/executions.csv
2,7M data/executionsPreMarket.csv
7,5M data/marketMakers.csv
9,8G data/orders.csv
279M data/ordersPreMarket.csv
516K data/stocks.csv
```
This starts the normal stack in the background with `docker compose up -d --build`. On the first run, the parser container:

1. downloads the NASDAQ archive, about 3.3 GB compressed,
2. extracts it,
3. parses it into roughly 16 GB of CSV data,
4. stores everything in the Docker volume `data`.

Depending on your connection and machine, the initial download and parsing step can take around 10 to 15 minutes.

After the parser finishes, the client loads the schema and pre-market data, then begins the timed replay. The replay starts 10 minutes after market open, so the initial database state corresponds to 9:40 AM market time. If it has been running for 20 minutes, the database state represents 10:00 AM market time.

Useful lifecycle commands:

### 2. Run the application
```shell
docker compose build client
docker compose up
./demo.sh stop
./demo.sh clean
./demo.sh pull
```

While the client is running, it replays the live exchange data in 100ms batches, treating the point in time the program was started as 9:30 AM, i.e. the exact instance the market opens.
In the first minute, the client catches up to the live transaction stream and starts inserting many events.
Afterward, you should get batches of a couple of thousand events per 100ms.
So, if you run the client for 30 minutes, the database state will represent the state of the NASDAQ exchange 30 minutes after market open, i.e., 10:00 AM.
`clean` removes the Docker volumes, including the parsed dataset.

## Access the services

You can stop the application via `CTRL+C` followed by `docker compose down`
Grafana is exposed on http://localhost:3000.

### 3. Connect to Grafana
You can now browse to Grafana at http://localhost:3000, log in with username `admin` and password `admin`, and view the NASDAQ dashboard.
Authentication is disabled for the UI, so opening the page is enough. The dashboard is provisioned automatically.

![Grafana Instructions](./grafana_instructions.png)

The AI chat UI is exposed on http://localhost:8080.

### 4. Query the data
Alternatively, you can run your own queries. This requires installation of the `psql` PostgreSQL command line interface.
Note that, for the `Time:` values to appear, you need to either run `\timing on` from within the session or
have a `$HOME/.psqlrc` file containing at least the following line: `\timing on`.
By default, the container starts with:

```shell
PGPASSWORD=postgres psql -h localhost -U postgres -d postgres
OPENROUTER_API_KEY={your_api_key_here}
LLM_MODEL=anthropic/claude-sonnet-4.5
```

Here are some example queries to get you started:
Set `OPENROUTER_API_KEY` before `./demo.sh start` if you want the chat UI to be functional.

```sql
postgres=#
select count(*) from orders;
count
----------
11019259
(1 row)
## Query the data

Time: 5.316 ms
```
The best way to run ad hoc SQL in this setup is through Grafana Explore.

Open http://localhost:3000/explore, select the provisioned PostgreSQL-compatible data source, and run SQL directly there.

Example queries:

```sql
postgres=#
select count(*) from orders;
select avg(price) from executions;
avg
-----------------------------
140.21785151844912886904428
(1 row)

Time: 15.681 ms
```

The following query calculates the new orders created per second averaged over the last 10 seconds.
The following query calculates new orders per second averaged over the last 10 seconds:

```sql
client=#
select count(*) / 10 as new -- averaged over 10 seconds
from orders o
where prevOrder is null -- == new order
and o.timestamp > (select max(e.timestamp) from executions e) - 10::bigint * 1000 * 1000 * 1000; -- averaged over 10 seconds
new
------
8285
(1 row)

Time: 32.514 ms
select count(*) / 10 as new
from orders o
where prevOrder is null
and o.timestamp > (
select max(e.timestamp) from executions e
) - 10::bigint * 1000 * 1000 * 1000;
```

You can find some more complex queries in the `sql` subdirectory.

## Load everything
More analytical queries are available in [`sql/`](./sql).

Start the Docker image, mounting the `./data` directory containing the CSV data:
## Comparison mode

```shell
docker run --rm -p 5432:5432 -e CEDAR_PASSWORD=postgres -v ./data:/data --name cedardb cedardb
```
Comparison mode starts CedarDB and PostgreSQL with the same CPU and memory limits, then replays the same workload into both systems.

Connect to CedarDB via the `psql` CLI:
It requires `DB_CPU_LIMIT` and `DB_MEM_LIMIT`:

```shell
PGPASSWORD=postgres psql -h localhost -U postgres -d postgres
DB_CPU_LIMIT=4 DB_MEM_LIMIT=8g ./demo.sh --comparison start
```

Using the `psql` client, run the DDL and then directly copy the CSV data:

```sql
\i client/schema.sql
copy stocks from '/data/stocks.csv' with(format text, delimiter ';', null '', header true);
copy marketmakers from '/data/marketMakers.csv' with(format text, delimiter ';', null '', header true);
copy orders from '/data/ordersPreMarket.csv' with(format text, delimiter ';', null '', header true);
copy orders from '/data/orders.csv' with(format text, delimiter ';', null '', header true);
copy executions from '/data/executionsPreMarket.csv' with(format text, delimiter ';', null '', header true);
copy executions from '/data/executions.csv' with(format text, delimiter ';', null '', header true);
copy cancellations from '/data/cancellationsPreMarket.csv' with(format text, delimiter ';', null '', header true);
copy cancellations from '/data/cancellations.csv' with(format text, delimiter ';', null '', header true);
```

Try running some ad hoc SQL queries.

Please note that this does not maintain the orderbook, which would be maintained by the client.

6 changes: 4 additions & 2 deletions nasdaq/compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@ services:
depends_on:
- cedar
volumes:
- ./grafana:/etc/grafana/provisioning
- ./grafana/dashboards/automatic.yml:/etc/grafana/provisioning/dashboards/automatic.yml
- ./grafana/dashboards/NASDAQ-1732292249298.json:/etc/grafana/provisioning/dashboards/NASDAQ-1732292249298.json
- ./grafana/datasources:/etc/grafana/provisioning/datasources
aichat:
build:
context: aichat
Expand All @@ -49,7 +51,7 @@ services:
DB_HOST: cedar
DB_NAME: postgres
DB_USER: postgres
DB_PASSWORD: postgres
DB_PASSWORD: ${ADMIN_PWD}
ports:
- 127.0.0.1:8080:8080
- "[::1]:8080:8080"
Expand Down