From 1a0aaa057497229b56832410e7746176134d213e Mon Sep 17 00:00:00 2001 From: Victor Pacyna Date: Wed, 20 May 2026 11:02:30 +0200 Subject: [PATCH] update README.md --- nasdaq/README.md | 171 +++++++++++++++++---------------------------- nasdaq/compose.yml | 6 +- 2 files changed, 69 insertions(+), 108 deletions(-) diff --git a/nasdaq/README.md b/nasdaq/README.md index 0108ef2..0e9cf20 100644 --- a/nasdaq/README.md +++ b/nasdaq/README.md @@ -1,155 +1,114 @@ # Replaying the NASDAQ order book -This is an example project live-replaying the complete NASDAQ exchange orders from January 30 2020 with CedarDB. -For an overview of the dataset, take a look at [our example dataset docs](https://cedardb.com/docs/example_datasets/nasdaq/). +This example live-replays the complete NASDAQ order stream from January 30, 2020, with CedarDB. For dataset background, see [the NASDAQ example dataset docs](https://cedardb.com/docs/example_datasets/nasdaq/). -What's especially noteworthy here is that CedarDB is not only running the **transactional query workload**, -inserting thousands of events every 100 ms, but also the **complex analytical queries** which feed the various -views in the Grafana dashboard. It's an excellent illustration of the power of Hybrid Transactional/Analytical -Processing (HTAP). +![Grafana](./grafana.png) -This example consists of separate applications: +The setup is fully dockerized. The demo stack contains: -1. A parser written in Python that parses NASDAQ's proprietary ITCHv5 protocol into human-readable CSV files. -2. A C++ client connecting to CedarDB and live-replaying all orders. -3. A Grafana Dashboard displaying live analytics (pictured below). +1. `parser`: downloads the NASDAQ ITCH dump and converts it into CSV files. +2. `cedar`: runs CedarDB and stores the parsed data on a Docker volume. +3. `client`: creates the schema, loads reference and pre-market data, and replays the live market stream in 100 ms batches. +4. `grafana`: shows live analytics on top of the replay. +5. `aichat`: optional web UI for natural-language questions over the same database. +In comparison mode, the stack also starts PostgreSQL and replays the same workload into both databases. -![Grafana](./grafana.png) +## Getting started -In addition to Grafana, you can also issue queries yourself to get insight into the market state. -This guide will show you how to do both using `docker compose`. +Prerequisites: +1. Docker with Compose support. +2. A stable internet connection to pull the required Docker images and download the NASDAQ dataset on first run. -## Getting started +Optional: -This guide assumes you already have a cedardb docker image, i.e. have completed [this guide](https://cedardb.com/docs/getting_started/running_docker_image/) up to step two. +1. A CedarDB license at `db-config/cedar/license.env`. You can sign up for a trial at https://console.cedardb.com/signup. -### 1. Prepare the data -Execute the `prepare.sh` script: -```shell -./prepare.sh -``` -It downloads the raw binary package capture that NASDAQ provides, extracts it and transforms it into CSV files. -This downloads about 3.3 GB and writes ~16 GB CSV files. +The license is needed to create the dedicated `grafana` database user and grant the required user permissions cleanly. It also enables database statistics in comparison mode. + +If no license is present, `demo.sh` falls back to using the `postgres` admin user for Grafana access because the dedicated `grafana` user cannot be granted the required read permissions. + +## Run the demo -You should now have a set of files in the data directory containing the stock exchange events: +Use `demo.sh` as the entrypoint for the stack: ```shell -du -h data/*.csv +./demo.sh start ``` -``` -5,3G data/cancellations.csv -181M data/cancellationsPreMarket.csv -337M data/executions.csv -2,7M data/executionsPreMarket.csv -7,5M data/marketMakers.csv -9,8G data/orders.csv -279M data/ordersPreMarket.csv -516K data/stocks.csv -``` +This starts the normal stack in the background with `docker compose up -d --build`. On the first run, the parser container: + +1. downloads the NASDAQ archive, about 3.3 GB compressed, +2. extracts it, +3. parses it into roughly 16 GB of CSV data, +4. stores everything in the Docker volume `data`. + +Depending on your connection and machine, the initial download and parsing step can take around 10 to 15 minutes. + +After the parser finishes, the client loads the schema and pre-market data, then begins the timed replay. The replay starts 10 minutes after market open, so the initial database state corresponds to 9:40 AM market time. If it has been running for 20 minutes, the database state represents 10:00 AM market time. + +Useful lifecycle commands: -### 2. Run the application ```shell -docker compose build client -docker compose up +./demo.sh stop +./demo.sh clean +./demo.sh pull ``` -While the client is running, it replays the live exchange data in 100ms batches, treating the point in time the program was started as 9:30 AM, i.e. the exact instance the market opens. -In the first minute, the client catches up to the live transaction stream and starts inserting many events. -Afterward, you should get batches of a couple of thousand events per 100ms. -So, if you run the client for 30 minutes, the database state will represent the state of the NASDAQ exchange 30 minutes after market open, i.e., 10:00 AM. +`clean` removes the Docker volumes, including the parsed dataset. +## Access the services -You can stop the application via `CTRL+C` followed by `docker compose down` +Grafana is exposed on http://localhost:3000. -### 3. Connect to Grafana -You can now browse to Grafana at http://localhost:3000, log in with username `admin` and password `admin`, and view the NASDAQ dashboard. +Authentication is disabled for the UI, so opening the page is enough. The dashboard is provisioned automatically. ![Grafana Instructions](./grafana_instructions.png) +The AI chat UI is exposed on http://localhost:8080. -### 4. Query the data -Alternatively, you can run your own queries. This requires installation of the `psql` PostgreSQL command line interface. -Note that, for the `Time:` values to appear, you need to either run `\timing on` from within the session or -have a `$HOME/.psqlrc` file containing at least the following line: `\timing on`. +By default, the container starts with: ```shell -PGPASSWORD=postgres psql -h localhost -U postgres -d postgres +OPENROUTER_API_KEY={your_api_key_here} +LLM_MODEL=anthropic/claude-sonnet-4.5 ``` -Here are some example queries to get you started: +Set `OPENROUTER_API_KEY` before `./demo.sh start` if you want the chat UI to be functional. -```sql -postgres=# -select count(*) from orders; - count ----------- - 11019259 -(1 row) +## Query the data -Time: 5.316 ms -``` +The best way to run ad hoc SQL in this setup is through Grafana Explore. + +Open http://localhost:3000/explore, select the provisioned PostgreSQL-compatible data source, and run SQL directly there. + +Example queries: ```sql -postgres=# +select count(*) from orders; select avg(price) from executions; - avg ------------------------------ - 140.21785151844912886904428 -(1 row) - -Time: 15.681 ms ``` -The following query calculates the new orders created per second averaged over the last 10 seconds. +The following query calculates new orders per second averaged over the last 10 seconds: ```sql -client=# -select count(*) / 10 as new -- averaged over 10 seconds -from orders o -where prevOrder is null -- == new order -and o.timestamp > (select max(e.timestamp) from executions e) - 10::bigint * 1000 * 1000 * 1000; -- averaged over 10 seconds - new ------- - 8285 -(1 row) - -Time: 32.514 ms +select count(*) / 10 as new +from orders o +where prevOrder is null + and o.timestamp > ( + select max(e.timestamp) from executions e + ) - 10::bigint * 1000 * 1000 * 1000; ``` -You can find some more complex queries in the `sql` subdirectory. - -## Load everything +More analytical queries are available in [`sql/`](./sql). -Start the Docker image, mounting the `./data` directory containing the CSV data: +## Comparison mode -```shell -docker run --rm -p 5432:5432 -e CEDAR_PASSWORD=postgres -v ./data:/data --name cedardb cedardb -``` +Comparison mode starts CedarDB and PostgreSQL with the same CPU and memory limits, then replays the same workload into both systems. -Connect to CedarDB via the `psql` CLI: +It requires `DB_CPU_LIMIT` and `DB_MEM_LIMIT`: ```shell -PGPASSWORD=postgres psql -h localhost -U postgres -d postgres +DB_CPU_LIMIT=4 DB_MEM_LIMIT=8g ./demo.sh --comparison start ``` - -Using the `psql` client, run the DDL and then directly copy the CSV data: - -```sql -\i client/schema.sql -copy stocks from '/data/stocks.csv' with(format text, delimiter ';', null '', header true); -copy marketmakers from '/data/marketMakers.csv' with(format text, delimiter ';', null '', header true); -copy orders from '/data/ordersPreMarket.csv' with(format text, delimiter ';', null '', header true); -copy orders from '/data/orders.csv' with(format text, delimiter ';', null '', header true); -copy executions from '/data/executionsPreMarket.csv' with(format text, delimiter ';', null '', header true); -copy executions from '/data/executions.csv' with(format text, delimiter ';', null '', header true); -copy cancellations from '/data/cancellationsPreMarket.csv' with(format text, delimiter ';', null '', header true); -copy cancellations from '/data/cancellations.csv' with(format text, delimiter ';', null '', header true); -``` - -Try running some ad hoc SQL queries. - -Please note that this does not maintain the orderbook, which would be maintained by the client. - diff --git a/nasdaq/compose.yml b/nasdaq/compose.yml index 61ff987..410a8ec 100644 --- a/nasdaq/compose.yml +++ b/nasdaq/compose.yml @@ -35,7 +35,9 @@ services: depends_on: - cedar volumes: - - ./grafana:/etc/grafana/provisioning + - ./grafana/dashboards/automatic.yml:/etc/grafana/provisioning/dashboards/automatic.yml + - ./grafana/dashboards/NASDAQ-1732292249298.json:/etc/grafana/provisioning/dashboards/NASDAQ-1732292249298.json + - ./grafana/datasources:/etc/grafana/provisioning/datasources aichat: build: context: aichat @@ -49,7 +51,7 @@ services: DB_HOST: cedar DB_NAME: postgres DB_USER: postgres - DB_PASSWORD: postgres + DB_PASSWORD: ${ADMIN_PWD} ports: - 127.0.0.1:8080:8080 - "[::1]:8080:8080"