Skip to content

Local banzai setup for running at site #427

Merged
cmccully merged 22 commits intomainfrom
banzai-local
Feb 12, 2026
Merged

Local banzai setup for running at site #427
cmccully merged 22 commits intomainfrom
banzai-local

Conversation

@timbeccue
Copy link
Copy Markdown
Contributor

@timbeccue timbeccue commented Aug 22, 2025

Changes in service of #422

This PR introduces the local setup we'll use for site deployments of banzai, sans calibration caching which will be added in a subsequent PR.

Local Banzai Notes

To run:

docker compose -f docker-compose-site.yml --env-file .site-banzai-env up -d --build

This requires an env file called .site-banzai-env that should look like this:

# .site-banzai-env

# Database Configuration
DB_ADDRESS=sqlite:////data/banzai.db    # Path for the docker container, not the host.
CAL_DB_ADDRESS=""                       # This should be the address to the AWS banzai database where we get calibrations
SITE_ID=

# API Configuration
API_ROOT=https://archive-api.lco.global/
AUTH_TOKEN=

# Data Paths
HOST_DATA_DIR=./site_banzai # this maps to /data in the container, and should contain unprocessed data in a subdirectory `raw`
HOST_PROCESSED_DIR=./site_banzai/output # path where processed data will be saved on the host

# Container Networking
FITS_BROKER=rabbitmq
FITS_BROKER_URL=amqp://rabbitmq:5672
FITS_EXCHANGE=fits_files
TASK_HOST=redis://redis:6379/0

# Celery Configuration
CELERY_TASK_QUEUE_NAME=e2e_task_queue
CELERY_LARGE_TASK_QUEUE_NAME=e2e_large_task_queue

# Worker Configuration
BANZAI_WORKER_LOGLEVEL=debug
OMP_NUM_THREADS=2
OPENTSDB_PYTHON_METRICS_TEST_MODE=1

In order to send images to be processed, run:

python queue_images.py <host_data_dir>/raw

The data to be processed should be in the directory ${HOST_DATA_DIR}/raw. The output will be saved in ./${HOST_PROCESSED_DIR}.

@timbeccue timbeccue linked an issue Aug 22, 2025 that may be closed by this pull request
6 tasks
@timbeccue
Copy link
Copy Markdown
Contributor Author

Is the prior docker-compose.yml file important to preserve? If not, I'll replace it with my version, currently docker-compose.local.yml

- Remove outdated docker-compose.yml
- Rename docker-compose.local.yml to docker-compose-site.yml
- Rename default local banzai directory from local_banzai to site_banzai
- Rename default db name from local-banzai.db to site-banzai.db
- Added line to cache sync daemon to create calibrations_cache directory if needed
@timbeccue timbeccue marked this pull request as ready for review October 28, 2025 23:56
Comment thread .site-banzai-env.default
Comment thread README.rst Outdated
Comment thread README.rst Outdated
Comment thread banzai/context.py Outdated
args_dict = args

# If a separate calibration db address is not provided, fall back to using the primary db address
if 'cal_db_address' not in args_dict or args_dict.get('cal_db_address') is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be in the main.py parse args code rather than the context. The context stuff doesn't really care what we put into the object itself.

Comment thread banzai/dbs.py Outdated
Comment thread banzai/dbs.py
Comment thread banzai/main.py
default='sqlite:///banzai-test.db',
help='Database address: Should be in SQLAlchemy form')
parser.add_argument('--calibration-db-address', dest='cal_db_address',
help='Optional separate database address for getting calibration files. Defaults to using the same address as --db-address.')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the cal-db-address should be set. default the arg to None and then check for None in this function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your suggestion of removing the cal_db_address fallback from context.py but that caused issues in the e2e tests because some parts of the code (e2e tests and celery workers) don’t use parse_args to set up the context, and were therefore failing to set the cal_db_address fallback to db_address.

Seems like the solution is either

  1. add the init logic back to context.py
  2. use access patterns like getattr(runtime_context, ‘cal_db_address’, runtime_context.db_address) everywhere cal_db_address is used (messy)
  3. require setting cal_db_address explicitly (might break existing banzai setups)
  4. get rid of the separate cal_db_address entirely

Any of these fixes is relatively easy to implement but option 4 seems best from an overall complexity standpoint if we are ok with the docker-compose-local.yml setup requiring users to set up their own db from scratch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead with option 1 (adding init logic back to context.py) just to get tests working again

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the fallback logic from context.py to settings.py

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the tests build a mock context object right?

Comment thread banzai/utils/fits_utils.py Outdated
raise FrameNotAvailableError(f"Frame {frame_id} not found in archive")

# Check if 'url' field exists in the response
if 'url' not in response_data:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the error trying to check? This feels too specific and like you are trying to solve something else here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I encountered some files that were missing the s3 url. Something from when I was working on the old cache setup a while back. Unfortunately I'm fuzzy on the details, but here's an example log with a file that prompted this:

2025-10-22 17:36:36.624 | 2025-10-22 21:36:36.620    ERROR:            sync: Unexpected error downloading lsc0m412-sq35-20240315-bpm-central30x30.fits.fz (frameid: 69366609, type: BPM): 'url' | {"processName": "MainProcess"}
...
2025-10-22 17:36:36.624 |     bytes = buffer.write(requests.get(response.json()['url'], stream=True, timeout=60).content)
2025-10-22 17:36:36.624 | KeyError: 'url'

I don't think the code is hitting this file anymore, and I tried running without this block and there were no issues. So maybe best to remove it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. If we don't know why it's there, I'm inclined to remove it.

Comment thread docs/example_reduction.ipynb
Comment thread docs/example_reduction.ipynb Outdated
Comment thread queue_images.py Outdated
from kombu import Connection, Exchange


def post_to_processing_queue(filename, path, broker_url, exchange_name, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels redundant with the file utils function. What new things does this add?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the file path rather than the frameid with the intended use case of wanting to process a file that exists on the local disk. Maybe it would be cleaner to modify the file utils function to accept a frameid or path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards keeping post_to_processing_queue separate from post_to_archive_queue in file utils because different intended workflows (local disk vs s3), and because this is a small function that's pretty limited in scope and maybe not worth setting up the abstraction elsewhere. Let me know if you disagree.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we keep them both we should rename them to cover what they actually do. I think the post to archive queue originally used paths as well. I'm just not a fan of the code redunancy.

Comment thread queue_images.py Outdated
Comment thread queue_images.py Outdated
Copy link
Copy Markdown
Collaborator

@cmccully cmccully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some recommendations for cleanup, but overall looks fine.

@cmccully
Copy link
Copy Markdown
Collaborator

You also need to add a change log message and update the pyproject version number.

  - Add changelog entry and bump version to 1.28.0
  - Rename .site-banzai-env to site-banzai-env (remove hidden prefix)
  - Change "Running at Site" to "Running Locally" in README
  - Use SQLAlchemy make_url() instead of string splitting in dbs.py
  - Add argparse to queue_images.py and use named FITS extension
  - Remove cal_db_address fallback from context.py (already in main.py)
  - Fix kernel name in example notebook
@timbeccue
Copy link
Copy Markdown
Contributor Author

Added the requested changes. Two of your comments I did not change but replied with details; let me know if there's anything else needed for those to be resolved.

Centralize the DB_ADDRESS and CAL_DB_ADDRESS configuration in settings.py
using environment variables, with CAL_DB_ADDRESS defaulting to DB_ADDRESS
if not set. This simplifies context.py to be a pure immutable container.
@timbeccue timbeccue requested a review from cmccully January 22, 2026 05:36
- Move db_address/cal_db_address configuration from settings.py to main.py
  parse_args, with cal_db_address defaulting to db_address when not set
- Consolidate post_to_archive_queue to accept either frameid or path via
  kwargs, replacing the duplicate post_to_processing_queue in queue_images.py
- Remove defensive url check in fits_utils.py download_from_s3
- Fix README: add banzai_create_local_db entry point, correct command name
  banzai_stack_calibrations -> banzai_make_master_calibrations
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a self-contained “BANZAI-at-site” local deployment flow (docker-compose + env defaults + helper script) and introduces configuration for using a separate calibration database, along with a new CLI to create/populate a local DB from a remote calibration DB.

Changes:

  • Added docker-compose-site.yml, site-banzai-env.default, and queue_images.py to support local/site deployments.
  • Introduced CAL_DB_ADDRESS / --calibration-db-address plumbing and a new banzai_create_local_db entrypoint.
  • Updated archive download retry behavior (don’t retry missing frames) and updated docs/versioning/deps accordingly.

Reviewed changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
site-banzai-env.default Adds default env vars for site deployments (DB/broker/paths).
queue_images.py Helper script to enqueue local FITS files via RabbitMQ.
pyproject.toml Version bump + torch source/marker changes + new console script entrypoint.
poetry.lock Dependency lock updates (notably torch/CUDA-related packages).
docs/example_reduction.ipynb Updates example instrument lookup call.
docker-compose.yml Removes legacy compose configuration.
docker-compose-site.yml New compose stack for redis/rabbitmq/listener/workers and host volume mappings.
banzai/utils/fits_utils.py Adds FrameNotAvailableError + retry filtering + request timeouts.
banzai/utils/file_utils.py Updates RabbitMQ publish helper to support either frameid or path.
banzai/tests/utils.py Extends FakeContext with cal_db_address.
banzai/tests/test_end_to_end.py Updates queue publishing call signature + adds cal_db_address into runtime context.
banzai/main.py Adds env-driven defaults + new --calibration-db-address + new create_local_db CLI.
banzai/frames.py Routes calibration DB writes via cal_db_address.
banzai/exceptions.py Adds FrameNotAvailableError exception.
banzai/dbs.py Adds create_local_db() and helpers to replicate site/instruments from remote DB.
banzai/context.py Minor formatting change.
banzai/calibrations.py Uses cal_db_address when fetching master cal record and when stacking input query.
README.rst Updates docs for new local/site compose workflow + new local DB creation command.
CHANGES.md Adds 1.28.0 release notes.
.gitignore Adds ignores for local/site runtime directories and venvs.
.dockerignore Excludes local/site runtime directories and venvs from docker build context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread site-banzai-env.default Outdated
Comment thread docker-compose-site.yml
Comment on lines +7 to +13
rabbitmq:
image: rabbitmq:3.12-management
container_name: banzai-rabbitmq
ports:
- "5672:5672"
- "15672:15672"
restart: unless-stopped
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RabbitMQ is started with default credentials, but FITS_BROKER_URL in the env example is amqp://rabbitmq:5672 (i.e., guest/guest). The default guest account is restricted to localhost in RabbitMQ, so other containers (listener/workers) typically cannot authenticate. Configure an explicit user/pass (e.g., via RABBITMQ_DEFAULT_USER / RABBITMQ_DEFAULT_PASS and optionally a vhost) and require FITS_BROKER_URL to include those credentials.

Copilot uses AI. Check for mistakes.
Comment thread banzai/utils/fits_utils.py Outdated
Comment thread banzai/calibrations.py Outdated
Comment thread banzai/frames.py Outdated
Comment thread pyproject.toml
Comment on lines 100 to +106

[tool.poetry.dependencies]
torch = [
{ markers = "sys_platform == 'darwin'", source = "PyPI"},
{ markers = "sys_platform != 'darwin' and extra == 'cpu' and extra != 'cuda'", source = "pytorch-cpu"},
{ markers = "sys_platform != 'darwin' and extra == 'cuda' and extra != 'cpu'", source = "pytorch-cuda"},
]
{ version = "^2.3", source = "PyPI", markers = "sys_platform=='darwin'" },
{ version = "^2.3", source = "pytorch-cpu", markers = "sys_platform!='darwin' and extra!='cuda'" },
{ version = "^2.3", source = "pytorch-cuda", markers = "sys_platform!='darwin' and extra=='cuda'" }
]
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torch dependency selection via extra markers looks inconsistent with the resulting poetry.lock (which contains overlapping CPU/CUDA torch entries for the same platforms). One contributing factor is that torch is also listed unconditionally under [project].dependencies, so Poetry may resolve torch outside the conditional [tool.poetry.dependencies] matrix. Consider removing the unconditional torch from [project].dependencies and gating CPU vs CUDA wheels solely via mutually-exclusive markers/extras, then re-generating the lockfile to ensure only one torch variant can match a given environment.

Copilot uses AI. Check for mistakes.
Comment thread docker-compose-site.yml Outdated
Comment thread banzai/utils/file_utils.py
Comment thread README.rst Outdated
Comment thread banzai/main.py Outdated
Copy link
Copy Markdown
Collaborator

@jchate6 jchate6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked at this, and nothing really jumps out at me as being obviously problematic, but I am the first to admit that it isn't clear to me that this will have no impact on normal operations.

I'm also not clear on exactly why we are running banzai at site, and how this will affect our ability to maintain tools and address banzai issues when they arise. There are tools and documentation for checking logs in the current banzai-workers, or re-queuing frames that failed to process. This happens a lot. Is there any coverage in those tools or documentation for what happens if these things break at a site? I know I'm a little late to this party, and it's possible all of these questions have been addressed elsewhere.

Comment thread scripts/queue_images.py
for pattern in ['*.fits', '*.fits.fz']:
fits_files.extend(glob.glob(os.path.join(args.directory, pattern)))

print(f'Files to process: {len(fits_files)}')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be logged somewhere rather than just printed? This might be helpful for debugging if it's captured somewhere.

The unpinned `pip install poetry` now pulls 2.3+, which is incompatible
with the 2.1.3-generated lockfile format and breaks docker builds.
- Remove quoted empty strings in site-banzai-env.default to fix Docker env_file handling
- Stop logging auth token in fits_utils.py download_from_s3
- Query individual cal frames from db_address (not cal_db_address) in make_master_calibrations
- Write calibration records to db_address (not cal_db_address) in CalibrationFrame.write
- Restore consistent sqlite:///banzai-test.db default for --db-address in parse_args
- Fix --processed-path to /output to match volume mount in docker-compose-site.yml
- Validate frameid/path presence in post_to_archive_queue before publishing
- Fix README CLI flags to match actual banzai_create_local_db arguments
- Return boolean from dbs.create_local_db to avoid false-positive success log
@markBowman
Copy link
Copy Markdown

I'm also not clear on exactly why we are running banzai at site, and how this will affect our ability to maintain tools and address banzai issues when they arise. There are tools and documentation for checking logs in the current banzai-workers, or re-queuing frames that failed to process. This happens a lot. Is there any coverage in those tools or documentation for what happens if these things break at a site? I know I'm a little late to this party, and it's possible all of these questions have been addressed elsewhere.

Great question. We already have several pieces of 'flash reduction' at site. Smart stacking, occultation and other observing techniques mean we need more image processing capabilities local to the data. It seems to have reached a threshold where using banzai makes more sense than adding more ad-hoc applications at site. That being said, we will definitely have to address SciOps visibility and capability as the project progresses. It is unlikely to be worse than the current myriad of technical-debt-laced Docker containers, understood' by Steve and nobody else. In most cases there will be very little opportunnity to reprocess data later as the underlying frames are considered volatile - the banzai image is the raw data product to upload to the archive.

@jchate6
Copy link
Copy Markdown
Collaborator

jchate6 commented Feb 10, 2026

In most cases there will be very little opportunnity to reprocess data later as the underlying frames are considered volatile - the banzai image is the raw data product to upload to the archive.

This seems bad. While the banzai-reduced image is good enough for many purposes, there are still many cases where that reduction fails or isn't what the user requires. Raw images are necessary to recover the data. We get bad flats, or poor solves all the time that have to be re-done either by us or by users. It feels like we are breaking a contract with the user to consider the raw images volatile in "most cases".

@cmccully cmccully merged commit 3e10f83 into main Feb 12, 2026
10 of 13 checks passed
@cmccully cmccully deleted the banzai-local branch February 12, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Initial setup for BANZAI-at-site

5 participants