This folder contains a script and config to extract unique error signatures from Observe log datasets across multiple services and regions. Output is printed to the terminal and written to error_report.txt (IST timestamps, counts, error messages, and deep links to the Observe log explorer).
New here? → See Setup (step-by-step) to run locally, or Running with Docker to run in a container.
| Item | Description |
|---|---|
| extract_errors.py | Main script: calls Observe API, runs OPAL pipelines, prints/writes results. |
| env.sample | Sample environment variables. Copy to .env and set values (do not commit real keys). |
| config/services.sample.json | List of services (name, workspace_id, dataset_id, pipeline_file) for multi-service runs. |
| pipelines/ | OPAL pipeline files (one per service or shared). Use {{REGION}} for cluster filter; script replaces it at runtime. |
| error_report.txt | Written on each run: table(s) of unique errors (and links). |
| app.py | Flask web app: dashboard check, hostname lookup, Fix (single-error analysis), and Send to Slack (formats report via Gemini and POSTs to webhook). |
| test.sh | Runs Cursor agent to analyze errors and suggest fixes. Used by the Fix button or manually with --error-file. |
| docs/RUNBOOK.md | Known-error runbook: maps common errors (e.g. DeploymentControllerRMQ) to root causes and fixes. |
- Launch Management
- Launch Management Background Jobs Service
- Launch Logs service
- Launch Logs bg service
- Launch telemetry service
- Launch logs-bg-exporter-service
- Launch Nginx service
- Launch Deployment Agent
Each entry can override workspace_id, dataset_id, and pipeline_file. Pipeline files live under pipelines/ and must output columns: latest_timestamp, total_occurrences, error_msg, context.
Observe/
├── app.py # Flask web app (dashboard check, Fix, Slack)
├── extract_errors.py # Error extraction script
├── test.sh # Fix flow: runs Cursor agent on errors
├── config/
│ └── services.sample.json # Service definitions for multi-service runs
├── docs/
│ └── RUNBOOK.md # Known-error runbook
├── output/ # Generated files (gitignored)
│ ├── error_report.txt # Full report from extract_errors
│ ├── error_to_fix.txt # Single error for Fix button
│ └── agent_analysis.md # Cursor agent analysis output
├── pipelines/ # OPAL pipeline files per service
├── static/
│ └── index.html # Web UI
├── env.sample # Environment template (copy to .env)
├── requirements.txt
├── Dockerfile # Web app
├── Dockerfile.cli # CLI (extract_errors)
└── README.md
- Open your Observe workspace in the browser, e.g.:
https://143110822295.eu-1.observeinc.com/workspace/41096433/home?tab=Favorites - The Customer ID is the first segment after
https://— i.e. the subdomain before.observeinc.com.- From
https://143110822295.eu-1.observeinc.com/workspace/...→ Customer ID is143110822295.
- From
Set this in .env as OBSERVE_CUSTOMER_ID.
- Go to the API tokens page in your Observe instance:
https://143110822295.eu-1.observeinc.com/settings/my-api-tokens
(Replace143110822295andeu-1with your own customer ID and cluster if different.) - Create a new API token (or use an existing one). Copy the token value once; it may not be shown again.
- Set it in your environment or
.envas OBSERVE_API_KEY (see Environment variables).
The script sends: Authorization: Bearer <OBSERVE_CUSTOMER_ID> <OBSERVE_API_KEY>.
Ensure the token’s user has dataset:view (or equivalent) on the datasets you query.
Follow this flow to get running from scratch.
- Python 3 (3.8+ recommended)
- Access to your Observe instance (Customer ID and API token)
Clone or open the repo and go to the project root (the folder that contains the Observe directory):
cd /path/to/Observe-automationFrom the project root (where requirements.txt lives):
pip install -r requirements.txtOr from inside Observe/:
cd Observe
pip install -r ../requirements.txt(Optional) Use a virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt- Go into the
Observefolder (if not already there):cd Observe - Copy the sample env file and edit it with your values:
cp env.sample .env
- In
.env, set at least:- OBSERVE_CUSTOMER_ID – from your Observe URL (see How to get your Customer ID and API token)
- OBSERVE_API_KEY – from Observe → Settings → My API tokens
Optionally set OBSERVE_CLUSTER, OBSERVE_WORKSPACE_ID, OBSERVE_DATASET_ID, and REGION as needed.
Do not commit .env (it contains secrets).
From the Observe folder, load your env and run the script:
# Load environment variables (choose one)
export $(grep -v '^#' .env | xargs)
# Or: set -a && source .env && set +a
# Single service (default workspace/dataset from .env)
python3 extract_errors.py
# All services from config/services.sample.json
python3 extract_errors.py --all-services
# All services × all regions (full report)
python3 extract_errors.py --autoResults are printed to the terminal and written to error_report.txt in the same folder.
A simple frontend lets you set env vars and run the same checks from the browser:
cd Observe
pip install -r requirements.txt # includes Flask
python3 app.pyOpen http://localhost:5000 (or http://localhost:5001 if 5000 is in use). Enter OBSERVE_CUSTOMER_ID and OBSERVE_API_KEY (required), optionally expand and set cluster, workspace, dataset, region, and time range. Choose a run mode (Single service, All services, All regions, or Auto) and click Run dashboard check. The report appears on the page; you can copy or download it.
After running a dashboard check, you can format the report and send it to Slack:
- Set SLACK_WEBHOOK_URL in
.env(e.g. a Slack Incoming Webhook URL or Contentstack Automations API URL). - Set GEMINI_API_KEY in
.env— the app uses Gemini to format the report for Slack. Get a free key at Google AI Studio. - Run a dashboard check, then click Send me a Slack.
- Optionally enter a Channel ID (e.g.
C1234567890orD086ZCDT6B0) in the UI to override the webhook’s default channel.
The formatted message is POSTed to your webhook as JSON (text, blocks, and optionally channel). If no webhook is configured, the formatted message is copied to your clipboard instead.
The Fix button runs the Cursor agent to analyze a single error and suggest concrete fixes in your codebase.
- Cursor CLI – Install the Cursor agent CLI and ensure
agentis in your PATH. - Workspace setup – The agent can only modify code it can see. Set
AGENT_WORKSPACEto a directory that contains both this Observe app and the repos that produce the errors (e.g.contentfly-management-background-jobs-service).
- Run a dashboard check and wait for the report.
- Click Fix next to the error you want to analyze.
- The app writes
output/error_to_fix.txtand startstest.shautomatically. - Output appears in your terminal and in
output/agent_analysis.md.
| Variable | Purpose | Default |
|---|---|---|
| AGENT_MODEL | Cursor agent model. Run agent models to list options. |
composer-1.5 |
| AGENT_WORKSPACE | Root directory the agent can read and modify. Must include the repos with the code that throws the errors. | ../ (parent of Observe) |
Example: If your layout is:
/Users/you/
├── Observe-automation/Observe/ ← this app
└── contentfly-management-background-jobs-service/ ← service that produces errors
Set in .env:
AGENT_WORKSPACE=/Users/you
Then the agent can suggest and apply fixes in both repos.
You can also run the fix flow manually:
# After clicking Fix, or if you have output/error_to_fix.txt:
./test.sh --error-file output/error_to_fix.txt
# With a different model:
./test.sh --error-file output/error_to_fix.txt --model <model-name>
# Full report (runs extract_errors.py first, then agent):
./test.sh| Step | What to do |
|---|---|
| Install | pip install -r requirements.txt (from project root) |
| Config | cd Observe → cp env.sample .env → set OBSERVE_CUSTOMER_ID and OBSERVE_API_KEY |
| Run | Load .env, then python3 extract_errors.py --all-services (or --auto) |
| Output | Terminal + Observe/error_report.txt |
| Web UI | cd Observe && python3 app.py → open http://localhost:5000 (details) |
| Send to Slack | Set SLACK_WEBHOOK_URL + GEMINI_API_KEY in .env → run dashboard check → click Send me a Slack (details) |
| Fix flow | Set AGENT_WORKSPACE (and optionally AGENT_MODEL) in .env → run dashboard check → click Fix next to an error (details) |
| Deploy on Render | Push to GitHub → connect repo at Render → deploy (details) |
| Docker | Web app: docker build -t observe . then docker run -p 5000:5000 --env-file .env observe. CLI: docker build -f Dockerfile.cli -t observe-cli . (details) |
You can host the web UI on Render for free (with limits).
-
Push your code to a GitHub (or GitLab) repository. Ensure the repo root contains the
Observefolder and the rootrender.yaml. -
Create a Web Service on Render:
- Go to dashboard.render.com → New → Web Service.
- Connect your repository.
- If you use the repo’s Blueprint (
render.yaml), Render will create the service from it. Otherwise set:- Root Directory:
Observe - Runtime: Python 3
- Build Command:
pip install -r requirements.txt - Start Command:
gunicorn --bind 0.0.0.0:$PORT app:app
- Root Directory:
-
Deploy. Render will build and run the app. Your URL will be like
https://observe-dashboard-check.onrender.com. -
Credentials: The app does not store Observe credentials on the server. Users enter OBSERVE_CUSTOMER_ID and OBSERVE_API_KEY in the browser (and can save them in localStorage).
-
Slack (optional): To enable "Send me a Slack", add SLACK_WEBHOOK_URL and GEMINI_API_KEY as environment variables in Render’s dashboard.
Note: On the free tier, requests may time out after ~30–60 seconds. For long “Run dashboard check” runs (e.g. All services × All regions), use a single service or fewer regions, or consider a paid plan for longer timeouts.
Two Dockerfiles: Dockerfile (web app) and Dockerfile.cli (CLI only).’
cd Observe
docker build -t observe .
docker run -p 5000:5000 --env-file .env observeOpen http://localhost:5000. Credentials are entered in the browser (not stored on the server).
Note: The Fix flow requires Cursor CLI and runs outside the container, so it does not work when the app runs in Docker. Use the app locally for Fix.
cd Observe
docker build -f Dockerfile.cli -t observe-cli .
docker run --rm --env-file .env observe-cli --all-services
docker run --rm --env-file .env observe-cli --autoMount a directory to get output/error_report.txt on your machine:
docker run --rm --env-file .env -v "$(pwd)/output:/app/output" observe-cli --autoThen open ./output/error_report.txt.
Copy env.sample to .env in this folder (or export in the shell). Load before running, e.g.:
set -a && source .env && set +a && python3 extract_errors.py --all-services
# or
export $(grep -v '^#' .env | xargs) && python3 extract_errors.py --all-services| Variable | Purpose | Default (if any) |
|---|---|---|
| OBSERVE_CUSTOMER_ID | Your Observe customer ID (in the URL). | 143110822295 |
| OBSERVE_API_KEY | API token for authentication. | (none – set this) |
| OBSERVE_CLUSTER | Regional cluster (e.g. eu-1). Base URL: https://<customer>.<cluster>.observeinc.com/... |
eu-1 |
| OBSERVE_WORKSPACE_ID | Default workspace for single-service or fallback in config. | 41096433 |
| OBSERVE_DATASET_ID | Default dataset when running a single service. | (e.g. 41249174) |
| REGION | Value for {{REGION}} in OPAL (cluster filter, e.g. label(^Cluster) = "{{REGION}}"). |
aws-na |
| START_IST | Start of time window in IST (YYYY-MM-DD HH:MM:SS). Optional; with END_IST overrides “last 24h”. |
— |
| END_IST | End of time window in IST. Optional. | — |
| SLACK_WEBHOOK_URL | Webhook URL for "Send me a Slack" (Slack Incoming Webhook, Contentstack Automations API, or any HTTP endpoint). If set, the formatted report is POSTed here. | — |
| GEMINI_API_KEY | Google Gemini API key for formatting the report as a Slack message. Get a free key at Google AI Studio. | — |
| GEMINI_MODEL | Gemini model name. | gemini-1.5-flash |
| AGENT_MODEL | Cursor agent model for Fix flow. Run agent models to list options. |
composer-1.5 |
| AGENT_WORKSPACE | Root directory for Fix flow. Must include repos with the code that produces the errors. | ../ |
| OBSERVE_LOOKUP_DAYS | For hostname lookup: time range = past N days. If unset, uses 15 minutes. | — |
| OBSERVE_LOOKUP_TIMEOUT_SEC | HTTP timeout (seconds) for Observe API calls in hostname lookup. Increase if queries time out with large OBSERVE_LOOKUP_DAYS. |
300 |
Valid regions (for REGION or --all-regions):
aws-na, aws-eu, aws-au, azure-na, azure-eu, gcp-na, gcp-eu
Run from the Observe folder (or pass correct paths to --config / --pipeline-file).
python3 extract_errors.pyUses OBSERVE_WORKSPACE_ID, OBSERVE_DATASET_ID, REGION, and last 24 hours.
python3 extract_errors.py -d <dataset_id> -p pipelines/<pipeline>.opalExample (Nginx only):
python3 extract_errors.py -d 41250854 -p pipelines/launch_nginx_errors.opalpython3 extract_errors.py --all-servicesFinds config/services.sample.json and runs every service in it.
python3 extract_errors.py --config path/to/services.jsonRuns the same run (all services or single) for each region and concatenates output:
python3 extract_errors.py --all-services --all-regionsOr with a custom config:
python3 extract_errors.py --config services.json --all-regionsEquivalent to --all-services --all-regions:
python3 extract_errors.py --autoOverride default “last 24 hours” by setting both start and end in IST:
python3 extract_errors.py --all-services --start "2026-02-13 00:00:00" --end "2026-02-14 12:00:00"Or use env: START_IST and END_IST.
python3 extract_errors.py -w <workspace_id> -d <dataset_id>| Flag | Short | Description |
|---|---|---|
| --workspace | -w | Override workspace ID. |
| --dataset | -d | Override dataset ID (single-service). |
| --start | — | Start time in IST (YYYY-MM-DD HH:MM:SS). Env: START_IST. |
| --end | — | End time in IST. Env: END_IST. |
| --pipeline-file | -p | Path to OPAL pipeline file (single-service). |
| --config | -c | Path to JSON file listing services (name, workspace_id, dataset_id, pipeline_file). |
| --all-services | — | Use config/services.sample.json as config. |
| --all-regions | — | Run for every region (aws-na, aws-eu, …) and print/write combined output. |
| --auto | — | Same as --all-services --all-regions (all services × all regions). |
- Terminal: Progress lines like
🚀 Extracting unique error signatures from <service> [region: <region>]...and one table per service (and per region when using--all-regions). - output/error_report.txt: Same table(s) in one file (IST timestamp, count, error & context, link to Observe log explorer).
- Pipeline files under pipelines/ define the OPAL (filters,
make_col,statsby, etc.). - The script replaces
{{REGION}}in the pipeline with the current region (env REGION or the loop value when using --all-regions). - Custom pipelines must output: latest_timestamp, total_occurrences, error_msg, context so the script can build the table and links.
- To copy OPAL from the Observe UI: Worksheet → query editor → OPAL tab; or Log Explorer → query builder → OPAL. Use only the pipeline part (no
interface "..."line).
For full steps see Setup (step-by-step). Short version:
- From project root:
pip install -r requirements.txt cd Observe→ copyenv.sampleto.envand set OBSERVE_CUSTOMER_ID and OBSERVE_API_KEY- Load env:
export $(grep -v '^#' .env | xargs)(orset -a && source .env && set +a) - Run:
python3 extract_errors.py --all-servicesorpython3 extract_errors.py --auto - Check output/error_report.txt for the full report