Skip to content

Commit c815435

Browse files
committed
Enhance health monitoring and status reporting in NGINX configuration and scripts
1 parent bf997d5 commit c815435

File tree

3 files changed

+298
-19
lines changed

3 files changed

+298
-19
lines changed

README.md

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,48 @@ services:
3939
4040
```bash
4141
curl http://localhost/health
42-
# Returns: upstream response or "UPSTREAM_UNAVAILABLE" (503) if upstream is down
43-
# Includes X-Upstream-Status header showing actual upstream response code
42+
# Returns: OK
4443
```
4544

46-
The health endpoint now proxies to the upstream server to verify connectivity. If the upstream is unavailable, it returns 503 with "UPSTREAM_UNAVAILABLE".
45+
`/health` is a lightweight liveness check for NGINX itself. Use `/status` for upstream reachability, cache totals, and connection counters.
4746

48-
**Health Monitoring**: A background process logs warnings every 5 minutes if the upstream server becomes unreachable, but the container continues running to serve cached content.
47+
### Status Endpoint
48+
49+
```bash
50+
curl http://localhost/status
51+
```
52+
53+
Example response:
54+
55+
```json
56+
{
57+
"updated_at": "2026-03-24T12:00:00Z",
58+
"health": {
59+
"nginx": true,
60+
"upstream": true
61+
},
62+
"upstream": {
63+
"host": "owl.virtualflybrain.org",
64+
"port": 80
65+
},
66+
"cache": {
67+
"source": "access_log",
68+
"total": 120,
69+
"hit": 113,
70+
"miss": 7
71+
},
72+
"connections": {
73+
"active": 3,
74+
"reading": 0,
75+
"writing": 1,
76+
"waiting": 2
77+
}
78+
}
79+
```
80+
81+
`/status` is refreshed by a background monitor that reads the access log for cache totals and samples NGINX `stub_status` for connection counters.
82+
83+
**Health Monitoring**: A background process logs warnings when the upstream server becomes unreachable, but the container continues running to serve cached content.
4984

5085
## Configuration
5186

@@ -55,12 +90,14 @@ The health endpoint now proxies to the upstream server to verify connectivity. I
5590
- `CACHE_MAX_SIZE`: Maximum cache size on disk (default: `20g`, accepts NGINX size units like `1t` for 1TB)
5691
- `CACHE_STALE_TIME`: How long a cached response is considered fresh (default: `6M`). After this time the entry is served stale while being refreshed in the background. Accepts NGINX time units: `s`, `m`, `h`, `d`, `w`, `M` (30 days), `y` (365 days).
5792
- `DNS_RESOLVER`: DNS resolver servers (default: `8.8.8.8`, space-separated list). Check `cat /etc/resolv.conf` in your container to find the correct value for your environment.
93+
- `STATUS_POLL_INTERVAL`: Seconds between `/status` refreshes (default: `5`)
94+
- `HEALTH_LOG_INTERVAL`: Seconds between periodic upstream health log lines when state is unchanged (default: `300`)
5895

5996
### Cache Headers
6097

6198
The proxy adds helpful headers to responses:
6299

63-
- `X-Cache-Status`: `HIT`, `MISS`, `EXPIRED`, or `STALE`
100+
- `X-Cache-Status`: `HIT`, `MISS`, `EXPIRED`, `STALE`, `UPDATING`, or `REVALIDATED`
64101
- `X-Cache-Key`: The cache key used for the request
65102

66103
## Performance
@@ -80,7 +117,8 @@ The proxy adds helpful headers to responses:
80117
- **Cache storage**: `/var/cache/nginx/owlery` with 1:2 directory levels
81118
- **Cache zone**: 100MB in-memory metadata zone
82119
- **Max cache size**: 20GB on disk (configurable via `CACHE_MAX_SIZE` environment variable)
83-
- **Health monitoring**: Background process checks upstream connectivity every 5 minutes and logs warnings
120+
- **Status monitoring**: Background process updates `/var/run/nginx/status.json` from the access log and NGINX `stub_status`
121+
- **Health monitoring**: Background process checks upstream connectivity and logs warnings without taking the cache offline
84122

85123
### Caching Behavior
86124

@@ -95,6 +133,7 @@ The proxy adds helpful headers to responses:
95133
### Networking
96134

97135
- **Listen ports**: 80 and 8080 (both ports handle requests identically)
136+
- **Status endpoints**: `/health` for liveness, `/status` for JSON metrics, and internal-only `/__nginx_status` for raw NGINX counters
98137
- **DNS resolver**: Configurable via `DNS_RESOLVER` (default: Google Public DNS `8.8.8.8` with 30s TTL for fast upstream IP updates). Check `cat /etc/resolv.conf` in your container for the correct value.
99138
- **Host-agnostic**: Ignores Host header for routing
100139
- **Connection pooling**: 16 keep-alive connections to backend
@@ -147,3 +186,4 @@ Set these in your GitHub repository secrets:
147186
- **Subsequent Requests**: Cache HIT → Return cached result (<10ms) with X-Cache-Status: HIT
148187
- **Expired Cache**: Return stale content immediately with X-Cache-Status: UPDATING + background refresh
149188
- **Backend Errors**: Forward errors to client without caching, allowing retries to succeed
189+
- **Status Reporting**: `/status` shows current hit/miss/total counts from the access log plus sampled connection counters

health-monitor.sh

Lines changed: 221 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,232 @@
11
#!/bin/sh
22

3-
# Health monitoring script for upstream server
4-
# Logs warnings but doesn't exit container
3+
# Poll NGINX and the access log to keep /status current while also logging
4+
# upstream reachability changes.
55

6-
UPSTREAM_HOST=$(echo $UPSTREAM_SERVER | cut -d: -f1)
7-
UPSTREAM_PORT=$(echo $UPSTREAM_SERVER | cut -d: -f2)
6+
ACCESS_LOG=${ACCESS_LOG:-/var/log/nginx/access.log}
7+
STATUS_FILE=${STATUS_FILE:-/var/run/nginx/status.json}
8+
NGINX_STATUS_URL=${NGINX_STATUS_URL:-http://127.0.0.1:8080/__nginx_status}
9+
STATUS_POLL_INTERVAL=${STATUS_POLL_INTERVAL:-5}
10+
HEALTH_LOG_INTERVAL=${HEALTH_LOG_INTERVAL:-300}
811

9-
# Default to port 80 if no port specified
10-
if [ "$UPSTREAM_HOST" = "$UPSTREAM_PORT" ]; then
12+
UPSTREAM_HOST=$(printf '%s' "$UPSTREAM_SERVER" | cut -d: -f1)
13+
UPSTREAM_PORT=$(printf '%s' "$UPSTREAM_SERVER" | cut -d: -f2)
14+
15+
# Default to port 80 if no port is specified.
16+
if [ -z "$UPSTREAM_HOST" ]; then
17+
UPSTREAM_HOST=unknown
18+
fi
19+
20+
if [ -z "$UPSTREAM_PORT" ] || [ "$UPSTREAM_HOST" = "$UPSTREAM_PORT" ]; then
1121
UPSTREAM_PORT=80
1222
fi
1323

14-
echo "Monitoring upstream server: $UPSTREAM_HOST:$UPSTREAM_PORT"
24+
STATUS_DIR=$(dirname "$STATUS_FILE")
25+
ACCESS_LOG_DIR=$(dirname "$ACCESS_LOG")
1526

16-
while true; do
17-
if nc -z -w3 $UPSTREAM_HOST $UPSTREAM_PORT 2>/dev/null; then
18-
echo "$(date): Upstream server is healthy"
27+
total_requests=0
28+
hit_requests=0
29+
miss_requests=0
30+
access_log_size=0
31+
32+
active_connections=
33+
reading_connections=
34+
writing_connections=
35+
waiting_connections=
36+
37+
nginx_healthy=false
38+
upstream_healthy=false
39+
last_upstream_state=
40+
last_health_log_epoch=0
41+
42+
json_escape() {
43+
printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g'
44+
}
45+
46+
json_number_or_null() {
47+
if [ -n "$1" ]; then
48+
printf '%s' "$1"
1949
else
20-
echo "$(date): WARNING - Upstream server $UPSTREAM_HOST:$UPSTREAM_PORT is unreachable"
50+
printf 'null'
51+
fi
52+
}
53+
54+
get_file_size() {
55+
if [ ! -f "$1" ]; then
56+
printf '0'
57+
return
58+
fi
59+
60+
if size=$(stat -c %s "$1" 2>/dev/null); then
61+
printf '%s' "$size"
62+
else
63+
wc -c < "$1" 2>/dev/null | tr -d ' '
64+
fi
65+
}
66+
67+
recount_access_log() {
68+
if [ ! -f "$ACCESS_LOG" ]; then
69+
total_requests=0
70+
hit_requests=0
71+
miss_requests=0
72+
access_log_size=0
73+
return
74+
fi
75+
76+
counts=$(awk '
77+
BEGIN { total = 0; hit = 0; miss = 0 }
78+
{
79+
status = ""
80+
if (NF >= 3) {
81+
status = $(NF - 2)
82+
}
83+
if (status ~ /^(HIT|MISS|BYPASS|EXPIRED|STALE|UPDATING|REVALIDATED)$/) {
84+
total++
85+
if (status == "HIT") {
86+
hit++
87+
} else if (status == "MISS") {
88+
miss++
89+
}
90+
}
91+
}
92+
END { printf "%d %d %d", total, hit, miss }
93+
' "$ACCESS_LOG" 2>/dev/null)
94+
95+
set -- $counts
96+
total_requests=${1:-0}
97+
hit_requests=${2:-0}
98+
miss_requests=${3:-0}
99+
access_log_size=$(get_file_size "$ACCESS_LOG")
100+
}
101+
102+
update_access_log_counts() {
103+
current_size=$(get_file_size "$ACCESS_LOG")
104+
105+
if [ "$current_size" -lt "$access_log_size" ]; then
106+
recount_access_log
107+
return
108+
fi
109+
110+
if [ "$current_size" -eq "$access_log_size" ]; then
111+
return
21112
fi
22113

23-
sleep 300 # Check every 5 minutes
24-
done
114+
if [ "$access_log_size" -eq 0 ]; then
115+
recount_access_log
116+
return
117+
fi
118+
119+
start_byte=$((access_log_size + 1))
120+
counts=$(tail -c +"$start_byte" "$ACCESS_LOG" 2>/dev/null | awk '
121+
BEGIN { total = 0; hit = 0; miss = 0 }
122+
{
123+
status = ""
124+
if (NF >= 3) {
125+
status = $(NF - 2)
126+
}
127+
if (status ~ /^(HIT|MISS|BYPASS|EXPIRED|STALE|UPDATING|REVALIDATED)$/) {
128+
total++
129+
if (status == "HIT") {
130+
hit++
131+
} else if (status == "MISS") {
132+
miss++
133+
}
134+
}
135+
}
136+
END { printf "%d %d %d", total, hit, miss }
137+
')
138+
139+
set -- $counts
140+
total_requests=$((total_requests + ${1:-0}))
141+
hit_requests=$((hit_requests + ${2:-0}))
142+
miss_requests=$((miss_requests + ${3:-0}))
143+
access_log_size=$current_size
144+
}
145+
146+
update_upstream_health() {
147+
current_epoch=$(date +%s)
148+
149+
if nc -z -w3 "$UPSTREAM_HOST" "$UPSTREAM_PORT" 2>/dev/null; then
150+
upstream_healthy=true
151+
upstream_state=healthy
152+
upstream_message="Upstream server is healthy"
153+
else
154+
upstream_healthy=false
155+
upstream_state=unreachable
156+
upstream_message="WARNING - Upstream server $UPSTREAM_HOST:$UPSTREAM_PORT is unreachable"
157+
fi
158+
159+
if [ "$upstream_state" != "$last_upstream_state" ] || [ $((current_epoch - last_health_log_epoch)) -ge "$HEALTH_LOG_INTERVAL" ]; then
160+
echo "$(date): $upstream_message"
161+
last_upstream_state=$upstream_state
162+
last_health_log_epoch=$current_epoch
163+
fi
164+
}
165+
166+
update_connection_stats() {
167+
status_text=$(wget -q -O - "$NGINX_STATUS_URL" 2>/dev/null || true)
168+
169+
if [ -n "$status_text" ]; then
170+
active_connections=$(printf '%s\n' "$status_text" | awk '/Active connections:/ { print $3 }')
171+
reading_connections=$(printf '%s\n' "$status_text" | awk '/Reading:/ { print $2 }')
172+
writing_connections=$(printf '%s\n' "$status_text" | awk '/Writing:/ { print $4 }')
173+
waiting_connections=$(printf '%s\n' "$status_text" | awk '/Waiting:/ { print $6 }')
174+
nginx_healthy=true
175+
else
176+
active_connections=
177+
reading_connections=
178+
writing_connections=
179+
waiting_connections=
180+
nginx_healthy=false
181+
fi
182+
}
183+
184+
write_status_file() {
185+
timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
186+
escaped_host=$(json_escape "$UPSTREAM_HOST")
187+
tmp_file="${STATUS_FILE}.tmp"
188+
189+
cat > "$tmp_file" <<EOF
190+
{
191+
"updated_at": "$timestamp",
192+
"health": {
193+
"nginx": $nginx_healthy,
194+
"upstream": $upstream_healthy
195+
},
196+
"upstream": {
197+
"host": "$escaped_host",
198+
"port": $UPSTREAM_PORT
199+
},
200+
"cache": {
201+
"source": "access_log",
202+
"total": $total_requests,
203+
"hit": $hit_requests,
204+
"miss": $miss_requests
205+
},
206+
"connections": {
207+
"active": $(json_number_or_null "$active_connections"),
208+
"reading": $(json_number_or_null "$reading_connections"),
209+
"writing": $(json_number_or_null "$writing_connections"),
210+
"waiting": $(json_number_or_null "$waiting_connections")
211+
}
212+
}
213+
EOF
214+
215+
mv "$tmp_file" "$STATUS_FILE"
216+
}
217+
218+
mkdir -p "$STATUS_DIR" "$ACCESS_LOG_DIR"
219+
umask 022
220+
221+
recount_access_log
222+
write_status_file
223+
224+
echo "Monitoring upstream server: $UPSTREAM_HOST:$UPSTREAM_PORT"
225+
226+
while true; do
227+
update_access_log_counts
228+
update_upstream_health
229+
update_connection_stats
230+
write_status_file
231+
sleep "$STATUS_POLL_INTERVAL"
232+
done

nginx.conf.template

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ http {
1717
'$upstream_cache_status $request_time $upstream_response_time';
1818

1919
access_log /dev/stdout cache;
20+
access_log /var/log/nginx/access.log cache;
2021
error_log /dev/stderr warn;
2122

2223
proxy_cache_path /var/cache/nginx/owlery levels=1:2 keys_zone=owlery_cache:100m max_size=${CACHE_MAX_SIZE} inactive=5y use_temp_path=off;
@@ -39,6 +40,21 @@ http {
3940
server {
4041
listen 80;
4142

43+
location = /status {
44+
access_log off;
45+
default_type application/json;
46+
add_header Cache-Control "no-store" always;
47+
alias /var/run/nginx/status.json;
48+
}
49+
50+
location = /__nginx_status {
51+
access_log off;
52+
stub_status;
53+
allow 127.0.0.1;
54+
allow ::1;
55+
deny all;
56+
}
57+
4258
location /health {
4359
access_log off;
4460
return 200 "OK\n";
@@ -79,6 +95,21 @@ http {
7995
server {
8096
listen 8080;
8197

98+
location = /status {
99+
access_log off;
100+
default_type application/json;
101+
add_header Cache-Control "no-store" always;
102+
alias /var/run/nginx/status.json;
103+
}
104+
105+
location = /__nginx_status {
106+
access_log off;
107+
stub_status;
108+
allow 127.0.0.1;
109+
allow ::1;
110+
deny all;
111+
}
112+
82113
location /health {
83114
access_log off;
84115
return 200 "OK\n";

0 commit comments

Comments
 (0)