Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/paper/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# LaTeX build artifacts
*.aux
*.log
*.out
*.synctex.gz
*.toc
*.bbl
*.blg
*.fls
*.fdb_latexmk
Binary file modified docs/paper/home-security-benchmark.pdf
Binary file not shown.
144 changes: 119 additions & 25 deletions docs/paper/home-security-benchmark.tex
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,9 @@
tool selection across five security-domain APIs, extraction of durable
knowledge from user conversations, and scene understanding from security
camera feeds including infrared imagery. The suite comprises
\textbf{16~test suites} with \textbf{131~individual tests} spanning both
\textbf{16~test suites} with \textbf{143~individual tests} spanning both
text-only LLM reasoning (96~tests) and multimodal VLM scene analysis
(35~tests). We present results from \textbf{34~benchmark runs} across
(47~tests). We present results from \textbf{34~benchmark runs} across
three model configurations: a local 4B-parameter quantized model
(Qwen3.5-4B-Q4\_1 GGUF), a frontier cloud model (GPT-5.2-codex), and a
hybrid configuration pairing the cloud LLM with a local 1.6B-parameter
Expand Down Expand Up @@ -142,7 +142,7 @@ \section{Introduction}

\textbf{Contributions.} This paper makes four contributions:
\begin{enumerate}[nosep]
\item \textbf{HomeSec-Bench}: A 131-test benchmark suite covering
\item \textbf{HomeSec-Bench}: A 143-test benchmark suite covering
16~evaluation dimensions specific to home security AI, spanning
both LLM text reasoning and VLM scene analysis, including novel
suites for prompt injection resistance, multi-turn contextual
Expand Down Expand Up @@ -299,7 +299,7 @@ \section{Benchmark Design}

HomeSec-Bench comprises 16~test suites organized into two categories:
text-only LLM reasoning (15~suites, 96~tests) and multimodal VLM scene
analysis (1~suite, 35~tests). Table~\ref{tab:suites_overview} provides
analysis (1~suite, 47~tests). Table~\ref{tab:suites_overview} provides
a structural overview.

\begin{table}[h]
Expand All @@ -325,9 +325,9 @@ \section{Benchmark Design}
Alert Routing & 5 & LLM & Channel, schedule \\
Knowledge Injection & 5 & LLM & KI use, relevance \\
VLM-to-Alert Triage & 5 & LLM & Urgency + notify \\
VLM Scene & 35 & VLM & Entity detect \\
VLM Scene & 47 & VLM & Entity detect \\
\midrule
\textbf{Total} & \textbf{131} & & \\
\textbf{Total} & \textbf{143} & & \\
\bottomrule
\end{tabular}
\end{table}
Expand Down Expand Up @@ -405,7 +405,7 @@ \subsection{LLM Suite 4: Event Deduplication}
and expects a structured judgment:
\texttt{\{``duplicate'': bool, ``reason'': ``...'', ``confidence'': ``high/medium/low''\}}.

Five scenarios probe progressive reasoning difficulty:
Eight scenarios probe progressive reasoning difficulty:

\begin{enumerate}[nosep]
\item \textbf{Same person, same camera, 120s}: Man in blue shirt
Expand All @@ -422,6 +422,15 @@ \subsection{LLM Suite 4: Event Deduplication}
with package, then walking back to van. Expected:
duplicate---requires understanding that arrival and departure are
phases of one event.
\item \textbf{Weather/lighting change, 3600s}: Same backyard tree
motion at sunset then darkness. Expected: unique---lighting context
constitutes a different event.
\item \textbf{Continuous activity, 180s}: Man unloading groceries
then carrying bags inside. Expected: duplicate---single
unloading activity.
\item \textbf{Group split, 2700s}: Three people arrive together;
one person leaves alone 45~minutes later. Expected: unique---different
participant count and direction.
\end{enumerate}

\subsection{LLM Suite 5: Tool Use}
Expand All @@ -439,7 +448,7 @@ \subsection{LLM Suite 5: Tool Use}
\item \texttt{event\_subscribe}: Subscribe to future security events
\end{itemize}

Twelve scenarios test tool selection across a spectrum of specificity:
Sixteen scenarios test tool selection across a spectrum of specificity:

\noindent\textbf{Straightforward} (6~tests): ``What happened today?''
$\rightarrow$ \texttt{video\_search}; ``Check this footage''
Expand All @@ -460,12 +469,20 @@ \subsection{LLM Suite 5: Tool Use}
(proactive); ``Were there any cars yesterday?'' $\rightarrow$
\texttt{video\_search} (retrospective).

\noindent\textbf{Negative} (1~test): ``Thanks, that's all for now!''
$\rightarrow$ no tool call; the model must respond with natural text.

\noindent\textbf{Complex} (2~tests): Multi-step requests (``find and
send me the clip'') requiring the first tool before the second;
historical comparison (``more activity today vs.\ yesterday?'');
user-renamed cameras.

Multi-turn history is provided for context-dependent scenarios (e.g.,
clip analysis following a search result).

\subsection{LLM Suite 6: Chat \& JSON Compliance}

Eight tests verify fundamental assistant capabilities:
Eleven tests verify fundamental assistant capabilities:

\begin{itemize}[nosep]
\item \textbf{Persona adherence}: Response mentions security/cameras
Expand All @@ -484,6 +501,12 @@ \subsection{LLM Suite 6: Chat \& JSON Compliance}
\item \textbf{Emergency tone}: For ``Someone is trying to break into
my house right now!'' the response must mention calling 911/police
or indicate urgency---casual or dismissive responses fail.
\item \textbf{Multilingual input}: ``¿Qué ha pasado hoy en las
cámaras?'' must produce a coherent response, not a refusal.
\item \textbf{Contradictory instructions}: Succinct system prompt
+ user request for detailed explanation; model must balance.
\item \textbf{Partial JSON}: User requests JSON with specified keys;
model must produce parseable output with the requested schema.
\end{itemize}

\subsection{LLM Suite 7: Security Classification}
Expand All @@ -502,7 +525,8 @@ \subsection{LLM Suite 7: Security Classification}
\end{itemize}

Output: \texttt{\{``classification'': ``...'', ``tags'': [...],
``reason'': ``...''\}}. Eight scenarios span the full taxonomy:
``reason'': ``...''\}}. Twelve scenarios span the full taxonomy:


\begin{table}[h]
\centering
Expand All @@ -520,14 +544,18 @@ \subsection{LLM Suite 7: Security Classification}
Cat on IR camera at night & normal \\
Door-handle tampering at 2\,AM & suspicious/critical \\
Amazon van delivery & normal \\
Door-to-door solicitor (daytime) & monitor \\
Utility worker inspecting meter & normal \\
Children playing at dusk & normal \\
Masked person at 1\,AM & critical/suspicious \\
\bottomrule
\end{tabular}
\end{table}

\subsection{LLM Suite 8: Narrative Synthesis}

Given structured clip data (timestamps, cameras, summaries, clip~IDs),
the model must produce user-friendly narratives. Three tests verify
the model must produce user-friendly narratives. Four tests verify
complementary capabilities:

\begin{enumerate}[nosep]
Expand All @@ -540,15 +568,17 @@ \subsection{LLM Suite 8: Narrative Synthesis}
\item \textbf{Camera grouping}: 5~events across 3~cameras
$\rightarrow$ when user asks ``breakdown by camera,'' each camera
name must appear as an organizer.
\item \textbf{Large volume}: 22~events across 4~cameras
$\rightarrow$ model must group related events (e.g., landscaping
sequence) and produce a concise narrative, not enumerate all 22.
\end{enumerate}

\subsection{VLM Suite: Scene Analysis}
\subsection{Phase~2 Expansion}

\textbf{New in v2:} Four additional LLM suites evaluate error recovery,
privacy compliance, robustness, and contextual reasoning. Two entirely new
suites---Error Recovery \& Edge Cases (4~tests) and Privacy \& Compliance
(3~tests)---were added alongside expansions to Knowledge Distillation (+2)
and Narrative Synthesis (+1).
HomeSec-Bench~v2 added seven LLM suites (Suites 9--15) targeting
robustness and agentic competence: prompt injection resistance,
multi-turn reasoning, error recovery, privacy compliance, alert routing,
knowledge injection, and VLM-to-alert triage.

\subsection{LLM Suite 9: Prompt Injection Resistance}

Expand Down Expand Up @@ -592,17 +622,70 @@ \subsection{LLM Suite 10: Multi-Turn Reasoning}
the time and camera context.
\end{enumerate}

\subsection{VLM Suite: Scene Analysis (Suite 13)}

35~tests send base64-encoded security camera PNG frames to a VLM
\subsection{LLM Suite 11: Error Recovery \& Edge Cases}

Four tests evaluate graceful degradation: (1)~empty search results
(``show me elephants'') $\rightarrow$ natural explanation, not hallucination;
(2)~nonexistent camera (``kitchen cam'') $\rightarrow$ list available cameras;
(3)~API error in tool result (503~ECONNREFUSED) $\rightarrow$ acknowledge
failure and suggest retry; (4)~conflicting camera descriptions at the
same timestamp $\rightarrow$ flag the inconsistency.

\subsection{LLM Suite 12: Privacy \& Compliance}

Three tests evaluate privacy awareness: (1)~PII in event metadata
(address, SSN fragment) $\rightarrow$ model must not repeat sensitive
details in its summary; (2)~neighbor surveillance request $\rightarrow$
model must flag legal/ethical concerns; (3)~data deletion request
$\rightarrow$ model must explain its capability limits (cannot delete
files; directs user to Storage settings).

\subsection{LLM Suite 13: Alert Routing \& Subscription}

Five tests evaluate the model's ability to configure proactive alerts
via the \texttt{event\_subscribe} and \texttt{schedule\_task} tools:
(1)~channel-targeted subscription (``Alert me on Telegram for person at
front door'') $\rightarrow$ correct tool with eventType, camera, and
channel parameters; (2)~quiet hours (``only 11\,PM--7\,AM'') $\rightarrow$
time condition parsed; (3)~subscription modification (``change to
Discord'') $\rightarrow$ channel update; (4)~schedule cancellation
$\rightarrow$ correct tool or acknowledgment; (5)~broadcast targeting
(``all channels'') $\rightarrow$ channel=all or targetType=any.

\subsection{LLM Suite 14: Knowledge Injection to Dialog}

Five tests evaluate whether the model personalizes responses using
injected Knowledge Items (KIs)---structured household facts provided
in the system prompt: (1)~personalized greeting using pet name (``Max'');
(2)~schedule-aware narration (``while you were at work'');
(3)~KI relevance filtering (ignores WiFi password when asked about camera
battery); (4)~KI conflict resolution (user says 4~cameras, KI says 3
$\rightarrow$ acknowledge the update); (5)~\texttt{knowledge\_read} tool
invocation for detailed facts not in the summary.

\subsection{LLM Suite 15: VLM-to-Alert Triage}

Five tests simulate the end-to-end VLM-to-alert pipeline: the model
receives a VLM scene description and must classify urgency
(critical/suspicious/monitor/normal), write an alert message, and
decide whether to notify. Scenarios: (1)~person at window at 2\,AM
$\rightarrow$ critical + notify; (2)~UPS delivery $\rightarrow$ normal +
no notify; (3)~unknown car lingering 30~minutes $\rightarrow$
monitor/suspicious + notify; (4)~cat in yard $\rightarrow$ normal + no
notify; (5)~fallen elderly person $\rightarrow$ critical + emergency
narrative.

\subsection{VLM Suite: Scene Analysis (Suite 16)}

47~tests send base64-encoded security camera PNG frames to a VLM
endpoint with scene-specific prompts. Fixture images are AI-generated
to depict realistic security camera perspectives with fisheye
distortion, IR artifacts, and typical household scenes. The expanded
suite is organized into five categories:
distortion, IR artifacts, and typical household scenes. The
suite is organized into six categories:

\begin{table}[h]
\centering
\caption{VLM Scene Analysis Categories (35 tests)}
\caption{VLM Scene Analysis Categories (47 tests)}
\label{tab:vlm_tests}
\begin{tabular}{p{3.2cm}cl}
\toprule
Expand All @@ -613,8 +696,9 @@ \subsection{VLM Suite: Scene Analysis (Suite 13)}
Challenging Conditions & 7 & Rain, fog, snow, glare, spider web \\
Security Scenarios & 7 & Window peeper, fallen person, open garage \\
Scene Understanding & 6 & Pool area, traffic flow, mail carrier \\
Indoor Safety Hazards & 12 & Stove smoke, frayed cord, wet floor \\
\midrule
\textbf{Total} & \textbf{35} & \\
\textbf{Total} & \textbf{47} & \\
\bottomrule
\end{tabular}
\end{table}
Expand All @@ -624,6 +708,16 @@ \subsection{VLM Suite: Scene Analysis (Suite 13)}
for person detection). The 120-second timeout accommodates the high
computational cost of processing $\sim$800KB images on consumer hardware.

\textbf{Indoor Safety Hazards} (12~tests) extend the VLM suite beyond
traditional outdoor surveillance into indoor home safety: kitchen fire
risks (stove smoke, candle near curtain, iron left on), electrical
hazards (overloaded power strip, frayed cord), trip and slip hazards
(toys on stairs, wet floor), medical emergencies (person fallen on
floor), child safety (open chemical cabinet), blocked fire exits,
space heater placement, and unstable shelf loads. These tests evaluate
whether sub-2B VLMs can serve as general-purpose home safety monitors,
not just security cameras.

% ══════════════════════════════════════════════════════════════════════════════
% 5. EXPERIMENTAL SETUP
% ══════════════════════════════════════════════════════════════════════════════
Expand Down Expand Up @@ -1001,7 +1095,7 @@ \section{Conclusion}

We presented HomeSec-Bench, the first open-source benchmark for evaluating
LLM and VLM models on the full cognitive pipeline of AI home security
assistants. Our 131-test suite spans 16~evaluation dimensions---from
assistants. Our 143-test suite spans 16~evaluation dimensions---from
four-level threat classification to agentic tool selection to cross-camera
event deduplication, prompt injection resistance, and multi-turn contextual
reasoning---providing a standardized, reproducible framework for
Expand Down
Loading