Skip to content

Conversation

@SandeepChauhan00
Copy link

@SandeepChauhan00 SandeepChauhan00 commented Jan 13, 2026

🐛 Problem

The get_all_face_embeddings() function was previously designed to group results by image_id.

  • The Bug: The logic if image_id not in images_dict caused the loop to skip the 2nd and 3rd faces if a single image contained multiple faces.
  • The Impact: This caused data loss in the frontend, as only one face per image was ever indexed or displayed.

🛠️ Solution

I have refactored backend/app/database/faces.py with the following improvements:

1. Logic Fix (Multi-face Retrieval)

  • Removed the dictionary keying by image_id.
  • Updated the retrieval logic to return a flat list of Face objects. Now, if an image has 3 faces, 3 distinct records are returned, all correctly linked to the same parent image metadata.

2. Database Safety (Resource Management)

  • Added try...finally blocks to all database functions.
  • Ensures conn.close() is always called, preventing "Database is locked" errors and memory leaks during high-concurrency operations.

3. Type Safety

  • Added robust type checking in db_insert_face_embeddings_by_image_id.
  • The code now safely handles both single np.ndarray inputs and List[np.ndarray] inputs, preventing crashes if the AI model output format varies.

📸 Technical Implementation

Before (Buggy Logic):

# Skipped faces if image_id was already processed
if image_id not in images_dict:
    images_dict[image_id] = { ... }


After (Fixed Logic):

Python

# Appends every face found, regardless of image_id
faces.append({
    "face_id": face_id,
    "id": image_id,
    ...
})


Testing
Multi-face Test: Uploaded an image with 4 faces. Verified that get_all_face_embeddings returns 4 distinct entries.
Tags: Verified that tags are correctly mapped to the image metadata for all faces.
Regression Test: Verified that single-face images still load correctly.
Clean Scope: Reverted accidental frontend dependency changes to keep this PR focused solely on the backend.
Closes #1027

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features / Improvements**
  * Insert single or multiple face embeddings with per-item metadata (confidence, bounding box, optional cluster); embeddings/bboxes stored as JSON.
  * Retrieval now returns structured embeddings, image metadata, optional cluster names, and cluster mean aggregations.

* **Bug Fixes**
  * Improved error handling, transaction rollbacks and connection cleanup for DB operations.
  * Safer handling and validation of missing/invalid embedding or bbox data.

* **Refactor**
  * Expanded public data shapes and standardized serialization/deserialization for face-related operations.

<sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

This change refactors face-related DB code in backend/app/database/faces.py: expands typing, centralizes DB connections, stores embeddings/bboxes as JSON, hardens error handling and transactions, and broadens insert/read/update APIs for single and batch face embeddings.

Changes

Cohort / File(s) Summary
Face DB: types, schema, insert, read, update
backend/app/database/faces.py
Expanded public type aliases (FaceData, FaceId, ImageId, ClusterId, BoundingBox, FaceEmbedding); added get_db_conn() and updated db_create_faces_table() (image_id NOT NULL, embeddings/bbox as JSON, FK on image_id); widened insert APIs (db_insert_face_embeddings, db_insert_face_embeddings_by_image_id) to accept single/multiple embeddings and per-item confidence/bbox/cluster_id; added robust JSON serialization/deserialization, guards for invalid data, and try/except with rollback/connection lifecycle handling; updated retrievals (get_all_face_embeddings, db_get_all_faces_with_cluster_names, db_get_faces_unassigned_clusters, db_get_cluster_mean_embeddings) to return typed lists/dicts and handle corrupted/missing JSON; batch update db_update_face_cluster_ids_batch accepts None cluster_ids and optional external cursor with proper rollback.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I hopped through rows of JSON dreams,

tucked embeddings into tidy seams,
rolled back tumbleweeds when errors peeped,
kept clusters snug where secrets sleep,
a rabbit's nibble — safe data, sweet.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main changes: fixing multi-face retrieval and improving database resource management, directly addressing the issue reference #1027.
Linked Issues check ✅ Passed The PR addresses faces.py type mismatches from #1027 by fixing embeddings/bbox/cluster_id handling, return types for insertion/retrieval functions, improving resource management with try/finally blocks, and handling edge-case null values.
Out of Scope Changes check ✅ Passed All changes to faces.py are within scope of #1027, focusing on type corrections, multi-face retrieval, database resource management, and edge-case handling as specified in the issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/app/database/faces.py (3)

110-186: db_insert_face_embeddings_by_image_id(): empty-list and 2D ndarray cases cause crashes or data corruption.

  • When embeddings=[] is passed, the list-detection condition at lines 133–136 fails (len check), falling into the "single face" path and crashing at db_insert_face_embeddings() line 84 (empty list has no .tolist() method).
  • When a 2D np.ndarray (N faces × D dims) is passed, it bypasses list detection (not a list), is treated as a single embedding, and .tolist() produces a nested list stored in a single database row instead of N separate rows for each face.

Additionally, type hints for confidence, bbox, and cluster_id parameters should use List[Optional[...]] instead of List[...] since the code handles None elements within lists.

Proposed fix
def db_insert_face_embeddings_by_image_id(
    image_id: ImageId,
    embeddings: Union[FaceEmbedding, List[FaceEmbedding]],
-   confidence: Optional[Union[float, List[float]]] = None,
-   bbox: Optional[Union[BoundingBox, List[BoundingBox]]] = None,
-   cluster_id: Optional[Union[ClusterId, List[ClusterId]]] = None,
+   confidence: Optional[Union[float, List[Optional[float]]]] = None,
+   bbox: Optional[Union[BoundingBox, List[Optional[BoundingBox]]]] = None,
+   cluster_id: Optional[Union[ClusterId, List[Optional[ClusterId]]]] = None,
 ) -> Union[Optional[FaceId], List[Optional[FaceId]]]:
     """..."""
 
+    # Handle empty list
+    if isinstance(embeddings, list) and len(embeddings) == 0:
+        return []
+
+    # Normalize 2D array to list of 1D arrays
+    if isinstance(embeddings, np.ndarray) and embeddings.ndim == 2:
+        embeddings = [row for row in embeddings]
+
     # Handle multiple faces in one image (list of numpy arrays)
     if (
         isinstance(embeddings, list)
         and len(embeddings) > 0
         and isinstance(embeddings[0], np.ndarray)
     ):

Consider a follow-up optimization: for multiple faces, reuse one DB connection/transaction instead of opening one connection per face insert to reduce lock contention.


29-56: Replace direct sqlite3.connect() calls with the centralized get_db_connection() context manager to ensure foreign key constraints are enforced.

The db_insert_face_embeddings() and db_update_face_cluster_ids_batch() functions perform INSERT/UPDATE operations without enabling PRAGMA foreign_keys. Since this pragma is per-connection in SQLite, FK constraints are silently ignored in these functions. A centralized get_db_connection() context manager already exists in backend/app/database/connection.py (which enables FK constraints and other integrity pragmas on every connection) and is used by albums.py. Refactor faces.py to use this pattern instead of direct sqlite3.connect() calls.


188-292: get_all_face_embeddings() can return None embeddings, causing runtime crashes in faceSearch.py.
The function correctly returns per-face dicts (compatible with actual usage in faceSearch.py line 70-81), but embeddings can be None when the database column is null. Line 79 of faceSearch.py passes image["embeddings"] directly to FaceNet_util_cosine_similarity(), which calls np.dot() and np.linalg.norm() — both will crash with TypeError if embeddings is None. Add a null check before computing similarity or ensure embeddings is always a list.

Separately, silently dropping faces via continue on JSONDecodeError (line 266) without logging the face_id obscures data corruption. Add logging to surface which faces failed to decode.

🧹 Nitpick comments (5)
backend/app/database/faces.py (5)

59-108: db_insert_face_embeddings(): good close/rollback, but validate non-numpy inputs before .tolist().
If a caller accidentally passes a plain list (or np.ndarray with unexpected shape), .tolist() or downstream logic may behave unexpectedly; given the PR’s “type mismatch / robustness” objective, a small guard helps.

Proposed guard + FK enable if you don't add a shared helper
@@
-        conn = sqlite3.connect(DATABASE_PATH)
+        conn = sqlite3.connect(DATABASE_PATH, timeout=30)
+        conn.execute("PRAGMA foreign_keys = ON")
         cursor = conn.cursor()
 
-        embeddings_json = json.dumps(embeddings.tolist())
+        if not isinstance(embeddings, np.ndarray):
+            raise TypeError("embeddings must be a numpy.ndarray")
+        embeddings_json = json.dumps(embeddings.tolist())

294-324: Good: connection always closes; consider catching JSON decode errors too.
json.loads(embeddings_json) can raise JSONDecodeError, which isn’t a sqlite3.Error; right now that will bubble up (but at least the finally will still close the connection).


326-371: Good: connection always closes; same JSONDecodeError robustness applies here.
If a row contains malformed embeddings, json.loads() will raise and you’ll lose the whole response rather than skipping/logging one bad row.


373-426: Batch update: validate face_id presence before executing.
Right now missing/None face_id entries will produce (cluster_id, None) updates that no-op silently; that can hide upstream bugs. Consider raising if face_id is None.


428-483: Mean embeddings: np.stack can be memory-heavy; also add JSONDecodeError handling.
If clusters get large, stacking every embedding can spike memory; incremental mean (or np.mean over a preallocated array) can help later. Separately, malformed JSON will currently raise (not caught as sqlite3.Error).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3af96c and 4cb7a38.

📒 Files selected for processing (1)
  • backend/app/database/faces.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/database/faces.py (3)
backend/app/utils/faceSearch.py (1)
  • BoundingBox (11-15)
backend/test.py (1)
  • get_all_face_embeddings (8-32)
backend/app/utils/images.py (1)
  • image_util_parse_metadata (496-513)
🔇 Additional comments (1)
backend/app/database/faces.py (1)

4-4: Typing import expansion is fine, but prefer minimizing Any creep.
No blocking issues here; just keep an eye on Dict[str, Any] APIs becoming de-facto untyped surfaces.

@SandeepChauhan00
Copy link
Author

Thanks for the review! I have pushed a follow-up commit that addresses the critical issues:

  1. Numpy Shape Fix: Normalizes 2D numpy arrays to lists to prevent shape mismatch errors if the model outputs batch data.
  2. Data Integrity: Explicitly enables PRAGMA foreign_keys = ON to ensure orphaned face data isn't left behind if an image is deleted.
  3. Crash Prevention: Added null-checks for embeddings to prevent crashes during retrieval if the JSON is corrupted or missing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/app/database/faces.py (3)

1-28: Critical: File contains duplicate imports and inconsistent type definitions.

The file has duplicate import statements (lines 1-5 vs 17-21) and duplicate type definitions with conflicting values:

Type Line 10/11 Line 26/28
ClusterId int str
BoundingBox Dict[str, Union[int, float]] Dict[str, int]

This appears to be a merge conflict or copy-paste error. The entire module is duplicated with slight variations, which will cause runtime errors due to redefined symbols and type mismatches.

Recommended fix: Remove duplicate imports and reconcile type definitions
 import sqlite3
 import json
 import numpy as np
 from typing import Optional, List, Dict, Union, TypedDict, Any
 from app.config.settings import DATABASE_PATH

 # Type definitions
 FaceId = int
 ImageId = str
-ClusterId = int
-BoundingBox = Dict[str, Union[int, float]]
+ClusterId = str  # Use str consistently per DB schema (TEXT)
+BoundingBox = Dict[str, Union[int, float]]  # Keep flexible for coordinates
 FaceEmbedding = np.ndarray
-
-
-class FaceData(TypedDict):
-    """Represents the full faces table structure"""
-import sqlite3
-import json
-import numpy as np
-from typing import Optional, List, Dict, Union, TypedDict, Any
-from app.config.settings import DATABASE_PATH
-
-# Type definitions
-FaceId = int
-ImageId = str
-ClusterId = str
-FaceEmbedding = np.ndarray  # 512-dim vector
-BoundingBox = Dict[str, int]  # {'x': int, 'y': int, 'width': int, 'height': int}

456-494: Critical: Duplicate function definitions will cause the first implementations to be overwritten.

Starting at line 456, there's a partial FaceData remnant followed by a complete second implementation of all database functions. Python will use the last definition of each function, meaning:

  1. The improved get_db_conn() helper (lines 40-44) exists but is never used by the second implementations
  2. The 2D numpy array handling in db_insert_face_embeddings_by_image_id (lines 130-133) is overwritten
  3. Foreign key enabling via get_db_conn() is lost in favor of direct sqlite3.connect() calls

The second db_create_faces_table also has a different schema (lines 467-494 uses INTEGER for cluster_id and TEXT for embeddings, while lines 47-72 uses TEXT for cluster_id and JSON for embeddings).

Remove the duplicate implementations (lines 456-921) and keep only the first, improved version (lines 40-455).


1-921: File has critical syntax errors and duplicate implementations—remove malformed section and keep only the improved first implementation.

This file is syntactically invalid and won't parse. It contains two complete duplicate implementations with conflicting schemas:

First implementation (lines 40-455) has improvements the PR intends:

  • get_db_conn() helper with PRAGMA foreign_keys
  • ✅ Null checks in get_all_face_embeddings() to prevent downstream crashes
  • ✅ 2D numpy array handling in db_insert_face_embeddings_by_image_id()
  • ✅ Schema: cluster_id TEXT matching ClusterId = str

Second implementation (lines 467-920) is outdated:

  • ❌ Direct sqlite3.connect() without helper
  • ❌ No null checks; will crash on missing embeddings
  • ❌ No 2D array handling
  • ❌ Schema: cluster_id INTEGER conflicts with string type definition

To fix:

  1. Remove the malformed orphaned class members at lines 452-460 (incomplete TypedDict fields hanging between functions)
  2. Delete the entire second implementation (lines 467-920)
  3. Keep lines 1-451 (first implementation), ensuring ClusterId = str remains at line 26
🤖 Fix all issues with AI agents
In @backend/app/database/faces.py:
- Around line 750-753: The second implementation of
db_get_faces_unassigned_clusters (and the similar duplicate implementations in
db_get_all_faces_with_cluster_names and db_get_cluster_mean_embeddings) will
crash when embeddings_json is None; add the same defensive check used in the
first implementation: if not embeddings_json: continue before calling
json.loads(embeddings_json) so you never pass None into
np.array(json.loads(...)), or remove the duplicate/second implementation
entirely and keep only the already-correct first implementation to eliminate the
bug and duplication.
🧹 Nitpick comments (2)
backend/app/database/faces.py (2)

67-68: Consider: Use structured logging instead of print statements.

Throughout this file, errors are logged via print() (lines 68, 105, 268, 292, 319, 358, 404, 451, etc.). For better observability in production, consider using Python's logging module with appropriate log levels.

import logging
logger = logging.getLogger(__name__)

# Instead of:
print(f"Error creating faces table: {e}")

# Use:
logger.error(f"Error creating faces table: {e}")

272-274: Minor: Empty list check may be unnecessary.

Given the query at lines 221-230 filters with WHERE m.name IS NOT NULL, any image_id in image_tags should have at least one tag. An empty list shouldn't occur, making this check redundant:

tags = image_tags.get(image_id)
if tags is not None and len(tags) == 0:
    tags = None

However, keeping it as defensive code is fine.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4cb7a38 and 630a0fb.

📒 Files selected for processing (1)
  • backend/app/database/faces.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/database/faces.py (5)
backend/app/utils/faceSearch.py (1)
  • BoundingBox (11-15)
backend/app/models/ObjectClassifier.py (1)
  • close (27-33)
backend/app/models/YOLO.py (1)
  • close (34-36)
backend/app/models/FaceNet.py (1)
  • close (27-29)
backend/app/utils/images.py (1)
  • image_util_parse_metadata (496-513)
🔇 Additional comments (5)
backend/app/database/faces.py (5)

40-44: Good addition: Centralized connection helper with foreign key enforcement.

The get_db_conn() helper ensures PRAGMA foreign_keys = ON is consistently applied, which is important for data integrity with the ON DELETE CASCADE constraints.


114-164: Good handling of multi-face embeddings and 2D arrays.

The logic correctly addresses the PR objectives:

  1. Empty list returns early (line 127-128)
  2. 2D numpy arrays (N, 512) are converted to a list of N arrays (lines 132-133)
  3. Per-embedding extraction of confidence, bbox, and cluster_id handles both list and scalar inputs

One minor observation: ClusterId is defined as str (line 26), but line 159 checks isinstance(cluster_id, str) which is correct. However, in the second duplicate implementation (line 595-596), it checks isinstance(cluster_id, int) - another inconsistency to address when removing duplicates.


190-296: LGTM - Correctly returns flat list and handles corrupted data.

This implementation properly addresses the core issue from #1027:

  • Returns one entry per face (not grouped by image_id), so multi-face images are fully represented
  • Lines 256-265 gracefully skip faces with null/empty embeddings, preventing downstream crashes
  • The two-query approach for tags (lines 220-231) avoids a Cartesian product

104-111: Good resource management pattern.

The try/except/finally pattern with explicit conn.rollback() on error and conn.close() in finally blocks addresses the "Database is locked" errors mentioned in the PR objectives. This pattern is consistently applied across all database functions.


365-408: Good batch update with flexible cursor handling.

The function properly supports both:

  1. External cursor (for use within larger transactions)
  2. Creating its own connection when called standalone

The defensive check at lines 380-381 is technically redundant after lines 376-378, but serves as a safety net for future refactoring.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @backend/app/database/faces.py:
- Line 395: The return type for db_get_cluster_mean_embeddings is wrong: replace
the int union with the ClusterId alias (a str) so the signature reads a list of
dicts mapping "cluster_id" to ClusterId and "mean_embedding" to FaceEmbedding
(np.ndarray); update the type annotation in the db_get_cluster_mean_embeddings
definition to use ClusterId instead of int and ensure ClusterId and
FaceEmbedding are imported/available in that module.
- Around line 260-272: The faces.append block currently stores embeddings_json
(a Python list from json.loads) which is inconsistent with other functions like
db_get_faces_unassigned_clusters and db_get_all_faces_with_cluster_names that
return numpy arrays; change the value stored under "embeddings" in that
faces.append (where embeddings_json is created) to a numpy array (e.g.,
np.array(json.loads(...))) so all getters return the same type and callers
receive np.ndarray consistently.
- Around line 10-12: Update the BoundingBox type alias from Dict[str, int] to
use floats to match the FaceDetector outputs and the Pydantic model: change
BoundingBox to Dict[str, float] (or a more specific TypedDict with float-typed
'x', 'y', 'width', 'height') and ensure any imports (e.g., typing.Dict or
typing.TypedDict) are adjusted accordingly; keep the existing ClusterId and
FaceEmbedding aliases unchanged.
🧹 Nitpick comments (3)
backend/app/database/faces.py (3)

24-28: Add return type annotation for consistency.

The function is missing a return type annotation. For consistency with the rest of the codebase and to help static analysis tools:

-def get_db_conn():
+def get_db_conn() -> sqlite3.Connection:

69-76: Potential AttributeError if embeddings is None or invalid.

Line 74 calls embeddings.tolist() without validation. If a caller passes None or a non-ndarray value (despite the type hint), this will raise an AttributeError. Given the PR's goal of preventing crashes on edge-case inputs, consider adding a guard:

     try:
         conn = get_db_conn()
         cursor = conn.cursor()
 
+        if embeddings is None or not isinstance(embeddings, np.ndarray):
+            print(f"Invalid embeddings for image {image_id}")
+            return None
+
         embeddings_json = json.dumps(embeddings.tolist())

217-217: Consider moving import to top of file if no circular dependency.

The import from app.utils.images import image_util_parse_metadata inside the function body suggests a circular import concern. If this isn't the case, moving it to the top of the file would be cleaner. If circular imports are a concern, this is an acceptable pattern.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 630a0fb and 034b58c.

📒 Files selected for processing (1)
  • backend/app/database/faces.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/database/faces.py (2)
backend/app/utils/faceSearch.py (1)
  • BoundingBox (11-15)
backend/app/utils/images.py (1)
  • image_util_parse_metadata (496-513)
🔇 Additional comments (5)
backend/app/database/faces.py (5)

31-56: LGTM!

Good use of the centralized get_db_conn() helper, proper try/except/finally for resource management, and appropriate NOT NULL constraints with foreign key cascade delete.


116-120: Edge case: 1D numpy array with single embedding might not be handled correctly.

The check at line 120 uses isinstance(embeddings, list), but a single 1D np.ndarray (shape (512,)) won't match this condition and will fall through to the else branch. This is the intended behavior. However, if someone passes a 2D array with shape (1, 512) (one face), line 116-117 converts it to a list of one array, then line 120's is_list_input becomes True, and it iterates correctly.

The logic appears sound, but the condition at line 120 could be simplified or documented more clearly since the 2D→list conversion makes the subsequent isinstance(embeddings, list) check the key discriminator.


283-307: LGTM!

Good null-check for embeddings_json, proper error handling, and consistent use of np.array() for embeddings conversion.


310-346: LGTM!

Proper null-check, consistent embeddings conversion to np.array, and good error handling with resource cleanup.


349-392: LGTM!

Good transaction management pattern: commits/rollbacks only when owning the connection, and properly delegates transaction control to the caller when an external cursor is provided. The defensive check at lines 364-365 is unreachable in normal flow but adds safety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant