-
Notifications
You must be signed in to change notification settings - Fork 552
fix(#1027): handle multi-face retrieval and improve DB resource management #1036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(#1027): handle multi-face retrieval and improve DB resource management #1036
Conversation
📝 WalkthroughWalkthroughThis change refactors face-related DB code in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
backend/app/database/faces.py (3)
110-186:db_insert_face_embeddings_by_image_id(): empty-list and 2D ndarray cases cause crashes or data corruption.
- When
embeddings=[]is passed, the list-detection condition at lines 133–136 fails (len check), falling into the "single face" path and crashing atdb_insert_face_embeddings()line 84 (empty list has no.tolist()method).- When a 2D
np.ndarray(N faces × D dims) is passed, it bypasses list detection (not a list), is treated as a single embedding, and.tolist()produces a nested list stored in a single database row instead of N separate rows for each face.Additionally, type hints for
confidence,bbox, andcluster_idparameters should useList[Optional[...]]instead ofList[...]since the code handles None elements within lists.Proposed fix
def db_insert_face_embeddings_by_image_id( image_id: ImageId, embeddings: Union[FaceEmbedding, List[FaceEmbedding]], - confidence: Optional[Union[float, List[float]]] = None, - bbox: Optional[Union[BoundingBox, List[BoundingBox]]] = None, - cluster_id: Optional[Union[ClusterId, List[ClusterId]]] = None, + confidence: Optional[Union[float, List[Optional[float]]]] = None, + bbox: Optional[Union[BoundingBox, List[Optional[BoundingBox]]]] = None, + cluster_id: Optional[Union[ClusterId, List[Optional[ClusterId]]]] = None, ) -> Union[Optional[FaceId], List[Optional[FaceId]]]: """...""" + # Handle empty list + if isinstance(embeddings, list) and len(embeddings) == 0: + return [] + + # Normalize 2D array to list of 1D arrays + if isinstance(embeddings, np.ndarray) and embeddings.ndim == 2: + embeddings = [row for row in embeddings] + # Handle multiple faces in one image (list of numpy arrays) if ( isinstance(embeddings, list) and len(embeddings) > 0 and isinstance(embeddings[0], np.ndarray) ):Consider a follow-up optimization: for multiple faces, reuse one DB connection/transaction instead of opening one connection per face insert to reduce lock contention.
29-56: Replace direct sqlite3.connect() calls with the centralized get_db_connection() context manager to ensure foreign key constraints are enforced.The
db_insert_face_embeddings()anddb_update_face_cluster_ids_batch()functions perform INSERT/UPDATE operations without enablingPRAGMA foreign_keys. Since this pragma is per-connection in SQLite, FK constraints are silently ignored in these functions. A centralizedget_db_connection()context manager already exists inbackend/app/database/connection.py(which enables FK constraints and other integrity pragmas on every connection) and is used byalbums.py. Refactorfaces.pyto use this pattern instead of directsqlite3.connect()calls.
188-292:get_all_face_embeddings()can returnNoneembeddings, causing runtime crashes infaceSearch.py.
The function correctly returns per-face dicts (compatible with actual usage infaceSearch.pyline 70-81), but embeddings can beNonewhen the database column is null. Line 79 offaceSearch.pypassesimage["embeddings"]directly toFaceNet_util_cosine_similarity(), which callsnp.dot()andnp.linalg.norm()— both will crash withTypeErrorif embeddings isNone. Add a null check before computing similarity or ensure embeddings is always a list.Separately, silently dropping faces via
continueonJSONDecodeError(line 266) without logging theface_idobscures data corruption. Add logging to surface which faces failed to decode.
🧹 Nitpick comments (5)
backend/app/database/faces.py (5)
59-108:db_insert_face_embeddings(): good close/rollback, but validate non-numpy inputs before.tolist().
If a caller accidentally passes a plainlist(ornp.ndarraywith unexpected shape),.tolist()or downstream logic may behave unexpectedly; given the PR’s “type mismatch / robustness” objective, a small guard helps.Proposed guard + FK enable if you don't add a shared helper
@@ - conn = sqlite3.connect(DATABASE_PATH) + conn = sqlite3.connect(DATABASE_PATH, timeout=30) + conn.execute("PRAGMA foreign_keys = ON") cursor = conn.cursor() - embeddings_json = json.dumps(embeddings.tolist()) + if not isinstance(embeddings, np.ndarray): + raise TypeError("embeddings must be a numpy.ndarray") + embeddings_json = json.dumps(embeddings.tolist())
294-324: Good: connection always closes; consider catching JSON decode errors too.
json.loads(embeddings_json)can raiseJSONDecodeError, which isn’t asqlite3.Error; right now that will bubble up (but at least thefinallywill still close the connection).
326-371: Good: connection always closes; same JSONDecodeError robustness applies here.
If a row contains malformedembeddings,json.loads()will raise and you’ll lose the whole response rather than skipping/logging one bad row.
373-426: Batch update: validateface_idpresence before executing.
Right now missing/Noneface_identries will produce(cluster_id, None)updates that no-op silently; that can hide upstream bugs. Consider raising ifface_id is None.
428-483: Mean embeddings:np.stackcan be memory-heavy; also add JSONDecodeError handling.
If clusters get large, stacking every embedding can spike memory; incremental mean (ornp.meanover a preallocated array) can help later. Separately, malformed JSON will currently raise (not caught assqlite3.Error).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/database/faces.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/database/faces.py (3)
backend/app/utils/faceSearch.py (1)
BoundingBox(11-15)backend/test.py (1)
get_all_face_embeddings(8-32)backend/app/utils/images.py (1)
image_util_parse_metadata(496-513)
🔇 Additional comments (1)
backend/app/database/faces.py (1)
4-4: Typing import expansion is fine, but prefer minimizingAnycreep.
No blocking issues here; just keep an eye onDict[str, Any]APIs becoming de-facto untyped surfaces.
|
Thanks for the review! I have pushed a follow-up commit that addresses the critical issues:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
backend/app/database/faces.py (3)
1-28: Critical: File contains duplicate imports and inconsistent type definitions.The file has duplicate import statements (lines 1-5 vs 17-21) and duplicate type definitions with conflicting values:
Type Line 10/11 Line 26/28 ClusterIdintstrBoundingBoxDict[str, Union[int, float]]Dict[str, int]This appears to be a merge conflict or copy-paste error. The entire module is duplicated with slight variations, which will cause runtime errors due to redefined symbols and type mismatches.
Recommended fix: Remove duplicate imports and reconcile type definitions
import sqlite3 import json import numpy as np from typing import Optional, List, Dict, Union, TypedDict, Any from app.config.settings import DATABASE_PATH # Type definitions FaceId = int ImageId = str -ClusterId = int -BoundingBox = Dict[str, Union[int, float]] +ClusterId = str # Use str consistently per DB schema (TEXT) +BoundingBox = Dict[str, Union[int, float]] # Keep flexible for coordinates FaceEmbedding = np.ndarray - - -class FaceData(TypedDict): - """Represents the full faces table structure""" -import sqlite3 -import json -import numpy as np -from typing import Optional, List, Dict, Union, TypedDict, Any -from app.config.settings import DATABASE_PATH - -# Type definitions -FaceId = int -ImageId = str -ClusterId = str -FaceEmbedding = np.ndarray # 512-dim vector -BoundingBox = Dict[str, int] # {'x': int, 'y': int, 'width': int, 'height': int}
456-494: Critical: Duplicate function definitions will cause the first implementations to be overwritten.Starting at line 456, there's a partial
FaceDataremnant followed by a complete second implementation of all database functions. Python will use the last definition of each function, meaning:
- The improved
get_db_conn()helper (lines 40-44) exists but is never used by the second implementations- The 2D numpy array handling in
db_insert_face_embeddings_by_image_id(lines 130-133) is overwritten- Foreign key enabling via
get_db_conn()is lost in favor of directsqlite3.connect()callsThe second
db_create_faces_tablealso has a different schema (lines 467-494 usesINTEGERforcluster_idandTEXTforembeddings, while lines 47-72 usesTEXTforcluster_idandJSONforembeddings).Remove the duplicate implementations (lines 456-921) and keep only the first, improved version (lines 40-455).
1-921: File has critical syntax errors and duplicate implementations—remove malformed section and keep only the improved first implementation.This file is syntactically invalid and won't parse. It contains two complete duplicate implementations with conflicting schemas:
First implementation (lines 40-455) has improvements the PR intends:
- ✅
get_db_conn()helper with PRAGMA foreign_keys- ✅ Null checks in
get_all_face_embeddings()to prevent downstream crashes- ✅ 2D numpy array handling in
db_insert_face_embeddings_by_image_id()- ✅ Schema:
cluster_id TEXTmatchingClusterId = strSecond implementation (lines 467-920) is outdated:
- ❌ Direct
sqlite3.connect()without helper- ❌ No null checks; will crash on missing embeddings
- ❌ No 2D array handling
- ❌ Schema:
cluster_id INTEGERconflicts with string type definitionTo fix:
- Remove the malformed orphaned class members at lines 452-460 (incomplete TypedDict fields hanging between functions)
- Delete the entire second implementation (lines 467-920)
- Keep lines 1-451 (first implementation), ensuring
ClusterId = strremains at line 26
🤖 Fix all issues with AI agents
In @backend/app/database/faces.py:
- Around line 750-753: The second implementation of
db_get_faces_unassigned_clusters (and the similar duplicate implementations in
db_get_all_faces_with_cluster_names and db_get_cluster_mean_embeddings) will
crash when embeddings_json is None; add the same defensive check used in the
first implementation: if not embeddings_json: continue before calling
json.loads(embeddings_json) so you never pass None into
np.array(json.loads(...)), or remove the duplicate/second implementation
entirely and keep only the already-correct first implementation to eliminate the
bug and duplication.
🧹 Nitpick comments (2)
backend/app/database/faces.py (2)
67-68: Consider: Use structured logging instead of print statements.Throughout this file, errors are logged via
print()(lines 68, 105, 268, 292, 319, 358, 404, 451, etc.). For better observability in production, consider using Python'sloggingmodule with appropriate log levels.import logging logger = logging.getLogger(__name__) # Instead of: print(f"Error creating faces table: {e}") # Use: logger.error(f"Error creating faces table: {e}")
272-274: Minor: Empty list check may be unnecessary.Given the query at lines 221-230 filters with
WHERE m.name IS NOT NULL, anyimage_idinimage_tagsshould have at least one tag. An empty list shouldn't occur, making this check redundant:tags = image_tags.get(image_id) if tags is not None and len(tags) == 0: tags = NoneHowever, keeping it as defensive code is fine.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/database/faces.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/database/faces.py (5)
backend/app/utils/faceSearch.py (1)
BoundingBox(11-15)backend/app/models/ObjectClassifier.py (1)
close(27-33)backend/app/models/YOLO.py (1)
close(34-36)backend/app/models/FaceNet.py (1)
close(27-29)backend/app/utils/images.py (1)
image_util_parse_metadata(496-513)
🔇 Additional comments (5)
backend/app/database/faces.py (5)
40-44: Good addition: Centralized connection helper with foreign key enforcement.The
get_db_conn()helper ensuresPRAGMA foreign_keys = ONis consistently applied, which is important for data integrity with theON DELETE CASCADEconstraints.
114-164: Good handling of multi-face embeddings and 2D arrays.The logic correctly addresses the PR objectives:
- Empty list returns early (line 127-128)
- 2D numpy arrays
(N, 512)are converted to a list of N arrays (lines 132-133)- Per-embedding extraction of
confidence,bbox, andcluster_idhandles both list and scalar inputsOne minor observation:
ClusterIdis defined asstr(line 26), but line 159 checksisinstance(cluster_id, str)which is correct. However, in the second duplicate implementation (line 595-596), it checksisinstance(cluster_id, int)- another inconsistency to address when removing duplicates.
190-296: LGTM - Correctly returns flat list and handles corrupted data.This implementation properly addresses the core issue from #1027:
- Returns one entry per face (not grouped by
image_id), so multi-face images are fully represented- Lines 256-265 gracefully skip faces with null/empty embeddings, preventing downstream crashes
- The two-query approach for tags (lines 220-231) avoids a Cartesian product
104-111: Good resource management pattern.The
try/except/finallypattern with explicitconn.rollback()on error andconn.close()infinallyblocks addresses the "Database is locked" errors mentioned in the PR objectives. This pattern is consistently applied across all database functions.
365-408: Good batch update with flexible cursor handling.The function properly supports both:
- External cursor (for use within larger transactions)
- Creating its own connection when called standalone
The defensive check at lines 380-381 is technically redundant after lines 376-378, but serves as a safety net for future refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @backend/app/database/faces.py:
- Line 395: The return type for db_get_cluster_mean_embeddings is wrong: replace
the int union with the ClusterId alias (a str) so the signature reads a list of
dicts mapping "cluster_id" to ClusterId and "mean_embedding" to FaceEmbedding
(np.ndarray); update the type annotation in the db_get_cluster_mean_embeddings
definition to use ClusterId instead of int and ensure ClusterId and
FaceEmbedding are imported/available in that module.
- Around line 260-272: The faces.append block currently stores embeddings_json
(a Python list from json.loads) which is inconsistent with other functions like
db_get_faces_unassigned_clusters and db_get_all_faces_with_cluster_names that
return numpy arrays; change the value stored under "embeddings" in that
faces.append (where embeddings_json is created) to a numpy array (e.g.,
np.array(json.loads(...))) so all getters return the same type and callers
receive np.ndarray consistently.
- Around line 10-12: Update the BoundingBox type alias from Dict[str, int] to
use floats to match the FaceDetector outputs and the Pydantic model: change
BoundingBox to Dict[str, float] (or a more specific TypedDict with float-typed
'x', 'y', 'width', 'height') and ensure any imports (e.g., typing.Dict or
typing.TypedDict) are adjusted accordingly; keep the existing ClusterId and
FaceEmbedding aliases unchanged.
🧹 Nitpick comments (3)
backend/app/database/faces.py (3)
24-28: Add return type annotation for consistency.The function is missing a return type annotation. For consistency with the rest of the codebase and to help static analysis tools:
-def get_db_conn(): +def get_db_conn() -> sqlite3.Connection:
69-76: PotentialAttributeErrorifembeddingsisNoneor invalid.Line 74 calls
embeddings.tolist()without validation. If a caller passesNoneor a non-ndarray value (despite the type hint), this will raise anAttributeError. Given the PR's goal of preventing crashes on edge-case inputs, consider adding a guard:try: conn = get_db_conn() cursor = conn.cursor() + if embeddings is None or not isinstance(embeddings, np.ndarray): + print(f"Invalid embeddings for image {image_id}") + return None + embeddings_json = json.dumps(embeddings.tolist())
217-217: Consider moving import to top of file if no circular dependency.The import
from app.utils.images import image_util_parse_metadatainside the function body suggests a circular import concern. If this isn't the case, moving it to the top of the file would be cleaner. If circular imports are a concern, this is an acceptable pattern.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/database/faces.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/database/faces.py (2)
backend/app/utils/faceSearch.py (1)
BoundingBox(11-15)backend/app/utils/images.py (1)
image_util_parse_metadata(496-513)
🔇 Additional comments (5)
backend/app/database/faces.py (5)
31-56: LGTM!Good use of the centralized
get_db_conn()helper, propertry/except/finallyfor resource management, and appropriateNOT NULLconstraints with foreign key cascade delete.
116-120: Edge case: 1D numpy array with single embedding might not be handled correctly.The check at line 120 uses
isinstance(embeddings, list), but a single 1Dnp.ndarray(shape(512,)) won't match this condition and will fall through to the else branch. This is the intended behavior. However, if someone passes a 2D array with shape(1, 512)(one face), line 116-117 converts it to a list of one array, then line 120'sis_list_inputbecomesTrue, and it iterates correctly.The logic appears sound, but the condition at line 120 could be simplified or documented more clearly since the 2D→list conversion makes the subsequent
isinstance(embeddings, list)check the key discriminator.
283-307: LGTM!Good null-check for
embeddings_json, proper error handling, and consistent use ofnp.array()for embeddings conversion.
310-346: LGTM!Proper null-check, consistent embeddings conversion to
np.array, and good error handling with resource cleanup.
349-392: LGTM!Good transaction management pattern: commits/rollbacks only when owning the connection, and properly delegates transaction control to the caller when an external cursor is provided. The defensive check at lines 364-365 is unreachable in normal flow but adds safety.
🐛 Problem
The
get_all_face_embeddings()function was previously designed to group results byimage_id.if image_id not in images_dictcaused the loop to skip the 2nd and 3rd faces if a single image contained multiple faces.🛠️ Solution
I have refactored
backend/app/database/faces.pywith the following improvements:1. Logic Fix (Multi-face Retrieval)
image_id.2. Database Safety (Resource Management)
try...finallyblocks to all database functions.conn.close()is always called, preventing "Database is locked" errors and memory leaks during high-concurrency operations.3. Type Safety
db_insert_face_embeddings_by_image_id.np.ndarrayinputs andList[np.ndarray]inputs, preventing crashes if the AI model output format varies.📸 Technical Implementation
Before (Buggy Logic):