Skip to content

Adaptive Density-Aware Face Clustering #1271

@rohan-pandeyy

Description

@rohan-pandeyy

Background

A contributor reported an unexpected clustering inconsistency: the same two people's images produce two separate clusters when processed alongside a larger photo library, but collapse into a single (incorrect) cluster when processed in isolation. Same images, different result, purely depending on folder composition.

The root cause is a known DBSCAN limitation called the Global vs. Local Density Problem (Gan et al., 2022, Information Sciences). A static eps radius works relative to the density of the full embedding space. In a sparse isolated dataset, the same eps spans across identity boundaries it would not touch in a denser combined dataset. A follow-up audit of the codebase confirmed that several fixes this problem calls for are not yet implemented.

Current State

Item Status
min_samples=1 chaining problem Fixed: now 2 in face_clusters.py
Face quality gate Not implemented
Adaptive eps via k-NN estimation Not implemented
Clustering params in config All hardcoded as Python defaults
Clustering regression tests None exist

The five affected parameters - eps, min_samples, similarity_threshold, merge_threshold, conf_threshold - are scattered as hardcoded defaults across face_clusters.py and FaceDetector.__init__(). None are wired to settings.py or overridable without editing source code.

Proposed Solution

9.1 - Face quality gate

Low-quality "bridge point" embeddings are a direct contributor to incorrect cluster merges. This adds a pre-clustering quality gate that runs after YOLO detection and before the FaceNet embedding step. Three checks are applied per detected face:

  • Blur: Laplacian variance on the cropped face region, rejected below a configurable threshold
  • Size: Bounding box area, rejected below a configurable minimum
  • Completeness: YOLO confidence score below conf_threshold (no new model dependency)

Rejected faces are counted and surfaced to the user via a non-blocking UI alert ("X faces skipped due to quality") rather than silently discarded.

9.2 - Adaptive eps via k-NN distance estimation

Replaces the single hardcoded eps = 0.75 with a per-run estimation step. After the quality gate, k-NN distances are computed across the current embedding set using k = min_samples. A dataset-appropriate eps is derived from the distance distribution, making cluster quality a function of embedding geometry rather than folder composition. Falls back to the config default if the embedding set is too small for reliable estimation.

Image

An in-depth explanation of each of these implementations can be found at: https://www.notion.so/Google-Summer-of-Code-Project-Proposal-for-AOSSIE-PictoPy-31f0567ec53f80bebad2c92acfe5f429?source=copy_link

Implementation Plan

  • Phase 1 - Centralise all five clustering parameters into a ClusteringConfig block in settings.py with env-var overrides. No functional change, refactor only.
  • Phase 2 - Implement face_quality_gate() in a new face_quality.py utility. Insert between YOLO output and FaceNet crop/embed. Return faces_skipped count up the call stack instead of discarding it.
  • Phase 3 - Implement estimate_eps() using k-NN distance distribution on the post-gate embedding set. k = min_samples read from config.
  • Phase 4 - Extend GlobalReclusterData Pydantic model with faces_skipped: Optional[int]. Wire the count from utility functions through the FastAPI route into the response envelope.
  • Phase 5 - Add faces_skipped to the frontend TypeScript type. Trigger a ShadCN corner toast in ApplicationControlsCard.tsx when faces_skipped > 0.
  • Phase 6 - Open a new issue to add pytest regression tests using synthetic embedding fixtures (no image assets). Assert the folder-size bug is resolved and cluster count is stable across dataset sizes.

References

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions