Adaptive Density-Aware Face Clustering

### Background

A contributor reported an unexpected clustering inconsistency: the same two people's images produce two separate clusters when processed alongside a larger photo library, but collapse into a single (incorrect) cluster when processed in isolation. Same images, different result, purely depending on folder composition.

The root cause is a known DBSCAN limitation called the **Global vs. Local Density Problem** (Gan et al., 2022, *Information Sciences*). A static `eps` radius works relative to the density of the full embedding space. In a sparse isolated dataset, the same `eps` spans across identity boundaries it would not touch in a denser combined dataset. A follow-up audit of the codebase confirmed that several fixes this problem calls for are not yet implemented.

### Current State

| Item | Status |
| --- | --- |
| `min_samples=1` chaining problem | Fixed: now `2` in `face_clusters.py` |
| Face quality gate | Not implemented |
| Adaptive `eps` via k-NN estimation | Not implemented |
| Clustering params in config | All hardcoded as Python defaults |
| Clustering regression tests | None exist |

The five affected parameters - `eps`, `min_samples`, `similarity_threshold`, `merge_threshold`, `conf_threshold` - are scattered as hardcoded defaults across `face_clusters.py` and `FaceDetector.__init__()`. None are wired to `settings.py` or overridable without editing source code.

### Proposed Solution

**9.1 - Face quality gate**

Low-quality "bridge point" embeddings are a direct contributor to incorrect cluster merges. This adds a pre-clustering quality gate that runs after YOLO detection and before the FaceNet embedding step. Three checks are applied per detected face: 
- **Blur**: Laplacian variance on the cropped face region, rejected below a configurable threshold
- **Size**: Bounding box area, rejected below a configurable minimum
- **Completeness**: YOLO confidence score below `conf_threshold` (no new model dependency)

Rejected faces are counted and surfaced to the user via a non-blocking UI alert ("X faces skipped due to quality") rather than silently discarded.

**9.2 - Adaptive `eps` via k-NN distance estimation**

Replaces the single hardcoded `eps = 0.75` with a per-run estimation step. After the quality gate, k-NN distances are computed across the current embedding set using `k = min_samples`. A dataset-appropriate `eps` is derived from the distance distribution, making cluster quality a function of embedding geometry rather than folder composition. Falls back to the config default if the embedding set is too small for reliable estimation.

<img width="513" height="534" alt="Image" src="https://github.com/user-attachments/assets/f0361705-47c3-495c-ad10-7e65f569fb31" />

An in-depth explanation of each of these implementations can be found at: https://www.notion.so/Google-Summer-of-Code-Project-Proposal-for-AOSSIE-PictoPy-31f0567ec53f80bebad2c92acfe5f429?source=copy_link

### Implementation Plan

- [ ] **Phase 1** - Centralise all five clustering parameters into a `ClusteringConfig` block in `settings.py` with env-var overrides. No functional change, refactor only.
- [ ] **Phase 2** - Implement `face_quality_gate()` in a new `face_quality.py` utility. Insert between YOLO output and FaceNet crop/embed. Return `faces_skipped` count up the call stack instead of discarding it.
- [ ] **Phase 3** - Implement `estimate_eps()` using k-NN distance distribution on the post-gate embedding set. `k = min_samples` read from config.
- [ ] **Phase 4** - Extend `GlobalReclusterData` Pydantic model with `faces_skipped: Optional[int]`. Wire the count from utility functions through the FastAPI route into the response envelope.
- [ ] **Phase 5** - Add `faces_skipped` to the frontend TypeScript type. Trigger a ShadCN corner toast in `ApplicationControlsCard.tsx` when `faces_skipped > 0`.
- [ ] **Phase 6** - Open a new issue to add pytest regression tests using synthetic embedding fixtures (no image assets). Assert the folder-size bug is resolved and cluster count is stable across dataset sizes.

### References

- Gan et al. (2022). *A hybrid method combining DBSCAN and K-nearest neighbors for improved density-based clustering.* Information Sciences. https://www.sciencedirect.com/science/article/abs/pii/S0020025521008367
- Discord bug report: https://discord.com/channels/1022871757289422898/1311271974630330388/1452731993254002779

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adaptive Density-Aware Face Clustering #1271

Background

Current State

Proposed Solution

Implementation Plan

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Item	Status
`min_samples=1` chaining problem	Fixed: now `2` in `face_clusters.py`
Face quality gate	Not implemented
Adaptive `eps` via k-NN estimation	Not implemented
Clustering params in config	All hardcoded as Python defaults
Clustering regression tests	None exist

Uh oh!

Adaptive Density-Aware Face Clustering #1271

Description

Background

Current State

Proposed Solution

Implementation Plan

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions