Skip to content
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9e89446
Add mpp correction factor and highest res .dcm selection.
blanca-pablos Oct 21, 2025
55c4343
Update wsi info command in README
blanca-pablos Nov 20, 2025
7fa9997
Merge branch 'main' into fix/legacy-mpp-multiple-dcm-files
helmut-hoffer-von-ankershoffen Nov 22, 2025
8f9c048
Merge branch 'main' into fix/legacy-mpp-multiple-dcm-files
blanca-pablos Nov 28, 2025
d13b8a1
Add tests for DCM pyramid selection
blanca-pablos Nov 28, 2025
ed6d28e
Fix mpp factor being applied to DICOM
blanca-pablos Nov 29, 2025
fc728b5
Remove line recalculating mpp factor
blanca-pablos Nov 29, 2025
477333d
Make variable lowercase to pass ruff
blanca-pablos Nov 29, 2025
9d08efd
docs(wsi): Update with DICOM filtering logic.
blanca-pablos Nov 29, 2025
e03e722
fix(application): Fix wrong image size in multi-file DICOM filtering.
blanca-pablos Dec 1, 2025
d17fdcd
fix(wsi): Revert changes to apply MPP factor, not in scope.
blanca-pablos Dec 1, 2025
98179aa
fix(application): Lint
blanca-pablos Dec 1, 2025
ea448b2
Merge branch 'main' into fix/legacy-mpp-multiple-dcm-files
helmut-hoffer-von-ankershoffen Dec 2, 2025
ec0c6de
fix(application): Filter out non-WSI .dcm files
blanca-pablos Dec 2, 2025
291f1e6
task(application): Address Oliver's review.
blanca-pablos Dec 4, 2025
326dfa6
task(wsi): Move DICOM filtering logic to pydicom handler, align CLI.
blanca-pablos Dec 5, 2025
fa54a37
task(docs): Update docs after move to WSI
blanca-pablos Dec 5, 2025
5adaeee
task(testing): Clean up redundant tests
blanca-pablos Dec 5, 2025
a8729c4
Merge branch 'main' into fix/legacy-mpp-multiple-dcm-files
helmut-hoffer-von-ankershoffen Dec 7, 2025
97c5bfe
chore(wsi): Refactor scan_files to pass SonarCloud complexity check
blanca-pablos Dec 8, 2025
ec357f8
Merge branch 'main' into fix/legacy-mpp-multiple-dcm-files
olivermeyer Dec 9, 2025
2a83702
Merge branch 'main' into fix/legacy-mpp-multiple-dcm-files
blanca-pablos Dec 10, 2025
a13ce8b
fix(wsi): Remove PydicomHandler dependency from dcm file scanning
blanca-pablos Dec 10, 2025
dedb618
Update doc
blanca-pablos Dec 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/aignostics/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ aignostics application run submit --application-id heta --files slide.svs
aignostics dataset download --collection-id TCGA-LUAD --output-dir ./data

# Get WSI info
aignostics wsi info slide.svs
aignostics wsi inspect slide.svs
```

## GUI Launch
Expand Down
76 changes: 76 additions & 0 deletions src/aignostics/application/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,82 @@ APPLICATION_RUN_UPLOAD_CHUNK_SIZE = 1024 * 1024 # 1MB
APPLICATION_RUN_DOWNLOAD_SLEEP_SECONDS = 5 # Wait between status checks
```

### DICOM Pyramid Handling

**Multi-File DICOM Pyramid Filtering (`_service.py`):**

The service automatically handles multi-file DICOM pyramids (whole slide images stored across multiple DICOM instances) by selecting only the highest resolution file from each pyramid. This prevents redundant processing since OpenSlide can automatically find related pyramid files in the same directory.
```python
@staticmethod
def _filter_dicom_pyramid_files(source_directory: Path) -> set[Path]:
"""Filter DICOM files to keep only one representative per pyramid.

For multi-file DICOM pyramids (WSI images split across multiple instances),
keeps only the highest resolution file. Excludes segmentations, annotations,
thumbnails, and other non-image DICOM files.

Filtering Strategy:
1. SOPClassUID filtering - Only process VL Whole Slide Microscopy Image Storage
- Include: 1.2.840.10008.5.1.4.1.1.77.1.6 (VL WSI)
- Exclude: 1.2.840.10008.5.1.4.1.1.66.4 (Segmentation Storage)
- Exclude: Other non-WSI DICOM types

2. ImageType filtering - Exclude auxiliary images
- THUMBNAIL, LABEL, OVERVIEW, MACRO, ANNOTATION, LOCALIZER

3. PyramidUID grouping - Group multi-file pyramids
- Files with same PyramidUID are part of one logical WSI
- Files without PyramidUID are treated as standalone WSIs

4. Resolution selection - Keep highest resolution per pyramid
- Based on TotalPixelMatrixRows × TotalPixelMatrixColumns
- Excludes all lower resolution levels

Used automatically in: generate_metadata_from_source_directory()
"""
```

**Key Behaviors:**

- **SOPClassUID validation**: Only processes VL Whole Slide Microscopy Image Storage files (1.2.840.10008.5.1.4.1.1.77.1.6)
- **Non-WSI exclusion**: Automatically excludes segmentations (1.2.840.10008.5.1.4.1.1.66.4), annotations, and other DICOM object types
- **ImageType filtering**: Excludes THUMBNAIL, LABEL, OVERVIEW, MACRO, ANNOTATION, and LOCALIZER image types
- **PyramidUID grouping**: Groups files by PyramidUID (DICOM tag identifying multi-resolution pyramids)
- **Resolution selection**: For each pyramid, keeps only the file with largest TotalPixelMatrixRows × TotalPixelMatrixColumns
- **Standalone handling**: Files without PyramidUID are treated as standalone WSI images and preserved
- **Graceful degradation**: Files with missing attributes are logged and treated as standalone (not excluded)
- **Debug logging**: Excluded files are logged at DEBUG level with pyramid/exclusion details
- **Pre-processing filter**: Filtering occurs before metadata generation, checksum calculation, or upload
- **Format-specific**: Only affects DICOM files (`.dcm` extension); other formats (SVS, TIFF) unaffected

**DICOM WSI Structure:**

In the DICOM Whole Slide Imaging standard:
- **PyramidUID**: Uniquely identifies a single multi-resolution pyramid that may span multiple files
- **SeriesInstanceUID**: Groups related images (may include multiple pyramids, thumbnails, labels)
- **TotalPixelMatrixRows/Columns**: Represents full image dimensions at the highest resolution level

**Example Scenario:**
```
Input Directory:
├── pyramid_level_0.dcm (10000×10000 px, PyramidUID: ABC123) ← KEPT
├── pyramid_level_1.dcm (5000×5000 px, PyramidUID: ABC123) ← EXCLUDED
├── pyramid_level_2.dcm (2500×2500 px, PyramidUID: ABC123) ← EXCLUDED
├── thumbnail.dcm (256×256 px, PyramidUID: ABC123, ImageType: THUMBNAIL) ← EXCLUDED
├── segmentation.dcm (10000×10000 px, SOPClassUID: Segmentation) ← EXCLUDED
└── standalone.dcm (8000×8000 px, No PyramidUID) ← KEPT

Result: Only pyramid_level_0.dcm and standalone.dcm are processed
```

**Error Handling:**

- Files with missing SOPClassUID are logged as warnings and excluded (malformed DICOM)
- Files with PyramidUID but missing TotalPixelMatrix* attributes are treated as standalone
- Files that cannot be read by pydicom are logged at DEBUG level and skipped
- AttributeError and general exceptions are caught to prevent processing pipeline failure


### Progress State Management

**Actual DownloadProgress Model (`_models.py`):**
Expand Down
102 changes: 96 additions & 6 deletions src/aignostics/application/_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,20 @@
import base64
import re
import time
from collections import defaultdict
from collections.abc import Callable, Generator
from http import HTTPStatus
from importlib.util import find_spec
from pathlib import Path
from typing import Any

import google_crc32c
import pydicom
import requests
from loguru import logger

from aignostics.bucket import Service as BucketService
from aignostics.constants import (
TEST_APP_APPLICATION_ID,
)
from aignostics.constants import TEST_APP_APPLICATION_ID
from aignostics.platform import (
LIST_APPLICATION_RUNS_MAX_PAGE_SIZE,
ApiException,
Expand All @@ -32,9 +32,7 @@
RunOutput,
RunState,
)
from aignostics.platform import (
Service as PlatformService,
)
from aignostics.platform import Service as PlatformService
from aignostics.utils import BaseService, Health, sanitize_path_component
from aignostics.wsi import Service as WSIService

Expand Down Expand Up @@ -312,6 +310,91 @@
for key_value in key_value_pairs:
Service._process_key_value_pair(entry, key_value, external_id)

@staticmethod
def _filter_dicom_pyramid_files(source_directory: Path) -> set[Path]: # noqa: C901

Check failure on line 314 in src/aignostics/application/_service.py

View check run for this annotation

SonarQubeCloud / SonarCloud Code Analysis

Refactor this function to reduce its Cognitive Complexity from 24 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=aignostics_python-sdk&issues=AZrgLin3KV_f-Lpmb49u&open=AZrgLin3KV_f-Lpmb49u&pullRequest=270
"""Filter DICOM files to keep only one representative per pyramid.
For multi-file DICOM pyramids (WSI images split across multiple DICOM instances),
keeps only the highest resolution file. Excludes segmentations, annotations,
thumbnails, labels, and other non-image DICOM files. OpenSlide will automatically
find related pyramid files in the same directory when needed.
Filtering Strategy:
- Only processes VL Whole Slide Microscopy Image Storage
(SOPClassUID 1.2.840.10008.5.1.4.1.1.77.1.6, see here:
https://dicom.nema.org/medical/dicom/current/output/chtml/part04/sect_b.5.html)
- Excludes thumbnails, labels, overviews by ImageType DICOM attribute
- Groups files by PyramidUID (unique identifier for multi-resolution pyramids)
- Selects highest resolution based on TotalPixelMatrixRows x TotalPixelMatrixColumns
- Preserves standalone WSI files without PyramidUID
Args:
source_directory: The directory to scan.
Returns:
set[Path]: Set of DICOM files to exclude from processing.
"""
dicom_files = list(source_directory.glob("**/*.dcm"))
pyramid_groups: dict[str, list[tuple[Path, int, int]]] = defaultdict(list)
files_to_exclude = set()

# Group by PyramidUID with dimensions
for dcm_file in dicom_files:
try:
ds = pydicom.dcmread(dcm_file, stop_before_pixels=True)

# Exclude non-WSI image files by SOPClassUID
# Only process VL Whole Slide Microscopy Image Storage (1.2.840.10008.5.1.4.1.1.77.1.6)
if ds.SOPClassUID != "1.2.840.10008.5.1.4.1.1.77.1.6":
logger.debug(f"Excluding {dcm_file.name} - not a WSI image (SOPClassUID: {ds.SOPClassUID})")
files_to_exclude.add(dcm_file)
continue

# Exclude thumbnails, labels, and overview images by ImageType
if hasattr(ds, "ImageType"):
image_type = [t.upper() for t in ds.ImageType]
exclude_types = {"THUMBNAIL", "LABEL", "OVERVIEW", "MACRO", "ANNOTATION", "LOCALIZER"}
if any(excluded in image_type for excluded in exclude_types):
logger.debug(f"Excluding {dcm_file.name} - ImageType: {image_type}")
files_to_exclude.add(dcm_file)
continue

# Now process valid WSI images with PyramidUID
if not hasattr(ds, "PyramidUID"):
logger.debug(f"DICOM {dcm_file.name} has no PyramidUID - treating as standalone")
continue

pyramid_uid = ds.PyramidUID

# These represent the full image dimensions across all frames
rows = int(ds.TotalPixelMatrixRows)
cols = int(ds.TotalPixelMatrixColumns)

pyramid_groups[pyramid_uid].append((dcm_file, rows, cols))
except AttributeError as e:
logger.debug(f"DICOM {dcm_file} missing required attributes: {e}")
except Exception as e:
logger.debug(f"Could not read DICOM {dcm_file}: {e}")

# For each pyramid with multiple files, keep only the highest resolution one
for pyramid_uid, files_with_dims in pyramid_groups.items():
if len(files_with_dims) > 1:
# Find the file with the largest dimensions (rows * cols = total pixels)
highest_res_file = max(files_with_dims, key=lambda x: x[1] * x[2])
file_to_keep, rows, cols = highest_res_file

# Exclude all others
for file_path, _, _ in files_with_dims:
if file_path != file_to_keep:
files_to_exclude.add(file_path)

logger.debug(
f"DICOM pyramid {pyramid_uid}: keeping {file_to_keep.name} "
f"({rows}x{cols}), excluding {len(files_with_dims) - 1} related files"
)

return files_to_exclude

@staticmethod
def generate_metadata_from_source_directory( # noqa: PLR0913, PLR0917
source_directory: Path,
Expand Down Expand Up @@ -366,10 +449,17 @@

metadata = []

# Pre-filter: exclude redundant DICOM files from multi-file pyramids
dicom_files_to_exclude = Service._filter_dicom_pyramid_files(source_directory)

try:
extensions = get_supported_extensions_for_application(application_id)
for extension in extensions:
for file_path in source_directory.glob(f"**/*{extension}"):
# Skip excluded DICOM files
if file_path in dicom_files_to_exclude:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat inefficient:

  1. First we list all files in the directory and derive files to exclude
  2. Then we list all files in the directory again, and skip files from (1)

Instead we could have a single function which lists files in the directory and returns only those to include, and iterate over that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep fair, I thought it would be a bit of a cleaner separation this way and anyways for typical dataset sizes time savings on glob would be negligible, but agree 👍 refactored!


# Generate CRC32C checksum with google_crc32c and encode as base64
hash_sum = google_crc32c.Checksum() # type: ignore[no-untyped-call]
with file_path.open("rb") as f:
Expand Down
Loading
Loading