PLANAR: Planetary Latent Analysis & Representation

PLANAR is a bias-aware, stability-validated unsupervised morphology discovery framework for protoplanetary disk observations.

It is designed not merely to cluster images, but to test whether discovered structure reflects physical morphology rather than nuisance factors such as brightness scaling or orientation.

Reviewer Quickstart

This repository implements the EXXA GSoC Test pipeline for protoplanetary disk analysis.

1. Install dependencies

pip install -r requirements.txt

2. Run the full verification pipeline

python -m planar run --config configs/verify.yaml

Runtime: ~2 minutes

Outputs will appear in:

artifacts/
reports/PLANAR_REPORT.md

3. Run observational verification

python -m planar run --config configs/verify_observational.yaml

4. Test with new FITS data

python scripts/generate_synthetic_fits.py --out-dir data/withheld_fits --n-samples 10
python -m planar infer --config configs/verify.yaml --data-dir data/withheld_fits

5. Notebook version

Open:

notebooks/EXXA_test_submission.ipynb

Or view the executed notebook:

artifacts/notebooks/EXXA_test_submission.executed.ipynb

Pipeline Overview

FITS cube handling note:

Extract layer 0 to preserve (600, 600) morphology.
Avoid averaging over spatial axes; that collapses structure.
When cubes include extra non-spatial axes, use a controlled squeeze or data[0].

Alignment With Terry et al. (2022)

This pipeline is aligned with Terry et al. (2022), which demonstrates that machine learning on synthetic ALMA-like observations can detect planet-induced disk structure and transfer to real data. Our approach likewise trains on synthetic continuum FITS observations, emphasizes morphology-driven clustering, and includes orientation/brightness bias checks plus observational evaluation to mirror the synthetic-to-real validation strategy.

Latent Space Representation

Figure — Latent space clustering of protoplanetary disk images. The convolutional autoencoder encodes each FITS disk image into a latent vector. A 2-D PCA projection of these latent vectors is shown, with colors representing cluster assignments produced by HDBSCAN (with KMeans fallback). Clusters correspond to distinct disk morphologies such as rings, gaps, and potential planet-induced structures.

Key Features

End-to-end pipeline for image representation learning, latent clustering, and transit classification.
Config-driven execution with deterministic seeding and structured artifact outputs.
Scientific diagnostics for brightness/orientation dominance (eta^2, Kruskal tests).
Stability analysis via perturbation-based pairwise ARI.
Morphology interpretation layer with radial derivative peaks and gap-width proxies.
PyPI-style package layout (src/), CLI entrypoint, tests, and documentation scaffold.

Scientific Motivation

Protoplanetary disk images contain both physically informative structure (rings, gaps, asymmetries) and nuisance variation (brightness scaling, inclination/orientation). PLANAR is designed to separate these effects by learning latent representations and explicitly auditing cluster bias against nuisance proxies. The goal is clustering that is scientifically meaningful, not merely visually separable.

Design Philosophy

PLANAR was built under three guiding principles:

Determinism before optimization.
Scientific validation before visual appeal.
Reproducibility before performance claims.

Every clustering result must pass:

Stability under perturbation
Bias audit against nuisance proxies
Transparent metric reporting

Architecture Overview

FITS Loader -> Preprocessing -> ConvAutoencoder -> Latent Vectors -> Clustering
   |              |                 |                 |                |
   |              |                 |                 |                +-> HDBSCAN / KMeans / GMM
   |              |                 |                 +-> Embedding (PCA/UMAP)
   |              |                 +-> Reconstructions
   |              +-> Radial averaging (optional)
   +-> Validation + shape checks

Transit Simulator -> 1D CNN Transit Classifier -> ROC/AUC + Stress Evaluation

Repository Layout

PLANAR/
├── src/planar/
│   ├── models/
│   ├── pipelines/
│   ├── cli.py
│   ├── config.py
│   ├── data_loader.py
│   ├── preprocessing.py
│   ├── metrics.py
│   ├── science_validation.py
│   └── transit_sim.py
├── configs/
├── scripts/
├── tests/
├── docs/
├── notebooks/
├── reports/
├── pyproject.toml
├── requirements.txt
└── environment.yml

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e .
planar run --config configs/default.yaml

CLI Usage

# Full pipeline
planar run --config configs/default.yaml

# Deterministic multi-seed reproducibility sweep
planar reproduce --config configs/reproduce.yaml

# Autoencoder only
planar train-ae --config configs/default.yaml

# Clustering with explicit checkpoint
planar cluster --config configs/default.yaml --model-path artifacts/autoencoder/autoencoder.pth

# Inference on new FITS folder
planar infer --config configs/default.yaml --data-dir path/to/fits

Sample Output Snippet

2026-03-13 03:51:10 | INFO | planar.pipelines.autoencoder | Autoencoder artifacts written to artifacts/autoencoder
2026-03-13 03:51:22 | INFO | planar.pipelines.clustering | Clustering complete: method=hdbscan reducer=pca silhouette=0.6976 noise=0.047
2026-03-13 03:57:44 | INFO | planar.pipelines.transit | Transit training complete. test_auc=0.9962 stress_auc=0.9612
2026-03-13 03:58:01 | INFO | planar.pipelines.full | Run report written to reports/PLANAR_REPORT.md

Benchmark Snapshot (Current Artifacts)

Metrics below are from the reviewer-facing radial/HDBSCAN configuration and the 3-seed reproducibility sweep generated in March 2026:

Task	Configuration	Result
Clustering (recommended run)	Radial preprocessing + HDBSCAN + PCA	Silhouette `0.6976`
Clustering reproducibility	3-seed sweep	Silhouette `0.5275 ± 0.0075`
Clustering stability	3-seed sweep	ARI mean `0.9482 ± 0.0285`
Bias audit	Brightness `eta^2` / Orientation `eta^2` across seeds	`0.0524 ± 0.0394` / `0.0169 ± 0.0154`
HDBSCAN behavior	Recommended radial run	Noise fraction `0.0467`
Transit classifier	3-seed sweep mean test AUC	`0.9984`
Transit stress test	3-seed sweep mean stress AUC	`0.9610`
Autoencoder	3-seed sweep best val loss	`0.0379 ± 0.0013`

Artifact sources: artifacts/reproducibility/, artifacts/clustering/, artifacts/transit/, reports/PLANAR_REPORT.md.

Pretrained checkpoint (generated during the run): artifacts/autoencoder/autoencoder.pth

Google Drive (pretrained models + artifacts): https://drive.google.com/drive/u/0/folders/1x3jiMVj2Iyeu9quEI53SVg6-EF7tMFqx

For reviewer-grade claims, use reproducibility summaries generated by: artifacts/reproducibility/reproducibility_summary.json (mean ± std across seeds + negative controls).

Reproducibility

Python version is pinned to 3.11.9 in .python-version, pyproject.toml, and environment.yml.
All stages consume a YAML config (configs/default.yaml or configs/research_top.yaml).
Global seed is set once and propagated to NumPy and PyTorch.
Deterministic cuDNN mode is enabled when PyTorch is available.
Run outputs are written to versioned artifact folders with JSON summaries for auditability.
Reproducibility pipeline supports seed sweeps with aggregate statistics and controls: planar reproduce --config configs/reproduce.yaml.
Negative controls are recorded in clustering artifacts (negative_controls.json) to validate that structure exceeds shuffled/permuted baselines.
Train/validation/test leakage checks are emitted in stage summaries under split_integrity.

How headline results are generated

planar reproduce --config configs/reproduce.yaml
planar report --config configs/research_top.yaml

Primary evidence files:

artifacts/reproducibility/reproducibility_summary.json
artifacts/clustering*/cluster_stability_summary.json
artifacts/clustering*/cluster_bias_summary.json
artifacts/transit/train_summary.json
reports/PLANAR_REPORT.md

Methodological Notes

Why radial averaging: suppresses azimuthal orientation effects so clustering emphasizes radial morphology (rings/gaps).
Why HDBSCAN: supports variable-density structure and marks ambiguous objects as noise (-1) rather than forcing assignment.
What silhouette means: measures within-cluster compactness versus between-cluster separation (higher is better).
What ARI stability means: agreement of cluster partitions under latent perturbations; high ARI indicates robust structure.
Why high noise fraction is not always bad: in density clustering, noise can represent rare/transitional morphologies rather than failure.

Why This Matters

Astrophysical ML often optimizes predictive performance without testing whether learned structure aligns with physical hypotheses. PLANAR addresses this gap by combining representation learning with explicit scientific validation layers, making unsupervised outcomes more useful for downstream disk-physics interpretation.

Limitations

Current transit data are simulated; domain shift to real survey light curves still requires calibration.
Ring/gap estimators are proxy-based and do not replace radiative transfer modeling.
HDBSCAN sensitivity to feature scaling and sample density can alter cluster counts.
Current latent model uses a single autoencoder family; contrastive/self-supervised alternatives are not yet integrated.

Future Work

Integrate contrastive pretraining and encoder ensembles.
Add uncertainty-aware clustering and calibrated outlier scoring.
Incorporate physically grounded simulators and real mission light curves in transit training.
Add benchmark datasets and continuous integration for regression tracking.
Expand docs with API references and experiment registry templates.

Development and Tests

pytest -q

License

This project is licensed under the MIT License. See LICENSE.

Citation

If you use PLANAR in research, please cite:

@software{planar2026,
  title        = {PLANAR: Planetary Latent Analysis \& Representation},
  author       = {Atharva Parande},
  year         = {2026},
  url          = {https://github.com/Atharva12081/PLANAR},
  version      = {0.1.0},
  license      = {MIT}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PLANAR: Planetary Latent Analysis & Representation

Reviewer Quickstart

1. Install dependencies

2. Run the full verification pipeline

3. Run observational verification

4. Test with new FITS data

5. Notebook version

Pipeline Overview

Alignment With Terry et al. (2022)

Latent Space Representation

Key Features

Scientific Motivation

Design Philosophy

Architecture Overview

Repository Layout

Quickstart

CLI Usage

Sample Output Snippet

Benchmark Snapshot (Current Artifacts)

Reproducibility

How headline results are generated

Methodological Notes

Why This Matters

Limitations

Future Work

Development and Tests

License

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
artifacts		artifacts
configs		configs
data/observational		data/observational
docs		docs
notebooks		notebooks
planar		planar
reports		reports
scripts		scripts
src/planar		src/planar
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sitecustomize.py		sitecustomize.py

Folders and files

Latest commit

History

Repository files navigation

PLANAR: Planetary Latent Analysis & Representation

Reviewer Quickstart

1. Install dependencies

2. Run the full verification pipeline

3. Run observational verification

4. Test with new FITS data

5. Notebook version

Pipeline Overview

Alignment With Terry et al. (2022)

Latent Space Representation

Key Features

Scientific Motivation

Design Philosophy

Architecture Overview

Repository Layout

Quickstart

CLI Usage

Sample Output Snippet

Benchmark Snapshot (Current Artifacts)

Reproducibility

How headline results are generated

Methodological Notes

Why This Matters

Limitations

Future Work

Development and Tests

License

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages