Galaxy morphology in the local universe is quantitatively separable in mass–size space, with a geometric boundary slope ≈ 2 and systematic redshift evolution.
This repository presents a reproducible, hypothesis-driven study of galaxy structural classification using NASA–Sloan Atlas (NSA) data. By combining interpretable machine learning with geometric boundary analysis, we demonstrate that galaxy morphology correlates with a mass–size scaling relation consistent with a surface-density–like threshold — and that this threshold evolves with redshift.
- Linear mass–size boundary slope ≈ 2.01
- Cross-validation stability: std ≈ 0.003
- Robust under regularization changes (C = 0.1–10)
- Redshift evolution: slope decreases from ~2.40 (low z) to ~1.59 (high z)
- Logistic Regression ROC-AUC ≈ 0.84
- Random Forest ROC-AUC ≈ 0.88
Probability surface in log(M*)–log(Re) space for structural classification. The diagonal transition band confirms that morphology is governed by a mass–size tradeoff, not mass alone.
Evolution of the mass–size boundary slope across redshift bins. The systematic decline suggests the morphology transition becomes increasingly compactness-regulated at later cosmic times.
Galaxy morphology (disk-dominated vs bulge-dominated) is strongly correlated with global structural properties such as stellar mass and effective radius. This project investigates the hypothesis:
Galaxy structure emerges from multivariate mass–size interaction rather than from a single dominant parameter.
Rather than maximizing classification accuracy, this study emphasizes geometric interpretability and robustness of structural boundaries as the primary scientific objectives.
Source: NASA–Sloan Atlas (NSA)
Redshift range: z < 0.08 (conservative reliability cut)
Final sample: ~287,000 galaxies
Extracted quantities:
- Stellar mass (M*; Sérsic-based, log-transformed)
- Effective radius (Re; Sérsic half-light radius, log-transformed)
- Spectroscopic redshift (z)
- Sérsic index (n) — used only to define structural class
Binary structural classification:
- Disk-dominated: n < 2.5
- Bulge-dominated: n ≥ 2.5
The Sérsic index is removed from the feature set to prevent target leakage.
- FITS ingestion with endian correction
- Conservative redshift filtering (z < 0.08)
- Removal of non-physical entries
- Log-transform of physical scale quantities
- Stratified 80/20 train-test split (class balance preserved)
- Logistic Regression — interpretable linear baseline
- Random Forest — non-linear comparison
- 5-fold stratified cross-validation throughout
- Permutation importance analysis on test set
| Experiment | Features | Logistic AUC | RF AUC |
|---|---|---|---|
| Full structural model | M*, Re, z | 0.840 | 0.877 |
| Remove surface density | M*, Re, z | 0.840 | 0.877 |
| Compactness only | Σ*, z | 0.800 | 0.784 |
The compactness-only model underperforms significantly, confirming that morphology depends on the geometry of mass–size space, not a single compactness ratio.
Logistic regression coefficients were extracted to interpret the decision boundary directly.
The boundary in log-space satisfies:
Empirically measured:
This implies the morphology transition follows:
which corresponds to a stellar surface mass density threshold:
This geometric result — slope ≈ 2 — emerged from the model without explicitly providing surface density as a feature. The model recovered the compactness scaling from mass and size alone.
| Test | Result |
|---|---|
| CV fold stability | slope = 2.008 ± 0.003 |
| Regularization (C = 0.1) | slope = 2.009 |
| Regularization (C = 1.0) | slope = 2.008 |
| Regularization (C = 10.0) | slope = 2.008 |
The boundary geometry is intrinsic to the data and not a numerical artifact.
The sample was divided into three equal-sized redshift bins (~76,000 galaxies each).
| Redshift Bin | z Range | Slope |
|---|---|---|
| Low z | 0.000 – 0.042 | ~2.40 |
| Mid z | 0.042 – 0.064 | ~1.79 |
| High z | 0.064 – 0.080 | ~1.59 |
Linear fit:
The systematic decline indicates the morphology transition becomes increasingly compactness-regulated toward lower redshift. This evolution is not driven by bin size imbalance and persists under all robustness checks.
- Galaxy morphology is strongly encoded in mass–size geometry.
- The structural transition approximates a surface-density threshold, recovered geometrically without explicit feature engineering.
- The boundary is intrinsically stable — insensitive to cross-validation fold, regularization strength, or feature set variations.
- The structural threshold evolves systematically with redshift, consistent with compactness-driven quenching becoming stronger at late cosmic times.
These results support the hypothesis that galaxy structure emerges from multivariate physical interaction rather than from a single linear parameter threshold.
pip install -r requirements.txt
python src/data_cleaning.py
python src/split_data.py
python src/train_model.pyRaw NSA FITS data is not version-controlled. Place the file in data/raw/ before running.
galaxy-structure-inference/
│
├── README.md
├── requirements.txt
├── config.yaml
├── src/
│ ├── data_cleaning.py
│ ├── split_data.py
│ └── train_model.py
├── figures/
│ ├── mass_size_probability_surface.png
│ └── slope_vs_redshift.png
└── data/ (not version-controlled)
This repository emphasizes hypothesis-driven experimentation, strict leakage prevention, geometric interpretability of learned boundaries, and robustness testing over performance maximization. The objective is scientific understanding of structural scaling relations rather than model optimization.
Gnaneshwar G S
Computational galaxy evolution | Structural scaling relations | Statistical modeling in large survey datasets

