Beyond Visual Plausibility

Faithfulness-Aware Comparison of CNNs and Vision Transformers for Chest X-Ray Classification

A rigorous, multi-dimensional study evaluating not just how accurately deep learning models classify chest X-rays — but whether their explanations can actually be trusted.

Department of Computer Science
Islamic University of Science and Technology, Awantipora, J&K
Dr. Owais  ·  Shahid Ul Islam

The Core Question

Most deep learning studies ask: "How accurate is the model?"

This study asks a harder question: "Can we trust why the model made that decision?"

We train three architectures on a four-class chest X-ray dataset and discover a clinically significant paradox — the model that produces the most visually convincing heatmaps is the one whose explanations are causally hollow.

What Makes This Study Different

Conventional Study	This Study
Reports accuracy only	Reports accuracy + faithfulness
One CAM method per model	Two CAM methods with inter-method agreement
Visual inspection of heatmaps	Quantitative pixel deletion (AOPC, AUC)
Single metric evaluation	Six-dimensional explainability framework
No statistical correction	Bonferroni-corrected non-parametric testing

Architecture Overview

flowchart TD
    A[("  6,432 PA Chest X-Rays\nKaggle Dataset")] --> B[Preprocessing\n224×224 · RGB · ImageNet Norm]

    B --> C{Architecture\nBranch}

    C -->|Aggressive Aug\n±30° · blur · erasing| D["  VGG16\nTransfer Learning\nImageNet pretrained"]
    C -->|Moderate Aug\n±15° · colour jitter| E["  ViT-B/16\nTransfer Learning\nImageNet pretrained"]
    C -->|Standard Aug\nRandom init| F["  Custom CNN\nFrom Scratch\n4 conv blocks"]

    D --> G[Class-Weighted\nCross-Entropy Loss\nwₖ = N / K·nₖ]
    E --> G
    F --> G

    G --> H[Early Stopping\nPatience P=3]
    H --> I{Evaluation}

    I --> J[Classification\nAccuracy · F1 · CM]
    I --> K[Explainability\nGradCAM++ · EigenCAM]
    K --> L[6-Metric\nFramework]
    K --> M[Pixel Deletion\nAOPC · AUC]

    style A fill:#1e3a5f,color:#fff,stroke:#3b82f6
    style D fill:#7c2d12,color:#fff,stroke:#f97316
    style E fill:#4a1d96,color:#fff,stroke:#a855f7
    style F fill:#14532d,color:#fff,stroke:#22c55e
    style G fill:#7f1d1d,color:#fff,stroke:#ef4444
    style L fill:#0c4a6e,color:#fff,stroke:#38bdf8
    style M fill:#0c4a6e,color:#fff,stroke:#38bdf8

Classification Results

Overall Performance

Model	Strategy	Accuracy	Macro F1	Wtd. F1	Inference
VGG16	Transfer Learning	83%	0.84	0.82	1.068s
ViT-B/16	Transfer Learning	82%	0.84	0.82	0.977s
🔹 Custom CNN	From Scratch	74%	0.75	0.74	0.768s

Class-wise F1 Scores

                    VGG16    ViT-B/16    Custom CNN
Normal              0.91     0.91        0.80
Bacterial Pneumonia 0.83     0.81        0.74
Viral Pneumonia     0.64     0.64        0.55
COVID-19            0.98     0.99        0.92

Training Convergence

xychart-beta
    title "Peak Validation Accuracy by Epoch"
    x-axis ["Ep 1", "Ep 2", "Ep 3", "Ep 4", "Ep 5", "Ep 6", "Ep 7"]
    y-axis "Validation Accuracy (%)" 70 --> 85
    line [80.1, 81.3, 82.0, 82.9, 82.5, 83.3, 83.3]
    line [77.8, 82.4, 81.1, 80.5, 71.5, 80.2, 82.4]

VGG16 stopped at epoch 6 · ViT-B/16 stopped at epoch 7 · Custom CNN ran all 15 epochs

The Explainability Paradox

The most visually convincing heatmaps are the least causally faithful ones.

This is the central finding of the study. We evaluate explanations using progressive pixel deletion — removing the pixels each model's heatmap marks as most important, and measuring confidence change.

graph LR
    A["VGG16 Heatmaps\n✓ Visually intuitive\n✓ Anatomically broad\n✓ Clinically convincing"] -->|Faithfulness test| B["AOPC = −0.012\n Confidence RISES\nas pixels removed\nCausally UNFAITHFUL"]

    C["ViT-B/16 Heatmaps\n~ Variable appearance\n~ Patchier coverage\n~ Less intuitive"] -->|Faithfulness test| D["AOPC = +0.199\n Confidence DROPS\nas pixels removed\nCausally FAITHFUL"]

    style A fill:#fef3c7,stroke:#d97706,color:#000
    style B fill:#fee2e2,stroke:#dc2626,color:#000
    style C fill:#ede9fe,stroke:#7c3aed,color:#000
    style D fill:#dcfce7,stroke:#16a34a,color:#000

Faithfulness Statistics

Model	AUC ↓	AUC SD	AOPC ↑	AOPC SD	Verdict
VGG16	0.828	0.119	−0.012	0.140	Unfaithful
ViT-B/16	0.588	0.076	+0.199	0.143	Faithful

Lower AUC = confidence decays faster when important pixels removed. Higher AOPC = greater average confidence drop. Both indicate more faithful explanations.

Six-Dimensional Explainability Framework

We evaluate every heatmap across six independent dimensions with Bonferroni-corrected statistical testing (α = 0.05/6 ≈ 0.0083):

Explainability Profile Comparison

Dimension	VGG16	ViT-B/16
Visual Intuitiveness	`█████████░` 0.90	`██████░░░░` 0.60
Activation Contrast	`████░░░░░░` 0.45	`████████░░` 0.85
Spatial Selectivity	`████████░░` 0.80	`████░░░░░░` 0.45
Perturbation Robustness ↑	`███░░░░░░░` 0.35	`█████████░` 0.95
Inter-Method Agreement ↑	`░░░░░░░░░░` 0.05	`███████░░░` 0.75
Causal Faithfulness ↑	`░░░░░░░░░░` 0.05	`████████░░` 0.80

↑ = higher is better. VGG16 leads on surface metrics; ViT-B/16 leads on every trustworthiness dimension.

Metric	VGG16	ViT-B/16	Winner
Entropy	5.159 ± 0.034	4.987 ± 0.092	—
Activation Std Dev	0.216 ± 0.018	0.250 ± 0.024	ViT
Sparsity	0.466 ± 0.148	0.252 ± 0.116	—
Top-k Mass	16.350 ± 0.874	16.197 ± 0.861	≈ Tie
Robustness ↑	0.542 ± 0.215	0.809 ± 0.217	ViT
Inter-Method ↑	−0.309 ± 0.483	+0.301 ± 0.406	ViT

All six comparisons reach statistical significance (p < 0.0083 after correction).

Key Findings at a Glance

mindmap
  root((Study\nFindings))
    Transfer Learning
      VGG16 converges in 6 epochs
      ViT converges in 7 epochs
      Custom CNN needs all 15 epochs
      9-point accuracy gap vs scratch
    Architecture Alignment
      VGG16 excels at focal patterns
        Bacterial pneumonia F1=0.83
      ViT excels at global patterns
        COVID-19 recall=0.99
      Viral pneumonia hardest for all
    Explainability Paradox
      VGG16 AOPC = negative
        Heatmaps misleading
      ViT AOPC = positive
        Heatmaps causally valid
      Visual plausibility ≠ faithfulness
    Clinical Implications
      Standard visual inspection insufficient
      Faithfulness testing is safety-critical
      Three-layer validation framework proposed

Dataset

Property	Value
Source	Pneumonia & COVID-19 Image Dataset — GiBi13 on Kaggle
Total images	6,432 posterior-anterior chest X-rays
Classes	Normal · Bacterial Pneumonia · Viral Pneumonia · COVID-19
Split	80% train / 10% validation / 10% test (stratified)
Imbalance handling	Class-weighted cross-entropy loss

Repository Structure

Transfer-Learning/
├── .streamlit/                 # Streamlit configuration
├── app/
│   ├── components/             # UI components
│   └── utils/                  # Inference utilities
├── data/                       # Dataset (not tracked)
├── models/                     # Saved model checkpoints
├── notebooks/                  # Training & evaluation notebooks
├── screenshots/                # App interface screenshots
├── presentation/               # Slide deck
├── reports/                    # Figures, metrics, outputs
├── results/ 
├── src/                        # Core training & evaluation code
├── requirements.txt
└── README.md

Note: Source code, notebooks, and trained model weights will be released publicly upon acceptance of the associated research paper, which is currently under peer review.

Streamlit Demo App

A full interactive diagnostic interface is included, supporting:

Single image upload and real-time classification
Per-class confidence scores with visual breakdown
GradCAM++ heatmap overlay on prediction
Model selection (VGG16 / ViT-B/16)

Documentation

Document	Description
`README.md`	Project overview (this file)
`CASE_STUDY.md`	Deep dive into the explainability paradox finding
`METHODOLOGY.md`	Full technical methodology and mathematical formulations

Environment Setup

git clone https://github.com/Khanz9664/Transfer-Learning-for-Respiratory-Disease-Classification.git
cd Transfer-Learning-for-Respiratory-Disease-Classification
pip install -r requirements.txt

Requirements

torch>=2.0.0
torchvision>=0.15.0
transformers>=4.30.0
pytorch-grad-cam>=1.4.0
streamlit>=1.25.0
numpy pandas scikit-learn matplotlib seaborn pillow

Citation

If you find this work useful, please cite it once the paper is published. In the meantime you may reference the preprint or this repository:

@misc{owais2026beyondvisual,
  author    = {Owais and {Shahid Ul Islam}},
  title     = {Beyond Visual Plausibility: A Faithfulness-Aware Comparison of
               CNNs and Vision Transformers for Multi-Class Chest X-Ray Classification},
  year      = {2026},
  note      = {Manuscript under peer review},
  institution = {Islamic University of Science and Technology, Awantipora}
}

License

This project is released under the MIT License. The dataset is subject to its original Kaggle terms of use.

Acknowledgements

Dataset: GiBi13 on Kaggle
Pretrained weights: PyTorch/torchvision (VGG16) and HuggingFace Transformers (ViT-B/16)
CAM implementations: pytorch-grad-cam

Department of Computer Science · Islamic University of Science and Technology

This is a research prototype. Not intended for clinical use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Visual Plausibility

Faithfulness-Aware Comparison of CNNs and Vision Transformers for Chest X-Ray Classification

The Core Question

What Makes This Study Different

Architecture Overview

Classification Results

Overall Performance

Class-wise F1 Scores

Training Convergence

The Explainability Paradox

Faithfulness Statistics

Six-Dimensional Explainability Framework

Explainability Profile Comparison

Key Findings at a Glance

Dataset

Repository Structure

Streamlit Demo App

Documentation

Environment Setup

Requirements

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
app		app
data		data
models		models
notebooks		notebooks
presentation		presentation
reports		reports
results		results
screenshots		screenshots
src		src
CASE_STUDY.md		CASE_STUDY.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Beyond Visual Plausibility

Faithfulness-Aware Comparison of CNNs and Vision Transformers for Chest X-Ray Classification

The Core Question

What Makes This Study Different

Architecture Overview

Classification Results

Overall Performance

Class-wise F1 Scores

Training Convergence

The Explainability Paradox

Faithfulness Statistics

Six-Dimensional Explainability Framework

Explainability Profile Comparison

Key Findings at a Glance

Dataset

Repository Structure

Streamlit Demo App

Documentation

Environment Setup

Requirements

Citation

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages