A multimodal machine learning pipeline for detecting pneumonia and identifying its cause (Bacterial vs Viral) from chest X-rays combined with synthetic clinical data. A Genetic Algorithm (GA) is used for intelligent feature selection across the fused feature space.
This project demonstrates:
- Multimodal fusion: CNN image features (MobileNetV2) + clinical tabular features
- Genetic Algorithm for automated feature selection
- Binary classification: Normal vs Pneumonia
- Multi-class classification: Normal vs Bacterial Pneumonia vs Viral Pneumonia
Chest X-Ray Images (Pneumonia) by Paul Mooney on Kaggle.
Note: Clinical features (temperature, WBC count, SpO2, etc.) are synthetically generated and correlated with the image labels to simulate a real multimodal dataset. They do not come from real patients.
Mini_Proj/
│
├── data/ # Created after running download_data.py
│ └── chest_xray/
│ ├── train/
│ │ ├── NORMAL/
│ │ └── PNEUMONIA/
│ ├── test/
│ ├── val/
│ └── clinical_data.csv # Auto-generated synthetic clinical data
│
├── outputs/ # Created after running train_evaluate.py
│ ├── cm_pneumonia.png
│ └── cm_cause.png
│
├── data_loader.py # Loads images + clinical data, preprocesses
├── download_data.py # Downloads Kaggle dataset + generates clinical data
├── feature_extractor.py # MobileNetV2 CNN feature extraction + fusion
├── genetic_algorithm.py # GA-based feature selection
├── train_evaluate.py # Main training & evaluation pipeline
├── requirements.txt
├── README.md
└── .gitignore
git clone https://github.com/<your-username>/pediatric-pneumonia-detection.git
cd pediatric-pneumonia-detectionpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txt- Go to kaggle.com → Account → Create API Token
- Place the downloaded
kaggle.jsonin~/.kaggle/(Linux/macOS) orC:\Users\<user>\.kaggle\(Windows) - Set permissions:
chmod 600 ~/.kaggle/kaggle.json
python download_data.pyThis downloads the Kaggle chest X-ray dataset and generates clinical_data.csv.
python train_evaluate.pyFor a quick test with a small sample:
# In train_evaluate.py, change:
loader.load_data(max_samples=200) # Use a small number
# GA settings:
GeneticAlgorithmFeatureSelection(population_size=10, generations=5)For a full run, set max_samples=None and increase GA population/generations.
Chest X-Ray Images
│
▼
MobileNetV2 (pretrained, ImageNet)
GlobalAveragePooling2D
│
▼ Synthetic Clinical Features
CNN Features (1280-d) + (temperature, WBC, SpO2, ...)
│ │
└──────────── Fusion ──────────┘
│
▼
Genetic Algorithm Feature Selection
│
▼
RandomForest Classifier
/ \
Binary Classification Multi-class Classification
(Normal vs Pneumonia) (Normal vs Bacteria vs Virus)
Outputs are saved to the outputs/ directory:
cm_pneumonia.png— Confusion matrix for binary classificationcm_cause.png— Confusion matrix for cause classification
Results vary depending on
max_samplesand GA parameters. Usemax_samples=Nonefor best accuracy.
| Component | Choice | Reason |
|---|---|---|
| CNN Backbone | MobileNetV2 | Lightweight, pretrained on ImageNet |
| Image Size | 128×128 | Balance between speed and detail |
| Feature Selection | Genetic Algorithm | Handles mixed (image+clinical) feature spaces |
| Classifier | Random Forest | Robust, interpretable, handles high dimensions |
| Clinical Data | Synthetic | Demonstrates multimodal pipeline without real EHR data |
- Clinical features are synthetic and not clinically validated
- Small sample sizes significantly affect accuracy
- The GA is computationally expensive; increase generations for better results
Unknownpneumonia cause samples (those withoutbacteria/virusin filename) are excluded from multi-class training
- Python 3.8–3.10
- TensorFlow < 2.16
- See
requirements.txtfor full list
This project is for educational purposes. The chest X-ray dataset is subject to Kaggle's terms of use.