Skip to content

pavanvattikala/MultiLabelVision--Unified-CNN-for-Multi-Class-Image-Understanding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Multi-Label, Multi-Class Image Classification

This project focuses on building a deep learning model that can predict multiple independent labels from a single image.

Instead of asking a simple yes/no question like β€œIs this a cat?”, we train the model to answer several orthogonal questions such as:

  • What type of object is in the image?
  • What is its specific identity or sub-type?
  • From what angle or orientation is it shown?

πŸ’‘ Why Multi-Label, Multi-Class?

Real-world images often represent more than one thing at once.
A picture isn’t just about what is in it β€” it’s also about what kind, in what form, and from what perspective.

For example:

  • 🐱 An image of a cat may involve:

    • Type: Cat
    • Breed: Siamese
    • Pose: Sitting
  • πŸ‘š In fashion:

    • Category: Dress
    • Style: Casual
    • Viewpoint: Side-view
  • πŸ›’ In product images:

    • Product type: Beverage
    • Brand: Coca-Cola
    • Packaging orientation: Front-facing

Each of these labels describes a different semantic axis, and we want the model to learn all of them β€” at the same time β€” from one image.


πŸ“‚ Dataset Used

This project uses a publicly available image dataset with multi-attribute annotations. You can access the dataset here:

πŸ”— Kaggle Mechanical Components Multi-Attribute Dataset

The dataset consists of object images annotated with type, part, and angle β€” suitable for training a multi-label, multi-class model.
You may replace this dataset with any similar dataset from fashion, product images, or other domains.


🧠 Model Overview

The model consists of:

  • A shared CNN backbone (like ConvNeXt) that learns a common feature representation from the image.
  • Multiple independent classification heads, each responsible for predicting a specific label group.
  • A combined loss function that encourages the model to jointly optimize all tasks.

Each head is trained with a separate classification objective, allowing the model to learn disentangled features relevant to each label set.


βš™οΈ Training Pipeline

  1. Preprocessing
    Images are preprocessed and each one is assigned multiple labels β€” one for each classification head.

  2. Label Encoding
    Each label group (e.g., category, style, angle) is encoded independently, maintaining separation between semantic tasks.

  3. Model Training

    • The shared CNN backbone extracts image features.
    • Each output head produces logits for one label set.
    • The total loss is computed as the sum of the losses across all heads.
  4. Evaluation
    While training, each head's performance can be tracked independently, providing insights into how well the model learns each dimension.


🌍 General Applicability

This design is domain-agnostic and works in any scenario where an image conveys multiple attributes. Example applications include:

  • πŸ‘• Fashion: predict type, fabric, sleeve length, and view
  • 🐢 Animals: predict species, breed, posture
  • πŸ“¦ Retail: predict category, brand, packaging style
  • 🩺 Medical imaging: predict organ, condition, scan orientation
  • πŸ“· Surveillance: predict object type, activity, direction

Any task where multiple independent labels must be predicted from one image can benefit from this architecture.


πŸ” Extendability

This model architecture is flexible:

  • Add or remove classification heads based on the number of attributes.
  • Swap out the CNN backbone with any other architecture (e.g., ResNet, ViT).
  • Easily plug into real-world pipelines involving detection, recommendation, tagging, and more.

🧬 Summary

This project demonstrates a scalable and adaptable way to approach multi-label, multi-class image understanding.
Rather than training multiple models in isolation, we unify the process into a single, end-to-end system capable of extracting rich, structured information from raw images.

It’s a foundational architecture with applications across industries β€” fashion, healthcare, manufacturing, retail, and beyond.


About

Multi head CNN Model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors