Skip to content

Latest commit

 

History

History
23 lines (18 loc) · 846 Bytes

File metadata and controls

23 lines (18 loc) · 846 Bytes

Benchmarking LLMs on Indic Languages

Performed as part of COMS6998 Course@Columbia University

Dataset

Samanantar : https://huggingface.co/datasets/ai4bharat/samanantar
IndicSentenceSummarization : https://huggingface.co/datasets/ai4bharat/IndicSentenceSummarization

Models Used

Bloomz-560m : https://huggingface.co/bigscience/bloomz-560m
mBART-large : https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt
IndicBART-XXEN : https://huggingface.co/ai4bharat/IndicBART-XXEN

Implementation

For each model the existing code base available at https://huggingface.co/ is recfactored to perform the following tasks:

  • Machine Translation
  • Summarization

Each file obtained as the output is further processed to find the mean values for the following metrics

  • METEOR
  • BLEU
  • ROUGE
  • BERTScore