Skip to content

stccenter/Benchmarking-SLMs-and-SRLMs-on-System-Log-Severity-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paper: https://arxiv.org/abs/2601.07790 arxiv:2601.07790

Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification


Yahya Masri¹, Emily Ma¹, Zifu Wang², Joseph Rogers¹, Chaowei Yang¹

¹George Mason University, ²Harvard University


1. Introduction

System logs are crucial for monitoring and diagnosing modern computing infrastructure, making them an important setting for evaluating compact operational AI models. To study their ability to reason over real-world system behavior, we benchmark nine Small Language Models (SLMs) and Small Reasoning Language Models (SRLMs) on log severity classification using zero-shot, few-shot, and retrieval-augmented prompting on journalctl data. To advance log reasoning in resource-constrained environments, we examine how small models behave under strict output requirements, heterogeneous log structure, and retrieval conditions. Retrieval-aware evaluation is particularly important for digital twin (DT) systems, where models must operate under tight latency budgets and shifting log distributions. Our benchmark reveals clear stratification: Qwen3-4B reaches 95.64% accuracy with RAG, Gemma3-1B improves from 20.25% to 85.28%, and the tiny Qwen3-0.6B achieves 88.12%, while several SRLMs degrade under retrieval and others exhibit prohibitive latency. These findings establish the first dedicated evaluation of SLMs and SRLMs on Linux system logs and highlight how retrieval, architecture, and efficiency jointly shape model suitability for DT pipelines. While much work remains, the results suggest that principled, retrieval-aware evaluation of compact models is a promising direction for building more capable operational AI systems.


2. Evaluation Results

Accuracy Comparison

Model Zero-Shot Few-Shot RAG
Qwen3-4B 27.12% 56.01% 95.64%
Qwen3-0.6B 27.45% 28.92% 88.12%
Gemma3-1B 0.14% 20.25% 85.28%
Gemma3-4B 4.79% 41.06% 81.84%
Llama3.2-3B 8.11% 33.21% 53.31%
Llama3.2-1B 1.04% 0.00% 37.37%
Qwen3-1.7B 33.61% 43.30% 28.96%
DeepSeek-R1-Distill-Qwen-1.5B 11.54% 17.63% 3.17%
Phi-4-Mini-Reasoning 9.20% 0.00% 0.00%

3. Quick Start

git clone https://github.com/stccenter/Benchmarking-SLMs-and-SRLMs-on-System-Log-Severity-Classification.git
cd Benchmarking-SLMs-and-SRLMs-on-System-Log-Severity-Classification

pip install -r requirements.txt

4. License

This project is licensed under the MIT License - see the LICENSE file for details.


5. Citation

@misc{masri2025benchmarking,
  title={Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification},
  author={Masri, Yahya and Ma, Emily and Wang, Zifu and Rogers, Joseph and Yang, Chaowei},
  year={2025}
}

6. Contact

If you have any questions, please raise an issue or contact us at ymasri@gmu.edu.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors