Paper: https://arxiv.org/abs/2601.07790 arxiv:2601.07790
Yahya Masri¹, Emily Ma¹, Zifu Wang², Joseph Rogers¹, Chaowei Yang¹
¹George Mason University, ²Harvard University
System logs are crucial for monitoring and diagnosing modern computing infrastructure, making them an important setting for evaluating compact operational AI models. To study their ability to reason over real-world system behavior, we benchmark nine Small Language Models (SLMs) and Small Reasoning Language Models (SRLMs) on log severity classification using zero-shot, few-shot, and retrieval-augmented prompting on journalctl data. To advance log reasoning in resource-constrained environments, we examine how small models behave under strict output requirements, heterogeneous log structure, and retrieval conditions. Retrieval-aware evaluation is particularly important for digital twin (DT) systems, where models must operate under tight latency budgets and shifting log distributions. Our benchmark reveals clear stratification: Qwen3-4B reaches 95.64% accuracy with RAG, Gemma3-1B improves from 20.25% to 85.28%, and the tiny Qwen3-0.6B achieves 88.12%, while several SRLMs degrade under retrieval and others exhibit prohibitive latency. These findings establish the first dedicated evaluation of SLMs and SRLMs on Linux system logs and highlight how retrieval, architecture, and efficiency jointly shape model suitability for DT pipelines. While much work remains, the results suggest that principled, retrieval-aware evaluation of compact models is a promising direction for building more capable operational AI systems.
| Model | Zero-Shot | Few-Shot | RAG |
|---|---|---|---|
| Qwen3-4B | 27.12% | 56.01% | 95.64% |
| Qwen3-0.6B | 27.45% | 28.92% | 88.12% |
| Gemma3-1B | 0.14% | 20.25% | 85.28% |
| Gemma3-4B | 4.79% | 41.06% | 81.84% |
| Llama3.2-3B | 8.11% | 33.21% | 53.31% |
| Llama3.2-1B | 1.04% | 0.00% | 37.37% |
| Qwen3-1.7B | 33.61% | 43.30% | 28.96% |
| DeepSeek-R1-Distill-Qwen-1.5B | 11.54% | 17.63% | 3.17% |
| Phi-4-Mini-Reasoning | 9.20% | 0.00% | 0.00% |
git clone https://github.com/stccenter/Benchmarking-SLMs-and-SRLMs-on-System-Log-Severity-Classification.git
cd Benchmarking-SLMs-and-SRLMs-on-System-Log-Severity-Classification
pip install -r requirements.txtThis project is licensed under the MIT License - see the LICENSE file for details.
@misc{masri2025benchmarking,
title={Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification},
author={Masri, Yahya and Ma, Emily and Wang, Zifu and Rogers, Joseph and Yang, Chaowei},
year={2025}
}If you have any questions, please raise an issue or contact us at ymasri@gmu.edu.
