Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

Paper: https://arxiv.org/abs/2601.07790 arxiv:2601.07790

Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

Yahya Masri¹, Emily Ma¹, Zifu Wang², Joseph Rogers¹, Chaowei Yang¹

¹George Mason University, ²Harvard University

1. Introduction

System logs are crucial for monitoring and diagnosing modern computing infrastructure, making them an important setting for evaluating compact operational AI models. To study their ability to reason over real-world system behavior, we benchmark nine Small Language Models (SLMs) and Small Reasoning Language Models (SRLMs) on log severity classification using zero-shot, few-shot, and retrieval-augmented prompting on journalctl data. To advance log reasoning in resource-constrained environments, we examine how small models behave under strict output requirements, heterogeneous log structure, and retrieval conditions. Retrieval-aware evaluation is particularly important for digital twin (DT) systems, where models must operate under tight latency budgets and shifting log distributions. Our benchmark reveals clear stratification: Qwen3-4B reaches 95.64% accuracy with RAG, Gemma3-1B improves from 20.25% to 85.28%, and the tiny Qwen3-0.6B achieves 88.12%, while several SRLMs degrade under retrieval and others exhibit prohibitive latency. These findings establish the first dedicated evaluation of SLMs and SRLMs on Linux system logs and highlight how retrieval, architecture, and efficiency jointly shape model suitability for DT pipelines. While much work remains, the results suggest that principled, retrieval-aware evaluation of compact models is a promising direction for building more capable operational AI systems.

2. Evaluation Results

Accuracy Comparison

Model	Zero-Shot	Few-Shot	RAG
Qwen3-4B	27.12%	56.01%	95.64%
Qwen3-0.6B	27.45%	28.92%	88.12%
Gemma3-1B	0.14%	20.25%	85.28%
Gemma3-4B	4.79%	41.06%	81.84%
Llama3.2-3B	8.11%	33.21%	53.31%
Llama3.2-1B	1.04%	0.00%	37.37%
Qwen3-1.7B	33.61%	43.30%	28.96%
DeepSeek-R1-Distill-Qwen-1.5B	11.54%	17.63%	3.17%
Phi-4-Mini-Reasoning	9.20%	0.00%	0.00%

3. Quick Start

git clone https://github.com/stccenter/Benchmarking-SLMs-and-SRLMs-on-System-Log-Severity-Classification.git
cd Benchmarking-SLMs-and-SRLMs-on-System-Log-Severity-Classification

pip install -r requirements.txt

4. License

This project is licensed under the MIT License - see the LICENSE file for details.

5. Citation

@misc{masri2025benchmarking,
  title={Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification},
  author={Masri, Yahya and Ma, Emily and Wang, Zifu and Rogers, Joseph and Yang, Chaowei},
  year={2025}
}

6. Contact

If you have any questions, please raise an issue or contact us at ymasri@gmu.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
figures		figures
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

1. Introduction

2. Evaluation Results

Accuracy Comparison

3. Quick Start

4. License

5. Citation

6. Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

1. Introduction

2. Evaluation Results

Accuracy Comparison

3. Quick Start

4. License

5. Citation

6. Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages