Character-Leaderboard

AI-powered social apps have reached 100 million monthly active users and continue to grow. In addition, real-person social apps have billions of users, which requires AI-assisted social interaction, indicating vast market potential for LLM-based social intelligence.

The Qwen-Character model is a branch model built upon the basic Qwen model. While the basic model models "world knowledge," the Qwen-Character model models "humans." Qwen-Character encompasses typical application scenarios such as role-playing, emotional companionship, avatar replication, smart hardware, and digital employees. Qwen-Character optimizes its modeling around six core human capabilities: personality, emotion, memory, mindset, knowledge, and morality, achieving leading results.

👏 Welcome to try our Qwen-Character Model via our bailian service!

Character-Leaderboard

We selected eight representative datasets—CharacterEval, CharacterBench, CoSER, WikiRole, TomBench, OpenTom, EmoBench, and MemoryEval—from publicly available industry benchmarks related to character analysis for performance evaluation. Since each benchmark has multiple and inconsistent evaluation dimensions, directly aggregating the results makes it difficult to reflect the model's detailed performance across each dimension. Therefore, we reorganized and summarized the results across eight dimensions: basic dialogue, dialogue appeal, memory, knowledge, personality, emotion, mindset, and morality, constructing a Character-Leaderboard. The Qwen-Character model has achieved leading performance across all dimensions.

Regarding specific evaluation methods, CharacterEval and CharacterBench use dedicated benchmarking tools to score single-turn responses based on dimensions such as dialogue, character design, and plot. CoSER uses GPT-4o for multi-turn dialogues. WikiRole and MemoryEval employ a knowledge-based question-and-answer format with GPT-4o scoring. TomBench, OpenTom, and EmoBench use multiple-choice questions for evaluation.

News

[2026.02] P-GenRM has been accepted to ICLR 2026 as an oral presentation (Top 1%). P-GenRM turns user preference signals into structured evaluation chains and introduces test-time user-based scaling (with user prototypes) to improve personalization and generalization, achieving state-of-the-art results on personalized reward model benchmarks. [Code] [Paper]
[2026.01] iStar has been accepted to ICLR 2026! iStar learns implicit step rewards from trajectory preferences and improves credit assignment without step labels or extra rollouts. It achieves strong results on WebShop, VisualSokoban, and SOTOPIA. [Code] [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Character-GenRM-NLHF		Character-GenRM-NLHF
Character-GenRM		Character-GenRM
CharacterRL-iStar		CharacterRL-iStar
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Character-Leaderboard

News

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Character-Leaderboard

News

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages