- ๐ Undergraduate student at Beijing University of Posts and Telecommunications (BUPT), School of Computer Science
- ๐ฌ Research interests: RLVR ยท RLHF ยท Optimization Algorithms
- ๐ฑ Currently exploring the intersection of reinforcement learning and large language model alignment
- ๐ Beijing, China
| Area | Description |
|---|---|
| RLVR | Reinforcement Learning from Verifiable Rewards โ scalable reward signals beyond human feedback |
| RLHF | Reinforcement Learning from Human Feedback โ aligning LLMs with human preferences |
| Optimizer | Adaptive optimization methods (AdamW, Muon, Shampoo, etc.) for deep learning |
APO_OFFICAL โ The official repository for Anchored Policy Optimization: Mitigating Exploration Collapse via Support-Constrained Rectification
โญ 12 ๐ด 0
SPPO โ [ACL 2026 Main] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks official repos.
โญ 2 ๐ด 2
No recent public activity.
2026-04-27Uncertain Estimate2026-04-27็ฌ่ฎฐ๏ผDPO ไธ GRPO ็ๅ ๅจๅๆๆงๅๆ2026-04-27็ฌฌๅไธ็ซ ๏ผๆไปถ็ณป็ปๅฎ็ฐ2026-04-27็ฌฌๅ ซ็ซ ๏ผๅ ๅญ็ฎก็2026-04-27็ฌฌไน็ซ &็ฌฌๅ็ซ


