Welcome! This guide helps you navigate the repository based on your goals and background.
Start here β Archive/2017-Course-Notes
Learning sequence:
- Watch intro talks (main README)
- Read Sutton & Barto book chapters
- Follow David Silver's course
- Study CS294 notes
- Implement classic algorithms (DQN, A3C, PPO)
Time estimate: 1-2 months for foundations
Start here β Modern-RL-Research/RLHF-and-Alignment
Learning sequence:
- Understand RLHF basics (PPO, reward models)
- Learn about DPO as simpler alternative
- Study safety considerations
- Explore code generation applications
- Review recent papers (PAPERS.md files)
Time estimate: 2-3 weeks to get up to speed
Start here β Modern-RL-Research/LLM-Code-Generation
Learning sequence:
- Study AlphaCode and CodeRL papers
- Understand execution feedback as rewards
- Learn about safety and sandboxing
- Review benchmarks (HumanEval, MBPP)
- Experiment with TRL library
- Check PAPERS.md for latest research
Time estimate: 1-2 weeks for overview, ongoing for deep work
Start here β Modern-RL-Research/LLM-RL-Program-Synthesis
Key papers to read:
- AlphaCode (Science 2022) - Foundation paper
- CodeRL (NeurIPS 2022) - RL framework
- Process-supervised RL (2025) - Recent advances
- Browse PAPERS.md for latest work
Also check:
- Berkeley's safe execution work
- Test-time compute scaling (o1, DeepSeek R1)
Study-Reinforcement-Learning/
β
βββ Archive/ # Classic RL (2017)
β βββ 2017-Course-Notes/
β βββ CS294-DeepRL-Berkeley/ # Levine, Schulman, Finn
β βββ Elements-Of-RL/ # Sutton & Barto concepts
β
βββ Modern-RL-Research/ # Cutting-edge (2022-2025)
β βββ LLM-RL-Program-Synthesis/ # AlphaCode, competitive coding
β β βββ README.md
β β βββ PAPERS.md # 50 recent papers
β β
β βββ LLM-Code-Generation/ # Practical code generation
β β βββ README.md
β β βββ PAPERS.md # 271 recent papers
β β
β βββ RLHF-and-Alignment/ # PPO, DPO, GRPO
β βββ README.md
β βββ PAPERS.md # 111 recent papers
β
βββ scripts/ # Automation tools
β βββ arxiv_paper_collector.py # Auto-fetch papers
β βββ papers_database.json # Complete paper database
β
βββ readme.md # Main entry point
- Test-Time Compute Scaling - o1, DeepSeek R1 approaches
- DPO Variants - Simpler alternatives to PPO
- Safe Code Generation - Sandboxing, Constitutional AI
- Multi-Modal Code - From diagrams/sketches to code
- Formal Verification + RL - Provably correct code
- 432 papers collected across all topics
- See
PAPERS.mdin each Modern-RL-Research subdirectory - Organized by year (2025 β 2022)
- Implement Q-Learning on simple grid worlds
- Train DQN on Atari games
- Build Policy Gradient agent for CartPole
- Fine-tune small LLM with RLHF on toy task
- Implement DPO and compare with PPO
- Create code completion model with execution feedback
- Reproduce CodeRL results on HumanEval
- Build safe code executor with sandboxing
- Experiment with test-time compute scaling
- Sutton & Barto (2017) - The RL bible
- Spinning Up in Deep RL (OpenAI) - Practical guide
- CS294/285 lectures (Berkeley) - Academic depth
- InstructGPT paper (2022) - Started the RLHF trend
- DPO paper (2023) - Simpler alternative
- PAPERS.md files - Latest research
- AlphaCode (2022) - Breakthrough paper
- CodeRL (2022) - Framework
- Berkeley Safe Code (2025) - Safety focus
- Gymnasium (OpenAI Gym) - Standard environments
- Stable-Baselines3 - Pre-built algorithms
- RLlib - Scalable RL library
- TRL (Transformers RL) - Hugging Face library
- DeepSpeed - Efficient training
- Composer - Mosaic ML framework
- HumanEval - Benchmark dataset
- Judge0 / Sphere Engine - Code execution
- Bandit / Semgrep - Security scanning
- HumanEval (164 problems) - Function-level
- MBPP (1000 problems) - Basic Python
- APPS (10K problems) - Competition-level
- LiveCodeBench - Continuously updated
- SWE-bench - Real GitHub issues
- GSM8K - Grade school math
- MATH - Competition mathematics
- Big-Bench Hard - Challenging tasks
- Week 1-2: Classic RL concepts, MDPs, value functions
- Week 3-4: Policy gradients, actor-critic methods
- Week 1-2: DQN, A3C, PPO implementations
- Week 3-4: Advanced topics (TRPO, SAC, TD3)
- Week 1-2: RLHF basics, reward modeling
- Week 3-4: Code generation, program synthesis
- Run
arxiv_paper_collector.pymonthly - Follow researchers on Twitter/X
- Attend conference workshops
- Join r/reinforcementlearning
- Implement, don't just read - Code algorithms from scratch
- Start simple - Master toy problems before complex tasks
- Join communities - Reddit, Discord, Twitter
- Read papers actively - Take notes, ask questions
- Reproduce results - Verify claims with your own experiments
- Focus on gaps - What problems remain unsolved?
- Build on existing work - Don't start from zero
- Collaborate - Find research groups, mentors
- Share findings - Blog posts, papers, code
- Stay updated - Use the arxiv script regularly
- Use pretrained models - Don't train from scratch
- Start small - Scale up gradually
- Track experiments - Use Weights & Biases, MLflow
- Version control - Git for code, DVC for data
- Document everything - Future you will thank you
Found something useful? Share it!
Ways to contribute:
- Add papers to collections
- Update scripts with new features
- Write tutorials or guides
- Fix errors or broken links
- Share your projects
How to contribute:
- Fork the repository
- Make your changes
- Submit a pull request
- Discuss in issues
- r/reinforcementlearning - Active subreddit
- r/MachineLearning - Broader ML community
- Discord servers for specific libraries (Hugging Face, etc.)
- The Batch (DeepLearning.AI) - Weekly ML news
- ImportAI - AI research summaries
- TLDR AI - Daily AI updates
- NeurIPS - December, largest ML conference
- ICLR - May, representation learning focus
- ICML - July, broad ML scope
- EMNLP/ACL - NLP conferences with code papers
Q: Should I learn classic RL first? A: Yes! Understanding MDPs, value functions, and policy gradients is essential before diving into LLM applications.
Q: Can I skip the math? A: Some math is necessary, but you can learn alongside practical implementation. Don't let math block your progress.
Q: What programming languages do I need? A: Python is essential. PyTorch or JAX for deep learning. Familiarity with Git and command line.
Q: How much compute do I need? A: For learning: Just a laptop. For research: GPU access helpful (Colab, cloud services). For production: Significant resources.
Q: Where do I find collaborators? A: Online communities, university research groups, open source projects, conference workshops.
This repository is actively maintained. To stay current:
- Star and watch this repo on GitHub
- Run the arxiv script monthly for new papers
- Check main README for announcements
- Follow the field via Twitter, Reddit, newsletters
Pick your path above and dive in! Remember:
- Start small, build gradually
- Implement what you learn
- Share your journey
- Ask questions
- Have fun!
Good luck on your RL journey! π―
Last Updated: 2025