NTU 2026 Sem 2.
- Lab 1
- Task 1: Using different search algorithms to find a path from start to end in a grid world.
- Task 2: Implementing Monte Carlo and Q-Learning Reinforcement Learning to solve a maze.
- Lab 2:
- Task 1: Encoding a scenario in First Order Logic and proving unethical behaviour using Prolog.
- Task 2: Modelling the British Royal Family succession rules as a Prolog KB.
- Contributor:
- Hung: Lab 1 Task 1, Lab 2 Task 2
- Allen: Lab 1 Task 2, Lab 2 Task 1
- Important notes:
- As per assignment manual, Lab 1 Task 2 uses vertical x axis and horizontal y axis.
- Lab 1 Detailed Description
- Lab 2 Detailed Description
- Author: Allen
- Task: Lab 1 Task 2.2
The standard definition of the Q-value is:
My initial proposed approach was: in each episode, calculate 10 paths from start to end (ε-soft) and compute the average for each node. However, after researching the standard solution, the more common approach is to calculate Q-values using accumulated past paths (Sutton & Barto, 2018, pp. 100–101):
where
This does not strictly follow the definition above, since the mathematical definition specifies that Q should only be averaged over paths generated from the current policy, rather than accumulating past paths. However, it remains justifiable — each episode introduces only small policy improvements, so the cumulation still converges as the incremental changes are small enough.
This solution is stored in ./Lab1/part2/task2-2, reaching 90.9% similarity with the optimal solution.
The standard cumulative method introduces two problems:
-
Memory: as episodes increase, storage grows linearly —
$O(n)$ . - Staleness: although policy changes are small per episode, accumulating all past returns can introduce errors over many episodes.
To address both issues, I further improved the solution to average only over the most recent 1000 returns:
A similar algorithmic idea can be found in other agent-related papers (Mnih et al., 2013).
This solution is stored in ./Lab1/part2/task2-2-v2. The algorithm also reaches 90.9% similarity with the optimal solution, but computed in a much shorter time — the time to average and space to store the data becomes
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. arXiv. https://arxiv.org/abs/1312.5602