Part 1 explores uninformed and informed graph search algorithms on a real-world road network with the goal of finding optimal paths given multiple constraints.
- Graph: Real-world locations (nodes) connected by roads (weighted edges)
- Edge weights:
Dist.json— distance (in decimetres) for each edgeCost.json— energy cost for each edge
- Search parameters:
- Start node: "1"
- Goal node: "50"
- Energy budget: 287,932 units
- Coordinates:
Coord.jsonstores GPS coordinates (lon, lat) as integers = degrees × 10⁶
Objective: Find the shortest path from start to goal ignoring energy cost.
- Priority queue: Ordered by accumulated distance (cost)
- Expansion: Always expands the node with minimum cumulative cost first
- Optimality: Guaranteed to find shortest path under non-negative costs
- State: Single node (no energy tracking needed)
Results:
- Shortest distance: 148,648.6
- Path length: 122 nodes
- States visited: 5,304
Objective: Find the shortest path while staying within the energy budget of 287,932 units.
-
Expanded state:
(node, energy_accumulated) - Priority queue: Ordered by accumulated distance
- Constraint: Prune edges that would exceed the energy budget
-
Optimality technique — Pareto dominance:
- At each node, maintain a Pareto front of non-dominated (distance, energy) labels
- A new label
(d', e')is discarded if any existing label(d, e)satisfies:$d ≤ d'$ AND$e ≤ e'$ - Labels dominated by the new one are removed
- This prunes suboptimal paths early, reducing the search space
Results:
- Shortest distance: 150,335.6 — slightly longer due to energy constraint
- Total energy used: 259,087 units (within budget of 287,932)
- Path length: 122 nodes
- States visited: 30,267 — 5.7× more than Task 1 due to expanded state space
Objective: Find the shortest energy-constrained path using informed search with Haversine (great-circle) heuristic.
A* requires an admissible heuristic
Since the graph represents real-world locations with GPS coordinates, the Haversine formula computes the great-circle (straight-line) distance between two points on Earth's surface:
where
Admissibility: Road distances are always ≥ straight-line distances, so Haversine never overestimates → admissible.
R = 63710000.0 # decimetres
lon2, lat2 = (v / 1e6 for v in Coord[goal])
lat2_r, lon2_r = math.radians(lat2), math.radians(lon2)
h = {}
for node, vals in Coord.items():
lon1, lat1 = vals[0] / 1e6, vals[1] / 1e6
lat1_r, lon1_r = math.radians(lat1), math.radians(lon1)
dlat = lat2_r - lat1_r
dlon = lon2_r - lon1_r
a = math.sin(dlat/2)**2 + math.cos(lat1_r) * math.cos(lat2_r) * math.sin(dlon/2)**2
h[node] = 2 * R * math.asin(math.sqrt(a))The heuristic is precomputed once for all nodes → goal, giving
-
Priority queue: Ordered by
$f(n) = g(n) + h(n)$ where$g(n)$ = accumulated distance -
Expanded state:
(node, energy_accumulated) - Pruning: Same Pareto dominance as Task 2
- Optimality: Guaranteed (Haversine is admissible and A* expands in best-first order)
Results:
- Shortest distance: 150,335.6 decimetres (same as Task 2 — optimal)
- Total energy used: 259,087 units
- Path length: 122 nodes
- States visited: 9,552 — 68.4% reduction vs UCS constrained
Objective: Find the shortest energy-constrained path using Euclidean distance heuristic (comparison).
Simple straight-line distance in (lon, lat) space:
where coordinates are scaled by
Note: This heuristic is not truly admissible in geographic coordinates (Euclidean distance is not the true minimum), but it serves as a benchmark.
Same A* with expanded state and Pareto dominance as Task 3a, but uses Euclidean distance instead of Haversine.
Results:
- Shortest distance: 150,335.6 decimetres (same optimal path)
- Total energy used: 259,087 units
- Path length: 122 nodes
- States visited: 3,271 — 89.2% reduction vs UCS constrained
| Algorithm | States Visited | Reduction vs UCS | Path Optimality | Heuristic Quality |
|---|---|---|---|---|
| Task 1: UCS (relaxed) | 5,304 | -- | 100% (baseline) | None |
| Task 2: UCS (constrained) | 30,267 | -- | 100% (optimal under budget) | None |
| Task 3a: A Haversine* | 9,552 | 68.4% ↓ | 100% (optimal) | Haversine (admissible) |
| Task 3b: A Euclidean* | 3,271 | 89.2% ↓ | 100% (optimal) | Euclidean (tighter estimate) |
- Informed search beats uninformed: A* with Haversine reduces states expanded by 68% by guiding the search toward the goal.
- Heuristic quality matters: Even though Euclidean is not strictly admissible, it's a tighter estimate of road distance than Haversine and achieves 89% reduction.
- Pareto dominance pruning: Critical for multi-objective optimization (minimizing both distance and energy usage).
- Energy constraint adds complexity: Expanding state space from 50 nodes to 30K+ (node, energy) tuples, but informed search still maintains 3-5K expansions.
Part 2 explores different reinforcement learning and planning algorithms in a stochastic 5×5 grid world.
-
Grid: 5×5, coordinates
$(x, y)$ where x is the vertical axis (0 = bottom, 4 = top) and y is the horizontal axis (0 = left, 4 = right) -
Start:
$(0, 0)$ — bottom-left -
Goal:
$(4, 4)$ — top-right -
Obstacles:
$(2, 1)$ and$(2, 3)$ -
Step cost:
$-1$ ; Goal reward:$+10$ (net$= +9$ ) -
Discount factor:
$\gamma = 0.9$ - Stochastic transitions: 0.8 probability of intended action, 0.1 each perpendicular direction
All tasks use the same environment defined in scene_map.py.
Scenario: The agent knows the exact transition probabilities.
-
Update rule:
$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)\bigl[R(s') + \gamma V(s')\bigr]$ -
Convergence threshold:
$\delta < 10^{-6}$
-
Policy Evaluation: Solve
$V^\pi(s) = \sum_a \pi(a|s)\sum_{s'} P(s'|s,a)\bigl[R(s') + \gamma V^\pi(s')\bigr]$ -
Policy Improvement:
$\pi'(s) = \arg\max_a Q^\pi(s,a)$ - Convergence: When policy stops changing across a full sweep
Both algorithms produce the identical optimal policy:
| X\Y | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| 4 | RIGHT | RIGHT | RIGHT | RIGHT | GOAL |
| 3 | RIGHT | RIGHT | RIGHT | RIGHT | UP |
| 2 | UP | OBS | UP | OBS | UP |
| 1 | UP | RIGHT | UP | RIGHT | UP |
| 0 | UP | RIGHT | UP | RIGHT | UP |
Key insight: Policy Iteration typically converges in fewer iterations than Value Iteration; both yield the same optimal policy.
Scenario: Transition probabilities are unknown; the agent learns purely from sampled episodes.
-
Exploration:
$\epsilon$ -greedy with$\epsilon = 0.1$ -
Return:
$G_t = \sum_{k=0}^{T-t} \gamma^k R_{t+k}$ (computed backwards per episode) -
Q-update:
$Q(s,a) \leftarrow \text{mean of all returns collected for } (s,a)$ -
Data structure:
returns[(s,a)] = [G_1, G_2, \ldots]— unbounded list; all historical returns averaged -
Training: 20,000 episodes,
$\epsilon = 0.1$
- Policy match vs VI optimal: 90.9% (20 / 22 valid states)
-
Mismatched states:
$(0,2)$ learned RIGHT (optimal UP),$(1,2)$ learned RIGHT (optimal UP)
Scenario: Same as Task 2-2, but Q-values are estimated from only the most recent 1000 returns per
-
Data structure:
collections.deque(maxlen=1000)— oldest return auto-discarded when full - Motivation: Averaging all historical returns weights early (suboptimal) policy samples equally with recent ones. The sliding window discards stale data so Q-values track the current near-optimal policy more faithfully.
-
Training: 20,000 episodes,
$\epsilon = 0.1$ , window size = 1000
- Policy match vs VI optimal: 90.9% (20 / 22 valid states)
-
Mismatched states:
$(0,0)$ learned RIGHT (optimal UP),$(1,0)$ learned RIGHT (optimal UP)
Scenario: Off-policy TD learning — updates Q-values after every single step.
-
Update rule:
$Q(s,a) \leftarrow Q(s,a) + \alpha\bigl[R(s') + \gamma \max_{a'} Q(s', a') - Q(s,a)\bigr]$ -
Learning rate:
$\alpha = 0.1$ ; Exploration:$\epsilon = 0.1$ (fixed) - Off-policy: bootstraps greedily at next state, decoupled from current exploration policy
- Training: 50,000 episodes
Convergence is declared when both of the following hold over a rolling 1500-episode window:
- Policy stability: every valid state's greedy action switches between at most 2 actions across the window
-
Q-value stability:
$|Q_{\text{end}}(s,a) - Q_{\text{start}}(s,a)| \leq 1$ for all$(s,a)$
Convergence point: Episode 5,300 (out of 50,000 trained)
- Policy match vs VI optimal: 90.9% (20 / 22 valid states)
-
Mismatched states:
$(0,0)$ learned RIGHT (optimal UP),$(0,2)$ learned RIGHT (optimal UP)
| X\Y | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| 4 | RIGHT | RIGHT | RIGHT | RIGHT | GOAL |
| 3 | RIGHT | RIGHT | RIGHT | RIGHT | UP |
| 2 | UP | OBS | UP | OBS | UP |
| 1 | UP | RIGHT | UP | RIGHT | UP |
| 0 | RIGHT | RIGHT | RIGHT | RIGHT | UP |
To fairly compare Monte Carlo v2 and Q-Learning under identical conditions, both algorithms were trained 50 independent times with 20,000 episodes each and evaluated against the VI optimal policy (compare_MC_QL.py).
| Metric | Monte Carlo v2 | Q-Learning |
|---|---|---|
| Mean accuracy | 85.09% | 88.82% |
| Standard deviation | 10.12% | 6.25% |
| Variance | 102.51 | 39.02 |
| Min | 63.64% | 68.18% |
| Max | 100.00% | 100.00% |
| Median | 86.36% | 90.91% |
Key findings:
- Q-Learning is more accurate: +3.7 percentage points higher mean accuracy than MC v2 at equal episode counts.
- Q-Learning is more stable: variance 2.6× lower than MC v2, meaning its policy quality is far more consistent run-to-run.
- MC v2 has higher variance: unbounded averaging of all historical returns under any policy introduces noise; the sliding window mitigates this vs plain MC but Q-Learning's per-step TD updates still converge faster.
- Q-Learning converges earlier: declared convergent at episode 5,300 — meaning 14,700 of the 20,000 episodes are consolidation, whereas MC v2 continues accumulating returns throughout.
| Algorithm | Model | Data | Update | Match (single) | Mean ± Std (50 runs) |
|---|---|---|---|---|---|
| Value Iteration | Known | DP | All states / iter | 100% (reference) | — |
| Policy Iteration | Known | DP | Policy sweeps | 100% (reference) | — |
| Monte Carlo (2-2) | Unknown | Full episodes | End of episode | 90.9% | — |
| Monte Carlo v2 (2-2-v2) | Unknown | Full episodes (window=1000) | End of episode | 90.9% | 85.1% ± 10.1% |
| Q-Learning (2-3) | Unknown | Single steps (TD) | Every step | 90.9% | 88.8% ± 6.3% |