Dyna-Q and Dyna-Q+

Comparing model-based RL agents in a changing environment.

The "Blocking Maze"

An agent must find the shortest path from Start (S) to Goal (G). After 3000 steps, the environment changes: the direct path is blocked, and a new, longer path opens up. This experiment tests how quickly different algorithms can adapt to this change.

Algorithms

Dyna-Q (Standard): A model-based agent that learns a model of the world from experience and uses it for "planning" (simulated experience) to speed up learning. Dyna-Q+ (Exploratory): An extension that adds an "exploration bonus" to actions that haven't been tried in a long time. This encourages the agent to re-explore the world and discover changes, like the new path opening up.

Run Experiment

Maze Environment

Shows the maze after the path has changed, with a sample optimal path found by Dyna-Q+.

Cumulative Rewards

The key result. A steeper slope means the agent is finding the goal more frequently. Note how Dyna-Q+ adapts after the 3000-step mark.