Learning outcomes
The learning outcomes of this chapter are:
- Manually apply n-step reinforcement learning approximation to solve small-scale MDP problems.
- Design and implement n-step reinforcement learning to solve medium-scale MDP problems automatically.
- Argue the strengths and weaknesses of n-step reinforcement learning.
Overview
In the previous sections on of this chapter, we looked at two fundamental temporal difference (TD) methods for reinforcement learning: Q-learning and SARSA.
These two methods have some weaknesses in this basic format:
- Unlike Monte-Carlo methods, which reach a reward and the backpropagate this reward, TD methods use bootstrapping (they estimate the future discounted reward using
), which means that for problems with sparse rewards, it can take a long time to for rewards to propagate throughout a Q-function. - Rewards can be sparse, meaning that there are few state/actions that lead to non-zero rewards. This is problematic because initially, reinforcement learning algorithms behave entirely randomly and will struggle to find good rewards.
- Both methods estimate a Q-function
, and the simplest way to model this is via a Q-table. However, this requires us to maintain a table of size , which is prohibitively large for any non-trivial problem. - Using a Q-table requires that we visit every reachable state many times and apply every action many times to get a good estimate of
. Thus, if we never visit a state , we have no estimate of , even if we have visited states that are very similar to .
To get around limitations 1 and 2, we are going to look at n-step temporal difference learning: ‘Monte Carlo’ techniques execute entire episodes and then backpropagate the reward, while basic TD methods only look at the reward in the next step, estimating the future wards. n-step methods instead look
n-step TD learning comes from the idea that Monte-Carlo methods can be useful for providing more information for Q-function updates than just temporal difference estimates. Monte Carlo methods uses ‘deep backups’, where entire episodes are executed and the reward backpropagated. Methods such as Q-learning and SARSA use ‘shallow backups’, only using the reward from the 1-step ahead. n-step learning finds the middle ground: only update the Q-function after having explored ahead
n-step TD learning
We will look at n-step reinforcement learning, in which
Both Q-learning and SARSA have an n-step version. We will look at n-step learning more generally, and then show an algorithm for n-step SARSA. The version for Q-learning is similar.
Intuition
The details and algorithm for n-step reinforcement learning making it seem more complicated than it really is. At an intuitive level, it is quite straightforward: at each step, instead of updating our Q-function or policy based on the reward received from the previous action, plus the discounted future rewards, we update it based on the last
Consider the following interative gif, which shows the update over an episode of five actions. The bracket represents a window size of

At time
At time
At time
At time
At time
At time
At time
At time
So, we can see that, intuitively,
Discounted Future Rewards (again)
When calculating a discounted reward over a episode, we simply sum up the rewards over the episode:
We can re-write this as:
If
In TD(0) methods such as Q-learning and SARSA, we do not know
That is, the reward of the entire future from step
This is a one-step return.
Truncated Discounted Rewards
However, we can estimate a two-step return:
a three-step return:
or n-step returns:
In this above expression
The basic idea of n-step reinforcement learning is that we do not update the Q-value immediately after executing an action: we wait
If
In Monte-Carlo methods, we go all the way to the end of an episode. Monte-Carlo Tree Search is one such Monte-Carlo method, but there are others that we do not cover.
Updating the Q-function
The update rule is then different. First, we need to calculate the truncated reward for
This just sums the discounted rewards from time step
Then calculate the n-step expected reward:
This adds the future expect reward if we are not at the end of the episode (if
Finally, we update the Q-value:
In the update rule above, we are using a SARSA update, but a Q-learning update is similar.
n-step SARSA
While conceptually this is not so difficult, an algorithm for doing n-step learning needs to store the rewards and observed states for
Algorithm 6 (n-step SARSA)
This is similar to standard SARSA, except that we are storing the last
As with SARSA and Q-learning, we iterate over each step in the episode. The first branch simply executes the selected action, selects a new action to apply, and stores the state, action, and reward.
It is the second branch where the actual learning happens. Instead of just updating with the 1-step reward
The final part of this branch removes the first element from the list of states, actions, and rewards, and moves on to the next state.
What is the effect of all of the computation? This algorithm differs from standard SARSA as follows: it only updates a state-action pair after it has seen the next
Computationally, this is not much worse than 1-step learning. We need to store the last
Example – -step SARSA update
Consider our simple 2D navigation task, in which we do not know the probability transitions nor the rewards. Initially, the reinforcement learning algorithm will be required to search randomly until it finds a reward. Propagating this reward back n-steps will be helpful.
Imagine the first episode consisting of the following (very lucky!) episode:

Assuming
For the first
On step 5, we reach the end of our n-step window, and we start to update values because
We can see that there are no rewards in the first five steps, so the Q-values first two state-action pairs in the sequence
When we reach step 7, however, we receive a reward. We can then update the Q-value
From this point,
At this point, there are no further states left to update, so the inner loop terminates, and we start a new episode.
Implementation
Below is a Python implementation of n-step temporal difference learning. We first implement an abstract superclass NStepReinforcementLearner
, which contains most of the code we need, except the part that determines the value of state
class NStepReinforcementLearner:
def __init__(self, mdp, bandit, qfunction, n):
self.mdp = mdp
self.bandit = bandit
self.qfunction = qfunction
self.n = n
def execute(self, episodes=100):
for _ in range(episodes):
state = self.mdp.get_initial_state()
actions = self.mdp.get_actions(state)
action = self.bandit.select(state, actions, self.qfunction)
rewards = []
states = [state]
actions = [action]
while len(states) > 0:
if not self.mdp.is_terminal(state):
(next_state, reward, done) = self.mdp.execute(state, action)
rewards += [reward]
next_actions = self.mdp.get_actions(next_state)
if not self.mdp.is_terminal(next_state):
next_action = self.bandit.select(
next_state, next_actions, self.qfunction
)
states += [next_state]
actions += [next_action]
if len(rewards) == self.n or self.mdp.is_terminal(state):
n_step_rewards = sum(
[
self.mdp.discount_factor ** i * rewards[i]
for i in range(len(rewards))
]
)
if not self.mdp.is_terminal(state):
next_state_value = self.state_value(next_state, next_action)
n_step_rewards = (
n_step_rewards
+ self.mdp.discount_factor ** self.n * next_state_value
)
q_value = self.qfunction.get_q_value(
states[0], actions[0]
)
self.qfunction.update(
states[0],
actions[0],
n_step_rewards - q_value,
)
rewards = rewards[1 : self.n + 1]
states = states[1 : self.n + 1]
actions = actions[1 : self.n + 1]
state = next_state
action = next_action
""" Get the value of a state """
def state_value(self, state, action):
abstract
We inherit from this class to implement the n-step Q-learning algorithm:
from n_step_reinforcement_learner import NStepReinforcementLearner
class NStepQLearning(NStepReinforcementLearner):
def state_value(self, state, action):
max_q_value = self.qfunction.get_max_q(state, self.mdp.get_actions(state))
return max_q_value
Example – 1-step Q-learning vs 5-step Q-learning
Using the interactive graphic below, we compare 1-step vs. 5-step Q-learning over the first 20 episodes of learning in the GridWorld task. Using 1-step Q-learning, reaching the reward only informs the state from which it is reached in the first episode; whereas for 5-step Q-learning, it informs the previous five steps. Then, in the 2nd episode, if any action reaches a state that has been visited, it can access the TD-estimate for that state. There are five such states in 5-step Q-learning; and just one in 1-step Q-learning. On all subsequent iterations, there is more chance of encountering a state with a TD estimate and those estimates are better informed. The end result is that the estimates ‘spread’ throughout the Q-table more quickly:

Values of n
Can we just increase
What is the best value for
Takeaways
- n-step reinforcement learning propagates rewards back
steps to help with learning. - It is conceptually quite simple, but the implementation requires a lot of ‘book-keeping’.
- Choosing a value of
for a domain requires experimentation and intuition.