-
3. Week_3-3_Average Reward카테고리 없음 2021. 6. 21. 05:08
3. Week_3-3_Average Reward 3. Average Reward
Average Reward $\quad : \;$ A New Way of Formulating Control Problem
Satinder Singh $\quad : \;$ on Intrinsic Rewards
Week 3 Review
$\cdot$ Average Reward $\quad : \;$ A New Way of Formulating Control Problem
Describe the average reward setting
Explain when average reward optimal policies are different from policies obtained under discounting
Understand different value functions.
In countinuing task,
we might be interestend in extremely long horizon performance.
Up until now, we've used dincounting in continuing problems to balance short-term performance and long-term gain.However, this is not the only way to formulate the probelm.
Today, we'll learn about a new way of formulating continuing problems, called the average reward formulation.A simple example
Today is all about continuing tasks.
Imagine a simple task
where the states are arranged in two intersecting rings.
Let's call this the Nearsighted(근시안적인) MDP.State
In most states,
there's only one action, so there are no decisions to be made.
There's only one state, $S$, where a decision can be made.
In this state, the agent can decide which ring to traverse.Action
This means there are two deterministic policies,
traversing the left ring or
traversing the right ring.Reward
The reward is $0$ everywhere, except for in one transition in each ring.
In the left ring, the reward is $+1$ immediately after state $S$,
In the right ring, the reward is $+2$ immediately before state $S$.Intuitively,
you would pick the right action because you know will get $+2$ reward.But
" what would the value function tell us to do ? "If we use discounting ($\gamma$),
What are the values of state $S$ under these two different policies ?The policy that chooses the left action has a vlaue of $\frac{1}{1-\gamma^5}$
The policy that chooses the right action has a value of $\frac{2\gamma^4}{1-\gamma^5}$
How do we figure this out ?
If you write out the infinite discounted return,
you will see this is a fairly straightforward geometric series with a closed form solution.
( See if you can get the same answer that we did )1 2 Let's think of the value of state $S$ under these two part policies
for particular values of $\gamma$.If $\gamma=0.5$,
$v_L(S)$ is approximately $1$ and
$v_R(S)$ is approximately $0.1$.This means
the policy that takes the left action is preferable
under this more myopic(근시적인) discount $\gamma$ of $0.5$.Let's try larger value of $\gamma=0.9$.
$v_L(S)$ is approximately $2.4$ and
$v_R(S)$ is approximately $3.2$.So now we prefer the other policy ?!
In fact,
we can figure out the minimum value of $\gamma$ so that the agent prefers the policy that goes right.$\gamma$ needs to be at least $0.841$.
( for the state $S$ to be $v_R(S) > v_L(S)$ )So the problem here is that
the discount $\gamma$ magnitude depends on the problem.For this example, $\gamma = 0.85$ is sufficiently large.
But if the rings had 100 states each,
this discount factor $\gamma$ would need to be over $0.99$.Problem with large $\gamma$ in continuing task
In general,
the only way to ensure that the agent's actions maximize reward over time
is to keep increasing the discount factor $\gamma$ towards $1$ !Depending on the problem,
we might need $\gamma$ to be quite large.
Remember,
we can't set it to $\gamma = 1$ in a continuing setting
because then the return might be infinite.Now then,
" what's wrong with having large $\gamma$ ? "
Larger values of $\gamma$ can also result in larger and more variables sums,
which might be difficult to learn !" Is there an alternative ? "
$\rightarrow \quad$ The Average Reward objectiveThe Average Reward objective
Let's discuss a new objective, called the Average Reward.
Imagine the agent has interested with the world for $h$ steps.
$r(\pi) \quad \doteq \quad \displaystyle \lim_{h \rightarrow \infty} \sum_{t=1}^h \mathbb{E}\big[ R_t | S_0,A_{0:t-1}\sim\pi \big]$
This is the reward it has received on average across those $h$ steps.
( In other words, it's rate of reward )If the agent's goal is to maximize this average reward,
then " it cares equally about nearby and distant rewards. "We denote the average reward of a policy with $r(\pi)$.
More generally,
we can write the average reward using the state visitation, $\mu_{\pi}(s)$.
$r(\pi) \quad = \quad \displaystyle \lim_{h \rightarrow \infty} \sum_s \mu_{\pi}(s) \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)r$This inner term is the expected reward in a state under policy $\pi$.
The outer sum takes the expectation over how frequently the policy is in that state.Together,
we get the expected reward across states !
( In other words, the Average Reward for a plicy )In the nearsighted example,
the two deterministic possible policised visit either the left loop or the right loop indefinitely.
In both cases, the five states in each loop are visited equally many times.In the left loop,
the immediate expected reward is $+0$ for all states,
except one which gets $+1$.This results in an average reward of $1$ every $5$ steps.
$r(\pi_L) = \frac{1}{5} = 0.2$In the right loop,
most states have $+0$ reward to expected reward,
but the last state gets $+2$.This gives an average reward of $2$ every $5$ steps
$r(\pi_R) = \frac{2}{5} = 0.4$$\Rightarrow \quad$ the Average Reward puts preference on the policy that receives more reward in total
$\qquad$ without having to consider larger and larger discounts !Returns for Average Reward
The Average Reward definition is intuitive for saying
" If one policy is better than another ".But
" How can we decide which actions from a state are better ? "What we need are action values for this new setting.
1 2 Return $$\downarrow$$ Differential Return The first step is to figure out what the retern is.
In the Average Reward setting,
returns are defined in terms of differences between rewards and the average reward $R_{\pi}$.
$G_t \quad = \quad R_{t+1}-r(\pi) \quad + \quad R_{t+2}-r(\pi) \quad + \quad R_{t+3}-r(\pi) \quad + \quad ...$This is called the differential return.
Example $\quad : \;$ nearsigted MDP
Let's look at what the differential returns are in our nearsighted MDP.
The differential return represents
how much more reward the agent will receive from the current state in action compared to the average reward of the policy.Left case
Let's loot at the differential return
starting in state $S \;\; \rightarrow \;\;$ first choosing action $L \;\; \rightarrow \;\;$ then following $\pi_L$ afterwards.The average reward for this policy $\pi_L$ is $0.2$.
$r(\pi_L) = 0.2$The differential return is
the sum of rewards into the future with the average reward subtracted from each one.This sum starts in state $S$ with the action $L$.
We can compute it by summing to some finite horizon $H$, then taking the limit as $H$ goes to infinity $\infty$.
We can simplifying things slightly with this limit notation.While notation provided works in many cases,
we need to use a different technique when the environment is periodic.
In this case, we compute the return using a more genreal limit the Cesaro Sum.
( but this technical detail is not critical )
( The main point here is the intuition )We find that the differential return is $0.4$.
Now,
let's look at the other action.This time we can break the differential return into two parts.
First,
the sum for a single trajectory through the right loop.
We can write the sum explicitly and it is equal to $1$.Then,
the sum corresponding to taking the left action indefinitely.
This sum is the same as the differential return we just computed, $0.4$.Adding the two parts together,
we find that the differential return is $1.4$.???
So if the agent's policy is to always take the left action,
it can observe it's differential returns and realize it should switch to taking the right action (???)
???Right case
Now,
let's look at the differential returns
if the agnet's policy is to always take the right action.This policy reusults in an average reward of $0.4$
and the differential return for the policy that takes the right action in state $S$ is $-0.8$.Now,
what's the differential return
for taking the left action once in state $S$ and
then taking the right action indefinitely ?Like before,
we break up the sum into two parts.
taking the left loop once results in a sum of $-1$ over the first five time steps.
Adding the differential return from following $\pi_R$ from state $S$, which we found to be $-0.8$ results in our answer of $-1.8$.Once again,
we see that the right action is preferred.You may have noticed that the differential returns for $\pi_R$ were lower than the differential returns for $\pi_L$,
even though $\pi_R$ has a higher average reward.This is because the differential return represents
" How much better it is to take an action in a state then on average under a certain policy "The differential return can only be used to compare actions,
if the same policy is followed on subsequent time step !To compare policies,
their average reward should be used instead.Interestingly,
the differential return is only a convergent sum
if the subtracted constant is equal to the true average reward.
If a lower or higher number is subtracted, the sum will diverge to positive or negative infinity !Value Functions for Average Reward
Now that we have a valid definitoin of the return for Average Reward,
$G_t \quad = \quad R_{t+1}-r(\pi) \quad + \quad R_{t+2}-r(\pi) \quad + \quad R_{t+3}-r(\pi) \quad + \quad ...$we define value functions in the usual way, as the expected return.
Similarly we can also define differential value functions as the expected differential return
under a policy from a given state or state-action pair.
$q_\pi(s,a) \quad = \quad \mathbb{E} \big[ G_t|S_t=s,A_t=a \big]$Like in the discounted setting,
differential value functions can be written as Bellman Equations.
Conveniently, they look like the previous ones we've seen.
The only differ in that they subtract $r_\pi$ from the immediate reward $r$, and there is no discounting $\gamma$.
$q_\pi(s,a) \quad = \quad \displaystyle \sum_{s',r} p(s',r|s,a) \big( r - r(\pi) + \sum_{a'} \pi(a'|s')q_\pi(s',a') \big)$This quantity captures
how much more reward the agent will get by starting in a particular state
than it would get on average over all states if it followed a fixed policy.Algorithm $\quad : \;$ Differential SARSA
Many algorithms from the discounted case can be rewritten to apply to the Average Reward case.
For example,
differential SARSA is very similar to the SARSA algorithm you've seen before.Let's step through the differences.
A key difference is that differential SARSA has to track an estimate of the Average Reward $\bar{R}$ under it's policy
and subtract it from the sample reward in it's update.This implementation does so
with the incremental averaging techniques we've seen throughout the course.
$\bar{R} \leftarrow \bar{R}+\beta(R-\bar{R})$Given this estimate,
it then subtracts $\bar{R}$ from the sampled reward $R$ in it's update.In practice,
we can get better performance with a slight modification to this algorithm.
Instead of the exponential average of the reward to compute $\bar{R}$,
We use this update which has lower variance.
$\bar{R} \leftarrow \bar{R}+\beta\delta$Summary
Introduced the average reward objective
Defined differential returns and differential value functions for this setting
Next week,
we'll talk about another way to optimize this average reward objective.