-
3. Week_4-3_Actor-Critic for continuing tasks카테고리 없음 2021. 6. 21. 05:12
3. Week_4-3_Actor-Critic for continuing tasks 3. Actor-Critic for Continuing Tasks
Estimating the Policy Gradient
Actor-Critic Algorithm
$\cdot$ Estimating the Policy Gradient
- Derive a sample-based estimate for the gradient of the average reward objective
We have an objective for policy optimization.
We also have the policy gradient theorem,
which gives us a simple expression for the gradient of that objective.In this video,
we'll complete the puzzle by showing " how to estimate this gradient "
using the experience of an agent interacting with the environment.Getting Stochastic Samples of the Gradient
We want to derive a gradient descent algorithm for our policy.
We have our objective and it's gradient due to the Policy Gradient Theorem.Now, we need to figure out " how to approximate this gradient ".
Infact, what we will do is get a stochastic sample of the gradient.
Recall
this expression for the gradient of the average reward.Computing the sum over states is really impractical.
But we can do the same thing we did
when deriving our stochastic gradient descent rule for policy evaluation.We simply make updates from states we observe while following policy $\pi$.
$S_0, A_0, R_1, S_1, A_1, \;\;\; \cdots \;\;\; , S_t, A_t, R_t+1, \;\;\; \cdots$This gradient from state $S_t$ provides an approximation
to the gradient of the average reward $\nabla r(\pi)$.
$\nabla r(\pi) \quad = \quad \displaystyle \sum_a \nabla\pi(a|\color{brown}{S_t},\theta_t) q_{\pi}(\color{brown}{S_t},a)$As we discussed before for stochastic gradient descent,
we can adjust the weights with this approximation ,
and still guarantee you will reach a stationary point.This is what the stochastic gradient descent update looks like
for the policy parameters.
$\theta_{t+1} \quad \doteq \quad \theta_t + \alpha \displaystyle \sum_a \nabla \pi(a|S_t,\theta_t) q_{\pi}(S_t,a)$We could stop here,
but let's simplify this further.Unbiasedness of the Stochastic Samples
Let's re-examine this from a perspective based on expectations.
This will help us simplify the update and give you more insight into why the update makes sense.Notice that
the sum over states weighted by $\mu$
can be re-written as an expectation under $\mu$.
( using sampled state $S$ instead of all every state $s$ )
$= \displaystyle \sum_s \mu(s) \sum_a \nabla \pi(a|s,\theta) q_{\pi}(s,a)$
$\Rightarrow \mathbb{E}_{\mu} \big[ \displaystyle \sum_a \nabla \pi(a|S,\theta) q_{\pi}(S,a) \big]$\Recall that
$\mu$ is the stationary distribution for $\pi$ which reflects state visitation under $\pi$.Q. What is stationary distribution ?
A. https://en.wikipedia.org/wiki/Stationary_distributionIn fact,
the state's we oberve while following $\pi$ are distributed according to $\mu$
$S_t \sim \mu$By computing the gradient from a state $S_t$,
we get an unbiased estimate of this expectation.Thinking about our stachastic gradient as an unbiased estimate
suggests one other simplification !Notice that
inside the expectation we have a sum over all actions $\displaystyle \sum_a$.
We want to make this term even simpler, and get rid of ther sum over all actions !If this was an expectation over actions,
we could get a stochastic sample of this two(?), and avoid summing over all actions !Getting Stochastic Samples with One Action
Here we're going to see
how we can get an unbiased gradient estimate using only one action 80,
which is the action taken by the agent.It would be nice if the sum of our actions was weighted by $\pi$ and so was an expectation under $\pi$.
That way we could sample it
using the agent's action selection, which is distributed according to $\pi$.It turns out this is an easy problem to solve !
To get a weighted sum correponding to an expectation
we can just multiply and divided by $\pi(a|s,\theta)$.Now we have an expectation over actions drawn from $\pi$ for this term (in the red-box)!
$= \mathbb{E}_{\pi} \big[ \frac{\nabla \pi(A|S,\theta)}{\pi(A|S,\theta)} q_{\pi}(S,A) \big]$Stochastic Gradient Ascent for Policy Parameters
The new stochastic gradient ascent update now looks like this.
As an aside,
it is common to rewrite this gradient as the gradient of the natural logarithm of $\pi$ !This is based on a formula from calculus for the gradient of a logarithm.
$\nabla \big(f(x)\big) = \displaystyle \frac{\nabla f(x)}{f(x)}$Using this rule,
we get that the gradient of log $\pi$ equals the gradient of $\pi$ over $\pi$
$\ln \pi(a|s,\theta) = \displaystyle \frac{\nabla \pi(a|s,\theta)}{\pi(a|s,\theta)}$So this update is equivalent to what we started with.
Why do we do this ?
One reason is that
it is actually simpler to compute the gradient of the logarithm of certain distribution !The other less important reason is that
it let's us write this gradient more compactly.In the end,
it is just a mathematical thrick.
so don't let it distract you from the underlying algorithm.We now have something that looks like many of the learning rules used in this course. (???)
We adjust the parameter $\theta$ proportionally to a stochastic gradient of the objective.
We use a step size parameter $\alpha$ to control the magnitude of the step in that direction.
So $\alpha$ has the same role it always has.We now have a nice clean update rule to learn the policy parameters !
Computing the Update
The last thing to talk about is
how to actually compute the stochastic gradient for a given state and action.We just need two components,
- the gradient of the policy
$\nabla \ln \pi(A_t|S_t,\theta_t)$ - an estimate of the differential values
$q_{\pi}(S_t,A_t)$
The first is easy,
we know the policy and this parameterization,
and so can compute it's gradient.The second,
The action value can be approximated in a variety of ways.
For example,
we could use a TD algorithm that learns differential action-values.In an upcoming video,
we will go through one particular choice in detail,
as well as how to compute the gradient for specific policy parameterization.Summary
- we derive a policy gradient learning rule for the average reward setting
In the next video,
we will see how to use this rule.$\cdot$ Actor-Critic Algorithm
- Describe the actor-critic algorithm
for control with function approximation
for continuing tasks
Do we have to choose between directly learning the policy parameters and learning a value function ?
No !
Even within policy gradient methods, valaue-learning methods like TD still have an important role to play.In this setup,
the parameterized policy plays the role of an actor,
while the value function plays the role of a critic,
evaluating the actions selected by the actor.These so called actor-critic methods
were some of the earliest TD-based methods introduced in Reinforcement Learning.Approximating the Action-Value in the Policy Update
We finished off the last video with this expression for the policy gradient learning rule.
$\theta_{t+1} \quad = \quad \theta_t + \alpha \nabla \ln \pi (A_t | S_t,\theta_t) q_{\pi}(S_t,A_t)$But, we don't have access to $q_{\pi}$,
so we'll have to approximate it !We can do the usual TD thing,
the one-step bootstrap return.That is the differential reward plus the value of the next state.
$R_{t+1} - \bar{R} + \hat{v}(S_{t+1},W)$Critic part and Actor part of the actor-critic algorithm
Critic part of the actor-critic algorithm
As usual,
the parameterized function $\hat{v}(s,W)$ is learned estimate of the value function.
In this case,
$\hat{v}(s,W)$ is the differential value function.This is the critic part of the actor-critic algorithm.
The critic provides immediate feedback.
To train the critic,
we can use any state-value learning algorithm.
We will use the average reward version of semi-gradient TD(0).
Actor part of the actor-critic algorithm
The parameterized policy is the actor !
It uses the policy gradient updates shown here.
Subtracting the Current State's Value Estimate
policy gradient update without baseline policy gradient update with baseline We could use this form of the update,
but there is one last thing we can do to improve the algorithm.We can subtract off what is called a baseline !
$\hat{v}(S_t,W)$ is the baseline in this case.Instead of using the one-step value estimate alone,
we can subtract the value estimate for the state $S_t$ to get the update that looks like this.Notice that
this expression is equal to the TD error $\delta$ !The expected value of this update is the same as the previous one.
Why is this ?Adding a Baseline
Let's take the expectation of the update conditioned on a particular state $S_t$ at time $t$.
Taking the expectation of the sum is the same as the sum of the expectations.
We can use this
to seperate out the expectation of our original term
from the expectation which involves the subtracted value function.It turns out the expectation of the second term is $0$.
So we can add this baseline to the update without changing the expectation of the update.You can varify this for yourself.
To start, write the expectation as a sum of our(?) actions,
and pull the $\hat{v}()$ term out of the sum.
( we leave this as an exercise )So why do we add this baseline if the update is the same in expectation ?
Subtracting this baseline tends to reduce the variance of the update
which results in faster learning !How the Actor and the Critic Interact
This update makes sense intuitively.
After we execute an action,
we use the TD error to decide how good the action was compared to the average for that state.If the TD error is positive,
then it means the selected action resulted in a higher value than expected.
Taking that action more often should improve our policy.That is exactly what this update does.
It changes the policy parameters($\theta$?)
to increase the probability of actions that were better than expected according to the critic.Correspondingly,
if the critic is disappointed and the TD error is negative,
then the probability of the action is decreased.The actor and the critic learn at the same time constantly interacting.
The actor is continually changing the policy to exceed the critic's expectation, and
the critic is constantly updating it's value function to evaluate the actor's changing policy.Actor-Critic algorithm
With the policy update in place,
we're ready to go through the full algorithm for average reward actor-critic.To start,
we specify the policy parameterization and the value function parameterization.Input : a differentiable policy parameterization $\pi(a|s,\theta)$
Input : a differentialble state-value function parameterization $\hat{v}(s,W)$Fot example,
we might use Tile Coding to construct the approximate value function and a Soft-max policy parameterization.We will need to maintain an estimate of the average reward $R$ just like we did in the differential SARSA algorithm.
We initialize this to $0$Initialize $\bar{R} \in \mathbb{R}$ to $0$
We can initialize the weights and the policy parameters however we like.
Initialize state-value weights $W \in \mathbb{R}^d$ and policy parameter $\theta \in \mathbb{R}^{d'} \quad$ (e.g. to $0$)
We initialize the step size parameters for the value estimate, the policy, and the average reward
and they could all be different.Algorithm parameters : $\alpha^W > 0, \;\; \alpha^\theta > 0, \;\; \alpha^{\bar{R}} > 0$
We get thie initial state from the environment, and then begin acting and learning.
Initialize $S \in \mathbb{E}$
Loop forever (for each time step)On each time step,
we choose the action according to our policy
and recieve the next state and reward from the environment.Loop forever (for each time step)
$\quad A \sim \pi(\; \cdot \; | S,\theta)$
$\quad$ Take the action $A$, $\quad$ observe $S', \; R$Using this information,
we compute the differential TD error
and update our running estimate of the average reward.$\delta \quad \leftarrow \quad R - \bar{R} + \hat{v}(S',W) - \hat{v}(S,W)$
$\bar{R} \quad \leftarrow \quad \bar{R} + \alpha^{\bar{R}} \delta$We update the value function weights using the TD update.
$W \quad \leftarrow \quad W + \alpha^W \delta \nabla \hat{v}(S,W)$
Finally,
we update the policy parameters using our policy gradient update !$\theta \quad \leftarrow \quad \theta + \alpha^\theta \delta \nabla \ln \pi(A \; |S,\theta)$
$S \quad \leftarrow \quad S;$
That's it !
This algorithm is designed for continuing tasks.
So we can run it indefinitely and continue to improve the policy forever.Summary
It is useful to learn a value function to estimate the gradient for the policy parameters
The actor-critic algorithm implements this idea,
with a critic that learns a value function for the actor