-
3. Week_4-2_Policy Gradient for Continuing Tasks카테고리 없음 2021. 6. 21. 05:11
3. Week_4-2_Policy Gradient for Continuing Tasks 2. Policy Gradient for Continuing Tasks
The Objective for Learning Policies
The Policy Gradient Theorem
$\cdot$ The Objective for Learning Policies
- Describe the objective for policy gradient algorithms
Now that we've introduced the idea of parameterizing policies directly,
we're ready to talk about how we can learn to improve a parameterized policy.Just like with action-value based methods,
The basic idea will be to specify an objective,
and figure out how to estimate the gradient of that objective from an agent's experience.The Goal of Reinforcement Learning $\quad : \;$ Maximizing Reward in the Long Run
Formulating an obejective for learninng a parameterized policy
is in some sense more straightforward than it was for action-value based methods !We've said many times that the ultimate goal of Reinforcement Learning
is to learn a policy that obtain as much reward as possible in the long run.It turns out that when we parameterize our policy directly,
we can also use this goal directly as the learning objective.Formalizing the Goal as an Objective
Throughout this specialization,
we've introduced a few different interpretations of
what it means to obtain as much reward as possible in the long run.For the episodic case,
we can use the undiscounted retrun,
which is the sum over a whole episode.For the continuing case,
we introduced the discounted return,
which places more emphasis on immediate reward in order to keep the sum finite.Most recently, we introduced the average reward formulation.
Here, we maximize the long-term average of the reward.
The appropriate return here is the sum of the differences between the immediate reward and it's average.Because the lont-trerm average subtracted, this sum is finite even without discounting !
Here wa have three potential problem formulation to consider.
We're going to restrict our attention the continuing setting with average reward.Remember what out aim is here
to find a way to directly optimized parameters of a policy.The first step is to write the average reward objective in a form that we can optimize.
The Average Reward Objective
Previously, we estimated the average reward to learn action-values !
( in an average reward variant of SARSA )Now our aim is to learn a policy
that directly optimized average reward.Toward this aim, let's start by writhing the average reward under a policy in a more useful form.
We can write the average reward $r(\pi)$ achieved by a particular policy $\pi$ like this.
$r(\pi) \quad = \quad \displaystyle \sum_s \mu(s) \sum_a \pi(a|s,\theta) \sum_{s',r} p(s',r|s,a)r$Undertand the Average Reward Objective (?)
Let's break this formula down starting from the inner sum and moving out.
This inner sum gives the expected reward if we starts state in $s$ and take action $a$.
$\rightarrow \quad \mathbb{E} \big[ R_t|S_t=s,A_t=a \big]$This is simply the sum over all reward $r$ we might receive weighted by their probability $p(s',r|s,a)$ from $s$ and $a$.
We sum over next states $s'$ to get marginal probabilities (주변확률) over the reward.
The next level of the summation gives us the expected reward under the policy $\pi$ from a particular state $s$.
$\rightarrow \quad \mathbb{E}_{\pi} \big[ R_t | S_t = s \big]$This is over all possible actions $a$ weighted by the probability $\pi(a|s,\theta)$ under (policy?) $\pi$.
Finally, we get the overall average reward
by considering the fraction of time we spend in state $s$ under policy $\pi$.
$\rightarrow \quad \mathbb{E} \big[ R_t \big]$The distribution $\mu$ provides these probabilities $\mu(s)$.
The expected reward across states $\mathbb{E}_{\pi}(R_t)$ is a sum over $s$ of the expected reward in a state $\mathbb{E}_{\pi} \big[ R_t | S_t = s \big]$ weighted by $\mu(s)$.$r(\pi)$ is our average reward learning objective !
Optimizing The Average Reward Objective
Our goal of policy optimization
will be to find a policy which maximizes the average reward $r(\pi)$ !Basic approach
to estimate the gradient of the objective $\nabla r(\pi)$ with respect to the policy parameters,
ans adjust the paramters based on the estimate !$\rightarrow \quad$ Policy Gradient method
$\qquad$ The class of methods they use this idea are often referred to as policy gradient methods.Indirect learning policy in GPI v.s. Direct learning policy in PG
Up until now,
we've used a very different approach.
We use the Generalized Policy Iteration framework (GPI) to learn approximate action-values.Then we use these approximate values indirectly to infer a good policy.
Now,
we're interested in learning policies directly !There's also a superficial difference.
In GPI
minimizing the Mean Squared Error for learningIn GP
maximizing an objective for learningThat means
we will want to move in the direction of the gradient !
( rather than the negative gradient )The Challenge of Policy Gradient Methods
However,
There're are few challenges in computing this gradient !The main difficulty
is that modifying our policy changes the distribution $mu$ !
$\rightarrow \quad$ $\mu(s)$ is depends on $\theta$ !This contrast the value function approximation
where we minimized Mean Squared value Error under a particular policy.There
the distribution $\mu$ was fixed.
It does not change as the weights and the parameterized value function change.
$\rightarrow \quad \mu(s)$ is independent of $W$We were therefore able to estimate gradients for state drawn from $\mu$ by simply following the policy.
the Policy Gradient Theorem $\quad : \;$ Solution for learning policy
This is less straightforward,
a $\mu$ itself depends on the policy we are optimizing.Luckily,
there's an excellent theoratical answer to this challenge,
called the Policy Gradient Theorem !( we'll talk more about that next )
Summary
- We can use the averege reward as an objective for policy optimization
In upcoming lectures,
we will disucss how to actually optimize this objective from sampled experience.$\cdot$ The Policy Gradient Theorem
Describe the result of the policy gradient theorem
Understand athe importance of the policy gradient theorem
We just discussed an objective for policy optimization.
The next step is to figure out how the agent can optimize it based on it's own experience.In this video,
we will describe the Policy Gradient theorem.This is a key theoretical result.
It allows us to write the gradient of the average reward so that it is easier to estimate from experience !Gradient Ascent
To optimize the Mean Squared value Error,
we used methods based on Stochastic Gradient Descent.We estimate the negative of the gradient of our objective,
and adjust the weights of the value function in that direction.Policy Gradient methods use a similar approach,
but with the average reward objective and the policy parameters $\theta$.we want to maximize the average reward rather than minimizing it !
This means we do Gradient Ascent, and move in the direction of the positive of the gradient.Remember that
a simple recipe for solving a problem isto first specify an objective
then estimate the gradient of that objective,
and finally, adjust the weights in that direction.Step 1 is done.
Now the next step is to estimate the gradient of our objective.The Gradient of the Objective
Recall
the Average Reward Objective.Let's compute the gradient.
We can apply the product rule of calculus to the objective to yield two terms.
This is pretty complex.
So let's look at it a bit more closely.The first term involves the gradient of the stationary distribution over states.
$\nabla \mu(s)$Unfortunately,
the gradient of $\mu$ is not straightforward to estimate.The stationary distribution $\mu$ depends on a long-term interaction between the policy and the environment.
The Policy Gradient Theorem
Luckily,
the Policy Gradient Theorem gives us a simpler expression for this gradient.Here we show the results of the theorem.
It's worth walking through this expression.In the inner sum,
we have the gradient of the policy times the action-value function.Let's try to understand this term a little better.
Understanding $\displaystyle \sum_a \nabla \pi(a|s,\theta) q_{\pi}(s,a)$
The gradient of the policy $\pi$ is easy to compute
as long as our policy parameterization is differentiavle !The gradient of the policy
tells you how to adjust your parameters to increase the probability of a certain action.Example $\quad$ for understanding the policy gradient
Consider the simple grid world shown here.
As usual, the agent cam move up, donw, left, or right.
For simplicity, let's assume the policy is controlled by just two parameters ($\theta_1$ and $\theta_2$).The parameters are curently set to the point marked here.
The arrows indicate the agent's action probabilities in a particular state with the current parameter settings.the policy gradient for Up action case
The gradient for the up action might look something like this on the plot.
The gradient tells us
how to change the policy parameters to make that action more likely to be selected in the given state !By moving the paramters in the direction of the gradient,
we increase the probability for the up action.This necessarily means decrease in the probability of some of the other actions !
the policy gradient for Left action case
Different actions have different gradients.
the gradient of the left action probability may look like this.
Moving the parameters in that direction will increase the probability of the left action,
and decrease the probability of some of the other actions.This might all sound a bit abstract.
We wll show concrete examples of computing this gradient in the coming lectures.the policy gradient for overall action
Now,
let's bring some reward into the situation
and think about what this whole term means.This is a sum over the gradients of each action probability weighted by the value of the associated action.
Imagine,
we've added a rewarding state in the bottom right.The up and left actions
move away from rewarding state, and should have negative value.
The down and right actions
move toward it the rewarding state, and should have positive value.The precise values will depend on the current policy.
but, let's say they look something like this.The weighted sum
gives a direction to move the parameters that decreases the pro bability of moving up or left since their value is negative,
while increasing the probability of moving down or right since their value is positive.That direction might look something like this on the plot.
Average Reward expression with the policy gradient
The gradient $\nabla r(\pi)$ expression given by the Policy Gradient Theorem
takes this expression and sums that over each state.This gives the direction to move the policy parameters to most rapidly increase the overall average reward !
The Policy Gradient Theorem
We now have a simple expression for the gradient !
Importantly,
this expression does not contain the gradient of the state distribution $\mu$ !
( which is challenging to estimate ! )The proof of the policy gradient theorem
can be found in the course textbook.As we will see in the next lecture,
this gradient is straightforward to estimate !That means we will be able to build an incremental policy gradient algorithm using an agent's experience !
Summary
The policy gradient theorem gives an expression for the gradient of the average reward
Understand the terms in this gradient