3. Week_3-1_Episodic SARSA with Function Approximation :: 돗토리

ABOUT ME

-

Today: -

Yesterday: -

Total: -

3. Week_3-1_Episodic SARSA with Function Approximation

카테고리 없음 2021. 6. 21. 05:07
3. Week_3-1_Episodic SARSA with Function Approximation

Week_3

INDEX

Episodic SARSA with Function Approximation

Episodic SARSA with Function Approximation

Episodic SARSA in Mountain Car

Expected SARSA with Function Approximation

Exploration under Function Approximation

Exploration under Function Approximation

Average Reward

Average Reward $\quad : \;$ A New Way of Formulating Control Problem

Satinder Singh $\quad : \;$ On Intrinsic Rewards

Week 3 Review

1. Episodic SARSA with Function Approximation

Episodic SARSA with Function Approximation

Episodic SARSA in Mountain Car

Expected SARSA with Function Approximation

$\cdot$ Episodic SARSA with Function Approximation

Understand how to consturct action-dependent features for approximate action-value

Explain how to use SARSA in episodic tasks with function approximation

This week,
We'll talk about control with Function Approximation.

In this video, we'll start with a GPI ( Generalized Policy Iteration ) algorithmm,
Episodic SARSA.

State-values to action values

Recall that
the va lue function approximation has two components.

$W^T \quad : \;$ Weight vector

$X(s) \quad : \;$ Feature vector

In a given state $S$,
the value of estimate is the dot product between these two components.
$v_{\pi} \approx \hat{v}(s,W) \doteq W^T X(s)$

To move from TD to SARSA,
We need action-value functions.
So the feature representation has to represent actions as well.
$q_{\pi} \approx \hat{q}(s,a,W) \doteq W^T X(s,a)$

Representing actions

One way to do this is
to have a seperate function approximator for each action.
This can be accomplished by stacking the features.

Stacking the features
is we can use the same state features for each action,
but only activate the features corresponding to that action.

To understand this more clearly, let's look at an example.

Let's say there are four features $X(s)$ and three actions $A(s)$.
The four features represent the state you are in.

But we want to learn a function of both states and actions.

We can do this by repeating the four features for each action.
Now the feature vector has 12 components.

Here, each segment of four features corresponds to a seperate action $a_0, a_1, a_2$.

We call this feature representation stacked,
because the weights for each action are stacked on top of each other.

Thus,
only the features for the specified action will be active,
while thouse for the other actions will be set to $0$ !

Computing action-values

Let's see an example of how to calculate the action-values from a given state.

Let's say that
there are four features and three actions.
The weight vector looks like this.

With stacked features,
we get the following feature vector for action $a_0$ in state $s_0$.
We zero out the features for the other actions.

So we extract the segment of the weight vector corresponding to each action.
The action-values are then the dot products between each segment of the weight vector and the feature vector.

Computing Action-values with a Neural Network

You might think that stacking features to create action-values
is specific to linear function approximation.
But this is not ths case.

For example,
the common way to represent action-values with a neural network is
to gernerate multiple outputs, one for each action-value.

This, however, is equivalent to the stacking procedure we just described.

The neural network inputs the state, and the last hidden layer produces the state-features.
Each action-value is computed from an independent set of weights using those state-features !
The weights for one action-value do not interact with those of another action-value, just like in stacking.

We might want to generalize over actions
for the same reason generalizing over state can be useful.

" How might this work with a neural network ? "

We would input both the state and and the action to the network.
There would only be one output. The approximate action-value for that state and action ! $\hat{q}(s, a, W)$

We can do something similar with Tile Coding by passing both the state and action as input.

Episodic SARSA with function approximation

We now know how to handle action-values with function approximation.
Let's talk about using SARSA for control of function approximation.
The algorithm quites similar to the Tabular version (of SARSA),
so we will just review the differences.

We use parameterized action-value function for the action-value estimates.
$\hat{q}(S,A,W), \quad \gamma \hat{q}(S',A',W) - \hat{q}(S,A,W)$

The update also changes to use the gradient to update the weights similar to Semi-gradient TD.
$\nabla \hat{q}(S,A,W)$

That's it.
That's the SARSA control algorithm with function approximation !

Summary

Talked about action-dependent features

Introduced Episodic SARSA with function approximation

We will cover a few more control algorithms in the coming lectures.

$\cdot$ Episodic SARSA in Mountain Car

Gain experience analyzing the performance of an approximate TD control method

In this video,
we get to see Episodic SARSA in action, on a continuous state domain.

We'll visualize the learn values,
and gain some intuition about the solutions learned by SARSA.

Example $\quad : \;$ The Mountain Car environment

Let's looak at an example of episodic SARSA on a classic control domain called Mountain Car.

Mountin Car is an episodic task where the objective is to drive an underpowered car up the side of the mountain.
Gravity is stronger than the car's engine, so the agent cannot directly drive up the mountain.
The only way to escape is to first drive backwards up the left slop.
Then driving down the hill gives the car enough momentum to drive up the right slop and out.

The episode begincs with a car in a random location near the bottom of the valley.
It ends when the car reaches the flags at the top of the hill.

To encourage the agent to finish the episode quickly as possible,
The reward is $-1$ on each timestep,
and no discounting is issued $\gamma = 1.0$.

The agent can observe the current position and velocity of the car.
This is a 2-dimensional countinuous-valued state.

The agent has a choice between three actions :
accelerate forward, accelerate backward, or coast (no acceleration).

Feature representation

For this example, we jointly tile-code(?) the position and velocity to produce features.

We use eight tiles,
( which means we use 8 by 8 grids for the 2-dimensional input space. )

We use 8 tilings,
( which means we have 8 overlapping grids. )

We treat actions independently
by using a stacked feature representaion.

We initialize the weights to $0$.

This initalization is actually optimistic.
This is because a reward is $-1$ per step,
so the values under any policy will be less than $0$.

In this problem, these optimistic initial values cause extensive and systematic exploration.
Because of this, we can act greedily without any additional random exploration.

Learned values

Let's see what the vlue function looks like
after we run the agent for a very long time.

Ideally, we would like to plot the value function for every state.
However, there is an uncountably infinite number of states.
So we'll have to sample the set of states.

We'll use the max value in each sample state.
$\displaystyle \max_a Q(s,a,W)$

This number tells us the number of steps the agent thinks it will take to escape under it's greedy policy.

The reward is $-1$ for each step,
so we negate this number $\max_a Q(s,a,W)$
to produce the agent's estimate of the number of steps to go, from each state.

negate : 음수를 취하다 ?

Here's the agent's estimate of the expected number of steps after 9000 episodes.
This is a really long time for this problem, so we're pretty sure the value estimates are about as good as they're going to get.
But they're not perfect due to function approximation.

The green line represents the goal position
which does not depend on the velocity.

Near the goal,
if the velocity is large enough,
the agent can directly drive out.

However,
if the agent is near the goal but velocity is too low,
it will take many steps to escape.

This is because
the agent must first go back down hill,
and up the left side again to gain enough momentum to escape.

The pick corresponds to the start states,
where it takes around 120 steps to reach the flag.

This green trajectory reveals the path
taken through the state-space by the learn policy !

Learning Curves

The value function looks pretty interesting.
but let's look at some learning curves to get better insight into the speed of learning.

Let's try SARSA with three different values of $\alpha$,
so three variances of the algorithm.

Here we are interested in the steps to goal,
which in this case corresponds to the return per episode.

We expect the number of steps per episode to decrease with learning.
Lower on the x-axis corresponds to better policies,
policies that can reach the goal in the fewest steps on average.

As always, we average the performance over many independent runs.
Here are the results.

By 500 episodes, the number of steps per episode has roughly stabilized.
All the learning curves exhibit the familiar exponential profile.

The smaller step size parameter value of 0.1 results in slower learning,
while $\alpha$ of 0.5 allow the agent to learn more quickly and find a better policy over 500 episodes.

Notice
we divided each $\alpha$ value by $8$.

If we are not using a vector of step sizes,
we often scale the step size parameter by the norm of the feature vector.

Here we use $8$ tilings in our tile coder.
That means the number of $1$s in the feature vector is always $8$. (?)

As a simple exercise, prove yourslef that $8$ corresponds to the $L_1$ norm of the feature vector.

Summary

Evaluated Episodic SARSA with linear function approximation in Mountain Car

$\cdot$ Expected SARSA with function approximation

Explain the update of Expected SARSA with function approximation

Explain the update of Q-learning with function approximation

So far we've talked about using SARSA with function approximation.
Let's continue our journey through approximate control methods and look at Expected SARSA.

From SARSA to Expected SARSA

Let's see how we can turn SARSA into Expected SARSA
when using function approximation.

Recall the update equation for SARSA
SARSA'a update target includes the action value for the next state and action $Q(S_{t+1}, A_{t+1})$.
$Q(S_t, A_t) \quad \leftarrow \quad Q(S_t, A_t) + \alpha \big( R_{t+1} + \gamma \color{brown}{Q(S_{t+1}, A_{t+1})} - Q(S_t, A_t) \big)$

Now recall the Expected SARSA instead using the expectation over it's target policy.
To compute the expectation,
we simply sum over the action values weighted by their probability under the target policy.
$Q(S_t, A_t) \quad \leftarrow \quad Q(S_t, A_t) + \alpha \big( R_{t+1} + \gamma \color{brown}{\displaystyle \sum_{a'} \pi(a'|S_{t+1})Q(S_{t+1},a')} - Q(S_t, A_t) \big)$

Expected SARSA with Function Approximation

The same expectation can be computed
even when we use function approximation !

First, recall the update for SARSA with function approximation.
It looks similar to the Tabular setting
except the action value estimates are parameterized by the weight factor (vector?) $W$.
$\color{brown}{\hat{q}(S_{t+1},A_{t+1},W)}$
We also have a gradient term to distribute the error to the weights appropriately.
$\nabla \hat{q}(S_{t},A_{t},W)$

Expected SARSA with function approximation follows a similar structure.
We compute the action values from our weight vector for every action in the next state.
Then we compute this expectation under the target policy.
$\color{brown}{\displaystyle \sum_{a'} \pi(a'|S_{t+1}) \hat {q}(S_t,A_t,W)}$

That's all we have to do to change SARSA into Expected SARSA.

Expected SARSA to Q-learning

Now, how do we do this for Q-learning with function approximation ?

Fortunately,
Q-learning is a special case of Expected SARSA.

In Q-laerning,
the target policy is greedy with respect to the approximate action values.

$\rightarrow \quad$ Computing the expectation under greedy policy
$\qquad$ is the same es computing the maximum action value.

So the Q-learning update with function approximation is actually quite straightforwards.

$\rightarrow \quad$ We just use the max
$\qquad$ in place of the expectation.

Summary

Introduced Expected Sarsa with function approximation

Extended it to Q-learning with function approximation

Next time,
We'll think about how to do exploration with function approximation !

인기포스트

ABOUT ME

LINK

ADMIN

티스토리툴바