-
3. Week_3-2_Exploration under Function Approximation카테고리 없음 2021. 6. 21. 05:08
3. Week_3-2_Exploration under Function Approximation 2. Exploration under Function Approximation
- Exploration under Function Approximation
$\cdot$ Exploration under Function Approximation
- Describe how optimistic initial values and $\epsilon - greedy$ can be used with function approximation.
The need to balance exploration and exploitation
is one of the defining characteristics of the sequential decision-making problem.We've talked about several simple ways to promote exploration in Bandits and Tabular Reinforcement Learning.
" What about Function Approximation ? "
" Is there anything special about exploration there ? "Today, we'll find out.
Optimistic Initial Values in the Tabular setting
Let's do a breif refresher on
how we use optimistic initial values in the Tabular setting.We initialize our values to be greater than the true values.
This is like the agent imagining that
it can get more reward by taking that action than it actually can in reality.Typically,
initializing the value function this way causes the agent to systematically explore the state-action space.As the agent's values become more accurate,
they are impacted less and less by this initialization.This is straightforward to implement in a Tabular setting
where the update to each state-action pair is independent of all the other state-action pairs.How to Initialize Values Optimistically under Function Approximation
In Function Approximation,
OLptimistic Initial values corresponds to initializing the weights
such that the resulting values are optimistic.Linear case
In some cses this is straightforward,
when the features are binary, we simply initialize each weight to be the largest possible return.Then, as long as each state has at least one feature active, the value will be optimistic and likely overly so.
Non-linear case
In many cses, however,
It is difficult to initialize optimistically.For example, in a Neural Network,
the relationship between the final values and the features can be quite complicated.Imagine a network composed of $\tanh$ activation functions.
The network could output negative values even with positive initial weights.
How Optimisim interacts with Generalization
But this isn't the whole story.
Depending on how our features generalize,
optimistic initial values may not result in the same kind of systematic exploration we see in the Tabular case.1 2 Single feature
Consider an extreme example, where we have only one feature that is always $1$.
We can initialize optimistically, but every update will change the value for all states.This means that before some states are even visited, the value will already have decreased
such that it is no longer optimistic ...
Tile coding
To facilitate systematic exploration,
changes to the vlaue function need to be more localized.For example,
function approximation with Tile Coding can produce such localized updates.
Neural Network
Neural Networks also provide local updates,
but neural networks may also generalize aggressively.In practice, without special consideration,
a neural network will lose his optimism relatively quickly.
$\epsilon - greedy$
Epsilon-greedy is generally applicable and easy to use
even in cases with Non-linear function approximation !The only thing Epsilon-greedy needs are the action value estimate, $\hat{q}(S_t, a, W)$,
Independent of how they are initialized or approximated.However, Epsilon-greedy is not a directed exploration method.
It relies on randomness to discover better actions near states followed by the current policy.It is therefore not as systematic as exploration methods that rely on optimism.
Improving exploration in the function approximation setting remains an open research question.
So in this course, we'll stick with this simple strategy.Summary
Many subtleties when combining Optimistic Initial Values aan Function Approximation
$\epsilon - greedy$ can be combined with any Function Approximation