Description

Rate this product

In this assignment, we will solve a simple grid world problem called ’FrozenLake-v0’ in
OpenAI gym using both model based and model free methods. To learn how to set up
the environment and interact with it take a look at the OpenAI website.(More about the
environment can be found on the OpenAI github page)
Note: Use the virtual environment from Assignment 1.
Question 1 – Model based methods
1. Describe the environment state and action spaces, and reward function. Given a state
and an action, is the state transition deterministic?
2. Given a Markov Decision Process described by S, A, R ,P, γ, where S ∈ R
n
is the statespace, A ∈ R
m is the action space, R : R
m × R
n × R
m → R is the reward function,
P : R
m × R
n × R
m → [0, 1] is the transition probability and γ is the discount factor.
Show that for a deterministic policy π(s), the value function v(s) can be expressed as:
v(s) = X
s
0∈S
p(s
0
|s, a)[r(s, a, s0
) + γv(s
0
)] (1)
where p(s
0
|s, a) ∈ P and r(s, a, s0
) ∈ R. Assume that the state and action spaces are
discrete.
3. Write a function TestPolicy(policy), that returns the average rate of successful
episodes over 100 trials for a deterministic policy. What is the success rate of a policy
given by π(s) = (s + 1)%4, where % is the modulus operator.
4. Write a function LearnModel, that returns the transition probabilities p(s
0
|a, s) and
reward function r(s, a, s0
). Estimate these values over 105
random samples.
5. Write a function PolicyEval for evaluating a given deterministic policy and with the
help of this function implement a policy iteration method to solve this environment
over 50 iterations. Plot the average rate of success of the learned policy at every
iteration.
6. Write a function ValueIter that returns a deterministic policy learned through valueiteration over 50 iterations. Plot the average rate of success of the learned policy at
every iteration.
1
Question 2 – Model free methods
1. Solve the environment using Q-learning over 5000 episodes. For exploration during
training, take random actions with probability 1-e/5000 where e is the number of
current episode. Plot the success rate of the learned policy at an interval of 100
episodes.
(a) Train the policy using the following learning rates with γ = 0.99.Report what you
observe.
α ∈ {0.05, 0.1, 0.25, 0.5}
(b) Train the policy using the following discount factors with α = 0.05. Report what
you observe.
γ ∈ {0.9, 0.95, 0.99}
2. In the previous question, the exploration was linearly annealed. Solve the environment
using Q-learning by proposing a different strategy to explore. Find a suitable α and γ
for your method. Report your strategy and training results.

ECE 276 Assignment 2 : Tabular Methods

Download Details:

Description

ECE 276 Assignment 2 : Tabular Methods

Download Details:

Description

Related products

ECE 276 Assignment 4 : Deep Deterministic Policy Gradients

ECE 276 Assignment 3 : Policy Gradients

ECE 276 Assignment 1 : Classical Control