SEMTM0016 Artificial Intelligence for Robotics
SEMTM0016 Coursework - Part B
Task Overview
You are the mighty HeroBot traversing the MazeDungeon where you will encounter many different entities.
You can find the MazeDungeon environment repository via this link: https://github. com/ uobcll/SEMTM0016_DungeonMazeWorld
• You can follow the README of the repository for the basic components of MazeDungeon environment.
• manual control . py : The code in this file shows how to load the dungeon maze env, how to reset the environment and how to check the state, action and reward from environment.
You have three tasks to complete:
Q1: Load the environment and implement rollout function and some simple policies.
Q2: Implement the model-based methods: Policy iteration and value iteration.
Q3: Implement the model-free methods: Monte-Carlo and Temporal Difference.
Question 1 - Simulation in environment and policies (4 Marks):
(1.1) Implement a rollout function by which you can sample a complete trajectory in the envi- ronment for a given policy. The function takes in the environment and policy, and returns the full rollout trajectory (e.g., a list of transition tuples where each tuple is (state, action, reward). The maximum environment step is set as 100 and grid size as 6.
(1.2) Implement the random policy (e.g. a uniform policy distribution over the action space for every state) to obtain the action the agent will take, and take trajectory using the rollout function in part (1.1) and the random policy.
(1.3) Implement the All-forward policy: only take move forwards action for every state, , and take trajectory using the rollout function in part (1.1) and the all-forward policy.
(1.4) Implement the Customized policy: any policy of your choice in this environment, and take trajectory using the rollout function and your customized policy.
Question 2 - Model-based Method (6 marks)
Let us set the grid size as 8, discount factor=1 and initial policy as random policy. Implement policy iteration and value iteration algorithm to get the optimal policy for the given environment. Specifically, you need to complete the following sub-tasks:
(2.1) Initialize value table. Implement policy iterations policy evaluation process conducting mul- tiple bellman equation updates utill the value table converges.
(2.2) Implement policy improvement process in policy iteration.
(2.3) Iteratively apply policy evaluation in (2.1) and policy improvement in (2.2) until the final optimal policy and value table converges.
(2.4) Initialize value table. Implement value iteration process till the optimal value table converges. (2.5) Get the optimal policy from the optimal value table derived in (2.4).
(2.6) Analyse how these two algorithms perform differently (e.g, convergence speed, numerical stability, senstivity to initial value table, and how the poilcy changes).
Question 3 - Model-free Method (7 marks)
Let us set the grid size as 10, discount factor=0.99 and initial policy as random policy. The maxi- mum environment step is set as 100. Suppose we do not have access to an explicit environment model and we can only sample trajectories from it. Implement Monte-Carlo and Temporal-Difference learning algorithms. Specifically, you need to complete the following sub-tasks.
(3.1) Given rollout function output, implement a cumulative reward calculation function to calcu- late the cumulative reward for one sampled trajectory.
(3.2) Implement Monte-Carlo sampling algorithms: sample some trajectories using the initial ran- dom policy and use the trajectories to get the Monte-Carlo value estimate for each grid. (Initialise each grid’s value as 0).
(3.3) Implement Temporal-Difference Learning algorithms: initialise the value table, sample some trajectories using the initial random policy. Then use the trajectories to conduct temporal difference learning until the value table converges. (Note that terminal states value are 0).
(3.4) Justify your design choices when setting the hyperparameters for these two algorithms (e.g., number of trajectory samples, td learning rate, value table initialisation, etc) and analyse how these two algorithms perform differently (e.g., convergence speed, numerical stability and hyperparameter sensitivity).
Report
There are 3 marks for overall presentation of the report.
Your report should be no longer than six pages, shorter is fine. Use an 11 or 12pt font and do not try tricks like expanding the margin to fit in more text, shorter is better than longer.
Your report must be submitted as a pdf and should be prepared either in LaTeX (overleaf is a good approach), MS Word, or a similar text editor to prepare the report and submit it as a pdf document.
Your code will not be marked for elegance, but it should run correctly; it is expected you will use Python. Do not include screenshots of graphs, they should be imported directly; resize them to the correct size before importing them, if the labels are tiny the graphs will not be marked. Make sure figure captions are descriptive, it is better to have some overlap between figure captions and the main text than to have figure captions that are not reasonably self-contained.
Avoid code snippets in the report unless that feels like the best way to illustrate some subtle aspect of an algorithm; do always though consider a mathematical description if possible. You will be asked to submit your code and it will be tested to make sure it works and matches your report. It will not, however, be marked itself for quality.
The teaching assistants (TAs) are unable to answer questions about how to solve an exercise or what methods to use beyond what has been specified in the coursework document. However, if you need help to know more about a method in a certain lab/worksheet in order to solve an exercise, do ask TAs for help about that method.