代做COMP9414 Artificial Intelligence Assignment 2: Reinforcement Learning Term 2, 2025帮做Python程序

COMP9414 Artificial Intelligence

Assignment 2: Reinforcement Learning

Term 2, 2025

Due: Week 9, Friday, 1 August 2025, 5:00 PM AEST

Worth: 25 marks (25% of final grade)

Submission: Electronic submission via Moodle

1 Problem Overview

In  this  assignment,  you  will  implement  and  compare  two  reinforcement  learning  al- gorithms: Q-learning and SARSA, within a static grid world environment.

The grid world consists of an  11 × 11 grid  in which the agent must navigate from a random starting position to a designated goal while avoiding fixed obstacles arranged in two distinct patterns.

You will first develop the Q-learning and SARSA algorithms, implementing action selec- tion policies that allow the agent to choose actions using an epsilon-greedy approach to balance exploration and exploitation. To ensure fair comparison between the algorithms, you must use identical hyperparameters across all experiments.

After training the baseline agents, you will simulate interactive reinforcement learn- ing (IntRL) by introducing a teacher-student framework.  In this setup, a pre-trained agent (the teacher) provides advice to a new agent (the student) during its training.  Each algorithm will teach its own type:  Q-learning teachers will guide Q-learning students, and SARSA teachers will guide SARSA students.

The teacher’s advice will be configurable in terms of its availability (probability of of- fering advice) and accuracy (probability that the advice is correct).

You will evaluate the impact of teacher feedback on the student’s learning performance by running experiments with varying levels of availability and accuracy. By maintaining consistent hyperparameters across all four tasks, you can meaningfully compare how each algorithm performs with and without teacher guidance.

The goal is to understand how teacher interaction influences the learning efficiency of each algorithm and to determine which algorithm benefits more from teacher guidance under identical conditions.

2 Environment

The environment is provided in the env .py file.  This section provides a brief overview of the grid world environment you will be working with.

For detailed information about the environment including setup instructions, key func- tions,  agent  movement and actions,  movement  constraints,  and the  reward structure, please refer to the Environment User Guide provided separately as a PDF file.  En- sure you familiarise yourself with the environment before proceeding with the assignment.

2.1 Environment Specifications

The environment has the following specifications:

Grid Size: 11 × 11

Obstacles: 10 cells arranged in two patterns:

◦  L-shaped pattern:  (2,2), (2,3), (2,4), (3,2), (4,2)

◦  Cross pattern:  (5,4), (5,5), (5,6), (4,5), (6,5)

Goal Position: (10, 10)

Rewards:

◦  Reaching the goal: +25

◦ Hitting an obstacle: -10

◦  Each step: -1

Figure  1: The  11 × 11  grid world  environment with visual elements.   The  agent  must navigate from its starting position to the goal whilst avoiding the L-shaped and cross- shaped obstacle patterns.

Environment Elements:

Agent

Grey robot that navigates the grid world

Goal

Red flag at position (10,10) Rewards +25 points

Obstacles

Construction barriers Penalty of -10 points

Status Bar Information:

-  Episode number

-  Teacher availability (%)

-  Teacher accuracy (%)

3    Hyperparameter Guidelines

For this assignment, you should use the following baseline parameters:

Learning rate (α): 0.1 to 0.5

Discount factor (γ): 0.9 to 0.99

Epsilon:

For exploration strategies with decay: Initial 0.8 to 1.0, Final 0.01 to 0.1

Decay strategy: Can use linear decay, exponential decay, or other decay strategies

For fixed epsilon (no decay): 0.1 to 0.3

Number of episodes:  300 to 1000

Maximum steps per episode: 50 to 100

You may experiment within these ranges, but must:

–  Use identical parameters across all four tasks (Tasks 1–4) to ensure fair comparison between Q-learning and SARSA, both with and without teacher guidance

–  Document your final chosen values and provide brief justification

–  For Tasks  3 and 4  (teacher  experiments), you may use fewer episodes to reduce computational time while maintaining meaningful results.

4 Task 1: Implement Q-learning

In this task, you will implement the Q-learning algorithm and train an agent in the provided environment.

Implementation Requirements

Your Q-learning implementation should:

–  Train the agent for the specified number of episodes using the hyperparameters from Section 3

–  Use epsilon-greedy action selection for exploration

–  Update Q-values according to the Q-learning update rule

Metrics to Track

During training, track these metrics for each episode:

Total Rewards per Episode:  The  cumulative reward accumulated during the episode

Steps per Episode: The number of steps taken to complete the episode

Successful Episodes: Whether the agent reached the goal

Required Outputs

After training is complete, you must produce the following:

–  Generate a plot that displays the episode rewards over time. The plot should include:

◦  Raw episode rewards (with transparency to show variance)

◦  A moving average line (e.g., 50-episode window) for smoothing

◦ A horizontal line at y=0 to indicate the transition between positive and neg- ative rewards

◦ Appropriate labels, title, and legend

Figure 2 shows a sample of what your Q-learning performance plot should look like

(generated with simulated data).

–  Calculate and report the Success Rate, Average Reward per Episode, and Average Learning Speed, using the following formulas:

Success Rate:

where N is the total number of episodes.

Average Reward per Episode:

where Ri  is the total reward in the i-th episode.

Average Learning Speed:

where Si  is the number of steps taken in the i-th episode.

– Keep track of the following outputs:

◦  The three calculated metrics: average reward, success rate, and average learn- ing speed

◦ The trained Q-table, as it will be used as the teacher in Task 3

Figure 2:  Sample  Q-learning performance plot showing episode rewards and 50-episode moving average. This is generated with simulated data for demonstration purposes only.

5 Task 2: Implement SARSA

Important: You must use the same hyperparameters  (learning rate, discount factor, epsilon, episodes, and maximum steps) that you chose in Task 1 to ensure fair comparison between Q-learning and SARSA.

In this task, you will implement the SARSA algorithm and train an agent in the provided environment.

Implementation Requirements

Your SARSA implementation should:

–  Train the agent for the specified number of episodes using the same hyperparameters as Task 1

–  Update Q-values according to the SARSA update rule

Metrics to Track

During training, track these metrics for each episode:

Total Rewards per Episode:  The  cumulative reward accumulated during the episode

Steps per Episode: The number of steps taken to complete the episode

Successful Episodes: Whether the agent reached the goal

Required Outputs

After training is complete, you must produce the following:

–  Generate a plot that displays the episode rewards over time. The plot should include:

◦  Raw episode rewards (with transparency to show variance)

◦  A moving average line (e.g., 50-episode window) for smoothing

◦ A horizontal line at y=0 to indicate the transition between positive and neg- ative rewards

◦ Appropriate labels, title, and legend

Figure 3 shows a sample of what your SARSA performance plot should look like

(generated with simulated data).

–  Calculate and report the Success Rate, Average Reward per Episode, and Average Learning Speed using the same formulas as in Task 1 (Equations 1, 2, and 3).

– Keep track of the following outputs:

◦  The three calculated metrics: average reward, success rate, and average learn- ing speed

◦ The trained Q-table, as it will be used as the teacher in Task 4

Figure 3:  Sample SARSA performance plot showing episode rewards and 50-episode mov- ing average. This is generated with simulated data for demonstration purposes only.

6 Baseline Comparison

After completing Tasks  1  and  2,  you should compare the baseline performance of Q- learning and SARSA. This comparison will help you understand the fundamental differ- ences between the two algorithms before introducing teacher guidance.

Creating the Comparison

Generate comparison visualisations that include:

Learning Progress Comparison

◦ Episode rewards for both Q-learning and SARSA (with transparency)

◦  50-episode moving averages for both algorithms

◦ The y=0 reference line

◦ Average reward values for each algorithm

Success Rate Comparison

◦  Rolling success rates (50-episode window) for both algorithms

◦  Overall success rates for each algorithm

◦  Success rate ranging from 0 to 100%

Figure 4 shows a sample baseline comparison plot (generated with simulated data).

Figure 4: Sample baseline comparison showing Q-learning vs SARSA performance. Left: Episode rewards with moving averages. Right: Success rate over time. This is generated with simulated data for demonstration purposes only.

7 Teacher Feedback Mechanism

A teacher feedback system is a valuable addition to the training process of agents using Q-learning or SARSA. In this system, a pre-trained agent (the teacher) assists a new agent by offering advice during training.  The advice provided by the teacher is based on two key probabilities:

– The availability factor determines whether advice is offered by the teacher at any step.

– The accuracy factor dictates whether the advice given is correct or incorrect.

7.1 How It Works

At each step, the system first determines whether the teacher provides advice (based on availability).  If advice is given, it then determines whether the advice is correct  (based on accuracy). These two checks ensure that advice is provided probabilistically and may not always be accurate.

The agent responds to the teacher’s advice as follows:

•  If the generated advice is correct (given the accuracy parameter), the agent follows the teacher’s recommended action (the action with highest Q-value in the teacher’s Q-table).

• If the generated advice is incorrect, the agent takes a random action, excluding the trainer’s best action.

• If no advice is given, the agent continues its independent learning using its explor- ation strategy (epsilon-greedy).

Figure 5 illustrates the complete decision process for the teacher feedback mechanism.

Figure 5: Flowchart showing the teacher feedback mechanism.  The student agent’s action selection depends on two probability checks:  availability  (whether the teacher provides advice) and accuracy (whether the advice is correct).

8 Task 3: Teacher Advice Using Q-learning Agent

Important: You must use the same hyperparameters  (learning rate, discount factor, epsilon, episodes, and maximum steps) that you chose in Task 1 to ensure fair comparison across all tasks.

In this task, you will implement the teacher-student framework where a pre-trained Q- learning agent (from Task 1) acts as a teacher to guide a new Q-learning student agent.

Implementation Requirements

Your implementation should:

–  Load the trained Q-table from Task 1 to use as the teacher

–  Train a new Q-learning student agent with teacher guidance

–  Implement the teacher feedback mechanism as described in Section 7

–  Test all combinations of teacher availability and accuracy parameters

Parameter Combinations

You must evaluate the following parameter combinations using nested loops:

Availability:  [0.1, 0.3, 0.5, 0.7, 1.0]

Accuracy:  [0.1, 0.3, 0.5, 0.7, 1.0]

This results in 25 different teacher configurations to test.

Metrics to Track

For each teacher configuration, track these metrics during training:

Total Rewards per Episode:  The cumulative reward accumulated during each episode

Steps per Episode: The number of steps taken to complete each episode

Successful Episodes: Whether the agent reached the goal

Required Outputs

After training with all parameter combinations, you must:

–  Calculate performance metrics for each configuration:

Success Rate using Equation 1

Average Reward per Episode using Equation 2

Average Learning Speed using Equation 3

–  Store all results in a structured format with the following data:

◦ Availability

◦ Accuracy

◦ Avg Reward

◦  Success Rate (%)

◦ Avg Learning Speed

–  Generate a heatmap visualisation showing average rewards for all teacher configur- ations:

◦ X-axis: Availability values

◦ Y-axis: Accuracy values

◦  Colour intensity: Average reward achieved

◦ Include appropriate colour bar and labels

Figure 6 shows a sample of what your teacher performance heatmap should look like

(generated with simulated data).

Figure 6:  Sample heatmap showing  Q-learning performance with different teacher con- figurations.  Note that accuracy increases from bottom to top.  This is generated with simulated data for demonstration purposes only.

9 Task 4: Teacher Advice Using SARSA Agent

Important: You must use the same hyperparameters  (learning rate, discount factor, epsilon, episodes, and maximum steps) that you chose in Task 1 to ensure fair comparison across all tasks.

In  this  task,  you  will  implement  the  teacher-student  framework  where  a  pre-trained SARSA agent (from Task 2) acts as a teacher to guide a new SARSA student agent.

Implementation Requirements

Your implementation should:

–  Load the trained Q-table from Task 2 to use as the teacher

–  Train a new SARSA student agent with teacher guidance

–  Implement the teacher feedback mechanism as described in Section 7

–  Test all combinations of teacher availability and accuracy parameters

–  Use the same implementation structure as Task 3 for consistency

Parameter Combinations

You must evaluate the following parameter combinations using nested loops:

Availability:  [0.1, 0.3, 0.5, 0.7, 1.0]

Accuracy:  [0.1, 0.3, 0.5, 0.7, 1.0]

This results in 25 different teacher configurations to test.

Metrics to Track

For each teacher configuration, track these metrics during training:

Total Rewards per Episode:  The cumulative reward accumulated during each episode

Steps per Episode: The number of steps taken to complete each episode

Successful Episodes: Whether the agent reached the goal

Required Outputs

After training with all parameter combinations, you must:

–  Calculate performance metrics for each configuration:

Success Rate using Equation 1

Average Reward per Episode using Equation 2

Average Learning Speed using Equation 3

–  Store all results in a structured format with the following data:

◦ Availability

◦ Accuracy

◦ Avg Reward

◦  Success Rate (%)

◦ Avg Learning Speed

–  Generate a heatmap visualisation showing average rewards for all teacher configur- ations:

◦ X-axis: Availability values

◦ Y-axis: Accuracy values

◦  Colour intensity: Average reward achieved

◦ Include appropriate colour bar and labels

This will allow direct comparison with the Q-learning teacher results from Task 3 to determine which algorithm provides better teaching capabilities.

10    Testing and Discussing Your Code

After completing all four tasks, you should analyse and compare your results to understand the impact of teacher guidance on reinforcement learning performance.

10.1 Required Analysis

You must perform the following analysis to demonstrate your understanding:

1. Teacher Impact on Learning Curves

For selected teacher availability levels (e.g., 0.1, 0.5, 1.0), create plots showing how differ- ent teacher accuracies affect learning progress compared to the baseline.  Each plot should show:

◦  Episode rewards (50-episode moving average) on the y-axis

◦  Episodes on the x-axis

◦ Multiple lines for different accuracy levels

◦  Baseline performance for reference

Figure 7 shows a sample comparison for Q-learning with 50% teacher availability.

Figure 7: Sample comparison of teacher accuracy impact on Q-learning performance with 50% availability. Generated with simulated data.

2.  Teacher Effectiveness Summary

Create a comprehensive analysis comparing how teacher guidance affects both algorithms:

◦  Generate learning curves showing Q-learning and SARSA performance with selected combinations of teacher availability and accuracy values

◦  Include baseline performance (no teacher) as a reference line

◦  Show multiple combinations of teacher availability (e.g., 0.1, 0.5, 1.0) and accuracy (e.g., 0.3, 0.7, 1.0) to demonstrate the range of teacher impact

◦  Use moving averages to smooth the learning curves for clarity Figure 8 shows a sample teacher effectiveness summary analysis.

Figure 8:  Sample teacher effectiveness summary showing the impact of different teacher configurations on both Q-learning and SARSA algorithms.  This analysis helps identify optimal teacher parameters and compare algorithm responsiveness to guidance.


热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图