代做CSE 4/546: Reinforcement Learning Spring 2025 Assignment 3 - Actor-Critic帮做Python编程

CSE 4/546: Reinforcement Learning

Spring 2025

Assignment 3 - Actor-Critic

Checkpoint: November 18, Tue, 11:59 pm

Due Date: November 25, Tue, 11:59 pm

1 Assignment Overview

The goal of the assignment is to help you understand the concept of policy gradient algorithms, to implement the actor-critic algorithm and apply it to solve Gymnasium environments. We will train our networks on multiple RL environments.

For this assignment, libraries with in-built RL methods cannot be used. Submissions with used in-built RL methods (e.g., stable-baselines, keras-RL, TF agents, etc) will not be evaluated.

Part I [20 points] – Introduction to Actor-Critic

The goal of this part is to help you build a deeper understanding of the structure and mechanics behind Actor-Critic algorithms. You will explore how neural networks are adapted for different environment setups, how Actor and Critic components are architected, and how common practices like normalization and gradient clipping can stabilize training. This section will also prepare you to design more generalizable and reusable RL agents that can adapt to various environments.

Refer to the Jupyter notebook for more detailed instructions: UBlearns > Assignments > Assignment 3 > a3_part_1_UBIT1_UBIT2.ipynb

You are required to complete the following five tasks. Each task should be fully implemented in your notebook and supported with comments and brief explanations where appropriate.

TASKS

1. [5 points] Implement Actor-Critic Neural Networks

Create two versions of the actor-critic architecture:

(a) Completely separate actor and critic networks with no shared layers.

(b) A shared network with two output heads: one for the actor and one for the critic.

Discuss the motivation behind each setup and when it may be preferred in practice.

2. [5 points] Auto-generate Actor-Critic Networks Based on Environment Spaces

Implement a general-purpose function create_shared_network(env) that dynamically builds a shared actor-critic network for any Gymnasium environment. Your function should handle variations in:

• Observation spaces: e.g., discrete, box (vector), image-based.

• Action spaces: discrete, continuous (Box), and multi-discrete.

Test your implementation on the following environments:

• CliffWalking-v0 (Observation: integer → one-hot encoded)

• LunarLander-v3

• PongNoFrameskip-v4 (use gym wrappers for preprocessing)

• HalfCheetah-v5 (MuJoCo)

Your code will be further tested with additional environments by the course staff. The design must be generalizable.

3. [5 points] Normalize Observations for Bounded Environments

Write a function normalize_observation(obs, env) that normalizes the input observation to a [−1, 1] range using the environment’s observation_space.low and high values. This normalization should be applied only for environments with Box observation spaces and defined bounds. Test this function using:

• LunarLander-v3

• PongNoFrameskip-v4

4. [5 points] Implement Gradient Clipping

Demonstrate how to apply gradient clipping in your training loop using PyTorch’s torch.nn.utils.clip_grad_norm_. Print the gradient norm before and after clipping to validate that the operation was applied correctly. You may use any of the models built in earlier tasks.

Submission Format:

• Submit your Jupyter Notebook as a3_part_1_UBIT1_UBIT2.ipynb.

• Ensure all cells are executed and saved with output before submission.

• Include inline comments and explanations within your code.

• No separate PDF report is required for this part.

Part II [50 points] - Implementing Advantage Actor Critic (A2C/A3C)

In this part we will implement an Advantage Actor Critic (A2C/A3C) algorithm and test it on any simple environment. A2C is a synchronous version of the A3C method.

1. Implement the A2C algorithm. You are welcome to implement A2C or A3C version of the algorithm. Any other variations will not be evaluated. You may use any framework (Tensorflow/PyTorch). Implement at least 2 actor-learner threads.

2. Train your implemented algorithm on any environment. Check the "Suggested environments" section.

3. Show and discuss your results after applying the A2C/A3C implementation on the environment. Plots should include the total reward per episode.

4. Provide the evaluation results. Run your environment for at least 10 episodes, where the agent chooses only greedy actions from the learnt policy. The plot should include the total reward per episode.

Important Implementation Note: Multi-threaded Actor-Learner Setup

You are required to implement a multi-threaded version of the A2C or A3C algorithm that uses at least 2 actor-learner threads. These threads should operate independently but either synchronize periodically (A2C) or update asynchronously (A3C) using a shared model.

• Each thread must maintain its own environment instance.

• Each thread should collect experience independently (e.g., observations, actions, rewards).

• Actor and Critic gradients must either be:

– Computed and applied synchronously across threads (as in A2C), or

– Computed and applied asynchronously by each thread to a shared model (as in A3C).

For A2C (Synchronous Version):

• Threads run in parallel to collect experience.

• After a fixed number of steps or episodes, threads synchronize and share their gradients.

• A global model is updated using the average of these gradients.

For A3C (Asynchronous Version):

• Each thread maintains a local copy of the global model.

• Threads collect experience and compute gradients independently.

• The global model is updated asynchronously with gradients from each thread.

• Local models are periodically synced with the updated global model.

Implementation Tips:

• Use Python’s multiprocessing or torch.multiprocessing libraries to spawn processes (Gym envi-ronments are not thread-safe).

• Ensure the global model is safely shared and updated using shared memory or locks.

• Start with 2 threads for simplicity and debugging, then scale up if desired.

• Consider printing from each thread to confirm parallel execution and model updates.

What to Submit:

• Your code must clearly show the use of multiple threads or processes for actor-learners.

• In your report, explain how the threads interact with the global model and how synchronization is handled.

• Include plots showing the reward per episode for each thread, or the average across all threads.

• (Optional) Discuss any tradeoffs or bottlenecks you observed when increasing the number of threads.

In your report for Part II:

1. Discuss:

• What are the roles of the actor and the critic networks?

• What is the "advantage" function in A2C, and why is it important?

• Describe the loss functions used for training the actor and the critic in A2C.

2. Briefly describe the environment that you used (e.g., possible actions, states, agent, goal, rewards, etc). You can reuse related parts from your previous assignments.

3. Show and discuss your results after applying your A2C/A3C implementation on the environments. Plots should include the total reward per episode.

4. Provide the evaluation results. Run your agent on the environment for at least 10 episodes, where the agent chooses only greedy actions from the learnt policy. Plot should include the total reward per episode.

5. Run your trained agent for 1 episode where the agent chooses only greedy actions from the learned policy. Save a video of this episode with the provided code (save_rendered_episode.py) (or you can write your own code to do so) and include it as a clearly named video file in your submission (environment_id_a2c/a3c.py). Verify that the agent has completed all the required steps to solve the environment.

6. Provide your interpretation of the results.

7. Include all the references that have been used to complete this part

Part II submission includes

• Report (as a PDF file)

• Jupyter Notebook (.ipynb) with all saved outputs. Do not combine with the Jupyter notebook from Part I.

• Saved weights of the trained models as pickle files, .h5, or .pth for each model and each environment.

• Saved video file or screenshots of one evaluation episode.

• If you are working in a team of two people, we expect equal contribution for the assignment. Provide contribution summary by each team member.

Part III [Total: 30 points] - Solving Complex Environments

In this part, test the A2C/A3C algorithm implemented in Part I on any other two complex environments.

1. Choose an environment. At least one of the environments has to be among "Suggested environments - Part II". The environment with multiple versions is considered as one environment.

2. Apply the A2C/A3C algorithm to solve it. You can adjust the neural network structure or hyperpa-rameters from your base implementation.

3. Show and discuss your results after applying the A2C/A3C implementation on the environment. Plots should include the total reward per episode.

4. Provide the evaluation results. Run your environment for at least 10 episodes, where the agent chooses only greedy actions from the learnt policy. Plot should include the total reward per episode.

5. Run your trained agent for 1 episode where the agent chooses only greedy actions from the learned policy. Save a video of this episode with the provided code (save_rendered_episode.py) (or you can write your own code to do so) and include it as a clearly named video file in your submission (environment_id_a2c/a3c.py). Verify that the agent has completed all the required steps to solve the environment.

6. Go to Step 1. In total, you need to provide the results for TWO environments.

In your report for Part III:

1. Briefly describe TWO environments that you used (e.g., possible actions, states, agent, goal, rewards, etc). You can reuse related parts from your previous assignments.

2. Show and discuss your results after training your Actor-Critic agent on each environment. Plots should include the reward per episode for TWO environments. Compare how the same algorithm behaves on different environments while training.

3. Provide the evaluation results for each environments that you used. Run your environments for at least 10 episodes, where the agent chooses only greedy actions from the learnt policy. Plot should include the total reward per episode.

4. If you are working in a team of two people, we expect equal contribution for the assignment. Provide contribution summary by each team member.

Part III submission includes

• Report (as a PDF file) – combined with your report from Part II

• Jupyter Notebook (.ipynb) with all saved outputs. It can be combined with the Jupyter notebook from Part II. Do not combine with the Jupyter notebook from Part I.

• Saved weights of the trained models as pickle files, .h5, or .pth for each model and each environment.

Suggested environments

Part II

• Your grid world is defined in A1 or A2

• CartPole

• Acrobot

• Mountain Car

• Pendulum

• Lunar Lander

Part III

• Any multi-agent environment

• Car Racing

• Bipedal Walker

• MuJoCo Ant

• Any Atari env

• Any other complex environment that you will use for your Final Project

Extra Points [max +12 points]

Implement a Different Version of Actor-Critic [7 points]

STEPS:

1. Choose one of the following more complex actor-critic algorithms: PPO, TRPO, DDPG, TD3, or SAC. Implement this algorithm.

2. Apply your implementation of the chosen advanced actor-critic algorithm to the same THREE environ-ments that you used earlier in this assignment.

3. In the report, create a new section titled "Bonus: Advanced Actor-Critic". Include the following:

• Present three reward dynamic plots, one for each environment. Each plot should show the learning curves for both the A2C/A3C algorithm you implemented previously and the new, improved actor-critic algorithm.

• Compare the performance of the two algorithms (A2C/A3C vs. the advanced version) across the three environments based on the plots.

• Provide a detailed analysis discussing the observed differences in performance, potential reasons for these differences, and any insights gained from comparing the two algorithms.

SUBMISSION:

Submit your results as a Jupyter Notebook file named: a3_bonus_advancedac_T EAMMAT E1_T EAMMAT E2.ipynb

Solve MuJoCo Environment [5 points]

STEPS:

1. Choose one environment from the MuJoCo suite (e.g., ‘HalfCheetah-v3‘, ‘Ant-v3‘, ‘Hopper-v3‘).

2. Implement any Actor-Critic algorithm to solve the chosen MuJoCo environment. You can either use your existing A2C/A3C implementation or implement a more advanced version like PPO, TRPO, DDPG, TD3, or SAC.

3. Train your chosen Actor-Critic agent on the selected MuJoCo environment. Evaluate its performance using appropriate metrics (e.g., average reward over episodes).

4. In the report, create a new section titled "MuJoCo Environment". Include the following:

• Present a reward dynamic plot showing the learning curve of your agent on the MuJoCo environ-ment.

• Describe the specific MuJoCo environment you chose and the Actor-Critic algorithm you imple-mented.

• Provide an analysis of the results, discussing the performance achieved and any challenges encountered during training.

SUBMISSION:

Submit your results as a Jupyter Notebook file named: a3_bonus_mujoco_T EAMMAT E1_T EAMMAT E2.ipynb

2 References

• NeurIPS Styles (docx, tex)

• Overleaf (LaTeX-based online document generator) - a free tool for creating professional reports

• Gymnasium environments

• Atari Environments

• Lecture slides

• Asynchronous Methods for Deep Reinforcement Learning

3 Assignment Steps

1(a). Register your team (Due date: November 6)

• You may work individually or in a team of up to 2 people. The evaluation will be the same for a team of any size.

• Register your team at UBlearns > Groups. You have to enroll in a team on UBLearns even if you are working individually. Your teammates for A2 and A3 should be different.

1(b). For a team of 2 students (Due date: November 6)

• Create a private GitHub repository for the project and add our course GitHub account as a collaborator: @ub-rl

• Each team member should regularly push their progress to the repository. For example, you can sync the repository daily to reflect any updates or improvements made

• In your report include a link to the repository along with the contribution table. Additionally, add screenshot(s) of your commit history.

• Report: The report should be delivered as a separate PDF file, and it is recommended for you to use the NIPS template to structure your report. You may include comments in the Jupyter Notebook; however you will need to duplicate the results in a separate PDF file. Name your report as:

a3_report_T EAMMAT E1_T EAMMAT E2.pdf

(e.g. a3_report_avereshc_nitinvis.pdf)

• Code: Python is the only code accepted for this project. You can submit the code in Jupyter Notebook with the saved results. You can submit multiple files, but they all need to have a clear name. After executing the command Jupyter Notebook, it should generate all the results and plots you used in your report, and should be able to be printed out in a clear manner.

Name your Jupyter Notebooks following the pattern:

a3_part_1_T EAMMAT E1_T EAMMAT E2.ipynb

and a3_part_2_T EAMMAT E1_T EAMMAT E2.ipynb

(e.g. a3_part_1_avereshc_nitinvis_.ipynb)

• Model Parameters: Saved weights of the model(s) as a pickle file or .h5, so that the grader can fully replicate your results. Name your .pickle, .h5, or .pth files using the following pattern:

a3_part_2_a2c_cartpole_T EAMMAT E1_T EAMMAT E2.pickle

a3_part_2_a2c_lunarlander_T EAMMAT E1_T EAMMAT E2.pickle

2. Submit Checkpoint (Due date: November 18)

• Complete Part I & Part II.

• Submit your .ipynb with saved outputs, named as

a3_part_1_T EAMMAT E1_T EAMMAT E2.ipynb

a3_part_2_T EAMMAT E1_T EAMMAT E2.ipynb

e.g. a3_part_1_avereshc_nitinvis.ipynb

• Report for Checkpoint submission is optional

• Submit at UBLearns > Assignments > Assignment 3

3. Submit final results (Due date: November 25)

• Fully complete all parts of the assignment

• Add all your assignment files in a zip folder, including ipynb files for Part I, Part II, Part III Bonus part (optional,) and report.

• Submit at UBLearns > Assignments

• Suggested file structure (bonus part is optional):

assignment_3_f inal_T EAMMAT E1_T EAMMAT E2.zip

– a3_part_1_T EAMMAT E1_T EAMMAT E2.ipynb

– a3_part_2_T EAMMAT E1_T EAMMAT E2.ipynb

– a3_part_2_a2c_ENV 1_T EAMMAT E1_T EAMMAT E2.h5

– a3_part_3_T EAMMAT E1_T EAMMAT E2.ipynb

– a3_part_3_a2c_ENV 2_T EAMMAT E1_T EAMMAT E2.h5

– a2_part_3_a2c_ENV 3_T EAMMAT E1_T EAMMAT E2.h5

– a3_report_T EAMMAT E1_T EAMMAT E2.pdf

– a3_bonus_advanced_ac_T EAMMAT E1_T EAMMAT E2.ipynb

– a3_bonus_mujoco_T EAMMAT E1_T EAMMAT E2.ipynb

• The code of your implementations should be written in Python. You can submit multiple files, but they all need to be labeled clearly.

• Your Jupyter notebook should be saved with the results

• Include all the references that have been used to complete the assignment

• If you are working in a team of two, we expect equal contribution for the assignment. Each team member is expected to make a code-related contribution. Provide a contribution summary by each team member in the form. of a table below. If the contribution is highly skewed, then the scores of the team members may be scaled w.r.t the contribution.

Team Member

Assignment Part

Contribution (%)

 

 

4 Grading Details

Checkpoint Submission

• Graded on a 0/1 scale.

• A grade of "1" is awarded for completing more than 80% of the checkpoint-related part, and "0" is assigned for all other cases.

• Receiving a "0" on a checkpoint submission results in a 30-point deduction from the final assignment grade.

Final Submission

• Graded on a scale of X out of 100 points, with potential bonus points (if applicable).

• All parts of the assignment are evaluated during final evaluation.

• Ensure your final submission includes the final version of all assignment parts.

Important Notes

1. Only files submitted on UBlearns are considered for evaluation.

2. Files from local devices, GitHub, Google Colab, Google Docs, or other locations are not accepted.

3. Regularly submit your work-in-progress on UBlearns and always verify submitted files by downloading and opening them.



热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图