代做CSE 6242 / CX 4242: Data and Visual Analytics | Georgia Tech | Spring 2025 HW 4代做Python语言

CSE 6242 / CX 4242: Data and Visual Analytics | Georgia Tech | Spring 2025

HW 4: PageRank Algorithm, Random Forest, Scikit-learn

Download the HW4 Skeleton before you begin

Homework Overview

Data analytics and machine learning both revolve around using computational models to capture relationships between variables and outcomes. In this assignment, you will code and fit a range of well-known models from scratch and learn to use a popular Python library for machine learning.

In Q1, you will  implement the famous  PageRank algorithm from scratch.  PageRank can be thought of as a model for a system in which a person is surfing the web by choosing uniformly at random a link to click on at each successive webpage they visit. Assuming this is how we surf the web, what is the probability that we are on a particular webpage at any given moment? The PageRank algorithm assigns values to each webpage according to this probability distribution.

In Q2, you will implement Random Forests, a very common and widely successful classification model,  from  scratch.  Random  Forest  classifiers  also  describe  probability  distributions—the conditional probability of a sample belonging to a particular class given some or all its features.

Finally, in Q3, you will use the Python scikit-learn library to specify and fit a variety of supervised and unsupervised machine learning models.

The maximum possible score for this homework is 100 points.

Important Notes

1.    Submit your work by the due date on the course schedule.

a.   Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking.

b.   Before the grace period expires, you may resubmit as many times as needed.

c.   TA assistance is not guaranteed during the grace period.

d.   Submissions during the grace period will display as "late" but will not incur a penalty.

e. We will not accept any submissions executed after the grace period ends.

2.  Always use the most up-to-date assignment (version number at bottom right of this document). The latest version will be listed in Ed Discussion.

3.   You may discuss ideas with other students at the "whiteboard" level (e.g. , how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers.

4.  All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Codewill be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI) . Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.

Submission Notes

1.  All questions are graded on the Gradescope platform, accessible through Canvas.

2.   We will not accept submissions anywhere else outside of Gradescope.

3.   Submit all required files as specified in each question. Make sure they are named correctly.

4.  You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive.

5.   You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices . Use Gradescope mainly as a "final" check.

6. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify:

a.   The code is free of syntax errors (by running locally)

b.  All methods have been implemented

c.   The correct file was submitted with the correct name

d.   No extra packages or files were imported

7.   When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time.

8.   Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”).

Q1 [20 pts] Implementation of PageRank Algorithm

Technology

PageRank Algorithm

Graph

Python >=3.7.x. You must use Python >=3.7.x for this question.

Allowed Libraries

Do not modify the import statements; everything you need to complete this question has been imported for you. You MUST not use other libraries for this assignment.

Max runtime

5 minutes

Deliverables

[Gradescope]

Q1.ipynb [12 pts]: your modified implementation

simplified_pagerank_iter{n}.txt: 2 files (as given below) containing the top 10 node IDs (w.r.t. the PageRank values) and their PageRank values for n iterations via the provided run() helper function

o simplified_pagerank_iter10.txt [2 pts]

o simplified_pagerank_iter25.txt [2 pts]

personalized_pagerank_iter{n}.txt: 2 files (as given below) containing the top 10 node IDs (w.r.t the PageRank values) and their PageRank values for n iterations via the provided run() helper function

o personalized_pagerank_iter10.txt [2 pts]

o personalized_pagerank_iter25.txt [2 pts]

Important: Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed.

In this question, you will implement the PageRank algorithm in Python for a large graph network dataset.

The PageRank algorithm was first proposed to rank web pages in search results. The basic assumption is that more “important” web pages are referenced more often by other pages and thus are ranked higher. To estimate the importance of a page, the algorithm works by considering the  number and “importance” of links  pointing to the  page.  PageRank  outputs  a  probability distribution over all web pages, representing the likelihood that a person randomly surfing the web (randomly clicking on links) would arrive at those pages.

As mentioned in the lectures, the PageRank values are the entries in the dominant eigenvector of the modified adjacency matrix in which each column’s values adds up to  1 (i.e., “column normalized”), and this eigenvector can be calculated by the power iteration method that you will implement in this question. This method iterates through the graph’s edges multiple times to update the nodes’ PageRank values (“pr_values” in Q1.ipynb) in each iteration. We recommend that you review the lecture video for PageRank and personalized PageRank before working on your implementation. At 9 minutes and 41 seconds of the video, the full PageRank algorithm is expressed in a matrix-vector form. Equivalently, the PageRank value of node vj , at iteration t + 1, can also be expressed as (notation different from video’s):

where

• vj  is node j

• vi  is any node i that has a directed edge pointing to node j

out degree(vi) is the number of links going out of node vi

• PR t+1(vj) is the pagerank value of node j at iteration t + 1

• PR t(vi) is the pagerank value of node i at iteration t

• d is the damping factor; set it to the common value of 0.85 that the surfer would continue to follow links

• Pd(vj) is the probability of random jump that can be personalized based on use cases

Tasks

You will be using the “network.tsv” graph network dataset in the hw4-skeleton/Q1 folder, which contains about 1 million nodes and 3 million edges. Each row in that file represents a directed edge in the graph. The edge’s source node id is stored in the first column of the file, and the target node id is stored in the second column.

Your code must NOT make any assumptions about the relative magnitude between the node ids of an edge. For example, suppose we find that the source node id is smaller than the target node id for most edges in a graph, we must NOT assume that this is always the case for all graphs (i.e., in other graphs, a source node id can be larger than a target node id).

You will complete the code in  Q1.ipynb (guidelines also provided in the file).

1.   Calculate and store each node’s out-degree and the graph’s maximum node id in calculate_node_degree()

a.  A node’s out-degree is its number of outgoing edges. Store the out-degree in instance variable "node_degree".

b.   max_node_id refers to the highest node id in the graph. For example, suppose a graph contains the two edges (1,4) and (2,3), in the format of (source, target), the max_node_id here is 4. Store the maximum node id to instance variable max_node_id.

2.    Implement run_pagerank()

a.   For simplified  PageRank algorithm, where Pd( vj ) =   1/(max_node_id  + 1) is provided as node_weights in the script. and you will submit the output for 10 and 25 iteration runs for a damping factor of 0.85. To verify, we are providing the sample output of 5 iterations for a simplified PageRank (simplified_pagerank_iter5_sample.txt).

b.   For personalized PageRank, the Pd() vector will be assigned values based on your 9- digit GTID (e.g., 987654321) and you will submit the output for 10 and 25 iteration runs for a damping factor of 0.85.

3.   Compare output

a.   Generate output text files by running the last cell of Q1.ipynb.

b. Note: When comparing your output for simplified_pagerank for 5 iterations with the given sample output, the absolute difference must be less than 5%. For example, absolute((SampleOutput - YourOutput) / SampleOutput) must be less than 0.05.

Q2 [50 pts] Random Forest Classifier

Technology

Python >=3.7.x

Allowed Libraries

Do not modify the import statements; everything you need to complete this question has been imported for you. You MUST not use other

libraries for this assignment.

Max runtime

300 seconds

Deliverables

[Gradescope]

Q2.ipynb [45 pts]: your solution as a Jupyter notebook, developed by completing the provided skeleton code

o 10 points are awarded for 2 utility functions, 5 points for entropy() and 5 points for information_gain()

o 35 points are awarded for successfully implementing your random forest

Random Forest Reflection [5 pts]: multiple-choice question completed on Gradescope.

Q2.1 - Random Forest Setup [45 pts]

Note: You must use Python >=3.7.x for this question.

You will implement a random forest classifier in Python via a Jupyter notebook. The performance of the classifier will be evaluated via the out-of-bag (OOB) error estimate using the provided dataset Wisconsin_breast_prognostic.csv, a comma-separated (csv) file in the Q2 folder. Features (Attributes) were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. You must not modify the dataset. Each row describes one patient (a data point, or data record) and each row includes 31 columns. The first 30 columns are attributes. The 31st  (the last column) is the label, and you must NOT treat it as an attribute. The value one and zero in the last column indicates whether the cancer is malignant or benign, respectively. You will perform binary classification on the dataset to determine if a particular cancer is benign or malignant.

Important:

1.   Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed.

2.   You may only  use the modules and libraries provided at the top of the notebook file included in the skeleton for Q2 and modules from the Python Standard Library. Python wrappers (or modules) must NOT be used for this assignment. Pandas must NOT be used — while we understand that they are useful libraries to learn, completing this question is not  critically  dependent  on  their  functionality.   In  addition,  to   make  grading  more manageable and to enable our TAs to provide better, more consistent support to our students, we have decided to restrict the libraries accordingly.

Essential Reading

Decision Trees. To complete this question, you will develop  a good  understanding  of  how decision trees work. We recommend that you review the lecture on the decision tree. Specifically, review how to construct decision trees using Entropy and Information Gain to select the splitting attribute and split point for the selected attribute. These slides from CMU(also mentioned in the lecture) provide an excellent example of how to construct a decision tree using Entropy and Information Gain. Note: there is a typo on page 10, containing the Entropy equation; ignore one negative sign (only one negative sign is needed).

Random Forests. To refresh your memory about random forests, see Chapter 15 in theElements of Statistical Learningbook and the lecture on random forests. Here is a blog postthat introduces random forests in a fun way, in layman’s terms.

Out-of-Bag Error Estimate. In random forests, it is not necessary to perform explicit cross- validation or use a separate test set for performance evaluation. Out-of-bag (OOB) error estimate has shown to be reasonably accurate and unbiased. Below, we summarize the key points about OOB in the original article by Breiman and Cutler.

Each tree in the forest is constructed using a different bootstrap sample from the original data. Each  bootstrap sample  is constructed  by  randomly sampling from the original dataset with replacement (usually, a bootstrap sample has thesame sizeas the original dataset). Statistically, about one-third of the data records (or data points) are left out of the bootstrap sample and not  used in the construction of the kth tree. For each data record that is not used in the construction of the kth tree, it can be classified by the kth tree. As a result, each record will have a “test set” classification by the subset of trees that treat the record as an out-of-bag sample. The majority vote for that record will be its predicted class. The proportion of times that a record’s predicted class is different from the true class, averaged over all such records, is the OOB error estimate.

While splitting a tree node, make sure to randomly select a subset of attributes (e.g., square root of the number of attributes) and pick the best splitting attribute (and splitting point of that attribute) among these subsets of attributes. This randomization is the main difference between random forest and bagging decision trees.

Starter Code

We have prepared some Python starter code to help you load the data and evaluate your model. The starter file name is Q2.ipynb has three classes:

●   Utililty: contains utility functions that help you build a decision tree

●    DecisionTree: a decision tree class that you will use to build your random forest

●   RandomForest: a random forest class

What you will implement

Below, we have summarized what you will implement to solve this question. Note that you must use information gain to perform. the splitting in the decision tree. The starter code has detailed comments on how to implement each function.

1.   Utililty class: implement the functions to compute entropy, information gain, perform splitting, and find the  best variable (attribute) and split-point. You can add additional methods for convenience. Note: Do not round the output or any of your functions.

2.   DecisionTree class: implement the learn() method to build your decision tree using the utility functions above.

3.   DecisionTree class: implement the classify() method to predict the label of a test record using your decision tree.

4.   RandomForest class:  implement  the  methods  bootstrapping(),   fitting(), voting() and  user().

5.   get_random_seed(),    get_forest_size():  implement  the  functions  to  return  a random seed and forest size (number of decision trees) for your implementation.

Important:

1.   You must achieve a minimum accuracy of 90% for the random forest. If the accuracy is turning out to be low, try playing around with hyper-parameters. If it is extremely low, try revisiting best_split() and classify()methods.

2.   Your code must take no more than 5 minutes to execute (which is a very long time, given the low program complexity). Otherwise, it may time out on Gradescope. Code that takes longer than 5 minutes to run likely means you need to correct inefficiencies (or incorrect logic) in your program. We suggest that you check the hyperparameter choices (e.g., tree depth, number of trees) and code logic when figuring out how to reduce the runtime.

3.   The run() function is provided to test your random forest implementation; do NOT modify this function.

4. Note: In your implementation, use basicPython Listsrather than the more complex Numpy data structures to reduce the chances of version-specific library conflicts with the grading scripts.

As  you  solve  this   question,  consider  the  following  design  choices.  Some   may  be  more straightforward to determine, while some maybe not (hint: study lecture materials and essential reading above). For example:

●   Which attributes to use when building a tree?

●    How to determine the split point for an attribute?

●    How many trees should the forest contain?

●   You may implement your decision tree using the data structure of your choice (e.g., dictionary, list,  class  member  variables).  However,  your  implementation  must  still  work  within  the DecisionTree class structure we have provided.

●   Your  decision  tree  will  be  initialized  using   DecisionTree(max_depth=10),  in  the RandomForest class in the jupyter notebook.

●   When do you stop splitting leaf nodes?

●   The depth found in the learn function is the depth of the current node/tree. You may want a check within the learn function that looks at the current depth and returns if the depth is greater than or equal to the max depth specified. Otherwise, it is possible that you continually split on nodes and create a messy tree. The max_depth parameter should be used as a stopping condition for when your tree should stop growing. Your decision tree will be instantiated with a depth of 0 (input to the learn() function in the jupyter notebook). To comply with this, make sure you implement the decision tree such that the root node starts at a depth of 0 and is built with increasing depth.

Note that, as mentioned in the lecture, there are other approaches to implement random forests. For example, instead of information gain, other popular choices include the Gini index, random attribute selection (e.g., PERT - Perfect Random Tree Ensembles). We decided to ask everyone to use an information gain based approach in this question (instead of leaving it open-ended), because information gain is a useful machine learning concept to learn in general.

Q2.2 - Random Forest Reflection [5 pts]

On Gradescope, answer the following multiple-choice question. You can submit your answer only once. Clicking the “Save Answer” button on Gradescope WILL immediately submit your answer, making it final and unchangeable. Select all that apply; your answer must be completely correct to earn the points. No partial marks will be awarded if all correct options are NOT selected.

What are the main advantages of using a random forest versus a single decision tree?



热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图