代写DSCI553 Foundations and Applications of Data Mining Assignment 3调试Python程序

DSCI553 Foundations and Applications of Data Mining

FALL 2021

Assignment 3

Deadline: October. 26th 11:59 PM PST

1. Overview of the Assignment </br>In Assignment 3, you will complete two tasks. The goal is to familiarize you with Locality Sensitive

Hashing (LSH), and different types of collaborative-filtering recommendation systems. The dataset you
are going to use is a subset from the Yelp dataset used in the previous assignments.
2. Assignment Requirements
2.1 Programming Language and Library Requirements
a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external
libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task (or case) if you
also submit a Scala implementation and both your Python and Scala implementations are correct.
b. You are required to only use the Spark RDD to understand Spark operations. You will not receive any
points if you use Spark DataFrame. or DataSet.
2.2 Programming Environment
Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2
We will use these library versions to compile and test your code. There will be no point if we cannot run
your code on Vocareum. On Vocareum, you can call `spark-submit` located at
/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit`. (*Do not use the one at
`/home/local/spark/latest/bin/spark-submit (2.4.4))
2.3 Write your own code
Do not share your code with other students!!
We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code
from this and other (previous) sections for plagiarism detection. We will report all the detected
plagiarism.
3. Yelp Data
In this assignment, the datasets you are going to use are from:
https://drive.google.com/drive/folders/1SufecRrgj1yWMOVdERmBBUnqz0EX7ARQ?usp=shar
ing
We generated the following two datasets from the original Yelp review dataset with some filters. We
randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and
20% of the data as the testing dataset.
a. yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars.
b. yelp_val.csv: the validation data, which are in the same format as training data.
c. We are not sharing the test dataset.
d. other datasets: providing additional information (like the average star or location of a business)
4. Tasks
Note: This Assignment has been divided into 2 parts on Vocareum. This has been done to
provide more computational resources.
4.1 Task1: Jaccard based LSH (2 points)
In this task, you will implement the Locality Sensitive Hashing algorithm with Jaccard similarity using
yelp_train.csv.
In this task, we focus on the “0 or 1” ratings rather than the actual ratings/stars from the users.
Specifically, if a user has rated a business, the user’s contribution in the characteristic matrix is 1. If the
user hasn’t rated the business, the contribution is 0. You need to identify similar businesses whose
similarity >= 0.5.
You can define any collection of hash functions that you think would result in a consistent permutation
of the row entries of the characteristic matrix. Some potential hash functions are:
f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m
where p is any prime number and m is the number of bins. Please carefully design your hash functions.
After you have defined all the hashing functions, you will build the signature matrix. Then you will divide
the matrix into b bands with r rows each, where b x r = n (n is the number of hash functions). You should
carefully select a good combination of b and r in your implementation (b>1 and r>1). Remember that
two items are a candidate pair if their signatures are identical in at least one band.
Your final results will be the candidate pairs whose original Jaccard similarity is >= 0.5. You need to write
the final results into a CSV file according to the output format below.
Example of Jaccard Similarity:
user1 user2 user3 user4
business1 0 1 1 1
business2 0 1 0 0
Jaccard Similarity (business1, business2) = #intersection / #union = 1/3
Input format: (we will use the following command to execute your code)
Python: spark-submit task1.py
Scala: spark-submit --class task1 hw3.jar
Param: input_file_name: the name of the input file (yelp_train.csv), including the file path.
Param: output_file_name: the name of the output CSV file, including the file path.
Output format:
IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We
will not regrade because of formatting issues.
a. The output file is a CSV file, containing all business pairs you have found. The header is
“business_id_1, business_id_2, similarity”. Each pair itself must be in the alphabetical order. The entire
file also needs to be in the alphabetical order. There is no requirement for the number of decimals for
the similarity value. Please refer to the format in Figure 2.
Figure 2: a CSV output example for task1
Grading:
We will compare your output file against the ground truth file using precision and recall metrics.
Precision = true positives / (true positives + false positives)
Recall = true positives / (true positives + false negatives)
The ground truth file has been provided in the Google drive, named as “pure_jaccard_similarity.csv”. You
can use this file to compare your results to the ground truth as well.
The ground truth dataset only contains the business pairs (from the yelp_train.csv) whose Jaccard
similarity >=0.5. The business pair itself is sorted in the alphabetical order, so each pair only appears
once in the file (i.e., if pair (a, b) is in the dataset, (b, a) will not be there).
In order to get full credit for this task you should have precision >= 0.99 and recall >= 0.97. If not, then
you will get only partial credit based on the formula:
(Precision / 0.99) * 0.4 + (Recall / 0.97) * 0.4
Your runtime should be less than 100 seconds. If your runtime is more than or equal to 100 seconds, you
will not receive any point for this task.
4.2 Task2: Recommendation System (5 points)
In task 2, you are going to build different types of recommendation systems using the yelp_train.csv to
predict the ratings/stars for given user ids and business ids. You can make any improvement to your
recommendation system in terms of the speed and accuracy. You can use the validation dataset
(yelp_val.csv) to evaluate the accuracy of your recommendation systems, but please don’t include it as
your training data.
There are two options to evaluate your recommendation systems. You can compare your results to the
corresponding ground truth and compute the absolute differences. You can divide the absolute
differences into 5 levels and count the number for each level as following:
>=0 and <1: 12345
>=1 and <2: 123
>=2 and <3: 1234
>=3 and <4: 1234
>=4: 12
This means that there are 12345 predictions with < 1 difference from the ground truth. This way you will
be able to know the error distribution of your predictions and to improve the performance of your
recommendation systems.
Additionally, you can compute the RMSE (Root Mean Squared Error) by using following formula:
Where Predi is the prediction for business i and Ratei is the true rating for business i. n is the total
number of the business you are predicting.
In this task, you are required to implement:
Case 1: Item-based CF recommendation system with Pearson similarity (2 points)
Case 2: Model-based recommendation system (1 point)
Case 3: Hybrid recommendation system (2 point)
4.2.1. Item-based CF recommendation system
Please strictly follow the slides to implement an item-based recommendation system with Pearson
similarity.
4.2.2. Model-based recommendation system
You need to use XGBregressor(a regressor based on the decision tree) to train a model. You need to use
this API https://xgboost.readthedocs.io/en/latest/python/python_api.html, the XGBRegressor
inside package xgboost.
Please choose your own features from the provided extra datasets and you can think about it with
customer thinking. For example, the average stars rated by a user and the number of reviews most likely
influence the prediction result. You need to select other features and train a model based on that. Use
the validation dataset to validate your result and remember don’t include it into your training data.
4.2.3. Hybrid recommendation system.
Now that you have the results from previous models, you will need to choose a way from the slides to
combine them together and design a better hybrid recommendation system.
Here are two examples of hybrid systems:
Example1:
You can combine them together as a weighted average, which means:
= α×
_
+ (1 − α)×
_
The key idea is: the CF focuses on the neighbors of the item and the model-based RS focuses on the user
and items themselves. Specifically, if the item has a smaller number of neighbors, then the weight of the
CF should be smaller. Meanwhile, if two restaurants both are 4 stars and while the first one has 10
reviews, the second one has 1000 reviews, the average star of the second one is more trustworthy, so
the model-based RS score should weigh more. You may need to find other features to generate your own
weight function to combine them together.
Example2:
You can combine them together as a classification problem:
Again, the key idea is: the CF focuses on the neighbors of the item and the model-based RS focuses on
the user and items themselves. As a result, in our dataset, some item-user pairs are more suitable for the
CF while the others are not. You need to choose some features to classify which model you should
choose for each item-user pair.
If you train a classifier, you are allowed to upload the pre-trained classifier model named “model.md” to
save running time on Vocareum. You can use pickle library, joblib library or others if you want. Here is an
example: https://scikit-learn.org/stable/modules/model_persistence.html.
You also need to upload the training script. named “train.py” to let us verify your model.
Some possible features (other features may also work):
-Average stars of a user, average stars of business, the variance of history review of a user or a business.
-Number of reviews of a user or a business.
-Yelp account starting date, number of fans.
-The number of people think a users’ review is useful/funny/cool. Number of compliments (Be careful
with these features. For example, sometimes when I visit a horrible restaurant, I will give full stars
because I hope I am not the only one who wasted money and time here. Sometimes people are satirical.
:-))
Input format: (we will use the following commands to execute your code)
Case1:
spark-submit task2_1.py
Param: train_file_name: the name of the training file (e.g., yelp_train.csv), including the file path
Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path
Param: output_file_name: the name of the prediction result file, including the file path
Case2:
spark-submit task2_2.py
Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive.
Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path
Param: output_file_name: the name of the prediction result file, including the file path
Case3:
spark-submit task2_3.py
Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive.
Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path
Param: output_file_name: the name of the prediction result file, including the file path
Output format:
a. The output file is a CSV file, containing all the prediction results for each user and business pair in the
validation/testing data. The header is “user_id, business_id, prediction”. There is no requirement for the
order in this task. There is no requirement for the number of decimals for the similarity values. Please
refer to the format in Figure 3.
Figure 3: Output example in CSV for task2
Grading:
We will compare your prediction results against the ground truth. We will grade on all the cases in Task2
based on your accuracy using RMSE. For your reference, the table below shows the RMSE baselines and
running time for predicting the validation data. The time limit of case3 is set to 30 minutes because we
hope you consider this factor and try to improve on it as much as possible (hint: this will help you a lot in
the competition project at the end of the semester).
Case 1 Case 2 Case 3
RMSE 1.09 1.00 0.99
Running Time 130s 400s 1800s
For grading, we will use the testing data to evaluate your recommendation systems. If you can pass the
RMSE baselines in the above table, you should be able to pass the RMSE baselines for the testing data.
However, if your recommendation system only passes the RMSE baselines for the validation data, you
will receive 50% of the points for each case.
5. Submission
You need to submit following files on Vocareum with exactly the same name:
a. Four Python scripts:
● task1.py
● task2_1.py
● task2_2.py
● task2_3.py
b. [OPTIONAL] hw3.jar and Four Scala scripts:
● task1.scala
● task2_1.scala
● task2_2.scala
● task2_3.scala
6. Grading Criteria
(% penalty = % penalty of possible points you get)
1. You can use your free 5-day extension separately or together. (Google Forms Link for Extension:
https://docs.google.com/forms/d/e/1FAIpQLSeSHzGWzPi3iuS-zNYyDLb-hhP4ancMEZgKDiwYZLmhyY
hKFw/viewform. )
2. There will be a 10% bonus if you use both Scala and Python.
3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code
from this and other (previous) sections for plagiarism detection. If plagiarism is detected, you will
receive no points for the entire assignment and we will report all detected plagiarism.
4. All submissions will be graded on Vocareum. Please strictly follow the format provided, otherwise
you won’t receive points even though the answer is correct.
5. If the outputs of your program are unsorted or partially sorted, there will be 50% penalty.
6. Do NOT you use Spark DataFrame, DataSet, sparksql.
7. We can regrade your assignments within seven days once the scores are released. We will not accept
any regrading requests after a week. There will be a 20% penalty if our grading is correct.
8. There will be a 20% penalty for late submissions within a week and no points after a week.
9. Only if your results from Python are correct will the bonus of using Scala be calculated. There is no
partial points awarded for Scala. See the example below:
Example situations
Task Score for Python Score for Scala(10% of previous column if correct) Total
Task1 Correct: 3 points Correct: 3 * 10% 3.3
Task1 Wrong: 0 point Correct: 0 * 10% 0.0
Task1 Partially correct: 1.5 points Correct: 1.5 * 10% 1.65
Task1 Partially correct: 1.5 points Wrong: 0 1.5
7. Common problems causing fail submission on Vocareum/FAQ
(If your program runs seem successfully on your local machine but fail on Vocareum, please check these)
1. Try your program on Vocareum terminal. Remember to set python version as python3.6,
And use the latest Spark
/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit
2. Check the input command line format.
3. Check the output format, for example, the header, tag, typos.
4. Check the requirements of sorting the results.
5. Your program scripts should be named as task1.py task2.py etc.
6. Check whether your local environment fits the assignment description, i.e. version, configuration.
7. If you implement the core part in Python instead of Spark, or implement it in a high time complexity
way (e.g. search an element in a list instead of a set), your program may be killed on Vocareum because
it runs too slowly.
8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will
not get any point if you use Spark DataFrame. or DataSet. Don’t import sparksql.

热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图