代写MKT 566 – Fall 2025 Homework 4: Predicting Yelp Review Ratings代写留学生Python语言-留学生作业帮

代写MKT 566 – Fall 2025 Homework 4: Predicting Yelp Review Ratings代写留学生Python语言

MKT 566 – Fall 2025

Homework 4: Predicting Yelp Review Ratings

Overview

In this assignment, you will use machine learning models to predict whether a Yelp review rating is greater than 3 (positive) or less than or equal to 3 (negative) based on review text and metadata features.

This project mirrors a real-world marketing analytics task — understanding how customersʼ language and review patterns relate to satisfaction. Youʼll practice data preprocessing, feature engineering, model training, evaluation, and interpretation.

Objective

● Practice end-to-end ML workflow: cleaning, feature engineering, training, and evaluating models.

● Interpret model results and communicate insights clearly.

● Understand the importance of hyperparameter tuning, model comparison, and feature importance.

Dataset

You will receive two files in JSONL (JSON Lines) format:

● train.jsonl: Contains review text, metadata, and ratings (this is the file you will to use to train and test your model).

● test_no_stars.jsonl: Contains the same fields but without ratings (this dataset is for the bonus competition only)

Each observation represents a Yelp review and contains:

● Review-level fields: review_id, text, stars, date, useful, funny, cool

● User-level metadata: user_id, user_review_count, user_average_stars, user_fans

● Business-level metadata: business_id, business_name, business_city, business_state, business_stars, business_review_count

Your target variable will be binary: y = 1 if stars > 3, else 0.

Tasks

1. Data Preprocessing

● Load and clean the dataset.

● Handle missing values and inconsistent entries.

● Prepare the data for modeling (e.g., tokenize text, encode categorical variables).

2. Exploratory Data Analysis (EDA)

Perform. EDA to understand the data before modeling.

Your EDA should include:

● At least five visualizations (e.g., distribution of ratings, word frequencies, category breakdowns).

● Summary statistics for key variables.

● Insights: discuss any interesting trends or relationships.

Make sure the code will generate and display the figures, not just the code.

3. Feature Engineering

● Create at least ten features, justified by your EDA findings. 5 should be derived from text data, and 5 from metadata.

● Possible examples:

○ Average user rating or review count.

○ Text features (review length, sentiment, word embeddings, TF-IDF).

○ Business category or location variables.

● Explain the rationale behind each feature.

● Scale or encode features appropriately.

● Note: feature engineering is crucial for model performance, so be thorough and creative! Using more features is encouraged as long as they are justified.

4. Train/Test Split

● Split train.jsonl into 80% training and 20% testing.

● Explain why this split is relevant for model evaluation.

● Set a seed for reproducibility.

5. Model Training (optimize for best AUC)

Train at least three classification models, including:

● Logistic Regression (mandatory benchmark)

● Two others (e.g., Random Forest, SVM, Gradient Boosting, XGBoost, etc.)

● As we discussed in class, model parameters tuning is important to improve performance. The caret library can help you with that. See here for a description of the package. In the case you are using Python, you can use GridSearchCV from sklearn.model_selection for hyperparameter tuning.

● Include a short explanation of why tuning matters for model performance.

● Report key parameters used to train the model.

6. Model Performance and Evaluation

Evaluate the model performance on the test set:

● Report Accuracy, Precision, Recall, and AUC (ROC) of each model in a well-structured table. Explain what each metric indicates about model performance.

● Plot the ROC curve for each model and interpret the results.

● Report and discuss feature importance.

● Explain tradeoffs between models (e.g., interpretability vs. performance).

Deliverables

Submit the following to Brightspace:

1. Report (PDF): This file must include the figures and tables generated in your analysis, along with explanations and interpretations. The report should be well-organized and clearly written. Use the following structure:

1. Introduction: Describe the problem and dataset.

2. EDA Findings: Key patterns and visualizations.

3. Feature Engineering: Created features and their description and rationale for creating them.

4. Model Training: Algorithms used, parameter tuning, and justification.

5. Model Evaluation: Results, metrics, and interpretations.

6. Conclusion: Main takeaways and future improvements.

2. Code: This file should contain all the code used for data preprocessing, EDA, feature engineering, model training, and evaluation. Ensure that the code is well-commented and organized for readability. You can use a notebook format as we have been using during the course (Jupyter Notebook for Python or R Markdown for R), or a script. format (.py or .R).

IMPORTANT: If you want to write the report in R Markdown or Jupyter Notebook, make sure to knit/export it to PDF before submission. If not, a PDF file is required.

3. Predictions File (for Bonus Competition only): A CSV file with your predictions on the test_no_stars.jsonl dataset with columns: review_id, predicted_probability | (see details below).

Bonus Competition (up to +5 points)

After training your models, apply your best-performing model based on AUC to the provided test_no_stars.jsonl dataset (which does not include ratings) and generate predictions for each review.

How It Works

● Each student submits one predictions file named HW4_predictions.csv to Brightspace which contains your predictions for the test_no_stars.jsonl.

● After the deadline, these predictions will be evaluated against the true labels (which are held out).

● The evaluation metric is AUC (Area Under the ROC Curve), so make sure that your model is optimized for AUC during training and tuning.

● Recall that AUC is a number between 0 and 1, and a higher AUC means your model better distinguishes between positive and negative reviews.

Scoring and Bonus Pointshw4.md

● The top 5 students with the highest AUC scores receive bonus points:

○ 1st place: +5 points

○ 2nd place: +4 points

○ 3rd place: +3 points

○ 4th place: +2 points

○ 5th place: +1 point

● Bonus points are added to your HW4 grade but capped at 100% total.

● If thereʼs a tie, all students tied at that position receive the same bonus value.

Format:

review_id,predicted_probability

12345,0.87

67890,0.12

● If you do not submit this file, you will not be considered for the bonus competition.

● If your predictions file is incorrectly formatted, you will be disqualified from the bonus competition.

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名