代写AD699: Data Mining for Business Analytics Individual Assignment #5 Fall 2024调试Python程序

AD699: Data Mining for Business Analytics

Individual Assignment #5

Fall 2024

You will submit two files:

(1)  a PDF with your write-up, along with

(2)  the script you used to generate your results.

You may wish to take a look at the dendrograms_deep_dive.R file, which can be found on Blackboard in the “Datasets & Scripts” folder.

This article about weighted hierarchical clustering could also be helpful for demonstrating the general principle of adjusting the feature weightings in a model like the one that you’ll build here.  While the code syntax shown in the article is based in Python, you will be able to do something similar in R.

Here is a book chapter about visualizing clusters.  Again, this is in Python but you can easily adapt the principles to R.

For the clustering model, use name in your dataframe’s row name to track the items.  You can assume that the price variable in this dataset refers to the estimated retail price of the item.

Task 1: Hierarchical Clustering

1.    Read the dataset shipping-data.csv into your R environment.

2.   What are your dataset’s dimensions?

3.   Using  any method in  R for this purpose, randomly  sample  20 rows from the entire group.   Those are the rows that you’ll use for this clustering.   You may set any seed value before sampling the data to get these 20 (I highly recommend using *some* seed value here, because otherwise you’ll get totally diferent results each time you run your script).

4.   If any rows in your data sample contain NAs, just drop those rows entirely.

5.   After reading the dataset description, take a look at your data, either with the head() function or the View() function.   Should your numeric variables be scaled? Why or why not?   If so, then scale your data’s numeric variables.

6.   Build   a   hierarchical   clustering   model   for   the   dataset,   using   any  method   for inter-cluster dissimilarity (if you’re not sure which one to choose you can experiment with the  options from the textbook chapter).   Do  not use any of your non-numeric variables to build this.

a.   Create and display a dendrogram for your model.  Be sure that you have done this in a way that displays the ‘name’ of each item.

b.   By looking at your dendrogram, how many clusters do you see here?  (There is not a single correct answer to this question, and not all people will answer it the same way-- just describe the number of clusters that seem to be showing here).

c.   Use the cutree function to cut the records into clusters.   Specify your desired number of clusters, and show the resulting cluster assignments for each item,

d.   Attach the assigned cluster numbers back to the original dataset.  Use groupby() and  summarize_all()  from  dplyr to generate per-cluster  summary  stats, and write 2-3 sentences about what you find.  What stands out here?  What do you notice about any unusual variables or clusters?

e.   Make  any three  simple visualizations to display the results of your clustering model.    Be  sure that the variables  depicted in your visualizations are actual variables  from  your  dataset.     Simple  visualizations  can  include  things  like scatterplots, barplots, histograms, boxplots, etc.

f.    Choose any item from among your sample.  What cluster did it fall into?  Write 2-3 sentences about the other members of its cluster (or if it’s a singleton, write a bit about why it is a singleton).

7.    In a previous step, you made the case for standardizing the variables.  Now they’re all on equal footing… but why might it be problematic  to view these variables with equal weight?   (Note:   You do not need any domain knowledge about worldwide shipping in order to answer this)

8.   Now it’s time to fix that problem!  Come up with your own weighting system for these variables, and apply it here.  Be sure that you are working with a dataframe. (if the data is not a dataframe, you can quickly fix that with as.data.frame().  Multiply each column by the weight that you have assigned to it.  Please note that there are no rules to the weighting system– the weights do not need to add up to any particular value.

a.   Explain the weighting system in a short paragraph.  There is no single *right* or *wrong* way to do this, but your answer to this question should demonstrate that you’ve taken  some time to put some thought into it.   One sentence per variable is enough to explain the weighting system.

9.   Now, generate  one more dendrogram, using your newly-rescaled set of variables (be sure that you’re not accidentally using the cluster assignments from a previous step as a clustering variable here).

a.   Once more, provide some description of what you see, and whether there are any noteworthy changes between this and the other dendrogram.

b.   Just as you did after the first hierarchical clustering, use the cutree() function to cut the records to clusters.  Specify your desired number of clusters, and show the resulting cluster assignments for each state.

c.   Attach the cluster assignments back to the original dataset.  Use groupby() and summarize_all() from dplyr to generate per-cluster summary stats, and write 2-3 sentences about what you find.

d.   Let’s check back in on that item that you selected during a previous step. Where is that item now, with this new model?  What else is in the same cluster?  In a few sentences, talk about what changed, and why, regarding this item’s cluster assignment.

Task 2: Text Mining

The book “Tidy Text Mining” can be a useful resource here.  So can the video on Blackboard in the Text Mining folder.


 

1.   Load the Office dataset into your R environment.  Filter the dataset so that it only contains your season and episode (please see attached file in Blackboard for the specific seasons and episodes).

2.  Who are the ten characters, or entities*, that had the most lines in your episode? Create a barplot that depicts the count values for each of these top 10.  Orient this barplot horizontally, and be sure to order your bars, either from highest to lowest or lowest to highest.   (if there are lines delivered by multiple characters, they could count as their own group, or ‘entity’)

a.   In a couple of sentences, what insights could be drawn from this plot?

3.  Using the select() function from dplyr, generate a new dataframe. for your episode that only contains the line_text column from your episode.

a.  Next, create a tidy version of your episode text.   Using unnest_tokens() should help you to convert your text into a dataframe. in which each word occupies its own row.

4.  What were the 10 most frequently used words in your episode? Show the code that you used to answer this question, along with your results.

a.  Why is this list of very limited value for any kind of analysis?

b.  Now, use the anti_join() function to remove stopwords.  Show the code that you used to do this. With the stopwords removed, what are the 10 most common words in your episode? Show them here.

c.  Do this again, but instead, do it with bigrams instead of unigrams.

i.    How are bigrams diferent from unigrams?

ii.    How might bigram analysis yield diferent results than unigram analysis, in general?

d.  Write 1-2 sentences that speculate about why it might be useful/interesting to see this list of the most frequently-used words from your episode (you could mention the bigrams or the unigrams here).  What could someone do with it?


Use your imagination and creativity to answer this and be  sure to reference your specific words from your episode.

5.   Generate a wordcloud based on your episode.  You may use any wordcloud package in R, and you may set this up any way you wish to.

a.   What does your wordcloud show you?  Describe it in a sentence or two.

6.   Next, let’s do some sentiment analysis.  We will use the bing lexicon for this purpose.

a.   What  10  words  made  the  biggest  sentiment  contributions  in  your  episode? Show the code that you used to find this, along with your results.

b.   Of these top 10 words, how many were positive?  How many were negative?

c.   In a sentence or two, speculate about what this list suggests about your episode.

7.    Now let’s take a look at how a diferent sentiment lexicon would view your episode. Bring  the afinn lexicon into your environment, and join it with the text from your episode.   Show the step(s) you used to do this.

a.   What were the three ‘worst’ words in your episode?  What were the three ‘best’ words in your episode?

b.   Sum all the values for your episode. What was the total?

c.   What does this sum suggest about your episode?  Why might this be helpful...but why might it also be incomplete or even misleading?


热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图