AD699: Data Mining for Business Analytics
Individual Assignment #5
Fall 2024
You will submit two files:
(1) a PDF with your write-up, along with
(2) the script you used to generate your results.
You may wish to take a look at the dendrograms_deep_dive.R file, which can be found on Blackboard in the “Datasets & Scripts” folder.
This article about weighted hierarchical clustering could also be helpful for demonstrating the general principle of adjusting the feature weightings in a model like the one that you’ll build here. While the code syntax shown in the article is based in Python, you will be able to do something similar in R.
Here is a book chapter about visualizing clusters. Again, this is in Python but you can easily adapt the principles to R.
For the clustering model, use name in your dataframe’s row name to track the items. You can assume that the price variable in this dataset refers to the estimated retail price of the item.
Task 1: Hierarchical Clustering
1. Read the dataset shipping-data.csv into your R environment.
2. What are your dataset’s dimensions?
3. Using any method in R for this purpose, randomly sample 20 rows from the entire group. Those are the rows that you’ll use for this clustering. You may set any seed value before sampling the data to get these 20 (I highly recommend using *some* seed value here, because otherwise you’ll get totally diferent results each time you run your script).
4. If any rows in your data sample contain NAs, just drop those rows entirely.
5. After reading the dataset description, take a look at your data, either with the head() function or the View() function. Should your numeric variables be scaled? Why or why not? If so, then scale your data’s numeric variables.
6. Build a hierarchical clustering model for the dataset, using any method for inter-cluster dissimilarity (if you’re not sure which one to choose you can experiment with the options from the textbook chapter). Do not use any of your non-numeric variables to build this.
a. Create and display a dendrogram for your model. Be sure that you have done this in a way that displays the ‘name’ of each item.
b. By looking at your dendrogram, how many clusters do you see here? (There is not a single correct answer to this question, and not all people will answer it the same way-- just describe the number of clusters that seem to be showing here).
c. Use the cutree function to cut the records into clusters. Specify your desired number of clusters, and show the resulting cluster assignments for each item,
d. Attach the assigned cluster numbers back to the original dataset. Use groupby() and summarize_all() from dplyr to generate per-cluster summary stats, and write 2-3 sentences about what you find. What stands out here? What do you notice about any unusual variables or clusters?
e. Make any three simple visualizations to display the results of your clustering model. Be sure that the variables depicted in your visualizations are actual variables from your dataset. Simple visualizations can include things like scatterplots, barplots, histograms, boxplots, etc.
f. Choose any item from among your sample. What cluster did it fall into? Write 2-3 sentences about the other members of its cluster (or if it’s a singleton, write a bit about why it is a singleton).
7. In a previous step, you made the case for standardizing the variables. Now they’re all on equal footing… but why might it be problematic to view these variables with equal weight? (Note: You do not need any domain knowledge about worldwide shipping in order to answer this)
8. Now it’s time to fix that problem! Come up with your own weighting system for these variables, and apply it here. Be sure that you are working with a dataframe. (if the data is not a dataframe, you can quickly fix that with as.data.frame(). Multiply each column by the weight that you have assigned to it. Please note that there are no rules to the weighting system– the weights do not need to add up to any particular value.
a. Explain the weighting system in a short paragraph. There is no single *right* or *wrong* way to do this, but your answer to this question should demonstrate that you’ve taken some time to put some thought into it. One sentence per variable is enough to explain the weighting system.
9. Now, generate one more dendrogram, using your newly-rescaled set of variables (be sure that you’re not accidentally using the cluster assignments from a previous step as a clustering variable here).
a. Once more, provide some description of what you see, and whether there are any noteworthy changes between this and the other dendrogram.
b. Just as you did after the first hierarchical clustering, use the cutree() function to cut the records to clusters. Specify your desired number of clusters, and show the resulting cluster assignments for each state.
c. Attach the cluster assignments back to the original dataset. Use groupby() and summarize_all() from dplyr to generate per-cluster summary stats, and write 2-3 sentences about what you find.
d. Let’s check back in on that item that you selected during a previous step. Where is that item now, with this new model? What else is in the same cluster? In a few sentences, talk about what changed, and why, regarding this item’s cluster assignment.
Task 2: Text Mining
The book “Tidy Text Mining” can be a useful resource here. So can the video on Blackboard in the Text Mining folder.
1. Load the Office dataset into your R environment. Filter the dataset so that it only contains your season and episode (please see attached file in Blackboard for the specific seasons and episodes).
2. Who are the ten characters, or entities*, that had the most lines in your episode? Create a barplot that depicts the count values for each of these top 10. Orient this barplot horizontally, and be sure to order your bars, either from highest to lowest or lowest to highest. (if there are lines delivered by multiple characters, they could count as their own group, or ‘entity’)
a. In a couple of sentences, what insights could be drawn from this plot?
3. Using the select() function from dplyr, generate a new dataframe. for your episode that only contains the line_text column from your episode.
a. Next, create a tidy version of your episode text. Using unnest_tokens() should help you to convert your text into a dataframe. in which each word occupies its own row.
4. What were the 10 most frequently used words in your episode? Show the code that you used to answer this question, along with your results.
a. Why is this list of very limited value for any kind of analysis?
b. Now, use the anti_join() function to remove stopwords. Show the code that you used to do this. With the stopwords removed, what are the 10 most common words in your episode? Show them here.
c. Do this again, but instead, do it with bigrams instead of unigrams.
i. How are bigrams diferent from unigrams?
ii. How might bigram analysis yield diferent results than unigram analysis, in general?
d. Write 1-2 sentences that speculate about why it might be useful/interesting to see this list of the most frequently-used words from your episode (you could mention the bigrams or the unigrams here). What could someone do with it?
Use your imagination and creativity to answer this and be sure to reference your specific words from your episode.
5. Generate a wordcloud based on your episode. You may use any wordcloud package in R, and you may set this up any way you wish to.
a. What does your wordcloud show you? Describe it in a sentence or two.
6. Next, let’s do some sentiment analysis. We will use the bing lexicon for this purpose.
a. What 10 words made the biggest sentiment contributions in your episode? Show the code that you used to find this, along with your results.
b. Of these top 10 words, how many were positive? How many were negative?
c. In a sentence or two, speculate about what this list suggests about your episode.
7. Now let’s take a look at how a diferent sentiment lexicon would view your episode. Bring the afinn lexicon into your environment, and join it with the text from your episode. Show the step(s) you used to do this.
a. What were the three ‘worst’ words in your episode? What were the three ‘best’ words in your episode?
b. Sum all the values for your episode. What was the total?
c. What does this sum suggest about your episode? Why might this be helpful...but why might it also be incomplete or even misleading?