Storytelling Prep Report
The goal is to present your business and data understanding on an intended project. The entire CRISP-DM phases will be included later in your storytelling presentation. This is an individual assignment.
You are free to use any tools such as Excel, Mathematica, Matlab, Minitab, Python, R, SAS, SPSS, SQL, Tableau, etc.
Before you start, I recommend watching top storytelling presentations in:
Storytelling Hall of Fame
Guidelines
Dataset
You can use one or more dataset(s) of your choice. A few good places to find one are kaggle.com, data.gov, healthdata.gov, archive.ics.uci.edu, or kdnuggets.com. However, your dataset should:
● include at least 10 variables except for columns of ID and unary (or nearly unary) variables.
● include at least 1,000 records (rows).
● be new to you. Using a dataset from your prior project is prohibited.
Any submission with a dataset that fails to meet these conditions will not be accepted.
The goal is to bring interesting insights of your own. To that end, I highly recommend choosing a larger dataset (e.g., with 20 columns and 10,000 rows) as far as you can manage it. From more columns, you may find more influential features leading to more accurate models and richer insights. With more rows, data partitions are more flexible in case of overfitting. Please avoid a dataset including a number of columns with high multicollinearity since most of them may have to be removed later so that few influential features are left. I also recommend choosing a dataset in the domain of your field, major, or interest for strong business understanding. You may want to start with some quick EDA of a few datasets to see if you can find some interesting initial insights and then choose one of them.
Format
There is no absolute limit to the length of the report. A typical report includes about 2 to 4 pages for the core content. Your report should be followed by an appendix of references, charts, tables, outputs, etc. The appendix may be as long as possible, but make sure to refer to it when needed during the presentation. If it is too long, consider submitting it in a separate document.
Your report should cover business and data understanding phases and below are the grading points I consider.
● Business understanding
○ Brief the business context of the client. Note: If you do not have a direct client, suppose potential clients who may use your analysis. For deeper business understanding, I recommend using some articles on the domain.
○ Propose a few research questions and discuss why they are important to the client(s).
● Data understanding
○ Discuss the results of EDA. Note: EDA is never enough. Do much EDA and include the details in the appendix.
○ Variable selection – Discuss variable importance and multicollinearity. Note: Consider correlations, chi-squares, etc.
○ Report skewed variables and their skewness.
○ Report outliers or confirm that your dataset is free of them. Note: I recommend box plots. Are they genuine outliers or data errors?
○ Data quality – Report data errors or duplicate records. Or confirm that your dataset is free of them.
○ Report missing values or confirm that your dataset is free of them.
○ Identify promising subsets (of rows) (if needed). Note: Your business understanding may help. I also recommend clustering.
● Appendix — Note: Appendix is required, not optional.
○ Include references, charts, tables, outputs, etc.
○ The appendix may be as long as possible, but make sure to refer to it when needed.
○ If it is too long, consider submitting it in a separate document.
Submission
Submit a Word (or PDF) file and raw data to: Modules => Storytelling => Storytelling Prep Report.
● Use a font size of at least 11 points and at least 1.15 space.
● Identify the project title and include your name at the top of the first page.
● Have your raw data file(s) ready. Note: Failure to submit the data file(s) will result in a zero mark.
● Name your file as: Storytelling_Prep_Report_First_Name_Last_Name. So, if your name is Satoshi Nakamoto, please name your file as: Storytelling_Prep_Report_Satoshi_Nakamoto.docx
Storytelling_Raw_Data_Satoshi_Nakamoto.xlsx (or .csv)
● Submit files. Links to external documents are not accepted.
● Unless you get my pre-approval in cases of technical or other issues, email submissions are not allowed.
Grading
Your report will be graded for a full mark of 3 points with the criteria (and their relative importance) below.
● Information selection and organization (2)
○ Information amount ‒ Provides an ideal quantity of information.
○ Information relevance ‒ Includes relevant and necessary information at an appropriate level of detail.
○ Information sequence ‒ Sequences information logically and persuasively.
○ Information flow ‒ Linking statements create a good flow of ideas.
● Design and delivery (1)
○ Engaging ‒ Captures and holds audience attention.
○ Professional ‒ Exceeds or meets professional expectations for the modality (written communication).
○ Visual aids ‒ Effectively designs and uses visual aids for message and audience.
○ Error-free ‒ Has no preventable mistakes or errors.
How to import a dataset into SAS EM
You can refer to the Income Analysis project of SAS EM Tutorial 1, but below are simpler instructions.
● If your dataset is in Excel format, convert it to CSV format.
● Copy the CSV file into the Documents folder on OneDrive.
● Open the diagram of your SAS EM project.
● Drag a File Import node from the Sample tab into the diagram.
● Click the File Import node.
● In the Train property, click the ellipsis (…) next to Import File and select the path to the file.