ECON30130, Spring Semester 2025
Data Science Project Guidelines
Rules & Guidelines
Ground Rules
This assignment counts 25% of your final grade. You have to perform an econometric analysis using R, and write up your answers using Word, LATEX, R Markdown, or similar. The rules are as follows:
• You have been given access to a restricted-use dataset, the Healthy Ireland Survey. Please refer to the “Data Access and Security” document on Brightspace for details on locating it in Google Drive. Do not upload the dataset to any website, especially ChatGPT or similar.
• The goal of the project is for you to conduct an econometric analysis using this dataset. More details below.
• You can do the project by yourself or in groups. The groups have been allocated as per Part 2 on Brightspace.
• The deadline for the Project is 5pm on Friday, April 18th. A late penalty of ten percentage points per 24-hour period will be applied. Don’t delay starting.
• Submissions should be in one PDF, and should include: 1) the write-up of the assignment, 2) the R code. Papers are to be submitted on Brightspace → Assessment → Assignments. Aim for about 6–10 pages total.
• If students work in a group, only one group member should submit the paper on Brightspace. On the first page of the paper it should be clearly stated that this was a group project and the names and student numbers of the group members should be given.
• UCD’s Academic Integrity Policy will apply. I will run plagiarism checks. Be very careful with LLMs, but they are allowed if used appropriately.
• A solution will not be provided after the deadline.
Grading
Students will receive a letter grade for this assignment. Grading is based on the following criteria:
• Correctness of the analysis and interpretations
• Writing (clear and concise)
• Exposition: are graphs and tables done well? They don’t need to look fancy, but it has to be clear what is shown. For regression tables, please use stargazer or alternative packages that give you professionally formatted regression tables.
• All graphs and tables should be programmed with R, i.e. ideally not not copy & pasted anywhere
• All graphs should be done with ggplot or similar (but not with the default grey background)
• Tidyverse functions (especially the pipe operator) should be used to help clean the data.
Do not forget to cite the data in your bibliography:
Department of Health (2024). Healthy Ireland Survey, 2023 [dataset]. Version 1. Irish Social Science Data Archive. SN: 0021-07. https://www.ucd.ie/issda/data/healthyireland/healthyireland2023/.
AI Policy
AI tools such as ChatGPT can be extremely useful for those who are able to use them. I encourage students to use ChatGPT and equivalent programmes to assist them with the assignment. ChatGPT can help you a lot with it; but an old rule holds here, too: garbage in, garbage out. I allow the use of AI software under conditions:
• You, and only you are responsible for your assignment. I will not have debates about the cor- rectness of an answer just because ChatGPT told you something is correct (it has no clue what is correct, trust me). And neither will I accept excuses for late submissions on the grounds that ChatGPT was down, or similar.
• If you use AI, please declare this in the beginning of the assignment. Explain briefly what you used it for by adding a section ”AI Statement” and, for example, state ”Our group used ChatGPT for language editing in parts XXX, and for correcting the R code in questions XXX.”
• In addition to declaring its use atthe start, you must appropriately cite the use of AI in the bibliog- raphy. See the “Approaches to Teaching and Learning” section on the module course descriptor for more details.
Some tips
The aim of this assignment is to get students to figure things out. In the tutorials, clear instructions and coding examples were given along with a clean data set. However, this is far away from the work data analysts are doing. Their projects typically have a clear goal, but the data are often messy and it is unclear how to reach the goal of the analysis. Simply put, the analyst has to figure things out: how to best clean the data set, how to best visualise data, how to bring the data into a format that is suitable for visualisation and regression analysis, etc. If you’re working in a company, you neither refuse to do a project because ”we haven’t learned about a certain procedure in class”, nor can you run to your manager with every little error message you encounter. Ultimately, data analysts are paid for solving problems themselves or collaboratively with team members. The sooner you get into that mindset, the better. This assignment is similar to a project one would encounter in a data analytics job.
How to figure things out?
• Google is your friend. Get a strange error message? Type it into Google; chances are someone else had the problem before. You can also search StackOverflow, the forum for all things pro- gramming (R, Python, C++, etc).
• If one solution doesn’t work, try another one. Solving problems is often frustrating; it takes time and a decent bit of grit. So if you encounter a problem, solve it or find a way around. There is always a solution!
Data Science Project
Part 1 of the project was to sketch an empirical model you found interesting. This encouraged you to think about the problem before looking at any data. In the Google Drive, you will find the full Data Dictionary for the dataset. The Data Dictionary lists all the variable names, the questions that were asked, and how the answers were recorded. You should set aside at least an hour to sit down and think about which variables you would like to include in your analysis. This analysis can look very different to what you initially found interesting in Part 1, that’s fine. Write out what you think the model should look like, and why you think the variables you chose are relevant. Sketch an Introduction (perhaps 500 words) as to why this question is interesting.
With your model in mind, run your analysis. The first big step will be data cleaning. For this, you will need to “figure things out” as above. It will likely take several hours before your data are in a format that is suitable for analysis. This is normal.
Perform. an econometric analysis on the question you found interesting. Pretend that this is a real- world problem and that you are a data analytics consultant hired by the government’s Department of Social Analysis. You will be providing expert advice, and presenting your report directly to the Minister. He understands statistical reports but might need reminding on the details so make sure your report is clear and presentable, and that you demonstrate understanding of anything you include.
Be sure to include a table of summary statistics (number of observations, mean, sd, min, max). You may want to include two or three helpful visualisations of the data, such as a scatterplot or bar-chart. Run an OLS regression. Choose the independent variables you think are relevant. Discuss why you think they are relevant and/or why you include them (do you use theory, statistical tests, or a mix of both?) Consider your choice of standard errors, and whether your want to test anything related to that.
Feel free to consider alternative specifications. For example, you might include a table with 2–3 regression models presented, but with differing explanatory variables. If you wish to include these, explain why you do so, and what the implications are. Note also that the outcome may be binary, and this fact might affect how you approach the econometrics.
Discuss causality. In general, explain your econometric decisions in a clear way. Be sure to include a table of well-formatted regression results. Interpret the coefficients and statistical significance. If included, interpret the relevant diagnostics (e.g. R2, F-test, etc).
In a concluding section, discuss the implications of your findings. For example, does your anal- ysis support government priorities? What would you recommend to the government? What are the limitations of your analysis?