midterm_IADS2024 - Jupyter Notebook
IADS midterm
Please ensure all code is executed and the corresponding outputs are included. Write the code directly in this notebook rather than creating a new one.
Part 1: Multiple choice and theoretic questions
Please write your answer after each question
Question 1. What would the p-value of 0.04 mean for t-test comparing two samples of observations (select all that applies):
A) sample averages are at least 4% different
B) the samples follow the underlying distributions with the same mean
C) the samples follow the underlying distributions with the different mean
D) one can reject the null hypothesis that the samples follow the underlying distributions with the same mean at 5% significance level (or with 95% confidence) since p-values is below 0.05
E) one can't reject the null hypothesis that the samples follow the underlying distributions with the same mean at 5% significance level (or 95% confidence) singe p-value does not reach 0.05
F) one can reject the null hypothesis that the samples follows the underlying distributions with the different means at 5% significance level (or 95% confidence)
G) probability that two samples have the same means is 4%
Question 2. What is true regarding normal and log-normal distributions:
A) Quantities following log-normal distributions have higher probabilities for outliers compared to normal distributions
B) Outliers significantly different from the mean are more common for the normally distributed variables compared to log normally distributed variables
C) Logarithm of the normally distributed quantity follow a log-normal distribution
D) Logarithm of the log-normally distributed quantity follows a normal distribution
E) Probability density function of the log-normally distributed variable equals to the logarithm of the probability density function of the normally distributed variable
Question 3.
Imagine training a model which considers multiple sattelite images of urban traffic and tries to find groups of typical (repeated with minor deviations) scenarios. How would you classify this problem from Machine Learning perspective?
A) Supervised leanring;
B) Unsupervised learning;
C) Semi-supervised learning;
D) Reinforcement learning.
Question 4.
Please explain why would you need separate training, validation and test samples to learn the model. In which cases you may need all three, including a validation sample?
Part 2: NYPD data analysis
In this part, you need to download New York Police Department (NYPD) complaints data for 2019 and write code for three following sections (each having own sub-sections): Data cleaning, Exploratory analysis and Hypothesis testing
download NYPD complaints data:
two options:
1. download with curl or urllib methods
2. download with API