STATS 3DA3
Homework Assignment 2
Instruction
• Due before 10:00 PM on Tuesday, February 11, 2025.
• Submit a copy of PDF with your solution to Avenue to Learn. You don’t need to write the questions in your answers.
• Late Penalty for Assignments: A 15% penalty will be applied for each day an assignment is submitted after 72 hours past the due date (rounded up). This includes accommodations for extended time through SAS.
• Assignments submitted after 72 hours will receive a grade of zero.
• Your assignment must conform to the Assignment Standards listed below.
Assignment Standards
• Write your name and student number on the title page. We will not grade assignments without the title page.
• Quarto Jupyter Notebook is strongly recommended.
• Eleven-point font (times or similar) must be used with 1.5 line spacing and margins of at least 1~inch all around.
• Use newpage to write solution for each question (Question 1, 2, 3).
• No screenshots are accepted for any reason.
• The writing and referencing should be appropriate to the undergradaute level.
• You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
• Various tools, including publicly available internet tools, may be used by the instructor to check the originality of submitted work.
• Assignment policy on the use of generative AI
– Generative AI is not permitted in the assignments, except for the use of GitHub Copilot as an assistant for coding.
– Clearly indicate in the code comments where GitHub Copilot was used as a coding assistant.
– In alignment with McMaster academic integrity policy, it “shall be an offence knowingly to submit academic work for assessment that was purchased or acquired from another source”. This includes work created by generative AI tools. Also state in the policy is the following, “Contract Cheating is the act of”outsourcing of student work to third parties” with or without payment.” Using Generative AI tools is a form. of contract cheat- ing. Charges of academic dishonesty will be brought forward to the Office of Academic Integrity.
For all the questions, use Python 3.11.5 and virtual environment. Then, install the required libraries for text mining and Shiny visualization.
Question 1: Word Cloud Analysis
Let’s explore the article “Data Science and Engineering With Human in the Loop, Behind the Loop, and Above the Loop” by Xiao-Li Meng (2023). Follow the steps below to create and analyze a word cloud for pages 2–5 of the article.
(1) Add the article “Data Science and Engineering With Human in the Loop, Behind the Loop, and Above the Loop” by Xiao-Li Meng (2023) to your reference list.
(2) Download the PDF of the article.
Hint:
• Access the article via https://doi.org/10.1162/99608f92.68a012eb.
• Click the Download button in the top-right corner, and choose the PDF format.
• Move the downloaded file to your working folder and rename it as paper.pdf.
(3) Use pdfplumber.open() to open the PDF.
(4) Extract the text from pages 2 to 5.
(5) Combine the text from these pages into a single string.
(6) Split the string by lines using \n.
(7) Create a pandas data frame named df with a column labeled line containing the split lines.
(8) Break each line into individual words.
(9) Convert each word into a separate row in the data frame.
(10) Convert all words to lowercase.
(11) Remove stop words.
(12) Remove unsuitable words using the following steps:
Hint:
(i) Remove rows where the word column contains punctuation using
• str.contains(r'[,. •‘”“:’;\(\)\[\]]', regex=True)]
(ii) Remove rows where the word column contains numbers using: - str.contains(r'\d', regex=True)]
(iii) Remove rows where the word column contains single letters using: - str.contains(r'^[a-z]$', regex=True)]
(13) Create a term-frequency data frame.
Hint:
(i) Calculate the frequency of each unique word using: value_counts().reset_index()
(ii) Save the result in a DataFrame called freq.
(14) Generate a word cloud for the most frequently occurring words (e.g., the top 10 words).
(15) Write a summary paragraph (at least two statements) about your word cloud. The summary
can include any limitations of your analysis and provide context based on the chosen text.
Question 2
Greenhouse gases (GHGs) play a significant role in global warming by capturing and retaining solar heat energy, leading to elevated global temperatures. In 2004, Canada launched the Greenhouse Gas Reporting Program (GHGRP) to monitor and record emissions from facilities that release 10 kilotonnes or more of greenhouse gases, measured in CO2-equivalent units. Facilities meeting this threshold are required to submit annual reports to Environment and Climate Change Canada. The dataset is publicly accessible through Canada’s Open Government Portal: Greenhouse Gas Reporting Program (GHGRP) - Facility Greenhouse Gas (GHG) Data.
For Question 2, we have downloaded the dataset PDGES-GHGRP-GHGEmissionsGES-2004- Present.csv from the portal.
This analysis focuses on creating a Shiny App to explore trends in greenhouse gas emissions across Canada’s provinces and territories, measured in CO2-equivalent units.
Data dictionary:
The dataset, spanning from 2004 to the present, includes emissions data (in tonnes and CO2-
equivalent tonnes) for each facility, categorized by gas type, including carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), hydrofluorocarbons (HFCs), perfluorocarbons (PFCs), and sulphur hexafluoride (SF6). It also provides the province or territory where each facility is lo- cated. For further details, refer to the Greenhouse Gas Reporting Program (GHGRP) - Facility
Greenhouse Gas (GHG) Data.
Pre-Processing Steps
To simplify the task of creating a Shiny App, we have pre-processed the data as follows: We start by importing the necessary libraries for data transformation:
import numpy as np
import pandas as pd
import re
Next, we read the downloaded dataset in CSV format with the specified encoding (latin1):
df = pd.read_csv("GHG_Emissions.csv", encoding='latin1')
The column names in the dataset are a mix of English and French. We use the clean_column_names() function to standardize the column names by removing French names, non-ASCII characters, and unnecessary symbols.
Here is the clean_column_names() function:
# clean_column_names function
def clean_column_names(column_names):
cleaned_names = []
# loop through each column name
for name in column_names:
# convert names to ASCII and remove non-ASCII characters
name = name.encode('ascii', 'ignore').decode('ascii')
# remove everything after '/' (French column name)
name = re.sub(r'/.*', '', name)
# remove parentheses
name = re.sub(r'[()]', '', name)
# remove extra whitespace
name = ' '.join(name.split())
cleaned_names.append(name)
# return new column names
return cleaned_names
We then apply this function to clean the column names in the DataFrame.
df.columns = clean_column_names(df.columns)
Next, we select the relevant columns for the analysis:
• Reference Year - the year GHG gas emission was recorded.
• GHGRP ID No. - the facility identity.
• Facility Province or Territory - province or territory of the facility.
• CO2 tonnes - emissions (in tonnes and tonnes of CO2 eq.) of carbon dioxide (CO2).
• CH4 tonnes - emissions (in tonnes and tonnes of CO2 eq.) of methane.
• N2O tonnes - emissions (in tonnes and tonnes of CO2 eq.) of nitrous oxide.
• SF6 tonnes- emissions (in tonnes and tonnes of CO2 eq.) of sulphur hexafluoride.
• HFC Total tonnes CO2e - emissions (in tonnes and tonnes of CO2 eq.) of hydrofluorocar-bons.
• PFC Total tonnes CO2e - emissions (in tonnes and tonnes of CO2 eq.) of perfluorocarbons.
selected_cols = [
"Reference Year", "GHGRP ID No.", "Facility Province or Territory",
"CO2 tonnes", "CH4 tonnes", "N2O tonnes", "SF6 tonnes",
"HFC Total tonnes CO2e", "PFC Total tonnes CO2e"
]
df= df[selected_cols]
We rename the columns to make them more concise and consistent:
df.rename(columns={
"Reference Year": "Year",
"GHGRP ID No.": "Facility_ID",
"Facility Province or Territory": "Province_Territory",
"CO2 tonnes": "CO2",
"CH4 tonnes": "CH4",
"N2O tonnes": "N2O",
"SF6 tonnes": "SF6",
"HFC Total tonnes CO2e": "HFC",
"PFC Total tonnes CO2e": "PFC"
}, inplace=True)
print(df.head())
Finally, we save the pre-processed data to a new CSV file:
df.to_csv("cleaned_GHG_Emissions.csv", index=False)
The pre-processed dataset is now available for analysis and can be accessed at:
https://raw.githubusercontent.com/PratheepaJ/datasets/refs/heads/master/cleaned__GHG__Emissions.csv.
You will use this dataset for Question 2.
Next Steps
The following questions guide you through creating a Shiny App to explore trends in CO2, CH4, and N2O emissions across provinces and territories in Canada from 2004 to 2022.
(1) Read the pre-processed data from the provided link.
(2) Ensure that the year variable is in the correct format. If not, convert it to the date-time
format and extract the year. Replace the original ‘Year’ variable with the extracted year.
Hint: Use the following command to convert the year:
to_datetime(df['Year'], format='%Y').dt.year
(3) Some territories may have no facilities reported in early years. Group the data by Year and Province_Territory to count distinct Facility_ID values. Find which territories are missing in 2004.
Hint: Use the following code to group the data and find missing territories:
df.groupby(['Year', 'Province_Territory']).agg(
facilities=('Facility_ID', 'nunique')
).reset_index()
(4) Find the earliest and latest year emissions were recorded.
(5) Group the data by Year and Province_Territory and sum the emissions of CO2, CH4, and N2O for each province.
Hint: Use the following code to calculate the total emissions:
df.groupby(['Year', 'Province_Territory']).agg(
CO2=('CO2', 'sum'),
CH4=('CH4', 'sum'),
N2O=('N2O', 'sum')
).reset_index()
(6) Plot the CO2 changes over the years for each province and territory, using colored lines to
differentiate between them.
Note: you will use the dataset obtained in (5) for this plot.
(7) Provide a description of the CO2 emission trend across provinces and territories based on the plot in (6).
(8) Develop a Shiny app that allows the user to input a start year (from 2004 to 2022), an end year (from 2004 to 2022), and select a gas type (CO2, CH4, N2O).
• Use ui.input_select to allow the user to specify the start year (between 2004 and 2022).
• Use ui.input_select to allow the user to specify the end year (between 2004 and 2022).
• Use ui.input_select to allow the user to select the gas type (CO2, CH4, or N2O).
You can start by using the following Shiny app template to structure your app. When writing the app in app.py, remove the template instructions and replace them with your implementation.
You will also need to copy-paste your app.py in your assignment answers, similar to the template provided here:
# load the required libraries
# define the UI for the Shiny app
app_ui = ui.page_fluid(
ui.input_select(
id='emissiontype'
label='Choose emission type',
# Add more gases as necessary in ...
choices=['CO2', '...', '...'],
selected='CO2'
),
ui.input_select(
"start_year",
"Start Year",
[str(year) for year in range(2004, 2023)]
),
ui.input_select(
"end_year",
"End Year",
[str(year) for year in range(2004, 2023)]
),
ui.output_plot('myplot')
)
# define the server function for the Shiny app
def server(input, output, session):
@output
@render.plot
def myplot():
# Read the pre-processed data
# from the provided link
df = ...
# Convert 'Year' column to date-time
# format and extract the year
df['Year'] = ...
# Filter data based on the
# selected start and end year
start_year = int(input.start_year())
end_year = ...
df = df[(df['Year'] >= start_year)
& (df['Year'] <= end_year)]
# Select the emission type based on user input
emission_type = input.emissiontype()
# Create a plot to visualize the emission trends
plt.figure(figsize=(10, 6))
sns.lineplot(data=df,
x='...',
y=emission_type,
# Color lines by province/territory
hue='...', marker='o')
# complete the title with your choice of text
plt.title(f'{emission_type} ...')
plt.xlabel('...')
plt.ylabel(f'Total {emission_type} Emissions (Tonnes)')
plt.legend(title='Province/Territory',
bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(ticks=np.arange(df_filtered['Year'].min(),
df_filtered['Year'].max() + 1, 1),
rotation=45)
plt.grid(True)
return plt.gcf()
# Run the app
app = App(app_ui, server)
(9) Deploy your Shiny App at https://www.shinyapps.io/. Then, provide the link to the App. For example, the link to my app is https://pratheepaj.shinyapps.io/my_app/.