DAT 5566: Big Data and Cloud Computing
Fall 2025, Mini B
Lab #3
INSTRUCTIONS
1. This is a group assignment to be completed during the lab.
2. ONLY utilize the codes we practice.
3. Due on Canvas.
4. Submit on Canvas.
ASSIGNMENT
In this assignment, we are going to work with a dataset called CarSale.csv located on our server.
The data has several columns, as explored in the previous assignments.
Answer questions on Canvas. No need to submit the codes.
Questions
Part 1- Linux command practices
1- Take a look at the data. Is there any car that is reported more than once? You need to answer. For this, we look at VIN of the car. Look at 2 rows to make sure there is no duplicated car.
2- Find the 5 cars that sit on the market for longest time. Report the days on market, make_name, and model_name. Make sure to have the appropriate header names too.
3- Let’s find the average price for all rows using AWK.
4- Let’s make the previous question more complex. Then let’s find the average price, and number of cars per year using AWK. In addition, add a header for your results. Do a sort according to years in ascending format.
5- Let’s make a smaller data by selecting only columns 1,2,4,12,13,15,18,19, and 23. Also consider only cars of years 2000 and beyond.
Part 2- Importing data into Hive/Impala
6- Creating table and uploading file into the table:
a. Upload the merged file into HDFS. Then create a table in Hive called your IDS_carsale. For example, in a group with two teammates with SIDS as 111111 and 222222, the table name is going to be 111111_222222_carsale. This table is going to include carsale data, then, make it corresponding to the data.
b. Upload the merged file into the created table.
Part 3- Query with Hive/Impala
7- Find the maximum, minimum, and average days on market per body type. Do not list null body types.
8- Find the average price and number of cars, per color, per condition (new or not). List only rows with more than 100 cars. Which combination is the most available ones?
9- Find the number of cars, average and maximum price per car maker, car model, and condition (new, used). If the car is new, tag it as “New”, otherwise as “Used”. Report only rows with more than 250 cars.
Part 4- Clean up the assignment
1- Remove the table that you created.
2- Remove both car sales files on home directory.