代写FIT5202 S1 2021 Sample Exam Question调试SQL 程序

FIT5202 S1 2021 Sample Exam Question

Question 1

Aditya and David are the first-year data science students with Monash University. They are discussing how parallel and distributed processing can help data scientists perform the computation faster. They would like your help to understand and get answers to the following questions:

1.   Using the current processing resources, we can finish processing 1TB (one terabyte) of data in 1 hour. Recently the volume of data has increased to 2TB and the management has decided to double up the processing resources. Using the new processing resources, we can finish processing the 2TB in 60 minutes. Aditya wants to know (1 + 1 = 2 Marks)

a.   Is this speed-up or scale-up? Please explain your answer.

b.  Also, please explain what type of speed-up or scale-up is it (linear, superlinear or sub-linear)?

2.   David is using his iMac desktop to do parallel query processing. The iMac has the following specifications:

He wants to know what type of parallel database architecture is he using to do the parallel query processing. Please explain the reason for your answer. (2 Marks)

3.   David read in the textbook that “Random unequal partitioning is sometimes inevitable in parallel search.” Please explain to him what is random unequal partitioning. (2 Marks)

4.  Aditya now understands that skewness is the unevenness of workload and skewed workload distribution is generally undesirable. He found the figure below in the textbook that shows the skewed workload distribution. He wants to know (1 + 2 =3 Marks)

a.   Is the figure below processing skew or data skew? Please explain with reason.

b.   Is it possible to have an equal distribution of data? Please explain how.


5.   David was given a task to perform log analysis in the lab. The input data consisted of log

messages of varying degrees of severity, along with some blank lines. He has to compute how many log messages appear at each level of severity. The contents of the “input.txt” file are shown below.

INFO This is a message with content

INFO This is some other content

(empty line)

INFO Here are more messages

WARN This is a warning (empty line)

ERROR Something bad happened

WARN More details on the bad thing

INFO back to normal messages

The expected output of the operations is as below.

[(‘INFO’, 4), (‘WARN’, 2), (‘ERROR’, 1)]

However, he is not sure how to begin. Please explain to him assuming ‘sc’ as a SparkContext object. (1 + 2= 3 Marks)

a.   What is an RDD?

b.   How to read the “input.txt” file into an RDD?


Question 2

Petadata is an enterprise software company that develops and sells database analytics software subscriptions. The company provides three main services: business analytics, cloud products, and consulting. It operates in North and Latin America, Europe, and Australia.

Petadata is headquartered in Melbourne, Victoria, and has additional major Australian locations in Sydney and Adelaide, where its data center research and development is housed. Peter Liu has    served as the company's president and chief executive officer since 2014. The company reported  $2.8 billion in revenue, with a net income of $112 million, and 15,026 employees globally, as of March 15, 2020.

Chin is a recent graduate from Monash University and preparing for the job interview in Petadata. He needs your help to understand aspects of parallel processing especially parallel joins and parallel sort.

1.   Using a more general notation, table R has |R| number of records, and table S has |S| number of records. The first step of ROJA is to redistribute the records from both tablesaccording to hash/range partitioning. What is the cost model of the Redistribution Step of ROJA? (4 marks)

Symbol

Description

Data Parameters

R

Size of table in bytes

Ri

Size of table fragment in bytes on processor i

|R|

Number of records in table R

|Ri |

Number of records in table R on processor i

Systems Parameters

N

Number of processors

P

Page size

Time Unit Cost

IO

Effective time to read a page from disk or write a page to disk

tr

Time to read a record in the main memory

tw

Time to write a record to the main memory

td

Time to compute destination

Communication Cost

mp

Message protocol cost per page

ml

Message latency for one page


2.   Given a data set D = {55; 30; 68; 39; 1; 4; 49; 90; 34; 76; 82; 56; 31; 25; 78; 56; 38; 32; 88; 9; 44; 98; 11; 70; 66; 89; 99; 22; 23; 26} and three processors, show step-by-step how the Parallel Redistribution Merge-All Sort works. (5 Marks)

Assume random equal partitioning has been applied, where each processor has 10 records. The first processor will get the first 10 records, etc.

Processor 1 = {55; 30; 68; 39; 1; 4; 49; 90; 34; 76}

Processor 2= {82; 56; 31; 25; 78; 56; 38; 32; 88; 9}

Processor 3 = {44; 98; 11; 70; 66; 89; 99; 22; 23; 26}

3.   Chin was thinking of using internal sorting to perform the sort. However, he read on the internet that “ External Sorting is different from Internal Sorting. Therefore, external sorting cannot use any of the Internal sorting methods” . Is this statement True or False? Explain the reason as well. (3 Marks)



Question 3

2020 has been the year of Big Data – the year when big data and analytics made tremendous progress through innovative technologies, data-driven decision making and outcome-centric analytics. You are applying for the job as a Data Scientist. Mohammad is a senior lecturer and data scientist at Monash University, and a good friend of yours. He has prepared a list of questions regarding Apache Spark and Machine Learning to help you prepare for the job interview. Please answer the following questions.

1.   In Apache Spark, machine learning pipelines provide a uniform. set of high-level APIs built on top of DataFrames. It makes it easier to combine multiple algorithms into a single pipeline, or workflow. The key concepts introduced by the Pipelines API are DataFrame, Transformer, Estimator, Pipeline, and Parameter.

a.  What is Machine Learning and why should you use machine learning with Spark? (2 Marks)

b.  What is a Transformer and an Estimator? (2 Marks)

2.  According to McKinsey study, 35% of what consumers purchase on Amazon and 75% of what they watch on Netflix is driven by machine learning–based product recommendations.

a.   Mohammad wants to know if you have understood how these  recommendation systems work. So, please use the dataset below to recommend Top-2 movies to Mohammad. Please show all the calculations. (4 Marks)

Name

StarTrek

StarWars

Superman

Batman

Hulk

Mohammad

4

2

?

5

4

Paras

5

3

4

?

3

Huashun

3

?

4

4

3

b.  You are given a dataset “ratings” which contains movie ratings consisting of user, movie, rating and timestamp columns. The column names are userId, movieId, rating and ts respectively. Write a basic Machine Learning Program in PySpark to build the recommendation  model  and  to  make  recommendation.  Write  the  missing  code snippets in the program given below. (4 Marks)

from pyspark.ml.recommendation import                                                        

Task #1: # split the dataset into training and test data (80% training and 20% test)

(trainingData, testData) =                                     

Task #2: Build the recommendation model using ALS on the training data # Use maxIter = 10, coldStartStrategy = “drop”

# make predictions

predictions = model.transform(testData)

Task #3: # Generate top 10 movie recommendations for each user

# Write code below



Question 4

StopHacking is a start-up incubated in Monash University to develop cloud service to detect and stop computer hackers. Although they have some rule-based service to identify certain hacks, they would like to add machine learning models which can integrate with their Spark cluster to process large amounts of data and detect any potential hacks. The dataset contains an “attack” column representing whether the request was an attack or not.

They hired you as the Lead Data Scientist and Peter (your intern) to investigate the open data from the Cyber Range Labs of UNSW Canberra and build a model based on the data to identify abnormal system behaviour.

Before proceeding with the development of ML models, Peter has some questions in mind that he would like your input on.

1.   Peter is not sure whether this is a classification or a regression problem. Is this a classification or a regression problem? Briefly discuss when do we use classification and regression with examples. (2 Marks)

2.   Upon investigation of the data, Peter has found that the data is imbalanced. Please suggest ways to handle an imbalanced dataset. (2 Marks)

3.  You have prepared an estimator for the Decision Tree Model. Executing a Decision tree algorithm is a simple task. But, Peter still has some doubts. (2 + 3 = 5 Marks)

a.   How does a tree splitting take place? Explain in the context of the ID3 algorithm.

b.   The models perform. great on the training data but generalize poorly to new instances. Peter is not sure what is happening. Can you explain what is happening and suggest two possible solutions.

4.   What are False Positive(FP) and False Negative(FN) in a confusion matrix? Which value should we try to reduce in this scenario, discuss briefly?  (3 Marks)


Question 5

Spectroscopy products developed at Divergent Technologies generate a lot of performance and diagnostic data. The data is typically stored locally on the controlling PC’s hard disk drive and only analysed for the  purpose  of  reviewing  function  and  performance  as  a  part  of  short  term  test requirements. Further analysis (such as trend analysis, predictive analytics, comparative studies, regression / correlation, etc.) is currently very challenging and is done manually on an as-needs basis.

You and Neha have been hired as summer interns to implement machine learning algorithms with the data generated by the spectroscopy products. These spectroscopy products have sensor arrays installed and it is anticipated that using ML techniques could prove extremely valuable that enable timely preventative maintenance of the sensors and / or responsive lower cost repairs. Ultimately, it may lead to the development of a sale-able product in this area, with potential use across the broader Divergent instrument portfolio.

You are working on streaming data from the sensors and Neha has some questions for you before she can develop the machine learning models.

1.   The spectroscopy product has multiple sensors attached to it that measures different things for example light, gas and heat emission. Can you please explain two different methods that can be used to lower the granularity of the sensor arrays? (4 Marks)

2.   There are three main sensors in the Spectroscopy products. So, Neha is planning to send the data using three Kafka producers using the same topic “spectroscopy_streams” . The   sensors are producing data as key value pairs in the format below and sent as bytes. (4 Marks)

“gas”: 125

“light”: 3298

“heat”: 78

In the Apache Spark Streaming, the received data looks like below.

 

Please complete the code below for Apache Spark Streaming to find the average for each sensor every 10 seconds.

from pyspark.sql import SparkSession

spark = SparkSession. ...

Task #1: # Subscribe to the topic “spectroscopy_streams” . The server is running on 192.168.0.10, port 9092.

Task #2:  Find the average for each sensor.

Task #3: # Start running the query that prints the running counts to the console every 10 seconds.

query.awaitTermination()



The output will be as shown in the example below.

 

3.   Is  the windowing method mentioned in the question time based window or tuple based window? Please explain. How can you enable time based overlapping sliding windows in Apache Spark Structured Streaming? (4 Marks)




热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图