代做STATS 769 STATISTICS SECOND SEMESTER, 2017代写Java程序

STATS 769

STATISTICS

Data Science Practice

SECOND SEMESTER, 2017

1.                                                                                                                                    [5 marks]

Figure 1 shows the content of an HTML le, "2017-07-29 .html". This is asimplified extract from the HOT 100 songs web site that was used for the web scraping lab.

Write five R expressions that use functions from the xml2 package (and XPath expres- sions) to perform the following steps:

  Read the HTML file into R.

 Extract the song title from the HTML (output shown below).

[1]  "Despacito"

• Extract the artist name from the HTML (output shown below; note that white space has been removed).

[1]  "Luis  Fonsi  &  Daddy  Yankee  Featuring  Justin  Bieber"

 Extract the song rank from the HTML (output shown below).

[1]  "1"

• Extract the rank from the previous week from the HTML (output shown below; note that the result is more than one character value).

[1]  "Last  Week"  "1"

<! doctype  html>

<html  class=""  lang=""> <body>

<article  class="chart-row  chart-row--1"  data-songtitle="Despacito"> <div  class="chart-row__primary">

<div  class="chart-row__history  chart-row__history--steady"></div> <div  class="chart-row    main-display">

<div  class="chart-row     rank">

<span  class="chart-row     current-week">1</span>

<span  class="chart-row     last-week">Last Week:  1</span> </div>

<div  class="chart-row     container"> <div  class="chart-row     title">

<h2  class="chart-row     song">Despacito</h2>

<a  class="chart-row     artist"  data-tracklabel="Artist  Name"> Luis  Fonsi  &  Daddy  Yankee  Featuring  Justin  Bieber

</a> </div>

</div>

</div> </div>

<div  id="chart-row-1-secondary"  class="chart-row     secondary"> <div  class="chart-row     stats">

<div  class="chart-row     last-week">

<span  class="chart-row     label">Last Week</span>

<span  class="chart-row     value">1</span> </div>

<div  class="chart-row__top-spot">

<span  class="chart-row     label">Peak  Position</span>

<span  class="chart-row     value">1</span>  </div> <div  class="chart-row     weeks-on-chart">

<span  class="chart-row     label">Wks  on  Chart</span>

<span  class="chart-row     value">26</span>  </div> </div>

</div>

</article> </body>

</html>

Figure 1: The HTML le "2017-07-29 .html".

2.                                                                                                                                   [5 marks]

Write a paragraph explaining the purpose of the flatten() function from the jsonlite package. You should provide at least one example of its use.

3.                                                                                                                                 [10 marks]

Explain what each of the following shell commands is doing and, where there is output, what the output means.  These commands were all run on one of the virtual machines that were used in the course.

pmur002@stats769prd01:~/$ mkdir  exam pmur002@stats769prd01:~/$  cd  exam

pmur002@stats769prd01:~/exam$  ls  -1  /course/AT/BUSDATA/  |   wc  -l 98973

pmur002@stats769prd01:~/exam$  ls  -l  /course/AT/BUSDATA/  |   awk  ' {  print($5)  } '   >  sizes .txt pmur002@stats769prd01:~/exam$  head  sizes .txt

343

345

345

345

436

437

438

531

438

pmur002@stats769prd01:~/exam$  grep  --no-filename  ' ,6215, '   \

>  /course/AT/BUSDATA/trip_updates_20170401* .csv  >  bus-6215-2017-04-01 .csv

pmur002@stats769prd01:~/exam$  head  bus-6215-2017-04-01 .csv

8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,252,6,7168,1490975922   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,293,9,8516,1490976233   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,299,NA,9,8516,1490976239   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,319,10,8524,1490976349 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,403,11,8532,1490976523

4.                                                                                                                                   [10 marks]

The following R code was run on one of the virtual machines that were used in the course to investigate how much memory would be required to load a large CSV le with several million rows into R. The plot that this code produces is also shown.

Explain what the code is doing and discuss whether this will lead to a good estimate of the memory required to read the complete CSV into R. Is there another way to estimate the memory required (without reading the entire CSV file into R) ?

numLines  <-  10^(1:5)

samples  <-  lapply(numLines,

function(i)  {

read .csv("/course/AT/alldata .csv",

nrows=i,  stringsAsFactors=FALSE) })

plot(numLines,  sapply(samples,  object .size),  log="xy", xlab="number  of  lines",  ylab="data  frame  size")

5.                                                                                                                                 [10 marks]

The following R code was run on one of the virtual machines that were used in the course to measure how much time is required to read diferent subsets of a large CSV file into R.

sapply(numLines,

function(i)  {

system.time(read .csv("/course/AT/alldata .csv",

nrows=i,  stringsAsFactors=FALSE))[1]

})

The result of running this code is shown below.

user .self  user .self  user .self  user .self  user .self 0.001         0.001         0.005         0.036         0.417

The following code was run to perform profiling.  The profiling result is shown below the code.

library(profvis) p  <-  profvis({

lapply(numLines,

function(i)  {

read .csv("/course/AT/alldata .csv",

nrows=i,  stringsAsFactors=FALSE)

})  })

htmlwidgets::saveWidget(p,  "profile.html")

Explain what the timing and profiling results mean.  Suggest how you could make the code run faster.

6.                                                                                                                                 [10 marks]

Write R code to perform a parallel version of the lapply() call from Question 4. Discuss the advantages and disadvantages of using the mclapply() (forking) approach compared to the makeCluster()  (socket) approach for this task.  Also discuss whether load balancing would make sense for this task.

7.                                                                                                                                 [10 marks]

Suppose we estimate the regression coefficients in a linear regression by minimizing

 

subject to

 

for a particular value of s. As we increases from 0, indicate which of the following is correct. Justify your answer.

a. The training residual sum of squares (SS residual) will increase initially, and then eventually start decreasing in an inverted U shape.

b. The training SSresidual  will decrease initially, and then eventually start increasing in a U shape.

c. The training SS residual  will steadily increase. d. The training SS residual  will steadily decrease. e. The training SS residual  will remain constant.

8.                                                                                                                                 [10 marks]

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

obs

X1

X2

X3

Y

1

0

3

0

Red

2

2

0

0

Red

3

0

1

3

Red

4

0

1

2

Green

5

-1

0

1

Green

6

1

1

1

Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using a K-NN classifier.

a. Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0. (recall that the Euclidean  distance  between  two points  (x1 ;x2 ;x3 ) and  (y1 ;y2 ;y3)

is  (x y1)2 +(x y2)2 +(x y3)2 ).

b. What is our prediction with K = 1? Why? c. What is our prediction with K = 3? Why?

d. If the Bayes decision boundary (optimal boundary) in this problem is highly non-linear, then would we expect the best  value for K to be large or small? Why?

9.                                                                                                                                 [10 marks]

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fita linear regression model to the data, as well as a separate cubic regression, i.e. Y b0 + b1X b2X2 + b3X3 + ε .

a.  Suppose that the true relationship between X  and Y  is  linear, i.e.  Y = b0 + b1X + ε . Consider the training residual sum of squares (SS resdiual) for the linear regression, and also the training SS residual  for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

b. Answer part (a) using test SS residual  rather than training SS residual.

10.                                                                                                                                 [10 marks]

Explain how k-fold cross-validation is implemented.  And what are the advantages and dis- advantages of k-fold cross validation relative to the validation set approach?

11.                                                                                                                                 [10 marks]

When the number of features p is large, there tends to be a deterioration in the performance of K-NN and other local approaches that perform. prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large.

a.  Suppose that we have a set of observations, each with measurements on p = 1 feature, X . We assume that X  is uniformly (evenly) distributed on [0; 1].  Associated with each observation is a response value.  Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of X  closest to that test observation.  For instance, in order to predict the response for a test observation with X = 0.6, we will use observations in the range [0.55; 0.65]. On average, what fraction of the available observations will we use to make the prediction?

b.  Now suppose that we have a set of observations, each with measurements on p = 2 features, X1  and X2 .  We assume that (X1 ;X2) are uniformly distributed on  [0; 1] × [0; 1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1  and within 10% of the range of X2  closest to that test observation. For instance, in order to predict the response for a test observation with X1 = 0.6 and X2  = 0.35,  we  will use observations in the range  [0.55; 0.65]  for X1   and  in the range [0.3; 0.4] for X2 .  On average, what fraction of the available observations will we use to make the prediction?

c.  Now suppose that we have a set of observations on p = 100 features.  Again the observa- tions are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation.  What fraction of the available observations will we use to make the prediction?

d. Using your answers to parts (a)–(c), argue that a drawback of K-NN when p is large is that there are very few training observations “near” any given test observation.

 



热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图