代做STATS 769 STATISTICS SECOND SEMESTER, 2017代写Java程序-留学生作业帮

代做STATS 769 STATISTICS SECOND SEMESTER, 2017代写Java程序

STATS 769

STATISTICS

Data Science Practice

SECOND SEMESTER, 2017

1. [5 marks]

Figure 1 shows the content of an HTML ﬁle, "2017-07-29 .html". This is asimpliﬁed extract from the HOT 100 songs web site that was used for the web scraping lab.

Write ﬁve R expressions that use functions from the xml2 package (and XPath expres- sions) to perform the following steps:

• Read the HTML ﬁle into R.

• Extract the song title from the HTML (output shown below).

[1] "Despacito"

• Extract the artist name from the HTML (output shown below; note that white space has been removed).

[1] "Luis Fonsi & Daddy Yankee Featuring Justin Bieber"

• Extract the song rank from the HTML (output shown below).

[1] "1"

• Extract the rank from the previous week from the HTML (output shown below; note that the result is more than one character value).

[1] "Last Week" "1"

<! doctype html>

<h2 class="chart-row song">Despacito</h2>

<a class="chart-row artist" data-tracklabel="Artist Name"> Luis Fonsi & Daddy Yankee Featuring Justin Bieber

</a> </div>

</div>

</div> </div>

<span class="chart-row label">Peak Position</span>

<span class="chart-row label">Wks on Chart</span>

</div>

</article> </body>

</html>

Figure 1: The HTML ﬁle "2017-07-29 .html".

2. [5 marks]

Write a paragraph explaining the purpose of the flatten() function from the jsonlite package. You should provide at least one example of its use.

3. [10 marks]

Explain what each of the following shell commands is doing and, where there is output, what the output means. These commands were all run on one of the virtual machines that were used in the course.

pmur002@stats769prd01:~/$ mkdir exam pmur002@stats769prd01:~/$ cd exam

pmur002@stats769prd01:~/exam$ ls -1 /course/AT/BUSDATA/ | wc -l 98973

pmur002@stats769prd01:~/exam$ ls -l /course/AT/BUSDATA/ | awk ' { print($5) } ' > sizes .txt pmur002@stats769prd01:~/exam$ head sizes .txt

343

345

436

437

438

531

438

pmur002@stats769prd01:~/exam$ grep --no-filename ' ,6215, ' \

> /course/AT/BUSDATA/trip_updates_20170401* .csv > bus-6215-2017-04-01 .csv

pmur002@stats769prd01:~/exam$ head bus-6215-2017-04-01 .csv

8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,252,6,7168,1490975922 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,293,9,8516,1490976233 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,299,NA,9,8516,1490976239 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,319,10,8524,1490976349 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,403,11,8532,1490976523

4. [10 marks]

The following R code was run on one of the virtual machines that were used in the course to investigate how much memory would be required to load a large CSV ﬁle with several million rows into R. The plot that this code produces is also shown.

Explain what the code is doing and discuss whether this will lead to a good estimate of the memory required to read the complete CSV into R. Is there another way to estimate the memory required (without reading the entire CSV ﬁle into R) ?

numLines <- 10^(1:5)

samples <- lapply(numLines,

function(i) {

read .csv("/course/AT/alldata .csv",

nrows=i, stringsAsFactors=FALSE) })

plot(numLines, sapply(samples, object .size), log="xy", xlab="number of lines", ylab="data frame size")

5. [10 marks]

The following R code was run on one of the virtual machines that were used in the course to measure how much time is required to read diferent subsets of a large CSV ﬁle into R.

sapply(numLines,

function(i) {

system.time(read .csv("/course/AT/alldata .csv",

nrows=i, stringsAsFactors=FALSE))[1]

})

The result of running this code is shown below.

user .self user .self user .self user .self user .self 0.001 0.001 0.005 0.036 0.417

The following code was run to perform proﬁling. The proﬁling result is shown below the code.

library(profvis) p <- profvis({

lapply(numLines,

function(i) {

read .csv("/course/AT/alldata .csv",

nrows=i, stringsAsFactors=FALSE)

}) })

htmlwidgets::saveWidget(p, "profile.html")

Explain what the timing and proﬁling results mean. Suggest how you could make the code run faster.

6. [10 marks]

Write R code to perform a parallel version of the lapply() call from Question 4. Discuss the advantages and disadvantages of using the mclapply() (forking) approach compared to the makeCluster() (socket) approach for this task. Also discuss whether load balancing would make sense for this task.

7. [10 marks]

Suppose we estimate the regression coefficients in a linear regression by minimizing

subject to

for a particular value of s. As we increases from 0, indicate which of the following is correct. Justify your answer.

a. The training residual sum of squares (SS residual) will increase initially, and then eventually start decreasing in an inverted U shape.

b. The training SSresidual will decrease initially, and then eventually start increasing in a U shape.

c. The training SS residual will steadily increase. d. The training SS residual will steadily decrease. e. The training SS residual will remain constant.

8. [10 marks]

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

obs	X1	X2	X3	Y
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using a K-NN classiﬁer.

a. Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0. (recall that the Euclidean distance between two points (x1 ;x2 ;x3 ) and (y1 ;y2 ;y3)

is √ (x1 — y1)2 +(x2 — y2)2 +(x3 — y3)2 ).

b. What is our prediction with K = 1? Why? c. What is our prediction with K = 3? Why?

d. If the Bayes decision boundary (optimal boundary) in this problem is highly non-linear, then would we expect the best value for K to be large or small? Why?

9. [10 marks]

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then ﬁta linear regression model to the data, as well as a separate cubic regression, i.e. Y = b0 + b1X + b2X2 + b3X3 + ε .

a. Suppose that the true relationship between X and Y is linear, i.e. Y = b0 + b1X + ε . Consider the training residual sum of squares (SS resdiual) for the linear regression, and also the training SS residual for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

b. Answer part (a) using test SS residual rather than training SS residual.

10. [10 marks]

Explain how k-fold cross-validation is implemented. And what are the advantages and dis- advantages of k-fold cross validation relative to the validation set approach?

11. [10 marks]

When the number of features p is large, there tends to be a deterioration in the performance of K-NN and other local approaches that perform. prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large.

a. Suppose that we have a set of observations, each with measurements on p = 1 feature, X . We assume that X is uniformly (evenly) distributed on [0; 1]. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X = 0.6, we will use observations in the range [0.55; 0.65]. On average, what fraction of the available observations will we use to make the prediction?

b. Now suppose that we have a set of observations, each with measurements on p = 2 features, X1 and X2 . We assume that (X1 ;X2) are uniformly distributed on [0; 1] × [0; 1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1 and within 10% of the range of X2 closest to that test observation. For instance, in order to predict the response for a test observation with X1 = 0.6 and X2 = 0.35, we will use observations in the range [0.55; 0.65] for X1 and in the range [0.3; 0.4] for X2 . On average, what fraction of the available observations will we use to make the prediction?

c. Now suppose that we have a set of observations on p = 100 features. Again the observa- tions are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

d. Using your answers to parts (a)–(c), argue that a drawback of K-NN when p is large is that there are very few training observations “near” any given test observation.

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名