代做STATS 769 Data Science Practice SECOND SEMESTER, 2016帮做R程序

STATS 769

STATISTICS

Data Science Practice

SECOND SEMESTER, 2016

1.                            [5 marks]


Figure 1 shows the content of a JSON file, "AT.json", and the following code reads this file into R and shows the resulting R object, trips.

> library(jsonlite)

> trips <- fromJSON("AT.json")

> trips

id stop_time_update.stop_sequence stop_time_update.stop_id timestamp

1 2928 20 7812 1474316682

2 2929 60 6569 1474316645

> dim(trips)

[1] 2 3

> names(trips)

[1] "vehicle" "stop_time_update" "timestamp"

Explain what sort of R object has been created and write R code to extract the stop_id information from the R object trips.

[

{

"vehicle": {

"id": "2928"

},

"stop_time_update": {

"stop_sequence": 20,

"stop_id": "7812"

},

"timestamp": 1474316682

},

{

"vehicle": {

"id": "2929"

},

"stop_time_update": {

"stop_sequence": 60,

"stop_id": "6569"

},

"timestamp": 1474316645

}

]

Figure 1: The JSON file "AT.json".

2.                          [5 marks]

Figure 2 shows the content of an XML file, "IRD.xml".

Write down the result of the following R code:

> library(xml2)

> ird <- read_xml("IRD.xml")

> xml_text(xml_find_all(ird, "//td[@align = ✬right✬]"))

Write R code to extract the first column of values from the table. Your code should produce the following result:

[1] "18 NCO Club" "1977 Masters Association"

[3] "1979 Reunion" "1993 Summer Camp Account"

[5] "1St Wainuiomata Venterer Unit" "44 South Travel"

[7] "81 Masters Association"

<table width="100%" cellpadding="0" cellspacing="0"><tbody>

<tr>

<td>18 NCO Club</td>

<td align="right">$142.03</td>

</tr>

<tr>

<td>1977 Masters Association</td>

<td align="right">$359.77</td>

</tr>

<tr>

<td>1979 Reunion</td>

<td align="right">$532.77</td>

</tr>

<tr>

<td>1993 Summer Camp Account</td>

<td align="right">$1,308.78</td>

</tr>

<tr>

<td>1St Wainuiomata Venterer Unit</td>

<td align="right">$431.14</td>

</tr>

<tr>

<td>44 South Travel</td>

<td align="right">$489.60</td>

</tr>

<tr>

<td>81 Masters Association</td>

<td align="right">$221.08</td>

</tr>

</tbody></table>

Figure 2: The XML file "IRD.xml".

3.                            [5 marks]

Explain what each of the following shell commands is doing and, where there is output, what the output means:

pmur002@sc-stat-346130:/home/paul$ ssh stats769prd01.its.auckland.ac.nz

pmur002@stats769prd01:~$ mkdir exam

pmur002@stats769prd01:~$ cd exam

pmur002@stats769prd01:~/exam$ cp /course/data/Ass2/exreg-10000.* .

pmur002@stats769prd01:~/exam$ ls -lh

total 16M

-rw-rw---- 1 pmur002 pmur002 7.8M Sep 20 12:12 exreg-10000.bin

-rw-rw---- 1 pmur002 pmur002 473 Sep 20 12:12 exreg-10000.desc

-rw-rw---- 1 pmur002 pmur002 7.7M Sep 20 12:12 exreg-10000.txt

pmur002@stats769prd01:~/exam$ wc exreg-10000.txt

10000 1010000 7979235 exreg-10000.txt

pmur002@stats769prd01:~/exam$ awk ✬NR < 1000✬ exreg-10000.txt > exreg-sub.txt

4.                                     [5 marks]

Explain the memory usage results from the following R code and output.

What is the significance of the NULL at the end of the function definition?

> gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 326457 17.5 592000 31.7 326457 17.5

Vcells 535537 4.1 1023718 7.9 535537 4.1

> x <- rnorm(1000000)

> gc()

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 326423 17.5 592000 31.7 328238 17.6

Vcells 1535490 11.8 2613614 20.0 1537444 11.8

> f <- function() {

+ x <- rnorm(1000000)

+ NULL

+ }

> f()

NULL

> gc()

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 326438 17.5 592000 31.7 330982 17.7

Vcells 1535523 11.8 2613614 20.0 2541622 19.4

5.                                 [5 marks]

Figure 3 shows the first few lines of a CSV file called "1987.csv". This file contains flight data from 1987 for flights within the USA.

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarr

ier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay

,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Carrier

Delay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO,447,NA,NA,0,NA,0,

NA,NA,NA,NA,NA

1987,10,15,4,729,730,903,849,PS,1451,NA,94,79,NA,14,-1,SAN,SFO,447,NA,NA,0,NA,0,

NA,NA,NA,NA,NA

...

Figure 3: The first few lines of the CSV file "1987.csv".

The following R code and output shows the time and memory requirements involved in naively reading the file "1987.csv" into an R data frame. and then calculating the average flight departure delay for each day of the week.

> system.time(f1987 <- read.csv("1987.csv"))

user system elapsed

8.785 0.176 8.967

> object.size(f1987)

152204024 bytes

> system.time(delays <- aggregate(f1987["DepDelay"],

list(DoW=f1987$DayOfWeek),

mean, na.rm=TRUE))

user system elapsed

0.997 0.004 1.001

> delays

DoW DepDelay

1 1 7.827491

2 2 9.086585

3 3 9.364805

4 4 8.143775

5 5 7.411825

6 6 6.034632

7 7 8.408912

Write R code that calls the read.csv() function, but with additional argument that would make the call run faster and create a smaller data frame. than the code above. Explain why your code would be faster and use less memory.

Write R code that uses functions from the data.table package to read the file "1987.csv" into R much faster and to calculate the average departure delay for each day of the week much faster.

6.                                  [5 marks]

The following R code and output shows the number of cores and the memory capacity of one of the virtual machines used in this course.

> detectCores()

[1] 20

> system("free")

total used free shared buffers cached

Mem: 206350080 42760988 163589092 44 262884 40465868

-/+ buffers/cache: 2032236 204317844

Swap: 1949692 131744 1817948

The following R code and output shows information about the full set of CSV files

("1987.csv" to "2008.csv") that contain US flight data over 22 years.

> system("ls -lh /course/data/ASADataExpo/*.csv")

-rw-r--r-- 1 pmur002 pmur002 122M Aug 4 14:48 /course/data/ASADataExpo/1987.csv

-rw-r--r-- 1 pmur002 pmur002 478M Aug 4 14:48 /course/data/ASADataExpo/1988.csv

-rw-r--r-- 1 pmur002 pmur002 464M Aug 4 14:48 /course/data/ASADataExpo/1989.csv

-rw-r--r-- 1 pmur002 pmur002 486M Aug 4 14:50 /course/data/ASADataExpo/1990.csv

-rw-r--r-- 1 pmur002 pmur002 469M Aug 4 14:47 /course/data/ASADataExpo/1991.csv

-rw-r--r-- 1 pmur002 pmur002 470M Aug 4 14:49 /course/data/ASADataExpo/1992.csv

-rw-r--r-- 1 pmur002 pmur002 469M Aug 4 14:49 /course/data/ASADataExpo/1993.csv

-rw-r--r-- 1 pmur002 pmur002 479M Aug 4 14:48 /course/data/ASADataExpo/1994.csv

-rw-r--r-- 1 pmur002 pmur002 507M Aug 4 14:48 /course/data/ASADataExpo/1995.csv

-rw-r--r-- 1 pmur002 pmur002 510M Aug 4 14:52 /course/data/ASADataExpo/1996.csv

-rw-r--r-- 1 pmur002 pmur002 516M Aug 4 14:49 /course/data/ASADataExpo/1997.csv

-rw-r--r-- 1 pmur002 pmur002 514M Aug 4 14:47 /course/data/ASADataExpo/1998.csv

-rw-r--r-- 1 pmur002 pmur002 528M Aug 4 14:48 /course/data/ASADataExpo/1999.csv

-rw-r--r-- 1 pmur002 pmur002 544M Aug 4 14:47 /course/data/ASADataExpo/2000.csv

-rw-r--r-- 1 pmur002 pmur002 573M Aug 4 14:48 /course/data/ASADataExpo/2001.csv

-rw-r--r-- 1 pmur002 pmur002 506M Aug 4 14:47 /course/data/ASADataExpo/2002.csv

-rw-r--r-- 1 pmur002 pmur002 598M Aug 4 14:50 /course/data/ASADataExpo/2003.csv

-rw-r--r-- 1 pmur002 pmur002 639M Aug 4 14:47 /course/data/ASADataExpo/2004.csv

-rw-r--r-- 1 pmur002 pmur002 640M Aug 4 14:49 /course/data/ASADataExpo/2005.csv

-rw-r--r-- 1 pmur002 pmur002 641M Aug 4 14:49 /course/data/ASADataExpo/2006.csv

-rw-r--r-- 1 pmur002 pmur002 671M Aug 4 14:50 /course/data/ASADataExpo/2007.csv

-rw-r--r-- 1 pmur002 pmur002 658M Aug 4 14:47 /course/data/ASADataExpo/2008.csv

The following R code could be used to read all 22 CSV files into R as a single data frame. and calculate the average departure delay for each day of the week.

filenames <- paste0(1987:2008, ".csv")

flights <- do.call(rbind, lapply(filenames, read.csv))

aggregate(flights["DepDelay"],

list(DoW=flights$DayOfWeek),

mean, na.rm=TRUE)

Discuss the time and memory requirements that would be involved in running this R code and whether it would be able to run on the virtual machine.

7.                              [10 marks]

The function sumFile() (code not shown) takes one argument, the name of a CSV file, and calculates the sum of the departure delay column and counts the number of non-NA values in the departure delay column, for each day of the week. The result of calling this function on the file "1987.csv" is shown below.

> sumFile("1987.csv")

DoW sum count

1: 1 1457291 186176

2: 2 1692113 186221

3: 3 1757905 187714

4: 4 1619129 198818

5: 5 1356023 182954

6: 6 1037999 172007

7: 7 1498897 178251

Write R code that calls the sumFile() function in parallel across multiple cores to calculate sums and counts for all CSV files in the data set and then combines the results from all CSV files to calculate the average departure delay for each day of the week across all 22 files. The final result would look like the output below.

> meanDepDelay

[,1] [,2]

[1,] 1 7.850057

[2,] 2 6.855870

[3,] 3 7.651197

[4,] 4 9.246910

[5,] 5 10.151539

[6,] 6 6.887023

[7,] 7 8.409293

Discuss the best way to schedule the 22 calls to sumFile() across the multiple cores (hint: think about whether each call to sumFile(), which handles a different CSV file, will take the same amount of time to run).

8.                           [5 marks]

The following code was used to profile the sumFile() function.

> Rprof("sumFile.log")

> sumFile("1987.csv")

> Rprof(NULL)

The profile results are summarised below using summaryRprof() ...

> summaryRprof("sumFile.log")

$by.self

self.time self.pct total.time total.pct

"fread" 1.38 97.18 1.38 97.18

"!" 0.02 1.41 0.02 1.41

"forderv" 0.02 1.41 0.02 1.41

$by.total

total.time total.pct self.time self.pct

"sumFile" 1.42 100.00 0.00 0.00

"fread" 1.38 97.18 1.38 97.18

"[" 0.04 2.82 0.00 0.00

"[.data.table" 0.04 2.82 0.00 0.00

"!" 0.02 1.41 0.02 1.41

"forderv" 0.02 1.41 0.02 1.41

$sample.interval

[1] 0.02

$sampling.time

[1] 1.42

... and using profReport().

> profReport("sumFile.log")

sumFile > fread

---------------

1.38

sumFile > [ > [.data.table > !

------------------------------

0.02

sumFile > [ > [.data.table > forderv

------------------------------------

0.02

Explain the profile results and what they tell us about how the sumFile() function works and where it spent most of its time.

9.                           [5 marks]

i. Explain why we might need to use the functions GET() or POST() from the httr package to access a web site, rather than just the download.file() function.

ii. Explain why we might need to use the functions makeCluster() and clusterApply(), instead of the mclapply() function.

10.                             [10 marks]

Table 1 shows the SSresidual and adjusted R 2 values for a model selection procedure of a linear regression problem with 6 input variables, x1,..., x6. At step i, the model contains the input variable which is shown in column “Variables entered” and the variables at all previous steps. At step i, the SSresidual column shows the value of residual sum of squares after the variable in the “Variables entered” column has been added to the previous model (from step i − 1). This value is the smallest SSresidual among all possible models that add a single predictor to the previous model (from step i−1).

1. Which of the model selection procedures is used in this problem?

2. Based on Table 1, what is the best linear predictor for this problem?

Variables

Step entered SSresidual Adjusted R 2

0 Intercept 34.026 0.700

1 x4 33.789 0.707

2 x3 33.584 0.703

3 x2 33.583 0.713

4 x5 33.586 0.714

5 x1 33.590 0.713

6 x6 33.595 0.701

Table 1: Summary of the model selection procedure

11.                        [10 marks]

The following code has been used to fit a k-nearest neighbor classifier for given data.

> library(class)

> train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])

> test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])

> cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))

> knnout <- knn(train, test, cl, k = 3)

> knnout

[1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v

[29] c c c c c v c c c c c c c c c c c c c c c c v c c v v v

[57] v v v v v v v c v v v v v v v v v v v

Levels: c s v

> table(knnout,cl)

cl

knnout c s v

c 23 0 3

s 0 25 0

v 2 0 22

Based on the output of the code, calculate

1. the misclassification rate for test data.

2. the sensitivity and specificity for test data.

12.                       [10 marks]

Kernel density estimation is an unsupervised learning procedure that estimates the proba bility density of a new observation x0 by counting observations close to it with weights that decrease with distance from it. Formally, for a given univariate random sample x1,..., xn drawn from a probability density gX (x), kernel density estimation uses the following formula to estimate the probability density gX (x) of a new observation x0,

gˆX (x0) = nh/1n∑i=1Kh(x0, xi)

where n is the sample size and h is a tuning parameter for the kernel function K. Assume that a given learning sample L, with one input variable, contains 100 data points from population I and 100 data points from population II. We want to classify the population of a new observation based on this learning set. This problem is a classification problem with one input variable and a response with two categories. Explain a classifier algorithm which uses kernel density estimation to classify the new observation.

13.                     [10 marks]

Explain the k-nearest neighbor algorithm for fitting models (a) when doing prediction of a continuous response, and (b) when doing classification.

14.                     [10 marks]

Explain the steps of how you would build a model for a classification problem when you receive a data set of 30,000 observations, 100 inputs and a categorical response variable.





热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图