代做High-Dimensional Analysis代写Python编程

High-Dimensional Analysis

Weighting: 20%

Late-onset Alzheimer’s disease (LOAD) is not well understood in terms of its causes and suitable treatments. A deep understanding of relevant genetic factors could allow the development of effective drugs or other interventions to treat or prevent this condition. Zhang et al. (2013) published a study which looked at gene expression in the prefrontal cortex (PFC) of a large number of deceased donor subjects, with slightly over half having had LOAD, and   the remainder being cognitively healthy. These were provided by the Harvard Brain Tissue Resource Center (HBTRC) at McLean Hospital in Belmont, Massachusetts, USA (Boston). Each subject had been diagnosed with respect to LOAD while still alive, with further extensive pathology examination after death.

The datasets were made public, with the brain dataset now available from https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE44772and also from Blackboard. We ask that you use the Blackboard version as this has had a small amount of imputation done, and    we will only consider part of the original dataset.

The patient condition labels are stored in brain_sample_descriptions_PFC.csv : normal (“N”) or with Alzheimer’s (“A”), along with their age in years at time of death and sex (“M” or “F”). The batch-normalised, logged and otherwise processed gene expression data is stored in braindat.csv. This processed dataset contains gene expression data derived from brain samples from 230 people, processed with microarrays to record values for 39280 probes, each of which is intended to represent a different gene or other transcript.

In addition to the scientific interest in the differences in the gene expression of subjects with  LOAD and those without, it is of interest to be able to determine whether a deceased subject  would have been diagnosed with LOAD. In time, it might be possible to use gene expression profile try to predict this in living subjects also.

Your tasks with the dataset are focused on classification of a sample as coming from a patient with late-onset Alzheimer’s disease or without, and identification of genes of potential interest.

You should select one classifier for the task of classification, which you have not used in previous assignments. Probability-based classifiers discussed in this course include linear, quadratic, mixture and kernel density discriminant analysis. Non-probability-based classifiers discussed include k nearest neighbours, neural networks, support vector machines, classification trees, random forests and boosted ensembles. All of these are implemented via various packages available in R and Python. If you wish to use a different method, please check with the lecturer. In addition, you will make use of lasso-penalised logistic regression. Note that you cannot choose another form. of logistic regression as your other classifier.

The number of observations is less than the number of variables, and so some form. of dimensionality reduction is needed for most forms of probability-based classifier and can be used if desired with the non-probability-based classifiers.

Here we consider analysis of this data to

(i) develop a model which is capable of accurately predicting the class (Alzheimer’s or normal) of new observations.

(ii) determine which genes are expressed differently between the two groups, individually, or as part of a combination.

Discriminant analysis/supervised classification can be applied to solve (i), and in combination with feature (predictor) selection, can be used to provide a limited solution to (ii) also.  Other  methods such as single-variable analysis can also be applied to attempt to answer (ii). You should use R or Python for the assignment.

Tasks:

1.   (3 marks) Perform. principal component analysis of the gene expression dataset and report and comment on the results. Detailed results should be submitted via a separate file, including what each principal component direction is composed of in terms of the (transformed) original explanatory variables, with some explanation in the main report about what is in the file. Give a plot or plots which shows the individual proportions of variance explained by each component up to the first 30. Also produce and include another plot about the principal components which you think would be of interest to scientists and clinicians such as Zhang et. al, along with some explanation and discussion. The R package FactoMineR is a good option for PCA.

2.   (3 marks) Perform. single variable analysis of the dataset with respect to the genetic probes, looking for a relationship with the response variable (the class). In doing so, use a linear model and adjust for both age and sex. Use the Benjamini-Hochberg (1995) or Benjamini-Yekutieli (2001) approach to control the false discovery rate to be at most 0.1 with respect to the probe hypothesis tests. Explain the assumptions of this approach and whether or not these are likely to be met by this dataset, along with possible consequences of any violations. Give pseudocode for the method you use. Give a plot of the log of gene  index, ordered by p-value, versus the log ofunadjusted p-value, along with a line indicating the FDR control (similar to Figure 18.19 from Hastie et al., 2009). Determine which genes are then declared significant along with the resulting threshold in the original p-values and report the count of these. If you find more than 30 genes significant, list only the top 30 in your report, along with their p-values (adjusted for the two covariates mentioned, but not adjusted for the purposes of FDR).

Within the R stats package (built-in) is the function p.adjust, which offers this method. More advanced implementations include the fdrame package in Bioconductor.

3.   (2 marks) Define binary logistic regression with a lasso penalty mathematically, including the function to be optimised and briefly introduce a method than can be used to optimise it. Note that this might require a little research.

4.   (1 mark) Explain the potential benefits and drawbacks of using PCA to reduce the dimensionality of the data before attempting to fit each type of classifier to this dataset. Decide whether you will use PCA to reduce dimensionality for each classifier and justify this decision.

5. Apply each classification method (your choice and lasso logistic regression) to the dataset using R or Python.

For lasso logistic regression in R, I suggest you use the glmnet package, available in CRAN, and make use of the function cv.glmnet and the family=“binomial” option. If you are interested, there is a recording of Trevor Hastie giving a tutorial on the lasso and glmnet at http://www.youtube.com/watch?v=BU2gjoLPfDc . There are other options in Python including in scikit-learn.

a) (1 mark) For your chosen classifier, characterise each fitted class by reporting parameter estimates or a reasonable alternative.

b) (2 marks) For lasso logistic regression, you will need to use cross-validation to estimate of the optimal value of λ . Explain how you plan to search over possible values. Then produce and explain a graph of your cost function versus λ. You should also produce a list ordered by importance of the genes included as predictor variables in the optimal classifier, along with    their estimated coefficients.

For your chosen classifier, also determine an ordered list of the most important genes, stopping at 30, or earlier if justified. For each classifier, comment on any differences between the apparent and CV-derived overall error rates.

c) (2 marks) For both classifiers, give cross-validation (CV)-based estimates of the overall and class-specific error rates: obtained by training the classifier on a large fraction of the whole dataset and then applying it to the remaining data and checking error rates. You may use K-fold cv with K ≥ 5 or leave-one-out cross-validation to estimate performance. Additionally report the overall apparent error rates (when trained on all the data and applied back to it).

6. (3 marks) Compare the results from all approaches to analysis of the dataset (PCA, single- variable analysis and the two classifiers). Explain what each approach seems to offer, including consideration of these results as an example. In particular, if you had to suggest 10 genes for the biologists to study further for possible links to LOAD, which ones would you   prioritise, and what makes you think they are worth studying further?

7. (3 marks) Mathematically define the partial correlation between two variables, assuming they come from ap-dimensional joint distribution. Consider only the probes which were selected using the lasso logistic regression. If you found more than 30, choose the 30 probes with largest absolute value of coefficient. For this set of probes, estimate two Gaussian graphical models using the graphical lasso – one for subjects who were diagnosed with LOAD, and one for those who were not. Produce two graph to represent the partial correlation matrices for each group, ignoring any partial correlations with absolute value less than 0.1. The graphs should include relevant edges, node labels and indicate the strength of any dependence shown. What are the main differences between the two graphs? Note: you do not need to find out more about what each probe represents genetically.




热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图