STATS 4014 Advanced Data Science


STATS 4014
Advanced Data Science
Assignment 3
Jono Tuke
Semester 1 2019
CHECKLIST
: Have you shown all of your working, including probability notation where necessary?
: Have you given all numbers to 3 decimal places unless otherwise stated?
: Have you included all R output and plots to support your answers where necessary?
: Have you included all of your R code?
: Have you made sure that all plots and tables each have a caption?
: If before the deadline, have you submitted your assignment via the online submission on MyUni?
: Is your submission a single pdf file - correctly orientated, easy to read? If not, penalties apply.
: Penalties for more than one document - 10% of final mark for each extra document. Note that you
may resubmit and your final version is marked, but the final document should be a single file.
? : Penalties for late submission - within 24 hours 40% of final mark. After 24 hours, assignment is not
marked and you get zero.
: Assignments emailed instead of submitted by the online submission on MyUni will not be marked
and will receive zero.

代做STATS 4014作业、代写Data Science作业、R程序语言作业代写、代做R编程设计作业
: Have you checked that the assignment submitted is the correct one, as we cannot accept other
submissions after the due date?
Due date: Friday 3rd May 2019 (Week 7), 5pm.
Q1. Bayesian connection to lasso and ridge regression
a. Suppose that
Yi = β0 + β1xi1 + . . . + βpxip + i,
where ~ iid N(0, σ2).
Write the likelihood for the data.
b. Let βj , j = 1, . . . , p have priors that are iid with
i.e., they are i.i.d. with a double-exponential distribution with mean 0, and common scale parameter b.
Write out the posterior for β given the likelihood in Part a. Show that the lasso estimate is the mode
of the posterior.
c. Let βj , j = 1, . . . , p have priors that are i.i.d. normal distribution with a mean zero and variance c.
Write the posterior of βj , j = 1, . . . , p. Hence show that the ridge regression is both the mean and the
mode of the posterior.
1
Q2. Using data.table
In the following, you are advised to use data.table. Trying to use standard data manipulation may crash
your computer or take too long. The data in DNA_combined.csv is real data on DNA methylation in modern
and ancient DNA samples.
Each row in the dataset is a segment of DNA for which we have the following information:
chr: the chromosome the segment is from,
pos: the starting position of the segment on the chromosome,
N: the length of the segment in number of bases,
X: the number of the bases that are methylated,
type: whether the DNA is modern or ancient, and
ID: the ID of the individual that the DNA is from.
Also we have a spreadsheet of metadata given in Data_Info.xlsx. Each row is an individual and we have
the following information:
Filename: the filename of the compressed file that had the data. I used this to get the samples for you,
SampleID: the ID of each individual,
Sex: the gender of the individual,
Tissue: the area of the body that the DNA was extracted from,
Type: whether the DNA is modern or ancient, and
Age_kyr: the age of the individual in 1000’s year.
Our goal is to find the proportion of samples for each tissue / type combination that has a higher proportion
of methylation compared to the mean for each tisse / type combination.
Perform the following steps:
a. Read in both datasets.
b. Rename the SampleID column to ID in the metadata data.table.
c. Find which samples IDs are repeated in the metadata.
d. Remove from the metadata any samples that are not Hairpin.
e. What is the total number of samples? What is the total number of modern samples and the total
number of ancient samples?
f. Calculate the proportion of methylation for each sample.
g. What is the total number of samples for each combination of tissue and type.
h. Calculate the mean proportion of methylation for each combination of tissue and type.
i. What proportion of samples have a methylation proportion greater than the mean proportion of
methylation for each tissue / type combination?
Q3. Webscraping
In this question, we are going to webscrape data from the internet movie database. As before there are marks
for webscraping and cleaning the dataset, but if you prefer not to do this, the cleaned dataset is provided.
a. Webscraping the data. The main package for webscraping is rvest:
https://rvest.tidyverse.org/
Also the chrome extension selectorgadget is really useful to identify the parts of the webpage that contains
the information:
https://selectorgadget.com/
https://rvest.tidyverse.org/articles/selectorgadget.html
I have written a template function to start you off based on the following tutorial:
2
https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/
The function is
## Get libs ----
pacman::p_load(
rvest, tidyverse, glue
)
#' get top 100
#'
#' Take a year and get information from imdb on the top 100 movies for that year.
#'
#' @param year year to get the data from
#'
#' @return data frame with the information
#'
#' @author Jono Tuke
#'
#' Wednesday 27 Mar 2019
get_top_100 <- function(year){
# Create url for the given year split to make easier reading
url <- glue("https://www.imdb.com/search/title?",
"count=100&release_date={year},{year}",
"&title_type=feature")
# Read in the webpage
html <- try(read_html(url))
if('try-error' %in% class(html)){
cat("Cannot load webpage", url, "\n")
return(NA)
}
# Get title of movies
titles <- html %>%
html_nodes(".lister-item-header a") %>%
html_text()
# Ratings
ratings <-
html %>%
html_nodes(".ratings-imdb-rating strong") %>%
html_text()
## Put together
info <- tibble(
year = year,
title = titles,
rating = ratings
)
return(info)
}
At present, it gets only year, title and ratings for the top 100 movies for a given year.
Write a function that will get the following
## # A tibble: 6 x 11
## year title description runtimes genre rating vote director actors
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
3
## 1 1980 The ~ "\n A f~ 146 min "\nD~ 8.4 768,~ Stanley~ Jack ~
## 2 1980 Star~ "\n Aft~ 124 min "\nA~ 8.8 1,03~ Irvin K~ Mark ~
## 3 1980 The ~ "\n Jak~ 133 min "\nA~ 7.9 165,~ John La~ John ~
## 4 1980 Flyi~ "\n A m~ 88 min "\nC~ 7.8 188,~ Jim Abr~ Rober~
## 5 1980 Flas~ "\n A f~ 111 min "\nA~ 6.5 45,2~ Mike Ho~ Sam J~
## 6 1980 The ~ "\n In ~ 104 min "\nA~ 5.7 57,7~ Randal ~ Brook~
## # ... with 2 more variables: metascore <chr>, gross <chr>
Then webscrape the data for 1980 to 2018 inclusively.
b. Cleaning the data. I will leave the decisions on the cleaning to you, but so that you know - I kept the
top 20 most prolific directors and top 20 most prolific actors - the rest became Other. Also I created a
boolean column for each genre.
c. Which movies are the most highly rated and the most lowly rated?
d. Which director has the highest mean rating?
e. Fit a lasso regression to predict rating with the following predictors:
year,
runtimes,
vote,
metascore,
gross, and
Animation1.
What is the best model? What is the first coefficient to be shrunk to zero as λ increases, and what is
the last coefficient?
=======
1Just because I am obsessed with animation.
4
Mark scheme
Part Marks Difficulty Area Type Comments
Q1
1a 4 0.00 Lasso/ridge proof 4 for derivation
1b 7 0.29 Lasso/ridge proof 5 for derivation; 2 for justification
1c 7 0.29 Lasso/ridge proof 5 for derivation; 2 for justification
Total 18
Q2
2a 1 0.00 data.table analysis 1 for coding
2b 1 1.00 data.table analysis 1 for coding
2c 2 0.50 data.table analysis 2 for coding
2d 1 0.00 data.table analysis 1 for coding
2e 2 0.00 data.table analysis 2 for coding
2f 1 0.00 data.table analysis 1 for coding
2g 4 0.50 data.table analysis 4 for coding
2h 2 0.00 data.table analysis 2 for coding
2i 5 0.60 data.table analysis 2 for coding; 3 for over presentation of
code in this Q
Total 19
Q3
3a 7 0.29 Lasso/ridge coding 5 for coding; 2 for quality of code
3b 10 0.20 Lasso/ridge analysis 5 for code; 5 for explanation of code
3c 2 0.00 Lasso/ridge analysis 2 for code
3d 2 0.00 Lasso/ridge analysis 2 for code
3e 8 0.38 Lasso/ridge interpretation 4 for coding; 4 for interpretation of
results
Total 29
Assignment total 66

因为专业,所以值得信赖。如有需要,请加QQ99515681 或邮箱:99515681@qq.com 

微信:codinghelp

转载于:https://www.cnblogs.com/pythonwel/p/10809348.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/462699.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

铁血规则:事件预订与取消预订

在编码的时候&#xff0c;我们经常预订某个事件来处理它&#xff0c;但很少取消事件的预订&#xff0c;这种做法可能导致程序在运行时出现一些异常。 如果你的某个用于处理事件的对象不是在运行期内永久存在的&#xff08;比如&#xff0c;不是Singleton对象&#xff09;&#…

MySQL中的insert ignore into讲解

最近工作中&#xff0c;使用到了insert ignore into语法&#xff0c;感觉这个语法还是挺有用的&#xff0c;就记录下来做个总结。 insert ignore into : 忽略重复的记录&#xff0c;直接插入数据。 包括两种场景&#xff1a; 1、插入的数据是主键冲突时 insert ignore into…

[置顶] 我的iOS作品

我的iOS作品 罗朝辉 ( http://blog.csdn.net/kesalin)CC 许可&#xff0c;转载请注明出处前言 做了好几年的 iOS 开发了&#xff0c;业余也零零散散地写了不少代码和博文教程。可惜一直都没有整理下&#xff0c;上次过年回家在张江广兰路把笔记本给丢了&#xff0c;损失惨重&am…

SSM框架搭建

SSM&#xff08;SpringSpringMvcMybatis&#xff09;项目环境搭建&#xff1a; 1、项目环境&#xff1a; jdk-1.8 tomcat-9.0 mysql-5.1.44 spring 5.1.6 mybatis 3.5.1 maven 3.5.42、项目目录结构&#xff1a; 3、pom.xml中引入的依赖&#xf…

MySQL字段值大小写敏感的解决方案

最近在用开源的MySQL 8.0开发本公司的产品&#xff0c;在客户现场建表时默认使用的是CHARSETutf8mb4 COLLATEutf8mb4_0900_ai_ci 字符集导致与oracle的结果不一致&#xff0c;最后将建表时的字符集改为utf8mb3就可以了。 正常建表如下&#xff0c;默认使用的是CHARSETutf8mb4 …

制作Slider组件

利用as3&#xff0c;我们可以尝试制作一些有趣的组件&#xff0c;虽然现在已经有很多实用的组件&#xff0c;但是自己尝试写一下也是不错的。利用as3语法&#xff0c;借用了绘图Api我们尝试制作一下这个组件。因为我们不需要很强大的功能&#xff0c;对此我们只是需要选取其一部…

Android 编程下的四大组件之服务(Service)

服务&#xff08;Service&#xff09; 是一种在后台运行&#xff0c;没有界面的组件&#xff0c;由其他组件调用开始。Android 中的服务和 Windows 中的服务是类似的东西&#xff0c;它运行于系统中不容易被用户发觉&#xff0c;可以使用它开发如监控之类的程序。 服务&#xf…

mysql ld preload过程

纯手工打造每一篇开源资讯与技术干货&#xff0c;数十万程序员和Linuxer已经关注。 导读 本文将叙述通过二进制源码方式安装Percona-5.7.15&#xff0c;并进行快速启动。这边如何使用二进制版本安装Percona-5.7.15就不说了&#xff0c;和之前一模一样。 不做多余的事 1、解…

第六章实验报告(函数和宏定义实验)

C程序设计实验报告 一、实验项目: 1、编写由三角形三边求面积的函数 2、编写求N阶乘的函数 3、求两个整数的最大公约数 4、打印输出三角形 5、求500以内的所有亲密数对 姓名&#xff1a;廖云福 实验地点&#xff1a;教学楼514教室  实验时间&#xff1a;2019.4.30 一、实验目…

HttpRequest 与HttpWebRequest 有什么区别

System.Web.HttpRequest是封装浏览器对服务器的请求的&#xff0c;主要用在ASP.NET中&#xff0c;其中包括浏览器请求的网址&#xff0c;查询字符串数据或表单数据等等 而System.Net.HttpWebRequest则是用来简化网络请求的过程&#xff0c;从服务器上获取文件/结果的&#xff0…

mysqld_safe启动脚本源码阅读与分析

原文链接&#xff1a;https://blog.csdn.net/weixin_39844426/article/details/113422137 前几天读了下mysqld_safe脚本&#xff0c;个人感觉还是收获蛮大的&#xff0c;其中细致的交代了MySQL数据库的启动流程&#xff0c;包括查找MySQL相关目录&#xff0c;解析配置文件以及…

统计文章中字母出现频率

代码位置&#xff1a;https://github.com/Evilleon/article-vocabulary/letter转载于:https://www.cnblogs.com/YXSZ/p/10809930.html

mapreduce shuffle过程问答

通过hadoop权威指南学习hadoop&#xff0c;对shuffle过程一直很疑惑&#xff0c;经过查看网上多个帖子&#xff0c;最终 完成此篇问答总结。 1.什么叫shuffle 从map任务输出到reducer任务输入之间的过程就叫做shuffle 2.每个map任务都有对应的缓存吗&#xff1f;默认是多少&…

Oracle为JDK 8寻求社区参与

据InfoQ报道&#xff0c;随着Java 7功能的日益完备&#xff0c;Oracle正在将注意力转向JDK 8&#xff0c;Java平台组的首席架构师Mark Reinhold正在寻求Java社区的参与&#xff1a; 我们已经知道JDK 8中会有一些大家伙&#xff0c;同时也会为其他大大小小的特性留下空间。因此需…

ubuntu下安装pt-query-digest

最近在改造开发MySQL时要使用pt-query-digest工具分析性能问题&#xff0c;一路遇到了一些问题&#xff0c;记录下来便于日后翻看。 系统&#xff1a; #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux 在安装过程中遇到了很多的问题&#…

S2-016、S2-017

前言 由于S2-016、S2-017出现的原因时相同的&#xff0c;只是由于poc不一样&#xff0c;造成了不同的攻击。S2-016是RCE&#xff0c;S2-017是开发型重定向漏洞。这里将两个漏洞放一起分析。另外“Struts2系列起始篇”是我整各系列的核心&#xff0c;希望大家能花些时间先看看。…

struts 2 配置通配符

2019独角兽企业重金招聘Python工程师标准>>> 随着Web应用程序的增加&#xff0c;所需的Action也会更多&#xff0c;从而导致大量的action映射&#xff0c;使用通配符可以减少action配置的数量&#xff0c;使一些具有类似行为的Action或者Action方法可以使用通用的样…

image to pdf

public void ExportDataIntoPDF(string pathName, String path){//导出至PDFiTextSharp.text.Document document new iTextSharp.text.Document(); try{iTextSharp .text .pdf .PdfWriter .GetInstance (document, new FileStream(pathName, FileMode.CreateNew ));document.O…

Mysql索引类型分析

一、简介MySQL目前主要有以下几种索引类型&#xff1a;1.普通索引2.唯一索引3.主键索引4.组合索引5.全文索引二、语句CREATE TABLE table_name[col_name data type][unique|fulltext][index|key][index_name](col_name[length])[asc|desc] 1.unique|fulltext为可选参数&#xf…

记一次使用pt-query-digest工具分析MySQL慢查询日志

最近遇到了MySQL性能问题&#xff0c;使用percona 的 pt-query-digest工具分析性能的瓶颈点。并且pt-query-digest工具要优于MySQL本身自带的mysqldumpslow工具。 查看pt-query-digest工具在ubuntu下的安装流程请看&#xff1a;ubuntu下安装pt-query-digest_一缕阳光a的博客-CS…