R/qtl2 - high-dimensional data & multi-parent populationskbroman/presentations/rqtl2_uncc2017.pdf17...

Post on 13-Jan-2020

0 views 0 download

Transcript of R/qtl2 - high-dimensional data & multi-parent populationskbroman/presentations/rqtl2_uncc2017.pdf17...

R/qtl2high-dimensional data & multi-parent populations

Karl Broman

Biostatistics & Medical Informatics, UW–Madison

kbroman.orggithub.com/kbroman

@kwbromanSlides: bit.ly/uncc2017

New

University of Wisconsin–MadisonPhD in Biomedical Data Science

bit.ly/MadBDS

2

17 years of R/qtl

0

5000

10000

15000

20000

25000

30000

35000

40000

●●

●●

●● ●●

●●

●●●●

●●

●●●●

●●

●●●●● ●●●●● ● ●●●●●

●●● ●●● ● ●

Year

Line

s of

cod

e

●● ● ●

● ●●●●

●●●●

●● ●●●●●

●●

●●●●● ●●●●●

● ●●●●●●●●● ●●● ● ●

●●

● ● ● ● ●● ●●●● ●●

●●●●●●●

●●●●●●● ●●●●● ● ●●●●●●●●● ●●● ● ●

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

idea svn git

R

C

doc

3

4

daviddeen.com

5

Intercross

P1 P2

F1 F1

F2

6

Data

7

QTL mapping

0

2

4

6

8

Chromosome

LOD

sco

re

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

BB BR RR

0.8

0.9

1.0

1.1

8

Interactive plot

bit.ly/lod_and_effect

9

17 years of R/qtl

0

5000

10000

15000

20000

25000

30000

35000

40000

●●

●●

●● ●●

●●

●●●●

●●

●●●●

●●

●●●●● ●●●●● ● ●●●●●

●●● ●●● ● ●

Year

Line

s of

cod

e

●● ● ●

● ●●●●

●●●●

●● ●●●●●

●●

●●●●● ●●●●●

● ●●●●●●●●● ●●● ● ●

●●

● ● ● ● ●● ●●●● ●●

●●●●●●●

●●●●●●● ●●●●● ● ●●●●●●●●● ●●● ● ●

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

idea svn git

R

C

doc

10

Good things

▶ some of the code▶ basics of the user interface▶ diagnostics and data visualization▶ quite comprehensive▶ quite flexible

11

Good things▶ some of the code▶ basics of the user interface▶ diagnostics and data visualization▶ quite comprehensive▶ quite flexible

11

Bad things

12

Input file

BBBBSBSBBB1m260.4547.1215

BBSB−−−1m395.2551.4814

BBSB−−−1m275.6360.9413

SBSBSBBBSB1m133.0538.6412

SBSB−−−1m184.1265.6311

SBSBBBBBBB1m154.8859.4710

SSSS−−−1m191.5459.269

SBSB−−−1m181.0565.318

SSSSBBSBSB1m126.1378.067

SBSBSBSBBB1m131.91586

BBBB−−−1m178.5888.335

SBSBSBSBBB1m153.1661.924

48.138.3110.451.427.33

221112

D2Mit75D2Mit379D1Mit17D1Mit80D1Mit18pgmsexspleenliver1

IHGFEDCBA

13

Stupidest code ever

n <- ncol(data)temp <- rep(FALSE ,n)for(i in 1:n) {

temp[i] <- all(data[2,1:i]=="")if(!temp[i]) break

}if(!any(temp)) stop("...")n.phe <- max((1:n)[temp])

kbroman.org/blog/2011/08/17/the-stupidest-r-code-ever

14

Open source meanseveryone can see my stupid mistakes

Version control meanseveryone can see every stupid mistake I’ve evermade

15

Open source meanseveryone can see my stupid mistakes

Version control meanseveryone can see every stupid mistake I’ve evermade

15

Documentation

16

Support

17

QTL mapping

0

2

4

6

8

Chromosome

LOD

sco

re

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

BB BR RR

0.8

0.9

1.0

1.1

18

Congenic line1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y

19

Improving precision

▶ more recombinations▶ more individuals▶ more precise phenotype▶ lower-level phenotypes

– transcripts, proteins, metabolites

20

Advanced intercross linesP

A B

F2

F3

F4

F7

F10

21

Recombinant inbred linesP1 P2

F1 F1

F2

F3

F4

F∞

22

Collaborative CrossG0

A B C D E F G H

A B C D E F G H

G1

G2

ABCD EFGH

G3

G4

G∞

23

Heterogeneous stockG0

A B C D E F G H

G1

G2

G10

G15

G20

24

Genome-scale phenotypes

Alan Attie

25

Challenges: diagnostics

kbroman.org/blog/2012/04/25/microarrays-suck26

Challenges: diagnostics

bit.ly/many_boxplots

27

Challenges: diagnostics

▶ What might have gone wrong?▶ How might it be revealed?▶ Make lots of graphs▶ Follow up artifacts

28

Challenges: scale of results

genotypes phenotypes

29

Challenges: scale of results

genotypes phenotypes

results

29

Challenges: organizing, automating

genotypes phenotypes

30

Challenges: organizing, automating

genotypes phenotypes

30

Challenges: organizing, automating

genotypes phenotypes

30

Challenges: organizing, automating

genotypes phenotypes

30

Challenges: organizing, automating

genotypes phenotypes

30

Challenges: organizing, automating

genotypes phenotypes

30

Challenges: organizing, automating

genotypes phenotypes

30

Challenges: metadata

What the heck is "FAD_NAD SI 8.3_3.3G"?

31

What was the question again?

32

ropenscilabs.github.io/miner_book33

R/qtl2

▶ High-density genotypes▶ High-dimensional phenotypes▶ Multi-parent populations▶ Linear mixed models

kbroman.org/qtl2

34

R/qtl2: Let’s not make the same mistakes

▶ C++ and Rcpp▶ Roxygen2 for documentation▶ Unit tests▶ A single “switch” for cross type

▶ Split into multiple packages▶ Yet another data input format▶ Flatter data structures, but still complex

35

R/qtl2: Let’s not make the same mistakes

▶ C++ and Rcpp▶ Roxygen2 for documentation▶ Unit tests▶ A single “switch” for cross type▶ Split into multiple packages▶ Yet another data input format▶ Flatter data structures, but still complex

35

Sustainable academic software

36

Acknowledgments

Danny ArendsGary ChurchillNick FurlotteDan GattiRitsert JansenPjotr PrinsŚaunak SenPetr SimecekArtem TarasovHao WuBrian Yandell

Robert CortyTimothée FlutreLars RonnegardRohan ShahLaura ShannonQuoc TranAaron Wolen

NIH/NIGMS

37

Slides: bit.ly/uncc2017

kbroman.org

kbroman.org/qtl2

github.com/kbroman

@kwbroman

38