Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your...

39
Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis [email protected]

Transcript of Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your...

Page 1: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Bioinformatics: A perspective

Dr. Matthew L. Settles

Genome CenterUniversity of California, Davis

[email protected]

Page 2: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Outline

• TheWorldwearepresentedwith• AdvancesinDNASequencing• BioinformaticsasDataScience• Viewportintobioinformatics• Training• Suggestions• AnintroductiontotheCore

Page 3: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Cost per Megabase of Sequence

Year

Dollars

2005 2010 2015

$0.1

$1

$10

$100

$1000

Cost per Human Sized Genome @ 30x

Year

2005 2010 2015

$1000

$100000

$10000000

SequencingCosts

• Includes:labor,administration,management,utilities,reagents,consumables,instruments(amortizedover3years),informaticsrelatedtosequenceproductions,submission,indirectcosts.

• http://www.genome.gov/sequencingcosts/

$0.014/Mb $1245 per Human sized

(30x) genome

Page 4: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

GrowthinPublicSequenceDatabase

• http://www.ncbi.nlm.nih.gov/genbank/statistics

WGS > 1 trillion bp

Year

Bases

1990 2000 2010

105

107

109

1011

1013

GenBankWGS

Year

Sequences

1990 2000 201010

210

410

610

8

GenBankWGS

Page 5: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

ShortReadArchive(SRA)Growth of the Sequence Read Archive (SRA) over time

Year

2000 2005 2010 2015

1011

1012

1013

1014

1015

BasesBytesOpen Access BasesOpen Access Bytes

> 1 quadrillion bp

http://www.ncbi.nlm.nih.gov/Traces/sra/

Page 6: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

IncreaseinGenomeSequencingProjects

• JGI– GenomesOnlineDatabase(GOLD)• 67,822genomesequencingprojects• JGI– GenomesOnlineDatabase(GOLD)• 67,822genomesequencingprojects

Lists > 3700 unique genus

Page 7: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

SequencingPlatforms

• 1986- DyeterminatorSangersequencing,technologydominateduntil2005until“nextgenerationsequencers”,peakingatabout900kb/day

Page 8: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

‘Next’Generation

• 2005– ‘NextGenerationSequencing’asMassivelyparallelsequencing,boththroughputandspeedadvances.ThefirstwastheGenomeSequencer(GS)instrumentdevelopedby454lifeSciences(lateracquiredbyRoche),Pyrosequencing 1.5Gb/day

Discontinued

Page 9: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Illumina

• 2006– Thesecond‘NextGenerationSequencing’platformwasSolexa (lateracquiredbyIllumina).Nowthedominantplatformwith75%marketshareofsequencerandandestimated>90%ofallbasessequencedarefromanIllumina machine,SequencingbySynthesis>200Gb/day.

Page 10: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

CompleteGenomics

• 2006– UsingDNAnanoball sequencing,hasbeenaleaderinHumangenomeresequencing,havingsequencedover20,000genomestodate.In2013purchasedbyBGIandisnowsettoreleasetheirfirstcommercialsequencer,theRevolocity.ThroughputonparwithHiSeq

Human genome/exomes only.

10,000 Human Genomes per year

Page 11: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

BenchtopSequencers

• Roche454Junior

• LifeTechnologies• IonTorrent• IonProton

• Illumina MiSeq

Page 12: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

The‘NextNext’Generation

• 2009– SingleMoleculeReadTimesequencingbyPacificBiosystems,mostsuccessfulthirdgenerationsequencingplatforms,~2Gb/day,NewPacBioSequal ~14Gb/day.

Page 13: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

OxfordNanopore

• 2015– Another3rd generationsequencer,foundedin2005andcurrentlyinbetatesting.Thesequencerusesnanopore technologydevelopedinthe90’stosequencesinglemolecules.Throughputisabout500Mbperflowcell.

FYI: 4th generation sequencing is being described as In-situ sequencing

Fun to play with but results are highly variable

Page 14: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Flexibility

DNA Sequence

Read 1 (50- 300bp)

Read 2 (50-300bp) Read 2 primer

Barcode (8bp)Barcode Read primer

Depth of C

overage

1X

100000X

Whole Genome

1KB

Reduction Techniques

Capture Techniques

Fluidigm Access ArrayAmplicons

Few or Single Amplicons

Genomic reduction allows for greater

coverage and multiplexing of

samples.

You can fine tune your depth of

coverage needs and sample size

with the reduction technique

RADseq

Greater Multiplexing

Single Multiplexing

Page 15: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

SequencingLibraries•DNA-seq•RNA-seq•Amplicons•CHiP-seq•MeDiP-seq•RAD-seq•ddRAD-seq•Pool-seq•EnD-seq

DNase-seqATAC-seqMNase-seqFAIRE-seqRibose-seqsmRNA-seqmRNA-seqTn-seqQTL-seq

tagRNA-seqPAT-seqStructure-seqMPE-seqSTARR-seqMod-seqBrAD-seqSLAF-seqG&T-seq

Page 16: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

omicsmaps.com

Page 17: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Thedatadeluge

• PluckingthebiologyfromtheNoise

Page 18: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Reality

• Itsmuchmoredifficultthanwemayfirstthink

Page 19: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Therealcostofsequencing

Pre-NGS(Approximately 2000)

Now(Approximately 2010)

Future(Approximately 2020)

0%

2040

6080

100%

Data reductionData management

Sample collection and experimental design

Sequencing Downstreamanalyses

Dat

a m

anag

emen

t

Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125

Page 20: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Bioinformatics

Biology

ComputerScience

MathStatistics

Biostatistics

Computational Biology

‘The data scientist role has been described as “part analyst, part artist.”’Anjul Bhambhri, vice president of big data products at IBM

BioinformaticsisDataScience

Page 21: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

DataScience

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.

Five Fundamental Concepts of Data Sciencestatisticsviews.com November 11, 2013 by Kirk Borne

Page 22: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

7StagestoDataScience

1. Define the question of interest

2. Get the data3. Clean the data4. Explore the data

5. Fit statistical models

6. Communicate the results7. Make your analysis reproducible

Page 23: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

1. Define the question of interest

Begin with the end in mind!what is the questionhow are we to know we are successfulwhat are our expectations

dictates the data that should be collectedthe features being analyzedwhich algorithms should be use

Page 24: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

2. Get the data3. Clean the data4. Explore the data

Know your data!know what the source wastechnical processing in producing data (bias, artifacts, etc.)“Data Profiling”

Data are never perfect but love your data anyway!the collection of massive data sets often leads to unusual , surprising, unexpected and even outrageous.

Page 25: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

5. Fit statistical modelsOver fitting is a sin against data science!

Model’s should not be over-complicated

• If the data scientist has done their job correctly the statistical models don't need to be incredibly complicated to identify important relationships

• In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer.

Page 26: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

6. Communicate the results7. Make your analysis reproducible

Remember that this is ‘science’!We are experimenting with data selections, processing, algorithms, ensembles of algorithms, measurements, models. At some point these must all be tested for validity and applicability to the problem you are trying to solve.

Page 27: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Data science done well looks easy – and that’s a big problem for data scientists

simplystatistics.orgMarch 3, 2015 by Jeff Leek

Page 28: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Training:DataScienceBias

Data Science (data analysis, bioinformatics) is most often taught through an apprentice model

Different disciplines/regions develop their own subcultures, and decisions are based on cultural conventions rather than empirical evidence.• Programming languages• Statistical models (Bayes vs Frequentist)• Multiple testing correction• Application choice, etc.These (and others) decisions matter a lot in data analysis"I saw it in a widely-cited paper in journal XX from my field"

Page 29: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

TheDataScienceinBioinformatics

Bioinformatics is not something you are taught, it’s a way of life

Mick Watson – Rosland Institute

“The best bioinformaticians I know are problem solvers –they start the day not knowing something, and they enjoyfinding out (themselves) how to do it. It’s a great skill to have,but for most, it’s not even a skill – it’s a passion, it’s a way oflife, it’s a thrill. It’s what these people would do at theweekend (if their families let them).”

Page 30: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Models

• Workshops•Oftenenrolledtoolate

• Collaborations•Moreexperiencepersons

• Apprenticeships• Previouslabpersonneltoyoungpersonnel

• FormalEducation•MostprogramsarePost-docorgraduatelevel• FewUndergraduate

Page 31: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Substrate

ClusterComputing

CloudComputing

BASTM Laptop & DesktopLINUX

Page 32: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Environment

“Command Line” and “Programming Languages”

VS

Bioinformatics Software Suite

Page 33: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Bioinformatics

• KnowandUnderstandtheexperiment• “TheQuestionofInterest”

•Buildasetofassumptions/expectations•Mixoftechnicalandbiological• Spendyourtimetestingyourassumptions/expectations

•Don’tspendyourtimefindingthe“best”software•Don’tunder-estimatethetimeBioinformaticsmaytake

•Bepreparedtoaccept‘failed’experiments

Page 34: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

BottomLine

The Bottom Line:Spend the time (and money) planning and producing good quality, accurate and sufficient data for your experiment.

Get to know to your data, develop and test expectations

Result, you’ll spend much less time (and less money) extracting biological significance and results during analysis.

Page 35: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Themission oftheBioinformaticsCorefacilityistofacilitateoutstandingomics- scaleresearchthroughtheseactivities:

Training

Data Analysis

Research Computing

The Bioinformatics Core promotes experimental design, advanced computation and informatics analysis of omics scale datasets that drives research forward.

Maintain and make available high-performance computing hardware and software necessary for todays data-intensive bioinformatic analyses.

The Core helps to educate the next generation of bioinformaticians through highly acclaimed training workshops, seminars and through direct participation in research activities.

Page 36: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

UCDavisBioinformaticsCoreintheGenomeCenter

Page 37: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

-omicsis“CollaborativeResearch”

• Todaysexperimentsarecomplexandgettingmorecomplex

• Knowoneperson,orevenonegrouptypicallyhastheneededcapabilitiesinallareas

• M

Biologist

MolecularExpertiseBioinformatics

Page 38: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,

Prerequisites

•Accesstoamulti-core(24cpu orgreater),‘high’memory64GborgreaterLinuxserver.

•Familiaritywiththe’commandline’andatleastoneprogramminglanguage.

•Basicknowledgeofhowtoinstallsoftware•BasicknowledgeofR(orequivalent)andstatisticalprogramming

•BasicknowledgeofStatisticsandmodelbuilding

Page 39: Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts,