Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your...

Post on 01-Oct-2020

1 views 0 download

Transcript of Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your...

Bioinformatics: A perspective

Dr. Matthew L. Settles

Genome CenterUniversity of California, Davis

settles@ucdavis.edu

Outline

• TheWorldwearepresentedwith• AdvancesinDNASequencing• BioinformaticsasDataScience• Viewportintobioinformatics• Training• Suggestions• AnintroductiontotheCore

Cost per Megabase of Sequence

Year

Dollars

2005 2010 2015

$0.1

$1

$10

$100

$1000

Cost per Human Sized Genome @ 30x

Year

2005 2010 2015

$1000

$100000

$10000000

SequencingCosts

• Includes:labor,administration,management,utilities,reagents,consumables,instruments(amortizedover3years),informaticsrelatedtosequenceproductions,submission,indirectcosts.

• http://www.genome.gov/sequencingcosts/

$0.014/Mb $1245 per Human sized

(30x) genome

GrowthinPublicSequenceDatabase

• http://www.ncbi.nlm.nih.gov/genbank/statistics

WGS > 1 trillion bp

Year

Bases

1990 2000 2010

105

107

109

1011

1013

GenBankWGS

Year

Sequences

1990 2000 201010

210

410

610

8

GenBankWGS

ShortReadArchive(SRA)Growth of the Sequence Read Archive (SRA) over time

Year

2000 2005 2010 2015

1011

1012

1013

1014

1015

BasesBytesOpen Access BasesOpen Access Bytes

> 1 quadrillion bp

http://www.ncbi.nlm.nih.gov/Traces/sra/

IncreaseinGenomeSequencingProjects

• JGI– GenomesOnlineDatabase(GOLD)• 67,822genomesequencingprojects• JGI– GenomesOnlineDatabase(GOLD)• 67,822genomesequencingprojects

Lists > 3700 unique genus

SequencingPlatforms

• 1986- DyeterminatorSangersequencing,technologydominateduntil2005until“nextgenerationsequencers”,peakingatabout900kb/day

‘Next’Generation

• 2005– ‘NextGenerationSequencing’asMassivelyparallelsequencing,boththroughputandspeedadvances.ThefirstwastheGenomeSequencer(GS)instrumentdevelopedby454lifeSciences(lateracquiredbyRoche),Pyrosequencing 1.5Gb/day

Discontinued

Illumina

• 2006– Thesecond‘NextGenerationSequencing’platformwasSolexa (lateracquiredbyIllumina).Nowthedominantplatformwith75%marketshareofsequencerandandestimated>90%ofallbasessequencedarefromanIllumina machine,SequencingbySynthesis>200Gb/day.

CompleteGenomics

• 2006– UsingDNAnanoball sequencing,hasbeenaleaderinHumangenomeresequencing,havingsequencedover20,000genomestodate.In2013purchasedbyBGIandisnowsettoreleasetheirfirstcommercialsequencer,theRevolocity.ThroughputonparwithHiSeq

Human genome/exomes only.

10,000 Human Genomes per year

BenchtopSequencers

• Roche454Junior

• LifeTechnologies• IonTorrent• IonProton

• Illumina MiSeq

The‘NextNext’Generation

• 2009– SingleMoleculeReadTimesequencingbyPacificBiosystems,mostsuccessfulthirdgenerationsequencingplatforms,~2Gb/day,NewPacBioSequal ~14Gb/day.

OxfordNanopore

• 2015– Another3rd generationsequencer,foundedin2005andcurrentlyinbetatesting.Thesequencerusesnanopore technologydevelopedinthe90’stosequencesinglemolecules.Throughputisabout500Mbperflowcell.

FYI: 4th generation sequencing is being described as In-situ sequencing

Fun to play with but results are highly variable

Flexibility

DNA Sequence

Read 1 (50- 300bp)

Read 2 (50-300bp) Read 2 primer

Barcode (8bp)Barcode Read primer

Depth of C

overage

1X

100000X

Whole Genome

1KB

Reduction Techniques

Capture Techniques

Fluidigm Access ArrayAmplicons

Few or Single Amplicons

Genomic reduction allows for greater

coverage and multiplexing of

samples.

You can fine tune your depth of

coverage needs and sample size

with the reduction technique

RADseq

Greater Multiplexing

Single Multiplexing

SequencingLibraries•DNA-seq•RNA-seq•Amplicons•CHiP-seq•MeDiP-seq•RAD-seq•ddRAD-seq•Pool-seq•EnD-seq

DNase-seqATAC-seqMNase-seqFAIRE-seqRibose-seqsmRNA-seqmRNA-seqTn-seqQTL-seq

tagRNA-seqPAT-seqStructure-seqMPE-seqSTARR-seqMod-seqBrAD-seqSLAF-seqG&T-seq

omicsmaps.com

Thedatadeluge

• PluckingthebiologyfromtheNoise

Reality

• Itsmuchmoredifficultthanwemayfirstthink

Therealcostofsequencing

Pre-NGS(Approximately 2000)

Now(Approximately 2010)

Future(Approximately 2020)

0%

2040

6080

100%

Data reductionData management

Sample collection and experimental design

Sequencing Downstreamanalyses

Dat

a m

anag

emen

t

Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125

Bioinformatics

Biology

ComputerScience

MathStatistics

Biostatistics

Computational Biology

‘The data scientist role has been described as “part analyst, part artist.”’Anjul Bhambhri, vice president of big data products at IBM

BioinformaticsisDataScience

DataScience

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.

Five Fundamental Concepts of Data Sciencestatisticsviews.com November 11, 2013 by Kirk Borne

7StagestoDataScience

1. Define the question of interest

2. Get the data3. Clean the data4. Explore the data

5. Fit statistical models

6. Communicate the results7. Make your analysis reproducible

1. Define the question of interest

Begin with the end in mind!what is the questionhow are we to know we are successfulwhat are our expectations

dictates the data that should be collectedthe features being analyzedwhich algorithms should be use

2. Get the data3. Clean the data4. Explore the data

Know your data!know what the source wastechnical processing in producing data (bias, artifacts, etc.)“Data Profiling”

Data are never perfect but love your data anyway!the collection of massive data sets often leads to unusual , surprising, unexpected and even outrageous.

5. Fit statistical modelsOver fitting is a sin against data science!

Model’s should not be over-complicated

• If the data scientist has done their job correctly the statistical models don't need to be incredibly complicated to identify important relationships

• In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer.

6. Communicate the results7. Make your analysis reproducible

Remember that this is ‘science’!We are experimenting with data selections, processing, algorithms, ensembles of algorithms, measurements, models. At some point these must all be tested for validity and applicability to the problem you are trying to solve.

Data science done well looks easy – and that’s a big problem for data scientists

simplystatistics.orgMarch 3, 2015 by Jeff Leek

Training:DataScienceBias

Data Science (data analysis, bioinformatics) is most often taught through an apprentice model

Different disciplines/regions develop their own subcultures, and decisions are based on cultural conventions rather than empirical evidence.• Programming languages• Statistical models (Bayes vs Frequentist)• Multiple testing correction• Application choice, etc.These (and others) decisions matter a lot in data analysis"I saw it in a widely-cited paper in journal XX from my field"

TheDataScienceinBioinformatics

Bioinformatics is not something you are taught, it’s a way of life

Mick Watson – Rosland Institute

“The best bioinformaticians I know are problem solvers –they start the day not knowing something, and they enjoyfinding out (themselves) how to do it. It’s a great skill to have,but for most, it’s not even a skill – it’s a passion, it’s a way oflife, it’s a thrill. It’s what these people would do at theweekend (if their families let them).”

Models

• Workshops•Oftenenrolledtoolate

• Collaborations•Moreexperiencepersons

• Apprenticeships• Previouslabpersonneltoyoungpersonnel

• FormalEducation•MostprogramsarePost-docorgraduatelevel• FewUndergraduate

Substrate

ClusterComputing

CloudComputing

BASTM Laptop & DesktopLINUX

Environment

“Command Line” and “Programming Languages”

VS

Bioinformatics Software Suite

Bioinformatics

• KnowandUnderstandtheexperiment• “TheQuestionofInterest”

•Buildasetofassumptions/expectations•Mixoftechnicalandbiological• Spendyourtimetestingyourassumptions/expectations

•Don’tspendyourtimefindingthe“best”software•Don’tunder-estimatethetimeBioinformaticsmaytake

•Bepreparedtoaccept‘failed’experiments

BottomLine

The Bottom Line:Spend the time (and money) planning and producing good quality, accurate and sufficient data for your experiment.

Get to know to your data, develop and test expectations

Result, you’ll spend much less time (and less money) extracting biological significance and results during analysis.

Themission oftheBioinformaticsCorefacilityistofacilitateoutstandingomics- scaleresearchthroughtheseactivities:

Training

Data Analysis

Research Computing

The Bioinformatics Core promotes experimental design, advanced computation and informatics analysis of omics scale datasets that drives research forward.

Maintain and make available high-performance computing hardware and software necessary for todays data-intensive bioinformatic analyses.

The Core helps to educate the next generation of bioinformaticians through highly acclaimed training workshops, seminars and through direct participation in research activities.

UCDavisBioinformaticsCoreintheGenomeCenter

-omicsis“CollaborativeResearch”

• Todaysexperimentsarecomplexandgettingmorecomplex

• Knowoneperson,orevenonegrouptypicallyhastheneededcapabilitiesinallareas

• M

Biologist

MolecularExpertiseBioinformatics

Prerequisites

•Accesstoamulti-core(24cpu orgreater),‘high’memory64GborgreaterLinuxserver.

•Familiaritywiththe’commandline’andatleastoneprogramminglanguage.

•Basicknowledgeofhowtoinstallsoftware•BasicknowledgeofR(orequivalent)andstatisticalprogramming

•BasicknowledgeofStatisticsandmodelbuilding