Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your...
Transcript of Bioinformatics: A perspective · 2. Get the data 3. Clean the data 4. Explore the data Know your...
Bioinformatics: A perspective
Dr. Matthew L. Settles
Genome CenterUniversity of California, Davis
Outline
• TheWorldwearepresentedwith• AdvancesinDNASequencing• BioinformaticsasDataScience• Viewportintobioinformatics• Training• Suggestions• AnintroductiontotheCore
Cost per Megabase of Sequence
Year
Dollars
2005 2010 2015
$0.1
$1
$10
$100
$1000
Cost per Human Sized Genome @ 30x
Year
2005 2010 2015
$1000
$100000
$10000000
SequencingCosts
• Includes:labor,administration,management,utilities,reagents,consumables,instruments(amortizedover3years),informaticsrelatedtosequenceproductions,submission,indirectcosts.
• http://www.genome.gov/sequencingcosts/
$0.014/Mb $1245 per Human sized
(30x) genome
GrowthinPublicSequenceDatabase
• http://www.ncbi.nlm.nih.gov/genbank/statistics
WGS > 1 trillion bp
Year
Bases
1990 2000 2010
105
107
109
1011
1013
●
●
GenBankWGS
Year
Sequences
1990 2000 201010
210
410
610
8
●
●
GenBankWGS
ShortReadArchive(SRA)Growth of the Sequence Read Archive (SRA) over time
Year
2000 2005 2010 2015
1011
1012
1013
1014
1015
●
●
●
●
BasesBytesOpen Access BasesOpen Access Bytes
> 1 quadrillion bp
http://www.ncbi.nlm.nih.gov/Traces/sra/
IncreaseinGenomeSequencingProjects
• JGI– GenomesOnlineDatabase(GOLD)• 67,822genomesequencingprojects• JGI– GenomesOnlineDatabase(GOLD)• 67,822genomesequencingprojects
Lists > 3700 unique genus
SequencingPlatforms
• 1986- DyeterminatorSangersequencing,technologydominateduntil2005until“nextgenerationsequencers”,peakingatabout900kb/day
‘Next’Generation
• 2005– ‘NextGenerationSequencing’asMassivelyparallelsequencing,boththroughputandspeedadvances.ThefirstwastheGenomeSequencer(GS)instrumentdevelopedby454lifeSciences(lateracquiredbyRoche),Pyrosequencing 1.5Gb/day
Discontinued
Illumina
• 2006– Thesecond‘NextGenerationSequencing’platformwasSolexa (lateracquiredbyIllumina).Nowthedominantplatformwith75%marketshareofsequencerandandestimated>90%ofallbasessequencedarefromanIllumina machine,SequencingbySynthesis>200Gb/day.
CompleteGenomics
• 2006– UsingDNAnanoball sequencing,hasbeenaleaderinHumangenomeresequencing,havingsequencedover20,000genomestodate.In2013purchasedbyBGIandisnowsettoreleasetheirfirstcommercialsequencer,theRevolocity.ThroughputonparwithHiSeq
Human genome/exomes only.
10,000 Human Genomes per year
BenchtopSequencers
• Roche454Junior
• LifeTechnologies• IonTorrent• IonProton
• Illumina MiSeq
The‘NextNext’Generation
• 2009– SingleMoleculeReadTimesequencingbyPacificBiosystems,mostsuccessfulthirdgenerationsequencingplatforms,~2Gb/day,NewPacBioSequal ~14Gb/day.
OxfordNanopore
• 2015– Another3rd generationsequencer,foundedin2005andcurrentlyinbetatesting.Thesequencerusesnanopore technologydevelopedinthe90’stosequencesinglemolecules.Throughputisabout500Mbperflowcell.
FYI: 4th generation sequencing is being described as In-situ sequencing
Fun to play with but results are highly variable
Flexibility
DNA Sequence
Read 1 (50- 300bp)
Read 2 (50-300bp) Read 2 primer
Barcode (8bp)Barcode Read primer
Depth of C
overage
1X
100000X
Whole Genome
1KB
Reduction Techniques
Capture Techniques
Fluidigm Access ArrayAmplicons
Few or Single Amplicons
Genomic reduction allows for greater
coverage and multiplexing of
samples.
You can fine tune your depth of
coverage needs and sample size
with the reduction technique
RADseq
Greater Multiplexing
Single Multiplexing
SequencingLibraries•DNA-seq•RNA-seq•Amplicons•CHiP-seq•MeDiP-seq•RAD-seq•ddRAD-seq•Pool-seq•EnD-seq
DNase-seqATAC-seqMNase-seqFAIRE-seqRibose-seqsmRNA-seqmRNA-seqTn-seqQTL-seq
tagRNA-seqPAT-seqStructure-seqMPE-seqSTARR-seqMod-seqBrAD-seqSLAF-seqG&T-seq
omicsmaps.com
Thedatadeluge
• PluckingthebiologyfromtheNoise
Reality
• Itsmuchmoredifficultthanwemayfirstthink
Therealcostofsequencing
Pre-NGS(Approximately 2000)
Now(Approximately 2010)
Future(Approximately 2020)
0%
2040
6080
100%
Data reductionData management
Sample collection and experimental design
Sequencing Downstreamanalyses
Dat
a m
anag
emen
t
Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125
Bioinformatics
Biology
ComputerScience
MathStatistics
Biostatistics
Computational Biology
‘The data scientist role has been described as “part analyst, part artist.”’Anjul Bhambhri, vice president of big data products at IBM
BioinformaticsisDataScience
DataScience
Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.
Five Fundamental Concepts of Data Sciencestatisticsviews.com November 11, 2013 by Kirk Borne
7StagestoDataScience
1. Define the question of interest
2. Get the data3. Clean the data4. Explore the data
5. Fit statistical models
6. Communicate the results7. Make your analysis reproducible
1. Define the question of interest
Begin with the end in mind!what is the questionhow are we to know we are successfulwhat are our expectations
dictates the data that should be collectedthe features being analyzedwhich algorithms should be use
2. Get the data3. Clean the data4. Explore the data
Know your data!know what the source wastechnical processing in producing data (bias, artifacts, etc.)“Data Profiling”
Data are never perfect but love your data anyway!the collection of massive data sets often leads to unusual , surprising, unexpected and even outrageous.
5. Fit statistical modelsOver fitting is a sin against data science!
Model’s should not be over-complicated
• If the data scientist has done their job correctly the statistical models don't need to be incredibly complicated to identify important relationships
• In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer.
6. Communicate the results7. Make your analysis reproducible
Remember that this is ‘science’!We are experimenting with data selections, processing, algorithms, ensembles of algorithms, measurements, models. At some point these must all be tested for validity and applicability to the problem you are trying to solve.
Data science done well looks easy – and that’s a big problem for data scientists
simplystatistics.orgMarch 3, 2015 by Jeff Leek
Training:DataScienceBias
Data Science (data analysis, bioinformatics) is most often taught through an apprentice model
Different disciplines/regions develop their own subcultures, and decisions are based on cultural conventions rather than empirical evidence.• Programming languages• Statistical models (Bayes vs Frequentist)• Multiple testing correction• Application choice, etc.These (and others) decisions matter a lot in data analysis"I saw it in a widely-cited paper in journal XX from my field"
TheDataScienceinBioinformatics
Bioinformatics is not something you are taught, it’s a way of life
Mick Watson – Rosland Institute
“The best bioinformaticians I know are problem solvers –they start the day not knowing something, and they enjoyfinding out (themselves) how to do it. It’s a great skill to have,but for most, it’s not even a skill – it’s a passion, it’s a way oflife, it’s a thrill. It’s what these people would do at theweekend (if their families let them).”
Models
• Workshops•Oftenenrolledtoolate
• Collaborations•Moreexperiencepersons
• Apprenticeships• Previouslabpersonneltoyoungpersonnel
• FormalEducation•MostprogramsarePost-docorgraduatelevel• FewUndergraduate
Substrate
ClusterComputing
CloudComputing
BASTM Laptop & DesktopLINUX
Environment
“Command Line” and “Programming Languages”
VS
Bioinformatics Software Suite
Bioinformatics
• KnowandUnderstandtheexperiment• “TheQuestionofInterest”
•Buildasetofassumptions/expectations•Mixoftechnicalandbiological• Spendyourtimetestingyourassumptions/expectations
•Don’tspendyourtimefindingthe“best”software•Don’tunder-estimatethetimeBioinformaticsmaytake
•Bepreparedtoaccept‘failed’experiments
BottomLine
The Bottom Line:Spend the time (and money) planning and producing good quality, accurate and sufficient data for your experiment.
Get to know to your data, develop and test expectations
Result, you’ll spend much less time (and less money) extracting biological significance and results during analysis.
Themission oftheBioinformaticsCorefacilityistofacilitateoutstandingomics- scaleresearchthroughtheseactivities:
Training
Data Analysis
Research Computing
The Bioinformatics Core promotes experimental design, advanced computation and informatics analysis of omics scale datasets that drives research forward.
Maintain and make available high-performance computing hardware and software necessary for todays data-intensive bioinformatic analyses.
The Core helps to educate the next generation of bioinformaticians through highly acclaimed training workshops, seminars and through direct participation in research activities.
UCDavisBioinformaticsCoreintheGenomeCenter
-omicsis“CollaborativeResearch”
• Todaysexperimentsarecomplexandgettingmorecomplex
• Knowoneperson,orevenonegrouptypicallyhastheneededcapabilitiesinallareas
• M
Biologist
MolecularExpertiseBioinformatics
Prerequisites
•Accesstoamulti-core(24cpu orgreater),‘high’memory64GborgreaterLinuxserver.
•Familiaritywiththe’commandline’andatleastoneprogramminglanguage.
•Basicknowledgeofhowtoinstallsoftware•BasicknowledgeofR(orequivalent)andstatisticalprogramming
•BasicknowledgeofStatisticsandmodelbuilding