Standing on the shoulders of giants, German Demidov,...
Transcript of Standing on the shoulders of giants, German Demidov,...
Standing on the
shoulders of giants,
German Demidov,
Bioinformatics
Summer School
2017
BiologyandBigData
> Discoveringtruth
bybuildingon
previous
discoveries
Whyitisuseful?
Justoneexample:
Usingdatafromconsortia
> Whichtypesofdatacanyouobtainfrom
consortia?Howtoaccessanddownload
data?
> Howtoworkasapartofconsortia?Which
problemsyoumayface?
ImportantRemark
> Workshops“Howtouseconsortium_name”
usuallytake~3days(ie
https://www.encodeproject.org/tutorials/
encode-meeting-2016/),wewilltrytomake
anoverviewin1hour
> However,ifyouwanttofindmoreinformation
– google“consortium_nameworkshop”
> Thereareseparatepapers(i.e.EwanBirney,
2012,Nature,aboutENCODE)
GWASConsortia
> http://
www.wikigenes.org/
e/art/e/185.html
> 500.000genotyped
peopleinUK
EWASConsortia
GenomicsConsortia
> TheExomeAggregationConsortium
> 1000Genomes
> HumanReferenceGenome
> InternationalCancerGenomeConsortium
> TheCancerGenomeAtlas
> PanCancerAnalysisofWholeGenomes
> GTEx
EpigenomicsConsortia
> ENCODE
> RoadmapEpigenomics
> BluePrint
> InternationalHumanEpigenome
Consortium
ExACOverivew
> http://exac.broadinstitute.org/about
> Firstthingtodo–lookandreadflagship
paper!
> Thedatasetprovidedonthiswebsitespans
60,706unrelatedindividualssequencedas
partofvariousdisease-specificand
populationgeneticstudies.
ExAC:Whyitisuseful
Itisusedto
> calculateobjectivemetricsofpathogenicityforsequencevariants,
> identifygenessubjecttostrongselectionagainstvariousclassesofmutation;identifying3,230geneswithnear-completereductionofnumberofpredictedprotein-truncatingvariants,with72%ofthesegeneshavingnocurrentlyestablishedhumandiseasephenotype,
> efficientfilteringofcandidatedisease-causingvariants
ExAC:Results
• ANNOVARandATAVwereupdatedusing
ExACdata
• CADDscoreswerere-calculated
• CommercialtoolssuchasGoldenHelixand
GeneTalkalsoincorporatedExACdata
ExAC:Download
> Download
ExAC:Methods
> FlagshipPaper–Methods–short
descriptionwithdetailedpipelinesin
SupplementaryInformation
> 91,796individualexomesdrawnfroma
widerangeofprimarilydisease-focused
consortia
ExACQualityAssesment
> Comparisonwithintrios:singletontransmissionrateof50.1%(~50%)
> >10.000sampleswerecheckedwithSNPArrays–97-99%heterozygousconcordance
> Platinumstandardgenomesequencedwith5differenttechnologies–99.8%Sensitivity,0.056%FDR
> Comparisonwith13WGS~30x,PCR-free
> IndelFDRishigher(4.7%),singletonvariantsshowhigherFDR
> FDRisdifferentfordifferentannotationclasses(missense,synonymous,proteintruncating)
ExACSampleFiltering
> Only60.706samplespassedQCoutof91.796
> SetofcommonSNPswasselected(5.400)andsampleswithoutlierheterozygositywereremovedpriortoPCA
> Persamplenumberofvariants,transition/transversion(TiTv)ratio,alternatealleleheterozygous/homozygous(Het/Hom)ratioandinsertion/deletion(indel)ratio
> Closerelativeswereremoved
> Finalcoverage:80%oftargetedbases>20x
> 77%wereenrichedwithAgilentKit(33MBtarget)
1000GP
> http://www.internationalgenome.org
1000GP:Overview,goals
> http://www.internationalgenome.org/data-portal/sample
> Prettyconvenientdataportalthatallowsyounicefiltering!
> Thegoalofthe1000GenomesProjectwastofindmostgeneticvariantswithfrequenciesofatleast1%inthepopulationsstudied.
> Theprojectplannedtosequenceeachsampleto4xgenomecoverage;atthisdepth,sequencingcannotdiscoverallvariantsineachsample,butcanallowthedetectionofmostvariantswithfrequenciesaslowas1%.
1000GP:MainPublications
> Pilot:Amapofhumangenomevariationfrompopulation-scalesequencingNature467,1061–1073(28October2010)
> Phase1:Anintegratedmapofgeneticvariationfrom1,092humangenomesNature491,56–65(01November2012)
> Phase3:AglobalreferenceforhumangeneticvariationNature526,68–74(01October2015)
> Anintegratedmapofstructuralvariationin2,504humangenomesNature526,75–81(01October2015)
1000GP:Pipeline
1000GP:PowerofDetection,Heterozygous
Discordance,SequencingDepth
1000GP:Results
1000GP:VariantCalling
1000GP:CNVs
1000GP:CNVsconcordance
PanCancerAnalysisOfWG
> https://dcc.icgc.org/pcawg
PanCancerAnalysisOfWG
1. Novelsomaticmutationcallingmethods
2. Analysisofmutationsinregulatoryregions
3. Integrationofthetranscriptomeandgenome
4. Integrationoftheepigenomeandgenome
5. Consequencesofsomaticmutationsonpathwayandnetworkactivity
6. Patternsofstructuralvariations,signatures,genomiccorrelations,retrotransposonsandmobileelements
7. Mutationsignaturesandprocesses
8. Germlinecancergenome
9. Inferringdrivermutationsandidentifyingcancergenesandpathways
10. Translatingcancergenomestotheclinic
11. Evolutionandheterogeneity
12. Portals,visualizationandsoftwareinfrastructure
13. Molecularsubtypesandclassification
14. Analysisofmutationsinnon-codingRNA
15. Mitochondrial
16. Pathogens
PCAWG,WG8:Validation
> High-coveragevalidation
> 3maincallers:BroadInstitute–HaplotypeCaller,Annai-RTG(privatecompany),Freebayes(EMBL-DKFZ)
> 50samples,5000sitespersamplesequencedwith~1000depth
> ~2300SNVs,~2700indels
> SNPRecall/PPV/concordance~0.995
> Indels:0.94Recall,0.91PPV,concordance0.88
PCAWGWG8,CNVs
> CNVs
PCAWGWG8:Results
> Sensitivity,deletionsonly~60%,
duplications~40%!
FurtherInformation
> Flagshippaperisnotinformative:/
> 16papersarereleasedinbioRxiv
GTEx
> TheGenotype-TissueExpressionprojectaimstoprovidetothescientificcommunityaresourcewithwhichtostudyhumangeneexpressionandregulationanditsrelationshiptogeneticvariation
> Variationsingeneexpressionthatarehighlycorrelatedwithgeneticvariationcanbeidentifiedasexpressionquantitativetraitloci,oreQTLs
GTEx
> Alotofgeneticchangesassociatedwithcommonhumandiseases,suchasheartdisease,cancer,diabetes,asthma,andstroke,liesoutsideoftheprotein-codingregionsofgenes
> ThecomprehensiveidentificationofhumaneQTLswillgreatlyhelptoidentifygeneswhoseexpressionisaffectedbygeneticvariation
GTExDataOverview
GTExScheme
GTEx:CausesofDeath
ENCODE:Overview
> https://www.encodeproject.org
> EncyclopediaofDNAelements
> ThegoalofENCODEistobuilda
comprehensivepartslistoffunctional
elementsinthehuman(mouse/fly/worm)
genome
ENCODETimeline
ENCODEasfor2012
ENCODE:TypesofData
> https://www.encodeproject.org
ENCODE:DataMatrix
ENCODE:AuditCategory
Eachsamplecanhavemultiple
QCissuesandcanstill
Beavailablefordownloading!
ENCODE:ResultofAnalysis
ENCODE:GroundLevel
ENCODE:Mid-level
ENCODE:Top-Level
ENCODEpublications
> Ofcourse,oneoftheproductsis
publicaitons!
0
100
200
300
400
500
600
Nu
mb
er
of
Pu
blic
ati
on
s
Cumulative ENCODE Publications Over Time
Papers from Non-ENCODE Authors
Papers from ENCODE 2 Production Groups
ENCODEstandards
> DataStandards
BluePrint
> “BLUEPRINTisalarge-scaleresearchprojectreceivingcloseto30millioneurofundingfromtheEU.”
> 42leadingEuropeanscientificcenters
> Theaimtofurthertheunderstandingofhowgenesareactivatedorrepressedinbothhealthyanddiseasedhumancells
> Focusondistincttypesofhaematopoieticcellsfromhealthyindividualsandontheirmalignantleukaemiccounterparts
BluePrint
> http://www.blueprint-epigenome.eu
> Publications(CellPapers)&DataPortal
BluePrint
> http://dcc.blueprint-epigenome.eu/#/home
BluePrint
BluePrint
RoadMapEpigenomics
> TheNIHRoadmapEpigenomicsResearchtotransformourunderstandingofhowepigeneticscontributestodisease
> TheConsortiumleveragesexperimentalpipelinesbuiltaroundnext-generationsequencingtechnologiestomapDNAmethylation,histonemodifications,chromatinaccessibilityandsmallRNAtranscriptsinstemcellsandprimaryexvivotissuesselectedtorepresentthenormalcounterpartsoftissuesandorgansystemsfrequentlyinvolvedinhumandisease
RoadMapEpigenomics
RoadMapEpigenomics
RoadMapEpigenomics
ItlookslikewecangetProtocolsclickingonthelink,however,
therearenotalotofthemthere.Theprotocolsaresuper
outdated!(egREMCSTANDARDSANDGUIDELINESFORCHIP-
SEQDEC.2,2011—V1.0)
RoadMapEpigenomics
> Ifyouwannatoworkwiththesedata–readthepaper“Integrativeanalysisof111referencehumanepigenomes”(+16ENCODE2012,donotprintthepaper!)
> Gothroughthe“Publications”list
RoadMapEpigenomics
ThemostusefulsectionisMethods:
> RNA-sequniformprocessingandquantificationforconsolidatedepigenomes
> ChIP-seqandDNase-sequniformreprocessingforconsolidatedepigenomes
> Methylationdatacross-assaystandardizationanduniformprocessingforconsolidatedepigenomes
> Chromatinstatelearning
> Etc.
RoadMapEpigenomics
> Publications
RoadMapEpigenomics
> HistonemarkcombinationsshowdistinctlevelsofDNAmethylationandaccessibility,andpredictdifferencesinRNAexpressionlevelsthatarenotreflectedineitheraccessibilityormethylation.
> Megabase-scaleregionswithdistinctepigenomicsignaturesshowstrongdifferencesinactivity,genedensityandnuclearlaminaassociations,suggestingdistinctchromosomaldomains.
> Approximately5%ofeachreferenceepigenomeshowsenhancerandpromotersignatures,whicharetwofoldenrichedforevolutionarilyconservednon-exonicelementsonaverage.
> Epigenomicdatasetscanbeimputedathighresolutionfromexistingdata,completingmissingmarksinadditionalcelltypes,andprovidingamorerobustsignalevenforobserveddatasets.
> Dynamicsofepigenomicmarksintheirrelevantchromatinstatesallowadata-drivenapproachtolearnbiologicallymeaningfulrelationshipsbetweencelltypes,tissuesandlineages.
WorkinginConsortia
WorkingwithData
• GettingRawData
• Workingwiththedatafromdifferent
consortiasimultaneously:differentQCs,
differentdataanalysispipeline
• Versionsoftoolsmissedoroutdated/
unsupportedtools–failureofreplication!
WorkinginConsortiaI
• WhenyourServergetsdownorallyour
datawereaccidentallyremoved
• Deadlines–add3-6monthstoexpected
date!
• Communication:teleconferences
• Passwordsrenewal,permissionstoaccess
• Efficientdatasharing–speed,reliability,
confidentiality
WorkinginConsortiaII
• Differentnamingofthesamesamplesindifferentworkinggroups/labs
• Wrong/MissingIdentifiers(egwrongcancertypeorpopulation)–case:normalandsomaticwereactuallyswapped
• Thesame,butfromclinicians
• Differentlabs-differentlibrarypreparation(egcoveragedepthsafterPCR-freeandPCR-basedWGS)
• Severaltoolscanbeusedfortheanalysis–establishmentofthebesttoolorgenerationofjointcallset
• Multipleblacklistoroutlierlists(everylab/grouphasitsownandtheydonotcompletelyoverlap)
WorkinginConsortiaIII
• UnbalancedPopulationStructure
• Mixofdifferenteffects(egCancervs.
Population)
• IsyourGermlinereallyGermline?
SlidefromAgENCODE,EwanBirney
Спасибозавнимание!