Bioinformatics for the 100,000 Genomes...
Transcript of Bioinformatics for the 100,000 Genomes...
Bioinformaticsforthe100,000GenomesProject
AugustoRendó[email protected]|GenomicsEnglandPrincipalResearchAssociate|UniversityofCambridge
Barcelona,2016-11-02
Outline
• IntroductiontotheUK’s100,000genomesproject• Analysesinrarediseases• Analysesincancer• BioinformaticsPlatform• Datamodelsandflows• Databases• Interpretation
Inceptionofthe100,000genomesproject(2012,2014)
“Ifwegetthisright,wecouldtransformhowwediagnoseandtreatourmostcomplexdiseasesnotonlyherebutacrosstheworld”(December2012)
“IamdeterminedtodoallIcantosupportthehealthandscientificsectortounlockthepowerofDNA,turninganimportantscientificbreakthroughintosomethingthatwillhelpdeliverbettertests,betterdrugsandaboveallbettercareforpatients.”(August2014)
• Sequence100,000genomes
• Cancerandraregeneticdisease
• Capturedatadeliveredelectronically,storeitsecurelyandanalyseitwithinanEnglishdatacentre(readinglibrary)
• Combinegenomeswithextractedclinicalinformationforanalysis,interpretation,andaggregation
• Createcapacity,capabilityandlegacyinpersonalisedmedicinefortheUK
GoalsoftheGenomicsEnglandproject
1.TobringbenefittoNHSpatients
2.Toenablenewscientificdiscoveryandmedicalinsights
3.Tokickstart thedevelopmentofaUKgenomicsindustry
4.Tocreateanethicalandtransparentprogrammebasedonconsent
GenomicsEnglandproject
http://www.genomicsengland.co.uk/library-and-resources/
Recruitmentandclinicalinterfacevia13“GMCs”,ScotlandandNorthernIreland
• GenomicMedicineCentres• NetworksofNHShospitalsincludinggenomicslabs
• 13“Leadorganisation”plus71“LocalDeliveryPartners”
• ContractedbyNHSEngland• Coverrecruitment,dataandreturnofresults
• Scotland• Doingownsequencing
• NorthernIreland• SimilartoaGMC• ContractedbyNIpayer
+
7
Feedbacktoparticipants
AdditionalfindingGenes
Requirements:
• Atreatableorpreventablecondition.
• Reliablydetectedbynextgenerationsequencing.
• Eachgenewillhaveacuratedlistofhighconfidence,highpenetrancevariants.
Otherconditionsmaybeaddedifclinicallyappropriateandtechnicallyfeasible.
ParticipantsrecruitedinRD• About400 RDparticipantscurrently
recruitedperweek• 5,000 participantsrecruitedtotheRDpilot
FamilySize
*DatafromMainProgramme
Recruitmentbytumourtype
10
AdultGlioma,19,2%
Bladder,28,3%
Breast,321,29%
Childhood,1,0%
Colorectal,264,24%EndometrialCarcinoma,20,2%
Lung,139,13%
MalignantMelanoma,3,0%
Ovarian,91,8%
Prostate,95,9%
Renal,65,6%
Sarcoma,44,4%TesticularGermCellTumours,1,0%
>14,896genomessequenced(Nov1)
NBasesx109
(Q30
-nod
up)
%Autosomalcoverage>=15x(Q30-nodup)
Germlinedataonly
• Median%Autosomalcoverage>=15X=97.4%• About1.4PBofdata
125150
AnalysesinRareDiseasesGeneticbasedtest
Checksofreporteddatavsgenetics
• Sexchecks• Coverage-based(WGS)• XchromosomeheterozygosityandYchromosomegenotypingrate(array)
• PredictedminorkaryotypesincludeXO,XXY,XYY• Relatednesschecks
• Mendelianinconsistencyrate(whereatleastoneparentsequenced)
• Estimatedidentitybydescentsharingforallpairsincohortandworkingonafamilyonlyworkflow- PLINKandPC-Relate
• Canidentifyrarephenomena,e.g.large-scaleuniparentalisodisomy
Coveragebasedsexchecks
Relatednesschecking
15
AnalysesinCancerAssessingthequalityofsamplepreparationprotocols
FreshFrozen(FF)vsFormalinFixedParaffinEmbedded(FFPE)
FF• Costlyandnotwidelyavailable• Difficulttocapturetumour• HighqualityDNA
FFPE• Routinelyused• Digitalpathologyfortumour
selection• Lowqualityandquantityof
DNA
ATdropout GCdropout
FFsample 0.00 0.06 lowcoverageforGC-richregions
FFPEsample
0.16 -0.26 trendisreversedwithpoorcoverageof AT-richregions
ATrich GCrich
AT/CGdropouteffectoncopynumbervariantcalling
FFPEGCdropout
FFGCdropoutFFPEATdropout
FFATdropout
FF ATdrop
Purity
RMSDcov
FFPE ATdrop
Purity
RMSDcov
4.7 0.6 13.1 5.8 0.6 18.9
4.0 0.4 13.2 5.4 0.4 24.5
4.3 0.5 14.3 6.6 0.5 22.4
4.4 0.4 12.9 15.8 NA 50.7
3.1 0.4 14.8 5.4 0.4 23.1
FreshfrozenandFFPEpairedsamples:abilitytocallCNVs
OverlappingSNVsinFFandFFPEsamplesfrompairedVAF<5%filteredout
FFPEalsoaffectssmallvariantcallingProp
ortio
nofvariants
GMC1OtherGMCs
Comparingsequencequalitymetricsacrosslabs
Afterstandardisingonoptimised FFPEprotocol
Bioinformaticsplatform
GELbioinformaticsplatform
DesignGoals• Scalability:abletooperateonseveralhundredwholegenomesperday• Traceability:abletokeeptheprovenanceofeveryartefactproducedintheprocess• Knowledgeaccumulation:abletocaptureandaggregatetheknowledge,decisionscapturedduringtheinterpretationinordertogeneratebetterknowledgebases• Serviceoriented:componentstalktoeachotherviawelldefinedAPIsanddataformats
Hospita
lsGe
nomicsE
ngland
Interp.provide
rs
ClinicalDataintakeservice
InterpretationplatformservicesGenomeintakeservice
Workflowmanagement
Metadata Variants
ReferenceKnowledge
GxPassociations
Interpretation
Tracking
Samedatamodel,manymanifestationsHowtoensurethatallthedataiscoherentlystoredandeasilyretrievable?
• InspiredbyModel-DrivenArchitectureapproaches• Models(schemas)controlledingithub includingboilerplatefunctionstovalidatedataagainstmodel• Documentationauto-generatedoutofthemodel• ServicescommunicateusingJSONderivedfromthemodel• Datawrittenagainsttheschemaauto-generatedfromthemodelinthemetadatastoreusingdocumentstores
Datamodelsintheplatform
• Useofavro foritsinterfacedefinitionlanguage,JSONoutofthebox,automaticcodegenerationofclassestohandlethesedata• Models(andauxiliarylibraries)availablehere:https://github.com/genomicsengland/GelReportModels/tree/releases/schemas/IDLs• Documentationforthemasterbranchhere:https://genomicsengland.github.io/GelReportModels/index.html• Bioinformaticsmodelshere:https://github.com/opencb/biodata• ForreadsandvariantsweuseprotocolbufferscompatiblewithGA4GHstandards
InterpretedGenomeRD
Bertha:Distributedworkflowmanagementsystem(reallyanenterpriseservicebusforgenomicdata)
Producer ConsumerExchangepublishes routes consumesQueue
MessageBroker
TrackingDB
JobScheduler
Dashboard
DeliveryAPI
Auditor
Orchestrator
GridConsumer
• Restarts• Scatter-gather• Singleandgroupprocesses• Multipleconcurrentworkflows
(workinprogress)
https://github.com/genomicsengland/bertha
bertha_default 1.1.0
Single Sample QC & Processing
Analysis
Intake QC
Multi Sample QC
Cross Sample Contamination
Single-Sample QC Check Point
Identity by DecentMendelian Inconsistency Rate
Sex Check
Somatic VCF re-headering
Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check
Intake QC Check Point
Merge Array Genotypes
Multi-Sample QC Check Point
Consent Check Point
Variant Calling
Variant Normalisation
Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs
Variant Annotation
Variant Tiering
Interpretation Dispatch Exomiser
Delivery API
Integrity Check
MD5 Check
Validate BAM Picard
Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC
Fix Permissions
Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats
QC Stats Post-processing
Workflowdiagramme
Dataintake
SingleSampleQC&Processing
Multi-sampleQC
Analysis
SequencereceivedIntakeAPI
InterpretationRequestDispatched
Interpretationapproach
• VirtualGenePanels• Initiallyassignedbyaclinician• Workingonautomatedpanelsuggestions
• Variantfiltering• AlleleFrequency:variantisrare• Segregation:variantsegregateswithconditioninfamily• Panelmembership(includingmodeofinheritance)• Differentforcancer
• Interpretation• Automatedpathogenicityscoring• Manualreview
• SeveralmanualQCpoints
Panelapp:Crowdsourcingcurationofgenediseaseassociations
https://bioinfo.extge.co.uk/crowdsourcing/PanelApp/
StatusofPanels
• 190panels• 97>=v1panels• 3,512genes• 435registeredreviewers• 15,149genelevelreviews• RecognisedbytheUKgenetictestingnetwork• Curationreachingapointofdiminishingreturns
1
10
100
1000
10000
0 50 100 150
Numberofreviews
Reviewers
Automatedpanelsuggestion
HP1
HP2
HP3
HP2
HP3
HP1
HP5
G1,G2,G3
G1,G4,G5
G6,G7,G8PanelZ
PanelY
PanelX
DiseaseX
X
Y
Z
Diseases,coreHPtermsandpanels
HP4
G1,G2,G3X
G1,G4,G5Y
G6,G7,G8ZHP4
G6
G7
HP2
HP3
HP4
Diseases,HPannotationandgenes
AlsogetaQCscoreforhowphenotypicallysimilarpatientistorecruiteddisease
RDpilotbenchmarking• 1831participantswithHPOterms,assignedpanels(2674total)andcoredisease• 847/1831(46%)haveexactlysamepanels• 728/1831(40%)havesamepanelsplus1or2extra• 256/1831(14%)aremissingsomeofmedicalreviewpanels
7November2016 360
200
400
600
800
1000
1200
-900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 More
Freq
uency
Bin
Genegainsorlosses
Filteringintherarediseasesprogramme
Domain1
Variantsinavirtualpanelofactionablegenes(between20and40).Actionablegenesaredefinedasgeneswithshortvariantsassociatedwiththerapeutic,prognosticordiagnosticactionsbyGenomOncology (MyCancerGenome)
Matchingatthevariantlevel
Domain2
VariantsinthegenesfromCancerGeneCensus- 534genes.
Domain3
Variantsinallothergenes
Frequencyfilters:excludecommonvariants(1000G,ExAC,GEL)Consequencefilters:excludesynonymousvariants
Filteringinthecancerprogramme
Twopartreports:Actionableand“Interesting”
Supp
lemen
taryanalysis
StructuralvariantsMutationaldensityCoverageandcopynumber
Mutationalsignatures
Hypermutation rainplotsMutationcontext
Cellbase• Referencedatastore/AnnotationEngineOpenCGA• Catalog:metadataandclinicaldatastore• Storage:variantdatabaseInterpretationPlatform• Interpretationservice:managevariousproducersandconsumers• Interpretationwarehouse(underconstruction):storesandservesinterpretationdata
Bioinformaticsplatformcomponents
https://github.com/opencb/opencgahttps://github.com/opencb/cellbase
OpenCB familyofapplications
InterfaceLayer
OpenCGACatalog
OpenCGAStorageCellbase
MongoDB MongoDB MongoDB HBASE PosixFS
GenomeBrowser
VariantAnalysis
DataDiscovery
Cellbase
• Knowledgebasemanagement• UsesEnsembl,Uniprot,IntAct,ClinVar,etc.• CurrentdatabaseengineisMongoDB• JSONoutputsagainstwelldefinedmodel• SupportsannotationagainstlocalDBs• Annotatesabout10,000variants/secondperinstance• PythonandRAPIs
http://nar.oxfordjournals.org/content/40/W1/W609.short
AnnotationagainstCellbase
http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/https://github.com/opencb/cellbase
CellBase 4.0- VEP82Consequencetypebenchmark(1kGphase3,83Mvariants)
● VEPannotations:346M● CellBaseannotations:346M● CoincidenceatSOtermlevel(346Mannotations)
– AnnotationsprovidedbyVEPandnotprovidedbyCellBase:3364(99.999%coincidence)
– AnnotationsprovidedbyCellBaseandnotprovidedbyVEP:4918(99.999%coincidence)
● 60%DuetodifferencesonmiRNAdatasources● 39%DifficultieswithVEPoutputformatparsing
● Coincidenceatvariantlevel(83Mvariants)– Variantswithconflictingannotation:4990(99.994%coincidence)
AnnotationforphasedMNVsandCNVs• SupportforCNVsnewinCellBase4.5Beta
• Mainchallenge:supportimprecisecalling-matchagainstalreadyreportedCNVs(populationfrequencies,clinicalvariants)
• Sameannotationdataasfortherestofvariants:consequencetype,populationfrequencies,etc.
• ExampleCNV
• SupportforMNVsandphasedvariantsfromCellBase4.0• Consequencetypedependsonvariantsaffectingthesamecodon
• Variantsareassignedaphaseset(phasedVCFsincludethePStag)-allvariantsonthesamephasesetshallbeprocessedtogether
AnnotationofMNVs
• Example:17:270550:AACAG:TGCAA• ExampleMNV
• Decomposeintosinglephasedvariantsmembersofthesamephaseset:
{"id":"17:270550:A:T","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}
{"id":"17:270551:A:G","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}
{"id":"17:270554:G:A","result":[{"codon":"caG/caA","proteinVariantAnnotation":{"reference":"GLN","alternate":"GLN"},"sequenceOntologyTerms":[{"accession":"SO:0001819","name":"synonymous_variant"}
OpenCGA - Catalog
MetadatastoreandA&AforOpenCGA• Managesroles,groups,acls• Auditlog• LDAPintegration• Arbitraryschemas(annotationsets)
6 node Hadoop cluster:• Transform: 97 min• Load: 80 sec• Merge: 84 sec• Millisecond response
times for regional queries
• Whole genome filtering queries for all individuals within seconds
OpenCGA - Storage
Extensivecapabilitiestoqueryacrossgenotypeandphenotyperelationships
AspirationtobefullyGA4GHcompatiblefromv1.0
Platformforinterpretation(underconstruction)
Key(personal)learnings
• Thereisgreatstrengthinmultidisciplinaryteamswithspecialisation,butthoseindividualsthatcanspanbothbiology/geneticsandsoftwareengineerarepivotal–theconnectthespecialist• Goodsoftwareengineeringpracticesalsoapplytobioinformatics,tonameafew:designing,documenting,Testing,supportandservice.Skippingthemdon’treallysaveyoutime• Ihavebecomeabigfanofusingwellestablishedtechnologieswithrichecosystems(e.g.hadoop)ratherthaninventingnewformats,datastructures,toolchains
Finalthoughts
• Thefutureinhumangeneticswillbeunderpinnedbyacademic/industrialpartnerships;boththetaskandthebenefitsaretoobigtogoatitalone• GenomicMedicineisjustoneofthepilotsofadigitalrevolutioninhealthcarewhereartificialintelligencewillcomplement/replacethediagnosticjourney• Butgenomicsistheeasypart,clinicaldataistherealchallenge