Pathway/ Gene Set Analysis in Genome-Wide Association Studies · (GE) – GE enrichment typically...

56
Pathway/ Gene Set Analysis in Genome-Wide Association Studies Alison Motsinger-Reif, PhD Associate Professor Bioinformatics Research Center Department of Statistics North Carolina State University

Transcript of Pathway/ Gene Set Analysis in Genome-Wide Association Studies · (GE) – GE enrichment typically...

Pathway/GeneSetAnalysisinGenome-WideAssociationStudies

AlisonMotsinger-Reif,PhDAssociateProfessor

BioinformaticsResearchCenterDepartmentofStatistics

NorthCarolinaStateUniversity

Goals

• MethodsforGWASwithSNPchips– IntegratingexpressionandSNPinformation

ManySharedIssues

• Manyoftheissues/choices/methodologicalapproachesdiscussedformicroarraydataaretrueacrossall“-omics”

• Manymethodshavebeenreadilyextendedforotheromic data

• Thereareseveralbiologicalandtechnologicalissuesthatmaymakejust“offtheshelf”useofpathwayanalysistoolsinappropriate

Genome-WideAssociationStudiesPopulationresources• trios• case-controlsamples

Whole-genomegenotyping• hundredsofthousandsormillion(s)ofmarkers,typicallySNPs

Genome-wideAssociation• singleSNPalleles• genotypes• multimarkerhaplotypes

AdvantagesofGWAS

• Comparedtocandidategenestudies– unbiasedscanofthegenome– potentialtoidentifytotallynovelsusceptibilityfactors

• Comparedtolinkage-basedapproaches– capitalizeonallmeioticrecombinationeventsinapopulation

• Localizesmallregionsofthechromosome• enablesrapiddetectioncausalgene

– Identifiesgeneswithsmallerrelativerisks

ConcernswithGWAS• AssumesCDCVhypothesis

• Expense

• Powerdependenton:– Allelefrequency– Relativerisk– Samplesize– LDbetweengenotyped

markerandtheriskallele– diseaseprevalence– .ultiple testing– …….

• StudyDesign– Replication– ChoiceofSNPs

• Analysismethods– ITsupport,data

management– Variableselection– Multipletesting

SuccessesinGWASStudies• Over400GWASpaperspublishedtodate

• BigFinds:– In2005,itwaslearnedthroughGWASthatage-relatedmacular

degenerationisassociatedwithvariationinthegeneforcomplementfactorH,whichproducesaproteinthatregulatesinflammation(Kleinetal.(2005)Science,308,385–389)

– In2007,theWellcome TrustCase-ControlConsortium(WTCCC)carriedoutGWASforthediseasescoronaryheartdisease,type1diabetes,type2diabetes,rheumatoidarthritis,Crohn'sdisease,bipolardisorderandhypertension.Thisstudywassuccessfulinuncoveringmanynewdiseasegenesunderlyingthesediseases.

MoreSuccesses• Associationscanof14,500nonsynonymous SNPs infourdiseasesidentifies

autoimmunity variants.NatGenet.2007

• Genome-wideassociationstudyof14,000casesofsevencommondiseasesand3,000sharedcontrols.Wellcome TrustCaseControlConsortiumNature.2007;447;661-78

• Genomewide associationanalysisofcoronaryarterydisease.Samani etal.NEngl JMed.2007;357;443-53

• Sequencevariantsintheautophagy geneIRGMandmultipleother replicatinglocicontribute toCrohn's diseasesusceptibility.Parkes etal.NatGenet.2007;39;830-2

• Robustassociationsof fournewchromosome regions fromgenome-wideanalysesoftype1diabetes.Toddetal.NatGenet.2007;39;857-64

• AcommonvariantintheFTOgeneisassociatedwithbodymassindexandpredisposes tochildhood andadultobesity.Frayling etal.Science.2007;316;889-94

• Replicationofgenome-wideassociationsignalsinUKsamplesrevealsrisklocifortype2diabetes.Zeggini etal.Science.2007;316;1336-41

• Scottetal.(2007)Agenome-wideassociationstudyoftype2diabetesinFinnsdetectsmultiple susceptibilityvariants.Science,316,1341–1345.

• …………

Limitations• Formanydiseases,theamountoftraitvariationexplainedbyeventhesuccessesiswaybelowtheestimatedheritability.

• Recently,GWASareunderalotofcriticismforrelativelyfewtranslatablefindingsgiventheinvestmentandhype.

• AssumptionsunderlyingGWASarenottrueforalldiseases.

TAManolio etal.Nature 461,747-753 (2009)doi:10.1038/nature08494

Feasibilityofidentifyinggeneticvariantsbyriskallelefrequencyandstrengthofgeneticeffect(oddsratio).

ReasonsGWASCanFailevenifwell-poweredandwell-designed….

• Alleleswithsmalleffectsizes• Rarevariants• Populationdifferences• Epistatic interactions• Copynumbervariation• Epigeneticinheritance• Diseaseheterogeneity• ……….

MissingHeritability

PossibleAssociationModels

1. Eachofseveralgenesmayhaveavariantthatconfersincreasedriskofdiseaseindependentofothergenes

2. Severalgenesincontributeadditivelytothemalfunctionofthepathway

3. Thereareseveraldistinctcombinationsofgenevariantsthatincreaserelativeriskbutonlymodestincreasesinriskforanysinglevariant

HypotheticalDiseaseMechanism

HypotheticalDiseaseMechanism

HypotheticalDiseaseMechanism

HypotheticalDiseaseMechanism

HypotheticalDiseaseMechanism

• Foreachgeneprobabilityofknockout=0.22 =0.04

• Probabilityofdisease:– Pathwayknockedout=0.4– Pathwayintact=0.2

• SampleSize=2000cases,2000controls• Power:

LinearPathway

EnrichmentTestinginGWAS• TestingpathwayenrichmentispossibleinGWASdata

– ManyofthesameissuesthatexistingeneexpressionenrichmenttestingoccurinGWASenrichmenttesting(e.g.choiceofstatistics,competitivevs self-contained)

• Primarydifference:– Inexpressiondatatheunitoftestingisagene– InGWASdatatheunitoftestingisaSNP

• Challenges:– IdentifyingtheSNP(set)->Genemapping– SummarizingacrossindividualSNPstatisticstocomputeaper-

genemeasure

MappingSNPstoGenes• AllSNPsinphysicalproximityofeachgene

– Pros:• All/mostgenesrepresented

– Cons:• VaryingnumberofSNPspergene• ManyoftheSNPsmaydilutesignal• Defininggeneproximitycanaffectresults

• eSNPs (ExpressionassociatedSNPs)– Pros:

• 1SNPpergene• SNPsfunctionallyassociated

– Cons:• Assumesvariantseffectexpression• NotallgeneshaveeSNPs• eSNPsmaybestudyandtissuedependent

Genesummaries

• Initialstudiesproposedifferentstatisticsforsummarizingtheoverallgeneassociationpriortoenrichmentanalysis– Number/proportionofSNPswithpvalue <0.05– Mean(-log10(pvalue))– Min(pvalue)– 1-(1-Min(pvalue))N

– 1-(1-Min(pvalue))(N+1)/2

Firstapproaches:combiningp-values• Computegene-wisep-value:

– Selectmostlikelyvariant- ‘best’ p-value– Selectedminimump-valueisbiaseddownward– Assign‘gene-wise’ p-valuebypermutations(Westfall-Young)

• Permutesamplesandcompute‘best’ p-valueforeachpermutation

• ComparecandidateSNPp-valuestothisnulldistributionof‘best’ p-values

• Combinep-valuesbyFisher’smethod,acrossSNPs(biasedinthepresenceofcorrelation)

)2(P

)log(

2)2( Vp

pV

k

Ggi

i

>=

=

Nextapproaches

• Additivemodel:

– Whereni indexesthenumberofalleleBs ofaSNPingenei inthegenesetG

– SelectsubsetofmostlikelySNP’s– Fitbylogisticregression(glm()inR)

• Significancebypermutations– Permutesampleoutcomes– Selectgenesandfitlogisticregressionagain

• Assessgoodnessoffiteachtime– Compareobservedgoodnessoffit

=Gg

iii

npp )

1log(

Competitivevs.Self-ContainedTests

• Competitivecutofftests– RequireonlypermutingSNPorGenelabels– Mayonlyallowtoassessrelativesignificance

• Self-containeddistributiontests– Requirepermutingphenotype-genotyperelationships

– Resourceintensive,maybedifficultforlargemeta-analyses

– Allowtoassessoverallsignificance

Competitivevs.Self-ContainedTests

• Self-containednullhypothesis– nogenesingenesetaredifferentiallyexpressed

• Competitivenullhypothesis– genesingenesetareatmostasoftendifferentiallyexpressedasgenesnotingeneset

WhatdoesthismeanforSNPdata?

ChoiceofPathways/GeneSets• Relativelyless“signal”inGWASthaningeneexpression

(GE)– GEenrichmenttypicallytestwhichgenesets/pathwaysshow

enrichment– GWASenrichmenttypicallytestifthereisenrichment

• Typicallywanttobeconservativeaboutselectingthenumberofpathwaystotest,otherwisewillbedifficulttoovercomemultipletesting

• PrioritizedApproach:– Limitednumberofspecifichypotheses(e.g.genesetsfrom

experiment,co-expressionmodules,disease-specificpathways/ontologies)

– ExploratoryanalysessuchasallKEGG/GOsets

SomeSpecificMethods

• SSEA– SNPSetEnrichmentAnalysis

• i-GSEA4GWAS• MAGENTA– Meta-AnalysisGene-setEnrichmentofvariantAssociations

SSEA

• Zhong etal.AJHG(2010)• eSNP analysistomapSNPstogenes– Moreonthislater…..

• Pathwaystatistic=one-sidedKolmogorov-Smirnovteststatistic

• Pathwayp-valueassessedbypermutinggenotype-phenotyperelationship

• FDRusedtocontrolerrorduetothenumberofpathwaystested

i-GSEA4GWAS• Zhangetal.Nucl AcidsRes(2010)• http://gsea4gwas.psych.ac.cn/

• Categorizesgenesassignificantornotsignificant– Significant:Atleast1SNPinthetop5%ofSNPs– Doesnotadjustforgenesize

• Pathwayscore:k/K– k=Proportionofsignificantgenesinthegeneset– K=ProportionofsignificantgenesintheGWAS

• FDRassessedbypermutingSNPlabels

Results

MAGENTA• Segreetal.PLoS Genetics(2010)• Softwaredownload:– http://www.broadinstitute.org/mpg/magenta/– RequiresMATLAB!!– Lessconvenient,butmorecustomizablethaniGSEA4GWAS

• Customizableproportionof“significant”genes• Customizablegenewindow(upstream&downstream)• OptionforRank-Sumtest• GeneSummary=min(p)– Usesstepwiseregressiontoadjustformultiplepossiblefactors:e.g.genesize,SNPdensity

MAGENTAResults

AdaptationsofGSEA

• Orderlog-oddsratiosorlinkagep-valuesforallSNPs

• MapSNPstogenes,andgenestogroups• Uselinkagep-valuesinplaceoft-scoresinGSEA– Comparedistributionoflog-oddsratiosforSNPsingrouptorandomlyselectedSNP’sfromthechip

SummaryPointsforGWAS• InGWAS,fewSNPstypicallyreachgenome-widesignificance

• Biologicalfunctionofthosethatdocantakeyearsofworktounravel

• Incorporatingbiologicalinformation(expression,pathways, etc)canhelpinterpretandfurtherexploreGWASresults

• Enrichmenttestscanbeusedtoexplorebiologicalpathwayenrichment– Differentteststellyoudifferent things

• Annotationchoicesverydifferentthatingeneexpressiondata,thoughstillrelyonthesameresources....notnecessarilysoforother‘omics”

AddinginGeneExpressionData

• Manymotivatingreasonstocombine/integratedatafrommultiple“-omes”

• ExpressionandSNPdataismostcommonlydone– Thoughmethodscouldbeappliedtocombineother“-omics”

• Generallymakeassumptionsaboutcentraldogma

GeneticsofGeneExpression

• Schadt,Monks,etal.(Nature2003)&Morley,Molony,etal.(Nature2004)showedthatgeneexpressionisaheritabletraitundergeneticcontrol

• Identifyingexpression-associatedSNPs(eSNPs)canidentifySNPswhichareassociatedwithbiologicalfunction

• ForsignificantGWAS“hits”eSNPs cansuggestcandidategenesandpossiblyinformationaboutdirectionofassociation

MotivationforIntegratedAnalysis

• Newerapproacheswillallowyoutonotdopartitioned/filteredanalysis,andleverageinformationacrossdatatypes

• Newtechnologiesallowformorereadyintegration– Ex.RNA-Seq– Droppingcostsallowformoredatatypes tobecollectedsimultaneously

– Biobanking effortarestoringmoretissues

MotivationforIntegratedAnalysis• NaturallyallowBayesianapproachesforidentifyingpriorsorjointingmodelingdata

• Severalnewapproachesproposed– MethodsthatweredevelopedforeSNPs arereadilyextendedacrossdatatypes

– Otherapproachestakeintoaccountsimilaritiesbetween/withingphenotypes• SeveralanontologyjointlyrepresentingdiseaseriskfactorsandcausalmechanismsbasedonGWASresults

• Proposedontologyisdisease-specific(nicotineaddictionandtreatment)andonlyapplicabletoveryspecificresearchquestions

– Morelateron“differentissuesfor–omics”

MotivationforIntegratedAnalysis

• Methodsarelargelyrelyingoncentraldogmaassumptionsthatdonotalwayshold

Summary• PathwayandgenesetanalysishasbeenextendedtoSNPandSNVdata

• Someannotationresourcesarereadilyadapted,butanewseriesofchoicesareavailable

• SoftwarepackagesforGWASpathwayanalysisarematuring

• Advancesinapproximationforpermutationtestingwillmakethesetoolsmorecomputationallytractable

• Manyofthesameissueswithmissingannotation,etc.arestillaconcern

Summary• IntegrationofSNPlevelandeSNP datahasbeenhighlysuccessful,andhelpsmotivatetheintegrationofother“-omes”inanalysis

• Suchintegrationwillbedependentonthequalityoftheannotationthatitrelieson

• Next,wewilltalkaboutspecificconcernsfordifferentdatatypes

• Issueswillcompoundinintegratedanalysis…

Questions?

[email protected]