BCHM 6280 Tutorial: Gene specific information using...

Post on 13-Mar-2018

230 views 5 download

Transcript of BCHM 6280 Tutorial: Gene specific information using...

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 1 of 5

BCHM6280Tutorial:GenespecificinformationusingNCBI,EnsemblandgenomeviewersWebresources:NCBIdatabase:http://www.ncbi.nlm.nih.gov/Ensembldatabase:http://useast.ensembl.org/index.htmlUCSCGenomebrowser:http://genome.ucsc.edu/Exercise1homepage:http://biochem.slu.edu/bchm628/exercise1.htmlGoals:LearnhowtoefficientlynavigatetheNCBI,EBI-Ensembl,andUCSCGenomebrowserstofindinformationonspecificgenes.NOTE:RefseqreferstorecordsthathavebeenreviewedbytheNCBIcurationstaff.TheRefseqdatabaseisaprecursortotheGenedatabaseandisavailableasaLimitsoptionintheproteinandnucleotidedatabases.CuratedRefseqrecordshavethenomenclature:NM_####formRNAandNP_####forproteinrecords.OtherdesignationsaredescribedinthePDFfileRefseqNomenclature.pdfavailablefromtheExercise1homepage.ConducttextbasedsearchesofNCBIandEnsembla)SearchtheNCBIGenedatabaseusingthequeryterm:“p53ANDhuman”.

TheANDtellsittosearchforbothp53andhumanineveryfield.b)Changethesearchqueryto:“p53ANDhuman[Organism]”orusetheAdvanceoptiontocreatethesamequery.

ThistellsthesearchalgorithmthatyouaresearchingspecificallyforspecieshumanintheOrganismfieldofthedatabase.

c)SearchtheEnsembldatabaseforthehumangeneencodingp53.Changethedropdownmenutohuman,type“p53”inthesearchboxandclickGO.Thefirstthingyoushouldnoteisthattherearemanymatchestothequery“p53.”Thereareseveralreasonsforthis:1.Youaresearchingeveryfieldandnotjustthegenename2.YouarenotusingtheofficialHGNC(HumanGenomeNomenclatureCommittee)genenameandthereareseveraldifferentaliasesforthisgene.

3.Thep53proteininteractswith>100otherproteinssothereisalotofliteraturethatmentionthisproteinandthusthenamewillappearintherecordsofmanyothergenes.

Sohowdoyougetaroundthis?Youcantrysearchingfordifferentaliases.Youcanlookthroughthefirstfewrecordsandseeifyoucandeterminewhattheofficialgenesymbolis.Youcansearchtheliteratureforotheraliases.Inthiscase,fromyoursearchofNCBI/Genedatabaseineithera)orb),thetophitisthegenewiththesymbolTP53,whichisthecorrectsymbol.Readthroughthesummaryandyou’llnotethattheofficialgenenameisTumorProteinp53andthatitisinvolvedinnumerouscellularprocessesinvolvedingeneregulation.Youshouldalsonotethatp53isoneofthelistedaliases.

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 2 of 5

SearchtheEnsemblhumangenomewiththequery“p53”.Howmanyresults?Now,restricttheresultstoGenesandthisshouldreducethelistto~443records.However,Ididnotfinditwithinthefirstfewpages.Changethesearchto“TP53”restrictedtohumanandGenesanditshouldcomeupasthetoprecord.Centraltothiscourseisdealingwithlistsofgenes.Forthisreason,wewillusetheofficialgenesymbolsandspecificdatabaseIDs.Ifyouhadtofindtheofficialgenesymbolformorethanabout10genesyouwillquicklyseethevalueofusinggeneidentifiersthatareuniversallyrecognized.Youwillalsolearntovalueliteraturethatreferencesgenesbytheirofficialsymbols.Unfortunately,thisisnotauniversalpractice.FindingtranscriptinformationaboutaspecificgeneusingNCBI&EnsemblHumangenesarecomplexandoftenhaveseveraltranscriptisoforms.Thecurationofgenemodelstoidentifyallpossibleandexpressedtranscriptsusesseveralexperimentaltechniques,includingtissue-specificRNAseq,whichprovidesdirectsupportforexpressionofexons.

ThecurationofgenesatNCBIusesasinglepipelineandcollectsthecuratedgenomic,transcriptandproteinsequencesintotheRefSeqdatabase.TheynomenclatureidentifiesthosesequencesthatareconsideredReference(NG_(genomic)NM_(mRNA)andNP_(protein).ThereisaPDFontheexercise1homepagethatdescribesalloftheRefseqnomenclature.NotethatsomeoflistedasXMorXP,whichindicatespredictedtranscriptsorproteinswithlessornoexperimentalevidenceforthem.

Ensemblhastwogenecurationpipelines(VEGA&HAVANNA),andwhenthetwopipelinesarecombined,theannotationisknownasGENCODE.OntheGenespecificpages,thetranscriptsareidentifiedbywhethertheyareproteincodingornot.Thereisalsoavisualforsplicevariantsthatmatchestheknowndomainsinthegenewiththedifferenttranscripts.EnsemblalsomakesiteasytoexportanExcel-compatibletranscripttableandusuallyidentifieswhichofitstranscriptshaveacorrespondingRefseqtranscriptmatch.

a)WithintheNCBIgenerecordfortheTP53genethereare2sectionsthatprovidetranscript/proteininformation:Genomicregions,transcriptsandproductsandNCBIReferenceSet.

ExportaPDFfromtheGenomicregionssection.Here,genesarecolorcoded(greenforproteincoding,bluefornon-coding).Italsolistsgenemodels(XRorXM).Refseqtranscripts/proteinsstartingwithXrepresentcomputationalmodelswithoutexperimentalverification.AnexampleisprovidedontheExercise1homepage.

b)WithintheEnsemblgenerecordforTP53,findthetranscripttable.HereyoucanexporttheentiretableinCSVformatandthenimportintoExcel.AnexampleisprovidedontheExercise1homepage.

NOTE:TheEnsemblsitegenerallymakesiteasiertodealwithlistsofgenes(bothimportingandexporting).TheNCBIsitehasbettercross-databasefunctionalityandisbetterintegratedwiththeliterature.

Youshouldnoteseveralthingsaboutthesetranscriptsearches:

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 3 of 5

1.TP53hasalargenumberoftranscriptisoforms.Notallhumangeneshavethismany,butifyouwanttoconductawholegenomeexpressionexperiment,oneconsiderationisconsiderwhethertoanalyzethedataonagene(~25,000)ortranscript(~160,000)level.

2.ThetranscriptvariantsdifferbetweenEnsemblandNCBI.ThoughEnsemblkindlyliststhosethatareincommonbetweenthetwosites.

3.Ensemblmakesiteasytodistinguishbetweentranscriptsthatareproteincodingornotandalsobetweentranscriptswithgoodexperimentalevidenceversuscomputationallypredictedtranscripts.

ExploringthegenomiccontextofgenesusingEnsemblandUCSCGenomebrowser.Thegenomiccontextmeanswhereonthegenomethegeneislocated.Thatis:

• Whichchromosome• Whereonthatchromosome• Whatstrand• Whatgenesareupstream/downstream

Genomebrowsersofferawaytovisualizedatathatcanbeplacedonachromosome.Thesedataareincludedasadditionaltracksofinformation(fromafewtohundredsdependingonthegenome)andincludesuchdataas:

• Locationofrepetitivesequences• Levelofhomologytoothergenomes• SNPorvariantswithinthegenomeofinterest• TFbindingsites

Thedatabehindagenomebrowserisenormousandcanbequitecomplextosortthrough.Thisamountofdatacanalsobeslowtoload.Spendsometimeturningtracksonandoffandfollowinglinksorpop-upsthatexplainthedifferentdatasources.WewilluseboththeUCSCandEnsemblgenomebrowsersforthisexercise.Bothallowyoutoexportimagesofthebrowserwindowandofferlinkstodownloadsequencedata.EnsemblgenomebrowserToaccesstheEnsemblgenomebrowser,clickontheLocationtab(whichshouldhaveatitle:Location:17:7,661,779-7,687,550.ThisindicatesthatthisgeneislocatedonChromosome17betweenthecoordinates7,661,779-7,687,550.Thefirstsectionshowsaschematicofthechromosomewitharedboxaroundthecoordinatesofthegene(Fig.1).IfyouclickontheAssemblyExceptionslink,youcanturnoffthattrackandareleftwithjusttheboxhighlighting

thegene.

Figure1:Chromosomeideogramofchr17withtheregionforTP53shownasaredbox

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 4 of 5

Scrolldowntothenextsectionandyou’llseethechromosomeregioninmoredetail,withtheTP53geneinthemiddle.Thisgivesyouanideaofthegenomiccontextofthegeneofinterest.Scrolldowntothenextsectionandthiswilldisplaythe25Kbregionthatencompassesthelargesttranscriptisoformofthegene.Youcanseeallthedifferentsplicevariants.Theyarecolorcodedbyexperimentalsupportandwhethertheyareproteincodingornot.Clickononeofthetranscriptsanditwillopenapop-upwindowwithadditionaldetailsaboutthattranscript.Youcanright-clickonthelinkswithinthepop-upwindowtoopenupthelinkinanewtaborwindow.ClickontheXtoclosethewindow.Scrolldownfurtherandyouwillseeadditionaltracksofinformation,suchasSNPlocations,associatedphenotypesand%GC.Thesetrackscanbeexpandedandturnedonandoff.Itcantakeawhileforthechangestobeimplementeddependingonhowlongofachromosomalregionyouareworkingwithandhowmuchdataisinthetrack.Ifyouscrollbacktothetopofthissection,youcanzoominorout.Sometimestrackswon’texpandbecauseyouareviewingalargeenoughsectionthattherewillbetoomuchinformationtodisplay.Ifyoutriedexpandingatrackandnothinghappened,tryzoominginsuchthatyouaredisplaying<10Kbofsequence.Thatwillusuallyallowanytracktobeexpanded.Figure2showsaportionoftheTDP53transcriptwithexpandedtrackofSNPs.

Figure2:PartoftheTP53transcriptvariantswithexpandedSNPsbelow.

BCHM 6280 2017 NCBI & Ensembl Tutorial Page 5 of 5

UsingtheUCSCGenomebrowserBelowtheheadersisadarkbluebarwiththelinkGenomes.MouseoveritandselecthumangenomeGRCh38/hg38.OrclickthelinkanditwillopenasearchwindowforthelatestHumanassemblyasadefaultoption.TypeinTP53intothesearchtextboxanditwilllistmanypossiblematches.Selectthesecondonewhichcorrespondstotumorproteinp53(fromHGNCTP53).ThisshouldopenawindowthatlookssomethinglikeFig.3.

ThegenesizeandcoordinatesofwherethisgenefallsonChr17shouldbeverysimilarifnotidenticaltothecoordinateslistedfortheEnsemblbrowser.Scrolldownthroughthegraphics.Clickonthegraphicorclickingonthenameofthetrackwillpopopenawindowwithinformationaboutthetrack.Clickonanysingletranscripttoseedetailsaboutthetranscript.AFEWofthequestionsyoucanaskwithagenomebrowserinclude(dependingonthegenomeandavailabletrackinformation):

1) Whatgenesarelocatednearitormaysharepromoters?2) WhatSNPsarefoundinmygeneandaretheylocatedinintrons,promotersorexons?3) Whatstrandismygeneencodedon?4) Whatregulatorelementsarelocatedwithinornearmygene?5) Whatclinicalvariantsareassociatedwithmygene?

Spendsometimeexploringthetracksandlookingupwhattheyrepresentandhowthedataispresented.Youmayfindsomeoftheinformationpertinenttoyourresearchproject.

Figure3:UCSCviewofTp53