Download - GDC API Tutorial July2017 - Ram Pages · GDC API Tutorial Author Info ... Olex – GDC API 2 ... Using the GDC Data Portal you would need to manually look at each BAM file

Transcript

GDCAPITutorial

AuthorInfoAmyOlexBioinformaticsSpecialistWrightCenterforClinicalandTranslationalResearchalolex@vcu.edu 804-828-1621

WhatisanAPI?AnAPIisanApplicationProgramInterface.APIsaremeanttohelpgraphicaluserinterfacesinteractwithasoftwareapplication.AwebAPIallowsotherwebsitesorprogramstointeractwithitorthedataitstores.WhileAPIsaregenerallymeantforprogramsandsoftwaretouse,youcanmanuallybuildyourownAPIcommandstointeractwiththesystem.Youwouldwanttodothisiftheinterfacethatisalreadyimplemented(akaGDCDataPortal)doesn’timplementthequeryorfunctionalitythatyouneed(akabulkfilemetadatadownload).YoucansubmitAPIqueriesorcommandsthroughyourInternetbrowserorbyusingterminalcommandslike“curl”onLinux/Macplatforms.Commandshavetobeformattedsothatthereceivingservercaninterpretthem.ThisformatiscalledJSON(JavaScriptObjectNotation)andanotherversioniscalledpercent-encodedJSON.Whenusingterminalcommandslike“curl”yourJSONformattedqueryissavedina“payload”file,whichisusedbythecurlcommand.IfyouaresubmittingyourAPIquerythroughawebbrowser,thenyouwillneedtopercent-encodeyourJSONquery,andpasteitintotheURLbar.Forthisworkshopwewillbeusingthepercent-encodedJSONformat.

WhyusetheAPI?Arethereanyotherwaysofgettingthisdata?AsidefromtheGDCDataPortalinterface,API,andDataTransferToolthereareafewotherwayofinteractingwiththeGDCtoextractthenewlyharmonizeddata.TheseincludeseveralRpackagesthatinterfacewiththeGDCtoextractandformatdataforyou.Themostup-to-dateRpackagesareTCGABiolinks/TCGABiolinksGUIandGenomicDataCommons.ThesearecurrentlytheonlypackagesthatdirectlyinterfacewiththeGDCandcanextractharmonizeddata.OtherpackageslikeFireBrowseR,cgdsr(cBioPortal),andTCGAAssemblerstillworkandaremaintained,butinterfacewiththeirowncopyofthelegacyTCGAdata.Olderpackages,includingRTCGAToolboxandTCGA2STAT,arenotbeingmaintainedandmaynolongerwork.IfyouknowRthentheTCGABiolinksisagreatpackagethatcandownload,formatandprocessGDCdatainR.ItalsohasaccesstothelegacyTCGAdata.Ifyouarelookingforclinicaldata,expressiondata,orsimilarpatient-orienteddataIhighlyrecommendusingTCGABiolinks.Ifyouhavethemostup-to-dateversionofR(3.4.0)youcanalsoinstalltheirTCGABiolinksGUIthatprovidesauserinterfaceandrequiresnoRprogrammingknowledgetodownload,process,andperformsomeadvancedanalysesonTCGAdata—includingsurvivalanalysesandintegratinggeneexpressionandvariantdata.Theone

Olex–GDCAPI

2

issuewithTCGABiolinksGUIisthatitisbuggyandIhavebeenunabletogetittofunctioncorrectlyatthemoment.Theyareworkingontheissues,soifyourunintoanyfeelfreetosubmitanissueontheirGitHubpage.Ifyouarelookingfortechnicalmetadataforyourfiles,forexamplethesequencingcenterorlibraryprepkitassociatedwithasequencingfile,thenyouwillnotbeabletouseTCGABiolinksortheGUI.TheRGenomicDataCommonspackageprovidesanRinterfacetotheGDCAPI.Whilethissoundspromisingitisnoteasytouse.Buildingtherightqueriesarecomplex,andthedatareturnedisinamulti-nestedlistform,whichisnoteasilyanalyzedorunderstood.Additionally,itdoesnotseemtobeabletodownloadtherequestedfilemetadatasuchassequencingcenterandlibraryprepkit(itmightbeabletodothis,butIjustwasn’tabletogetittowork).Tobefair,theAPIisnotstraightforwardtouse,however,ifyouneedtodownloadharmonizedfilemetadatainbulkitiscurrentlyyouronlyoption.InthistutorialwewillusetheGDCAPIplusafewauxiliarytoolstoobtainalistofVCFfileswiththeBAMfilethatitwasgeneratedfrom,theexomecapturekitassociatedwiththesample,andsomeadditionalmetadatainformation.Whydoweneedthismetadata?TheGDCharmonizesalltheirfilesasmuchaspossibletoremovebatcheffects;however,itisimpossibletoeliminatebatcheffectsarisingfromdifferencesinsamplehandlingandprocessingpriortosequencing.TheseeffectsstillremainandneedtobeconsideredandaccountedforinanyglobalanalysisofTCGAdata.Thus,itisimportanttohaveaccesstothetechnicalmetadata.

ScenarioThistutorialwillwalkyouthroughidentifyingmetadataforGDCsequencingfileswiththefollowingscenario:Youhavethefollowing3(or300!)wholeexomesequencingBAMfileIDs:

• 0f792b53-5c12-487d-8229-187e5f8c0148• e3e53764-dd65-4e9b-8e96-5ae043b03401• 85b9ee03-9e87-4d69-81ca-b1d6b24a8762

ForeachfileyouwanttoknowtheGDCprojectID,sampletype(tumor,normal,etc.),exomecapturekitname,andthenamesofthedownstreamanalysisVCFfilesthatareavailable.UsingtheGDCDataPortalyouwouldneedtomanuallylookateachBAMfilepageandrecordthisinformation(feelfreetopokearoundandtrythis).ThisisnotfeasiblesoyoudecidetousetheAPItoobtainthisinformationinbulk.

GDCAPIHelpGDCAPIManual:https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/EmailtheGDCHelpDesk:[email protected] Theyareveryhelpfulwitheverything!

Olex–GDCAPI

3

RequiredAuxiliaryToolsThefollowingaretheauxiliarytoolswewillusetohelpusformatourAPIcommandscorrectly.Theseareallonlinetools,sonoinstallationneeded!NotethattheGDCAPIManualalsolistssomeonlineJSONeditingtools,butIprefertheoneslistedbelow.Openupalltoolstohavethemreadypriortostartingthetutorial.

• InternetBrowser(GoogleChromerecommended)• URLEncoding:http://www.url-encode-decode.com• JSONEditor:http://www.cleancss.com/json-editor• Notepadoraplaintexteditor.• ExcelWorksheet• Theaccompanying“copy-paste”tutorialdocument.

Olex–GDCAPI

4

Section1)AnatomyofanAPICallTherearedifferentpartstoanAPIcall(command).EachpartcanbemodifiedtochangewhichpartsoftheGDCdatabasearequeriedandwhatdataisreturned.HereisanexampleAPIcallfromtheGDCAPIManual.Youcancopyandpasteitintoyourbrowser’sURLbar(seethecopy-pastedocumentdistributedwiththistutorialasyoucan’tcopyandpastefromaPDF).PressENTERtoseetheresults.

Don’tbeintimidated!Letsbreakitdown….

https://api.gdc.cancer.gov/files?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.submitter_id%22%2C%22value%22%3A%5B%22TCGA-CK-4948%22%2C%22TCGA-D1-A17N%22%2C%22TCGA-4V-A9QX%22%2C%22TCGA-4V-A9QM%22%5D%7D%7D%2C%7B%22op%22%3A%22%3D%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%22Gene%20Expression%20Quantification%22%7D%7D%5D%7D&format=tsv&fields=file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,analysis.workflow_type,cases.project.project_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id&size=10

https://api.gdc.cancer.gov/files ? filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.submitter_id%22%2C%22value%22%3A%5B%22TCGA-CK-4948%22%2C%22TCGA-D1-A17N%22%2C%22TCGA-4V-A9QX%22%2C%22TCGA-4V-A9QM%22%5D%7D%7D%2C%7B%22op%22%3A%22%3D%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%22Gene%20Expression%20Quantification%22%7D%7D%5D%7D & format=tsv & fields=file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,analysis.workflow_type,cases.project.project_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id & size=10

ThewebsitetosendtheAPIcommandto(api.gdc.cancer.gov)followedbytheGDCendpoint“files”toquery.Endpointsareexplainedonthenextpage.A“?”separatesthewebsiteaddressfromthequerycommands.

Specifieshowyouwanttherecordsfilteredinpercent-encodedJSONformat.Wehaveatooltodothispart!Allquerycommandcomponentsareseparatedbyan“&”.

Theformatyouwantyourresultsreturnedin.“tsv”is“tab-separatedvalue”andcanbecopiedandpastedintoExcel.

AlistofGDCfieldsthatyouwanttoseeinyourresultsseparatedbyacomma.WewilluseanAPIcommandtofindwhichfieldsareavailable.

Themaximumnumberofrecordstoreturn.Ifyoudon’tknowhowmanyyoucansetthistoareallyhighnumber.Forthisassignmentwewillonlyreturnuptothefirst10.

Olex–GDCAPI

5

2)GDC“Endpoints”FormoreinformationonendpointsseetheGDCAPIManualhttps://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#api-endpointsThereare4GDCEndpointsusedforsearchingmetadata:

1. files–searchesallfilesatthefilelevel,eachrecordreturnedrepresentsasinglefile.2. cases–searchesbasedonpatients,eachrecordreturnedrepresentsasingle

patient.3. projects–searchesattheprojectlevel,eachrecordrepresentsaproject.4. annotations–searchesannotationsaddedtothedataaftercuration.

Endpointsdefinetheviewpointofthesearchandthetypeofinformationreturned.Forexample,ifyouquerythe“files”endpointandwanttoreturnallthefileIDsassociatedwithaparticularcase,youwouldusethefieldname“file_id”andonerecordwouldbereturnedforeachfile.However,ifyousearchfromthe“cases”endpointyouhavetoprefixthe“file_id”fieldnamewiththeendpoint(i.e.“files.file_id”).ThiswouldreturnonerecordforeachcasethatlistsALLfileIDsassociatedwiththatcaseononerow.

Projects Cases Files Annotations

Testitout–copyandpastetheentireAPIcallfromSection1intoyourwebbrowserandpress“enter”.CopytheresultsintoanExcelworkbookandcountthenumberofrowsyouhave.Nowusethesamequeryonlyreplacethe“files”endpointwith“cases”.Task1)Howmanyrecordswerereturnedfor“files”andfor“cases”?Youshouldhavegotten10records(rows)forthe“files”endpointand4recordsforthe“cases”endpoint.Task2)Whyaretheremorerecordsreturnedwhenusingthe“files”endpoint?Hint:comparethecolumnsnamed“cases_0_case_id”and“file_id”fromthe“files”queryandthecolumnnamed“case_id”fromthe“cases”query.Therearemultiplefilespercase.The“files”endpointreturnedmorebecauseitlistedeachfileasarecord.Casesreturnedfewerbecauseitlistedeachcaseasarecordwithmultipleassociatedfiles.

file_id cases.case_id projects.project_id

files.file_id case_id projects.project_id

files.file_id cases.case_id project_id

Olex–GDCAPI

6

3)FindingAvailableFieldsInordertobuildafilteranddesignatespecificfieldstoreturnyouhavetoknowwhatfieldsarepresentinthedatabaseandateachendpoint.TogettheavailablefieldsforanendpointenterthefollowingURLintothebrowserandhit“enter”(copyfromthe“copy-past”document):

Youshouldgetalonglistwithentriesformattedlikethefollowingthatareseparatedbycommas:

https://api.gdc.cancer.gov/files/_mapping

… "files.analysis.analysis_id": { "description": "", "doc_type": "files", "field": "analysis.analysis_id", "full": "files.analysis.analysis_id", "type": "string" }, "files.analysis.analysis_type": { "description": "", "doc_type": "files", "field": "analysis.analysis_type", "full": "files.analysis.analysis_type", "type": "string" }, …

Ifqueryingthe“files”endpointthenafieldisreferencedbythe“field”name(inblue).Usingthe“full”fieldnamewillreturnnoresults.Onlyusethefullfieldnameifyouarequeryingadifferentendpoint.Somefieldswillhaveadescription,whichcanhelpyoufindwhatyouarelookingfor(seefigurefromSection2).

Testitout–copyandpastethemappingURLaboveintoyourbrowser(copyfromthe“copy-past”document).Task3)TrythemappingURLfor“files”,“cases”,and“projects”.Whatistheresult?Youshouldnoticethattheyeachreturnadifferentsetofavailablefieldsforeachendpoint.Thefieldsthatcontaintheinformationwewantareunderthe“files”endpoint.Seeifyoucanlocatethefollowing:

• file_id• file_name• downstream_analyses.output_files.file_name• analysis.metadata.read_groups.target_capture_kit_name• cases.samples.sample_type• cases.project.project_id

Olex–GDCAPI

7

4)BuildingaFilterThefilterisbuiltusingJSONformat.OpentheJSONeditor(http://www.cleancss.com/json-editor).Thendeleteeverythingintheleftpaneandreplaceitwiththefollowing--allbracketsarerequired(copyfromthe“copy-paste”document):

Note:JSONisverypickyaboutthetypeofparenthesesused.EditingtheJSONinatexteditorcanchangethesecharacterswithoutyourknowledgeandbreaktheotherwisecorrectquery.The“op”:”in”tellstheserverwhattypeoffilterthisis.Inthiscaseitissayingtoreturnanyrecordthathasan“entity_id”INthegivenlistunderthe“value”line.The“in”couldalsobechangedtootheroperatorslike“=”,butwewon’tbedoingthatinthistutorial.

BAMFileIDs:0f792b53-5c12-487d-8229-187e5f8c0148e3e53764-dd65-4e9b-8e96-5ae043b0340185b9ee03-9e87-4d69-81ca-b1d6b24a8762

{ "op":"in", "content":{ "field":"entity_id", "value":[ "e0d36cc0-652c", "25ebc29a-7598", "fe660d7c-2746" ] } }

Tryitout–copyandpastetheJSONfilterabovefromthe“copy-paste”documentintotheJSONEditor.Task4)Editthefiltertousethefieldname“file_id”,andreplacethelistofentity_idvaluestousetheBAMfileidslistedbelow.Youdonotneedtouploadanyfilesorpushanybuttonsontheeditorwebpage.Justeditthetextandcopyittoyourclipboard.Thecodeshouldlooklikethefollowing:{"op":"in","content":{"field":"file.file_id","value":["0f792b53-5c12-487d-8229-187e5f8c0148","e3e53764-dd65-4e9b-8e96-5ae043b03401","85b9ee03-9e87-4d69-81ca-b1d6b24a8762"]}}Next,opentheURLencoder/decoder(http://www.url-encode-decode.com).Task5)CopyandpasteyourJSONfilterintotheleftpaneoftheURLencoder/decoder,thenpressthe“encode”button.Copyandpastetheresultontoyourclipboard.Youshouldseesomethinglikethis:%7B%0D%0A+++%22op%22%3A%22in%22%2C%0D%0A+++%22content%22%3A%7B%0D%0A++++++%22field%22%3A%22files.file_id%22%2C%0D%0A++++++%22value%22%3A%5B%0D%0A++++++++++%220f792b53-5c12-487d-8229-187e5f8c0148%22%2C%0D%0A++++++++++%22e3e53764-dd65-4e9b-8e96-5ae043b03401%22%2C%0D%0A++++++++++%2285b9ee03-9e87-4d69-81ca-b1d6b24a8762%22%0D%0A%0D%0A++++++%5D%0D%0A+++%7D%0D%0A%7D

Olex–GDCAPI

8

5)SpecifyingadditionaloptionsSofarwehavetalkedaboutfiltersandfields,whicharepartsoftheAPIcommandthatisaftertheURLandquestionmark“?”.Thereareadditionaloptionsyoucanset,liketheformatandsizeoftheresults.TheGDCAPIManualliststheavailableoptionshere(https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#request-parameters).

6)AssemblingtheAPICommandYounowhaveallthepartstoassembleyourcustomAPIcall.Copythetemplatebelowintoatexteditortofillinthemissingpieces,indicatedby<<missing>> tags,thatyouobtainedinsteps3and4.Itiscolorcodedforeaseofreadingonly.

https://api.gdc.cancer.gov/files ? filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.submitter_id%22%2C%22value%22%3A%5B%22TCGA-CK-4948%22%2C%22TCGA-D1-A17N%22%2C%22TCGA-4V-A9QX%22%2C%22TCGA-4V-A9QM%22%5D%7D%7D%2C%7B%22op%22%3A%22%3D%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%22Gene%20Expression%20Quantification%22%7D%7D%5D%7D & format=tsv & fields=file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,analysis.workflow_type,cases.project.project_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id & size=10

https://api.gdc.cancer.gov/files?filters=<<missing>>&format=tsv&fields=<<missing>>&size=10

Tryitout–copyandpastetheabovetext(fromthe“copy-paste”document)intoatextfileeditor.Task6)CopyfieldsfromSection2intotheorange<<missing>>spacewitheachseparatedbyacomma(nospaces).ThencopyyouranswerfromSection3intothegreen<<missing>>space.CopythisURLintoyourwebbrowserandpress“enter”.YoushouldgetaURLthatlookslikethefollowing:https://api.gdc.cancer.gov/files?filters=%7B%0D%0A+++%22op%22%3A%22in%22%2C%0D%0A+++%22content%22%3A%7B%0D%0A++++++%22field%22%3A%22files.file_id%22%2C%0D%0A++++++%22value%22%3A%5B%0D%0A++++++++++%220f792b53-5c12-487d-8229-187e5f8c0148%22%2C%0D%0A++++++++++%22e3e53764-dd65-4e9b-8e96-5ae043b03401%22%2C%0D%0A++++++++++%2285b9ee03-9e87-4d69-81ca-b1d6b24a8762%22%0D%0A%0D%0A++++++%5D%0D%0A+++%7D%0D%0A%7D&format=tsv&fields=file_id,file_name,downstream_analyses.output_files.file_name,analysis.metadata.read_groups.target_capture_kit_name,cases.samples.sample_type,cases.project.project_id&size=10

Olex–GDCAPI

9

Answers:Task7)0f792b53-5c12-487d-8229-187e5f8c0148Task8)TCGA-UCSTask9)hg18nimblegenexomeversion2

Tryitout–copyandpasteyourAPIcallintotheURLbarofyourbrowser–makesureitiscompletelyemptyfirst.CopytheresultsintoanExcelworksheetandanswerthefollowingquestions.Warning:Thecolumnsareretunedinarandomorder.YoucansorttheminExcelorjustscrollthroughtolookatcontent.Task7)Whichfileisfrom“PrimaryTumor”?ProvidethefileID.Task8)WhatistheProjectIDforthefile“e3e53764-dd65-4e9b-8e96-5ae043b03401”?Task9)WhichexomecapturekitwasusedonthesamplethatwasusedtoproducetheVCFfilenamed“69642ea0-a20d-4ff3-96b6-4a9a3ff76058.vcf”?