GDC API Tutorial July2017 - Ram Pages · GDC API Tutorial Author Info ... Olex – GDC API 2 ......

9
GDC API Tutorial Author Info Amy Olex Bioinformatics Specialist Wright Center for Clinical and Translational Research [email protected] 804-828-1621 What is an API? An API is an Application Program Interface. APIs are meant to help graphical user interfaces interact with a software application. A web API allows other websites or programs to interact with it or the data it stores. While APIs are generally meant for programs and software to use, you can manually build your own API commands to interact with the system. You would want to do this if the interface that is already implemented (aka GDC Data Portal) doesn’t implement the query or functionality that you need (aka bulk file metadata download). You can submit API queries or commands through your Internet browser or by using terminal commands like “curl” on Linux/Mac platforms. Commands have to be formatted so that the receiving server can interpret them. This format is called JSON (JavaScript Object Notation) and another version is called percent-encoded JSON. When using terminal commands like “curl” your JSON formatted query is saved in a “payload” file, which is used by the curl command. If you are submitting your API query through a web browser, then you will need to percent-encode your JSON query, and paste it into the URL bar. For this workshop we will be using the percent-encoded JSON format. Why use the API? Are there any other ways of getting this data? Aside from the GDC Data Portal interface, API, and Data Transfer Tool there are a few other way of interacting with the GDC to extract the newly harmonized data. These include several R packages that interface with the GDC to extract and format data for you. The most up-to-date R packages are TCGABiolinks/TCGABiolinksGUI and GenomicDataCommons. These are currently the only packages that directly interface with the GDC and can extract harmonized data. Other packages like FireBrowseR, cgdsr (cBioPortal), and TCGA Assembler still work and are maintained, but interface with their own copy of the legacy TCGA data. Older packages, including RTCGAToolbox and TCGA2STAT, are not being maintained and may no longer work. If you know R then the TCGABiolinks is a great package that can download, format and process GDC data in R. It also has access to the legacy TCGA data. If you are looking for clinical data, expression data, or similar patient-oriented data I highly recommend using TCGABiolinks. If you have the most up-to-date version of R (3.4.0) you can also install their TCGABiolinksGUI that provides a user interface and requires no R programming knowledge to download, process, and perform some advanced analyses on TCGA data— including survival analyses and integrating gene expression and variant data. The one

Transcript of GDC API Tutorial July2017 - Ram Pages · GDC API Tutorial Author Info ... Olex – GDC API 2 ......

GDCAPITutorial

AuthorInfoAmyOlexBioinformaticsSpecialistWrightCenterforClinicalandTranslationalResearchalolex@vcu.edu 804-828-1621

WhatisanAPI?AnAPIisanApplicationProgramInterface.APIsaremeanttohelpgraphicaluserinterfacesinteractwithasoftwareapplication.AwebAPIallowsotherwebsitesorprogramstointeractwithitorthedataitstores.WhileAPIsaregenerallymeantforprogramsandsoftwaretouse,youcanmanuallybuildyourownAPIcommandstointeractwiththesystem.Youwouldwanttodothisiftheinterfacethatisalreadyimplemented(akaGDCDataPortal)doesn’timplementthequeryorfunctionalitythatyouneed(akabulkfilemetadatadownload).YoucansubmitAPIqueriesorcommandsthroughyourInternetbrowserorbyusingterminalcommandslike“curl”onLinux/Macplatforms.Commandshavetobeformattedsothatthereceivingservercaninterpretthem.ThisformatiscalledJSON(JavaScriptObjectNotation)andanotherversioniscalledpercent-encodedJSON.Whenusingterminalcommandslike“curl”yourJSONformattedqueryissavedina“payload”file,whichisusedbythecurlcommand.IfyouaresubmittingyourAPIquerythroughawebbrowser,thenyouwillneedtopercent-encodeyourJSONquery,andpasteitintotheURLbar.Forthisworkshopwewillbeusingthepercent-encodedJSONformat.

WhyusetheAPI?Arethereanyotherwaysofgettingthisdata?AsidefromtheGDCDataPortalinterface,API,andDataTransferToolthereareafewotherwayofinteractingwiththeGDCtoextractthenewlyharmonizeddata.TheseincludeseveralRpackagesthatinterfacewiththeGDCtoextractandformatdataforyou.Themostup-to-dateRpackagesareTCGABiolinks/TCGABiolinksGUIandGenomicDataCommons.ThesearecurrentlytheonlypackagesthatdirectlyinterfacewiththeGDCandcanextractharmonizeddata.OtherpackageslikeFireBrowseR,cgdsr(cBioPortal),andTCGAAssemblerstillworkandaremaintained,butinterfacewiththeirowncopyofthelegacyTCGAdata.Olderpackages,includingRTCGAToolboxandTCGA2STAT,arenotbeingmaintainedandmaynolongerwork.IfyouknowRthentheTCGABiolinksisagreatpackagethatcandownload,formatandprocessGDCdatainR.ItalsohasaccesstothelegacyTCGAdata.Ifyouarelookingforclinicaldata,expressiondata,orsimilarpatient-orienteddataIhighlyrecommendusingTCGABiolinks.Ifyouhavethemostup-to-dateversionofR(3.4.0)youcanalsoinstalltheirTCGABiolinksGUIthatprovidesauserinterfaceandrequiresnoRprogrammingknowledgetodownload,process,andperformsomeadvancedanalysesonTCGAdata—includingsurvivalanalysesandintegratinggeneexpressionandvariantdata.Theone

Olex–GDCAPI

2

issuewithTCGABiolinksGUIisthatitisbuggyandIhavebeenunabletogetittofunctioncorrectlyatthemoment.Theyareworkingontheissues,soifyourunintoanyfeelfreetosubmitanissueontheirGitHubpage.Ifyouarelookingfortechnicalmetadataforyourfiles,forexamplethesequencingcenterorlibraryprepkitassociatedwithasequencingfile,thenyouwillnotbeabletouseTCGABiolinksortheGUI.TheRGenomicDataCommonspackageprovidesanRinterfacetotheGDCAPI.Whilethissoundspromisingitisnoteasytouse.Buildingtherightqueriesarecomplex,andthedatareturnedisinamulti-nestedlistform,whichisnoteasilyanalyzedorunderstood.Additionally,itdoesnotseemtobeabletodownloadtherequestedfilemetadatasuchassequencingcenterandlibraryprepkit(itmightbeabletodothis,butIjustwasn’tabletogetittowork).Tobefair,theAPIisnotstraightforwardtouse,however,ifyouneedtodownloadharmonizedfilemetadatainbulkitiscurrentlyyouronlyoption.InthistutorialwewillusetheGDCAPIplusafewauxiliarytoolstoobtainalistofVCFfileswiththeBAMfilethatitwasgeneratedfrom,theexomecapturekitassociatedwiththesample,andsomeadditionalmetadatainformation.Whydoweneedthismetadata?TheGDCharmonizesalltheirfilesasmuchaspossibletoremovebatcheffects;however,itisimpossibletoeliminatebatcheffectsarisingfromdifferencesinsamplehandlingandprocessingpriortosequencing.TheseeffectsstillremainandneedtobeconsideredandaccountedforinanyglobalanalysisofTCGAdata.Thus,itisimportanttohaveaccesstothetechnicalmetadata.

ScenarioThistutorialwillwalkyouthroughidentifyingmetadataforGDCsequencingfileswiththefollowingscenario:Youhavethefollowing3(or300!)wholeexomesequencingBAMfileIDs:

• 0f792b53-5c12-487d-8229-187e5f8c0148• e3e53764-dd65-4e9b-8e96-5ae043b03401• 85b9ee03-9e87-4d69-81ca-b1d6b24a8762

ForeachfileyouwanttoknowtheGDCprojectID,sampletype(tumor,normal,etc.),exomecapturekitname,andthenamesofthedownstreamanalysisVCFfilesthatareavailable.UsingtheGDCDataPortalyouwouldneedtomanuallylookateachBAMfilepageandrecordthisinformation(feelfreetopokearoundandtrythis).ThisisnotfeasiblesoyoudecidetousetheAPItoobtainthisinformationinbulk.

GDCAPIHelpGDCAPIManual:https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/EmailtheGDCHelpDesk:[email protected] Theyareveryhelpfulwitheverything!

Olex–GDCAPI

3

RequiredAuxiliaryToolsThefollowingaretheauxiliarytoolswewillusetohelpusformatourAPIcommandscorrectly.Theseareallonlinetools,sonoinstallationneeded!NotethattheGDCAPIManualalsolistssomeonlineJSONeditingtools,butIprefertheoneslistedbelow.Openupalltoolstohavethemreadypriortostartingthetutorial.

• InternetBrowser(GoogleChromerecommended)• URLEncoding:http://www.url-encode-decode.com• JSONEditor:http://www.cleancss.com/json-editor• Notepadoraplaintexteditor.• ExcelWorksheet• Theaccompanying“copy-paste”tutorialdocument.

Olex–GDCAPI

4

Section1)AnatomyofanAPICallTherearedifferentpartstoanAPIcall(command).EachpartcanbemodifiedtochangewhichpartsoftheGDCdatabasearequeriedandwhatdataisreturned.HereisanexampleAPIcallfromtheGDCAPIManual.Youcancopyandpasteitintoyourbrowser’sURLbar(seethecopy-pastedocumentdistributedwiththistutorialasyoucan’tcopyandpastefromaPDF).PressENTERtoseetheresults.

Don’tbeintimidated!Letsbreakitdown….

https://api.gdc.cancer.gov/files?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.submitter_id%22%2C%22value%22%3A%5B%22TCGA-CK-4948%22%2C%22TCGA-D1-A17N%22%2C%22TCGA-4V-A9QX%22%2C%22TCGA-4V-A9QM%22%5D%7D%7D%2C%7B%22op%22%3A%22%3D%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%22Gene%20Expression%20Quantification%22%7D%7D%5D%7D&format=tsv&fields=file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,analysis.workflow_type,cases.project.project_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id&size=10

https://api.gdc.cancer.gov/files ? filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.submitter_id%22%2C%22value%22%3A%5B%22TCGA-CK-4948%22%2C%22TCGA-D1-A17N%22%2C%22TCGA-4V-A9QX%22%2C%22TCGA-4V-A9QM%22%5D%7D%7D%2C%7B%22op%22%3A%22%3D%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%22Gene%20Expression%20Quantification%22%7D%7D%5D%7D & format=tsv & fields=file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,analysis.workflow_type,cases.project.project_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id & size=10

ThewebsitetosendtheAPIcommandto(api.gdc.cancer.gov)followedbytheGDCendpoint“files”toquery.Endpointsareexplainedonthenextpage.A“?”separatesthewebsiteaddressfromthequerycommands.

Specifieshowyouwanttherecordsfilteredinpercent-encodedJSONformat.Wehaveatooltodothispart!Allquerycommandcomponentsareseparatedbyan“&”.

Theformatyouwantyourresultsreturnedin.“tsv”is“tab-separatedvalue”andcanbecopiedandpastedintoExcel.

AlistofGDCfieldsthatyouwanttoseeinyourresultsseparatedbyacomma.WewilluseanAPIcommandtofindwhichfieldsareavailable.

Themaximumnumberofrecordstoreturn.Ifyoudon’tknowhowmanyyoucansetthistoareallyhighnumber.Forthisassignmentwewillonlyreturnuptothefirst10.

Olex–GDCAPI

5

2)GDC“Endpoints”FormoreinformationonendpointsseetheGDCAPIManualhttps://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#api-endpointsThereare4GDCEndpointsusedforsearchingmetadata:

1. files–searchesallfilesatthefilelevel,eachrecordreturnedrepresentsasinglefile.2. cases–searchesbasedonpatients,eachrecordreturnedrepresentsasingle

patient.3. projects–searchesattheprojectlevel,eachrecordrepresentsaproject.4. annotations–searchesannotationsaddedtothedataaftercuration.

Endpointsdefinetheviewpointofthesearchandthetypeofinformationreturned.Forexample,ifyouquerythe“files”endpointandwanttoreturnallthefileIDsassociatedwithaparticularcase,youwouldusethefieldname“file_id”andonerecordwouldbereturnedforeachfile.However,ifyousearchfromthe“cases”endpointyouhavetoprefixthe“file_id”fieldnamewiththeendpoint(i.e.“files.file_id”).ThiswouldreturnonerecordforeachcasethatlistsALLfileIDsassociatedwiththatcaseononerow.

Projects Cases Files Annotations

Testitout–copyandpastetheentireAPIcallfromSection1intoyourwebbrowserandpress“enter”.CopytheresultsintoanExcelworkbookandcountthenumberofrowsyouhave.Nowusethesamequeryonlyreplacethe“files”endpointwith“cases”.Task1)Howmanyrecordswerereturnedfor“files”andfor“cases”?Youshouldhavegotten10records(rows)forthe“files”endpointand4recordsforthe“cases”endpoint.Task2)Whyaretheremorerecordsreturnedwhenusingthe“files”endpoint?Hint:comparethecolumnsnamed“cases_0_case_id”and“file_id”fromthe“files”queryandthecolumnnamed“case_id”fromthe“cases”query.Therearemultiplefilespercase.The“files”endpointreturnedmorebecauseitlistedeachfileasarecord.Casesreturnedfewerbecauseitlistedeachcaseasarecordwithmultipleassociatedfiles.

file_id cases.case_id projects.project_id

files.file_id case_id projects.project_id

files.file_id cases.case_id project_id

Olex–GDCAPI

6

3)FindingAvailableFieldsInordertobuildafilteranddesignatespecificfieldstoreturnyouhavetoknowwhatfieldsarepresentinthedatabaseandateachendpoint.TogettheavailablefieldsforanendpointenterthefollowingURLintothebrowserandhit“enter”(copyfromthe“copy-past”document):

Youshouldgetalonglistwithentriesformattedlikethefollowingthatareseparatedbycommas:

https://api.gdc.cancer.gov/files/_mapping

… "files.analysis.analysis_id": { "description": "", "doc_type": "files", "field": "analysis.analysis_id", "full": "files.analysis.analysis_id", "type": "string" }, "files.analysis.analysis_type": { "description": "", "doc_type": "files", "field": "analysis.analysis_type", "full": "files.analysis.analysis_type", "type": "string" }, …

Ifqueryingthe“files”endpointthenafieldisreferencedbythe“field”name(inblue).Usingthe“full”fieldnamewillreturnnoresults.Onlyusethefullfieldnameifyouarequeryingadifferentendpoint.Somefieldswillhaveadescription,whichcanhelpyoufindwhatyouarelookingfor(seefigurefromSection2).

Testitout–copyandpastethemappingURLaboveintoyourbrowser(copyfromthe“copy-past”document).Task3)TrythemappingURLfor“files”,“cases”,and“projects”.Whatistheresult?Youshouldnoticethattheyeachreturnadifferentsetofavailablefieldsforeachendpoint.Thefieldsthatcontaintheinformationwewantareunderthe“files”endpoint.Seeifyoucanlocatethefollowing:

• file_id• file_name• downstream_analyses.output_files.file_name• analysis.metadata.read_groups.target_capture_kit_name• cases.samples.sample_type• cases.project.project_id

Olex–GDCAPI

7

4)BuildingaFilterThefilterisbuiltusingJSONformat.OpentheJSONeditor(http://www.cleancss.com/json-editor).Thendeleteeverythingintheleftpaneandreplaceitwiththefollowing--allbracketsarerequired(copyfromthe“copy-paste”document):

Note:JSONisverypickyaboutthetypeofparenthesesused.EditingtheJSONinatexteditorcanchangethesecharacterswithoutyourknowledgeandbreaktheotherwisecorrectquery.The“op”:”in”tellstheserverwhattypeoffilterthisis.Inthiscaseitissayingtoreturnanyrecordthathasan“entity_id”INthegivenlistunderthe“value”line.The“in”couldalsobechangedtootheroperatorslike“=”,butwewon’tbedoingthatinthistutorial.

BAMFileIDs:0f792b53-5c12-487d-8229-187e5f8c0148e3e53764-dd65-4e9b-8e96-5ae043b0340185b9ee03-9e87-4d69-81ca-b1d6b24a8762

{ "op":"in", "content":{ "field":"entity_id", "value":[ "e0d36cc0-652c", "25ebc29a-7598", "fe660d7c-2746" ] } }

Tryitout–copyandpastetheJSONfilterabovefromthe“copy-paste”documentintotheJSONEditor.Task4)Editthefiltertousethefieldname“file_id”,andreplacethelistofentity_idvaluestousetheBAMfileidslistedbelow.Youdonotneedtouploadanyfilesorpushanybuttonsontheeditorwebpage.Justeditthetextandcopyittoyourclipboard.Thecodeshouldlooklikethefollowing:{"op":"in","content":{"field":"file.file_id","value":["0f792b53-5c12-487d-8229-187e5f8c0148","e3e53764-dd65-4e9b-8e96-5ae043b03401","85b9ee03-9e87-4d69-81ca-b1d6b24a8762"]}}Next,opentheURLencoder/decoder(http://www.url-encode-decode.com).Task5)CopyandpasteyourJSONfilterintotheleftpaneoftheURLencoder/decoder,thenpressthe“encode”button.Copyandpastetheresultontoyourclipboard.Youshouldseesomethinglikethis:%7B%0D%0A+++%22op%22%3A%22in%22%2C%0D%0A+++%22content%22%3A%7B%0D%0A++++++%22field%22%3A%22files.file_id%22%2C%0D%0A++++++%22value%22%3A%5B%0D%0A++++++++++%220f792b53-5c12-487d-8229-187e5f8c0148%22%2C%0D%0A++++++++++%22e3e53764-dd65-4e9b-8e96-5ae043b03401%22%2C%0D%0A++++++++++%2285b9ee03-9e87-4d69-81ca-b1d6b24a8762%22%0D%0A%0D%0A++++++%5D%0D%0A+++%7D%0D%0A%7D

Olex–GDCAPI

8

5)SpecifyingadditionaloptionsSofarwehavetalkedaboutfiltersandfields,whicharepartsoftheAPIcommandthatisaftertheURLandquestionmark“?”.Thereareadditionaloptionsyoucanset,liketheformatandsizeoftheresults.TheGDCAPIManualliststheavailableoptionshere(https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#request-parameters).

6)AssemblingtheAPICommandYounowhaveallthepartstoassembleyourcustomAPIcall.Copythetemplatebelowintoatexteditortofillinthemissingpieces,indicatedby<<missing>> tags,thatyouobtainedinsteps3and4.Itiscolorcodedforeaseofreadingonly.

https://api.gdc.cancer.gov/files ? filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.submitter_id%22%2C%22value%22%3A%5B%22TCGA-CK-4948%22%2C%22TCGA-D1-A17N%22%2C%22TCGA-4V-A9QX%22%2C%22TCGA-4V-A9QM%22%5D%7D%7D%2C%7B%22op%22%3A%22%3D%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%22Gene%20Expression%20Quantification%22%7D%7D%5D%7D & format=tsv & fields=file_id,file_name,cases.submitter_id,cases.case_id,data_category,data_type,cases.samples.tumor_descriptor,cases.samples.tissue_type,cases.samples.sample_type,cases.samples.submitter_id,cases.samples.sample_id,analysis.workflow_type,cases.project.project_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.portions.analytes.aliquots.submitter_id & size=10

https://api.gdc.cancer.gov/files?filters=<<missing>>&format=tsv&fields=<<missing>>&size=10

Tryitout–copyandpastetheabovetext(fromthe“copy-paste”document)intoatextfileeditor.Task6)CopyfieldsfromSection2intotheorange<<missing>>spacewitheachseparatedbyacomma(nospaces).ThencopyyouranswerfromSection3intothegreen<<missing>>space.CopythisURLintoyourwebbrowserandpress“enter”.YoushouldgetaURLthatlookslikethefollowing:https://api.gdc.cancer.gov/files?filters=%7B%0D%0A+++%22op%22%3A%22in%22%2C%0D%0A+++%22content%22%3A%7B%0D%0A++++++%22field%22%3A%22files.file_id%22%2C%0D%0A++++++%22value%22%3A%5B%0D%0A++++++++++%220f792b53-5c12-487d-8229-187e5f8c0148%22%2C%0D%0A++++++++++%22e3e53764-dd65-4e9b-8e96-5ae043b03401%22%2C%0D%0A++++++++++%2285b9ee03-9e87-4d69-81ca-b1d6b24a8762%22%0D%0A%0D%0A++++++%5D%0D%0A+++%7D%0D%0A%7D&format=tsv&fields=file_id,file_name,downstream_analyses.output_files.file_name,analysis.metadata.read_groups.target_capture_kit_name,cases.samples.sample_type,cases.project.project_id&size=10

Olex–GDCAPI

9

Answers:Task7)0f792b53-5c12-487d-8229-187e5f8c0148Task8)TCGA-UCSTask9)hg18nimblegenexomeversion2

Tryitout–copyandpasteyourAPIcallintotheURLbarofyourbrowser–makesureitiscompletelyemptyfirst.CopytheresultsintoanExcelworksheetandanswerthefollowingquestions.Warning:Thecolumnsareretunedinarandomorder.YoucansorttheminExcelorjustscrollthroughtolookatcontent.Task7)Whichfileisfrom“PrimaryTumor”?ProvidethefileID.Task8)WhatistheProjectIDforthefile“e3e53764-dd65-4e9b-8e96-5ae043b03401”?Task9)WhichexomecapturekitwasusedonthesamplethatwasusedtoproducetheVCFfilenamed“69642ea0-a20d-4ff3-96b6-4a9a3ff76058.vcf”?