Workshop finding and accessing data - fiona nadia charlotte - cambridge april 26 2016

57
Genome sharing projects around the world – and how you find data for your research Cambridge, April 26 2016 Slides will be made available online Tweets welcome #CamFindData

Transcript of Workshop finding and accessing data - fiona nadia charlotte - cambridge april 26 2016

PowerPoint Presentation

Genome sharing projects around the world and how you find data for your researchCambridge, April 26 2016

Slides will be made available online

Tweets welcome #CamFindData

1

We are on twitter: @glyn_dk@repositiveio@DNAdigest @CamOpenDataCambridge, April 26 2016

Slides will be made available online

Tweets welcome #CamFindData

2

What data are you looking for? And Why?

Data resources from around the worldTips on how to find and access dataHands-on using Repositive

Summary and feedbackWorkshop outline

1. What data are you looking for?

This workshop will focus on finding and accessing human genomic data.

And why would you be looking for genomic data for your research?

Are you researching cancer or genetic diseases?

Because interpretation requires LOTS of data

And although data exists around the world, it is siloed, and even if available, it is not accessible

This is Jenn, a genetic researcher our target customer- seeking to interpret data from genetic diseases and cancerShe needs data from other patients to compare and interpret Mabels DNA

She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users4

How much data do you need to publish a paper?

2001: 1 human genome2012: 1000 Genomes (1092 genomes, since increased to ~2500)2015: UK10K, Icelandic population (2,636 + 100k imputed), Cancer genome atlas ~11,000 genomesExac consortium 65,000 exomes?

Statistically speaking, you still need 10s of thousands of samples for validation

The more severe the phenotype and the more complete penetrance, the easier it will be for you to find your variant, but

As the genetic complexity of the disease increases (for example, reduced penetrance and increased locus heterogeneity), issues of statistical power quickly become paramount. http://www.nature.com/nrg/journal/v15/n5/full/nrg3706.html But I am just looking at this one disease

What can I do?

PRO TIP: involve a statistician early on in your study design!

It has been shown that the combination of summary single-variant statistics from multiple data sets, rather than the joint analysis of a combined data set, does not result in an appreciable loss of information85, and that taking into account heterogeneity in effect size across studies can improve statistical power7

How can I determine significance? One potentially powerful approach is to assess conservation across and within multiple species as whole-genome sequence data become more abundant.

Look at extreme phenotypes Sampling cases or controls from the extremes of an appropriate quantitative distribution can often increase power

Look at non-SNP variants, they are more likely to have functional effects

- how to account for the technical features of sequencing, such as incomplete sequencing and biased coverage over the genome?

Although they are harder to call and annotate, insertion or deletions, multinucleotide variants and structural variants (including copy-number variants, translocations and inversions) constitute a smaller set of variation (in terms of the number of discrete events an individual is expected to carry) relative to all SNVs and are more likely to have functional effects.

8

Think of how you can provide evidence that your result is not just a local technical variation or sampling bias

e.g. data from same cell type, same seq technology, same alignmentHow to account for bias?

PRO TIP: include more reference data in your analysis

It has been shown that the combination of summary single-variant statistics from multiple data sets, rather than the joint analysis of a combined data set, does not result in an appreciable loss of information85, and that taking into account heterogeneity in effect size across studies can improve statistical power9

Know what data is available in your lab, your dept, your org

Survey from Qiagen showed that one of the main reasons researchers collaborate is to get access to data!

How can I access more data for my research?

10

How can I find collaborators?

PRO TIP: Search for collaborators who have the data you need

PRO TIP: Tell your colleagues and peers what type of data you have in your lab

2. Data resources from around the world

Public repositories

some you apply for access, especially if data contains clinical info or whole genome PID

some are open access: GEO, SRA, PGP, OpenSNP, GigaDB,

some are consented for general research use, some have specific consent

Because interpretation requires LOTS of data

And although data exists around the world, it is siloed, and even if available, it is not accessible

This is Jenn, a genetic researcher our target customer- seeking to interpret data from genetic diseases and cancerShe needs data from other patients to compare and interpret Mabels DNA

She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users12

Large amounts of data, but not accessible

.5 PB Sequence available

80+ PB Sequenced every yearWGS data available in public reposExponential growth rate

Under-utilised data has huge potential for medical research

Population scale genome sequencing projects have been launched all over the worldMore than 80PB of human genomic data is being sequenced Every yearBUTTo date only around .5PB of data available in public repositories13

DATA is fragmented

Further confounded by the data being highly fragmented.Siloed in repositories and institutions around the world. 14

It may be confusing

There are many public repositories, but It can be hugely confusing to know where to look for the right kind of data15

Hundreds of data sourcesbut they arent easy to find!http://dx.doi.org/10.1371/journal.pbio.1002418 First 30 data sources listed here:

Data source content

Assay TypesDedicated to

Number of samples in Data sourcesSample # (Log10)Top 5:GEO (1.8M)PMI Cohort Program(1M)Auria Biopankki (1M)EGA (~0.6M)SRA (~0.5M)

Data accessibilityCan download the data straight away or after logging in. Need to apply for access to the data. Has both Open and Restricted access data within one repository.

Online Data source typesUniversity Affiliated to a university. Often only members of that university can upload/download to/from it.Catalogue doesnt have raw data but lists studies/datasets.Initiative/Consortium Has a specific purpose/aim. Often focussed on a question or disease. Repository Can download from, has data from multiple institutions. Often can also upload your own data there. Company For profit organisation. Listing data is not their main purpose. Biobank many have sequence data of their biological samples.

Sequenced ethnicities

AboriginalsAfrican AmericansAfricansAustraliansChineseMalaysIndiansDanishDutchEstonianRussianEuropean AncestryFinnishIcelandicJapaneseKoreanLatin AmericansSaudiSwedish

Machines & Data sources

947560088660266850623250023InternationalInteresting site to look at: http://omicsmaps.com/stats

Main Repository funders

BGI = 4EBI = 9NIH = 10NCBI = 9The Broad = 8Wellcome = 4EBI total 104 services, 19 repositories http://www.ebi.ac.uk/services/allNCBI total 67 databases http://www.ncbi.nlm.nih.gov/guide/all/#databases_

Biobanks as data sourcesBiobanks are potential sources of genomic dataMost biobanks contain large collections of samples (thousands)Some biobanks also contain data related to these samplesA fraction of this data is genomic data (usually genotyping)Several biobanks (e.g. ToMMo biobank in Japan, UK biobank) have sequencing programsMany biobanks do not consider sequencing as their priority but are willing to give their samples to researchers who would like to sequence themMost biobanks are supposed to share their samples with bona fide researchers (exception commercial biobanks, e.g. Abcodia)In most cases, the best thing is to ask them directly whether they have samples/data that you need!

Name: UK BiobankType of data: genotypingURL: http://biobank.ctsu.ox.ac.uk/crystal/gsearch.cgiUK BiobankName: ToMMo BiobankType of data: genotyping, WGSURL: https://ijgvd.megabank.tohoku.ac.jp/Name: Diabetes Biobank BrusselsType of data: data (including genomic; not specified) and clinical samples on >20.000 diabetic patients and their first degree relatives. URL: http://www.diabetesbiobank.org/Name: Dutch biobanks (dozens of them!)Type of data: multipleURL: http://bit.ly/1XxPA6WName: Auria Biobank FinlandType of data: There are roughly one million human biological samples stored in Auria Biobank, a considerable proportion of which are cancer samples. At the moment, there is only the catalogue of samples, no catalogue of data. In case a researcher needs to know what kind of data we have, he/she needs to contact us.URL: https://www.auriabiopankki.fi/?lang=en

More information about data sources in our recent paper:

http://tinyurl.com/plos-biology-repositive

26

Case study: DNA data on Cancer3. Tips to find and access data

Case Study DNA data on CancerRepositories youhave heard of:

Ask around (word of mouth):

RepositoryData TypeAccessArrayExpressExpressionOpenGEOEspressionOpenEGAMixedRestricteddbGaPMixedRestrictedEncodeHealthy ReferenceOpen1000 GenomesHealthy ReferenceOpen

RepositoryData TypeAccessCOSMICSomatic mutations & WGSOpenClinVarVariant informationOpenExACAllele Freq. but not raw dataOpenSRAIndividual sequencesOpenTCGAClinical & high level dataOpenCGHubLow level data (DNA data)Restricted

Case Study DNA data on CancerWe have identified the first 27 cancer specific data sources

And many more that contain cancer data alongside other data types.

AbcodiaAmbryShareBRCA ExchangeBreast Cancer Now Tissue BankBroad Cancer programme datasetsCancer Moonshot 2020CanGEMCGCICGHubChinese cancer genome consortiumChinese national human genome centreFollicular Lymphoma Genome DataG-DOCGenoMelICGCNational Mesothelioma Virtual BankNCIP HubProject GENIETargetTCGATexa cancer research biobankNCI-60CCLECOSMICFantomcancer methylome systemCancer therepeutics response portal

29

1. Register for eRA account

2. Request access to specific dataset of interest

3. Download dataRegistering for CGHubhttps://cghub.ucsc.edu/keyfile/newuser.html

Principle signing official registersEmail to verifyEmail to confirm/deny access to website

Email with temporary passwordChange passwordElectronic signature

LoginFill in contact info,Complete 424 form (research application form)Request reviewed by DACEmail to confirm/deny access to data

LoginRetrieve personal access tokenDownload!

Often a long process

Bottlenecks: Finding relevant and usable dataGetting authorisation to access dataFormatting dataStoring and moving dataWe studied the problem by qualitative interviews followed by a survey of researchers in human genetics

Often a long process

T. A. van Schaik et alThe need to redefine genomic data sharing: a focus on data accessibility, Applied & Translational Genomics, 2014 10.1016/j.atg.2014.09.013Researchers spend months to find and access genomic data, and often choose to not access data at all

Why the barrier?

Why the barrier?

Benefits: strict governance, review of consent, applicant signs for full responsibility for governanceDisadvantages: No control of data once access is given, high barrier for access too high?

Public repositories: default is apply for access -> full access

Benefits: strict governance, review of consent, applicant signs for full responsibility for governanceDisadvantages: No control of data once access is given, high barrier for access too high? (researchers giving up, even patients cant get access to their own data)

34

Start planning your data needs early in your projectWhen you find the data you need, start applicationUse Open Access dataHow can I save time?

PRO Tip: If you use human genomic data, apply for the GRU datasets in dbGaP, one application access to all the GRU datasets

Some data is Open Access requires specific consent

OpenSNP.org (Bastian)Personal Genomes ProjectsIndividuals who put their genomes online, e.g. Manuel Corpas and his family the Corpasome

http://manuelcorpas.com/about/ Not all data is restricted

36

Some data is Open Access requires specific consent

Individuals who put their genomes online, e.g. Manuel Corpas and his family the Corpasomehttp://manuelcorpas.com/about/

OpenSNP.org Personal Genomes ProjectsNot all data is restricted

37

Personal Genome ProjectPGP HarvardPGP CanadaPGP UKGenom AustriaHost institutionHarvard Medical School BostonSickKids TorontoUniversity College LondonCeMM Research Center for Molecular MedicinePrincipal InvestigatorGeorge ChurchSteven SchererStephan BeckChristoph Bock &Giulio Superti-FurgaLaunch year2005201220132014Geographic scopeUSA, mainly BostonCanadaUnited KingdomMainly AustriaEnrollment eligibilityAt least 18 years old, able to make an informed decision, perfect score in the PGP enrollment exam, certain vulnerable groups excludedData GeneratedWhole genome sequencing, upload of additional data possible Mainly whole genome sequencingWhole genome sequencing, DNA methylome sequencing, RNA transcriptome sequencingMainly whole genome sequencingNumber of genomes100s10s10s10sData accesshttp://personalgenomes.org/harvard/data http://genomaustria.at/unser-genom/#genome-der-pionierinnen Project fundingDiscretional funds and corporate sponsoringInstitutional startup fundsDiscretional funds and corporate sponsoringInstitutional startup fundsAreas of emphasisIntegration with phenotypic data, collaboration with other personal omics initiativesGenome donations, synergy with massive-scale clinical genome sequencing projectsGenomes and society, genetic literacy, school projects, educationWebsitehttp://personalgenomes.org/harvard/http://personalgenomes.org/canada/http://personalgenomes.org/uk/http://genomaustria.at/

Summary of data access barriersData is uploaded to repositoryData is discovered by potential userData is accessed by potential user

even when researchers are authorised to share data they report reluctance to do so because of the amount of effort required http://www.sciencedirect.com/science/article/pii/S2212066114000386

Clinical geneticists cited a lack of time because their main priority is diagnosing patients. Industrial researchers cited a lack of time because of the pressure to meet the deadlines in their job. Researchers in academia cited both a concern about the potential loss of future publications once unpublished data is shared, and the lack of time and incentive to share data as this does not contribute to their publication record. Researchers from all categories felt that they lacked sufficient resources to make their data available.The barrier of making data availableBut I do not want to share my data

If you expect data to be available to you you have to make your data available too!

Encourage collaborations: power by numbers

Get credit publish and make your data availableGive credit cite data sourcesUnderstand consent for all uses of clinical dataBest practices

Use all available tools to make your life easier: Data publications visibility and citations for your data, e.g. GigaScience and Scientific Data

Figshare, Zenodo, Dryad for sharing open access data

PhenomeCentral, Matchmaker exchange for rare disease research

Repositive for finding data across repositories and make your own data discoverableBest practices: use the tools

Does data sharing matter at

grant proposal evaluation?Based on: Winning Horizon 2020 with Open Science, http://dx.doi.org/10.5281/zenodo.12247

Best practices: Plan into your grant proposals

ODP trained, EURO-BASIN manager, a boring title, for a diverse job, in an exciting research domain.

DIP into EACH step of the research cycle, from proposal formulation to providing the best return-on-investment to the funders.

So I`d like to share with you some experiences from the last few years of OS advocacy in the Marine Science Community43

Weakness: Involvement of non-academic beneficiaries is limitedWeakness: highly focused on academic activities, and lacks an advanced communication strategyWeakness: limited exposure to non-academic partners & infrastructuresExcellenceImpactImplementationdata accessibility is unclear!data storage & access not consideredBest practices: Plan into your grant proposals

Excellence at your Research Subject is excellent, but is it ENOUGH ?

To be successful, a candidate will be judged on being complete.

MESSAGE: FOSUC only on IF could expose you to risk44

Strengths: extensive dissemination of data to the scientific community (open access, databases)

outreach activities to a broad audience

research software is freely availableImpact:

Best practices: Plan into your grant proposals

45

Best practices: Plan into your grant proposals

ODP trained, EURO-BASIN manager, a boring title, for a diverse job, in an exciting research domain.

DIP into EACH step of the research cycle, from proposal formulation to providing the best return-on-investment to the funders.

So I`d like to share with you some experiences from the last few years of OS advocacy in the Marine Science Community46

Make the (research) world a better place by sharing in return Best practices: Share in return!

Digital consent: towards automatic processing of applications

Dynamic consent and power to the patient, e.g. PatientsKnowBest

Privacy-preserving access to datasets: preserving control and governance with data custodian, lower barrier for accessWhat the future holds

4. Hands-on session using RepositiveWhat if finding data was as easy as finding a book on Amazon, book a hotel on Expedia?

49

Repositive promotes best practices

Discover new data sources

EASY SEARCH

50

Repositive promotes best practices

Make your data visible

SHARE KNOWLEDGE

51

Repositive promotes best practices

Build a data community

BUILDTRUST

52

Benefit for both sides of data collaboration Data consumers

Data producers

Find relevant data faster

Feedback from other users through ratings and comments to evaluate data quality

Find collaborators with data

Make your data visible

Build credibility as a trusted provider of quality data

Find collaborators to analyse your data

Live demo http://discover.repositive.io

Use activation code: CamFindData

Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data54

5. Summary and feedbackGet credit publish dataGive credit cite dataUnderstand consent

Tell us your thoughts: @repositiveio@glyn_dkAnd read more on http://repositive.io

Bugs and feedback to: Charlotte at Repositive.io

56

Thank you!

Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data57