ENCODE Data Coordination at UCSC Kate Rosenbloom ENCODE DCC Technical Project Manager UCSC Genome...

31
ENCODE Data Coordination at UCSC Kate Rosenbloom ENCODE DCC Technical Project Manager UCSC Genome Bioinformatics Group September 2010 Genome Browser SAB Review
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    0

Transcript of ENCODE Data Coordination at UCSC Kate Rosenbloom ENCODE DCC Technical Project Manager UCSC Genome...

ENCODE Data Coordination at UCSC

Kate RosenbloomENCODE DCC Technical Project Manager

UCSC Genome Bioinformatics Group

September 2010 Genome Browser SAB Review

ENCODE in a nutshell: Y3Q3

• 2 species (human, mouse)• 3 years of production phase (2 years left)• 17 grants, 27 labs• 20 experiment types• 27 browser tracks• 140 cell & tissue types• 1559 datasets

Topics:

1. DCC role in ENCODE

2. Progress since last SAB review

3. Challenges

4. Browser impact

DCC role in ENCODE

• Define data formats and submission process

• Load, display, curate, review & release data

• Collect metadata and documentation

• Provide public website, viz, tools

• User outreach & support

• Support consortium communications (wiki)

• Support analysis group

Data submission website:http://encodesubmit.ucsc.edu

2438 submissions as of September 2010

Lifecycle of a data submissionsub.tar.gz

lab uploads to submission website

loaded

uploaded

pipeline validates and loads into database

wrangler configures browser track

data fails validation or loading

validate failed,load failed

11

22

33

displayed (on test browser)

lab approves trackon test browser44 55 Q/A reviews and releasesapproved,

reviewing

released (to public browser)

ENCODE track quality review• Data format checks

• Description and metadata complete & correct

• Configurability

• Display at different zoom levels and visibilities

• Performance

• Does the data make biological sense ?

• Usability

Released Data:

27 tracks in hg18, representing 860 experiments as of September 2010

Features planned for Year 2:

• High-resolution wiggle (bigWig) DONE

• RNAseq display enhancements DONE

• NCBI accessioning of seq data IN PROGRESS

• Track search tool IN REVIEW

Progress since last SAB review (Feb ‘09)

Plus:• Integrated regulatory track• Hg19/GRCh37 migration• BAM support (spec, validation, display, c-tracks)• Mid-course review, 4 data freezes, 2 analysis

workshops, DCC site review• Mouse ENCODE

ENCODE data at NCBI GEO: Caltech RNA-seq

Mouse ENCODE experiment matrix

4 grants funded by the ARRA, 3 are now submitting data

more cell types more factors

Initial tracks of mouse data (test browser)

Simple search looks at:• Track names and labeling• Tracl description• Metadata terms (specifically ENCODE controlled vocabulary)

Finding data in the browser: Simple free-text search

Advanced search allows selection by defined metadata terms.(Currently only for ENCODE tracks)

This search findshistone modificationH3K4me3 as seenin H1-hESC cells.

Finding data in the browser: By metadata terms

The results of the search on the previous slide is a single track of histone modification H3K4me3 as seen in H1-hESC cells. Clicking ‘View in Browser’ will display this data.

Results from track search

Challenges• Number of labs, difficulty of some• Metadata expansion, special handling beyond normal

browser data• Multiple customers: NHGRI, analysis group, labs, user

community• Production vs. research• Mission expansion: GEO/SRA, standards, ARRA, year 5• Reporting overhead• Engineering staff -> hire ‘wranglers’• Funding delays

DCC site visit recommendations

1. Data accessibility Track search, Feature supertracks, Tutorial

2. Data usability

3. Data quality Post standards on website, Flag non-conforming data

4. Long-term repository Deposit data to GEO

5. Metadata user review

6. Use cases Session gallery on website

7. Reproducibility in publications

8. Web site Data snapshot on website, Improve labeling

9. Analysis data sets Integrated regulatory track, Imports from AWG

10.Metrics for success

Blue items are DCC-specific

Impact on browser

• Expanded data – mostly useful, some not so much

• Pushes development of viz, tools, formats for large datasets

• Competes for staff and mgmt resources

People at the DCC

PI: Jim Kent• Technical project manager: Kate Rosenbloom• Engineering / Wrangling: Tim Dreszer, Venkat Malladi,

Brian Raney, Cricket Sloan, Melissa Cline • Outreach, usability: Melissa, OpenHelix (contractor)• Submissions website: Galt Barber• GEO tools: Krishna Roskin• Quality assurance: Katrina Learned, Vanessa Swing• Browser management: Donna Karolchik, Bob Kuhn, Ann Zweig

Reporting:

Monthly

Quarterly

Annual

Additional slides

Plans

• ENCODE tutorial

• Portal upgrade

• Complete GEO submissions

• Analysis tracks

• ARRA grants (protegenomics, epitope-tag)

• Release Mouse data

Browser features developed for ENCODE

• High resolution wiggle (bigWig)• HTS formats (BAM and bigBed)• BIG custom tracks• View-based tracks• Data selection matrix • Metadata links

• Coming soon: Track Search

GEO Submission Pipeline

http://encodeproject.orgENCODE Portal

ENCODE Outreach 2009-2010

• Publication: NAR 2010 Database issue (2011 update in press)• Presentations: CSHL Statistical Analysis course June 2010,

Stanford Computational Systems Bioinformatics, Aug 2010• Posters: CSHL Biology of Genomes May 2009, CSB 2010

OpenHelix ENCODE tutorial

Integrated regulatory tracks

UCSC-developed integrative ENCODE track – shows enrichment of histone modifications suggestive of enhancer and promoter activity, DNAse clusters indicating open chromatin, regions of transcription factor binding, and transcription levels, derived from ENCODE data collected in multiple cell lines.

Key items from site visit recommendations

• Make a track search tool to make it easy to find all data on one cell line or one transcription factor

• Organize data by biochemical entities rather than by lab.

• Put effort into high level documentation on website – “Sessions gallery” to show use cases– Page that give overview of what data is available in

ENCODE including cells, antibodies, and assays.– Put up data summaries, figures, and presentations

generated by the AWG onto site

Some user comments from DCC survey

• Linking annotation across all cell types and linking all annotations across one cell type would be quite nice. As it is now, it takes a fair bit of manual manipulation to do this.

• Great job, awesome resource. Thanks to all!• Need more cell types and conditions. Do data from non-

ENCODE consortium groups get incorporated?• Amazingly there isn't a useful Encode summary, let alone a

detailed description of the project and results. There's a nice do-loop with links between UCSC & NHGRI that don't lead anywhere. Is there a publication or link that I'm missing somewhere that informs, educates & is a users' guide? Great project, just difficult to sift through in it's current form.

• Encode only covers 1% of the human genome. Not sufficient coverage