Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability...

67
Genomic Computing, DEIB, 4-7 March 2013 Architecture of Distributed Annotation Server Heiko Muller Computational Research IIT@SEMM [email protected]

Transcript of Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability...

Page 1: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Genomic Computing, DEIB, 4-7 March 2013

Architecture of Distributed Annotation Server Heiko Muller

Computational Research IIT@SEMM [email protected]

Page 2: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

The genomic data surge

1980s 1990s 2000s 2010s PCR Genome draft

Southern RFLP

SNP chips SNP beads

Northern Differential Display SAGE Expression chips

Sequencer

Genome browsers

IFOM-IEO-CAMPUS Since 2009: 1800 samples 16 TB raw data 70 TB elaborated data

Page 3: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

NGS data flow

The current situation: 1. Biologist fills in request form and sends it to [email protected] 2. Data are inserted into LIMS and request ID’s are sent back to biologist 3. Samples are sequenced and run data are inserted into LIMS 4. LIMS prepares sample sheets that are used for demultiplexing and bcl->fastq conversion 5. FastQC is run for quality control 6. FASTQ data are saved on IIT-Isilon device and hard links are produced in user folders 7. Group bioinformaticians align and analyze data 8. Group bioinformaticians interact with biologists to interpret results

Request LIMS->FASTQ bioinformaticians Elaborated data sets

homogeneous heterogeneous

Page 4: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Request form

Page 5: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Request submission

Page 6: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

LIMS: Laboratory Information Management System

Page 7: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

N:N

N:N sample request

N:N sample application

N:N sample run

LIMS is backed by a MySQL database

Page 8: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

LIMS: NGS reagents

Page 9: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Illumina HiSeq

Each lane contains more than one sample (multiplexing)

180 mio clusters per lane

Page 10: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

LIMS: NGS runs

Page 11: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

LIMS: Samplesheet

http://hilt.iit.ieo.eu/data/SampleSheets/

Page 12: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Demultiplexing

Samplesheet

Script generator

Run on IIT blades, manually or via Process proc = Runtime.getRuntime().exec(command);

Page 13: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Output: FASTQ files

FASTQ = “FASTA with Qualities” @SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

“FASTA” stands short for “FAST-All” because as opposed to the FASTP protein aligner described in 1985 could work with all alphabets (DNA:DNA, translated protein:DNA). FASTA is also a universal file format for sequences. Example: >header1 acgtgatgc >header2 cgtgatgca . .

Wikipedia

Page 14: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Quality control

Page 15: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Why do quality control?

Colors get blurred with increasing cycle number.

Page 16: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Base call qualities

• If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) • The score is written with the character whose ASCII code is Q+33 (Sanger Institute standard). (ASCII = American Standard Code for Information Interchange)

Phred=classical base calling program

Q p (call

incorrect) ASCII symbol

0 1.000000 33 !

1 0.794328 34 "

2 0.630957 35 #

3 0.501187 36 $

4 0.398107 37 %

5 0.316228 38 &

6 0.251189 39 '

7 0.199526 40 (

8 0.158489 41 )

9 0.125893 42 *

10 0.100000 43 +

11 0.079433 44 ,

12 0.063096 45 -

13 0.050119 46 .

14 0.039811 47 /

15 0.031623 48 0

16 0.025119 49 1

17 0.019953 50 2

18 0.015849 51 3

19 0.012589 52 4

20 0.010000 53 5

21 0.007943 54 6

22 0.006310 55 7

23 0.005012 56 8

24 0.003981 57 9

25 0.003162 58 :

26 0.002512 59 ;

27 0.001995 60 <

28 0.001585 61 =

29 0.001259 62 >

30 0.001000 63 ?

31 0.000794 64 @

32 0.000631 65 A

33 0.000501 66 B

34 0.000398 67 C

35 0.000316 68 D

36 0.000251 69 E

37 0.000200 70 F

38 0.000158 71 G

39 0.000126 72 H

40 0.000100 73 I

0.1

1

0.01

0.001

P call is wrong:

Page 17: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Aligning sequence reads

The first task after obtaining your reads (and any QC) is to determine the corresponding site in the reference genome from which they each derive Termed genome alignment (also known as ‘mapping’) Short-read aligners use one of these ideas to base their algorithm on: • use spaced-seed indexing • hash seed words from the reference • hash seed words from the reads • sort reference words and reads lexicographically • use the Burrows-Wheeler transform (BWT) BWT seems to be the winning idea (very fast, sufficiently accurate), and is used by the newest tools (Bowtie, SOAPv2, BWA).

Page 18: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

http://samtools.sourceforge.net/SAM1.pdf

SAM example

Page 19: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Viewing BAM files in Integrated Genome Browsers

Page 20: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

File sizes

fastq

bam bigWig

Page 21: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

blade

GPU

blade

blade

blade

users

service

Application server

Sun grid engine: Number crunching

Storage Isilon 200 TB

hardware price RAM (GB) processors application server 7100 16 24

blade system 33221 256 48

GPU Tesla M2050 9888 48 24

Isilon storage (200 TB) 230000

Getting data to the user: IIT@SEMM infrastructure

Mainly FastQ

Illumina HiSeq2000

Page 22: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Application server, blades, GPU

Page 23: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Isilon storage

Page 24: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Visualizing NGS data: Genome Browsers

Page 25: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Visualizing genomic data: What is a “Genome Browser”

• linear representation of a genome

• position-based annotations, each called a track

– continuous annotations: e.g. conservation

– interval annotations: e.g. gene, read alignment

– point annotations: e.g. SNPs

• user specifies a subsection of genome to look at

Page 26: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Comparison of Genome Browsers

UCSC Ensembl IGV IGB

Reference http://genome.ucsc.edu/ http://www.ensembl.org/index.html http://www.broadinstitute.org/igv/ http://bioviz.org/igb/

Model Server Server Client Client

Interactive

HTS support

Database of tracks

Plugins

No support Some support Good support

Server model Client model

Server central data store Server stores data

renders images

sends to client

Client requests images Client local HTS store

displays images renders images

displays images

Limitations: do not support multiple genomes simultaneously do not capture 3-dimensional conformation do not capture spatial or temporal information do not integrate well with analytics

Page 27: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

• Browse many eukaryotic genomes (yeast to human)

• Most annotations are there

• Important evolutionary and variation data representation.

• Very flexible and configurable views

• Graphical and table views

• Upload your data into custom tracks and share with colleagues

• Client/server application with it’s issues, but a great app!

About UCSC Genome Browser

Page 28: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

http://genome.ucsc.edu

Page 29: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

http://genome.ucsc.edu

Page 30: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Integrated Genome Browser and IIT DAS2 server

Page 31: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Integrated Genome Browser and published genome annotations

Page 32: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Genome browser view: ChIP-seq

.bam .bed .bigWig

Page 33: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Genome browser view: sequencing errors

Page 34: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Integrated Genome Browser and the Distributed Annotation System (DAS)

Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples

Page 35: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Integrated Genome Browser and the Distributed Annotation System (DAS)

Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples

Page 36: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Frederic Sanger

Genbank

Centralized repository, sequences owned by submitter,

Genbank

Page 37: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

LOCUS NM_053056 4304 bp mRNA linear PRI 27-MAY-2012 DEFINITION Homo sapiens cyclin D1 (CCND1), mRNA. ACCESSION NM_053056 NM_001758 VERSION NM_053056.2 GI:77628152 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 4304) AUTHORS Li,Q., Dong,Q. and Wang,E. TITLE Rsf-1 is overexpressed in non-small cell lung cancers and regulates cyclinD1 expression and ERK activity JOURNAL Biochem. Biophys. Res. Commun. 420 (1), 6-10 (2012) PUBMED 22387541 REMARK GeneRIF: Rsf-1 is overexpressed in non-small cell lung cancers and contributes to malignant cell growth by cyclin D1 and ERK modulation. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-138 BM796500.1 1-138 139-1278 BC001501.2 73-1212 1279-4077 AP001888.4 12952-15750 4078-4304 X59798.1 4018-4244 FEATURES Location/Qualifiers source 1..4304 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="11" /map="11q13" gene 1..4304 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /note="cyclin D1" /db_xref="GeneID:595" /db_xref="HGNC:1582" /db_xref="HPRD:01346" /db_xref="MIM:168461" exon 1..407 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /inference="alignment:Splign" /number=1 CDS 210..1097 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /note="B-cell CLL/lymphoma 1; BCL-1 oncogene; PRAD1 oncogene; B-cell lymphoma 1 protein" /codon_start=1 /product="G1/S-specific cyclin-D1”

/protein_id="NP_444284.1" /db_xref="GI:16950655" /db_xref="CCDS:CCDS8191.1" /db_xref="GeneID:595" /db_xref="HGNC:1582" /db_xref="HPRD:01346" /db_xref="MIM:168461" /translation="MEHQLLCCEVETIRRAYPDANLLNDRVLRAMLKAEETCAPSVSY FKCVQKEVLPSMRKIVATWMLEVCEEQKCEEEVFPLAMNYLDRFLSLEPVKKSRLQLL GATCMFVASKMKETIPLTAEKLCIYTDNSIRPEELLQMELLLVNKLKWNLAAMTPHDF IEHFLSKMPEAEENKQIIRKHAQTFVALCATDVKFISNPPSMVAAGSVVAAVQGLNLR SPNNFLSYYRLTRFLSRVIKCDPDCLRACQEQIEALLESSLRQAQQNMDPKAAEEEEE EEEEVDLACTPTDVRDVDI" misc_feature 885..887 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /experiment="experimental evidence, no additional details recorded" /note="Phosphotyrosine; propagated from UniProtKB/Swiss-Prot (P24385.1); phosphorylation site" ORIGIN 1 cacacggact acaggggagt tttgttgaag ttgcaaagtc ctggagcctc cagagggctg 61 tcggcgcagt agcagcgagc agcagagtcc gcacgctccg gcgaggggca gaagagcgcg 121 agggagcgcg gggcagcaga agcgagagcc gagcgcggac ccagccagga cccacagccc 181 tccccagctg cccaggaaga gccccagcca tggaacacca gctcctgtgc tgcgaagtgg 241 aaaccatccg ccgcgcgtac cccgatgcca acctcctcaa cgaccgggtg ctgcgggcca 301 tgctgaaggc ggaggagacc tgcgcgccct cggtgtccta cttcaaatgt gtgcagaagg 361 aggtcctgcc gtccatgcgg aagatcgtcg ccacctggat gctggaggtc tgcgaggaac 421 agaagtgcga ggaggaggtc ttcccgctgg ccatgaacta cctggaccgc ttcctgtcgc 481 tggagcccgt gaaaaagagc cgcctgcagc tgctgggggc cacttgcatg ttcgtggcct 541 ctaagatgaa ggagaccatc cccctgacgg ccgagaagct gtgcatctac accgacaact 601 ccatccggcc cgaggagctg ctgcaaatgg agctgctcct ggtgaacaag ctcaagtgga 661 acctggccgc aatgaccccg cacgatttca ttgaacactt cctctccaaa atgccagagg 721 cggaggagaa caaacagatc atccgcaaac acgcgcagac cttcgttgcc ctctgtgcca //

A Genbank entry

By design, annotations are nearly impossible to incorporate

Page 38: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Since 1989, centrally curated, annotations provided by the community -> curation bottleneck

AceDB: A C.elegans database

Page 39: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

2001

2002

-To view massive amounts of sequencing data, genome browsers were developed. -Annotations developed in “Annotation Jamborees” -Human Genome Project Analysis Group: concept of annotation tracks -Tracks produced and curated by different groups but stored on centralized server ->Bandwidth bottleneck

HUGO

Page 40: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Integrated Genome Browser and the Distributed Annotation System (DAS)

Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples

Page 41: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Decentralized curation of annotation tracks Decentralized storage of annotation tracks

Distributed Annotation System: DAS

The distributed annotation system

Page 42: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

components: 1 Reference genome server (provides coordinates and sequence) 2 Annotation server(s) (provides annotation tracks) 3 Client (view annotations mapped onto reference)

DAS basics

reference

Client (web or stand alone)

annotations

Dowell et al. 2001

Page 43: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Geodesic: Standalone client by Dowell et al. 2001

Source code: http://www.biodas.org/geodesic/

Page 44: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Glyphs: Graphic elements used for track display

Page 45: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

DAS/2 (not listed in registry) http://india907.server4you.de:8080/das2/genome (epigenome.at) http://www.bioviz.org/das2/genome (Bioviz) http://bioserver.hci.utah.edu:8080/DAS2DB/genome (UofUtahBioinfoCore) http://netaffxdas.affymetrix.com/das2/genome (NetAffx)

Currently 1600 DAS/1 entries Clients:

DAS registry (www.dasregistry.org)

Page 46: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

http://www.biodas.org/wiki/DAS/2

Main difference: DAS/2 supports non-XML file formats DAS/2 clients support DAS/1 but not vice versa

DAS/1 != DAS/2

2004-2007, NIH grant for DAS/2 development, partners: Affymetrix, Cold Spring Harbor Lab, the EBI/ Sanger Center, Dalke Scientific

Page 47: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

DAS specification (www.biodas.org)

Page 48: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Sources: list available genomes Segments: lists chromosomes per genome Types: list types of annotation (file format etc) Features: list annotation details in specific region

DAS: Basic Query types: sources, segments, types, features

Page 49: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

<?xml version="1.0" encoding="UTF-8"?> <SOURCES xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/" > <MAINTAINER email="[email protected]" /> <SOURCE uri="D_rerio" title="D_rerio" > <VERSION uri="danRer7" title="danRer7" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="danRer7/segments" /> <CAPABILITY type="types" query_uri="danRer7/types" /> <CAPABILITY type="features" query_uri="danRer7/features" /> </VERSION> </SOURCE> <SOURCE uri="H_sapiens" title="H_sapiens" > <VERSION uri="H_sapiens_Mar_2006" title="H_sapiens_Mar_2006" created="2012-05-05T16:47:27+0200" > <COORDINATES uri="http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/" authority="NCBI" taxid="9606" version="36" source="Chromosome" /> <CAPABILITY type="segments" query_uri="H_sapiens_Mar_2006/segments" /> <CAPABILITY type="types" query_uri="H_sapiens_Mar_2006/types" /> <CAPABILITY type="features" query_uri="H_sapiens_Mar_2006/features" /> </VERSION> </SOURCE> <SOURCE uri="M_musculus" title="M_musculus" > <VERSION uri="M_musculus_Jul_2007" title="M_musculus_Jul_2007" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="M_musculus_Jul_2007/segments" /> <CAPABILITY type="types" query_uri="M_musculus_Jul_2007/types" /> <CAPABILITY type="features" query_uri="M_musculus_Jul_2007/features" /> </VERSION> <VERSION uri="M_musculus_Mar_2006" title="M_musculus_Mar_2006" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="M_musculus_Mar_2006/segments" /> <CAPABILITY type="types" query_uri="M_musculus_Mar_2006/types" /> <CAPABILITY type="features" query_uri="M_musculus_Mar_2006/features" /> </VERSION> </SOURCE> </SOURCES>

http://rubidio.ifom-ieo-campus.it:8080/das2/genome

Page 50: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

<?xml version="1.0" encoding="UTF-8"?> <SEGMENTS xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/" uri="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/segments" >

<SEGMENT uri="chr1" title="chr1" length="197195432" /> <SEGMENT uri="chr2" title="chr2" length="181748087" /> <SEGMENT uri="chr3" title="chr3" length="159599783" /> <SEGMENT uri="chr4" title="chr4" length="155630120" /> <SEGMENT uri="chr5" title="chr5" length="152537259" /> <SEGMENT uri="chr6" title="chr6" length="149517037" /> <SEGMENT uri="chr7" title="chr7" length="152524553" /> <SEGMENT uri="chr8" title="chr8" length="131738871" /> <SEGMENT uri="chr9" title="chr9" length="124076172" /> <SEGMENT uri="chr10" title="chr10" length="129993255" /> <SEGMENT uri="chr11" title="chr11" length="121843856" /> <SEGMENT uri="chr12" title="chr12" length="121257530" /> <SEGMENT uri="chr13" title="chr13" length="120284312" /> <SEGMENT uri="chr14" title="chr14" length="125194864" /> <SEGMENT uri="chr15" title="chr15" length="103494974" /> <SEGMENT uri="chr16" title="chr16" length="98319150" /> <SEGMENT uri="chr17" title="chr17" length="95272651" /> <SEGMENT uri="chr18" title="chr18" length="90772031" /> <SEGMENT uri="chr19" title="chr19" length="61342430" /> <SEGMENT uri="chrX" title="chrX" length="166650296" /> <SEGMENT uri="chrY" title="chrY" length="15902555" /> <SEGMENT uri="chrM" title="chrM" length="16299" /> <SEGMENT uri="chr1_random" title="chr1_random" length="1231697" /> <SEGMENT uri="chr3_random" title="chr3_random" length="41899" /> <SEGMENT uri="chr4_random" title="chr4_random" length="160594" /> <SEGMENT uri="chr5_random" title="chr5_random" length="357350" /> <SEGMENT uri="chr7_random" title="chr7_random" length="362490" /> <SEGMENT uri="chr8_random" title="chr8_random" length="849593" /> <SEGMENT uri="chr9_random" title="chr9_random" length="449403" /> <SEGMENT uri="chr13_random" title="chr13_random" length="400311" /> <SEGMENT uri="chr16_random" title="chr16_random" length="3994" /> <SEGMENT uri="chr17_random" title="chr17_random" length="628739" /> <SEGMENT uri="chrUn_random" title="chrUn_random" length="5900358" /> <SEGMENT uri="chrX_random" title="chrX_random" length="1785075" /> <SEGMENT uri="chrY_random" title="chrY_random" length="58682461" />

</SEGMENTS>

http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/segments

Page 51: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

<?xml version="1.0" encoding="UTF-8"?> <TYPES xmlns="http://biodas.org/documents/das2" xml:base="http://localhost:8080/genopub/genome/M_musculus_Jul_2007/" > <TYPE uri="EML1/PU1_ChIP/Input" title="EML1/PU1_ChIP/Input" > <FORMAT name="useq" /> <PROP key="Normalization" value="N" /> <PROP key="group" value="Alcalay" /> <PROP key="group_contact" value="Myriam Alcalay" /> <PROP key="group_email" value="[email protected]" /> <PROP key="name" value="Input" /> <PROP key="owner" value="Alcalay, Myriam" /> <PROP key="owner_email" value="IEO" /> <PROP key="owner_institute" value="[email protected]" /> <PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=11" /> <PROP key="visibility" value="Members" /> </TYPE> <TYPE uri="EML1/PU1_ChIP/PU1_A3" title="EML1/PU1_ChIP/PU1_A3" > <FORMAT name="useq" /> <PROP key="Normalization" value="N" /> <PROP key="group" value="Alcalay" /> <PROP key="group_contact" value="Myriam Alcalay" /> <PROP key="group_email" value="[email protected]" /> <PROP key="name" value="PU1_A3" /> <PROP key="owner" value="Alcalay, Myriam" /> <PROP key="owner_email" value="IEO" /> <PROP key="owner_institute" value="[email protected]" /> <PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=7" /> <PROP key="visibility" value="Members" /> </TYPE> </TYPES>

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/types

Page 52: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_ musculus_Jul_2007%2Fchr1;overlaps=79374747%3A81152999;type=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_musculus_Jul_2007 %2FEML1%2FPU1_ChIP%2FPU1_B2;format=useq

Returns a file in useq format, essentially a zip file, preferred format in IGB Contains a archiveReadMe.txt and one or more “slice files” Observations can be textual or numerical

http://useq.sourceforge.net/useqArchiveFormat.html

Features

Page 53: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

A BED file (.bed) is a tab-delimited text file that defines a feature track. File extension .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGB/IGV. Notes: Zero-based index: Start and end positions are identified using a zero-based index. The end position is excluded. For example, setting start-end to 1-2 describes exactly one base, the second base in the sequence (ACGT). track name=pairedReads description="Clone Paired Reads" Chr22 1000 5000 cloneA Chr22 2000 6000 cloneB

Other important file formats: BED (textual)

Page 54: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

The bedGraph format is line-oriented. Bedgraph data are preceeded by a track definition line, which adds a number of options for controlling the default display of this track. The track type is REQUIRED, and must be “bedGraph”. Bedgraph track data values can be integer or real, positive or negative values. Chromosome positions are specified as 0-relative. The first chromosome position is 0. The last position in a chromosome of length N would be N - 1. Only positions specified have data. Positions not specified do not have data and will not be graphed. All positions specified in the input data must be in numerical order. The bedGraph format has four columns of data: track type=bedGraph name="BedGraph Format" chr19 49302000 49302300 10 chr19 49302300 49302600 20 chr19 49302600 49302900 25 Intervals can be of any length and overlapping.

Other important file formats: BEDGraph (numerical)

Page 55: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

The wiggle (WIG) format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format. Chromosome positions are specified as 1-relative. variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format. It begins with a declaration line and is followed by two columns containing chromosome positions and data values: variableStep chrom=chrN [span=windowSize] StartA dataValueA StartB dataValueB variableStep chrom=chr2 is equivalent to: variableStep chrom=chr2 span=5 300701 12.5 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 Both versions display a value of 12.5 at position 300701-300705 on chromosome 2.

Other important file formats: Wig (“wiggle”)

Page 56: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

The wiggle (WIG) format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format. Chromosome positions are specified as 1-relative. fixedStep is for data with regular intervals between new data values and is the more compact wiggle format. It begins with a declaration line and is followed by a single column of data values: The declaration line starts with the word fixedStep and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in variableStep format. For example, this fixedStep specification: fixedStep chrom=chr3 start=400601 step=100 span=5 11 22 33 displays the values 11, 22, and 33 as single-base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively. Step and span are fixed for entire data set.

Other important file formats: Wig (“wiggle”)

Page 57: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Data transfer level Required: random access. At the lowest layer, we take advantage of the byte-range protocols of HTTP and HTTPS, and the protocols associated with resuming interrupted FTP transfers, to achieve random access to binary files over the web. URL data cache layer a cache layer on top of the data transfer layer. Data are fetched in blocks of 8 Kb, and each block is kept in a cache. Indexing based on a single dimensional version of the R tree that is commonly used for indexing geographical data. The index size is typically less than 1% of the size of the data itself. Because the stored data are sorted by chromosome and start position, not every item in the file must be indexed; in fact by default only every 512th item is indexed. Compression: regions between indexed items (containing 512 items by default) are individually compressed (gzip).

BigWig and BigBed

Page 58: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Basic architecture Object relational mapping Via Hibernate

Flex

Apache Tomcat 6 Glassfish

mySQL

DAS/2 server reference implementation: http://sourceforge.net/projects/genoviz

Page 59: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Database tables

Page 60: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

User table

Annotation table

User role table Message digest 5 (MD5) encryption from java.security package

Table views

Page 61: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Each file gets his own folder (automatically assigned folder names). No filenames to store in DB, which may contain non-supported characters.

Data storage directory

Page 62: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Visibility levels:

DAS2 administration user interface

Page 63: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

If you want to access data with restricted visibility, you must be inserted in the user table and be part of a group that is headed by the owner of the data.

Users and groups setup

Page 64: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Every user, admin or non-admin, can change his password, load new data, add data descriptions, and set visibility levels.

Non-administrator users interface

Page 65: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

IGB user identification

jdbcRealm ldapRealm Both work

Page 66: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

NetAffx and UCSC hg19 annotations

All these annotations are one click away from the user

Page 67: Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) •

Conclusions

DAS2 servers provide distributed genome annotations Support fine grained security model Perform parsing of data for custom genome views