Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability...

Genomic Computing, DEIB, 4-7 March 2013

Architecture of Distributed Annotation Server Heiko Muller

Computational Research IIT@SEMM [email protected]

The genomic data surge

1980s 1990s 2000s 2010s PCR Genome draft

Southern RFLP

SNP chips SNP beads

Northern Differential Display SAGE Expression chips

Sequencer

Genome browsers

IFOM-IEO-CAMPUS Since 2009: 1800 samples 16 TB raw data 70 TB elaborated data

NGS data flow

The current situation: 1. Biologist fills in request form and sends it to [email protected] 2. Data are inserted into LIMS and request ID’s are sent back to biologist 3. Samples are sequenced and run data are inserted into LIMS 4. LIMS prepares sample sheets that are used for demultiplexing and bcl->fastq conversion 5. FastQC is run for quality control 6. FASTQ data are saved on IIT-Isilon device and hard links are produced in user folders 7. Group bioinformaticians align and analyze data 8. Group bioinformaticians interact with biologists to interpret results

Request LIMS->FASTQ bioinformaticians Elaborated data sets

homogeneous heterogeneous

mailto:[email protected]



Request form

Request submission

LIMS: Laboratory Information Management System

N:N

N:N sample request

N:N sample application

N:N sample run

LIMS is backed by a MySQL database

LIMS: NGS reagents

Illumina HiSeq

Each lane contains more than one sample (multiplexing)

180 mio clusters per lane

LIMS: NGS runs

LIMS: Samplesheet

http://hilt.iit.ieo.eu/data/SampleSheets/

Demultiplexing

Samplesheet

Script generator

Run on IIT blades, manually or via Process proc = Runtime.getRuntime().exec(command);

Output: FASTQ files

FASTQ = “FASTA with Qualities” @SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

“FASTA” stands short for “FAST-All” because as opposed to the FASTP protein aligner described in 1985 could work with all alphabets (DNA:DNA, translated protein:DNA). FASTA is also a universal file format for sequences. Example: >header1 acgtgatgc >header2 cgtgatgca . .

Wikipedia

http://en.wikipedia.org/wiki/FASTA_format

Quality control

Why do quality control?

Colors get blurred with increasing cycle number.

Base call qualities

• If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) • The score is written with the character whose ASCII code is Q+33 (Sanger Institute standard). (ASCII = American Standard Code for Information Interchange)

Phred=classical base calling program

Q p (call

incorrect) ASCII symbol

0 1.000000 33 !

1 0.794328 34 "

2 0.630957 35 #

3 0.501187 36 $

4 0.398107 37 %

5 0.316228 38 &

6 0.251189 39 '

7 0.199526 40 (

8 0.158489 41 )

9 0.125893 42 *

10 0.100000 43 +

11 0.079433 44 ,

12 0.063096 45 -

13 0.050119 46 .

14 0.039811 47 /

15 0.031623 48 0

16 0.025119 49 1

17 0.019953 50 2

18 0.015849 51 3

19 0.012589 52 4

20 0.010000 53 5

21 0.007943 54 6

22 0.006310 55 7

23 0.005012 56 8

24 0.003981 57 9

25 0.003162 58 :

26 0.002512 59 ;

27 0.001995 60 <

28 0.001585 61 =

29 0.001259 62 >

30 0.001000 63 ?

31 0.000794 64 @

32 0.000631 65 A

33 0.000501 66 B

34 0.000398 67 C

35 0.000316 68 D

36 0.000251 69 E

37 0.000200 70 F

38 0.000158 71 G

39 0.000126 72 H

40 0.000100 73 I

0.1

1

0.01

0.001

P call is wrong:

Aligning sequence reads

The first task after obtaining your reads (and any QC) is to determine the corresponding site in the reference genome from which they each derive Termed genome alignment (also known as ‘mapping’) Short-read aligners use one of these ideas to base their algorithm on: • use spaced-seed indexing • hash seed words from the reference • hash seed words from the reads • sort reference words and reads lexicographically • use the Burrows-Wheeler transform (BWT) BWT seems to be the winning idea (very fast, sufficiently accurate), and is used by the newest tools (Bowtie, SOAPv2, BWA).

http://samtools.sourceforge.net/SAM1.pdf

SAM example

Viewing BAM files in Integrated Genome Browsers

File sizes

fastq

bam bigWig

blade

GPU

blade

blade

blade

users

service

Application server

Sun grid engine: Number crunching

Storage Isilon 200 TB

hardware price RAM (GB) processors application server 7100 16 24

blade system 33221 256 48

GPU Tesla M2050 9888 48 24

Isilon storage (200 TB) 230000

Getting data to the user: IIT@SEMM infrastructure

Mainly FastQ

Illumina HiSeq2000

Application server, blades, GPU

Isilon storage

Visualizing NGS data: Genome Browsers

Visualizing genomic data: What is a “Genome Browser”

• linear representation of a genome

• position-based annotations, each called a track

– continuous annotations: e.g. conservation

– interval annotations: e.g. gene, read alignment

– point annotations: e.g. SNPs

• user specifies a subsection of genome to look at

Comparison of Genome Browsers

UCSC Ensembl IGV IGB

Reference http://genome.ucsc.edu/ http://www.ensembl.org/index.html http://www.broadinstitute.org/igv/ http://bioviz.org/igb/

Model Server Server Client Client

Interactive

HTS support

Database of tracks

Plugins

No support Some support Good support

Server model Client model

Server central data store Server stores data

renders images

sends to client

Client requests images Client local HTS store

displays images renders images

displays images

Limitations: do not support multiple genomes simultaneously do not capture 3-dimensional conformation do not capture spatial or temporal information do not integrate well with analytics

• Browse many eukaryotic genomes (yeast to human)

• Most annotations are there

• Important evolutionary and variation data representation.

• Very flexible and configurable views

• Graphical and table views

• Upload your data into custom tracks and share with colleagues

• Client/server application with it’s issues, but a great app!

About UCSC Genome Browser

http://genome.ucsc.edu

Integrated Genome Browser and IIT DAS2 server

Integrated Genome Browser and published genome annotations

Genome browser view: ChIP-seq

.bam .bed .bigWig

Genome browser view: sequencing errors

Integrated Genome Browser and the Distributed Annotation System (DAS)

Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples

Frederic Sanger

Genbank

Centralized repository, sequences owned by submitter,

Genbank

LOCUS NM_053056 4304 bp mRNA linear PRI 27-MAY-2012 DEFINITION Homo sapiens cyclin D1 (CCND1), mRNA. ACCESSION NM_053056 NM_001758 VERSION NM_053056.2 GI:77628152 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 4304) AUTHORS Li,Q., Dong,Q. and Wang,E. TITLE Rsf-1 is overexpressed in non-small cell lung cancers and regulates cyclinD1 expression and ERK activity JOURNAL Biochem. Biophys. Res. Commun. 420 (1), 6-10 (2012) PUBMED 22387541 REMARK GeneRIF: Rsf-1 is overexpressed in non-small cell lung cancers and contributes to malignant cell growth by cyclin D1 and ERK modulation. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-138 BM796500.1 1-138 139-1278 BC001501.2 73-1212 1279-4077 AP001888.4 12952-15750 4078-4304 X59798.1 4018-4244 FEATURES Location/Qualifiers source 1..4304 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="11" /map="11q13" gene 1..4304 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /note="cyclin D1" /db_xref="GeneID:595" /db_xref="HGNC:1582" /db_xref="HPRD:01346" /db_xref="MIM:168461" exon 1..407 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /inference="alignment:Splign" /number=1 CDS 210..1097 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /note="B-cell CLL/lymphoma 1; BCL-1 oncogene; PRAD1 oncogene; B-cell lymphoma 1 protein" /codon_start=1 /product="G1/S-specific cyclin-D1”

/protein_id="NP_444284.1" /db_xref="GI:16950655" /db_xref="CCDS:CCDS8191.1" /db_xref="GeneID:595" /db_xref="HGNC:1582" /db_xref="HPRD:01346" /db_xref="MIM:168461" /translation="MEHQLLCCEVETIRRAYPDANLLNDRVLRAMLKAEETCAPSVSY FKCVQKEVLPSMRKIVATWMLEVCEEQKCEEEVFPLAMNYLDRFLSLEPVKKSRLQLL GATCMFVASKMKETIPLTAEKLCIYTDNSIRPEELLQMELLLVNKLKWNLAAMTPHDF IEHFLSKMPEAEENKQIIRKHAQTFVALCATDVKFISNPPSMVAAGSVVAAVQGLNLR SPNNFLSYYRLTRFLSRVIKCDPDCLRACQEQIEALLESSLRQAQQNMDPKAAEEEEE EEEEVDLACTPTDVRDVDI" misc_feature 885..887 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /experiment="experimental evidence, no additional details recorded" /note="Phosphotyrosine; propagated from UniProtKB/Swiss-Prot (P24385.1); phosphorylation site" ORIGIN 1 cacacggact acaggggagt tttgttgaag ttgcaaagtc ctggagcctc cagagggctg 61 tcggcgcagt agcagcgagc agcagagtcc gcacgctccg gcgaggggca gaagagcgcg 121 agggagcgcg gggcagcaga agcgagagcc gagcgcggac ccagccagga cccacagccc 181 tccccagctg cccaggaaga gccccagcca tggaacacca gctcctgtgc tgcgaagtgg 241 aaaccatccg ccgcgcgtac cccgatgcca acctcctcaa cgaccgggtg ctgcgggcca 301 tgctgaaggc ggaggagacc tgcgcgccct cggtgtccta cttcaaatgt gtgcagaagg 361 aggtcctgcc gtccatgcgg aagatcgtcg ccacctggat gctggaggtc tgcgaggaac 421 agaagtgcga ggaggaggtc ttcccgctgg ccatgaacta cctggaccgc ttcctgtcgc 481 tggagcccgt gaaaaagagc cgcctgcagc tgctgggggc cacttgcatg ttcgtggcct 541 ctaagatgaa ggagaccatc cccctgacgg ccgagaagct gtgcatctac accgacaact 601 ccatccggcc cgaggagctg ctgcaaatgg agctgctcct ggtgaacaag ctcaagtgga 661 acctggccgc aatgaccccg cacgatttca ttgaacactt cctctccaaa atgccagagg 721 cggaggagaa caaacagatc atccgcaaac acgcgcagac cttcgttgcc ctctgtgcca //

A Genbank entry

By design, annotations are nearly impossible to incorporate

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606

http://www.ncbi.nlm.nih.gov/pubmed/22387541

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606

http://www.ncbi.nlm.nih.gov/nuccore/77628152?from=1&to=4304

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=595

http://www.genenames.org/data/hgnc_data.php?hgnc_id=1582

http://www.hprd.org/protein/01346

http://www.ncbi.nlm.nih.gov/omim/168461



http://www.ncbi.nlm.nih.gov/protein/16950655

http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS8191.1

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=595

http://www.genenames.org/data/hgnc_data.php?hgnc_id=1582

http://www.hprd.org/protein/01346

http://www.ncbi.nlm.nih.gov/omim/168461

http://www.ncbi.nlm.nih.gov/protein/16950655?from=226&to=226

Since 1989, centrally curated, annotations provided by the community -> curation bottleneck

AceDB: A C.elegans database

2001

2002

-To view massive amounts of sequencing data, genome browsers were developed. -Annotations developed in “Annotation Jamborees” -Human Genome Project Analysis Group: concept of annotation tracks -Tracks produced and curated by different groups but stored on centralized server ->Bandwidth bottleneck

HUGO

Integrated Genome Browser and the Distributed Annotation System (DAS)

Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples

Decentralized curation of annotation tracks Decentralized storage of annotation tracks

Distributed Annotation System: DAS

The distributed annotation system

components: 1 Reference genome server (provides coordinates and sequence) 2 Annotation server(s) (provides annotation tracks) 3 Client (view annotations mapped onto reference)

DAS basics

reference

Client (web or stand alone)

annotations

Dowell et al. 2001

Geodesic: Standalone client by Dowell et al. 2001

Source code: http://www.biodas.org/geodesic/

Glyphs: Graphic elements used for track display

DAS/2 (not listed in registry) http://india907.server4you.de:8080/das2/genome (epigenome.at) http://www.bioviz.org/das2/genome (Bioviz) http://bioserver.hci.utah.edu:8080/DAS2DB/genome (UofUtahBioinfoCore) http://netaffxdas.affymetrix.com/das2/genome (NetAffx)

Currently 1600 DAS/1 entries Clients:

DAS registry (www.dasregistry.org)

http://www.biodas.org/wiki/DAS/2

Main difference: DAS/2 supports non-XML file formats DAS/2 clients support DAS/1 but not vice versa

DAS/1 != DAS/2

2004-2007, NIH grant for DAS/2 development, partners: Affymetrix, Cold Spring Harbor Lab, the EBI/ Sanger Center, Dalke Scientific

DAS specification (www.biodas.org)

Sources: list available genomes Segments: lists chromosomes per genome Types: list types of annotation (file format etc) Features: list annotation details in specific region

DAS: Basic Query types: sources, segments, types, features

<?xml version="1.0" encoding="UTF-8"?> <SOURCES xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/" > <MAINTAINER email="[email protected]" /> <SOURCE uri="D_rerio" title="D_rerio" > <VERSION uri="danRer7" title="danRer7" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="danRer7/segments" /> <CAPABILITY type="types" query_uri="danRer7/types" /> <CAPABILITY type="features" query_uri="danRer7/features" /> </VERSION> </SOURCE> <SOURCE uri="H_sapiens" title="H_sapiens" > <VERSION uri="H_sapiens_Mar_2006" title="H_sapiens_Mar_2006" created="2012-05-05T16:47:27+0200" > <COORDINATES uri="http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/" authority="NCBI" taxid="9606" version="36" source="Chromosome" /> <CAPABILITY type="segments" query_uri="H_sapiens_Mar_2006/segments" /> <CAPABILITY type="types" query_uri="H_sapiens_Mar_2006/types" /> <CAPABILITY type="features" query_uri="H_sapiens_Mar_2006/features" /> </VERSION> </SOURCE> <SOURCE uri="M_musculus" title="M_musculus" > <VERSION uri="M_musculus_Jul_2007" title="M_musculus_Jul_2007" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="M_musculus_Jul_2007/segments" /> <CAPABILITY type="types" query_uri="M_musculus_Jul_2007/types" /> <CAPABILITY type="features" query_uri="M_musculus_Jul_2007/features" /> </VERSION> <VERSION uri="M_musculus_Mar_2006" title="M_musculus_Mar_2006" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="M_musculus_Mar_2006/segments" /> <CAPABILITY type="types" query_uri="M_musculus_Mar_2006/types" /> <CAPABILITY type="features" query_uri="M_musculus_Mar_2006/features" /> </VERSION> </SOURCE> </SOURCES>

http://rubidio.ifom-ieo-campus.it:8080/das2/genome

<?xml version="1.0" encoding="UTF-8"?> <SEGMENTS xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/" uri="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/segments" >

<SEGMENT uri="chr1" title="chr1" length="197195432" /> <SEGMENT uri="chr2" title="chr2" length="181748087" /> <SEGMENT uri="chr3" title="chr3" length="159599783" /> <SEGMENT uri="chr4" title="chr4" length="155630120" /> <SEGMENT uri="chr5" title="chr5" length="152537259" /> <SEGMENT uri="chr6" title="chr6" length="149517037" /> <SEGMENT uri="chr7" title="chr7" length="152524553" /> <SEGMENT uri="chr8" title="chr8" length="131738871" /> <SEGMENT uri="chr9" title="chr9" length="124076172" /> <SEGMENT uri="chr10" title="chr10" length="129993255" /> <SEGMENT uri="chr11" title="chr11" length="121843856" /> <SEGMENT uri="chr12" title="chr12" length="121257530" /> <SEGMENT uri="chr13" title="chr13" length="120284312" /> <SEGMENT uri="chr14" title="chr14" length="125194864" /> <SEGMENT uri="chr15" title="chr15" length="103494974" /> <SEGMENT uri="chr16" title="chr16" length="98319150" /> <SEGMENT uri="chr17" title="chr17" length="95272651" /> <SEGMENT uri="chr18" title="chr18" length="90772031" /> <SEGMENT uri="chr19" title="chr19" length="61342430" /> <SEGMENT uri="chrX" title="chrX" length="166650296" /> <SEGMENT uri="chrY" title="chrY" length="15902555" /> <SEGMENT uri="chrM" title="chrM" length="16299" /> <SEGMENT uri="chr1_random" title="chr1_random" length="1231697" /> <SEGMENT uri="chr3_random" title="chr3_random" length="41899" /> <SEGMENT uri="chr4_random" title="chr4_random" length="160594" /> <SEGMENT uri="chr5_random" title="chr5_random" length="357350" /> <SEGMENT uri="chr7_random" title="chr7_random" length="362490" /> <SEGMENT uri="chr8_random" title="chr8_random" length="849593" /> <SEGMENT uri="chr9_random" title="chr9_random" length="449403" /> <SEGMENT uri="chr13_random" title="chr13_random" length="400311" /> <SEGMENT uri="chr16_random" title="chr16_random" length="3994" /> <SEGMENT uri="chr17_random" title="chr17_random" length="628739" /> <SEGMENT uri="chrUn_random" title="chrUn_random" length="5900358" /> <SEGMENT uri="chrX_random" title="chrX_random" length="1785075" /> <SEGMENT uri="chrY_random" title="chrY_random" length="58682461" />

</SEGMENTS>

http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/segments

<?xml version="1.0" encoding="UTF-8"?> <TYPES xmlns="http://biodas.org/documents/das2" xml:base="http://localhost:8080/genopub/genome/M_musculus_Jul_2007/" > <TYPE uri="EML1/PU1_ChIP/Input" title="EML1/PU1_ChIP/Input" > <FORMAT name="useq" /> <PROP key="Normalization" value="N" /> <PROP key="group" value="Alcalay" /> <PROP key="group_contact" value="Myriam Alcalay" /> <PROP key="group_email" value="[email protected]" /> <PROP key="name" value="Input" /> <PROP key="owner" value="Alcalay, Myriam" /> <PROP key="owner_email" value="IEO" /> <PROP key="owner_institute" value="[email protected]" /> <PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=11" /> <PROP key="visibility" value="Members" /> </TYPE> <TYPE uri="EML1/PU1_ChIP/PU1_A3" title="EML1/PU1_ChIP/PU1_A3" > <FORMAT name="useq" /> <PROP key="Normalization" value="N" /> <PROP key="group" value="Alcalay" /> <PROP key="group_contact" value="Myriam Alcalay" /> <PROP key="group_email" value="[email protected]" /> <PROP key="name" value="PU1_A3" /> <PROP key="owner" value="Alcalay, Myriam" /> <PROP key="owner_email" value="IEO" /> <PROP key="owner_institute" value="[email protected]" /> <PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=7" /> <PROP key="visibility" value="Members" /> </TYPE> </TYPES>

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/types

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_ musculus_Jul_2007%2Fchr1;overlaps=79374747%3A81152999;type=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_musculus_Jul_2007 %2FEML1%2FPU1_ChIP%2FPU1_B2;format=useq

Returns a file in useq format, essentially a zip file, preferred format in IGB Contains a archiveReadMe.txt and one or more “slice files” Observations can be textual or numerical

http://useq.sourceforge.net/useqArchiveFormat.html

Features

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http://localhost:8080/genopub/genome/M_

http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http://localhost:8080/genopub/genome/M_

A BED file (.bed) is a tab-delimited text file that defines a feature track. File extension .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGB/IGV. Notes: Zero-based index: Start and end positions are identified using a zero-based index. The end position is excluded. For example, setting start-end to 1-2 describes exactly one base, the second base in the sequence (ACGT). track name=pairedReads description="Clone Paired Reads" Chr22 1000 5000 cloneA Chr22 2000 6000 cloneB

Other important file formats: BED (textual)

The bedGraph format is line-oriented. Bedgraph data are preceeded by a track definition line, which adds a number of options for controlling the default display of this track. The track type is REQUIRED, and must be “bedGraph”. Bedgraph track data values can be integer or real, positive or negative values. Chromosome positions are specified as 0-relative. The first chromosome position is 0. The last position in a chromosome of length N would be N - 1. Only positions specified have data. Positions not specified do not have data and will not be graphed. All positions specified in the input data must be in numerical order. The bedGraph format has four columns of data: track type=bedGraph name="BedGraph Format" chr19 49302000 49302300 10 chr19 49302300 49302600 20 chr19 49302600 49302900 25 Intervals can be of any length and overlapping.

Other important file formats: BEDGraph (numerical)

http://genome.ucsc.edu/goldenPath/help/customTrack.html

The wiggle (WIG) format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format. Chromosome positions are specified as 1-relative. variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format. It begins with a declaration line and is followed by two columns containing chromosome positions and data values: variableStep chrom=chrN [span=windowSize] StartA dataValueA StartB dataValueB variableStep chrom=chr2 is equivalent to: variableStep chrom=chr2 span=5 300701 12.5 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 Both versions display a value of 12.5 at position 300701-300705 on chromosome 2.

Other important file formats: Wig (“wiggle”)

http://genome.ucsc.edu/goldenPath/help/bedgraph.html

http://genome.ucsc.edu/goldenPath/help/bigWig.html

The wiggle (WIG) format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format. Chromosome positions are specified as 1-relative. fixedStep is for data with regular intervals between new data values and is the more compact wiggle format. It begins with a declaration line and is followed by a single column of data values: The declaration line starts with the word fixedStep and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in variableStep format. For example, this fixedStep specification: fixedStep chrom=chr3 start=400601 step=100 span=5 11 22 33 displays the values 11, 22, and 33 as single-base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively. Step and span are fixed for entire data set.

Other important file formats: Wig (“wiggle”)

http://genome.ucsc.edu/goldenPath/help/bedgraph.html

http://genome.ucsc.edu/goldenPath/help/bigWig.html

Data transfer level Required: random access. At the lowest layer, we take advantage of the byte-range protocols of HTTP and HTTPS, and the protocols associated with resuming interrupted FTP transfers, to achieve random access to binary files over the web. URL data cache layer a cache layer on top of the data transfer layer. Data are fetched in blocks of 8 Kb, and each block is kept in a cache. Indexing based on a single dimensional version of the R tree that is commonly used for indexing geographical data. The index size is typically less than 1% of the size of the data itself. Because the stored data are sorted by chromosome and start position, not every item in the file must be indexed; in fact by default only every 512th item is indexed. Compression: regions between indexed items (containing 512 items by default) are individually compressed (gzip).

BigWig and BigBed

Basic architecture Object relational mapping Via Hibernate

Flex

Apache Tomcat 6 Glassfish

mySQL

DAS/2 server reference implementation: http://sourceforge.net/projects/genoviz

Database tables

User table

Annotation table

User role table Message digest 5 (MD5) encryption from java.security package

Table views

Each file gets his own folder (automatically assigned folder names). No filenames to store in DB, which may contain non-supported characters.

Data storage directory

Visibility levels:

DAS2 administration user interface

If you want to access data with restricted visibility, you must be inserted in the user table and be part of a group that is headed by the owner of the data.

Users and groups setup

Every user, admin or non-admin, can change his password, load new data, add data descriptions, and set visibility levels.

Non-administrator users interface

IGB user identification

jdbcRealm ldapRealm Both work

NetAffx and UCSC hg19 annotations

All these annotations are one click away from the user

Conclusions

DAS2 servers provide distributed genome annotations Support fine grained security model Perform parsing of data for custom genome views

Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability...

Documents

Transcript of Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability...