Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability...
Transcript of Computational Research IIT@SEMM heiko.muller@iit · Base call qualities • If p is the probability...
Genomic Computing, DEIB, 4-7 March 2013
Architecture of Distributed Annotation Server Heiko Muller
Computational Research IIT@SEMM [email protected]
The genomic data surge
1980s 1990s 2000s 2010s PCR Genome draft
Southern RFLP
SNP chips SNP beads
Northern Differential Display SAGE Expression chips
Sequencer
Genome browsers
IFOM-IEO-CAMPUS Since 2009: 1800 samples 16 TB raw data 70 TB elaborated data
NGS data flow
The current situation: 1. Biologist fills in request form and sends it to [email protected] 2. Data are inserted into LIMS and request ID’s are sent back to biologist 3. Samples are sequenced and run data are inserted into LIMS 4. LIMS prepares sample sheets that are used for demultiplexing and bcl->fastq conversion 5. FastQC is run for quality control 6. FASTQ data are saved on IIT-Isilon device and hard links are produced in user folders 7. Group bioinformaticians align and analyze data 8. Group bioinformaticians interact with biologists to interpret results
Request LIMS->FASTQ bioinformaticians Elaborated data sets
homogeneous heterogeneous
Request form
Request submission
LIMS: Laboratory Information Management System
N:N
N:N sample request
N:N sample application
N:N sample run
LIMS is backed by a MySQL database
LIMS: NGS reagents
Illumina HiSeq
Each lane contains more than one sample (multiplexing)
180 mio clusters per lane
LIMS: NGS runs
LIMS: Samplesheet
http://hilt.iit.ieo.eu/data/SampleSheets/
Demultiplexing
Samplesheet
Script generator
Run on IIT blades, manually or via Process proc = Runtime.getRuntime().exec(command);
Output: FASTQ files
FASTQ = “FASTA with Qualities” @SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
“FASTA” stands short for “FAST-All” because as opposed to the FASTP protein aligner described in 1985 could work with all alphabets (DNA:DNA, translated protein:DNA). FASTA is also a universal file format for sequences. Example: >header1 acgtgatgc >header2 cgtgatgca . .
Wikipedia
Quality control
Why do quality control?
Colors get blurred with increasing cycle number.
Base call qualities
• If p is the probability that the base call is wrong, the Phred score is: Q = -10 log10 p p=10^(-Q/10) • The score is written with the character whose ASCII code is Q+33 (Sanger Institute standard). (ASCII = American Standard Code for Information Interchange)
Phred=classical base calling program
Q p (call
incorrect) ASCII symbol
0 1.000000 33 !
1 0.794328 34 "
2 0.630957 35 #
3 0.501187 36 $
4 0.398107 37 %
5 0.316228 38 &
6 0.251189 39 '
7 0.199526 40 (
8 0.158489 41 )
9 0.125893 42 *
10 0.100000 43 +
11 0.079433 44 ,
12 0.063096 45 -
13 0.050119 46 .
14 0.039811 47 /
15 0.031623 48 0
16 0.025119 49 1
17 0.019953 50 2
18 0.015849 51 3
19 0.012589 52 4
20 0.010000 53 5
21 0.007943 54 6
22 0.006310 55 7
23 0.005012 56 8
24 0.003981 57 9
25 0.003162 58 :
26 0.002512 59 ;
27 0.001995 60 <
28 0.001585 61 =
29 0.001259 62 >
30 0.001000 63 ?
31 0.000794 64 @
32 0.000631 65 A
33 0.000501 66 B
34 0.000398 67 C
35 0.000316 68 D
36 0.000251 69 E
37 0.000200 70 F
38 0.000158 71 G
39 0.000126 72 H
40 0.000100 73 I
0.1
1
0.01
0.001
P call is wrong:
Aligning sequence reads
The first task after obtaining your reads (and any QC) is to determine the corresponding site in the reference genome from which they each derive Termed genome alignment (also known as ‘mapping’) Short-read aligners use one of these ideas to base their algorithm on: • use spaced-seed indexing • hash seed words from the reference • hash seed words from the reads • sort reference words and reads lexicographically • use the Burrows-Wheeler transform (BWT) BWT seems to be the winning idea (very fast, sufficiently accurate), and is used by the newest tools (Bowtie, SOAPv2, BWA).
http://samtools.sourceforge.net/SAM1.pdf
SAM example
Viewing BAM files in Integrated Genome Browsers
File sizes
fastq
bam bigWig
blade
GPU
blade
blade
blade
users
service
Application server
Sun grid engine: Number crunching
Storage Isilon 200 TB
hardware price RAM (GB) processors application server 7100 16 24
blade system 33221 256 48
GPU Tesla M2050 9888 48 24
Isilon storage (200 TB) 230000
Getting data to the user: IIT@SEMM infrastructure
Mainly FastQ
Illumina HiSeq2000
Application server, blades, GPU
Isilon storage
Visualizing NGS data: Genome Browsers
Visualizing genomic data: What is a “Genome Browser”
• linear representation of a genome
• position-based annotations, each called a track
– continuous annotations: e.g. conservation
– interval annotations: e.g. gene, read alignment
– point annotations: e.g. SNPs
• user specifies a subsection of genome to look at
Comparison of Genome Browsers
UCSC Ensembl IGV IGB
Reference http://genome.ucsc.edu/ http://www.ensembl.org/index.html http://www.broadinstitute.org/igv/ http://bioviz.org/igb/
Model Server Server Client Client
Interactive
HTS support
Database of tracks
Plugins
No support Some support Good support
Server model Client model
Server central data store Server stores data
renders images
sends to client
Client requests images Client local HTS store
displays images renders images
displays images
Limitations: do not support multiple genomes simultaneously do not capture 3-dimensional conformation do not capture spatial or temporal information do not integrate well with analytics
• Browse many eukaryotic genomes (yeast to human)
• Most annotations are there
• Important evolutionary and variation data representation.
• Very flexible and configurable views
• Graphical and table views
• Upload your data into custom tracks and share with colleagues
• Client/server application with it’s issues, but a great app!
About UCSC Genome Browser
http://genome.ucsc.edu
http://genome.ucsc.edu
Integrated Genome Browser and IIT DAS2 server
Integrated Genome Browser and published genome annotations
Genome browser view: ChIP-seq
.bam .bed .bigWig
Genome browser view: sequencing errors
Integrated Genome Browser and the Distributed Annotation System (DAS)
Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples
Integrated Genome Browser and the Distributed Annotation System (DAS)
Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples
Frederic Sanger
Genbank
Centralized repository, sequences owned by submitter,
Genbank
LOCUS NM_053056 4304 bp mRNA linear PRI 27-MAY-2012 DEFINITION Homo sapiens cyclin D1 (CCND1), mRNA. ACCESSION NM_053056 NM_001758 VERSION NM_053056.2 GI:77628152 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 4304) AUTHORS Li,Q., Dong,Q. and Wang,E. TITLE Rsf-1 is overexpressed in non-small cell lung cancers and regulates cyclinD1 expression and ERK activity JOURNAL Biochem. Biophys. Res. Commun. 420 (1), 6-10 (2012) PUBMED 22387541 REMARK GeneRIF: Rsf-1 is overexpressed in non-small cell lung cancers and contributes to malignant cell growth by cyclin D1 and ERK modulation. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-138 BM796500.1 1-138 139-1278 BC001501.2 73-1212 1279-4077 AP001888.4 12952-15750 4078-4304 X59798.1 4018-4244 FEATURES Location/Qualifiers source 1..4304 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="11" /map="11q13" gene 1..4304 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /note="cyclin D1" /db_xref="GeneID:595" /db_xref="HGNC:1582" /db_xref="HPRD:01346" /db_xref="MIM:168461" exon 1..407 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /inference="alignment:Splign" /number=1 CDS 210..1097 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /note="B-cell CLL/lymphoma 1; BCL-1 oncogene; PRAD1 oncogene; B-cell lymphoma 1 protein" /codon_start=1 /product="G1/S-specific cyclin-D1”
/protein_id="NP_444284.1" /db_xref="GI:16950655" /db_xref="CCDS:CCDS8191.1" /db_xref="GeneID:595" /db_xref="HGNC:1582" /db_xref="HPRD:01346" /db_xref="MIM:168461" /translation="MEHQLLCCEVETIRRAYPDANLLNDRVLRAMLKAEETCAPSVSY FKCVQKEVLPSMRKIVATWMLEVCEEQKCEEEVFPLAMNYLDRFLSLEPVKKSRLQLL GATCMFVASKMKETIPLTAEKLCIYTDNSIRPEELLQMELLLVNKLKWNLAAMTPHDF IEHFLSKMPEAEENKQIIRKHAQTFVALCATDVKFISNPPSMVAAGSVVAAVQGLNLR SPNNFLSYYRLTRFLSRVIKCDPDCLRACQEQIEALLESSLRQAQQNMDPKAAEEEEE EEEEVDLACTPTDVRDVDI" misc_feature 885..887 /gene="CCND1" /gene_synonym="BCL1; D11S287E; PRAD1; U21B31" /experiment="experimental evidence, no additional details recorded" /note="Phosphotyrosine; propagated from UniProtKB/Swiss-Prot (P24385.1); phosphorylation site" ORIGIN 1 cacacggact acaggggagt tttgttgaag ttgcaaagtc ctggagcctc cagagggctg 61 tcggcgcagt agcagcgagc agcagagtcc gcacgctccg gcgaggggca gaagagcgcg 121 agggagcgcg gggcagcaga agcgagagcc gagcgcggac ccagccagga cccacagccc 181 tccccagctg cccaggaaga gccccagcca tggaacacca gctcctgtgc tgcgaagtgg 241 aaaccatccg ccgcgcgtac cccgatgcca acctcctcaa cgaccgggtg ctgcgggcca 301 tgctgaaggc ggaggagacc tgcgcgccct cggtgtccta cttcaaatgt gtgcagaagg 361 aggtcctgcc gtccatgcgg aagatcgtcg ccacctggat gctggaggtc tgcgaggaac 421 agaagtgcga ggaggaggtc ttcccgctgg ccatgaacta cctggaccgc ttcctgtcgc 481 tggagcccgt gaaaaagagc cgcctgcagc tgctgggggc cacttgcatg ttcgtggcct 541 ctaagatgaa ggagaccatc cccctgacgg ccgagaagct gtgcatctac accgacaact 601 ccatccggcc cgaggagctg ctgcaaatgg agctgctcct ggtgaacaag ctcaagtgga 661 acctggccgc aatgaccccg cacgatttca ttgaacactt cctctccaaa atgccagagg 721 cggaggagaa caaacagatc atccgcaaac acgcgcagac cttcgttgcc ctctgtgcca //
A Genbank entry
By design, annotations are nearly impossible to incorporate
Since 1989, centrally curated, annotations provided by the community -> curation bottleneck
AceDB: A C.elegans database
2001
2002
-To view massive amounts of sequencing data, genome browsers were developed. -Annotations developed in “Annotation Jamborees” -Human Genome Project Analysis Group: concept of annotation tracks -Tracks produced and curated by different groups but stored on centralized server ->Bandwidth bottleneck
HUGO
Integrated Genome Browser and the Distributed Annotation System (DAS)
Outline Genome Browsing: Why was DAS developed? DAS: history, usage, and specification, reference implementation Integrated Genome Browser Examples
Decentralized curation of annotation tracks Decentralized storage of annotation tracks
Distributed Annotation System: DAS
The distributed annotation system
components: 1 Reference genome server (provides coordinates and sequence) 2 Annotation server(s) (provides annotation tracks) 3 Client (view annotations mapped onto reference)
DAS basics
reference
Client (web or stand alone)
annotations
Dowell et al. 2001
Geodesic: Standalone client by Dowell et al. 2001
Source code: http://www.biodas.org/geodesic/
Glyphs: Graphic elements used for track display
DAS/2 (not listed in registry) http://india907.server4you.de:8080/das2/genome (epigenome.at) http://www.bioviz.org/das2/genome (Bioviz) http://bioserver.hci.utah.edu:8080/DAS2DB/genome (UofUtahBioinfoCore) http://netaffxdas.affymetrix.com/das2/genome (NetAffx)
Currently 1600 DAS/1 entries Clients:
DAS registry (www.dasregistry.org)
http://www.biodas.org/wiki/DAS/2
Main difference: DAS/2 supports non-XML file formats DAS/2 clients support DAS/1 but not vice versa
DAS/1 != DAS/2
2004-2007, NIH grant for DAS/2 development, partners: Affymetrix, Cold Spring Harbor Lab, the EBI/ Sanger Center, Dalke Scientific
DAS specification (www.biodas.org)
Sources: list available genomes Segments: lists chromosomes per genome Types: list types of annotation (file format etc) Features: list annotation details in specific region
DAS: Basic Query types: sources, segments, types, features
<?xml version="1.0" encoding="UTF-8"?> <SOURCES xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/" > <MAINTAINER email="[email protected]" /> <SOURCE uri="D_rerio" title="D_rerio" > <VERSION uri="danRer7" title="danRer7" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="danRer7/segments" /> <CAPABILITY type="types" query_uri="danRer7/types" /> <CAPABILITY type="features" query_uri="danRer7/features" /> </VERSION> </SOURCE> <SOURCE uri="H_sapiens" title="H_sapiens" > <VERSION uri="H_sapiens_Mar_2006" title="H_sapiens_Mar_2006" created="2012-05-05T16:47:27+0200" > <COORDINATES uri="http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/" authority="NCBI" taxid="9606" version="36" source="Chromosome" /> <CAPABILITY type="segments" query_uri="H_sapiens_Mar_2006/segments" /> <CAPABILITY type="types" query_uri="H_sapiens_Mar_2006/types" /> <CAPABILITY type="features" query_uri="H_sapiens_Mar_2006/features" /> </VERSION> </SOURCE> <SOURCE uri="M_musculus" title="M_musculus" > <VERSION uri="M_musculus_Jul_2007" title="M_musculus_Jul_2007" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="M_musculus_Jul_2007/segments" /> <CAPABILITY type="types" query_uri="M_musculus_Jul_2007/types" /> <CAPABILITY type="features" query_uri="M_musculus_Jul_2007/features" /> </VERSION> <VERSION uri="M_musculus_Mar_2006" title="M_musculus_Mar_2006" created="2012-05-05T16:47:27+0200" > <CAPABILITY type="segments" query_uri="M_musculus_Mar_2006/segments" /> <CAPABILITY type="types" query_uri="M_musculus_Mar_2006/types" /> <CAPABILITY type="features" query_uri="M_musculus_Mar_2006/features" /> </VERSION> </SOURCE> </SOURCES>
http://rubidio.ifom-ieo-campus.it:8080/das2/genome
<?xml version="1.0" encoding="UTF-8"?> <SEGMENTS xmlns="http://biodas.org/documents/das2" xml:base="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/" uri="http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/segments" >
<SEGMENT uri="chr1" title="chr1" length="197195432" /> <SEGMENT uri="chr2" title="chr2" length="181748087" /> <SEGMENT uri="chr3" title="chr3" length="159599783" /> <SEGMENT uri="chr4" title="chr4" length="155630120" /> <SEGMENT uri="chr5" title="chr5" length="152537259" /> <SEGMENT uri="chr6" title="chr6" length="149517037" /> <SEGMENT uri="chr7" title="chr7" length="152524553" /> <SEGMENT uri="chr8" title="chr8" length="131738871" /> <SEGMENT uri="chr9" title="chr9" length="124076172" /> <SEGMENT uri="chr10" title="chr10" length="129993255" /> <SEGMENT uri="chr11" title="chr11" length="121843856" /> <SEGMENT uri="chr12" title="chr12" length="121257530" /> <SEGMENT uri="chr13" title="chr13" length="120284312" /> <SEGMENT uri="chr14" title="chr14" length="125194864" /> <SEGMENT uri="chr15" title="chr15" length="103494974" /> <SEGMENT uri="chr16" title="chr16" length="98319150" /> <SEGMENT uri="chr17" title="chr17" length="95272651" /> <SEGMENT uri="chr18" title="chr18" length="90772031" /> <SEGMENT uri="chr19" title="chr19" length="61342430" /> <SEGMENT uri="chrX" title="chrX" length="166650296" /> <SEGMENT uri="chrY" title="chrY" length="15902555" /> <SEGMENT uri="chrM" title="chrM" length="16299" /> <SEGMENT uri="chr1_random" title="chr1_random" length="1231697" /> <SEGMENT uri="chr3_random" title="chr3_random" length="41899" /> <SEGMENT uri="chr4_random" title="chr4_random" length="160594" /> <SEGMENT uri="chr5_random" title="chr5_random" length="357350" /> <SEGMENT uri="chr7_random" title="chr7_random" length="362490" /> <SEGMENT uri="chr8_random" title="chr8_random" length="849593" /> <SEGMENT uri="chr9_random" title="chr9_random" length="449403" /> <SEGMENT uri="chr13_random" title="chr13_random" length="400311" /> <SEGMENT uri="chr16_random" title="chr16_random" length="3994" /> <SEGMENT uri="chr17_random" title="chr17_random" length="628739" /> <SEGMENT uri="chrUn_random" title="chrUn_random" length="5900358" /> <SEGMENT uri="chrX_random" title="chrX_random" length="1785075" /> <SEGMENT uri="chrY_random" title="chrY_random" length="58682461" />
</SEGMENTS>
http://rubidio.ifom-ieo-campus.it:8080/das2/genome/M_musculus_Jul_2007/segments
<?xml version="1.0" encoding="UTF-8"?> <TYPES xmlns="http://biodas.org/documents/das2" xml:base="http://localhost:8080/genopub/genome/M_musculus_Jul_2007/" > <TYPE uri="EML1/PU1_ChIP/Input" title="EML1/PU1_ChIP/Input" > <FORMAT name="useq" /> <PROP key="Normalization" value="N" /> <PROP key="group" value="Alcalay" /> <PROP key="group_contact" value="Myriam Alcalay" /> <PROP key="group_email" value="[email protected]" /> <PROP key="name" value="Input" /> <PROP key="owner" value="Alcalay, Myriam" /> <PROP key="owner_email" value="IEO" /> <PROP key="owner_institute" value="[email protected]" /> <PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=11" /> <PROP key="visibility" value="Members" /> </TYPE> <TYPE uri="EML1/PU1_ChIP/PU1_A3" title="EML1/PU1_ChIP/PU1_A3" > <FORMAT name="useq" /> <PROP key="Normalization" value="N" /> <PROP key="group" value="Alcalay" /> <PROP key="group_contact" value="Myriam Alcalay" /> <PROP key="group_email" value="[email protected]" /> <PROP key="name" value="PU1_A3" /> <PROP key="owner" value="Alcalay, Myriam" /> <PROP key="owner_email" value="IEO" /> <PROP key="owner_institute" value="[email protected]" /> <PROP key="url" value="http://localhost:8080/genopub/genopub?idAnnotation=7" /> <PROP key="visibility" value="Members" /> </TYPE> </TYPES>
http://localhost:8080/genopub/genome/M_musculus_Jul_2007/types
http://localhost:8080/genopub/genome/M_musculus_Jul_2007/features?segment=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_ musculus_Jul_2007%2Fchr1;overlaps=79374747%3A81152999;type=http%3A%2F%2Flocalhost%3A8080%2Fgenopub%2Fgenome%2FM_musculus_Jul_2007 %2FEML1%2FPU1_ChIP%2FPU1_B2;format=useq
Returns a file in useq format, essentially a zip file, preferred format in IGB Contains a archiveReadMe.txt and one or more “slice files” Observations can be textual or numerical
http://useq.sourceforge.net/useqArchiveFormat.html
Features
A BED file (.bed) is a tab-delimited text file that defines a feature track. File extension .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGB/IGV. Notes: Zero-based index: Start and end positions are identified using a zero-based index. The end position is excluded. For example, setting start-end to 1-2 describes exactly one base, the second base in the sequence (ACGT). track name=pairedReads description="Clone Paired Reads" Chr22 1000 5000 cloneA Chr22 2000 6000 cloneB
Other important file formats: BED (textual)
The bedGraph format is line-oriented. Bedgraph data are preceeded by a track definition line, which adds a number of options for controlling the default display of this track. The track type is REQUIRED, and must be “bedGraph”. Bedgraph track data values can be integer or real, positive or negative values. Chromosome positions are specified as 0-relative. The first chromosome position is 0. The last position in a chromosome of length N would be N - 1. Only positions specified have data. Positions not specified do not have data and will not be graphed. All positions specified in the input data must be in numerical order. The bedGraph format has four columns of data: track type=bedGraph name="BedGraph Format" chr19 49302000 49302300 10 chr19 49302300 49302600 20 chr19 49302600 49302900 25 Intervals can be of any length and overlapping.
Other important file formats: BEDGraph (numerical)
The wiggle (WIG) format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format. Chromosome positions are specified as 1-relative. variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format. It begins with a declaration line and is followed by two columns containing chromosome positions and data values: variableStep chrom=chrN [span=windowSize] StartA dataValueA StartB dataValueB variableStep chrom=chr2 is equivalent to: variableStep chrom=chr2 span=5 300701 12.5 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 Both versions display a value of 12.5 at position 300701-300705 on chromosome 2.
Other important file formats: Wig (“wiggle”)
The wiggle (WIG) format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized. If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format. Chromosome positions are specified as 1-relative. fixedStep is for data with regular intervals between new data values and is the more compact wiggle format. It begins with a declaration line and is followed by a single column of data values: The declaration line starts with the word fixedStep and includes specifications for chromosome, start coordinate, and step size. The span specification has the same meaning as in variableStep format. For example, this fixedStep specification: fixedStep chrom=chr3 start=400601 step=100 span=5 11 22 33 displays the values 11, 22, and 33 as single-base regions on chromosome 3 at positions 400601, 400701, and 400801, respectively. Step and span are fixed for entire data set.
Other important file formats: Wig (“wiggle”)
Data transfer level Required: random access. At the lowest layer, we take advantage of the byte-range protocols of HTTP and HTTPS, and the protocols associated with resuming interrupted FTP transfers, to achieve random access to binary files over the web. URL data cache layer a cache layer on top of the data transfer layer. Data are fetched in blocks of 8 Kb, and each block is kept in a cache. Indexing based on a single dimensional version of the R tree that is commonly used for indexing geographical data. The index size is typically less than 1% of the size of the data itself. Because the stored data are sorted by chromosome and start position, not every item in the file must be indexed; in fact by default only every 512th item is indexed. Compression: regions between indexed items (containing 512 items by default) are individually compressed (gzip).
BigWig and BigBed
Basic architecture Object relational mapping Via Hibernate
Flex
Apache Tomcat 6 Glassfish
mySQL
DAS/2 server reference implementation: http://sourceforge.net/projects/genoviz
Database tables
User table
Annotation table
User role table Message digest 5 (MD5) encryption from java.security package
Table views
Each file gets his own folder (automatically assigned folder names). No filenames to store in DB, which may contain non-supported characters.
Data storage directory
Visibility levels:
DAS2 administration user interface
If you want to access data with restricted visibility, you must be inserted in the user table and be part of a group that is headed by the owner of the data.
Users and groups setup
Every user, admin or non-admin, can change his password, load new data, add data descriptions, and set visibility levels.
Non-administrator users interface
IGB user identification
jdbcRealm ldapRealm Both work
NetAffx and UCSC hg19 annotations
All these annotations are one click away from the user
Conclusions
DAS2 servers provide distributed genome annotations Support fine grained security model Perform parsing of data for custom genome views