High-Performance In-Memory Genome (HIG) Project
-
Upload
matthieu-schapranow -
Category
Technology
-
view
532 -
download
1
Transcript of High-Performance In-Memory Genome (HIG) Project
HIG Project Overview
August 31, 2012
Matthieu-P. Schapranow Hasso Plattner Institute
Chair of Prof. Hasso Plattner
Vision: Real-time Analysis of Genomic Data to Improve Medical Treatment
HIG Project Overview, M. Schapranow, Aug 31, 2012
2
Build up the Whole Picture out of Layers
■ Data:
□ Combine research findings from int’l scientific databases in single system at HPI
■ Platform:
□ Expose information as a service to be consumed by special purpose applications
■ Applications:
□ Support genome alignment pipeline processing by
□ Massively parallel execute: □ Alignment algorithms, e.g. BWA, BT2, etc. □ Variant calling
□ Analyze individual patient results (real-time annotations with combined data)
□ Analyze patient cohorts using individual filters HIG Project Overview, M. Schapranow, Aug 31, 2012
3
How the Vision Becomes Real
■ Platform:
□ Worker Framework: Enables parallel execution of tasks (alignment, variant calling) across node limits
□ Updating Framework: Retrieves periodic database updated of international databases and automatically integrates them into local store
■ Applications:
□ Alignment Coordinator: Submit alignment tasks and retrieve mutation lists, e.g. CSV
□ Genome Browser: Interactive browsing in reference and specific patient genomes
HIG Project Overview, M. Schapranow, Aug 31, 2012
4
Alignment Coordinator
■ Available Alignment Algorithms (and growing)
□ Bowtie2
□ Bowtie
□ BWA
□ TMAP
□ SNAP
□ MAQ
□ SOAP
HIG Project Overview, M. Schapranow, Aug 31, 2012
5
Numbers you should know Alignment Execution Time
■ One cell line ~600k reads / 110MB
■ Pipeline: Alignment and variant calling
HIG Project Overview, M. Schapranow, Aug 31, 2012
6
Property Traditional HPI Full Genome No Yes
Cores 2 * 6 cores 25 * 40 cores Main Memory 48 GB 25 TB
Runtime ~720 ~40s
Numbers you should know History of the Human Genome Project
■ 1984: Idea of a global Human Genome (HG) project discussed at Alta Summit: “DNA available on the Internet”
■ 1990: HG project for 15 years started in the US (3 billion USD funding)
■ 2000: Rough draft of the HG announced
■ 2003: Complete genome sequenced
■ 2006: Last and longest chr1 sequenced
■ … what’s next?
HIG Project Overview, M. Schapranow, Aug 31, 2012
7
Numbers you should know Human Genome
HIG Project Overview, M. Schapranow, Aug 31, 2012
Entity Cardinality Different Bases 4 (A,C,G,T) Base Pairs 3.137 Bbp Chromosomes 23 Distinct Genes 20k-25k Amino Acids (coded as triplets)
21
Proteins 50k-300k
8
Taken from http://de.wikipedia.org/wiki/Code-Sonne
Numbers you should know Comparison of Costs
HIG Project Overview, M. Schapranow, Aug 31, 2012
9
0,01
0,1
1
10
100
1000
10000
01.0
1.01
01.0
5.01
01.0
9.01
01.0
1.02
01.0
5.02
01.0
9.02
01.0
1.03
01.0
5.03
01.0
9.03
01.0
1.04
01.0
5.04
01.0
9.04
01.0
1.05
01.0
5.05
01.0
9.05
01.0
1.06
01.0
5.06
01.0
9.06
01.0
1.07
01.0
5.07
01.0
9.07
01.0
1.08
01.0
5.08
01.0
9.08
01.0
1.09
01.0
5.09
01.0
9.09
01.0
1.10
01.0
5.10
01.0
9.10
01.0
1.11
01.0
5.11
01.0
9.11
01.0
1.12
Cos
ts in
US
D
Comparison of Costs for Main Memory and Genome Analysis
Costs per Megabyte RAM Costs per Megabase Sequencing
Hardware Characteristics
■ 1,000 core cluster, 25 TB main memory
■ Consists of 25 identical nodes:
□ 80 cores
□ 1 TB main memory
□ Intel® Xeon® E7- 4870
□ 2.40GHz
□ 30 MB Cache
HIG Project Overview, M. Schapranow, Aug 31, 2012
10
Customer Process as of Today
■ Tissue sequencing in context of cancer treatment
■ Complex, time-consuming, media breaks, manual steps
HIG Project Overview, M. Schapranow, Aug 31, 2012
11
Project Objectives
■ Alignment of DNA reads (FASTQ) against reference genome (FASTA) è mapped reads
■ Real-time analysis of mapped reads
□ Detection of mutations (SNP, INDELs)
□ Comparison of multiple tissues
□ Detection of similar clusters to identify co-relations
■ Analysis of mutations
□ Identify mutations with scientific references (existing knowledge)
□ Detection of similar clusters to identify co-relations
□ Identify genes and regulators for certain phenotypic characteristics, e.g. “fast running horses”
HIG Project Overview, M. Schapranow, Aug 31, 2012
12
Thank you for your interest! Keep in contact with us.
HIG Project Overview, M. Schapranow, Aug 31, 2012
13
Hasso Plattner Institute Enterprise Platform & Integration Concepts
Matthieu-P. Schapranow August-Bebel-Str. 88
14482 Potsdam, Germany
Matthieu-P. Schapranow, M.Sc. [email protected]
http://j.mp/schapranow