ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

32
ILRI/BECA ILRI/BECA Bioinformatics Platform Bioinformatics Platform Introduction Introduction Etienne de Villiers ILRI - Kenya

Transcript of ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Page 1: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

ILRI/BECA ILRI/BECA Bioinformatics PlatformBioinformatics Platform

IntroductionIntroduction

Etienne de Villiers

ILRI - Kenya

Page 2: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

OutlineOutline

• ILRI/BECA Bioinformatics Platform• Hardware• Specialized software:

– Database searching– Assembly software

• CGIAR Bioinformatics Grid

Page 3: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

International Livestock Research InstituteInternational Livestock Research Institute

A lab in Africa at the foot of Kenya’s Ngong Hills

Page 4: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

ILRI Research ObjectivesILRI Research Objectives

•Overall mandate is livestock research for poverty alleviation in Africa and South East Asia.

•Undertakes a balance of fundamental and applied research with long, medium and short term objectives. •Livestock health, genetics, and management.

Page 5: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

ILRI FacilitiesILRI Facilities

• State of the art laboratories (2500 m2)• Large and small animal facilities

– Level-2/3 biosafety facility for cattle and sheep

• Bioinformatics unit– 64 CPU Paracel 64-bit HPC cluster

• Sequencing unit– ABI 3730 and ABI 3100

• Microarray facility• Proteomics facility• Oligonucleotide synthesis unit • FACS analysis facility• Tick unit

Page 6: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

BECA - Biosciences East and Central BECA - Biosciences East and Central AfricaAfrica

• Under NEPAD several centers of excellence are being established in Africa.

• One center is being established at ILRI –Biosciences East and Central Africa (BECA).

• Center will provide state-of–the-art facilities for scientist in the region.

• Facilities include:• Genetics and Genomics lab with high throughput sequencers

• Microarray laboratory

• Proteomics laboratory

• Immunology and molecular biology laboratories

• Bioinformatics Platform

Page 7: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

ILRI/BECA – Bioinformatics PlatformILRI/BECA – Bioinformatics Platform

• Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support.

• EMBNet node for East and central Africa

Page 8: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

IBBPIBBP service servicess

• Access to bioinformatics tools through either:– web-based bioinformatics tools through the BBP website– secure shell (ssh) access for registered users

• Facilities for storage of large datasets• Systems administration and backup of datasets• Training and support in the use of BBP resources• Graduate and Post-graduate Fellowships in

Bioinformatics

Page 9: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

IBBIBBPP FacilitiesFacilities

• Training room– 18 computers with MS windows

and Linux– High speed internet connection

• Servers– 66 CPU Beowulf Linux cluster– High availability Web server

Page 10: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

IBBP WebsiteIBBP Website

www.becabioinfo.org

Page 11: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Selection of available tools on IBBPSelection of available tools on IBBP

• Paracel Blast• GeneMatcher2• PTA• Oligocheck• EMBOSS 200+ bioinformatics tools• ClustalW multiple alignment software• T-coffee multiple alignment software• FastA sequence alignment tool• HMMER multiple alignment and

sequence searching software

• Staden sequence assembly and analysis package

• Primer3 primer design package • Paup tree-inference package • Phylip tree-inference package• Phred/Phrap DNA editing and

assembly tools• R statistical package • Rosetta – Ab initio protein prediction• SRS – sequence retrieval tool• Etc……

Page 12: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

IBBP Hardware SystemsIBBP Hardware Systems

Paracel Blast MachineParallel NCBI-Blast (20 CPU )•Blast•PSI-Blast•Mega-Blast

GeneMatcher26144 CPU supercomputer•HMM•Smith-Waterman•GeneWise•Profile

HPC Linux cluster66 CPUs (AMD 64-bit)72 Gigabyte RAM3 Terrabyte disk storage

Page 13: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Linux clusterLinux cluster

• Rocks 4.1 (RedHat) operating system• Platform LSF batch queuing

• shares resources equally between users

• MPI libraries • Parallel computations

Application Software (e.g. BLAST, EMBOSS, Rosetta)

Middleware (Platform LSF)

Operating System (Red Hat - ROCKS)

Node Node Node Node Node

Network (GiGE)

Application Integration

Batch Queue Setup

Cluster Build and Configuration

Turnkey HPC Integration

Page 14: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Database searchingDatabase searching

• Heuristic Algorithms (FASTA and BLAST)– Gapped BLAST– Traditional ungapped BLAST

Are fast but give approximate alignments

• Dynamic Programming Algorithms– Global – Needleman-Wunsch– Local – Smith-Waterman

Give optimal alignment but are very slow

Page 15: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Paracel Blast ServerParacel Blast Server

• Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems

• 20 CPU parallel NCBI-Blast• 20x faster than NCBI-Blast server

Paracel Blast – 1h 9m 56s

NCBI – 6 days 2h 20m 34s

Blastn – Paracel Blast vs. NCBI Blast

Query – Chromosome 81 sequence150,000,000 bases

Database – Human Ref. Seq10,300 sequences24,300,000 bases

Page 16: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Paracel Blast ServerParacel Blast Server

BioView Viewer

Page 17: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

BioView ViewerBioView Viewer

Page 18: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Gene Structure DeterminationGene Structure Determination

• To compare a cDNA or EST database to a genomic database, one must allow introns

• Two approaches:– Double-affine Smith-Waterman (separate gap penalty for

introns)– Genewise – protein or HMM versus genomic DNA (models

the important features of protein families better)

Page 19: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

How to get more distant homologsHow to get more distant homologs

• Use dynamic programming algorithms• Use position-specific or HMM profiles• Do iterated searches• Use translated searches

Must be careful in interpretation (statistics)

Page 20: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

GeneMatcher2GeneMatcher2

• Do things you either can’t or wouldn’t attempt at NCBI (100x faster)

• Is a computer specialized for executing calculation intensive methods in bioinformatics:– Especially fast in performing the very sensitive Smith-

Waterman pairwise alignment method• compensate for frame shifts

– GeneWise • intron- and frameshift-tolerant search method

– Needleman-Wunch alignments– HMM searches

• 6,144 parallel processor computer

Page 21: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Why GeneMatcher2?Why GeneMatcher2?

•Comparison of sensitivity and selectivity of various sequence search methods

•Blue denotes a software method•Yellow denotes a hardware accelerated method

Less Falsepositives

More true positives

Page 22: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

GeneMatcher2 - PerformanceGeneMatcher2 - Performance

•Time-to-completion comparison of original methods and methods on GeneMatcher2

•TBLASTX improvement is 20-fold•Other methods at least 100-fold

Source: Genome Canada Bioinformatics Platform Project

NCBI TBLASTX

Parac

el T

BLASTX

Decyp

her T

BLASTX

WUSTL H

MM

clu

ster

Decyp

her H

MM

FASTA Sm

ith-W

ater

man

GeneM

atch

er2

SW

EBI Gen

eWis

e

Parac

el G

eneW

Ise

376

140.1

161316

270

1000

Runtime for an average query

Method

0

200

400

600

800

1000

Se

co

nd

s

* * *

Page 23: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

BioView WorkbenchBioView Workbench

BioView Viewer

Page 24: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

BioView ViewerBioView Viewer

Page 25: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Assembly SoftwareAssembly Software

• Paracel Transcript Assembler (PTA)– High capacity solution for EST based transcript

reconstruction– Can assemble large numbers of ESTs, allowing for splice

variants– Complete pipeline for: sequence cleaning,clustering and

assembly– Detection, alignment and visualization of alternative splice forms– Visualization through intuitive graphical interfaces

Page 26: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Scientific problems for PTAScientific problems for PTA

• Proteomics• Gene discovery• Verify gene predictions for genome assembly• Detecting splice variants• Patterns of expression, tissue specificity• SNP detection • Combinations of all the above...

Page 27: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

PTA – Contig viewPTA – Contig view

Page 28: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

PTA – Splice variant alignmentPTA – Splice variant alignment

Page 29: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Paracel OligocheckParacel Oligocheck

• Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2

• Search oligo’s fast against whole genome• Software used by companies designing and

synthesizing oligonucleotides e.g. MWG

Page 30: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Ensemble mirrorEnsemble mirror

• Ensembl is a joint project between EMBL - EBI and the Sanger Institute.

• A software system which produces and maintains automatic annotation on selected eukaryotic genomes.

• Our site provides free access to a selected areas of the data and software from the Ensembl project.

Page 31: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

CGIAR – HPC GRID computingCGIAR – HPC GRID computing

ILRIKenya

IRRIPhilippines

ICRISATIndia

CIPPeru

49 nodes89 CPUs

33 nodesGenematcher2 4 nodes

8 nodes 4 nodes

BECA/Partners

Page 32: ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Thank youThank you