01-02-03-05-04-07-06-08-09-10-11-12-13-14 Vacuolar ATP synthase14-3-3Ankyrin repeat...

1 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500

01-02-03-05-04-07-06-08-09-10-11-12-13-14

322878739 bp

Vacuolar ATP synthase 14-3-3 Ankyrin repeat protein 6-PhosphofructokinaseVacuolar ATP synthase 14-3-3 Ankyrin repeat protein 6-Phosphofructokinase

313..895 1750..2488 3175..5658 6914..7945855 bp 687 bp 1256 bp

8739 bp

FAS(GCG)FRAMES(SEQWEB)TRANSLATE BLAST(NCBI) PROTEIN ANALYSIS(EXPASY)

Global Gene Expression AnalGlobal Gene Expression Analysis:ysis:Microarry, SAGE and ESTMicroarry, SAGE and EST

Petrus Tang, Ph.D. Graduate Institute of Basic Medical SciencesandBioinformatics Center, Chang Gung [email protected]://petang.cgu.edu.tw

LECTURE 92-13LECTURE 92-13

Published Complete Genome Projects: 95(including 3 chromosomes)

Prokaryotic Ongoing Genome Projects: 310

Eukaryotic Ongoing Genome Projects: 211(including 11 chromosomes)

Last update: 18July2002

THE WORLD OF GENOMICSTHE WORLD OF GENOMICS

GenBank Sequences

GenBank® is the National Institute of Health genetic sequence database, an annotated collection of all publicly available DNA sequences.

Genetic Sequence Data Bank Apr 15 2004, Release 141.0

33,676,218 loci, 38,989,342,565 bases, from 33,676,218 reported sequences

Homo sapiens: 7,714,277 sequences 10,569,756,393 bases

EST: 5,491,557 sequences

Gene Expression & Post-Translational Modification of Proteins

Gene A

Gene B

Gene C

Muscle cell Skin cell Nerve cell Normal cell Cancer cell

Cell Growth, External Stress

Gene A

Gene B

Gene C

Gene Expression

Unique set of genes are expressed at different growth conditions and at different stages.

Gene Gene Products

mRNA Protein

GenomeGenome

ProteomeProteome

Microarray,Microarray,SAGE, EST,SAGE, EST,

SubtractionSubtractionlibrarylibrary

TranscriptomeTranscriptome 2D-PAGE2D-PAGE

GenomeGenomeProjectsProjects

Functional Genomics

Global Analysis of Global Analysis of Gene ExpressionGene Expression

Northern HybridizationNorthern Hybridization

RT-PCR RT-PCR

Differential Display, Subtraction Library,Differential Display, Subtraction Library,

Serial Analysis of Gene Expression (SAGSerial Analysis of Gene Expression (SAGE)E)

Expressed Sequence Tags (EST)Expressed Sequence Tags (EST)

Real-Time PCRReal-Time PCR

MicroarryMicroarryAnalysis of 10,000-50,000 messages in a transcriptome will generate a relevant profile of gene expression within a cell, provi

ding a quantitative measurement of transcripts for gene discovery.

10,000 Clones

perslide

Microarray

Marilyn Monroe Marilyn was printed using a MicroGrid Compact with MicroSpot 10K Quill pins onto Corning GAPS2 slides.Cy3 and Cy5 labelled M13FWD oligonucleotides were made to 1µM in 1 x EZRays spotting solution (Apogent Discoveries), 0.01% N-lauroyl sarcosine. Eight 2-fold dilutions of the oligonucleotides were made into the same spotting buffer. A 20µl aliquot of each oligonucleotide dilution were placed in the appropriate wells of Greiner 384 well v bottom microplates.

The Art of Microarray

The image of Marilyn was downloaded from the web, pixelised and grayscaled using image manipulation software. Pixel information was extracted to Excel from which grid patterns were generated.Ms Monroe was scanned at 10µm resolution using the ArrayWorX scanner from Applied Precision.

Mona LisaMona was printed using a MicroGrid Compact with MicroSpot 10K Quill pins onto Corning GAPS2 slides.Cy3 labelled M13FWD oligonucleotides were made to 1µM in 1 x EZRays spotting solution (Apogent Discoveries), 0.01% N-lauroyl sarcosine. Eight 2-fold dilutions of the oligonucleotides were made into the same spotting buffer. A 20µl aliquot of each oligonucleotide dilution were placed in the appropriate wells of Greiner 384 well v bottom microplates.

The Art of Microarray

The image of Mona was downloaded from the web, pixelised and grayscaled using image manipulation software. Pixel information was extracted to Excel from which grid patterns were generated.The Mona Lisa was scanned at 10µm resolution using the ArrayWorX scanner from Applied Precision.

1. Mix 5 µg total RNA with oligo dT magnetic beads2. Synthesize double-strand cDNA

3. Digest with NlaIII to form one end of the tag

4. Divide in half and ligate 40 bp adapters (A and B) containing the recognition sequence for the type- II restriction enzyme BsmF 1

5. Cleave with BsmF 1 to form ~ 50 bp tag (40 bp adaptor/13 bp tag)

6. Fill in 5' overhangs and ligate to form a ~ 100 bp ditag7. PCR amplify using ditag primers 1 and 28. Cut 40 bp adapters with Nla III to release the 26 bp ditag

9. Ligate ditags to form concatemers10. Clone and sequence

Serial Analysis of Gene Expression (SAGE)

What are ESTs?

Expressed Sequence Tags are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns--the intervening DNA sequences interrupting the protein coding sequence of a gene.

AAAAAAAAAAA 3`5` *START

*STOP

Coding Sequence (CDS) 3`-UTR

5`-Untranlasted region (UTR)

5’-EST 3’-EST

Expressed Sequence Tags (EST)

▲ Relational database (Oracle 8i) ▲ Automatic data validation ▲ Quality score generation ▲ Automatic trimming of low-quality, vector, adaptor, poly-A tails,

low-complexity and contaminant sequences ▲ Automatic running of selected blast algorithms, with user-defined parameters,

user selected reference databases, and storage of top results (by user- defined cutoffs) in the database ▲ Includes a web interface for viewing the data in the database, according to the

permissions allowed to the viewer (by individual, project, lab or institution) ▲ Includes a Java tool for dbEST submission of newly generated ESTs at intervals

define by the users ▲ System can be readily and simply deployed at any of the partner's institutions ▲ Includes methods for defining a Unigene set for a library.

Additional functionalities are needed by the members of the current cAdditional functionalities are needed by the members of the current co-development group, including:o-development group, including:▲ Tissue or organism, integration of gene expression data. ▲ Annotations: Gene ontology annotations, functional motif annotation, metabolic

pathways annotations, signal transduction pathways.

Basic Features and Tools of an Automated Basic Features and Tools of an Automated EST Analysis Pipeline EST Analysis Pipeline

Data Processing – Raw Nucleotide Sequence

EST or SAGE clones sequenced

MegaBRACE 1000

Chromas High quality

Chromas Poor quality

Abi format

High qualityPoor quality

Fasta format

sequence sequence

PC

UNIX PHRED algorithm Remove uncalled/miscalled bases & vector sequence

Ewing B et al. (1988) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3):175-85Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8(3):186-94

SeqVerter™SeqVerter™ is a free sequence file format conversion utility by GeneStudio, Inc. SeqVerter encapsulates a small subset of the features offered by the GeneStudio Pro suite of programs. While the standalone SeqVerter is a simple dialog-based utility, the free SeqVerter component of the GeneStudio suite adds sophisticated viewers and sequence formatting functions, including a viewer for automatic DNA sequencer chromatogram files (traces). http://www.genestudio.com/seqverter.htm

Octopus Octopus is an interactive program designed for the rapid interpretation of BLAST, BLAST-2 and FASTA output text files. It provides an easy-to-use graphical user interface for both experienced and inexperienced users with sequence comparison analysis based on the widely-used BLAST serie of softwares and FASTA. Octopus is able to read results files coming from various BLAST and BLAST2 servers, the GCG's BLAST and the original FASTA3 program.

Trace Viewers:Trace Viewers: In order to take a look at the SCF file you first have to choose a program. Very commonly used programs for viewing the sequencing data are CHROMAS (for PC/Windows), TraceViewer (for MAC) and Trev (contained in the Gap4 Database Viewer, for UNIX).

FREEWARESFREEWARES

DL

DL

DL

EST Analysis : Clustering

CONTIGSCONTIGS Clusters

Singletons

ALGORITHM PHREDPHRAPCONSEDWu-BlastnBlastx

FUNCTIONRemove uncalled/miscalled bases & vector sequence Assemble clones to from contigsContig viewer & screen for misassembliesGroup contigs to form clusters of related contigsHomology search against self-generated dbases

1 500 1000 1500

Similarity Search: Blastx

BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990)

Nucleotide query translated to six reading frames vs protein database

Blastx-nrBlastx-pfam,smart

WWW Blastx GCG Blastx

Blastx-GCG formatBlastx-Octopus viewer

TV007D02

http://www.ncbi.nlm.nih.gov/

InterPro provides an integrated view of the commonly used signature databases, and has an intuitive interface for text- and sequence-based searches.

Bioinformatics infrastructural activities are crucial to modern biological research. Complete and up-to-date databases of biological knowledge are vital for the increasingly information-dependent biological and biotechnological research. Secondary protein databases on functional sites and domains like PROSITE, PRINTS, SMART, Pfam, ProDom, etc. are vital resources for identifying distant relationships in novel sequences, and hence for predicting protein function and structure. Unfortunately, these signature databases do not share the same formats and nomenclature, and each database has is own strengths and weaknesses. To capitalise on these, the following partners: EBI, SIB, University of Manchester, Sanger Institute, GENE-IT, CNRS/INRA, LION bioscience AG and University of Bergen unified PROSITE, PRINTS, ProDom and Pfam into InterPro (Integrated resource of Protein Families, Domains and Sites). The latest databases to join the project were SMART, and more recently, TIGRFAMs.

Annotation - GO

The goal of the Gene OntologyTM Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.

GENE ONTOLOGYTM CONSORTIUM http://www.geneontology.org

Molecular FunctionMolecular Function the tasks performed by individual gene products; examples are transcription factor and DNA helicase.

Biological ProcessBiological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions.

Cellular ComponentCellular Component subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex .p53 p53

http://www.geneontology.org/

The Cancer Genome Anatomy Project(CGAP) http://cgap.nci.nih.gov/

Classification According to Metabolic & Signalling Pathways

Biocarta( http://biocarta.com)

Kyto Encyclopedia of Genes &Genomeshttp://www.genome.ad.jp/kegg/

http://www.biocarta.com/index.asp

http://www.ncbi.nlm.nih.gov/

Annotation

ESTs are categorized into the following classes:

ESTs matches exactly to known protein sequences

ESTs shows homology

to known protein motifs/domains

Unique ESTs with no matces

Number of ESTs

Top 50 Highly Expressed ESTsTop 50 Highly Expressed ESTs

2.31%

0.17%

Percen

tage o

f total E

ST

s

998 ESTs24.75%

75.25%

GO Classification of ESTsGO Classification of ESTs GENE ONTOLOGYTM CONSORTIUM http://www.geneontology.org/

CELLULAR COMPONENTCELLULAR COMPONENT

BIOLOGICAL PROCESSBIOLOGICAL PROCESS

MOLECULAR FUNCTIONMOLECULAR FUNCTION

COG Classification of ESTsCOG Classification of ESTsClusters of Orthologous Groups of proteinshttp://www.ncbi.nlm.nih.gov/COG/

Automated EST Analysis Pipeline

Project Management

Sequence Management

Clustering

Sequence Analysis

Annotation

dBEST12,261,869 (Aug,2002)

GenBank® is the National Institute of Health genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 20,648,748,345 bases in 17,471,130 sequence records as of June 2002 R130 (12,055,326 sequences in dBEST, 4.500,000 from Homo sapiens).

EST Databases – dBEST & UNIGENE

dbESTdbEST (http://www.ncbi.nlm.nih.gov/dbEST/index.html) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.

UniGene UniGene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene) is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

1: BQ640943. TVEST017.H09 Tv30...[gi:21765401] Taxonomy

IDENTIFIERS

dbEST Id: 12791004EST name: TVEST017.H09GenBank Acc: BQ640943GenBank gi: 21765401

CLONE INFOClone Id: (5')DNA type: cDNA

PRIMERSPCR forward: T7PCR backward: T3Sequencing: T3PolyA Tail: Unknown

SEQUENCE ATTACAGCAATTGCCGATGATTGGCTTGGCATCACTGGCTGGCGTATCGAAAACTTTAAG CTCGTTAAAGTTGCAGAGATGGGCGCCTTCCACACAGGAGATTCTTATTTGTATCTTCAC GCTTACCTTGNTTGGCACAAGCAAGCTCGTCCATCGTGATATTTACTTCTGGCAGGGCTC CACATCCACAACAGATGAGCGCGGTGCTGTTGCTATCAAGGCTGTTGAACTTGATGACAG ATTTGGAGGCTCTCCAAAGCAACACAGAGAAGTCCAGAACCACGAGTCAGACCAGTTCAT TGGACTCTTCGATCAGTTTGGCGGTGTTCGCTACCTCGATGGCGGTGTTGAATCAGGATT CCACAAAGTCACAACATCTGCAAAGGTTGAGATGTACAGAATCAAGGGAAGAAAGCGCCC AATTCTCCAGATCGTTCCAGCTCAGCGCTCCTCCCTCAACCATGGAGATGTTTTCATTAT CCATGC

Entry Created: Jul 8 2002Last Updated: Jul 15 2002

PUTATIVE ID Assigned by submitter ACTIN-BINDING PROTEIN FRAGMIN P.

LIBRARYLib Name: Tv30236_PT cDNA LibraryOrganism: Trichomonas vaginalisCell line: ATCC30236Develop. stage: Trophozoites at mid-log phaseLab host: XL1 Blue-MRF'Vector: Lambda ZAP-Express (Stratagene)R. Site 1: EcoRIR. Site 2: XhoI

SUBMITTERName: Tang, P.Lab: Molecular Regulation and Bioinformatics Laboratory, College of MedicineInstitution: Chang Gung UniversityAddress: 259 Wenhwa 1st. Road, Kweishan, Taoyuan 333, TaiwanTel: +886 3 3283016 EXT5136Fax: +886 3 3283031E-mail: [email protected]

CITATIONSTitle: Analysis of Gene Expression Profile in Trichomonas vaginalis by EST SequencingAuthors: Zhou,Y., Shu,W.M., Huang,S.C.C., Huang,K.Y., Tang,P.Year: 2003Status: Unpublished

dBEST Record

NCBI dBEST Accession numbers forNCBI dBEST Accession numbers for Trichomonas vaginalisTrichomonas vaginalis ESTs ESTsBQ621379~BQ621732; BQ625216~BQ625229; BQ640771~BQ640943

trichomonas vaginalis AND gbdiv_est[PROP]http://www.ncbi.nlm.nih.gov/dbEST/index.html

http://www.ncbi.nlm.nih.gov/SAGE/

EST & SAGE Based Microarray

Bladder Carcinoma-SpecificMicroarrays

Bladder Carcinoma-SpecificMicroarrays

Bladder Tissue, Normal Bladder Tissue, Cancer

Not Pre-selectedNot Pre-selectedCan identify Gene FamiliesCan identify Gene FamiliesReal Gene Expressed ProductsReal Gene Expressed ProductscDNA vs cDNA cDNA vs cDNA Abundance = Expression LevelAbundance = Expression Level

Normal, U1,U2,U3,U4, Prognosis, Drug Resistant

Genes

Genes

mRNAs

cDNA ESTs

01-02-03-05-04-07-06-08-09-10-11-12-13-14 Vacuolar ATP synthase14-3-3Ankyrin repeat...

Documents

Transcript of 01-02-03-05-04-07-06-08-09-10-11-12-13-14 Vacuolar ATP synthase14-3-3Ankyrin repeat...