DNA Sequence Analysis

DNA Sequence Analysis

5.1 Introduction1. Terms in common use are defined, and the genetic

code is reviewed.

2. EST-Expressed Sequence Tag as a unit of sequence data, derived from rapid sequencing of cDNA libraries.

3. Three examples of producers of EST databases are profiled.

5.2 Why analysis DNA?

The most sensitive comparisons between sequences are made at the protein level; detection of distantly related sequences is easier in protein translation, because the redundancy of the genetic code of 64 codons is reduced to 20 distinct amino acids. However, the loss of degeneracy at this level is accompanied by a loss of information about evolutionary process, because proteins are a functional abstraction of genetic events in DNA.

Table 5.1 The Genetic Code

Box 5.1 Family Analysis at DNA Level

5.3 Gene structure and DNA sequences

1. DNA sequence databases contain genomic sequence data,which includes information at the level of the untranslated sequence, introns and exons, mRNA, cDNA , and translations.

2. Untranslated regions(UTRs): occur in both DNA and RNA; they are portions of the sequence flanking the CDS that are not translated into protein.It is highly specific at the 3’ end both to the gene and the species from which the sequence is derived.

Box 5.2 The Central Dogma

3. Six-Frame Translation: There are three forward frames, which are achieved by beginning to translate at the first,second and third bases respectively; the three reverse frames are determined by reversing the DNA sequence and again beginning on the first, second and third bases. Thus, for any piece of DNA, the result of a six-frame translation is six potential protein sequences.

Fig. 5.1 Six-Frame Translation

5.4 Features of DNA sequence analysis

1. Detecting open reading frames (ORF) : Initial codon: ATG Stop codon: TGA, TAA, TAG2. Several features may be used as indicators of potential

protein coding regions in DNA: a. Sufficient ORF length b. Recognition of flanking Kozak sequence c. Patterns of codon usage d. A general preference for G/C over A/T in the third base

(wobble) position of a codon e. Ribosome binding sites f. Alignment with a homologous protein sequences

Table 5.2 Percentage use of codons for serine in a variety of model organisms

3. DNA sequence assembly: The rapid accumulation of DNA sequence data has been expedited by the introduction of fluorescent sequencing technology.The output consists of a series of color-coded peaks, beneath which is a string of base symbols-the particular base shown is determined by the highest peak at that position of the trace.

Box 5.3 Fluorescent sequence chromatogram interpretation

5.5 Issues in the interpretation of EST searches

1. A large part of currently available DNA data is made up of partial sequence, the majority of which are Expressed Sequence Tags (ESTs).

2. In analyzing ESTs the following points should be borne in mide:

a. The EST alphabet is five characters:ACGTN.

b. There may be phantom INDELs resulting in translation frameshifts.

c. The EST will often be a sub-sequence of any other sequence in the databases.

d. The EST may not represent part of the CDS of any gene.

3. The EST alphabet

4. The existence of splice variants has particular consequences for database searches with EST queries.

5.6 Two approaches to gene hunting

Position cloning: The chromosome linked to the disease in question is established by analyzing a population of subjects. Once a link to a chromosomal region has been established, a large part of the chromosome in the vicinity of this region(locus) is sequenced, yielding several megabases of DNA. Such a locus can contain many individual genes, only one of which is likely to be involved in diseases.

Ultimately, several genes will need to be expressed, and further experimentation will be required to confirm which gene is actually involved in the disease. Although genes discovered in this way can be illuminating from an academic point of view, they do not necessarily represent good drug targets.The whole process is lengthy, time-consuming and labor intensive.

RNA transcript analysis: This approach requiring much less sequencing effort and relying more heavily on the powerful search capabilities of current computer systems, examines the genes that are actually expressed in healthy and diseased tissue.This process analyses the mRNA and allows a comparison to be performed between the two states, and a process of reasoning applied to arrive at a potential drug target in a more direct way.

The hierarchy of genomic information: The human genome is complex, containing of about 3 billion base-pairs of DNA. Yet only 3% of the DNA is coding sequence. Thus, in simple terms, we have three levels of genomic information:

1. The chromosomal genome-the genetic information common to every cell in the organism.

2. The expressed genome-the part of genome that is expressed in a cell at a specific stage in its development.

3. The proteome-the protein molecules that interact to give the cell its individual character.

5.7 cDNA libraries and ESTs

Obtained a sample of cells RNA extractionReversed transcribed to cDNA cDNA librarySequence

1. The sequences that emerge successfully from this process are called ESTs.

2. Good libraries contain at least 1 million clones, and the actual number of distinct genes expressed in a cell may be a few thousand; the number varies according to cell type.

5.8 Different approaches to EST analysis

There are three major sources of EST information. Much of the publicly available data are collected together into the EST sections of the EMBL Data Library and GenBank (dbEST).

1. Merck/IMAGE: In 1994, Merck&Co. funded a research project to sequence 300,000 ESTs from a variety of normalised libraries. As of May 1997, 484421 ESTs had been submitted by the project to dbEST.(Table 5.4)

2. Incyte: Incyte Pharmaceuticals Inc. produces a database, LifeSeq, emphasizing the quantitative

information derived by sequencing standard cDNA libraries. The goal is to provide information on transcribed genes in health and diseased tissues, to facilitate the elucidation of potential therapeutic targets.In April 1998, the size of LifeSeq was 2.5 million ESTs, representing 80,000-120,000 different genes.

3. TIGR: The Institute for Genomic Research is a research organization with interests in structural, functional and comparative analysis of genomes and gene products.

TIGR Human Gene Index(HGI)

5.9 EST analysis tools

There are three publicly avaiable tools for the analysis of ESTs:

1. Sequence similarity search tools- The BLAST series of programs has variants that will translate DNA databasees(TBLASTN), translate the input sequence(BLASTX), or both(TBLASTX).FastA provides a similar suite of options.

2. Sequence assembly tools-When a search of the databases reveals several ESTs matching with

a probe sequence, the ESTs must be aligned with each other to reveal the consensus sequence.

3. Sequence clustering tools- Programs that take a large set of sequences and divide them into subsets, or clusters, based on the extent of shared sequence identity in a minimum overlap region. A reliable mechanism for clustering ESTs will reduce redundancy in the dataset, and save search time.

Clustering an EST library

5.10 A practical example of EST analysis

DNA Sequence Analysis

Documents

Transcript of DNA Sequence Analysis