DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search...

12
DNA and Protei n Sequence Analysi s A Practical Approac h M . J. BISHOP C . J . RAWLINGS

Transcript of DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search...

Page 1: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

DNA and ProteinSequence Analysis

A Practical Approach

M. J. BISHOP

C. J . RAWLINGS

Page 2: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation
Page 3: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

List of contributors

xx i

Abbreviations

xxii i

1 . Molecular biology databases

1

Christian Burks

1. Overview

1Summary

1Molecular biology databases

2

Sequence databases

2What are the uses of databases?

1 6

2. Contributing data to the databases

1 6

Community pipelines

1 6

Direct, electronic submission

17

Timeliness of release of data to databanks

17

'Promulgating data revisions and extensions

1 8

3. Retrieving data from the databases

1 9

Finding databases of interest

1 9

Media

22

Mechanisms

22

Which databases should I get?

23

4. Using the data

24

Do I have a current version of the database?

24

How often should I repeat routine queries?

24

How redundant is the database?

25

Are there errors in the database?

2 5

How did I get that result?

2 6

5. Queries across multiple databases

2 6

6. Keeping up and going further

2 8

Acknowledgements

29

References

29

2 . The NCBI software tools

3 1

J. M. Ostell

1 . Introduction

31

Page 4: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

2. The software toolkit

3 1Portable core library

3 1Data encoding in ASN .1

3 2

3. The NCBI data model

3 2Introduction

3 2Pub

3 3Bioseq

3 44. Technical aspects of the NCBI toolkit

3 9ASN.1 libraries

3 9Object loader layer

39Utilities layer

39Access libraries

40Vibrant portable graphical interface

40Network client/server libraries

415. NCBI toolkit applications

42Entrez

42BLAST

4 2Banklt

4 2Sequin

42Others

42

6. Summary

42

3 . CBI databases and tools

45

Rainer Fuchs and Graham N. Cameron1. EBI information products

452. Databases and software on the EBI CD-ROM

46EBI software for DOS computers

4 8EBI retrieval software for Macintosh computers

4 9Other software

5 13, Network information services

5 1EBI database and information servers

5 1On-line database access

54Remote database searches

564 . Contacting the EBI

56Acknowledgements

58

References

58

4 . Networked services

5 9

G. Williams

1. Introduction

5 9Logging in to the system

5 9Computer names

59

Page 5: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

2. Electronic mail

60

E-mail

60

E-mail servers

60

3. File transfer protocol (FTP)

61

FTP

6 1

File formats

63

Archie

64

4. Remote log in

64

Telnet

64

BIDS

65

MEDLARS

65

MSDN

66

5. Mailing lists and network news

66

Mailing lists

67

Usenet/network news

6 8

6. Information servers

70

Gopher

7 0WWW

7 0

7. Further information

7 4

References

7 4

5 . DNA sequencing methodology and software 7 5

William D. Rawlinson and Barclay G. Barrell

1. Introduction

75

2. DNA sequencing methods

76

3. Sequence handling software and sequence project design

78

Conventions

79Display of trace data from within the database

80

Software created to make design of sequencing reactions easier

80

4. The software for assembling sequence data

82

The database assembly and handling program (xbap)

82Alternative packages

93

5. Assessment of sequencing projects

94

Recording information about the sequencing templates

94

Assessment of the sequence data during assembly

9 4

6. Discussion

9 5

Acknowledgements

9 7

References

97

Page 6: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

6 . Molecular biology software for the AppleMacintosh

99

M. Ginsburg and M. P. Mitchell

1. Introduction

99

2. GeneWorks

100Overview

100DNA analysis

101Protein analysis

104Special analyses

108

3. MacVector suite

11 1Overview

11 1DNA analysis

11 2Protein analysis

11 5Special analyses

11 7AssemblyLIGN

11 9

4. DNAStar

119Overview

11 9Sequence editing

120Pattern analysis

121Protein analysis

122Special analyses

123

5. Sequencher

126Overview

126Entering sequences

127Assembling the data

12 7Editing the data

12 8

6. Amplify

12 8Overview

12 8Running the program

12 9

7. MacPattern

13 0Overview

13 0Running the program

13 0

8. Other programs

13 2Suppliers

13 3Internet sources

13 4

References

13 4Further reading

13 5

7. Sequence comparison and alignment

13 7

Stephen F. Altschul

1 . Introduction

137

Page 7: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

2. Global sequence alignment

137

Algorithms

137

Substitution and gap scores

139

Statistics

140

3. Global multiple alignment

14 1

Scores

14 1

Algorithms

142

4. Local sequence alignment

14 3

Algorithms

14 3Local alignment statistics

14 4

Local alignment scoring systems

14 8

5. Database search methods

15 1

Parallel architectures

15 1

Heuristic algorithms

15 2

Vector-based comparison methods

15 4

6. Local multiple alignment

15 4

Consensus word methods

15 5

Template methods

15 6

Progressive alignment methods

15 6

Pairwise comparison methods

157

Statistically-based methods

157

General issues

158

7. Sequence motifs

158

Weight matrices

159

Generalizations

162

Acknowledgements

162

References

162

8. Simple sequences of protein and DNA

169

John C. Wootton

1. Introduction

169

2. Some practical guidelines to a complex body of theory

170

Complexity, pattern, and periodicity are distinct properties o f

simple sequences

170

Terminology

171

Local compositional complexity

17 1

Low complexity is more clear-cut for proteins than DNA

173

Unbiased inference

173

Sources for mathematical background

173

Visual inspection is complementary to mathematical analysis

174

Page 8: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

3. Software and examples of applications

175Available software

175Comparison of different algorithms and programs

175Future software developments

180

4. Masking of low-complexity sequences for searching databases 180The problem

180Masking methods

180

5. Complexity definitions and segmentation algorithm

181Definition 1

181Definition 2

181Probabilities of complexity states

182Segmentation algorithm based on compositional complexity

182

References

182

9. Repetitive sequences in DNA

18 5

Jörg T. Epplen and Olaf Riess

1. Introduction

18 5

2. Types of repetitive sequences

18 7Satellite DNA

18 7Simple repetitive DNA sequences

18 8Short and long interspersed nucleotide elements (SINEs and LINEs) 18 8Minisatellites

18 9

3. Repeats in genomic DNA (and protein) databanks

18 9Evolutionary aspects

19 0Expression of repeats

19 0Repeats as tools

19 1

4. Short consensus motifs for the identification of functionalsequences in DNA which appear repetitively in an daround genes

19 1

5. Diseases caused by expansion of simple nucleotide repeats

192

6. Conclusions

193

References

193

10 . Isochores and synonymous substitutions i nmammalian genes

197

Giorgio Bernardi, Dominique Mouchiroud, and Christian Gautier

1. Introduction

197

2. Methods

198

3. Results

198

Page 9: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

4. Discussion

203

The frequencies of synonymous substitutions do not exhibi tdifferences related to regions of the mammalian genome

20 3Differences in repair efficiency do not cause differences in th erates of synonymous substitutions of genes located in different

isochore families

20 4

Differences in the process of mutation associated with replicationtiming do not affect the rates nor the biases of synonymou ssubstitutions of genes located in different isochore families

20 5

5. Conclusions

20 6

References

207

11 . Identifying genes in genomic DNAsequences

209

Eric E. Snyder and Gary D. Stormo

1. Introduction

209

Low-level motif identification

21 0

Assembling complete genes using multiple pieces of evidence

21 1

2. Programs

21 2

GeneModeler

21 2GenelD

21 3

GRAIL

214

GeneParser

215

3. Performance statistics

21 5

Test data

216

Comparison of currently available programs

218

Results

21 9

4. Recommendations for users

22 0

5. Conclusions

223

References

223

12. Prediction of mRNA sequence function

225

Keith Vass

1. Introduction

225

2. Analysis of sequence data

226

Short sequence patterns

226

Repeated sequences

226

Conserved sequences

227Database searching

22 8Secondary structure

229Secondary structure searches of sequence databases

230

Page 10: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

3 . Summary

23 0

References

23 0

13 . Forecasting protein function

23 1

T. C. Hodgman

1. Introduction

23 1

2. Structure/function relationships

232

3. General strategy

233

4. Pairwise domain matches

234FASTA and BLAST

235MPSRCH and PROSRCH

237DFLASH

238Assessing retained sequences

238

5. Weak domain matches

23 9General points

23 9SBASE

240PROD OM

240PLSEARCH

240BLASTS

242

6. Motif matches

242Sources

242Definitions

242PROSEARCH

24 4BLOCKS

244BLA

24 5LUPES

24 6

7. URF alignments

24 6PROFILESEARCH

247PIPL

247PTNSRCH

247SCR UTINEER

247

8. Assessing candidate matches

24 8

9. Single sequence analyses

24 9Repeats

24 9Biased composition

24 9Secondary structure

250

10. Software sources

25 1

Acknowledgements

252

References

253

Page 11: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

14. DNA and RNA structure prediction

255

Eric Westhof Pascal Auffinger, and Christine Gaspin

1. Introduction

255

2. Molecular mechanics and molecular dynamics methods

257The potential energy function

257Molecular dynamics simulation protocols

259Modelling large nucleic acids

26 1Analysis of the trajectories

26 1

3. Fine structure and the search for specific regions in DNA

26 1

4. RNA secondary structure prediction

262Representation

26 3Data necessary for folding RNA molecules

26 4Methods of prediction

26 6Limits

27 2

5. RNA tertiary structure construction

27 2

6. Conclusions

27 3

Acknowledgements

27 3

References

27 5

15 . Phylogenetic estimation

279

Nick Goldman

1. Introduction

279

2. Common ground

28 1Trees

28 1Data

282Models of evolutionary change

28 3Estimation

28 6Heuristics

28 7

3. Phylogenetic estimation methods based on sequences

28 7Maximum likelihood methods

28 7Parsimony methods

291

4. Phylogenetic estimation methods based on distances

293Sequence distances

294Phylogenetic trees from distance matrices

295

5. Comparison of methods

297

6. Other phylogenetic estimation methods

299Lake's method of invariants

300Hein's method of simultaneous alignment and phylogenetic tre eestimation

30 0Minimum message length coding

301

Page 12: DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search for specific regions in DNA 261 4. RNA secondary structure prediction 262 Representation

7. Measuring uncertainty

30 1Statistical fluctuation

30 2

Systematic errors

30 5

8. The future of phylogenetic estimation

30 6

Appendix : computer programs

30 8PHYLIP

30 8MEGA

30 9

PA UP

30 9

BASEML and BASEMLG

31 0PROTML

31 0TREEALIGN

31 0Minimum message length encoding

31 0FASTDNAML

31 0

References

31 1

16 . Evolution and relationships of proteinfamilies

31 3

William R. Taylor

1. Introduction

313

2. Sequence similarity

31 4

Pairwise sequence alignment

31 4

Multiple sequence alignments

31 8

Structure biased alignment

31 9

Sequence threading

321

3. Structural comparison

322Recent comparison methods

323Fold classification

324How many protein folds?

326

4. Molecular evolution

328

Genetic algorithm model

328Gene duplication and fusion

329Introns and exons

33 0Evolution of function

33 2

5 . . Theory

335

6 . Conclusions

336

References

336

Al . List of suppliers

34 1

Glossary

343

Index

349