DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search...
Transcript of DNA and Protein Sequence AnalysisAnalysis of the trajectories 261 3. Fine structure and the search...
DNA and ProteinSequence Analysis
A Practical Approach
M. J. BISHOP
C. J . RAWLINGS
List of contributors
xx i
Abbreviations
xxii i
1 . Molecular biology databases
1
Christian Burks
1. Overview
1Summary
1Molecular biology databases
2
Sequence databases
2What are the uses of databases?
1 6
2. Contributing data to the databases
1 6
Community pipelines
1 6
Direct, electronic submission
17
Timeliness of release of data to databanks
17
'Promulgating data revisions and extensions
1 8
3. Retrieving data from the databases
1 9
Finding databases of interest
1 9
Media
22
Mechanisms
22
Which databases should I get?
23
4. Using the data
24
Do I have a current version of the database?
24
How often should I repeat routine queries?
24
How redundant is the database?
25
Are there errors in the database?
2 5
How did I get that result?
2 6
5. Queries across multiple databases
2 6
6. Keeping up and going further
2 8
Acknowledgements
29
References
29
2 . The NCBI software tools
3 1
J. M. Ostell
1 . Introduction
31
2. The software toolkit
3 1Portable core library
3 1Data encoding in ASN .1
3 2
3. The NCBI data model
3 2Introduction
3 2Pub
3 3Bioseq
3 44. Technical aspects of the NCBI toolkit
3 9ASN.1 libraries
3 9Object loader layer
39Utilities layer
39Access libraries
40Vibrant portable graphical interface
40Network client/server libraries
415. NCBI toolkit applications
42Entrez
42BLAST
4 2Banklt
4 2Sequin
42Others
42
6. Summary
42
3 . CBI databases and tools
45
Rainer Fuchs and Graham N. Cameron1. EBI information products
452. Databases and software on the EBI CD-ROM
46EBI software for DOS computers
4 8EBI retrieval software for Macintosh computers
4 9Other software
5 13, Network information services
5 1EBI database and information servers
5 1On-line database access
54Remote database searches
564 . Contacting the EBI
56Acknowledgements
58
References
58
4 . Networked services
5 9
G. Williams
1. Introduction
5 9Logging in to the system
5 9Computer names
59
2. Electronic mail
60
60
E-mail servers
60
3. File transfer protocol (FTP)
61
FTP
6 1
File formats
63
Archie
64
4. Remote log in
64
Telnet
64
BIDS
65
MEDLARS
65
MSDN
66
5. Mailing lists and network news
66
Mailing lists
67
Usenet/network news
6 8
6. Information servers
70
Gopher
7 0WWW
7 0
7. Further information
7 4
References
7 4
5 . DNA sequencing methodology and software 7 5
William D. Rawlinson and Barclay G. Barrell
1. Introduction
75
2. DNA sequencing methods
76
3. Sequence handling software and sequence project design
78
Conventions
79Display of trace data from within the database
80
Software created to make design of sequencing reactions easier
80
4. The software for assembling sequence data
82
The database assembly and handling program (xbap)
82Alternative packages
93
5. Assessment of sequencing projects
94
Recording information about the sequencing templates
94
Assessment of the sequence data during assembly
9 4
6. Discussion
9 5
Acknowledgements
9 7
References
97
6 . Molecular biology software for the AppleMacintosh
99
M. Ginsburg and M. P. Mitchell
1. Introduction
99
2. GeneWorks
100Overview
100DNA analysis
101Protein analysis
104Special analyses
108
3. MacVector suite
11 1Overview
11 1DNA analysis
11 2Protein analysis
11 5Special analyses
11 7AssemblyLIGN
11 9
4. DNAStar
119Overview
11 9Sequence editing
120Pattern analysis
121Protein analysis
122Special analyses
123
5. Sequencher
126Overview
126Entering sequences
127Assembling the data
12 7Editing the data
12 8
6. Amplify
12 8Overview
12 8Running the program
12 9
7. MacPattern
13 0Overview
13 0Running the program
13 0
8. Other programs
13 2Suppliers
13 3Internet sources
13 4
References
13 4Further reading
13 5
7. Sequence comparison and alignment
13 7
Stephen F. Altschul
1 . Introduction
137
2. Global sequence alignment
137
Algorithms
137
Substitution and gap scores
139
Statistics
140
3. Global multiple alignment
14 1
Scores
14 1
Algorithms
142
4. Local sequence alignment
14 3
Algorithms
14 3Local alignment statistics
14 4
Local alignment scoring systems
14 8
5. Database search methods
15 1
Parallel architectures
15 1
Heuristic algorithms
15 2
Vector-based comparison methods
15 4
6. Local multiple alignment
15 4
Consensus word methods
15 5
Template methods
15 6
Progressive alignment methods
15 6
Pairwise comparison methods
157
Statistically-based methods
157
General issues
158
7. Sequence motifs
158
Weight matrices
159
Generalizations
162
Acknowledgements
162
References
162
8. Simple sequences of protein and DNA
169
John C. Wootton
1. Introduction
169
2. Some practical guidelines to a complex body of theory
170
Complexity, pattern, and periodicity are distinct properties o f
simple sequences
170
Terminology
171
Local compositional complexity
17 1
Low complexity is more clear-cut for proteins than DNA
173
Unbiased inference
173
Sources for mathematical background
173
Visual inspection is complementary to mathematical analysis
174
3. Software and examples of applications
175Available software
175Comparison of different algorithms and programs
175Future software developments
180
4. Masking of low-complexity sequences for searching databases 180The problem
180Masking methods
180
5. Complexity definitions and segmentation algorithm
181Definition 1
181Definition 2
181Probabilities of complexity states
182Segmentation algorithm based on compositional complexity
182
References
182
9. Repetitive sequences in DNA
18 5
Jörg T. Epplen and Olaf Riess
1. Introduction
18 5
2. Types of repetitive sequences
18 7Satellite DNA
18 7Simple repetitive DNA sequences
18 8Short and long interspersed nucleotide elements (SINEs and LINEs) 18 8Minisatellites
18 9
3. Repeats in genomic DNA (and protein) databanks
18 9Evolutionary aspects
19 0Expression of repeats
19 0Repeats as tools
19 1
4. Short consensus motifs for the identification of functionalsequences in DNA which appear repetitively in an daround genes
19 1
5. Diseases caused by expansion of simple nucleotide repeats
192
6. Conclusions
193
References
193
10 . Isochores and synonymous substitutions i nmammalian genes
197
Giorgio Bernardi, Dominique Mouchiroud, and Christian Gautier
1. Introduction
197
2. Methods
198
3. Results
198
4. Discussion
203
The frequencies of synonymous substitutions do not exhibi tdifferences related to regions of the mammalian genome
20 3Differences in repair efficiency do not cause differences in th erates of synonymous substitutions of genes located in different
isochore families
20 4
Differences in the process of mutation associated with replicationtiming do not affect the rates nor the biases of synonymou ssubstitutions of genes located in different isochore families
20 5
5. Conclusions
20 6
References
207
11 . Identifying genes in genomic DNAsequences
209
Eric E. Snyder and Gary D. Stormo
1. Introduction
209
Low-level motif identification
21 0
Assembling complete genes using multiple pieces of evidence
21 1
2. Programs
21 2
GeneModeler
21 2GenelD
21 3
GRAIL
214
GeneParser
215
3. Performance statistics
21 5
Test data
216
Comparison of currently available programs
218
Results
21 9
4. Recommendations for users
22 0
5. Conclusions
223
References
223
12. Prediction of mRNA sequence function
225
Keith Vass
1. Introduction
225
2. Analysis of sequence data
226
Short sequence patterns
226
Repeated sequences
226
Conserved sequences
227Database searching
22 8Secondary structure
229Secondary structure searches of sequence databases
230
3 . Summary
23 0
References
23 0
13 . Forecasting protein function
23 1
T. C. Hodgman
1. Introduction
23 1
2. Structure/function relationships
232
3. General strategy
233
4. Pairwise domain matches
234FASTA and BLAST
235MPSRCH and PROSRCH
237DFLASH
238Assessing retained sequences
238
5. Weak domain matches
23 9General points
23 9SBASE
240PROD OM
240PLSEARCH
240BLASTS
242
6. Motif matches
242Sources
242Definitions
242PROSEARCH
24 4BLOCKS
244BLA
24 5LUPES
24 6
7. URF alignments
24 6PROFILESEARCH
247PIPL
247PTNSRCH
247SCR UTINEER
247
8. Assessing candidate matches
24 8
9. Single sequence analyses
24 9Repeats
24 9Biased composition
24 9Secondary structure
250
10. Software sources
25 1
Acknowledgements
252
References
253
14. DNA and RNA structure prediction
255
Eric Westhof Pascal Auffinger, and Christine Gaspin
1. Introduction
255
2. Molecular mechanics and molecular dynamics methods
257The potential energy function
257Molecular dynamics simulation protocols
259Modelling large nucleic acids
26 1Analysis of the trajectories
26 1
3. Fine structure and the search for specific regions in DNA
26 1
4. RNA secondary structure prediction
262Representation
26 3Data necessary for folding RNA molecules
26 4Methods of prediction
26 6Limits
27 2
5. RNA tertiary structure construction
27 2
6. Conclusions
27 3
Acknowledgements
27 3
References
27 5
15 . Phylogenetic estimation
279
Nick Goldman
1. Introduction
279
2. Common ground
28 1Trees
28 1Data
282Models of evolutionary change
28 3Estimation
28 6Heuristics
28 7
3. Phylogenetic estimation methods based on sequences
28 7Maximum likelihood methods
28 7Parsimony methods
291
4. Phylogenetic estimation methods based on distances
293Sequence distances
294Phylogenetic trees from distance matrices
295
5. Comparison of methods
297
6. Other phylogenetic estimation methods
299Lake's method of invariants
300Hein's method of simultaneous alignment and phylogenetic tre eestimation
30 0Minimum message length coding
301
7. Measuring uncertainty
30 1Statistical fluctuation
30 2
Systematic errors
30 5
8. The future of phylogenetic estimation
30 6
Appendix : computer programs
30 8PHYLIP
30 8MEGA
30 9
PA UP
30 9
BASEML and BASEMLG
31 0PROTML
31 0TREEALIGN
31 0Minimum message length encoding
31 0FASTDNAML
31 0
References
31 1
16 . Evolution and relationships of proteinfamilies
31 3
William R. Taylor
1. Introduction
313
2. Sequence similarity
31 4
Pairwise sequence alignment
31 4
Multiple sequence alignments
31 8
Structure biased alignment
31 9
Sequence threading
321
3. Structural comparison
322Recent comparison methods
323Fold classification
324How many protein folds?
326
4. Molecular evolution
328
Genetic algorithm model
328Gene duplication and fusion
329Introns and exons
33 0Evolution of function
33 2
5 . . Theory
335
6 . Conclusions
336
References
336
Al . List of suppliers
34 1
Glossary
343
Index
349