Comparison of Variants of Blast
Transcript of Comparison of Variants of Blast
COMPARISON OF VARIANTS OF BLAST (Basic Local Alignment Search Tool)
A Thesis
Submitted in partial fulfillment of the requirement for the award of degree of
Master of Engineering In
Software Engineering
Under the Supervision of Ms. Inderveer Chana
Senior Lecturer Computer Science and Engineering Department
Batch 2003-2005
Submitted By Harpreet Kaur
(8033107)
Computer Science & Engineering Depar tment Thapar Institute of Engineering & Technology
(Deemed University), Patiala-147004 (India).
May 2005
i
ABSTRACT
Now a days, large quantities of gene sequences of related species of plants, animals
and microorganisms show complex patterns of similarity to one another and many
molecular biologists are convinced that an understanding of sequence evolution is the
first step towards understanding the evolution itself. In fact this is one of the most
fascinating aspects of the study of evolution. Thus the comparison of gene sequences
or biological sequence analysis is one of the processes used to understand sequence
evolution. Just as the ancient Greeks used comparative anatomy to understand the
human body and linguists used the Rosetta stone to decipher Egyptian hieroglyphs,
today we can use comparative sequence analysis to understand genomes. There is
variety of different tools available to perform sequence analysis. Various DNA
sequences alignment tools have been developed. Various software packages of
automated tools have been developed that had improved the eff iciency of much
biological research. Fast, economical, flexible, and extensible computing power is
making it increasingly attractive to scientists in many areas of research, including
biology.
More generally, the open source movement has greatly benefited biological research.
The combination of data availabili ty and free software is revolutionizing this field.
BLAST is the eff icient tool used for biological searches. There exists variants of Blast
which are developed to overcome the limitations of Main BLAST Tool. I studied
variants of BLAST (BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX,PSI-
BLAST). Each variant has advantages and disadvantages over one another. Different
tools work according to the different parameters. These parameters add to the
performance of the algorithm. I did analysis of these variants and compared these
tools on the basis of their algorithms, parameters, and performance. Situation is
depicted that in which condition, which variant is more advantageous and under
which circumstances different versions should use. How they can be improved by
eliminating their deficiencies and by adding new features.
ii
DECLARATION
I hereby certify that the work which is being presented in the thesis entitled,
“ Comparison of Variants of Blast (Basic Local Ali gnment Search
Tool)” in partial fulf i llment of the requirements for the award degree of Master
of Engineering in Software Engineering at Computer Science and Engineering
Department of Thapar Institute of Engineering and Technology (Deemed
University), Patiala, is an authentic record of my own work carried out under the
supervision of Ms. Inderveer Chana.
�The matter presented in this thesis has not been submitted by me for the award of any
other degree of this or any other University.
Harpreet Kaur
This is to certify that the above statement made by the candidate is correct and true to
the best of my knowledge.
Ms. Inderveer Chana
Senior Lecturer
Computer Science and Engineering Department
Thapar Institute of Engineering and Technology
PATIALA- 147004
Countersigned by
Mr. R.S Salaria
Head
Computer Science and Engineering Department
Thapar Institute of Engineering and
Technology
PATIALA- 147004
Dr. D. S. Bawa
Dean Of Academic Affairs
Thapar Institute of Engineering and
Technology
PATIALA- 147004
ii i
����������ACKNOWLEDGEMENT
I wish to express my deep gratitude to Ms. Inderveer Chana, Senior Lecturer,
Computer Science and Engineering Department for providing her uncanny guidance
and support throughout the Thesis work.
�I am also thankful to Mr.R.S.Salaria, Head, Computer Science and Engineering
Department and Mr. Rajesh Bhatia, P.G Coordinator, for their excellent guidance and
encouragement right from the beginning of this course
I would also like to thank all the staff members and my Co-students who were always
there at the need of the hour and provided with all the help and faciliti es, which I
required for the completion of the Thesis.
I wish to express my indebtedness to my parents who have been a constant source of
love and encouragement.
Finally I would like to thank God for not letting me down at the time of crisis and
showing me the silver lining in the dark clouds.
Harpreet Kaur ���������
iv
TABLE OF CONTENTS
Abstract ..........................................................................................................................i
Declaration……………………………………………………………………………ii
Acknowledgement…………………………………………………………………...ii i
List of Figures……………………………………………………………………….vii
List of Tables…………………………………………………………………………ix
Organization of Thesis……………………………………………………………….x
CHAPTER 1 DATA MINING.............................................................................. 1-10
1.1 DATA MINING.......................................................................................................1
1.2 WHY DATA MINING............................................................................................1
1.3 STEPS OF KDD PROCESS....................................................................................2
1.4 WHAT KIND OF DATA CAN BE MINED?.........................................................4
1.4.1 Relational Databases.....................................................................................4
1.4.2 Data Warehouses...........................................................................................4
1.4.3 Transactional Databases................................................................................4
1.4.4 Multimedia Databases...................................................................................5
1.4.5 Spatial Databases...........................................................................................5
1.4.6 World Wide Web ..........................................................................................5
1.4.7 Advanced DB and Information Repositories................................................5
1.5 ARCHITECTURE FOR DATA MINING SYSTEM ..............................................6
1.5.1 Database, Data Warehouse, or Other Information Repository......................6
1.5.2 Database or Data Warehouse Server.............................................................6
1.5.3 Knowledge Base............................................................................................7
1.5.4 Data Mining Engine......................................................................................7
1.5.5 Pattern Evaluation Module............................................................................8
1.5.6 Graphical User Interface...............................................................................8
1.6 DATA MINING APPLICATIONS.........................................................................8
1.7 THE SCOPE OF DATA MINING ..........................................................................9
CHAPTER 2 BIOINFORMATICS.................................................................... 11-24
2.1 WHY BIOINFORMATICS...................................................................................11
v
2.2 BIOINFORMATICS..............................................................................................11
2.3 AIMS OF BIOINFORMATICS.............................................................................12
2.4 STEPS OF KDD FOR BIOINFORMATICS.........................................................13
2.5 WHAT KIND OF DATA CAN BE MINED?.......................................................13
2.5.1 DNA ............................................................................................................13
2.5.2 RNA ............................................................................................................15
2.5.3 PROTEIN....................................................................................................16
2.6 DATA MINING TECHNIQUES IN BIOINFORMATICS..................................17
2.6.1 Clustering....................................................................................................17
2.6.2 Classification...............................................................................................19
2.6.3 Association..................................................................................................19
2.7 THE CENTRAL DOGMA.....................................................................................19
2.7.1 Transcription ...............................................................................................19
2.7.2 The Genetic Code........................................................................................20
2.8 NEED OF DATA MINING IN BIOINFORMATICS...........................................21
2.9 BIOINFORMATICS AND ITS SCOPE................................................................22
2.10 APPLICATIONS OF BIOINFORMATICS........................................................23
CHAPTER 3 INTRODUCTION TO BLAST ................................................... 25-42
3.1 INTRODUCTION..................................................................................................25
3.2 DATABASES AVAILABLE FOR BLAST SEARCH INCLUDE.......................26
3.2.1 Protein Sequence Databases........................................................................26
3.2.2 Nucleotide Sequence Databases..................................................................27
3.3 BLAST ALGORITHM ..........................................................................................29
3.4 BLAST PARAMETERS........................................................................................32
3.5 FEATURES OF BLAST........................................................................................39
3.5.1 Heuristic ......................................................................................................39
3.5.2 Substitution Matrix......................................................................................40
3.5.3 Local Alignments........................................................................................40
3.5.4 Ungapped Alignments.................................................................................40
3.5.5 Explicit Statistical Theory...........................................................................40
3.5.6 Rapid ...........................................................................................................41
3.5.7 Sequence Input ............................................................................................41
3.5.8 Results Format.............................................................................................41
vi
3.5.9 BLAST Output ............................................................................................41
CHAPTER 4 VARIANTS OF BLAST............................................................... 43-61
4.1 BLAST VARIANTS..............................................................................................43
4.2 PSI-BLAST............................................................................................................45
4.3 BLASTN ................................................................................................................53
4.4 BLASTX ................................................................................................................55
4.5 BLASTP.................................................................................................................58
4.5.1 BLASTP PARAMETERS ..........................................................................59
4.6 TBLASTN..............................................................................................................60
4.7 TBLASTX ..............................................................................................................61
4.7.1 Limitations of TBlastX................................................................................61
CHAPTER 5 COMPARISON OF VARIANTS OF BLAST ........................... 62-74
5.1 INTRODUCTION..................................................................................................62
5.1.1 Comparison On The Basis Of Parameters...................................................62
5.2 COMPARISON ON THE BASIS OF ALGORITHM...........................................66
5.2.1 The Two-Hit Algorithm Isn't Used In BLASTN, Because Word Hits
Are Generally Rare With Large Identical Words........................................66
5.2.2 Extension in BlastN is different from BlastP and other protein based
programs......................................................................................................68
5.3 COMPARISON ON THE BASIS OF PERFORMANCE.....................................68
5.3.1 Comparison On The Basis of Varying Expect Values................................68
5.3.2 Comparison On The Basis of Word Size....................................................70
5.3.3 Comparison on the Basis of Execution Time..............................................73
CHAPTER 6 CONCLUSION AND FUTURE SCOPE................................... 75-76
6.1 CONCLUSION......................................................................................................75
6.2 FUTURE SCOPE...................................................................................................76
REFERENCES...........................................................................................................77
L IST OF PUBLICATIONS.......................................................................................80
GLOSSARY................................................................................................................81
vii
LIST OF FIGURES
Number Page
Figure 1.1 The Process of Knowledge Discovery 03
Figure 1.2 Architecture Of Typical Data Mining 07
Figure 2.1 The KDD Process For Bioinformatics 14
Figure 2.2 DNA Molecule 14
Figure 2.3 Protien Moleceule 17
Figure 3.1 Protein Database 28
Figure 3.2 Nucleotide Database 28
Figure 3.3 List of Words From Query Sequence 30
Figure 3.4 Exact Matches of Words From Word List 31
Figure 3.5 Maximal Segment Pairs 31
Figure 3.6 Figure Shows The Word Size Option 33
Figure 4.1 Blast Variants 43
Figure 4.2 Blast Variants 45
Figure 4.3 PSI-Blast 46
Figure 4.4 PSI-Blast Step1 47
Figure 4.5 PSI-Blast Step2 48
Figure 4.6 PSI-Blast Output 50
Figure 4.7 PSI-Blast Output 51
Figure 4.8 PSI Blast 51
Figure 4.9 BlastN 53
Figure 4.10 Using BlastN for Comparison 54
Figure 4.11 BlastN Results 54
Figure 4.12 BlastX 56
Figure 4.13 Using BlastX for Comparison 57
Figure 4.14 BlastX Results 58
Figure 4.15 BlastP 59
Figure 4.16 Using BlastP for Comparison 59
Figure 5.1 Conserved Domain Search For Blastn And Blastp 63
Figure 5.2 Different Word size for BlastN and BlastP 64
vii i
Figure 5.3 Empirically Estimated Probability That An HSP Is Missed By This
Method, As a Function of Its Normalized Score 67
Figure 5.4 Speeds of The One-Hit And Two-Hit Methods 67
Figure 5.5 Comparison - Varying Expect Values 69
Figure 5.6 Comparison - Varying Expect Values 70
Figure 5.7 Varying Expect Values For Blastn 71
Figure 5.8 Varying Expect Values Blastn 71
Figure 5.9 Varying Expect Values For Variants 72
Figure 5.10 Varying Expect Values For Variants 72
Figure 5.11 Compares The Performance of BLAST Compiled With 32-Bit And 64-Bit
Processor 73
ix
LIST OF TABLES
Number Page Table 2.1 The 20-Amino Acids and their off icial codes 16
Table 4.1 Programs Available For Blast 44
Table 5.1 No of hits for varying expect values 69
Table 5.2 No of hits for varying expect values BlastN 70
Table 5.3 No of Hits For Varying Word Size 72
Table 5.4 Varying Execution Time 73
x
ORGANIZATION OF THESIS
The Thesis entitled “Compar ison of Var iants of BLAST (Basic Local Alignment
Search Tool)” is concerned with comparison of variants of BLAST. All tools are
compared according to some defined criteria.
The First chapter briefly introduces Data Mining technology and the techniques which
are used in data mining. Process of knowledge discovery for databases for is also
discussed.
The Second chapter is related to Field of Bioinformatics, Need of Bioinformatics, kind
of data on which bioinformatics is applied.
The Third chapter explains Biological tool BLAST which is used for sequence
similarity, algorithm of BLAST, features of BLAST is explained.
Fourth chapter explores variants of BLAST (BlastN, BlastX, BlastP, TBlastN, TBlastX,
PSI-Blast) the algorithm of all variants, parameters, and the performance criteria for
each tool is explored.
In Fifth chapter comparison of variants of BLAST is performed on the basis of
parameters, algorithms and performance. Deficiency of any parameters and
improvement to that is also enlightened.
1
CHAPTER 1 DATA MINING
1.1 DATA MINING
Data Mining is extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases. It is the process of selecting, explor ing,
and modeling large amounts of data to uncover previously unknown patterns or
information for a business advantage [4]. Data Mining can be viewed as an analytical
process designed to explore data (usually large amounts of - typically business or
market related - data) in search for consistent patterns and/or systematic relationships
between variables, and then to validate the findings by applying the detected patterns
to new subsets of data. There are many terms carrying a similar or slightly different
meaning to data mining, such as knowledge mining from databases, knowledge
extraction, data/ pattern analysis, data archaeology, and data dredging. It is a young
interdisciplinary field, drawing from areas such as database systems, data
warehousing, statistics, machine learning, data visualization, information retrieval,
and high-performance computing. Other contributing areas include neural networks,
pattern recognition, spatial data analysis, image databases, signal processing, and
many application fields, such as business, economics, and bioinformatics.
1.2 WHY DATA MINING “W e are drowning in data, but starving for knowledge!”
Necessity is the Mother of Invention - Automated data collection tools and mature
database technology led to tremendous amounts of data stored in databases, data
warehouses and other information repositories. Every day the world creates 52,000
terabytes of data. Only 4% of the data is used for any purpose. So a thought came that
if we could do something useful with this data, and with this thought the field of
DATA MINING was born. Database technology began with the development of data
collection and database creation mechanisms that, led to the development of effective
mechanisms for data management including data storage and retrieval, and query and
transaction processing. The large number of database systems offering query and
transaction processing eventually and naturally led to the need for data analysis and
understanding. Hence, data mining began its development out of this necessity.
2
1.3 STEPS OF KDD PROCESS
Knowledge discovery is defined as ̀ `the non-trivial extraction of implicit, unknown,
and potentially useful information from data''. The knowledge discovery process takes
the raw results from data mining (the process of extracting trends or patterns from
data) and carefully and accurately transforms them into useful and understandable
information [6].
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of
the task.
o Using dimensionali ty reduction or transformation methods to reduce
the effective number of variables under consideration or to find
invariant representations for the data.
3
Figure 1. 1 The process of Knowledge Discovery [22]
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of
the KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form
or a set of such representations as classification rules or trees,
regression, clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
The terms knowledge discovery and data mining are distinct. KDD refers to the
overall process of discovering useful knowledge from data. It involves the evaluation
and possibly interpretation of the patterns to make the decision of what qualifies as
knowledge. It also includes the choice of encoding schemes, preprocessing, sampling,
and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data
without the additional steps of the KDD process.
4
1.4 WHAT KIND OF DATA CAN BE MINED?
Data mining is not specific to one type of media or data. Data mining should be
applicable to any kind of information repository. However, algorithms and approaches
may differ when applied to different types of data. Data mining is being put into use
and studied for databases, including relational databases, object-relational databases
and object-oriented databases, data warehouses, transactional databases, unstructured
and semi-structured repositories such as the World Wide Web, advanced databases
such as spatial databases, multimedia databases, time-series databases and textual
databases, and even flat files.
1.4.1 Relational Databases
A relational database consists of a set of tables containing either values of entity
attributes, or values of attributes from entity relationships. Tables have columns and
rows, where columns represent attributes and rows represent tuples. A tuple in a
relational table corresponds to either an object or a relationship between objects and is
identified by a set of attribute values representing a unique key. The most commonly
used query language for relational database is SQL, which allows retrieval and
manipulation of the data stored in the tables, as well as the calculation of aggregate
functions such as average, sum, min, max and count.
1.4.2 Data Warehouses
A data warehouse is a repository of information collected from multiple resources,
stored under a unified schema and which usually reside at a single site. Data
warehouse are constructed via a process of data cleaning, data transformation data
integration, data loading and process data refreshing.
1.4.3 Transactional Databases
A transactional database consists of a file where each record represents a transaction.
A transaction typically includes a unique transaction identity number and a list of
items making up the transaction.
5
1.4.4 Multimedia Databases
Multimedia databases include video, images, audio and text media. They can be
stored on extended object-relational or object-oriented databases, or simply on a file
system. Multimedia is characterized by its high dimensionality, which makes data
mining even more challenging. Data mining from multimedia repositories may require
computer vision, computer graphics, image interpretation, and natural language
processing methodologies.
1.4.5 Spatial Databases
Spatial databases are databases that, in addition to usual data, store geographical
information like maps, and global or regional positioning. Such spatial databases
present new challenges to data mining algorithms.
1.4.6 World Wide Web
The World Wide Web is the most heterogeneous and dynamic repository available. A
very large number of authors and publishers are continuously contributing to its
growth and metamorphosis, and a massive number of users are accessing its resources
daily. Data in the World Wide Web is organized in inter-connected documents. These
documents can be text, audio, video, raw data, and even applications. Conceptually,
the World Wide Web is comprised of three major components: The content of the
Web, which encompasses documents available; the structure of the Web, which
covers the hyperlinks and the relationships between documents; and the usage of the
web, describing how and when the resources are accessed.
1.4.7 Advanced DB and Information Repositor ies
• Object-or iented databases
Object Oriented databases are based on the object oriented programming
paradigm, where each entity is considered as an object. Each object has associated
with it a set of variables, a set of messages and set of methods.
Objects that share a common set of properties can be grouped into an object class.
Each object is an instance of its class. For example, employee can contain
variables like name, address and birth date.
6
• Object-relational databases
The object-relational model extends the basic relational data model by adding the
power to handle complex data types, class hierarchies and object inheritance.
These are becoming more popular in industry and applications.
• Spatial databases
Spatial databases include spatial related information. Such databases include
geographical databases, VLSI chip design databases, and medical and satell ite
image databases.
• Temporal databases and Time Ser ies databases
Temporal databases usually stores relational data that include time related
attributes. Time Series database stores sequences of values that change with time,
such as data collected regarding the stock exchange.
• Legacy databases
A legacy database is a group of heterogeneous databases that combines different
kinds of data systems, such as relational or object oriented databases, spreadsheets
or file systems.
1.5 ARCHITECTURE FOR DATA MINING SYSTEM
The architecture of typical data mining system has the following components [11]:
1.5.1 Database, Data Warehouse, or Other Information Repository
This is one or a set of database, data warehouse spreadsheet, or other kinds of
information repositories. Data cleaning and data integration techniques may be
performed on the data.
1.5.2 Database or Data Warehouse Server
The database or data warehouse server is responsible for fetching the relevant data,
based on the user data-mining request [14].
A data warehouse is a repository for long-term storage of data from multiple sources,
organized so as to facilitate management decision making. The data are stored under a
unified schema and are typically summarized. Data warehouse systems provide some
7
data analysis capabilities, collectively referred to as OLAP (On-Line Analytical
Processing).
1.5.3 Knowledge Base
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Knowledge such as users beliefs, which can be
used to assess a pattern’s interestingness based on its unexpectedness, may be
included. Other examples of domain knowledge are additional interestingness
constraints or threshold and metadata.
Figure 1.2 Architecture of a typical data mining system 1.5.4 Data Mining Engine
This is essential to the data mining system and identically consists of set of functional
modules for task such as characterization, association, classification, cluster analysis
and evolution and deviation analysis.
8
1.5.5 Pattern Evaluation Module
This component typically employs interestingness measure and interacts with the data
mining modules so as to focus the search towards interesting patterns. It may use
interestingness thresholds to filter out discovered patterns.
1.5.6 Graphical User Interface
This modules communicates between users and data mining system, allowing the
users to interacts with the system by specifying a data mining query or task, providing
information to help focus on the search and performing exploratory data mining based
on intermediate data mining results.
1.6 DATA MINING APPLICATIONS The Google system uses a mathematical algorithm called PageRank to estimate the
relative importance of individual web pages based on link patterns [19].
• Financial institutions have reduced incidents of credit-card fraud through the
application of neural networks, which feature circuits arranged in a brain-like
configuration that can infer patterns from data.
• The medical sector is also taking advantage of data-mining: One application
involves a collaboration between IBM and the Mayo Clinic to detect patterns
in medical records, while another project uses natural-language processing to
map out the "grammar" of amino acid sequences and match them to specific
protein shapes and functions.
• Government organizations such as the Defense Department and the National
Security Agency are using AI technology for several efforts related to national
security, such as the Echelon telecom monitoring system. The Defense
Advanced Research Projects Agency (DARPA) is a leading AI research
investor, and the break throughs that come out of DARPA-funded projects are
more often than not put to civili an rather than mili tary use.
• Marketing: In marketing, the primary application is database marketing
systems, which analyze customer databases to identify different customer
groups and forecast their behavior. Business Week (Berry 1994) estimated that
9
over half of all retailers are using or planning to use database marketing, and
those who do use it have good results; for example,American Express reports a
10- to 15-percent increase in credit-card use. Another notable marketing
application is market-basket analysis
• Investment: Numerous companies use data mining for investment, but most
do not describe their systems. One exception is LBS Capital Management. Its
system uses expert systems, neural nets, and genetic algorithms to manage
portfolios totaling $600 million; since its start in 1993, the system has
outperformed the broad stock market (Hall, Mani, and Barr 1996).
• Fraud detection: HNC Falcon and Nestor PRISM systems are used for
monitoring credit card fraud, watching over milli ons of accounts. The FAIS
system (Senator et al. 1995),from the U.S. Treasury Financial Crimes
Enforcement Network, is used to identify financial transactions that might
indicate money laundering activity.
• Telecommunications: The telecommunications alarm-sequence analyzer
(TASA) was built in cooperation with a manufacturer of telecommunications
equipment and three telephone networks (Mannila, Toivonen, and Verkamo
1995). The system uses a novel framework for locating frequently occurring
alarm episodes from the alarm stream and presenting them as rules. Large sets
of discovered rules can be explored with flexible information-retrieval tools
supporting interactivity and iteration. In this way, TASA offers pruning,
grouping, and ordering tools to refine the results of a basic brute-force search
for rule.
1.7 THE SCOPE OF DATA MINING Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in
gigabytes of store scanner data — and mining a mountain for a vein of valuable ore.
Both processes require either sifting through an immense amount of material, or
intelligently probing it to find exactly where the value resides. Given databases of
10
suff icient size and quality, data mining technology can generate new business
opportunities by providing these capabili ties [21].
• Automated prediction of trends and behaviors. Data mining automates the
process of f inding predictive information in large databases. Questions that
traditionally required extensive hands-on analysis can now be answered
directly from the data — quickly. A typical example of a predictive problem is
targeted marketing. Data mining uses data on past promotional maili ngs to
identify the targets most likely to maximize return on investment in future
mailings. Other predictive problems include forecasting bankruptcy and other
forms of default, and identifying segments of a population likely to respond
similarly to given events.
• Automated discovery of previously unknown patterns. Data mining tools
sweep through databases and identify previously hidden patterns in one step.
An example of pattern discovery is the analysis of retail sales data to identify
seemingly unrelated products that are often purchased together. Other pattern
discovery problems include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry keying errors.
Data mining techniques can yield the benefits of automation on existing software and
hardware platforms, and can be implemented on new systems as existing platforms
are upgraded and new products developed. When data mining tools are implemented
on high performance parallel processing systems, they can analyze massive databases
in minutes. Faster processing means that users can automatically experiment with
more models to understand complex data. High speed makes it practical for users to
analyze huge quantities of data. Larger databases, in turn, yield improved predictions.
11
CHAPTER 2 BIOINFORMATICS
2.1 WHY BIOINFORMATICS The information for the set-up of living organisms is stored in the sequences of
nucleotides in DNA. DNA serves two purposes: to provide the information during the
li fe cycle of a cell and to pass it on to offspring. The discovery of genes and the
genetic code triggered the hope to be able to read the information stored in our genes,
and today we are able to do so: massive progress in sequencing technology has
delivered entire genomes to the tips of our fingers. The era of genomics and
proteomics has opened up the opportunity to go beyond the analysis of single genes
and proteins, towards understanding the interactions between all components of
genomes and proteomes. From trying to comprehend li fe by cutting it into smaller and
smaller pieces, we are beginning to unveil i n the same way it has been functioning
since its beginning: as a whole.
Computer scientists are important alli es for biologists in the struggle to understand
the information in DNAs. On one hand the massive amount of sequence data requires
new tools -computers and programs- to generate, proof, store, and access these data.
On the other hand, the deciphering of genomes necessitates the development of new
hard- and software that allow to detect genes, determine relationships between them,
study their expression, to be able to understand the basis of development and disease.
Bioinformatics provides the tools to understand the information in biological data.
2.2 BIOINFORMATICS
Bioinformatics has evolved into a full -fledged multidisciplinary subject that
integrates developments in Information and Computer Technology as applied to
Biotechnology and Biological Sciences. Bioinformatics uses Computer software tools
for database creation, data management, data warehousing, data mining and global
communication networking. Bioinformatics is the recording, annotation, storage,
analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein
sequence and structural information [2]. This includes databases of the sequences and
structural information as well methods to access, search, visualize and retrieve the
information. Bioinformatics concern the creation and maintenance of databases of
12
biological information whereby researchers can both access existing information and
submit new entries. Bioinformatics includes Sequence analysis used by geneticists,
cell biologists, molecular biologists, Molecular modeling used by crystallographers,
cell biologists, biochemists, Molecular phylogeny/evolution, Ecology and population
studies ,Medical informatics .The most pressing tasks in bioinformatics involve the
analysis of sequence information.
Computational Biology is the name given to this process, and it involves the following:
• Finding the genes in the DNA sequences of various organisms
• Developing methods to predict the structure and/or function of newly
discovered proteins and structural RNA sequences.
• Clustering protein sequences into families of related sequences and the
development of protein models.
• Aligning similar proteins and generating phylogenetic trees to examine
evolutionary relationships.
2.3 AIMS OF BIOINFORMATICS
The aims of bioinformatics are basically three-fold. They are
� Organization of data in such a way that it allows researchers to access existing
information & to submit new entries as they are produced. While data-creation
is an essential task, the information stored in these databases is useless unless
analyzed. Thus the purpose of bioinformatics extends well beyond mere
volume control.
� To develop tools and resources that help in the analysis of data. For example,
having sequenced a particular protein, it is with previously characterized
sequences. This requires more than just a straightforward database search. As
such, programs such as FATA and PSI-BLAST much consider what
constitutes a biologically significant resemblance. Development of such
resources extensive knowledge of computational theory, as well as a thorough
understanding of biology.
13
� Use of these tools to analyze the individual systems in detail, and frequently
compared them with few that are related.
2.4 STEPS OF KDD FOR BIOINFORMATICS
The steps of KDD for bioinformatics involve the same steps as were performed during
the KDD in simple databases. The only difference is the data on which the data
mining is performed. Here the data is biomolecular data instead of simple databases
[2]. It may involve DNA sequences, RNA sequences. KDD for bioinformatics is
shown in figure 2.1.
2.5 WHAT KIND OF DATA CAN BE MINED? KDD for Bioinformatics can be applied on biomolecular data. Biomolecular Data consists of the following types � DNA ( deoxyribonucleic acid)
� RNA ( ribonucleic acid)
� Protein sequences ( 2D & 3D structures)
2.5.1 DNA
In most living organisms (except for viruses), genetic information is stored in the
molecule deoxyribonucleic acid, or DNA. DNA is made and resides in the nucleus of
living cells. DNA gets its name from the sugar molecule contained in its backbone
(deoxyribose), however it gets its significance from its unique structure There are
four different nucleotide bases that occur in DNA:
A - Adenine
T- thymine
C- cytosine
G- guanine
14
Figure 2.1 The KDD for Bioinformatics The versatility of DNA comes from the fact that the molecule is actually double-
stranded. The nucleotide bases of the DNA molecule form complementary pairs: the
nucleotides hydrogen bond to another nucleotide base in a strand of DNA opposite
to the original. This bonding is specific, and adenine always bonds to thymine (and
vice versa) and guanine always bonds to cytosine (and vice versa). This bonding
occurs across the molecule leading to a double-stranded system as shown in picture:
Figure 2.2 DNA Molecule
15
The fundamental chemical building block of deoxyribonucleic acid (DNA) is the
nucleotide. A nucleotide consists of three parts:
(1) a nitrogen-containing pyrimidine or purine base,
(2) a deoxyribose sugar, and
(3) a phosphate group that acts as a bridge between adjacent deoxyribose sugars.
The double-stranded DNA molecule has the unique abil ity that it can make exact
copies of itself, or self-replicate. When more DNA is required by an organism (such
as during reproduction or cell growth) the hydrogen bonds between the nucleotide
bases break and the two single strands of DNA separate. New complementary bases
are brought in by the cell and paired up with each of the two separate strands thus
forming two new identical, double-stranded DNA molecules.
2.5.2 RNA RNA stands for Ribonucleic Acid. It is a long molecule but usually Single stranded,
except when it folds back on itself. They differ chemically from DNA by containing
ribose instead of deoxyribose & containing Uracil ( U) instead of Thymine (T). So the
only important differences between RNA and DNA are that
� RNA differs from DNA by one nucleotide.
� RNA comes as a single stranded
The four bases of RNA are
A - adenine
U- uracil
C- cytosine
G- guanine
Some programs automatically handle the U-instead-of-T conversion and many do not
even distinguish between the two classes o nucleic acids. Don’ t be surprised if a
database entry displays RNA sequences with a T instead of U. In fact, RNA
sequences are encoded in the DNA.
16
Table 2.1: The 20- Amino Acids and their Off icial Codes
2.5.3 PROTEIN
Protein is a polymer constructed by Amino acids. The most popular representation
model for biologists to describe a protein is to use the sequence. A protein
sequence is made up of 20 amino acids, each represented by a letter. These amino
acids along with their codes are shown in Table 2.1.
# 1-Letter Code 3-Letter Code Name
1 A Ala Alanine
2 R Arg Arginine
3 N Asn Asparagine
4 D Asp Aspartic acid
5 C Cys Cysteine
6 Q Gln Glutamine
7 E Glu Glutamic acid
8 G Gly Glycine
9 H His Histidine
10 I Ile Isoleucine
11 L Leu Leucine
12 K Lys Lysine
13 M Met Methionine
14 F Phe Phenylalanine
15 P Pro Proline
16 S Ser Serine
17 T Thr Threonine
18 W Trp Tryptophan
19 Y Tyr Tyrosine
20 V Val Valine
17
Figure 2.3 Protein Molecule
2.6 DATA MINING TECHNIQUES IN BIOINFORMATICS There are many data mining techniques available which can be applied to
biomolecular data. Clustering, Classification and Association, which are very useful
in biomolecular data, are discussed below. These techniques are able to discover
previously hidden pattern in biomolecular data [2].
2.6.1 Clustering
The search for protein structure motif begins with the knowledge that some protein
with low sequence similarity folds into remarkably similar 3-D conformations. Even
globally different structure may share similar or identical substructures.
Protein motifs can be divided into four categories
i. Sequences Motif: Linear strings of amino acids residues with a topological
ordering.
ii . Sequence structure motifs.
ii i. Structure Motifs: 3-D objects that correspond to a protein backbone.
iv. Structure Sequence Motif: Structure motifs in which nodes of the graph are
annotated with sequence information.
Predictabili ty
It is the degree to which a motif is representing one level or facet of protein structure
or function may be predicted form knowledge of another. For the local structure
18
motifs designated as secondary structure, predictability is the ability to accurately
predict secondary structure classes from amino acid sequence.
Predictive utili ty
It is the flip side of the predictabili ty criterion for e.g. If one takes the view of
secondary structure as an intermediate level encoding between primary structure and
tertiary structure, then predictive utilit y ought to be some measure of the gain in
accuracy in predicting tertiary structure with a particular encoding, as compared with
prediction using other possible encoding. Another mode direct measure might be the
degree to which a particular set of proposed motifs, corresponding to secondary
structure classes, constrain the alpha and gamma angles of the included structure
fragments.
Intelligibility
refers to the ease with which researchers and practitioners of protein science can
understand a given structure motif and can incorporate its information into their own
work. Many factors affect intelligibility for e.g. A discovered structured class that
contains one-third traditional alpha helix, one-third traditional beta sheets and one-
third coil i s harder to explain than one that correlates almost perfectly with alpha
helix.
Naturalness
It means the degree to which a motif captures some essential bio chemicals or
evolutionary properties or some essential class structure in the space of protein
sequence or structure fragments under consideration. Some clustering methods are
infamous for finding ersatz clusters in uniformly distributed data. Other clustering
methods produce results very dependent on their starting point. To avoid such results
it is important to carefully choose appropriate representations and attributes for
classification.
Systematicity
It is the degree to which a motif discovery method is derived from explicitly stated
principles and the degree to which the methods can repeatedly be applied to diverge
data and produce consistent results.
19
Ease Of Discovery
It refers to the computational complexity and data complexity of the methods required
to discover the motif.
2.6.2 Classification
To find knowledge pattern discovery is a fundamental operation. A Pattern in Bio-
sequence can help scientist to analyze the property of a sequence or predict the
function of a new entity. The pattern may also help to classify an unknown sequence
or to assign the sequence to an existing family.
2.6.3 Association Some qualiti es or some traits in any species don't come alone; they come associated
with some other fundamentals differences. So sometime if one particular
characteristic (pattern) in the sequence, that wil l also depend upon the confidence of a
particular object (pattern) in that sequence for that particular association.
Types of association
i. Association can be for a pair or set of similarity in the same sequence.
ii . Association can be for a pair or set of similarity in the two sequences.
Association can be for a pair or set of similarity in the multiple sequences.
2.7 THE CENTRAL DOGMA
The expression of the genetic information stored in DNA involves the translation of a
linear sequence of nucleotides into a co-linear sequence of amino acids in proteins.
The flow is: DNA :�mRNA :�Protein [2].
2.7.1 Transcription
A segment of DNA is first copied into a complementary strand of RNA. This process
called transcription is catalyzed by the enzyme RNA polymerase. Near most of the
genes there is a special pattern in the DNA called promotor, located upstream of the
transcription start site, which informs the RNA polymerase where to begin the
transcription. This is achieved with the assistance of transcriptional factors that
recognize the promotor sequence and bind to it. Although ribonucleic acid (RNA) is a
long chain of nucleic acids (as is DNA), it has very different properties. First, RNA is
usually single stranded (denoted ssRNA). Second,
20
RNA has a ribose sugar, rather than deoxy-ribose. Third, RNA has the pyrimidine
based Uracil (abbreviated U) instead of Thymine. Fourth, unlike DNA, which is
located primarily in the nucleus, RNA can also be found in the cellular liquid outside
the nucleus, which is called the cytoplasm.
In Eukaryotic organisms, to produce a protein the entire length of the gene, including
both its introns and its exons, is first transcribed into a very large RNA molecule - the
primary transcript. At the end of the gene the transcription stops, and a few dozens of
Adenine (A) nucleotides are added to the end of the RNA molecule for protection
(poly-A tail ). 5’ CAP lays an important part in the initializing of protein synthesis by
the protecting the growing RNA transcript from degradation. Before this RNA
molecule leaves the nucleus, a complex of RNA processing enzymes removes all the
intron sequence, in a process called splicing, thereby producing a much shorter RNA
molecule. Typical eukaryotic exons are of average length of 200bp, while the average
length of introns is around 10000bp (these lengths can vary greatly between different
introns and exons). In many cases, the pattern of the splicing can vary depending on
the tissue in which the transcription occurs. For example, an intron that is cut from
mRNAs of a certain gene transcribed in the liver may not be cut from the same
mRNA when transcribed in the brain. This variation is called alternative splicing, and
it contributes to the overall protein diversity in the organism. After this RNA
processing step has been completed, the RNA molecule moves to the cytoplasm as a
messenger RNA molecule (mRNA), in order to undergo translation.
2.7.2 The Genetic Code
The rules by which the nucleotide sequence of a gene is translated into the amino acid
sequence of the corresponding protein, the so-called genetic code, were deciphered in
the early 1960s. The sequence of nucleotides in the mRNA molecule, that acts as an
intermediate was found to be read in serial order in groups of three. Each triplet of
nucleotides, called a codon, species one amino acid (the basic unit of a protein,
analogous to nucleotides in DNA). Since RNA is a linear polymer of four different
nucleotides, there are 43 = 64 possible codon triplets (However, only 20 different
amino acids are commonly found in proteins, so that most amino acids are specified
by several codons. In addition, 3 codons (of the 64) specify the end of translation, and
are called stop codons. The codon specifying beginning of translation is AUG, and is
also the codon for the amino acid Methionine. The code has been highly conserved
21
during evolution: with a few minor exceptions, it is the same in organisms as diverse
as bacteria, plants, and humans.
2.8 NEED OF DATA MINING IN BIOINFORMATICS
Data in biology are very diverse and abundant. They can be catalogued and classified,
but often cannot be easily summarized or abstracted using a formula.
With the increase in biological knowledge, computer-based databases have become
essential for this task. Bioinformatics databases includes following types of databases
• Sequence databases
• Structural databases
• Motif databases
• Genome databases
• Proteome databases
• RNA expression
• Literature
• Populations
• Mutations
• Organisms
Moreover the data of even a single microorganism is very large. Rickettsia conorii is
the smallest bacteria whose complete gene sequence is known. This bacteria is 1.3
million bp long and this size is stil l on the small side of bacteria. Human genome
sequences are several billion bp in length. So with the significant growth of the
amount of biomolecular data, it becomes increasingly important to develop new
techniques for extracting knowledge from the data. Data mining is a fundamental
operation in such a domain.
Every data in bioinformatics can be converted into DNA sequence. All the protein,
RNA sequence can be converted into DNA sequences. So the data mining need to be
applied on the DNA sequences and later the results can be converted for the other
molecular data.
22
2.9 BIOINFORMATICS AND ITS SCOPE
Bioinformatics has evolved into a full-fledged scientific discipline over the last
decade. The definition of Bioinformatics is not restricted to computational molecular
biology and computational structural biology. It now encompasses fields such as
comparative genomics, structural genomics, transcriptiomics, Proteomics,
cellunomics and metabolic pathway engineering. Developments in these fields have
direct implications to healthcare, medicine, discovery of next generation drugs,
development of agricultural products, renewable energy, environmental protection etc
[23]. Bioinformatics integrates the advances in the areas of Computer Science,
Information Science and Information Technology to solve complex problems in Life
Sciences. The core data comprises of the genomes and proteomes of human and other
organisms, 3-D structures and functions of proteins, microarray data, metabolic
pathways, cell lines & hybridoma, biodiversity etc. The sudden growth in the
quantitative data in Biology has rendered data capture, data warehousing and data
mining as major issues for biotechnologists and biologist. Availability of enormous
and other data has resulted in the realization of the inherent biocomplexity issues
which call for innovative tools for biotechnologists and biologist. Availability of
enormous and other data has resulted in the realization of the inherent biocomplexity
issues which call for innovative tools for synthesis of knowledge. Information
Technology, particularly the internet, is utilized to collect, distribute and access ever-
increasing data which are later analyzed with mathematics and statistics-based tools.
Bioinformatics has a key role to play in the cutting edge Research & Development
areas such as functional genomics, proteomics, protein engineering,
pharmacogenomics, discovery of new drugs and vaccines, molecular diagnostic kits,
agro-biotechnology etc. This has attracted attention of several companies and
entrepreneurs. As a result, a large number of Bioinformatics- based start-ups have
been launched and the trend is likely to continue. This has necessitated the availability
of a large number of formally trained individuals in Bioinformatics. A
Bioinformaticians must acquire/possess expertise in the essential multi-displinary
fields that comprise the core of this new science. Quality research and education in
Bioinformatics are vital not only to meet the existing challenges but also to set and
accomplish new goals in Life Science.
23
2.10 APPLICATIONS OF BIOINFORMATICS
Molecular medicine
The human genome will have profound effects on the fields of biomedical research
and clinical medicine. Every disease has a genetic component. This may be inherited
or a result of the body's response to an environmental stress which causes alterations
in the genome (eg. cancers, heart disease, diabetes.). The completion of the human
genome means that we can search for the genes directly associated with different
diseases and begin to understand the molecular basis of these diseases more clearly
[27]. This new knowledge of the molecular mechanisms of disease will enable better
treatments, cures and even preventative tests to be developed.
Personalized medicine
Clinical medicine will become more personalised with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritance
affects the body's response to drugs. At present, some drugs fail to make it to the
market because a small percentage of the clinical patient population show adverse
affects to a drug due to sequence variants in their DNA.
As a result, potentially lives saving drugs never make it to the marketplace. Today,
doctors have to use trial and error to find the best drug to treat a particular patient as
those with the same clinical symptoms can show a wide range of responses to the
same treatment. In the future, doctors wil l be able to analyze a patient's genetic profile
and prescribe the best available drug therapy and dosage from the beginning.
Preventative medicine
With the specific details of the genetic mechanisms of diseases being unraveled, the
development of diagnostic tests to measure a persons susceptibil ity to different
diseases may become a distinct reality. Preventative actions such as change of
li festyle or having treatment at the earliest possible stages when they are more likely
to be successful, could result in huge advances in our struggle to conquer disease.
Gene Therapy
In the not too distant future, the potential for using genes themselves to treat disease
may become a reality. Gene therapy is the approach used to treat, cure or even prevent
disease by changing the expression of a persons genes. Currently, this field is in its
24
infantile stage with clinical trials for many different types of cancer and other diseases
ongoing.
Drug development
At present all drugs on the market target only about 500 proteins. With an improved
understanding of disease mechanisms and using computational tools to identify and
validate new drug targets, more specific medicines that act on the cause, not merely
the symptoms, of the disease can be developed. These highly specific drugs promise
to have fewer side effects than many of today's medicines.
.
25
CHAPTER 3 INTRODUCTION TO BLAST
3.1 INTRODUCTION
The discovery of sequence homology to a known protein or family of proteins often
provides the first clues about the function of a newly sequenced gene. As the DNA
and amino acid sequence databases continue to grow in size they become increasingly
useful in the analysis of newly sequenced genes and proteins because of the greater
chance of f inding such homologies. There are a number of software tools for
searching sequence databases but all use some measure of similarity between
sequences. To distinguish biologically significant relationships from chance
similarities. Perhaps the best studied measures are those in conjunction with variations
of the dynamic programming algorithm These methods assign scores to insertions,
deletions and replacements, and compute an alignment of two sequences that
corresponds to the least costly set of such mutations. Such an alignment may be
thought of as minimizing the evolutionary distance or maximizing the similarity
between the two sequences compared. In either case, the cost of this alignment is a
measure of similarity; the algorithm guarantees it is optimal, based on the given
scores. Because of their computational requirements, dynamic programming
algorithms are impractical for searching large databases without the use of a
supercomputer. Rapid heuristic algorithms that attempt to approximate the above
methods have been developed, allowing large databases to be searched on commonly
available computers. In many heuristic methods -the measure of -similarity is not
explicitly defined as a minimal cost set of mutations, but instead is implicit in the
algorithm itself. For example, the FASTP program first finds locally similar regions
between two sequences based on identities but not gaps, and then rescores these
regions using a measure of similarity between residues, such as a PAM which allows
conservative replacements as well as identities to increment the similarity score.
Despite their rather indirect approximation of minimal evolution measures, heuristic
tools such as FASTP have been quite popular and have identified many distant but '
biologically significant relationships.
BLAST (Basic Local Alignment Search Tool), which employs a measure based on
well-defined mutation scores. It directly approximates the results that would be
26
obtained by a dynamic programming algorithm for optimizing this measure. The
method will detect weak but biologically significant sequence similarities, and is more
than an order of magnitude faster than existing heuristic algorithms.
BLAST Means:
B(Basic) - Despite the adjective “BASIC” in its name it is sophisticated software
package that has become the single most important piece of software in the field of
bioinformatics.
LA (Local Alignment) - local alignment is one from two kinds of alignment that
finds the best subsequence alignment. Necessity for this alignment is that functional
(catalytic sites) are localized or relatively short regions.
ST (Search Tool)- It has introduced a no of refinements to database searching that
improved overall search speed & put database searching on a firm statistical
foundation. It searches using some threshold value.
BLAST (Basic Local A lignment Search Tool) is a set of similarity search programs
designed to explore all of the available (DNA and protein) sequence databases
regardless of whether the query is protein or DNA. The BLAST programs have been
designed for speed, with a minimal sacrifice of sensitivity to distant sequence
relationships. BLAST uses the concept of a "segment pair" which is a pair of sub-
sequences of the same length that form an ungapped alignment. The algorithm first
looks for short words that are present in both sequences and then extends these at
either end to find the longest segments present in both. The statistical significance of
these High-scoring Segment Pairs is evaluated to determine whether the matches are
random or not. Thus, the scores assigned in a BLAST search have a well -defined
statistical interpretation, making real matches easier to distinguish from random
background.
3.2 DATABASES AVA ILABLE FOR BLAST SEARCH 3.2.1 Protein Sequence Databases We can choose a protein db for blastp or blastx. We can choose a nucleotide database
for blastn, tblastn or tblastx
27
• ��nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
• ��month All new or revised GenBank CDS
translation+PDB+SwissProt+PIR+PRF released in the last 30 days.
• ��swissprot Last major release of the SWISS-PROT protein sequence database
(no updates)
• ��Drosophila genome Drosophila genome proteins provided by Celera and
Berkeley Drosophila
• Genome Project (BDGP). (www.fruitfly.org)
• ��yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations
• ecoli Escherichia coli genomic CDS translations
• ��pdb Sequences derived from the 3-dimensional structure from Brookhaven
Protein Data Bank (www.pdb.org)
• ��kabat Kabat's database of sequences of immunological interest
(http://immuno.bme.nwu.edu)
• ��alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences
3.2.2 Nucleotide Sequence Databases We can choose a nucleotide database for blastn, tblastn or tblastx
• ��nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or
phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".
• ��month All new or revised GenBank+EMBL+DDBJ+PDB sequences released
in the last 30 days.
• ��Drosophila genome Drosophila genome provided by Celera and Berkeley
Drosophila Genome Project)
• dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions
• ��dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions
• ��gss Genome Survey Sequence, includes single-pass genomic data, exon-
trapped sequences, and Alu PCR sequences.
• ��yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
• E. coli Escherichia coli genomic nucleotide sequences
• pdb Sequences derived from the 3-dimensional structure from Brookhaven
Protein Data Bank
28
BLAST protein databases available at through blastp web inter face
Figure 3.1 Protein Databases
• ��kabat Kabat's database of sequences of immunological interest
• ��vector Vector subset of GenBank(R), NCBI, in
ftp://ncbi.nlm.nih.gov/blast/db/
• ��mito Database of mitochondrial sequences
• ��alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from
query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov
(under the /pub/jmc/alu directory).
• ��Epd Eukaryotic Promotor Database
BLAST nucleotide databases available at through blastn web interface
Figure 3.2 Nucleotide Databases
29
3.3 BLAST ALGORITHM
(1) In step 1, BLAST filters low complexity regions removes them from the query
sequence. Low compositional complexity or short-periodicity repeats can yield
extremely large numbers of statistically significant but biologically uninteresting
results. The filtering and removal of these can be controlled with the -F flag of the
stand-alone version of BLAST and with check boxes in the web version. Next,
BLAST generates a list of all of short sequences, or words, that make up the query
(Figure a). The default word lengths are 3 and 11, for amino-acid sequences and
nucleotide sequences, respectively, and are adjustable using the -W flag in the stand-
alone version. Then, BLAST uses a scoring matrix (BLOSUM62, by default, for
amino acids) to determine all high-scoring matching words for each word in the query
sequence. No gaps are allowed. The list of matches is reduced by taking only those
that will score above a given threshold, called the neighborhood word-score threshold.
There is a trade-off at this stage between speed and sensitivity: a higher threshold
gives greater speed but increases the chance of missing relevant pairs [8].
� For the query find the list of high scoring words of length w.�� For a given word length w (usually 3 for proteins) and a given score matrix:
Create a list of all words (w-mers) that can can score >T when compared to w-
mers from the query.
P Q N 12 etc.
Below Threshold (T=13)
L N K C K T P Q G Q R L V N Q
P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13
Neighborhood Words
Word
P M G 13
30
Query Sequence of length L
Maximum of L-w+1 words (typically w = 3 for proteins)
For each word from the query sequence
find the li st of words that will score
at least T when scored using a pairscore
matrix (e.g. PAM 250). For typical parameters
there are around 50 words per residue of the query
Figure 3.3 List of Words From Query Sequence
(2) In the second step, BLAST searches through the target sequence database for
exact matches to the word list generated. Because BLAST has already pre-processed
and indexed the databases for the occurrence of all words in each sequence in the
database, this search is extremely fast. If a match is found, it is used to seed a possible
alignment between the query and the database sequences.
• Compare the word list to the database and identify exact matches.
• Each neighborhood word gives all positions in the database where it is found
(hit list).
P D G 13
P Q G 18
P E G 15 P R G 14
P K G 14 P N G 13
P M G 13 PMG Database
31
Database
Sequences
Figure 3.4 Exact matches of words form word list
(3) In the third step, the original BLAST method tried to extend the alignment from
the matching words in both directions as long as the score continued to increase.
For each word match, extend alignment in both directions to find alignments that
score greater than score threshold S. The program tries to extend matching segments
(seeds) out in both directions by adding pairs of residues. Residues will be added until
the incremental score drops below a threshold. The resulting alignment was called a
high-scoring pair, or HSP. Gapped BLAST uses a lower threshold for generating the
list of high-scoring matching words; the algorithm uses short matched regions with no
insertions or deletions between them and within a certain distance of each other as the
starting points for longer ungapped alignments. These joined regions are then
extended using the same method as in the original BLAST.
Figure 3.5 Maximal Segment Pairs (MSPs)
32
Next, BLAST determines whether each score found by one of the above methods is
greater in value than a given cutoff score S, determined empirically by examining the
range of scores given by comparing random sequences and then choosing a value that
is significantly greater. The maximal scoring pairs, or MSPs, from the entire database
are identified and listed. Finally, BLAST determines the statistical significance of
each score, initially by calculating the probability that two random sequences, one the
length of the query sequence and the other the length of the database (the sum of the
lengths of all of the database sequences) with the same composition (nucleotide or
amino acid) could produce the calculated score.
3.4 BLAST PARAMETERS
There are various parameters that play a vital role in the output produced by the
BLAST. The proper value of these parameters can improve the speed and sensitivity
of the BLAST. We have analyzed all the parameters to see which of them can be
improved to improve the results of the BLAST [8]. The parameters of BLAST
includes
¾ W, word size
Word size is roughly the minimal length of an identical match an alignment must
contain if it is to be found by the algorithm. It controls the number of word hits. The
query sequence and every database sequence is split up into every possible "word" of
a selected size. The default word size is 11 bp for DNA and 3 aa for Proteins (it must
be >=7 for DNA). The task of f inding HSPs begins with identifying short words of
length W in the query sequence that either match or satisfy some positive-valued
threshold score T when aligned with a word of the same length in a database
sequence. These initial neighborhood word hits act as seeds for initiating searches to
find longer HSPs containing them. The word hits are extended in both directions
along each sequence for as far as the cumulative alignment score can be increased.
Extension of the word hits in each direction are halted when: the cumulative
alignment score falls off by the quantity X from its maximum achieved value; the
cumulative score goes to zero or below, due to the accumulation of one or more
negative-scoring residue alignments; or the end of either sequence is reached.
33
If we are interested in longer regions of homology we should increase the word size.
Increasing the word size also speeds up the search, especially with larger query
sequences (>5kb) and large databases. But the high values of W in conjunction with
moderate values of T can lead to immense memory requirements. The probability of a
hit decreases with increase in word size [15]. The smaller word sizes increase
sensitivity and decreases speed. For protein searches the best word size is of four.
¾ T, the threshold parameter.
T is referred to as the neighborhood word score threshold (Altschul et al., 1990). It is
the minimum score that a word pair in the segment pair should have. Actually we can
adjust the value of T to control the size of the neighborhood and therefore the number
of word hits in the search space. The lower value of T increases the chance that a
segment pair with a score of at lest S will contain a word pair with a score of at least
T. Thus, a small value for T increases the number of hits. But this in turn increases the
execution time of the algorithm because there wil l be more words generated by the
query sequence and therefore more hits. On the other hand, higher values of T
progressively reduce word hits and reduce the search space. So the proper value of T
depends on the balance between speed and sensitivity. It also depends on the values in
the scoring matrix.
Figure 3.6 Figure shows the word size option
If the value for T is not chosen carefully, though -- i.e., if T is set just a little bit too
low -- a combinatorial explosion in neighborhood words will soon lead to the
34
depletion of all available memory. Even if the neighborhood word list does fit in
memory, however, its sheer size may produce an adverse effect on speed, due to the
consequent loss of processor cache eff iciency.
¾ X, drop off
This value provides a cutoff threshold for the extension algorithm tree exploration.
When the score of a given branch drops below the current best score minus the X
dropoff , the exploration of this branch stops. This variable represents the recent
alignment history [20]. Specifically, it represents how much the score is allowed to
drop off since the last maximum.
A very large value of X doesn’ t increase the score and requires more computation. It
is generally a good idea to use a large value, which reduces the risk of premature
termination and is better way to increase speed than with the seeding parameters.
However, W,T and 2-hit are better for controlling speed than X.
X not only depends on the substitution scores, but also gap initiation and extension
costs. We general need to adjust this parameter in following two situations:
� If we align sequences that are nearly identical and we want to prevent the
extensions from straying into nonidentical sequences, we can set the various X
values very low.
� If we try to align very distant sequences and have already adjusted W, T and the
scoring matrix to allow additional sensitivity, it makes sense to also increase the
various X values.
¾ λλ, lambda
λ, is a matrix specific constant required to convert a raw score to normalized score.
Raw score can be a misleading quantity because scaling factors are arbitrary. A
normalized score, corresponding to the original lod score, is therefore a more useful
measure. Lambda is approximately the inverse of the original scaling factor, but its
value may be slightly different due to integer rounding errors. When calculating target
frequencies from multiple alignments, the sum of all target
frequencies naturally sums to 1.
ΣΣ qij = 1 ………(1)
35
The score of two amino acids is the log-odds ratio of the observed and expected
frequencies. The same equation is presented in Equation, but the lod score is replaced
by the product of lambda and the raw score.
λSij = loge (qij / pi pj ) ………(2)
Equation (1)rearranges Equation (2) to solve for pair-wise frequency.
qij = pi pj eλ Sij ………(3)
From Equation 3,we can see that a pair-wise frequency (q ij) is implied from
individual amino acid frequencies (p i and p j )and a normalized score (λ �S ij ).The key
to solving for lambda is to provide the individual amino acid frequencies (pi and
pj)and find a value for lambda where the sum of the implied target frequencies equals
one. The formulation is given in Equation 4.
ΣΣ qij = ΣΣ pi pj eλ Sij = 1 ………(4)
Normally, once lambda is estimated, it is used to calculate the Expect of every HSP in
the BLAST report. Unfortunately, the residue frequencies of some proteins deviate
widely from the residue frequencies used to construct the original scoring matrix.
Recently, some versions of PSI-BLAST and BLASTP have therefore begun to use the
query and subject sequence amino acid compositions to calculate a composition based
lambda .These “hit-specific ”lambdas have been shown to improve BLAST
sensitivity, so this approach may see wider use in the near future. Lambda is also used
in calculating the Expect by using the equation E = kmne-λS . Here Lambda may be
thought of as the expected increase in reliabili ty of an alignment associated with a unit
increase in alignment score. Reliabili ty in this case is expressed in units of
information, such as bits or nats, with one nat being equivalent to 1/log(2) (roughly
1.44) bits.
¾ k, Adjustment
A small adjustment (k) takes into account the fact that optimal local alignment scores
for alignments that start at different places in the two sequences may be highly
correlated. For example, a high-scoring alignment starting at residues 1,1 implies a
pretty high alignment score for an alignment starting at residues 2,2 as well .
¾ m, length of query
It seems to be the length of the query that we enter to be matched to the different
databases. But actually in BLAST it is the effective length of the query. It may be
36
defined as the actual length minus the expected HSP length where expected HSP
length is the length of an HSP that hat has an Expect of 1. The size of the search space
is simply the product of the number of letters in the query (m) and the number of
letters in the database (n). The relationship between the expected number of
alignments (E) and the search space (mn)is linear. If the size of the search space is
doubled, the expected number of alignments with a particular score also doubles.
¾ n, length of the database
It seems to be the length of the database sequence with which the query is to be
matched. But actually its is the effective length of the database. It may be defined as
the sum of effective length of every sequence within it. The size of the search space is
simply the product of the number of letters in the query (m) and the number of letters
in the database (n). The relationship between the expected number of alignments (E)
and the search space (mn) is linear. If the size of the search space is doubled, the
expected number of alignments with a particular score also doubles.
No effective length of the query or database can ever be less than 1/k. Setting an
effective length to 1/k basically amounts to ignoring a short sequence for statistical
purposes; in case when both m and n are less than 1/k, BLAST searches are ill -
advised.
¾ H, Relative Entropy
The formal name for the average information per symbol is entropy. But what if all
symbols aren’ t equally probable? To compute the entropy, you need to weigh the
information of each symbol by its probabilit y of occurring. This formulation, known
as Shannon ’s Entropy (named after Claude Shannon),is shown in Equation.
H= - Σ pi log2pi
Entropy (H) is the negative sum over all the symbols (n )of the probabil ity of a
symbol (pi )multiplied by the log base 2 of the probabili ty of a symbol (log 2 pi ). The
relative entropy of a scoring matrix (H ) conveniently summarizes the general
behavior of a scoring matrix. Its formulation is similar to the expected score but is
calculated from normalized scores. It formulation is shown in following equation
H = - ΣΣ qij λSij
37
H is the average number of bits (or nats) per position in an alignment and is always
positive.
¾ E, Expect
Expect is the number of alignments expected by chance during a sequence database
search and can be represented using the Karlin-Altschul equation.
E = kmne-λS
From the above equation we can see that E is a function of the size of the search space
(m *n ),the normalized score (λS ),and a minor constant (k ). The relationship between
the expected number of alignments and the search space (mn) is linear. If the size of
the search space is doubled, the expected number of alignments with a particular score
also doubles. The relationship between the expected numbers of alignments and score
is exponential. This means that small changes in score can lead to large differences in
E. An E-value tells you how many alignments with a given score are expected by
chance, that is, the E value is the probabili ty that the associated match is due to
randomness. The lower the E value, the more specific/significant is the match. Its
relation with P value can represented as
E= - In(1-P)
E is the statistical significance threshold for reporting matches against database
sequences; the default value is 10, such that 10 matches are expected to be found
merely by chance, according to the stochastic model of Karlin and Altschul (1990). In
the BLAST output report the sequences are listed in order of increasing E (expect)
value. The alignments are listed in order of most to least significant. If the statistical
significance ascribed to a match is greater than the EXPECT threshold, the match will
not be reported. Lower EXPECT thresholds are more stringent, leading to fewer
chance matches being reported.
¾ S, Score
In the late ’60s and early ’70s, Margaret Dayhoff pioneered quantitative techniques
for measuring amino acid similarity. Using sequences that were available at the time,
she constructed multiple alignments of related proteins and compared the frequencies
of amino acid substitutions. As expected, there is quite a bit of variation in amino acid
substitution frequency, and the patterns are generally what you ’d expect from the
chemical properties.
38
Dayhoff represented the similarity between amino acids as a log 2 odds ratio, also
known as a lod score .To derive the lod score of an amino acid, take the log 2 of the
ratio of a pairing ’s observed frequency divided by the pairing ’s random expected
frequency. If the observed and expected frequencies are equal, the lod score is zero. A
positive score indicates that a pair of letters is common, while a negative score
indicates an unlikely pairing. The general formula for any pair of amino acids is
shown in following Equation.
Sij = log(qij/pipj )
The score of two amino acids i and j, is sij, their individual probabilities are pi and pj ,
and their frequency of pairing is qij. The relationship between the expected number of
alignments and score is exponential. This means that small changes in score can lead
to large differences in E.
¾ P-value
A P-value tells you how often you can expect to see such an alignment.
P = 1- e -E
For values of less than 0.001,the E-value and P-value are essentially identical.
The aggregate pair-wise P-value for a sum score can be approximated using above
stated equation. Thus, when sum statistics are being employed, BLAST not only uses
a different score, it also uses a different formula to convert that score into a
probability —the standard Karlin-Altschul equation E= kmne -λS isn’ t used to convert
a sum score to an Expect. In the limit of infinite E, P approaches 1; and in the limit as
E approaches 0, E and P approach equality.
Due to inaccuracy in the statistical methods as they are applied in the BLAST
programs, whenever E and P are less than about 0.05, the two values can be
practically treated as being equal.
¾ Number of sequences in database
The number of sequences in database also affects the speed and sensitivity of the
BLAST algorithm. If the number of sequences is very less then the speed of the
BLAST is enhanced as there are less word hits and less sequences to be compared
with the query.
39
¾ Percent identity
Percent identity is the percent of exact matches between your query sequence and the
database sequence. The positive value is more relevant to protein alignments. This is
the percent of exact + similar (based on properties) amino acid matches.
¾ Number of Alignments
Restricts database sequences to the number specified for which high-scoring segment
pairs (HSPs) are reported; the default limit is 100. If more database sequences than
this happen to satisfy the statistical significance threshold for reporting only the
matches ascribed the greatest statistical significance are reported.
¾ Fil ter
Low-complexity regions, such as proline- or glycine-rich regions or acidic or basic
regions, can yield tremendous numbers of spurious matches between sequences that
have no other similarity between them. The statistics break down when such
decidedly non-random sequences appear; furthermore, search times may be needlessly
increased. To avoid spurious matching and make the statistics more robust, low-
complexity regions can be filtered from the query sequence. Fil tering eliminates
statistically significant but biologically uninteresting reports from the blast output
(e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more
biologically interesting regions of the query sequence available for specific matching
against database sequences. Filtering is only applied to the query sequence (or its
translation products), not to database sequences. Filtering should not be expected to
always yield an effect. Furthermore, in some cases, sequences are masked in their
entirety, indicating that the statistical significance of any matches reported against the
unfil tered query sequence should be suspect.
Sometimes we need to mask the human repeats (LINE's and SINE's). It is especiall y
useful for human sequences that may contain these repeats.
3.5 FEATURES OF BLAST
3.5.1 Heur istic
BLAST is not guaranteed to find the best alignment between your query and the database; it may miss matches. This is because it uses a strategy, which is expected to
find most matches, but sacrifices complete sensitivity in order to gain speed.
40
However, in practice few biologically significant matches are missed by BLAST that
can be found with other sequence search programs. BLAST searches the database in
two phases. First it looks for short subsequences, which are likely to produce
significant matches, and then it tries to extend these subsequences [8].
3.5.2 Substitution Matr ix
A substitution matrix is used during all phases of protein searches (BLASTP,
BLASTX, and TBLASTN). Both phases of the alignment process (scanning &
extension) use a substitution matrix to score matches. This is in contrast to FASTA
that uses a substitution matrix only for the extension phase. Substitution matrices
greatly improve sensitivity. There are two main types of matrices PAM and
BLOSUM; we can select the preferred matrix.
PAM (Percent Accepted Mutation) matrices: predicted matrices, most sensitive for
alignments of sequences with evolutionary related homologs. The greater the number
in the matrix name, the greater the expected evolutionary (mutational) distance, i.e.
PAM30 would be used for alignments expected to be more closely related in
evolution than an alignment performed using the PAM250 matrix
BLOSUM (Blocks Substitution Matrix): calculated matrices, most sensitive for local
alignment of related sequences, ideal when trying to identify an unknown nucleotide
sequence. BLOSUM62 is the default matrix set in the BLAST search tool.
3.5.3 Local Alignments
BLAST uses LOCAL ALIGNMENTS for matching sequnecs rather than GLOBAL
ALIGNMENTS. BLAST tries to find patches of regional similarity, rather than trying
to find the best alignment between your entire query and an entire database sequence.
3.5.4 Ungapped Alignments Alignments generated with BLAST do not contain gaps. BLAST's speed and
statistical model depend on this, but in theory it reduces sensitivity. However, BLAST
will report multiple local alignments between your query and a database sequence.
3.5.5 Explicit Statistical Theory BLAST is based on an explicit statistical theory developed by Samuel Karlin and
Steven Altschul. The original theory was later extended to cover multiple weak
matches between query and database entry: the repetitive nature of many biological
41
sequences (particularly naive translations of DNA/RNA) violates assumptions made
in the Karlin & Altschul theory. While the P values provided by BLAST are a good
rule-of-thumb for initial identification of promising matches, care should be taken to
ensure that matches are not due simply to biased amino acid composition.
The databases are contaminated with numerous artifacts. The intelligent use of f ilters
can reduce problems from these sources. Remember that the statistical theory only
covers the likelihood of f inding a match by chance under particular assumptions; it
does not guarantee biological importance.
3.5.6 Rapid
BLAST is extremely fast. It does not explore the entire search space between two
sequences as it uses the three layers (seeding, extension, and evaluation) of rules to
sequentially refine potential HSPs (high scoring pairs). This minimization of search
space is the key to its speed but at the cost of a loss in sensitivity. You can either run
the program locally or send queries to an E-mail server maintained by NCBI.
3.5.7 Sequence Input
The BLAST web pages accept input sequences in three formats; FASTA sequence
format, NCBI Accession numbers, or GIs. The preferred query sequence format for
the BLAST program is the FASTA format. Advanced BLAST tolerates both spaces
and numbers and is case insensitive.
3.5.8 Results Format
Results returned in either text format (default) or HTML format (must supply an e-
mail address and select the HTML results format option). A Request ID number is
given such that the results are obtained at a later time, if you want the results
immediately, we can click on the "Format Results" button.
Formatting items such as the results format option and the number of descriptions and
alignments in the results output are needed only for formatting, these items may be
specified from the BLAST query form or at the time you request your results. Most
results are held for up to 24 hours; very-large result files are deleted after 30 minutes.
3.5.9 BLAST Output
All BLAST programs produce a similar output. This consists of program introduction,
42
a schematic distribution of alignments of the query sequence to those in the databases,
a series of one line descriptions of the database sequences which have significantly
aligned to the query sequence, the actual sequence alignments, and a list of statistics
specific to the BLAST search method and version number is displayed at the top of
the output. The output consists of:
� A schematic distribution of the ordered alignments of the query sequence to those
in the databases. Colored bars are distributed in a way to reflect the region of
alignment onto the query sequence. The color legend represents the significance
of the alignment scores. Holding the mouse over a given bar wil l display a
description of that specific alignment sequence in the above window; clicking on a
specific bar wil l cause the browser to jump down to that particular alignment.
� Sequence alignments and their corresponding line descriptions are listed in order
of lowest to highest E value where E value is the expect value is the probability
that the associated match is due to randomness; the lower the E value, the more
specific/significant the match.
� Identifiers for the database sequences appear in the first column and are
hyperlinked to the associated GenBank entry
� The Score for each alignment. The score (bits) is a sum value calculated for
alignments using the scoring matrix; the higher the score value, the better the
alignment
� The percent identity (called "Identities" is given as a percent) is the percent of
exact matches between your query sequence and the database sequence, this value
also gives the number of nucleotide bases or amino acid residues that are matched
in the database sequence versus the query sequence
� Gap value is the percent of the alignment sequence that has been gapped in the
particular alignment. Alignments are gapped unless specified by the user at the
BLAST search submission page
� A list of statistics specific to the particular BLAST search are displayed at the
bottom of the output, they include the BLAST version number, the database and
matrices used for the search.
43
CHAPTER 4 VARIANTS OF BLAST
The best way to identify an unknown sequence is to see if that sequence already exists
in a public database. If the database sequence is a well -characterized sequence, then
you may have access to a wealth of biological information. BLAST (Basic Local
Alignment Search Tool) is a set of similarity search programs designed to explore all
of the available sequence databases (DNA or protein) regardless of whether the query
is protein or DNA. These programs have been tailored specifically for the purpose of
sequence similarity identification. Each BLAST program performs a different task.
Different flavors of BLAST are covered in the following sections [7].
Figure 4.1 Blast Variants
4.1 BLAST VARIANTS
Programs Available For The Blast Search Include:
44
Program
Query sequences
of type
Database Of Type Compar ison Application
BlastN DNA DNA
Compares a nucleotide query sequence against a
nucleotide sequence database
Find DNA sequences that match the query
BlastP Protein Protein Compares an amino acid query sequence against a protein sequence database
Compares an amino acid query sequence
against a protein sequence database
BlastX DNA Protein
Protein Compares a nucleotide query
sequence translated in all reading frames against a
protein sequence databases
Find what protein the query sequence
codes for
TBlastN Protein DNA
Compares protein query sequence against
nucleotide sequence database translated in
reading frames
Find genes in unknown DNA
sequences
TBlastX DNA
DNA
Compares the six-frame translations of a nucleotide query
sequence against the six frame translations of a nucleotide sequence
database
Discover gene structure (Find degree of homology between the coding region of the query sequence and known genes in
the database)
Table 4.1 Programs Available For The Blast
Types of BLAST Programs:
• blastp compares an amino acid query sequence against a protein sequence
database
• blastn compares a nucleotide query sequence against a nucleotide sequence
database
• blastx compares a nucleotide query sequence translated in all reading frames
against a protein sequence database
• tblastn compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames.
45
Figure 4.2 Blast Variants
• tblastx compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database. Note that
tblastx program cannot be used with the nr database on the BLAST Web page.
4.2 PSI-BLAST
PSI-Blast is the preferred method for searching a protein database with a protein
sequence as the key. If used for only one round, it is identical to BlastP. Its algorithm
is designed to conduct further iterations of the search and to extend the search to
distantly related homologues.
PSI stands for Position Specific Iterated. This search method makes use of a profile,
which is a position-specific accounting of what amino acid residues are found in a
family of aligned homologous proteins. PSI-Blast accepts a protein sequence as input
and first conducts a normal BlastP search to identify homologues in the database. A
profile is constructed from the spectrum of sequences found in the initially identified
homologues. This profile is used as the search key to identify more distant relatives.
The process is then iterated, each time refining the profile based on inclusion of the
new members. Ideally, the process is expected to converge on a unique set of genes.
In practice, the search may at some point begin to include proteins that are related by
chance similarity. The user must use judgement to recognize when proteins of known
and unrelated functions begin to appear in the list of f inds [19].
46
It’s an acronym for "Position Specific Iterated" BLAST. It is an iterative form of
blastp in which a profile is created from the amino acid query and nth set of results
(meeting the Psi-Expectation) and resubmitted. PSI - BLAST is a program based on
the BLAST 2.0 algorithm that is designed to detect weak relationships between the
query and members of the database not necessarily detectable by standard BLAST
searches [19]. The added sensitivity of this program over regular BLAST comes from
the use of a profile that is constructed (automatically) from a multiple alignment of
the highest scoring hits in the initial BLAST search. The profile is generated by
calculating position-specific scores for every position in the alignment. A highly
conserved position will receive a high score and weakly conserved positions receive
scores near zero. The profile is then used to perform additional BLAST searches
(called iterations) and the results of each iteration used to refine the profile.
PSI-BLAST is designed for more sensitive protein protein similarity searches. PSI-
BLAST is the most sensitive BLAST program, making it useful for finding very
distantly related proteins. We should use PSI-BLAST when our standard protein-
Figure 4.3 PSI Blast
protein BLAST search either failed to find significant hits, or returned hits with
descriptions such as "hypothetical protein" or "similar to...". When we use PSI-
BLAST to search a database, it generates Position Specific Scoring Matrices, which
can then be built into a database of patterns. Then we can just search one of these
databases with a new sequence. One of the diff iculties in doing this is curating the
47
database. In a regular sequence database, we just keep throwing in new sequences,
whereas with one of these pattern databases, we have to periodically go back and redo
the patterns and try to consolidate them and so forth. It takes a lot of effort to keep up
to date Position-Specific Iterative PSI-BLAST analysis is useful both for identifying
the distant members of a protein family, whose relationship is not recognizable by
straight sequence comparison, and also for deducing the function of hypothetical
proteins that are unannotated in the database.
STEPS OF PSI -BLAST ALGORITHM:
STEP 1:
The data to be entered must be in one of the allowed formats for BLAST search. Once
the query sequence is entered, the database to be searched must be selected from the
appropriate pull down menus. Options include a number of different sequence
databases that can be searched using blastp.
Figure 4.4 PSI Blast-Step1 The default database is nr, which is the collection of all unique sequences.It contains
all non redundant Genbank CDS translations + PDB + SwissProt + PIR +PRF entries.
STEP2:
The E-value is the statistical significance threshold for reporting matches against
database sequences. The default expect value for the initial BLAST search is 10. This
48
EXPECT threshold is fairly lenient allowing all possible related sequences to be
reported. Thus, the initial (BLAST) E value is set at 1.0.
It is appropriate to filter most queries for low complexity sequences because they give
spuriously high scores that reflect compositional bias rather than significant position-
by-position alignment. Thus we have selected to filter lo complexity region.
The BLOSUM62 (gap existence cost = 11; per residue gap cost = 1; lambda ratio =
0.85) substitution matrix is used by default in BLAST 2.0. A variety of other matrices
are also supported which include: PAM30, PAM70, BLOSUM80, BLOSUM62 and
BLOSUM45. Adjustments to the matrix may be in order when a search for very
distant relatives of the query is being performed. The BLOSUM matrix assigns a
probability score for each position in an alignment that is based on the frequency with
which that substitution is known to occur among related proteins.
Figure 4.5 PSI Blast-Step2
Then the word size needs to be set which is by default 3. There are other advance
options possible, which can specify gap costs, word size, and other parameters not
otherwise selectable on the query form that can be set. Here, we have not set any
advanced option.
STEP 3:
Checking the NCBI-gi designation is facilitates the process of doing additional
searches to investigate the significance of a given alignment whereas checking
49
graphical overview gives the graphical overview of the database sequences aligned to
the query sequence. The score of each alignment is represented by bars of different
colors. Multiple alignments on the same database sequence are connected by a striped
line. Mousing over a hit sequence causes the definition and score to be shown in the
window at the top.
The default number of descriptions and alignments to be listed is 500. Although it
may seem useful to change the default to something smaller to control the magnitude
of the output, these variables affect the search in two important ways: First, if the total
number of hits in which E is less than the threshold exceeds the number (x) of
descriptions requested, only the top x most signficant would be listed; additional
possibly significant alignments would not be shown, though these may embody
important information. Second, the number of sequences used in generating the
multiple alignment and the position specific matrix is specified by the larger of the
two(descriptions, alignments) variables. If at any point in the iterative PSI-BLAST
process, significant sequences are omitted from the profile, all subsequent output will
be affected. By selecting a large number of descriptions (e.g. 250-500) it is possible to
ensure that the E value and not the description limit will be the determining factor in
generating the profile to be used for additional iterations. Reducing the output can
then be accomplished, if desired, by limiting the number of alignments to be reported.
A variety of different alignment formats are available. The choice of which to use is
based on personal preference. Pairwise alignment gives a good view of the quality of
an individual hit. However, a flat query-anchored alignment (with identities) is a
format in which identities shared by numerous sequences can be easily spotted.
There is second E value which is the threshold value for inclusion in the position
specific matrix used for PSI-BLAST iterations. Here the PSI-BLAST E value is left at
the default setting of 0.001. Both of the E values specified (one earlier) allow the user
to see (and selectively, based on prior knowledge, include) all of the BLAST hits up
to E=1; but to automatically include only those hits exceeding a relatively rigorous E
value threshold of 0.001.
There are some more options to set, which include layout, formatting options on page
with result and autoformat. All these affect the report format but not the results
produced. In the end we click on the search button to initiate the search. In seconds,
50
the query sequence has been compared to all of the entries in the specified database.
Each comparison is scored and the top scores are listed in rank order.
.PSI-BLAST Output
Output of PSI-BLAST is shown both in graphical format and in detailed format. In
detailed format the hits are divided into two categories. Those that are better than the
E value threshold are listed first. Those with E values worse than threshold, but
nonetheless have an E value better than 1 (selected on the query page) are listed
further down the page.
Figure 4.6 PSI Blast-Output
PSI-BLAST In summary:
Patterns of conservation such as PSSM (Position Specific Score Matrix) identified
from the alignment of related sequences can aid the recognition of distant similarities.
This power can be further enhanced through iteration of the search procedure.
Position-Specific Iterated BLAST (PSI-BLAST) was developed for this goal, and
furthermore, has advantages at speed, simplicity and automatic operation. PSI-
BLAST program runs as follows.
51
Figure 4.7 PSI Blast-Output
(1) A standard BLAST search is performed against a database using a substitution
matrix (e.g.BLOSUM62).
(2) A PSSM (checkpoint) is constructed automatically from a multiple alignment of
the hits of the initial BLAST search or last round iteration of homology searching.
High conserved positions receive high scores and weakly conserved positions receive
low scores.
(3) The new PSSM replaces the initial matrix (e.g. BLOSUM62) or last round PSSM
to perform a next “BLAST” search.
(4) Steps 2 and 3 can be repeated and the new found sequences are included to build a
new PSSM.
(5) PSI-BLAST has converged if no new sequences are included.
Figure 4.8 PSI Blast
52
PSI-Blast The blastpgp program can do an iterative search in which sequences found in one
round of searching are used to build a score model for the next round of searching. In
this usage,the program is called Position-Specific Iterated BLAST, or PSI-BLAST. As
explained in the accompanying paper, the BLAST algorithm is not tied to a specific
score matrix. Traditionally, it has been implemented using an AxA substitution matrix
where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the
length of the query sequence; at each position the cost of a letter depends on the
position w.r.t. the query and the letter in the subject sequence.The position-specific
matrix for round i+1 is built from a constrained multiple alignment among the query
and the sequences found with suff iciently low e-value in round i. The top part of the
output for each round distinguishes the sequences into: sequences found previously
and used in the score model, and sequences not used in the score model. The output
currently includes lots of diagnostics requested by users at NCBI. To skip quickly
from the output of one round to the next, search for the string "producing", which is
part of the header for each round and likely does not appear elsewhere in the output.
PSI-BLAST "converges" and stops if all sequences found at round i+1 below the e-
value threshold were already in the model at the beginning of the round [21].
There are several blastpgp parameters specifically for PSI-BLAST:
-j is the maximum number of rounds (default 1; i.e., regular BLAST)
-e is the e-value threshold for including sequences in the score matrix model (default
0.01)
-c is the "constant" used in the pseudocount formula specified in the paper (default
10)
The -C and -R flags provide a "checkpointing" facility whereby a score model can be
stored and later reused.
-C stores the query and frequency count ratio matrix in a file
-R restarts from a file stored previously.
When using -R, it is required that the query specified on the command line match
exactly the query in the restart file.Users who also develop their own sequence
analysis software may wish to develop their own scoring systems. For this purpose the
code in posit.c that writes out the checkpoint can be easily adapated to write out
53
scoring systems derived by other algorithms in such a way that PSI-BLAST can read
the files in later.
The checkpoint structure is general in the sense that it can handle any position-
specific matrix that fits in the Karlin-Altschul statistical framework for BLAST
scoring.
4.3 BLASTN
Standard nucleotide BLAST compares a nucleotide query sequence against a
nucleotide sequence database. It is better at finding sequences similar, but not
identical, to your query. The BLAST nucleotide algorithm finds similar sequences by
generating an indexed table or dictionary of short subsequences called words for both
the query and the database. The program can then rapidly find initial exact matches to
the query words by simply looking up a particular word in the database dictionary.
These initial matches serve as starting points for longer alignments that are generated
in several steps, ending with a final gapped alignment [8].
One of the important parameters governing the sensitivity of BLAST searches is the
length of the initial words (word size). The most important reason that blastn is more
sensitive than MEGABLAST is that it uses a shorter default word size. Because of
this, blastn is better than MEGABLAST at finding alignments to related nucleotide
sequences from other organisms since the initial exact match can be shorter. The word
Figure 4.9 BLASTN
54
size is adjustable in blastn and can be reduced from the default value of 11 to a
minimum of 7 to increase sensitivity. This word size can also be increased to increase
the search speed and limit the number of database hits. Nucleotide-nucleotide
searches are not the recommended way to find homologous protein coding regions in
other organisms. It is better to perform searches at the protein level, either with
translations of the nucleotide sequences or by direct protein-protein BLAST. This is
because of the degeneracy of the genetic code, the greater information available in
amino acid sequence, and the more sophisticated algorithm in protein-protein BLAST.
Figure 4.10 Using Blastn For Comparison
Figure 4.11 Blastn Results
55
4.4 BLASTX
Sequence similarity between a translated nucleotide sequence and a known biological
protein can provide strong evidence for the presence of a homologous coding region,
and such similarities can often be identified even between distantly related genes. The
computer program BLASTX performed conceptual translation of a nucleotide query
sequence followed by a protein database search in one programmatic step. The
BLAST search algorithm combined with Karlin-Altschul statistics yields a predictable
selectivity that has been parameterized. BLASTX is appropriate for use in moderate
and large scale sequencing projects at the earliest opportunity, when the data are most
prone to containing errors [9].
Most primary sequence data is obtained as nucleic acid, while much of the biological
interest lies in the encoded protein. Inference of likely protein coding regions is often
based on statistical features, such as codon usage and the locations of putative splice
site signals but significant false positive rates are common. In contrast, similarity
between a conceptually translated nucleotide sequence and a known protein sequence
may be highly significant statistically, which suggested a more discriminating
approach to inferring coding potential. BLASTX is used to probe a nucleotide
sequence directly for the presence of protein coding regions by identifying segments
that encode significant similarity to members of a protein sequence database.
The BLASTX program has been successfully employed to identify likely protein
coding sequences in thousands of partial cDNA sequences from human brain tissue.
BLASTX allowed protein-protein comparisons to be considered when only
uncharacterized nucleotide query sequence was available.
The program conceptually translated query sequences in all six reading frames (three
on each strand) and compared each of these full -length translation products with a
comprehensive protein sequence database in a single pass. The BLAST algorithm
approximates a well defined measure of local sequence similarity based on a matrix of
similarity or substitution scores for all possible pairs of residues. The algorithm
identifies ungapped, aligned pairs of sequence segments with locally maximum scores
which meet or exceed a parameterized cutoff score S, These segments are referred to
as “high-scoring segment pairs” (HSP’s), and the highest scoring segment pair
derivable from any two
56
Figure 4.12 Blastx
sequences is their maximal-scoring segment pair, or MSP. A program, BLASTX,
based on this rapid, probabilistic algorithm, was used to find statistically significant
HSPs between a translated nucleotide query sequence and a target protein sequence
database. When an HSP was found, the analysis of Karlin and Altschul was used to
estimate the significance of its score. No prior knowledge of the reading frame or
direction was assumed by BLASTX; all possible reading frames in both orientations
of the query sequence were translated into protein sequence using the standard genetic
code. The PAM (point accepted mutation) amino acid substitution model was
typically used for scoring similarity between peptide sequences. By default, BLASTX
used a PAM120 matrix.
The expected number of alignments scoring S or greater in a comparison between two
random sequences of lengths m and n is
E=mnKe-s
Where K and S �DUH�SDUDPHWHUV�GHSHQGHQW�RQ�WKH�DPLQR�DFLG�FRPSRVLWLRQV�RI�WKH
sequences. For values less than about 0.1,E is often an acceptable approximation to
P the probability of occurrence of one or more matches scoring S or greater. In a true
coding region, one reading frame may have a predicted amino acid composition
typical for biologically occurring proteins, while the other reading frames exhibit
anomalous
Compositions. For this reason, BLASTX calculated separate K and S �YDOXHV�IRU�HDFK
reading frame.
57
Figure 4.13 Using Blastx For Comparison The BLAST algor ithm operates in two successive stages, “ neighborhood” word
generation followed by the actual search, with an implicit trade-off in speed versus
sensitivity imparted in the first stage. A list of neighborhood words of length W is
generated from consecutive, overlapping words of length W in the query sequence,
using a specified scoring matrix. The neighborhood list contains all words which
satisfy a threshold scoring parameter, T, when aligned with words in the query
sequence. Raising T decreases the size of the neighborhood and, consequently,
increases the search speed in the algorithm’s second stage, but at the expense of
decreased sensitivity. In BLASTX, the neighborhood word list was built from the
conceptual translations of all six reading frames on both strands of the query sequence
[24].
Dur ing the second stage of the BLAST algor ithm, the neighborhood words from
the first stage are searched for in the database or “ target” sequence; the presence of a
neighborhood word match indicates the possible location of an HSP. Individual
neighborhood word matches (or word “hits” ) are extended in both directions along the
matrix diagonal until the ends are reached or the cumulative alignment score falls
from its maximum achieved value by a parameterized quantity X.
58
Figure 4.14 Blastx Results 4.5 BLASTP
The BLASTP program is a search tool for databases of protein sequences that is
widely used by biologists as a first step in investigating new genome sequences.
BLASTP finds high-scoring local alignments without gaps between a query sequence
q and sequence s in the database. The score of an alignment is the sum of the scores of
individual alignments between amino acids that make up the protein. These individual
scores come from a scoring matrix modeling the rate of evolutionary mutation.
BLASTP is the most widely used program for determining alignments of protein
sequences against databases such as Genbank. BLASTP is a three-step algorithm that
succeeds in only scanning the database for exact matches [14].
The BLASTP algorithm works in three steps:
1. Neighborhood Construction. A set of words of length W, called the neighborhood
N, is computed. Each word scores at least T with some word of equivalent length
in the query sequence Q.
2. Hit Detection. Each subject SB in the database DB is scanned for (exact) matches
to a word in N.
3. Hit Extension. The match, or hit H, is extended into a potentially higher scoring
alignment
59
Figure 4.15 Blastp .The first step is to create a neighbourhood for each (short) segment of length $ of
the query sequence. The neighbourhood consists of all sequences of $ amino acids
that match the query segment with a high-score. An automaton is built to recognize
the union of all neighbourhoods. The second step is to scan the database for exact
matches to any neighbour. These matches are called hits. The third step attempts to
extend a hit into a high-scoring pair of segment with approximate matches to the left
and right of the hit. As each pair of aligned residues is included into the alignment, the
score of the aligned pair is looked-up in a score matrix and added to a running sum.
Extension of a hit continues until the falloff value, X, is reached.
Figure 4.16 Using Blastp For Comparison 4.5.1 BlastP Parameters
1.[ DATABASE ] Valid database name
60
Default : nr
2. [EXPECT] The statistically significant expectation value. If the statistical
significance ascribed to a match is greater than the E value, the match will not be
reported. Lower E values are more stringent, leading to a fewer chance matches being
reported.
Default : 10.0
3. [ENTREZ_QUERY] Entrez query to limit Blast search
• Value : Entrez query format
• Default : Empty
4. [FILTER] Sequence filter identifier
• L for Low Complexity
• R for Human Repeats
• m for Mask for Lookup
5. [GAP_OPEN_COSTS] Gap open costs
• Value : integer values
• Default : 5 for nuc-nuc, 11 for proteins, non-aff ine for megablast
6. [GAP_EXTEND_COSTS] Gap extend costs
• Value : space separated float values
• Default : 2 for nuc-nuc, 1 for proteins, non-aff ine for megablast
7. [MATRIX_NAME] A key element in evaluating the quality of a pairwise sequence
alignment is the "substitution matrix", which assigns a score for aligning any possible
pair of residues.
• Value : Valid matrix name
• Default : BLOSUM62
4.6 TBLASTN
It compares a protein query sequence against a nucleotide sequence database
dynamically translated in all six reading frames (both strands). The "Protein query -
Translated db [tblastn]" search is useful for finding protein homologs in unannotated
nucleotide data. A tblastn search allows you to compare a protein sequence to the six-
frame translations of a nucleotide database. It can be a very productive way of f inding
homologous protein coding regions in unannotated nucleotide sequences such as
expressed sequence tags (ESTs) and draft genome records (HTG), located in BLAST
databases est and htgs, respectively. ESTs are short, single-read cDNA sequences.
61
These comprise the largest pool of sequence data for many organisms and contain
portions of transcripts from many uncharacterized genes. Since ESTs have no
annotated coding sequences, there are no corresponding protein translations in the
BLAST protein databases. Hence a tblastn search is the only way to search for these
potential coding regions at the protein level. The HTG sequences, draft sequences
from various genome projects or large genomic clones, are another large source of
unannotated coding regions [8].
Like all translating searches, the tblastn search is especially suited to working with
error prone data like ESTs and draft genomic sequences from HTG because it
combines BLAST statistics for hits to multiple reading frames and thus is robust to
frame shifts introduced by sequencing error.
4.7 TBLASTX
Tblastx compares the six-frame translations of a nucleotide query sequence against
the six-frame translations of a nucleotide sequence database. The tblastx program
cannot be used with the nr database on the BLAST Web page because it is
computationally intensive. The "Nucleotide query - Translated db [tblastx]" is useful
for identifying novel genes in error prone query sequence. Tblastx takes a nucleotide
query sequence, translates it in all six frames, and compares those translations to the
database sequences dynamically translated in all six frames. This effectively performs
a more sensitive blastp search without doing the manual translation. Tblastx gets
around the potential frame-shift and ambiguities that may prevent certain open
reading frames from being detected. This is very useful in identifying potential
proteins encoded by single pass read ESTs. In addition, it would be a good tool for
identifying novel genes [8].
4.7.1 Limitations Of Tblastx
1. TblastX is computationally insensitive.
2. Until recently there were not many completely sequenced genomes
3. When we got a match, rarely find a description for what was found.
62
CHAPTER 5 COMPARISON OF VARIANTS
OF BLAST
5.1 INTRODUCTION Blast is a successful tool to compare biological sequences. Now a days Large amount
of biological data is available, but Standalone Blast is not suff icient to handle all types
of queries related to sequence similarities, so different variants (BlastX, BlastP,
BlastN, TBlastN, TBlastX, PSI-Blast) have been developed. Each variant has
limitations and advantages. Every tool is made to handle with different purposes. So
the user should have knowledge in which situation to use which tool. Comparison is
needed between these variants different to know thoroughly about these tools [29].
Comparison Between The Var iants of Blast on The Basis of:
� Parameters
� Algorithm
� Performance.
5.1.1 Compar ison On The Basis Of Parameters
All variants of BLAST run on same algorithm followed by Main Blast Program.
There are some differences occur between these variants, due to which the
functionality differs. All the parameters are same for all variants, which are used for
MAIN BLAST program. But stil l there are some parameters which can be present in
some variants, or the absence of which can make other tools to advantageous one over
the other.
5.1.1.1 Conserved Domain Search Is Not Applied To Blastn, I t Is Applicable To
Blastp.
Proteins often contain several domains, each with a distinct function (membrane
binding, signal peptide, etc.) .As species evolve; the functional parts of important
proteins remain relatively constant over time, and may even be copied and adapted for
use by other proteins. Such domains have evolved as modules that are combined in
various arrangements to produce proteins of unique function. Conserved domains are
structural modules that have been reused frequently during the process of evolution.
63
NCBI’ s new Conserved Domain Search (CD-Search) service can be used to identify
conserved domains in a protein sequence.
Figure 5.1 Conserved Domain For BlastN and BlastP Influence of absence of CDD Search: Conserved Domain Search is applicable only
to proteins. Because it is based on PSSMs (Position Specific Score Matrices) which is
applied only on proteins. By applying PSSMs, specific functional areas with in
proteins can be searched. The searched functional domains are used in future for
further research. Because PSSM is not applied on nucleotides so if there are specific
functional areas exist in nucleotides, no search option is available for that.
Conserved domain will not work for nucleotide as -it is based on PSSM which does not apply to nucleotide. 5.1.1.2 The Default Word Size Is 11 Characters For Blastn. The Default Word
Size Is 3 For BLASTP, due To Which BLASTP Searches Run Slower
Than BLASTN.
Word size (seed) strongly affects the database searching. Speed of the algorithm is
inversely proportional to the word size. By decreasing the word size the sensitivity
increases but speed of the search program decreases. Word size for BlastP is very
small as compared to BlastN. Word size (seed) in case of BlastP is of 3-residues.It is
seen for BlastP, during the second step of algorithm, large no of hits are found in the
database. This is because of the small size of the seed. So more time is spent on the
search. But in case of BlastN, seed is of 11-nucleotides.It is difficult to find more
number of exact matches for such large seed size. Results are displayed in lesser time
as compared to BlastP and less number of hits are found. But sensitivity decreases in
BlastN.
64
Figure 5.2 Different Word Sizes For BlastN and BlastP
5.1.1.3 Blastn Is Very Different From Other Protein-Based Algor ithms. Blastn Seeds Are Identical Words. T Is Never Used In Blastn.
A word hit is simply two identical sequences. T is the threshold parameter for
sequences. T is only used where any match related to given sequence is not found.
This parameter is used to increase the length of the word seed. Neighborhood of a
given word seed is found. Neighborhood of a word contains the word itself and other
words whose score is at least as big as T when comparing with the scoring matrix. By
adjusting T, it is possible to control the size of neighborhood and therefore word hits
in the search space[30].But T is not used in BlastN, because BlastN always find
identical matches. Therefore no need of neighborhood is there.
Influence of absence of T: T is not used in BlastN. There is big limitation of this to
BlastN algorithm. If identical seeds are not found in BlastN, there will be no match.
Because when no match is found with respect to the given seed, the search is stopped
there. No extension of the seed will be performed and no match wil l be found.
Improvement: T should be used in BlastN. By using T more word hits can be found.
When the other words are aligned with the previously word seed, Neighborhood of
word is created and extension is applied on that. On applying the extension in both the
directions, the words are included in the extension whose score does not lies below
threshold value ‘T’ . And similar sequence is found whose value does not lie below the
drop-off score ‘X’ . Therefore no need to stop the search here. More sequence matches
can be found. There will be less chances of missing alignments.
65
5.1.1.4 Unlike Nucleotide BLAST, There Is No Comparable MEGABLAST For
Protein Searches.
MegaBlast is optimized for aligning sequences that differ slightly as a result of
sequencing or other similar "errors". MegaBlast is also able to eff iciently handle much
longer DNA sequences than the blastn program of traditional BLAST algorithm.
When larger word size is used (see explanation below), it is up to 10 times faster than
more common sequence similarity programs. Mega BLAST is also able to eff iciently
handle much longer DNA sequences than the blastn program of traditional BLAST
algorithm.
Influence of absence of Mega Blast:
MegaBlast is an improvement to existing BlastN algorithm, but for proteins there is
no such program exists. No batch queries can be run in case of protein sequence
searching. Longer sequence searches cant be applied so efficiently. To improve the
speed of the protein searches by speed, and to handle long sequence searches
MegaBlast like program should be developed for proteins, Which can run large
protein sequence and batch sequences at a time.
5.1.1.5 Genetic Code Option Is Only Used With Blastx, Genetic Code Option Is
Disabled With Tblastn
The genetic code is the relationship between the sequence of the bases in the DNA
and the sequence of amino acids in proteins. Both DNA and proteins are linear
polymers thus it seems logical to suppose that the sequence of bases in DNA codes for
the sequences of amino acids in proteins. However, there are 20 amino acids found in
proteins and only 4 different bases found in DNA so the coding ratio cannot be 1 to 1
nor can it be 2 bases to 1 amino acid, which would only give 16 different
combinations. At least 3 bases in combination as a triplet are required to code for each
amino acid and this would give 4 to power 3 = 64 possible combinations of triplet
bases or codons. We now know that the genetic code is based on these triplet codons.
Different species may use different genetic codes
to encode for the same amino acid. You have to
specify appropriate genetic codes (translation
table) for your query sequence based on the
organism and sources
66
BlastX mainly translate the given nucleotide sequence into protein and then compare
it with the protein database. These genetic codes are used to translate those nucleotide
into protein. Without these codes translation is not possible. Different codes are
available for different species. Mainly the Standard Genetic codes are used.
5.2 COMPARISON ON THE BASIS OF ALGORITHM
All variants of BLAST run on same algorithm followed by Main Blast Program. But
there exist some difference in the working of these due to which the performance of
all varies by the other. The different features in the algorithm make it possible to use
different tool for purpose. On the basis of different functionality different algorithms
can be optimized to improve the performance.
5.2.1 The Two-Hit Algor ithm Isn' t Used In BLASTN, Because Word
Hits Are Generally Rare With Large Identical Words.
The two-hit algorithm isn't used in original version. BLASTN the statistical
alignments which are found using main BLAST algorithm are based on threshold
value ‘T’ and drop-off score X. The central idea of the BLAST algorithm is that a
statistically significant alignment is likely to contain a high-scoring pair of aligned
words. BLAST first scans the database for words (typically of length three for
proteins) that score at least T when aligned with some word within the query
sequence. Any aligned word pair satisfying this condition is called a hit. The second
step of the algorithm checks whether each hit lies within an alignment with score
suff icient to be reported. This is done by extending a hit in both directions, until the
running alignment’s score has dropped more than X below the maximum score yet
attained. This extension step is computationally quite costly; with the T and X
parameters necessary to attain reasonable sensitivity to weak alignments, the
extension step typically accounts for >90% of Blast’s execution time. It is therefore
desirable to reduce the number of extensions performed. Refined algorithm is based
upon the observation that an HSP of interest is much longer than a single word pair,
and may therefore entail multiple hits on the same diagonal and within a relatively
short distance of one another. Specifically, we choose a window length A, and invoke
an extension only when two non-overlapping hits are found within distance A of one
another on the same diagonal. Any hit that overlaps the most recent one is ignored.
The two-hit method will detect an HSP if it contains two no overlapping length-W
67
words of score at least T. To analyze the relative speeds of the one-hit and two-hit
methods, using the parameters studied above. Two-hit method generates on average
~3.2 times as many hits, but only ~0.14 times as many hit extensions.
Influence of absence of two-hit algor ithm: Two-hit algorithm is not used for
BlastN, because the word size for BlastN is large (11 nucleotide). Word hits are the
identical words. It is rare and diff icult to find word hits with large word size. It is easy
to find identical matches for one or two nucleotide in a given database.
�
���
���
���
���
���
���
� �� �� �� ��11RUPDRUPDOLOL]]HG�HG�++663�63�6FFRUHRUH
33
U
R
E
U
R
E
DD
EE
L
O
L
L
O
L
WW
\
�
\
�
RR
I
�
I
�
PP
LL
VV
VV
LL
Q
J
�
Q
J
�
DD
Q
�
Q
�
++
66
33
7 ��7 ��
Figure 5.3 shows the empirically estimated probability that an HSP is missed by this method, as a function of its normalized score
But it is very rare that we find exactly same nucleotide sequence with the seed of 11
bp. Therefore two-hit algorithm is not used.
Figure 5.4 Speeds of the one-hit and two-hit methods
68
Improvement: If two-hit algorithm will be applied to blastn, The sensitivity of
BlastN will i ncreased and more accurate sequence similarity will be obtained. This
can be done by decreasing the word size of BlastN. Because with large words size it is
diff icult to find the same matches regularly at two positions. But with short word size
it is easy to find the exact matches at more than one position.
5.2.2 Extension in BlastN is different from BlastP and other protein
based programs.
Extension for BlastN is different from Blastp. This is because of the Proteins and
Nucleotides. Different Scoring matrices are used for scoring of neighborhood during
extension. Different scoring matrices yields separate drop-off (X) score for BlastN and
BlastP. But in BlastN there are 11-nucleotides for which the whole score has to be
evaluated. It will t ake more time to calculate as compared to BlastP because the word
size for BlastP is small as compared to BlastN.
5.3 COMPARISON ON THE BASIS OF PERFORMANCE Every tool is eff icient in different conditions and to different input queries.
Performance of variants is measured on the basis of following criteria.
Performance of various variants of Blast is measured on the basis of:
• Expect Value
• Word Size
• Time
5.3.1 Compar ison On The Basis of Varying Expect Values
A BlastN was performed using the mRNA sequence of PRDX1 against the non-
redundant database. To observe the effect of the "expect value" parameter, values of
10, 0.1, and 1e-30 were used, keeping the wordsize (11) and the fil ter (low
complexity) constant. Table 5.1 shows the results:
The results from expect=10 returned 163 hits, expect=0.1 returned 157 hits, and
expect=1e-30 returned only 65 hits. The expect value is the measure of how many
times the sequence could hit another by chance. By decreasing this value, the blast
becomes more stringent and less results are returned.
69
Expect
value (e)
BlastN BlastP BlastX TBlastN TBlastX PSI-Blast
10 163 100 100 100 101 501
0.1 157 100 100 101 100 501
1e-30 65 80 58 75 98 480
Table 5.1 No of hits for varying expect values
In the same manner, the protein sequence of PRDX1 was blasted against the non-
redundant protein databases, BlastP, BlastX, TblastX, TBlastN and PSI -Blast.
Again, the expect value was varied while keeping the word size (3) constant. The
results from the expect values of 10 and 0.1 both returned almost 100 hits and in PSI-
BLAST it gives 501 hits, meaning that a decrease in stringency by 100x yields no
difference. However, when an expect value of 1e-30 was used, only 58 hits were
returned. The protein sequences in the database aligned so well with the PRDX1
protein sequence that only very low expect values altered the output.
Figure 5.5 Comparison - Varying Expect Values
BlastN BlastP Blastx TBlastN Tblastx PSI-
Blast
70
3H3HUUIIRRUUPPDDQQFFH�H�RRQ�Q�WWKKH�H�%%DDVVLLV�V�RRI�I�(([S[SHHFFW�W�9D9DOOXXHH
�������������������
(([[SHSHFFW�W�99DDOOXHXH
11
RR
�
�
�
�
RR
I
�
I
�
++
LL
WW
VV
%ODVW1%ODVW3%ODVW;7EODVW17EODVW;3VL�%ODVW
Figure 5.6 Comparison - Varying Expect Values
By lowering the value by just 100th does not make much difference in number of hits
in BlastP, BlastX, TBlastX, BlastP. But variation comes when the expect value is
reduced by a large factor. But as it can be seen from the graph , irrespective of the
same input parameters given to all the variants, PSI-BLAST and BLASTN gives the
maximum output.
5.3.2 Compar ison On The Basis of Word Size
Similar to the above experiment, a BlastN was performed using PRDX1 mRNA. This
time, the expect value was held constant at 10 while the word size was changed (7, 11,
15). Also, other variables such as the nr database and the low complexity filter were
similarly used. The following results were observed.
Word Size (w) BlastN
7 163
11 163
15 139
. Table 5.2 No of hits for varying expect values BlastN
The results showed that both a wordsize of 7 and 11 returned 163 hits while a
wordsize of 15 returned only 139 hits. Wordsize is a measure of how many items,
71
3H3HUUIIRRUUPPDDQFQFH�H�RRI�I�%%DDOVWOVW1�1�RQ�RQ�WWKKH�H�%%DDLLV�RQ�V�RQ�
9D9DUU\\LLQQJ�J�ZZRRUUG�G�VL]VL]HH
���������������������������
� �� ��::RUG�RUG�66LL]H]H
11
RR
�
�
�
�
RR
I
�
I
�
++
LL
WW
VV
6HULHV�
Figure 5.7 Varying Expect Values for BlastN
nucleotides in this case, are taken and compared to the database. In a wordsize of 11, a
group of 11 sequential nucleotides are compared with the database. The larger the
wordsize, the more stringent the analysis. That is why a wordsize of 15 returned less
results
33HHUUIIRURUPPDDQFQFH�H�RQ�RQ�WWKKH�H�EEDDVVLLV�V�RRI�I�::RUG�RUG�66L]L]HH
125
130
135
140
145
150
155
160
165
7 11 15
Word Size
No
. o
f Hit
s
BlastN
Figure 5.8 Varying Expect Values BlastN
Wordsize can also be varied in a BlastP, BlastX, TblastX, TBlastN and PSI -Blast.
In the next comparison, PRDX1 protein was blasted against the protein database using
a constant expect value (1e-70), database (nr), and fil ter (low complexity). Wordsize
was varied between 2 and 3.
72
Performance on the basis of varying word size
0
100
200
300
400
500
600
w ord size
no
of
hit
s
word size=2
word size=3
Word size ( w) BlastP BlastX TblastX TblastN PSI
2 58 100 100 115 501
3 58 100 57 115 501
. Table 5.3 No of Hits For Varying Word Size
Figure 5.9 Varying Expect Values for variants
33HHUIUIRRUUPPDDQQFFH�H�RQ�RQ�WWKKH�H�%%DDVVLLV�V�RRI�I�
::RURUG�G�66L]L]HH
�������������������
� �::RRUUG�G�66LL]]HH
11
RR
�
�
�
�
RR
I
�
I
�
++
LL
WW
VV
%ODVW3%ODVW;7EODVW;7EDOVW136,
Figure 5.10 Varying Expect Values for variants
BlastP BlastX TbalstX TbalstN PSI
73
Varying word size does not affect the performance of BlastP, BlastN, TBlastN,
TBlastP and PSI-Blast. But it only affects the performance of TBlastX. Performance
of TBlastX declines with the increase of word size.
5.3.3 Compar ison on the Basis of Execution Time
All the variants were executed on 32-bit and 64-bit processors and their performance
was compared in terms of seconds and number of processors, which is shown below.
TEST NUMBER OF CPUs
32-BIT TIME (in seconds)
64-BIT TIME (in seconds)
blastX 1 1516 1085 blastX 2 751 550 blastN 1 297 252 blastN 2 153 132 tblastX 1 4999 3545 tblastX 2 2761 1940
. Table 5.4 Varying Execution Time
From the graph shown on next page, it is clear that TblastN takes less time to Execute
than the other variants. TblastX is slowest amongst all whether it is executed on 32-bit
processor or 64-bit processor. The performance of BlastX lies between both.
The observations are represented in the graph as shown below:
0
1000
2000
3000
4000
5000
6000Single CPU 32-bit
Dual CPU 32-bit
Single CPU 64-bit
Dual CPU 64-bit
Figure 5.11 Compares the performance of BLAST compiled with 32-bit and 64-bit processor
74
Summary: Variants of Blast (BlastN, BlastP, BlastX, TBlastN, and TblastX, PSI-Blast) run on
different parameters, different algorithms, and each tool have different performance
criteria. The performances differ on the basis of parameters like Word Size, Expect
Value, and Databases Available. By selecting different values, the eff iciency of each
tool can be improved. In this chapter the performance is being checked on the basis of
execution time, and varying parameters and algorithm comparison. From the
performance, we can make decision that in which situation, which tool is to be used.
75
CHAPTER 6 CONCLUSION AND FUTURE
SCOPE
6.1 CONCLUSION
In the plethora of tools available for data mining in bioinformatics, Blast was chosen
due to its unmatched speed, sensitivity and accuracy. Though, the performances of
BLAST was best, but still due to different conditions variants of BLAST are
available. There are various parameters that are having contextual relation with areas
other than the algorithm design and computer science: the analysis of parameters was
limited form the point of view of computer engineer. That is why the improvements in
some of the parameters are suggested.
Firstly to improve the speed of BlastP word size should increased as in BlastP. Word
size strongly affects the database searching. Speed of the algorithm is inversely
proportional to the word size. By decreasing the word size the sensitivity increases
but speed of the search program decreases. Word size for BlastP is very small as
compared to BlastN. In case of BlastP word size is of 3-residues. It is seen for BlastP,
during the second step of algorithm, large no of hits are found in the database. This is
because of the small size of the seed. So, more time is spent on the search. But in
case of BlastN, seed is of 11-nucleotides. It is diff icult to find more number of exact
matches for such large seed size. Results are displayed in lesser time as compared to
BlastP and less number of hits are found. But sensitivity decreases in BlastN.
Secondly improvement for BlastP, BlastX, and TBlastX, PSI-Blast is: For Nucleotide
BLAST, there is one MegaBlast available. There should also be Comparable
MEGABLAST for Protein Searches. MegaBlast is optimized for aligning sequences
that differ slightly as a result of sequencing or other similar "errors". MegaBlast
eff iciently handles much longer DNA sequences than the BlastN program of
traditional BLAST algorithm. When larger word size is used, it is up to 10 times faster
than more common sequence similarity programs.
Lastly there are some advantages and disadvantages in each of the variants. Regular
exploration and improvements are need for better eff iciency of these tools. Some
76
features are available only in nucleotide based tools which are absent in protein based
versions. By continuously evaluating the performances and exploring the features of
each tool, improvements are being done in this area.
6.2 FUTURE SCOPE
Over the past decade many biological tools have been developed, but still
improvements are needed in these tools, to improve the speed and accuracy. Research
for improvements of existing tools is carrying on.
¾ Examinations of the problems arising from the use of biological tools.
¾ How the execution of code affects the performance of the tool. What
modifications can be done in source code.
By doing modifications to the existing parameters and source code, speed will
increase and the field of bioinformatics will emerge with and more dynamic scope.
“ Measurement and Analysis is the key to Development and Improvement”
So with continuous evaluations of existing versions of biological tools, further
improvements will be possible.
77
REFERENCES
[1] By blast-help group, NCBI User Service, “BLAST Program Selection Guide” ,
NCBI, NLM, NIH, 8600 Rockville Pike, Bethesda, MD 20894
[2] Dan E. Krane, Michel L. Raymer “Fundamentals concepts of
Bioinformatics” ,Pearson Education, 2003.
[3] Dr. Joanne Fox, “Sequence Similarity Searching: Understanding and Using Web
Based BLAST” , Wednesday January 26th, 2005 Rm 220 FNS Building, UBC
[4] Discovery: An Overview” . In U.M. Fayad, et al. (eds.), Advances in Knowledge
Discovery and Data Mining, 1-35.AAAI/MIT Press, 1996.
[5] Gat and Tal Kohen , “Algorithms for Molecular Biology” , Lecture 4: January
1, 1999
[6] G.Piatetsky-Shapiro, U. Fayad, and P. Smith “Data mining to Knowledge
Discovery: An Overview” . In U.M. Fayad, et al. (eds.), Advances in Knowledge
Discovery and Data Mining, 1-35.AAAI/MIT Press, 1996
[7] Ian Korf, “Serial BLAST Searching” , The Wellcome Trust Sanger Institute [8] Ian Krof, Mark Yandell, and Joseph Bedell “BLAST” , Shroff Publishers & Distributors Pvt. Ltd. [9] Jason, Bruce, Dennis, “Pattern Discovery in Biomolecular Data” , Oxford
University Press. New York 1999
[10] Jean Michel Claverie and Cedric Notredame “Bioinformatics A Beginners
Guide” , Wiley Publishing, Inc. 2003.
[11] Jiawei Han, Micheline Kamber and Simon Fraser University “Data Mining
Concepts and Techniques” Morgan Kaufmann Publishers, USA 2001.
[12] Jaak Vilo, “Pattern Discovery from Biosequences” , University Of Helsinki Finland,2003
78
[13] Nick Camp, Haruna Cofer, and Roberto Gomperts, “High-Throughput BLAST” , September 1998 [14] Osmar R. Zaïane ,“Principles of Knowledge Discovery in Databases” , 1999 [15] Paracel Algorithms, “The Biologist’s Guide to Paracel’s Similarity Search Algorithms” , October 2, 2001 [16] Sandra Barth, “Sequence similarity searches” , Session 4 ,2002.Jason
[17] Shawn Delaney, Greg Butler, Clement Lam, Larry Thiel Department of
Computer Science, Concordia University, “Three Improvements to the BLASTP
Search of Genome Databases” , 1455 de Maisonneuve Blvd. West, Montreal,
Quebec, Canada, H3G 1M8
[18] Sir William Dunn, “ Introduction to Database Searching” , Oxford, July 12, 2001 [19] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer1, Jinghui
Zhang, Webb Mill er2 and David J. Lipman “Gapped BLAST and PSI-
BLAST: a new generation of protein database search programs” , 3389–3402
Nucleic Acids Research, 1997, Vol. 25, No. 17
[20] Stephen F. Altschul', Warren Gish', Webb Miller2 Eugene W. Myers3 and David
J. Lipmanl “Basic Local Alignment Search Tool” J.Mol.Biol (1990) 215,403-
410
[21] Fengkai Zhang, “The Use of Vector Seeds to Improve PSI-BLAST
Sensitivity”, School of Computer Science, University of Waterloo,
Waterloo,
Ontario, Canada, 2004
[22] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From Data
Mining to Knowledge Discovery in Databases” . Articles.
[23] Warren Gish and David J. States “ Identification of Protein Coding Regions by
Database Similarity Search” , Articles.
INTERNET RELATED LINKS
[24] http://www.eas.asu.edu/~mining03/chap1/lesson_2.html
79
[25] http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/
palace/datamining.htm
[26] http://e-comm.webopedia.com/TERM/D/data_mining.html [27] http://biotech.icmb.utexas.edu/pages/bioinfo.html [28] http://services.bioasp.nl/blast/cgi-bin/blast.cgi?program=blastx [29] http://www.ncbi.nlm.nih.gov/blast [30] www.biotech.ufl.edu/WorkshopsCourses/ bioinfoWorkshops/bioinfoTools/BLAST
80
LIST OF PUBLICATIONS
1. Ms. Inderveer Chana, Harpreet Kaur, Navjot Kaur, “ Issues Of Software
Engineering and Knowledge Engineering In Bioinformatics “ in National
Conference of Bioinformatics Computing, held at T.I.E.T, Patiala, on 18th
March – 19th March.
2. Mrs. Rinkle Aggarwal, Navjot Kaur, Harpreet Kaur, “ Algorithmic and Non-
Algorithmic Issues In Database Search Of Sequence databases ” in National
Conference of Bioinformatics Computing, held at T.I.E.T, Patiala, on 18th
March – 19th March.
81
GLOSSARY
Algor ithm: a fixed procedure embodied in a computer program. The Basic Local
Alignment Search Tool or BLAST is a sequence comparison algorithm that NCBI
uses to search sequence databases for optimal local alignments with a query sequence.
FASTA is another type of algorithm used for database similarity searching.
Alignment: The process of lining up two or more sequences to achieve maximal
levels of identity (and conservation, in the case of amino acid sequences) for the
purpose of assessing the degree of similarity and the possibility of homology.
Codon: The sequence of nucleotides, coded in triplets (codons) along the mRNA, that
determines the sequence of amino acids in protein synthesis. A gene's DNA sequence
can be used to predict the mRNA sequence, and the genetic code can in turn be used
to predict the amino acid sequence.
EST expressed sequence tag: A short strand of DNA that is a part of a cDNA
molecule and can act as identifier of a gene. Used in locating and mapping genes.
Exons: DNA segments of a gene that encode the amino acid sequence of a protein.
Gap: A space introduced into an alignment to compensate for insertions and deletions
in one sequence relative to another. To prevent the accumulation of too many gaps in
an alignment, introduction of a gap causes the deduction of a fixed amount (the gap
score) from the alignment score. Extension of the gap to encompass additional
nucleotides or amino acid is also penalized in the scoring of an alignment.
Global Alignment: The alignment of two nucleic acid or protein sequences over their
entire length
Homology: Similarity attributed to descent from a common ancestor.
82
HSP: High-scoring segment pair. Local alignments with no gaps that achieve one of
the top alignment scores in a given search.
Identity: The extent to which two (nucleotide or amino acid) sequences are invariant.
Introns: Noncoding DNA sequences that interrupt the sequences containing
instructions for making a protein (exons). Introns are not represented in messenger
RNA; only the exons are translated into protein. The function of introns is stil l being.
Local Alignment: The alignment of some portion of two nucleic acid or protein sequences
Sensitivity: It is the abili ty to detect ‘ true positives’ i.e. correct matches. The most
sensitive search finds all true matches, but might have lots of ‘ false positives’ i.e.
erroneous matches detected. Sensitivity can be defined as the probabili ty of f inding
the matches such that the query and the matched database sequences have at least x%
similarity.
Similar ity: The extent to which nucleotide or protein sequences are related. The
extent of similarity between two sequences can be based on percent sequence identity
and/or conservation. In BLAST similarity refers to a positive matrix score.
Specificity: Abili ty to reject ‘ false positives’ . The most specific search will return
only true matches, but might have lots of ‘ false negatives’ i.e. missed correct matches.