Introduction to NCBI Resources

7/29/2019 Introduction to NCBI Resources

1/39

A Basic Introduction to the Science Underlying NCBIResources

BIOINFORMATICS

Over the past few decades, major advances in the field of molecularbiology, coupled with advances in genomic technologies, have led to anexplosive growth in the biological information generated by the scientificcommunity. This deluge of genomic information has, in turn, led to anabsolute requirement for computerized databases to store, organize, andindex the data and for specialized tools to view and analyze the data.

The completion of a "working draft" of thehuman genome--an important milestone inthe Human Genome Project--was announcedin June 2000 at a press conference at theWhite House and was published in theFebruary 15, 2001 issue of the journalNature.

What Is a Biological Database?

A biological database is a large, organized body of persistent data, usually associated with computerizedsoftware designed to update, query, and retrieve components of the data stored within the system. A simpledatabase might be a single file containing many records, each of which includes the same set of information. Forexample, a record associated with a nucleotide sequence database typically contains information such as contactname, the input sequence with a description of the type of molecule, the scientific name of the source organismfrom which it was isolated, and often, literature citations associated with the sequence.

For researchers to benefit from the data stored in a database, two additional requirements must be met:

easy access to the information

a method for extracting only that information needed to answer a specific biological question

At NCBI, many of our databases are linked through a unique search and retrievalsystem, called Entrez.Entrez(pronounced ahn' tray) allows a user to not only accessand retrieve specific information from a single database but to access integratedinformation from many NCBI databases. For example, the Entrez Protein database iscross-linked to the Entrez Taxonomy database. This allows a researcher to findtaxonomic information (taxonomy is a division of the natural sciences that deals withthe classification of animals and plants) for the species from which a protein sequencewas derived.

The data inGenBankaremade available in avariety of ways, eachtailored to a particularuse, such as datasubmission or sequencesearching.

What Is Bioinformatics?

Bioinformatics is the field of science in which biology, computer science, and informationtechnology merge to form a single discipline. The ultimate goal of the field is to enable thediscovery of new biological insights as well as to create a global perspective from whichunifying principles in biology can be discerned. At the beginning of the "genomicrevolution", a bioinformatics concern was the creation and maintenance of a database tostore biological information, such as nucleotide and amino acid sequences. Developmentof this type of database involved not only design issues but the development of complexinterfaces whereby researchers could both access existing data as well as submit new orrevised data.

Ultimately, however, all of this information must be combined to form a comprehensive picture of normal cellularactivities so that researchers may study how these activities are altered in different disease states. Therefore, the

field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation ofvarious types of data, including nucleotide and amino acid sequences, protein domains, and protein structures.The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:

Biology in the 21stcentury is beingtransformed from apurely lab-basedscience to aninformation science aswell.
http://www.ncbi.nlm.nih.gov/Entrez/http://www.ncbi.nlm.nih.gov/Entrez/http://www.ncbi.nlm.nih.gov/Entrez/http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Entrez/


2/39

the development and implementation of tools that enable efficient access to, and use and management of,various types of information

the development of new algorithms (mathematical formulas) and statistics with which to assessrelationships among members of large data sets, such as methods to locate a gene within a sequence,predict protein structure and/or function, and cluster protein sequences into families of related sequences

Why Is Bioinformatics So Important?

The rationale for applying computational approaches to facilitate the understandingof various biological processes includes:

a more global perspective in experimental design

the ability to capitalize on the emerging technology ofdatabase-mining -the process by which testable hypotheses are generated regarding thefunction or structure of a gene or protein of interest by identifying similarsequences in better characterized organisms

Although a human diseasemay not be found in exactlythe same form in animals,there may be sufficient datafor an animal model thatallow researchers to makeinferences about the processin humans.

Evolutionary Biology

New insight into the molecular basis of a disease may come from investigating the function of homologs of adisease gene in model organisms. In this case, homology refers to two genes sharing a common evolutionaryhistory. Scientists also use the term homology, or homologous, to simply mean similar, regardless of theevolutionary relationship.

Equally exciting is the potential for uncovering evolutionary relationships and

patterns between different forms of life. With the aid of nucleotide and proteinsequences, it should be possible to find the ancestral ties between differentorganisms. Thus far, experience has taught us that closely related organisms havesimilar sequences and that more distantly related organisms have more dissimilarsequences. Proteins that show a significant sequence conservation, indicating aclear evolutionary relationship, are said to be from the same protein family. By

studying protein folds (distinct protein building blocks) and families, scientists are able to reconstruct theevolutionary relationship between two species and to estimate the time of divergence between two organismssince they last shared a common ancestor.

NCBI'sCOGs databasehasbeen designed to simplifyevolutionary studies ofcomplete genomes and toimprove functional assignmentof individual proteins.

Phylogenetics is the field of biology that deals with identifying and understanding the relationshipsbetween the different kinds of life on earth.

Protein Modeling

The process of evolution has resulted in the production of DNA sequences that encode proteins with specificfunctions. In the absence of a protein structure that has been determined by X-ray crystallography or nuclearmagnetic resonance (NMR) spectroscopy, researchers can try to predict the three-dimensional structure usingprotein or molecular modeling. This method uses experimentally determined protein structures (templates) topredict the structure of another protein that has a similar amino acid sequence (target).

Although molecular modeling may not be as accurate at determining a protein's structure as experimental

methods, it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modelingalso provides a starting point for researchers wishing to confirm a structure through X-ray crystallography andNMR spectroscopy. Because the different genome projects are producing more sequences and because novelprotein folds and families are being determined, protein modeling will become an increasingly important tool forscientists working to understand normal and disease-related processes in living organisms.
http://www.ncbi.nlm.nih.gov/COG/http://www.ncbi.nlm.nih.gov/COG/http://www.ncbi.nlm.nih.gov/COG/http://www.ncbi.nlm.nih.gov/COG/


3/39

The Four Steps of Protein Modeling

Identify the proteins with known three-dimensional structures that are related to the target sequence

Align the related three-dimensional structures with the target sequence and determine those structuresthat will be used as templates

Construct a model for the target sequence based on its alignment with the template structure(s)

Evaluate the model against a variety of criteria to determine if it is satisfactory

Genome Mapping

Genomic maps serve as a scaffold for orienting sequence information. A few years ago, a researcher wanting tolocalize a gene, or nucleotide sequence, was forced to manually map the genomic region of interest, a time-consuming and often painstaking process. Today, thanks to new technologies and the influx of sequence data, anumber of high-quality, genome-wide maps are available to the scientific community for use in their research.

Computerized maps make gene hunting faster, cheaper, and more practical for almost any scientist. In a nutshell,scientists would first use a genetic map to assign a gene to a relatively small area of a chromosome. They wouldthen use a physical map to examine the region of interest close up, to determine a gene's precise location. In lightof these advances, a researcher's burden has shifted from mapping a genome or genomic region of interest tonavigating a vast number of Web sites and databases.

Map Viewer: A Tool for Visualizing Whole Genomes or Single Chromosomes

NCBI'sMap Vieweris a tool that allows a user to view an organism's complete genome, integrated maps for eachchromosome (when available), and/or sequence data for a genomic region of interest. When using Map Viewer, aresearcher has the option of selecting either a "Whole-Genome View" or a "Chromosome or Map View". TheGenome View displays a schematic for all of an organisms chromosomes, whereas the Map View shows one ormore detailed maps for a single chromosome. If more than one map exists for a chromosome, Map Viewer allowsa display of these maps simultaneously.

UsingMap Viewer, researchers can find answers to questions such as:

Where does a particular gene exist within an organism's genome?

Which genes are located on a particular chromosome and in what order?

What is the corresponding sequence data for a gene that exists in aparticular chromosomal region?

What is the distance between two genes?

The rapidly emerging field of bioinformatics promises to lead to advances in understanding basic biologicalprocesses and, in turn, advances in the diagnosis, treatment, and prevention of many genetic diseases.Bioinformatics has transformed the discipline of biology from a purely lab-based science to an information scienceas well. Increasingly, biological studies begin with a scientist conducting vast numbers of database and Web site

searches to formulate specific hypotheses or to design large-scale experiments. The implications behind thischange, for both science and medicine, are staggering.
http://www.ncbi.nlm.nih.gov/mapview/http://www.ncbi.nlm.nih.gov/mapview/http://www.ncbi.nlm.nih.gov/mapview/http://www.ncbi.nlm.nih.gov/mapview/http://www.ncbi.nlm.nih.gov/mapview/http://www.ncbi.nlm.nih.gov/mapview/http://www.ncbi.nlm.nih.gov/mapview/http://www.ncbi.nlm.nih.gov/mapview/


4/39

GENOME MAPPING: A GUIDE TO THE GENETIC HIGHWAY WE CALL THE HUMAN GENOME

Imagine you're in a car driving down the highway to visit an old friend who has just moved to Los Angeles. Yourfavorite tunes are playing on the radio, and you haven't a care in the world. You stop to check your maps andrealize that all you have are interstate highway mapsnot a single street map of the area. How will you ever findyour friend's house? It's going to be difficult, but eventually, you may stumble across the right house.

This scenario is similar to the situation facing scientists searching for a specific genesomewhere within the vast human genome. They have available to them two broadcategories of maps: genetic maps and physical maps. Both genetic and physicalmaps provide the likely order of items along a chromosome. However, a genetic map,like an interstate highway map, provides an indirect estimate of the distance betweentwo items and is limited to ordering certain items. One could say that genetic mapsserve to guide a scientist toward a gene, just like an interstate map guides a driver fromcity to city. On the other hand, physical maps mark an estimate of the true distance, inmeasurements called base pairs, between items of interest. To continue our analogy,physical maps would then be similar to street maps, where the distance between twosites of interest may be defined more precisely in terms of city blocks or streetaddresses. Physical maps, therefore, allow a scientist to more easily home in on the

location of a gene. An appreciation of how each of these maps is constructed may be helpful in understanding howscientists use these maps to traverse that genetic highway commonly referred to as the "human genome".

Genetic maps serve to

guide a scientist toward a

gene, just like an

interstate map guides a

driver from city to city.

Physical maps are more

similar to street maps and

allow a scientist to more

easily home in on a

gene's location.

PART I: GENETIC MAPS

Types of Landmarks Found on a Genetic Map

Just like interstate maps have cities and towns that serve as landmarks, genetic mapshave landmarks known as genetic markers, or "markers" for short. The term marker is

used very broadly to describe any observable variation that results from an alteration, ormutation, at a single genetic locus. A marker may be used as one landmark on a map if,in most cases, that stretch of DNA is inherited from parent to child according to thestandard rules of inheritance. Markers can be within genes that code for a noticeablephysical characteristic such as eye color, or a not so noticeable trait such as a disease.DNA-based reagents can also serve as markers. These types of markers are foundwithin the non-coding regions of genes and are used to detect unique regions on a chromosome. DNA markersare especially useful for generating genetic maps when there are occasional, predictable mutations that occurduring meiosisthe formation of gametes such as egg and spermthat, over many generations, lead to a highdegree of variability in the DNA content of the marker from individual to individual.

Genetic maps use

landmarks called geneticmarkers to guide

researchers on their gene

hunt.

Commonly Used DNA Markers RFLPs, orrestriction fragment length polymorphisms, were among the first developed

DNA markers. RFLPs are defined by the presence or absence of a specific site, called arestriction site, for a bacterial restriction enzyme. This enzyme breaks apart strands of DNAwherever they contain a certain nucleotide sequence.

VNTRs, orvariable number of tandem repeat polymorphisms, occur in non-codingregions of DNA. This type of marker is defined by the presence of a nucleotide sequencethat is repeated several times. In each case, the number of times a sequence is repeatedmay vary.

Microsatellite polymorphisms are defined by a variable number of repetitions of a verysmall number of base pairs within a sequence. Oftentimes, these repeats consist of thenucleotides, or bases, cytosine and adenosine. The number of repeats for a givenmicrosatellite may differ between individuals, hence the term polymorphism--the existenceof different forms within a population.

SNPs, orsingle nucleotide polymorphisms, are individual point mutations, orsubstitutions of a single nucleotide, that do not change the overall length of the DNAsequence in that region. SNPs occur throughout an individual's genome.


5/39

From Linkage Analysis to Genetic Mapping

Early geneticists recognized that genes are located on chromosomes and believed that each individualchromosome was inherited as an intact unit. They hypothesized that if two genes were located on the same

chromosome, they were physically linked together and were inherited together. We now know that this is notalways the case. Studies conducted around 1910 demonstrated that very few pairs of genes displayed completelinkage. Pairs of genes were either inherited independently or displayed partial linkagethat is, they wereinherited together sometimes, but not always.

During meiosisthe process whereby gametes (eggs and sperm) are produced twocopies of each chromosome pair become physically close. The chromosome arms canthen undergo breakage and exchange segments of DNA, a process referred to asrecombination or crossing-over. If recombination occurs, each chromosome found in thegamete will consist of a "mixture" of material from both members of the chromosomepair. Thus, recombination events directly affect the inheritance pattern of those genesinvolved.

Because one cannot physically see crossover events, it is difficult to determine with any degree of certainty howmany crossovers have actually occurred. But, using the phenomenon ofco-segregation of alleles of nearbymarkers, researchers can reverse-engineer meiosis and identify markers that lie close to each other. Then, usinga statistical technique called genetic linkage analysis, researchers can infer a likely crossover pattern, and fromthat an order of the markers involved. Researchers can also infer an estimate for the probability that arecombination occurs between each pair of markers.

It is the behavior of

chromosomes during

meiosis that determines

whether two genes will

remain linked.

An allele is one of the variant forms of a DNA sequence at a particular locus, or location, on a chromosome.

Co-segregation of alleles refers to the movement of each marker during meiosis. If a marker tends to "travel"

with the disease status, the markers are said to co-segregate.

If recombination occurs as a random event, then two markers that are close together should beseparated less frequently than two markers that are more distant from one another. Therecombination probability between two markers, which can range from 0 to 0.5, increasesmonotonically as the distance between the two markers increases along a chromosome.Therefore, the recombination probability may be used as a surrogate for ordering geneticmarkers along a chromosome. If you then determine the recombination frequencies fordifferent pairs of markers, you can construct a map of their relative positions on thechromosome.

Monotonic

functions are

functions that

tend to move only

in one direction.

But alas, predicting recombination is not so simple. Although crossovers arerandom, they are not uniformly distributed across the genome or anychromosome. Some chromosomal regions, called recombination hotspots,

are more likely to be involved in crossovers than other regions of achromosome. This means that genetic map distance does not always indicatephysical distance between markers. Despite these qualifications, linkageanalysis usually correctly deduces marker order, and distance estimates aresufficient to generate genetic maps that can serve as a valuable framework for

genome sequencing.

Linkage maps can tell you where

markers are in relation to each other

on the chromosome, but the actual

"mileage" between those markers

may not be so well defined.

Linkage Studies in Patient Populations: Genetic Maps and Gene Hunting


6/39

In humans, data for calculating recombination frequencies are obtained by examiningthe genetic makeup of the members of successive generations of existing families,termed human pedigree analysis. Linkage studies begin by obtaining blood samplesfrom a group of related individuals. For relatively rare diseases, scientists find a fewlarge families that have many cases of the disease and obtain samples from as manyfamily members as possible. For more common diseases where the pattern of diseaseinheritance is unclear, scientists will identify a large number of affected families and willtake samples from four to thirty close relatives. DNA is then harvested from all of theblood samples and screened for the presence, orco-inheritance, of two markers. One

marker is usually the gene of interest, generally associated with a physically identifiablecharacteristic. The other is usually one of the various detectable rearrangements mentioned earlier, such as amicrosatellite. A computerized analysis is then performed to determine whether the two markers are linked andapproximately how far apart those markers are from one another. In this case, the value of the genetic map is thatan inherited disease can be located on the map by following the inheritance of a DNA marker present in affectedindividuals but absent in unaffected individuals, although the molecular basis of the disease may not yet beunderstood, nor the gene(s) responsible identified.

In humans, genetic

diseases are frequently

used as gene markers,

with the disease state

being one allele and the

healthy state the second

allele.

Genetic Maps as a Framework for Physical Map Construction

Genetic maps are also used to generate the essential backbone, or scaffold, needed for the creation of moredetailed human genome maps. These detailed maps, called physical maps, further define the DNA sequencebetween genetic markers and are essential to the rapid identification of genes.

PART II: PHYSICAL MAPS

Types of Physical Maps and What They Measure

Physical maps can be divided into three general types: chromosomal orcytogenetic maps, radiation hybrid(RH) maps, and sequence maps. The different types of maps vary in their degree ofresolution, that is, theability to measure the separation of elements that are close together. The higher the resolution, the better thepicture.

The lowest-resolution physical map is the chromosomal orcytogenetic map, which is based on the distinctivebanding patterns observed by light microscopy of stained chromosomes. As with genetic linkage mapping,chromosomal mapping can be used to locate genetic markers defined by traits observable only in wholeorganisms. Because chromosomal maps are based on estimates of physical distance, they are considered to bephysical maps. Yet, the number of base pairs within a band can only be estimated.

RH maps and sequence maps, on the other hand, are more detailed. RH maps are similar to linkage maps in thatthey show estimates of distance between genetic and physical markers, but that is where the similarity ends. RHmaps are able to provide more precise information regarding the distance between markers than can a linkagemap.

The physical map that provides the most detail is the sequence map. Sequence maps show genetic markers, aswell as the sequence between the markers, measured in base pairs.

How Are Physical Maps Made and Used?

RH Mapping

RH mapping, like linkage mapping, shows an estimated distance between genetic markers. But, rather than


7/39

relying on natural recombination to separate two markers, scientists use breaks induced by radiation to determinethe distance between two markers. In RH mapping, a scientist exposes DNA to measured doses of radiation, andin doing so, controls the average distance between breaks in a chromosome. By varying the degree of radiationexposure to the DNA, a scientist can induce breaks between two markers that are very close together. The abilityto separate closely linked markers allows scientists to produce more detailed maps. RH mapping provides a wayto localize almost any genetic marker, as well as other genomic fragments, to a defined map position, and RHmaps are extremely useful for ordering markers in regions where highly polymorphic genetic markers arescarce.

Polymorphic refers to the existence of two or more forms of the same gene, or genetic marker, with each form

being too common in a population to be merely attributable to a new mutation. Polymorphism is a useful genetic

marker because it enables researchers to sometimes distinguish which allele was inherited.

Scientists also use RH maps as a bridge between linkage maps and sequence maps. In doing so, they have beenable to more easily identify the location(s) of genes involved in diseases such as spinal muscular atrophy andhyperekplexia, more commonly known as "startle disease".

Sequence Mapping

Sequence tagged site (STS) mapping is another physical mapping technique. An STS is a short DNA sequencethat has been shown to be unique. To qualify as an STS, the exact location and order of the bases of thesequence must be known, and this sequence may occur only once in the chromosome being studied or in thegenome as a whole if the DNA fragment set covers the entire genome.

Common Sources of STSs Expressed sequence tags (ESTs) are short sequences obtained by analysis of complementary DNA (cDNA)

clones. Complementary DNA is prepared by converting mRNA into double-stranded DNA and is thought torepresent the sequences of the genes being expressed. An EST can be used as an STS if it comes from a uniquegene and not from a member of a gene family in which all of the genes have the same, or similar, sequences.

Simple sequence length polymorphisms (SSLPs) are arrays of repeat sequences that display length variations.SSLPs that are polymorphic and have already been mapped by linkage analysis are particularly valuable becausethey provide a connection between genetic and physical maps.

Random genomic sequences are obtained by sequencing random pieces of cloned genomic DNA or byexamining sequences already deposited in a database.

To map a set of STSs, a collection of overlapping DNA fragments from a chromosome is digested into smallerfragments using restriction enzymes, agents that cut up DNA molecules at defined target points. The data fromwhich the map will be derived are then obtained by noting which fragments contain which STSs. To accomplishthis, scientists copy the DNA fragments using a process known as "molecular cloning". Cloning involves the useof a special technology, called recombinant DNA technology, to copy DNA fragments inside a foreign host.

First, the fragments are united with a carrier, also called a vector. After introduction into a suitable host, the DNAfragments can then be reproduced along with the host cell DNA, providing unlimited material for experimentalstudy. An unordered set of cloned DNA fragments is called a library.

Next, the clones, or copies, are assembled in the order they would be found in the original chromosome bydetermining which clones contain overlapping DNA fragments. This assembly of overlapping clones is called aclone contig. Once the order of the clones in a chromosome is known, the clones are placed in frozen storage,


8/39

and the information about the order of the clones is stored in a computer, providing a valuable resource that maybe used for further studies. These data are then used as the base material for generating a lengthy, continuousDNA sequence, and the STSs serve to anchor the sequence onto a physical map.

The Need to Integrate Physical and Genetic Maps

As with most complex techniques, STS-based mapping has its limitations. In addition to gaps in clone coverage,DNA fragments may become lost or mistakenly mapped to a wrong position. These errors may occur for a varietyof reasons. A DNA fragment may break, resulting in an STS that maps to a different position. DNA fragments mayalso get deleted from a clone during the replication process, resulting in the absence of an STS that should bepresent. Sometimes a clone composed of DNA fragments from two distinct genomic regions is replicated, leadingto DNA segments that are widely separated in the genome being mistakenly mapped to adjacent positions. Lastly,a DNA fragment may become contaminated with host genetic material, once again leading to an STS that will mapto the wrong location. To help overcome these problems, as well as to improve overall mapping accuracy,researchers have begun comparing and integrating STS-based physical maps with genetic, RH, and cytogeneticmaps. Cross-referencing different genomic maps enhances the utility of a given map, confirms STS order, andhelps order and orient evolving contigs.

NCBI and Map Integration

Comparing the many available genetic and physical maps can be a time-consuming step, especially when trying topinpoint the location of a new gene. Without the use of computers and special software designed to align thevarious maps, matching a sequence to a region of a chromosome that corresponds to the gene location would bevery difficult. It would be like trying to compare 20 different interstate and street maps to get from a house in Ukiah,California to a house in Beaver Dam, Wisconsin. You could compare the maps yourself and create your own travelitinerary, but it would probably take a long time. Wouldn't it be easier and faster to have the automobile club createan integrated map for you? That is the goal behind NCBI's Human Genome Map Viewer.

NCBI's Map Viewer: A Tool for Integrating Genetic and Physical Maps

TheNCBI Map Viewerprovides a graphical display of the available human genome sequence data as well assequence, cytogenetic, genetic linkage, and RH maps. Map Viewer can simultaneously display up to seven maps,selected from a large set of maps, and allows the user access to detailed information for a selected map region.Map Viewer uses a common sequence numbering system to align sequence maps and shared markers as well asgene names to align other maps. You can use NCBI's Map Viewer to search for a gene in a number of genomes,by choosing an organism from the Map Viewerhome page.

Map Viewer Getting Started

Need help using the NCBI Map Viewer? TryGETTING STARTED, a quick "how-to guide"

on NCBI data mining tools designed for the novice user.GETTING STARTEDusing Map

Viewer provides:

information on how you can use Map Viewer

descriptions of the Map Viewer layout

step-by-step information on using the Map Viewer

shortcuts for getting to where you need to go
http://www.ncbi.nlm.nih.gov/mapview/map_search.cgihttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgihttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgihttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgihttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgihttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgihttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/index.htmlhttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/index.htmlhttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/index.htmlhttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/mapviewer/index.htmlhttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/mapviewer/index.htmlhttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/mapviewer/index.htmlhttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/mapviewer/index.htmlhttp://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/index.htmlhttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgihttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgi


9/39

MOLECULAR MODELING: A METHOD FOR UNRAVELING PROTEIN STRUCTURE AND FUNCTION

Proteins are fundamental components of all living cells. They exhibit an enormous amount ofchemical and structural diversity, enabling them to carry out an extraordinarily diverse range ofbiological functions. Proteins help us digest our food, fight infections, control body chemistry, andin general, keep our bodies functioning smoothly. Scientists know that the critical feature of aprotein is its ability to adopt the right shape for carrying out a particular function. But sometimes aprotein twists into the wrong shape or has a missing part, preventing it from doing its job. Many

diseases, such as Alzheimer's and "mad cow", are now known to result from proteins that have adopted an incorrectstructure.

Identifying a protein's shape, orstructure, is key to understanding its biological function and its role in health and

disease. Illuminating a protein's structure also paves the way for the development of new agents and devices totreat a disease. Yet solving the structure of a protein is no easy feat. It often takes scientists working in thelaboratory months, sometimes years, to experimentally determine a single structure. Therefore, scientists havebegun to turn toward computers to help predict the structure of a protein based on its sequence. The challenge liesin developing methods for accurately and reliably understanding this intricate relationship.

Proteins formour bodies andhelp direct itsmany systems.

Levels of Protein Structure

To produce proteins, cellular structures called ribosomes join together long chains ofsubunits. A set of 20 different subunits, called amino acids, can be arranged in any order toform a polypeptide that can be thousands of amino acids long. These chains can then loopabout each other, orfold, in a variety of ways, but only one of these ways allows a protein tofunction properly. The critical feature of a protein is its ability to fold into a conformation thatcreates structural features, such as surface grooves, ridges, and pockets, which allow it to fulfill its role in a cell. Aprotein's conformation is usually described in terms of levels of structure. Traditionally, proteins are looked upon ashaving four distinct levels of structure, with each level of structure dependent on the one below it. In some proteins,functional diversity may be further amplified by the addition of new chemical groups after synthesis is complete.

The stringing together of the amino acid chain to form a polypeptide is referred to as the primary structure. The

secondary structure is generated by the folding of the primary sequence and refers to the path that thepolypeptide backbone of the protein follows in space. Certain types of secondary structures are relatively common.Two well-described secondary structures are the alpha helix and the beta sheet. In the first case, certain types of

bonding between groups located on the same polypeptide chain cause the backbone to twist into a helix, most oftenin a form known as the alpha helix. Beta sheets are formed when a polypeptide chain bonds with another chain thatis running in the opposite direction. Beta sheets may also be formed between two sections of a single polypeptidechain that is arranged such that adjacent regions are in reverse orientation.

The tertiary structure describes the organization in three dimensions of all of the atoms in the polypeptide. If aprotein consists of only one polypeptide chain, this level then describes the complete structure.

Multimeric proteins, or proteins that consist of more than one polypeptide chain, require a higher level oforganization. The quaternary structure defines the conformation assumed by a multimeric protein. In this case, the

individual polypeptide chains that make up a multimeric protein are often referred to as the protein subunits. Thefour levels of protein structure are hierarchal, that is, each level of the build process is dependent upon the onebelow it.

Proteins functionthrough theirconformation.

Proteins can be divided into two general classes based on their tertiary structure.

Fibrous proteins have elongated structures, with the polypeptide chains arranged in long strands. Thisclass of proteins serves as major structural components of cells, and therefore their role tends to be staticin providing a structural framework.

Globular proteins have more compact, often irregular structures. This class of proteins includes most

enzymes and most of the proteins involved in gene expression and regulation.


10/39

How Do Proteins Acquire Their Correct Conformations?

A protein's primary amino acid sequence is crucial in determining its final structure. In some cases, amino acidsequence is the sole determinant, whereas in other cases, additional interactions may be required before a proteincan attain its final conformation. For example, some proteins require the presence of a cofactor, or a second

molecule that is part of the active protein, before it can attain its final conformation. Multimeric proteins often requireone or more subunits to be present for another subunit to adopt the proper higher order structure. In any case, aswe stated earlier, the entire process is cooperative, that is, the formation of one region of secondary structuredetermines the formation of the next region.

Allosteric Proteins

Under certain conditions, a protein may have a stable alternate conformation, or shape,that enables it to carry out a different biological function. Proteins that exhibit thischaracteristic are called allosteric. The interaction of an allosteric protein with a specificcofactor, or with another protein, may influence the transition of the protein betweenshapes. In addition, any change in conformation brought about by an interaction at onesite may lead to an alteration in the structure, and thus function, at another site. One

should bear in mind, though, that this type of transition affects only the protein's shape, not the primary amino acidsequence. Allosteric proteins play an important role in both metabolic and genetic regulation.

Allosteric proteins canchange their shape andfunction depending on theenvironmental conditionsin which they are found.

Determining Protein Structure

Traditionally, a protein's structure was determined using one of two techniques: X-ray crystallography ornuclearmagnetic resonance (NMR) spectroscopy.

X-ray Crystallography

Crystals are a solid form of a substance in which the component molecules are present in anordered array called a lattice. The basic building block of a crystal is called a unit cell. Eachunit cell contains exactly one unique set of the crystal's components, the smallest possible setthat is fully representative of the crystal. Crystals of a complex molecule, like a protein,produce a complex pattern ofX-ray diffraction, or scattering of X-rays. When the crystal isplaced in an X-ray beam, all of the unit cells present the same face to the beam; therefore,many molecules are in the same orientation with respect to the incoming X-rays. The X-raybeam enters the crystal and a number of smaller beams emerge: each one in a differentdirection, each one with a different intensity. If an X-ray detector, such as a piece of film, isplaced on the opposite side of the crystal from the X-ray source, each diffracted ray, called areflection, will produce a spot on the film. However, because only a few reflections can be detected with any oneorientation of the crystal, an important component of any X-ray diffraction instrument is a device for accuratelysetting and changing the orientation of the crystal. The set of diffracted, emerging beams contains information about

the underlying crystal structure.

If we could use light instead of X-rays, we could set up a system of lenses to recombine the beams emerging fromthe crystal and thus bring into focus an enlarged image of the unit cell and the molecules therein. But the moleculesdo not diffract visible light, and X-rays, unlike light, cannot be focused with lenses. However, the scientific laws thatlenses obey are well understood, and it is possible to calculate the molecular image with a computer. In effect, thecomputer mimics the action of a lens.

The major drawback associated with this technique is that crystallization of the proteins is a difficult task. Crystalsare formed by slowly precipitating proteins under conditions that maintain their native conformation or structure.These exact conditions can only be discovered by repeated trials that entail varying certain experimental conditions,one at a time. This is a very time consuming and tedious process. In some cases, the task of crystallizing a protein

borders on the impossible.

When performingthis technique, themolecule understudy must first becrystallized, and thecrystals must besingular and ofperfect quality atime-consuming anddifficult task.

Nuclear Magnetic Resonance (NMR) Spectroscopy


11/39

The basic phenomenon ofNMR spectroscopy was discovered in 1945. In thistechnique, a sample is immersed in a magnetic field and bombarded with radiowaves. These radio waves encourage the nuclei of the molecule to resonate, or spin.

As the positively charged nucleus spins, the moving charge creates what is called amagnetic moment. The thermal motion of the moleculethe movement of themolecule associated with the temperature of the materialfurther creates a torque,or twisting force, that makes the magnetic moment "wobble" like a child's top. When the radio waves hit the spinningnuclei, they tilt even more, sometimes flipping over. These resonating nuclei emit a unique signal that is then pickedup on a special radio receiver and translated using a decoder. This decoder is called the Fourier Transform

algorithm, a complex equation that translates the language of the nuclei into something a scientist can understand.By measuring the frequencies at which different nuclei flip, scientists can determine molecular structure, as well asmany other interesting properties of the molecule.

In the past 10 years, NMR has proven to be a powerful alternative to X-ray crystallography for the determination ofmolecular structure. NMR has the advantage over crystallographic techniques in that experiments are performed insolution as opposed to a crystal lattice. However, the principles that make NMR possible tend to make thistechnique very time consuming and limit the application to small- and medium-sized molecules.

Solution NMR is performed ona solution of macromoleculeswhile the molecules tumble andvibrate with thermal motion.

The Advent of Computational Modeling

Researchers have been working for decades to develop procedures for predicting protein structure that are not sotime consuming and that are not hindered by size and solubility constraints. To do this, researchers have turned tocomputers for help in predicting protein structure from gene sequences, a concept called homology modeling. Thecomplete genomes of various organisms, including humans, have now been decoded and allow researchers toapproach this goal in a logical and organized fashion.

Before we go any further, it is important to define some common terminology used in this field.

Common Terminology Used in Homology Modeling

Folding motifs are independent folding units, or particular structures, that recur in many molecules.

Domains are the building blocks of a protein and are considered elementary units of molecular function.

Families are groups of proteins that demonstrate sequence homology or have similar sequences.

Superfamilies consist of proteins that have similar folding motifs but do not exhibit sequence similarity.

Some Basic Theory

It is theorized that proteins that share a similar sequence generally share the same basicstructure. Therefore, by experimentally determining the structure for one member of aprotein family, called a target, researchers have a model on which to base the structure ofother proteins within that family. Moving a step further, by selecting a target from eachsuperfamily, researchers can study the universe of protein folds in a systematic fashion andoutline a set of sequences associated with each folding motif. Many of these sequencesmay not demonstrate a resemblance to one another, but their identification and assignmentto a particular fold is essential for predicting future protein structures using homologymodeling.

The scientific basis for these theories is that a strong conservation of protein three-

dimensional shape across large evolutionary distancesboth within single species,between species, and in spite of sequence variationhas been demonstrated again and

again. Although most scientists choose high-priority structures as their targets, this theory provides the option tochoose any one of the proteins within a family as the target, rather than trying to achieve experimental results usinga protein that is particularly difficult to work with using crystallographic or NMR techniques.

A computer-generatedimage of a protein'sstructure shows therelative locations ofmost, if not all, of theprotein's thousands ofatoms. The image alsoreveals the physical,chemical, and electricalproperties of the proteinand provides cluesabout its role in thebody.


12/39

Steps for Maximizing Results

Specific tasks must be carried out to maximize results when determining protein structure using homology modeling.

First, protein sequences must be organized in terms of families, preferably in a searchable database, and a targetmust be selected. Protein families can be identified and organized by comparing protein sequences derived from

completely sequenced genomes. Targets may be selected for families that do not exhibit apparent sequencehomology to proteins with a known three-dimensional structure.

Next, researchers must generate a purified protein for analysis of the chosen target and then experimentallydetermine the target's structure, either by X-ray crystallography and/or NMR. Target structures determinedexperimentally may then be further analyzed to evaluate their similarity to other known protein structures and todetermine possible evolutionary relationships that are not identifiable from protein sequence alone. The targetstructure will also serve as a detailed model for determining the structure of other proteins within that family. Infavorable cases, just knowing the structure of a particular protein may also provide considerable insight into itspossible function.

PDB: The Protein Data Bank

ThePDBwas the first "bioinformatics" database ever built and is designed to store complex three-dimensional data.The PDB was originally developed and housed at the Brookhaven National Laboratories but is now managed andmaintained by the Research Collaboratory for Structural Bioinformatics (RCSB). The PDB is a collection of allpublicly available three-dimensional structures of proteins, nucleic acids, carbohydrates, and a variety of othercomplexes experimentally determined by X-ray crystallography and NMR.

PDB is supported by funds from the National Science Foundation, the Department of Energy, and two units of theNational Institutes of Health: the National Institute of General Medical Sciences and the National Library of

Medicine.

Protein Modeling at NCBI

The Molecular Modeling Database

NCBI'sMolecular Modeling DataBase (MMDB), an integral part of our Entrez information retrieval system, is acompilation of all of the PDB three-dimensional structures of biomolecules. The difference between the twodatabases is that the MMDB records reorganize and validate the information stored in the database in a way thatenables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. Byintegrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction.

MMDB records have value-added information compared to the original PDB entries, including explicit chemicalgraph information, uniformly derived secondary structure definitions, structure domain information, literature

citation matching, and molecule-based assignment of taxonomy to each biologically derived protein or nucleicacid chain.

NCBI has also developed a three-dimensional structure viewer, calledCn3D, for easy interactive visualization ofmolecular structures from Entrez. Cn3D serves as a visualization tool for sequences and sequence alignments.

What sets Cn3D apart from other software is its ability to correlate structure and sequence information. Forexample, using Cn3D, a scientist can quickly locate the residues in a crystal structure that correspond to knowndisease mutations or conserved active site residues from a family ofsequence homologs, or sequences that sharea common ancestor. Cn3D displays structure-structure alignments along with the corresponding structure-basedsequence alignments to emphasize those regions within a group of related proteins that are most conserved in
http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtmlhttp://www.rcsb.org/pdb/


13/39

structure and sequence. Cn3D also features custom labeling options, high-quality graphics, and a variety of fileexports that together make Cn3D a powerful tool for literature annotation.

PDBeast: Taxonomy in MMDB

Taxonomy is the scientific discipline that seeks to catalog and reconstruct the evolutionary history of life on earth.NCBI's Structure Group, in collaboration with NCBI's taxonomists, has undertaken taxonomy annotation for the

structure data stored in MMDB. A semi-automated approach has been implemented in which a human expertchecks, corrects, and validates automatic taxonomic assignments.

ThePDBeastsoftware tool was developed by NCBI for this purpose. It pulls text-descriptions of "SourceOrganisms" from either the original PDB-Entries or user-specified information and looks for matches in the NCBITaxonomy database to record taxonomy assignments.

The Role of Taxonomy

Taxonomy provides a vivid picture of the existing organic diversity of the earth.

Taxonomy provides much of the information permitting a reconstruction of the phylogeny of life.

Taxonomy reveals numerous, interesting evolutionary phenomena.

Taxonomy supplies classifications that are of great explanatory value in most branches of biology.

COGs: Phylogenetic Classification of Proteins

The database ofClusters of Orthologous Groupsof proteins (COGs) represents an attempt at the phylogeneticclassification of proteins a scheme that indicates the evolutionary relationships between organismsfromcomplete genomes. Each COG includes proteins that are thought to be orthologous. Orthologs are genes indifferent species derived from a common ancestor and carried on through evolution. COGs may be used to detectsimilarities and differences between species for identifying protein families and predicting new protein functions andto point to potential drug targets in disease-causing species. The database is accompanied by theCOGnitorprogram, which assigns new proteins, typically from newly sequenced genomes, to pre-existing COGs. A Web pagecontaining additional structural and functional information is now associated with each COG. These hyperlinkedinformation pages include: systematic classification of the COG members under the different classification systems;indications of which COG members (if any) have been characterized genetically and biochemically; information onthe domain architecture of the proteins constituting the COG and the three-dimensional structure of the domains ifknown or predictable; a succinct summary of the common structural and functional features of the COG members

as well as peculiarities of individual members; and key references.

Detecting New Sequence Similarities: BLAST against MMDB

Comparison, whether of structural features or protein sequences, lies at the heartof biology. The introduction ofBLAST, or The Basic Local Alignment SearchTool, in 1990 made it easier to rapidly scan huge databases for overthomologies, or sequence similarities, and to statistically evaluate the resultingmatches. BLAST works by comparing a user's unknown sequence against thedatabase of all known sequences to determine likely matches. Sequencesimilarities found by BLAST have been critical in several gene discoveries. Hundreds of major sequencing centers

and research institutions around the country use this software to transmit a query sequence from their localcomputer to a BLAST server at the NCBI via the Internet. In a matter of seconds, the BLAST server compares theuser's sequence with up to a million known sequences and determines the closest matches.

Not all significant homologies are readily and easily detected, however. Some of the most interesting are subtle

The journal article describing theoriginal algorithm used in BLASThas since become one of the mostfrequently cited papers of thedecade, with over 10,000 citations.
http://www.ncbi.nlm.nih.gov/Structure/PDBEAST/pdbeast.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/PDBEAST/pdbeast.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/PDBEAST/pdbeast.shtmlhttp://www.ncbi.nlm.nih.gov/COG/http://www.ncbi.nlm.nih.gov/COG/http://www.ncbi.nlm.nih.gov/COG/http://www.ncbi.nlm.nih.gov/COG/xognitor.htmlhttp://www.ncbi.nlm.nih.gov/COG/xognitor.htmlhttp://www.ncbi.nlm.nih.gov/COG/xognitor.htmlhttp://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/COG/xognitor.htmlhttp://www.ncbi.nlm.nih.gov/COG/http://www.ncbi.nlm.nih.gov/Structure/PDBEAST/pdbeast.shtml


14/39

similarities that do not always rise to statistical significance during a standard BLAST search. Therefore, NCBI hasextended the statistical methodology used in the original BLAST to address the problem of detecting weak, yetsignificant, sequence similarities.PSI-BLAST, orPosition-Specific Iterated BLAST, searches sequencedatabases with a profile constructed using BLAST alignments, from which it then constructs what is called aposition-specific score matrix. For protein analysis, the new Pattern Hit Initiated BLAST, orPHI-BLAST, serves tocomplement the profile-based searching that was previously introduced with PSI-BLAST. PHI-BLAST furtherincorporates hypotheses as to the biological function of a query sequence and restricts the analysis to a set ofprotein sequences that is already known to contain a specific pattern or motif.

BLASTnow comes in several varieties in addition to those described above. Specialized BLASTs are also

available for human, microbial, and other genomes, as well as for vector contamination, immunoglobulins, andtentative human consensus sequences.

Structure Similarity Searching Using VAST

As just noted, a sequence-sequence similarity program provides an alignment of two sets of sequences. Astructure-structure similarity program provides a three-dimensional structure superposition. Structure similaritysearch services are based on the premise that some measure can be computed between two structures to assess

their similarities, much the same way a BLAST alignment is predicted.

VAST, or the VectorAlignment Search Tool, is a computer algorithm developed at NCBI foruse in identifying similar three-dimensional protein structures. VAST is capable of detectingstructural similarities between proteins stored in MMDB, even when no sequence similarity isdetected.

VAST Searchis NCBI's structure-structure similarity search service that compares three-dimensional coordinates of newly determined protein structures to those in the MMDB or PDBdatabases. VAST Search creates a list of structure neighbors, or related structures, that a usercan then browse interactively. VAST Search will retrieve almost all structures with an identicalthree-dimensional fold, although it may occasionally miss a few structures or report chance similarities.

The detection ofstructural similarityin the absence ofobvious sequencesimilarity is apowerful tool tostudy remotehomologies andprotein evolution.

The Conserved Domain Database

The Conserved Domain Database (CDD)is a collection of sequence alignments and profiles representing proteindomains conserved in molecular evolution. It includes domains from SMART and Pfam, two popular Web-basedtools for studying sequence domains, as well as domains contributed by NCBI researchers.CD Search, another

NCBI search service, can be used to identify conserved domains in a protein query sequence. CD-Search usesRPS-BLAST to compare a query sequence against specific matrices that have been prepared from conserveddomain alignments present in CDD. Alignments are also mapped to known three-dimensional structures and can bedisplayed using Cn3D (see above).

Conserved Domain Architecture Retrieval Tool

NCBI'sConserved Domain Architecture Retrieval Tool(CDART) displays the functional domains that make up aprotein and lists other proteins with similar domain architectures. CDART determines the domain architecture of aquery protein sequence by comparing it to the CDD, a database of conserved domain alignments, usingRPS-BLAST.

The Conserved Domain Architecture Retrieval Tool then compares the protein's domain architecture to that of otherproteins in NCBI's non-redundant sequence database. Related sequences are identified as those proteins thatshare one or more similar domains. CDART displays these sequences using a graphical summary that depicts thetypes and locations of domains identified within each sequence. Links to the individual sequences as well as tofurther information on their domain architectures are also provided. Because protein domains may be consideredelementary units of molecular function and proteins related by domain architecture may play similar roles in cellularprocesses, CDART serves as a useful tool in comparative sequence analysis.
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.htmlhttp://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.htmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgihttp://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgihttp://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgihttp://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rpshttp://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rpshttp://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rpshttp://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rpshttp://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgihttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.htmlhttp://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtmlhttp://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/


15/39

RPS-BLAST is a "reverse" version of PSI-BLAST, which is described above. Both RPS-BLAST and PSI-BLASTuse similar methods to derive conserved features of a protein family. However, RPS-BLAST compares a query

sequence against a database of profiles prepared from ready-made alignments, whereas PSI-BLAST buildsalignments starting from a single protein sequence. The programs also differ in purpose: RPS-BLAST is used toidentify conserved domains in a query sequence, whereas PSI-BLAST is used to identify other members of the

protein family to which a query sequence belongs.

Application to Biomedicine

Although the information derived from modeling studies is primarily about molecular function, protein structure dataalso provide a wealth of information on mechanisms linked to the function and the evolutionary history of andrelationships between macromolecules. NCBI's goals in adding structure data to its Web site are to make thisinformation easily accessible to the biomedical community worldwide and to facilitate comparative analysis involvingthree-dimensional structure.

SNPs: VARIATIONS ON A THEME

Wouldn't it be wonderful if you knew exactly what measures you could take to stave off, or even prevent, the onsetof disease? Wouldn't it be a relief to know that you are not allergic to the drugs your doctor just prescribed?Wouldn't it be a comfort to know that the treatment regimen you are undergoing has a good chance of successbecause it was designed just for you? With the availability of millions of SNPs, biomedical researchers now believethat such exciting medical advances are not that far away.

What Are SNPs and How Are They Found?

A Single Nucleotide Polymorphism, or SNP (pronounced "snip"), is a small genetic change,or variation, that can occur within a person's DNA sequence. The genetic code is specified bythe fournucleotide "letters" A (adenine), C (cytosine), T (thymine), and G (guanine). SNPvariation occurs when a single nucleotide, such as an A, replaces one of the other threenucleotide lettersC, G, or T.

An example of a SNP is the alteration of the DNA segment AAGGTTA to ATGGTTA, where thesecond "A" in the first snippet is replaced with a "T". On average, SNPs occur in the humanpopulation more than 1 percent of the time. Because only about 3 to 5 percent of a person'sDNA sequence codes for the production of proteins, most SNPs are found outside of "codingsequences". SNPs found within a coding sequence are of particular interest to researchersbecause they are more likely to alter the biological function of a protein. Because of the recent advances in

technology, coupled with the unique ability of these genetic variations to facilitate gene identification, there has beena recent flurry of SNP discovery and detection.

Although manySNPs do notproduce physicalchanges in people,scientists believethat other SNPsmay predisposepeople to diseaseand even influencetheir response todrug regimens.

Needles in a Haystack

Finding single nucleotide changes in the human genome seems like a dauntingprospect, but over the last 20 years, biomedical researchers have developed a numberof techniques that make it possible to do just that. Each technique uses a differentmethod to compare selected regions of a DNA sequence obtained from multipleindividuals who share a common trait. In each test, the result shows a physical

difference in the DNA samples only when a SNP is detected in one individual and not inthe other.

Many common diseases in humans are not caused by a genetic variation within a single gene but are influenced bycomplex interactions among multiple genes as well as environmental and lifestyle factors. Although both

As a result of recentadvances in SNPs research,diagnostics for manydiseases may improve.


16/39

environmental and lifestyle factors add tremendously to the uncertainty of developing a disease, it is currentlydifficult to measure and evaluate their overall effect on a disease process. Therefore, we refer here mainly to aperson's genetic predisposition, or the potential of an individual to develop a disease based on genes andhereditary factors.

Genetic factors may also confer susceptibility or resistance to a disease and determine the severity or progressionof disease. Because we do not yet know all of the factors involved in these intricate pathways, researchers havefound it difficult to develop screening tests for most diseases and disorders. By studying stretches of DNA that havebeen found to harbor a SNP associated with a disease trait, researchers may begin to reveal relevant genes

associated with a disease. Defining and understanding the role of genetic factors in disease will also allowresearchers to better evaluate the role non-genetic factorssuch as behavior, diet, lifestyle, and physicalactivityhave on disease.

Because genetic factors also affect a person's response to drug therapy, DNA polymorphisms such as SNPs willbe useful in helping researchers determine and understand why individuals differ in their abilities to absorb or clearcertain drugs, as well as to determine why an individual may experience an adverse side effect to a particular drug.Therefore, the recent discovery of SNPs promises to revolutionize not only the process of disease detection but thepractice of preventative and curative medicine.

SNPs and Disease Diagnosis

Each person's genetic material contains a unique SNP pattern that is made up ofmany different genetic variations. Researchers have found that most SNPs are notresponsible for a disease state. Instead, they serve as biological markers forpinpointing a disease on the human genome map, because they are usually locatednear a gene found to be associated with a certain disease. Occasionally, a SNP mayactually cause a disease and, therefore, can be used to search for and isolate the

disease-causing gene.

To create a genetic test that will screen for a disease in which the disease-causing gene has already beenidentified, scientists collect blood samples from a group of individuals affected by the disease and analyze their DNAfor SNP patterns. Next, researchers compare these patterns to patterns obtained by analyzing the DNA from agroup of individuals unaffected by the disease. This type of comparison, called an "association study", can detectdifferences between the SNP patterns of the two groups, thereby indicating which pattern is most likely associatedwith the disease-causing gene. Eventually, SNP profiles that are characteristic of a variety of diseases will beestablished. Then, it will only be a matter of time before physicians can screen individuals for susceptibility to adisease just by analyzing their DNA samples for specific SNP patterns.

It will only be a matter of timebefore physicians can screenpatients for susceptibility to adisease by analyzing their DNAfor specific SNP profiles.

SNPs and Drug Development

As mentioned earlier, SNPs may also be associated with the absorbance and clearance oftherapeutic agents. Currently, there is no simple way to determine how a patient will respond to

a particular medication. A treatment proven effective in one patient may be ineffective in others.Worse yet, some patients may experience an adverse immunologic reaction to a particulardrug. Today, pharmaceutical companies are limited to developing agents to which the "average"patient will respond. As a result, many drugs that might benefit a small number of patients nevermake it to market.

In the future, the most appropriate drug for an individual could be determined in advance of treatment by analyzing apatient's SNP profile. The ability to target a drug to those individuals most likely to benefit, referred to as"personalized medicine", would allow pharmaceutical companies to bring many more drugs to market and allow

doctors to prescribe individualized therapies specific to a patient's needs.

Using SNPs to

study the geneticsof drug responsewill help in thecreation of"personalized"medicine.

SNPs and NCBI


17/39

Because SNPs occur frequently throughout the genome and tend to be relativelystable genetically, they serve as excellent biological markers. Biological markersare segments of DNA with an identifiable physical location that can be easily trackedand used for constructing a chromosome map that shows the positions of knowngenes, or other markers, relative to each other. These maps allow researchers tostudy and pinpoint traits resulting from the interaction of more than one gene. NCBIplays a major role in facilitating the identification and cataloging of SNPs through its

creation and maintenance of the public SNP database(dbSNP). This powerful genetic tool may be accessed bythe biomedical community worldwide and is intended to stimulate many areas of biological research, including the

identification of the genetic components of disease.

Most SNPs are not responsiblefor a disease state. Instead,they serve as biologicalmarkers for pinpointing adisease on the human genomemap.

NCBI's "Discovery Space" Facilitating SNP Research

Figure 1. The NCBI Discovery Space.

Records in dbSNP are cross-annotated within other internal information resources such as PubMed, genome project sequences, GenBankrecords, the Entrez Gene database, and the dbSTS database of sequence tagged sites. Users may query dbSNP directly or start a search in anypart of the NCBI discovery space to construct a set of dbSNP records that satisfy their search conditions. Records are also integrated withexternal information resources through hypertext URLs that dbSNP users can follow to explore the detailed information that is beyond the scopeof dbSNP curation.

Reproduced with permission from Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K."dbSNP: the NCBI database ofgenetic variation." Nucleic Acids Research. 2001; 29:308-311.

To facilitate research efforts, NCBI's dbSNP is included in the Entrez retrieval system which provides integratedaccess to a number of software tools and databases that can aid in SNP analysis. For example, each SNP record inthe database links to additional resources within NCBI's "Discovery Space", as noted in Figure 1. Resourcesinclude:GenBank, NIH's sequence database;Entrez Gene, a focal point for genes and associated information;dbSTS, NCBI's resource containing sequence and mapping data on short genomic landmarks; human genomesequencing data; andPubMed, NCBI's literature search and retrieval system. SNP records also link to variousexternal allied resources.

Providing public access to a site for "one-stop SNP shopping" facilitates scientific research in a variety of fields,ranging from population genetics and evolutionary biology to large-scale disease and drug association studies. Thelong-term investment in such novel and exciting research promises not only to advance human biology but torevolutionize the practice of modern medicine.

ESTs: GENE DISCOVERY MADE EASIER

Investigators are working diligently to sequence and assemble the genomes of various organisms, including themouse and human, for a number of important reasons. Although important goals of any sequencing project may be to
http://www.ncbi.nlm.nih.gov/SNP/http://www.ncbi.nlm.nih.gov/SNP/http://www.ncbi.nlm.nih.gov/SNP/http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genehttp://www.ncbi.nlm.nih.gov/dbSTS/index.htmlhttp://www.ncbi.nlm.nih.gov/dbSTS/index.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/dbSTS/index.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genehttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/SNP/


18/39

obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding ofwhen, where, and how a gene is turned on, a process commonly referred to as gene expression. Once we begin tounderstand where and how a gene is expressed under normal circumstances, we can then study what happens in analtered state, such as in disease. To accomplish the latter goal, however, researchers must identify and study theprotein, or proteins, coded for by a gene.

As one can imagine, finding a gene that codes for a protein, or proteins, is not easy. Traditionally, scientists wouldstart their search by defining a biological problem and developing a strategy for researching the problem. Oftentimes,a search of the scientific literature provided various clues about how to proceed. For example, other laboratories may

have published data that established a link between a particular protein and a disease of interest. Researchers wouldthen work to isolate that protein, determine its function, and locate the gene that coded for the protein. Alternatively,scientists could conduct what is referred to as linkage studies to determine the chromosomal location of a particular

gene. Once the chromosomal location was determined, scientists would use biochemical methods to isolate the geneand its corresponding protein. Either way, these methods took a great deal of timeyears in some casesandyielded the location and description of only a small percentage of the genes found in the human genome.

Now, however, the time required to locate and fully describe a gene is rapidly decreasing, thanksto the development of, and access to, a technology used to generate what are called ExpressedSequence Tags, orESTs. ESTs provide researchers with a quick and inexpensive route fordiscovering new genes, for obtaining data on gene expression and regulation, and forconstructing genome maps. Today, researchers using ESTs to study the human genome find

themselves riding the crest of a wave of scientific discovery the likes of which has never beenseen before.

An ExpressedSequence Tag is atiny portion of anentire gene that canbe used to help

identify unknowngenes and to maptheir positionswithin a genome.

What Are ESTs and How Are They Made?

ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencingeither one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressedin certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of

chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic

sequences varies among organisms and is dependent upon genome size as well as the presence or absence ofintrons, the intervening DNA sequences interrupting the protein coding sequence of a gene.

Separating the Wheat from the Chaff: Using mRNA to Generate cDNA

Gene identification is very difficult in humans, because most of our genome is composed of introns interspersed witha relative few DNA coding sequences, or genes. These genes are expressed as proteins, a complex processcomposed of two main two steps. Each gene (DNA) must be converted, ortranscribed, into messenger RNA(mRNA), RNA that serves as a template for protein synthesis. The resulting mRNA then guides the synthesis of aprotein through a process called translation. Interestingly, mRNAs in a cell do not contain sequences from theregions between genes, nor from the non-coding introns that are present within many genes. Therefore, isolatingmRNA is key to finding expressed genes in the vast expanse of the human genome.


19/39

Figure 1. An overview of the process of protein synthesis.

Protein synthesis is the process whereby DNA codes for the production of amino acids and proteins. The process is divided into two parts:transcription and translation. During transcription, one strand of a DNA double helix is used as a template by mRNA polymerase to synthesize a

mRNA. During this step, mRNA passes through various phases, including one called splicing, where the non-coding sequences are eliminated. Inthe next step, translation, the mRNA guides the synthesis of the protein by adding amino acids, one by one, as dictated by the DNA andrepresented by the mRNA.

The problem, however, is that mRNA is very unstable outside of a cell; therefore, scientists use special enzymes toconvert it to complementary DNA (cDNA). cDNA is a much more stable compound and, importantly, because it wasgenerated from a mRNA in which the introns have been removed, cDNA represents only expressed DNA sequence.

cDNA is a form of DNA prepared in the laboratory using an enzyme called reverse transcriptase. cDNA productionis the reverse of the usual process of transcription in cells because the procedure uses mRNA as a template

rather than DNA. Unlike genomic DNA, cDNA contains only expressed DNA sequences, or exons.

From cDNAs to ESTs

Once cDNA representing an expressed gene has been isolated, scientists can thensequence a few hundred nucleotides from either end of the molecule to create two differentkinds of ESTs. Sequencing only the beginning portion of the cDNA produces what is calleda 5' EST. A 5' EST is obtained from the portion of a transcript that usually codes for aprotein. These regions tend to be conserved across species and do not change muchwithin a gene family. Sequencing the ending portion of the cDNA molecule produces what

is called a 3' EST. Because these ESTs are generated from the 3' end of a transcript, they are likely to fall within non-coding, oruntranslated regions (UTRs), and therefore tend to exhibit less cross-species conservation than do

coding sequences.

A "gene family" is agroup of closely relatedgenes that producessimilar protein products.


20/39

A UTR is that part of a gene that is not translated into protein.

Figure 2. An overview of how ESTs are generated.

ESTs are generated by sequencing cDNA, which itself is synthesized from the mRNA molecules in a cell. The mRNAs in a cell are copies of thegenes that are being expressed. mRNA does not contain sequences from the regions between genes, nor from the non-coding introns that arepresent within many interesting parts of the genome.

ESTs: Tools for Gene Mapping and Discovery

ESTs as Genome Landmarks

Just as a person driving a car may need a map to find a destination, scientists searching for genes also needgenome maps to help them to navigate through the billions of nucleotides that make up the human genome. For a

map to make navigational sense, it must include reliable landmarks or "markers". Currently, the most powerfulmapping technique, and one that has been used to generate many genome maps, relies on Sequence Tagged Site(STS) mapping. An STS is a short DNA sequence that is easily recognizable and occurs only once in a genome (orchromosome). The 3' ESTs serve as a common source of STSs because of their likelihood of being unique to aparticular species and provide the additional feature of pointing directly to an expressed gene.

ESTs as Gene Discovery Resources

ESTs are powerfultools in the hunt forknown genes


21/39

Because ESTs represent a copy of just the interesting part of a genome, that which isexpressed, they have proven themselves again and again as powerful tools in the hunt forgenes involved in hereditary diseases. ESTs also have a number of practical advantages in thattheir sequences can be generated rapidly and inexpensively, only one sequencing experimentis needed per each cDNA generated, and they do not have to be checked for sequencing errorsbecause mistakes do not prevent identification of the gene from which the EST was derived.

because they greatlyreduce the timerequired to locate agene.

Using ESTs, scientists have rapidly isolated some of the genes involved in Alzheimer's disease and colon cancer.

To find a disease gene using this approach, scientists first use observable biological clues to identify ESTs that maycorrespond to disease gene candidates. Scientists then examine the DNA of disease patients for mutations in one ormore of these candidate genes to confirm gene identity. Using this method, scientists have already isolated genesinvolved in Alzheimer's disease, colon cancer, and many other diseases. It is easy to see why ESTs will pave the wayto new horizons in genetic research.

ESTs and NCBI

Because of their utility, speed with which they may be generated, and the low costassociated with this technology, many individual scientists as well as large genomesequencing centers have been generating hundreds of thousands of ESTs for public use.Once an EST was generated, scientists were submitting their tags toGenBank, the NIHsequence database operated by NCBI. With the rapid submission of so many ESTs, itbecame difficult to identify a sequence that had already been deposited in the database. Itwas becoming increasingly apparent to NCBI investigators that if ESTs were to be easilyaccessed and useful as gene discovery tools, they needed to be organized in a searchabledatabase that also provided access to other genome data. Therefore, in 1992, scientists at NCBI developed a newdatabase designed to serve as a collection point for ESTs. Once an EST that was submitted to GenBank had beenscreened and annotated, it was then deposited in this new database, calleddbEST.

For ESTs to be easilyaccessed and useful asgene discovery tools, theymust be organized in asearchable database thatalso provides access togenome data.

dbEST: A Descriptive Catalog of ESTs

Scientists at NCBI created dbEST to organize, store, and provide access to the great massof public EST data that has already accumulated and that continues to grow daily. UsingdbEST, a scientist can access not only data on human ESTs but information on ESTs fromover 300 other organisms as well. Whenever possible, NCBI scientists annotate the ESTrecord with any known information. For example, if an EST matches a DNA sequence thatcodes for a known gene with a known function, that gene's name and function are placed on

the EST record. Annotating EST records allows public scientists to use dbEST as an avenue for gene discovery. Byusing a database search tool, such as NCBIs BLAST, any interested party can conduct sequence similarity searchesagainst dbEST.

Scientists at NCBIannotate EST recordswith text informationregarding DNA andmRNA homologies.

UniGene: A Non-Redundant Set of Gene-oriented Clusters

Because a gene can be expressed as mRNA many, many times, ESTs ultimately derived from this mRNA may beredundant. That is, there may be many identical, or similar, copies of the same EST. Such redundancy and overlapmeans that when someone searches dbEST for a particular EST, they may retrieve a long list of tags, many of whichmay represent the same gene. Searching through all of these identical ESTs can be very time consuming. To resolvethe redundancy and overlap problem, NCBI investigators developed theUniGene databaseUniGene automaticallypartitions GenBank sequences into a non-redundant set of gene-oriented clusters.

Although it is widely recognized that the generation of ESTs constitutes an efficient strategy to identify genes, it isimportant to acknowledge that despite its advantages, there are several limitations associated with the ESTapproach. One is that it is very difficult to isolate mRNA from some tissues and cell types. This results in a paucity ofdata on certain genes that may only be found in these tissues or cell types.
http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.htmlhttp://www.ncbi.nlm.nih.gov/dbEST/index.htmlhttp://www.ncbi.nlm.nih.gov/dbEST/index.htmlhttp://www.ncbi.nlm.nih.gov/dbEST/index.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigenehttp://www.ncbi.nlm.nih.gov/dbEST/index.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html


22/39

Second is that important gene regulatory sequences may be found within an intron. Because ESTs are smallsegments of cDNA, generated from a mRNA in which the introns have been removed, much valuable informationmay be lost by focusing only on cDNA sequencing. Despite these limitations, ESTs continue to be invaluable incharacterizing the human genome, as well as the genomes of other organisms. They have enabled the mapping ofmany genes to chromosomal sites and have also assisted in the discovery of many new genes.

MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE

With only a few exceptions, every cell of the body contains a full set of chromosomes andidentical genes. Only a fraction of these genes are turned on, however, and it is the subsetthat is "expressed" that confers unique properties to each cell type. "Gene expression" isthe term used to describe the transcription of the information contained within the DNA, therepository of genetic information, into messenger RNA (mRNA) molecules that are thentranslated into the proteins that perform most of the critical functions of cells. Scientists studythe kinds and amounts of mRNA produced by a cell to learn which genes are expressed,which in turn provides insights into how the cell responds to its changing needs. Geneexpression is a highly complex and tightly regulated process that allows a cell to responddynamically both to environmental stimuli and to its own changing needs. This mechanism

acts as both an "on/off" switch to control which genes are expressed in a cell as well as a"volume control" that increases or decreases the level of expression of particular genes asnecessary.

The proper andharmonious expressionof a large number ofgenes is a criticalcomponent of normalgrowth anddevelopment and themaintenance of properhealth. Disruptions orchanges in geneexpression areresponsible for many

diseases.

Enabling Technologies

Biomedical research evolves and advances not only through the compilation of knowledge but also through thedevelopment of new technologies. Using traditional methods to assay gene expression, researchers were able tosurvey a relatively small number of genes at a time. The emergence of new tools enables researchers to addresspreviously intractable problems and to uncover novel potential targets for therapies. Microarrays allow scientists toanalyze expression of many genes in a single experiment quickly and efficiently. They represent a major

methodological advance and illustrate how the advent of new technologies provides powerful tools for researchers.Scientists are using microarray technology to try to understand fundamental aspects of growth and development aswell as to explore the underlying genetic causes of many human diseases.

DNA Microarrays: The Technical Foundations

Two recent complementary advances, one in knowledge and one in technology, are greatly facilitating the study ofgene expression and the discovery of the roles played by specific genes in the development of disease. As a result ofthe Human Genome Project, there has been an explosion in the amount of information available about the DNAsequence of the human genome. Consequently, researchers have identified a large number of novel genes within

these previously unknown sequences. The challenge currently facing scientists is to find a way to organize andcatalog this vas

Introduction to NCBI Resources

Documents

Transcript of Introduction to NCBI Resources