Computer Applications in Biology and biological research

8/18/2019 Computer Applications in Biology and biological research

1/21


2/21

PAGE 1

CONTENTS

1) Automated Biological Analysers.

1.1) Routine biochemistry analysers.1.2) Immuno-based analysers.1.3) Hematology analysers.

2) Biological Databases.2.1) Nucleic Acids Research Database2.2) Species-specific databases2.3) Applications2.3.1) Mass Spectrometry - MALDI TOF2.3.2) Protein Fingerprinting2.3.3) Proteomics

3) The Human Genome Project3.1) Sequencing Applications3.2) Computational Analysis3.3) Computational Engine for Genomic Sequences3.4) Data mining and information retrieval.3.4) Data warehousing.3.5) Visualization for data and collaboration.

4) Bibliography


3/21

PAGE 2

An automated analyser is a medical laboratory instrument designed tomeasure different chemicals and other characteristics in a number ofbiological samples quickly, with minimal human assistance .These measuredproperties of blood and other fluids may be useful in the diagnosis of disease.

Many methods of introducing samples into the analyser have been invented.This can involve placing test tubes of sample into racks, which can be movedalong a track, or inserting tubes into circular carousels that rotate to make the

sample available. Some analysers require samples to be transferred to samplecups. However, the effort to protect the health and safety of laboratory staffhas prompted many manufacturers to develop analysers that feature closedtube sampling, preventing workers from direct exposure to samples.

Samples can be processed singly, in batches, or continuously .The automationof laboratory testing does not remove the need for human expertise (resultsmust still be evaluated by medical technologists and other qualified clinicallaboratory professionals), but it does ease concerns about error reduction,staffing concerns, and safety

Routine biochemistry analysers

These are machines that process a large portion of the samples going into ahospital or private medical laboratory. Automation of the testing process hasreduced testing time for many analytes from days to minutes. The history ofdiscrete sample analysis for the clinical laboratory began with the

introduction of the "Robot Chemist" invented by Hans Baruch and introducedcommercially in 1959

AutoAnalyzer is an automated analyzer using a special flow technique named"continuous flow analysis (CFA)", invented in 1957 by Leonard Skeggs, PhDand first made by the Technicon Corporation.

AUTOMATED BIOLOGICAL ANALYSERS


4/21

PAGE 3

The first applications were for clinical (medical) analysis. The AutoAnalyzerprofoundly changed the character of the chemical testing laboratory byallowing significant increases in the numbers of samples that could beprocessed. The design based on separating a continuously flowing stream withair bubbles largely reduced slow, clumsy, and error prone manual methods ofanalysis.

The types of tests required include enzyme levels (such as many of the liverfunction tests), ion levels (e.g. sodium and potassium, and other tell-talechemicals (such as glucose, serum albumin, or creatinine).

Simple ions are often measured with ion selective electrodes, which let onetype of ion through, and measure voltage differences. Enzymes may bemeasured by the rate they change one coloured substance to another; in thesetests, the results for enzymes are given as an activity, not as a concentrationof the enzyme.

Other tests use colorimetric changes to determine the concentration of thechemical in question. Turbidity may also be measured.

Immuno-based analysers

Antibodies are used by some analysers to detect many substances byimmunoassay and other reactions that employ the use of antibody-antigenreactions.

When concentration of these compounds is too low to cause a measurableincrease in turbidity when bound to antibody, more specialised methods mustbe used. Recent developments include automation for theimmunohaematology lab, also known as transfusion medicine.


5/21


6/21

PAGE 5

BIOLOGICAL DATABASES

Biological databases are libraries of life sciences information, collected fromscientific experiments, published literature, high-throughput experimenttechnology, and computational analysis.[2] They contain information fromresearch areas including genomics, proteomics, metabolomics, microarraygene expression, and phylogenetics. Information contained in biologicaldatabases includes gene function, structure, localization (both cellular andchromosomal), clinical effects of mutations as well as similarities of biologicalsequences and structures.

Biological databases can be broadly classified into sequence and structuredatabases. Nucleic acid and protein sequences are stored in sequencedatabases and structure database only store proteins. These databases areimportant tools in assisting scientists to analyze and explain a host ofbiological phenomena from the structure of biomolecules and theirinteraction, to the whole metabolism of organisms and to understanding the

evolution of species. This knowledge helps facilitate the fight against diseases,assists in the development of medications, predicting certain genetic diseasesand in discovering basic relationships among species in the history of life.

Biological knowledge is distributed among many different general andspecialized databases. This sometimes makes it difficult to ensure theconsistency of information. Integrative bioinformatics is one field attemptingto tackle this problem by providing unified access. One solution is howbiological databases cross-reference to other databases with accessionnumbers to link their related knowledge together.


7/21

PAGE 6

Relational database concepts of computer science and Information retrievalconcepts of digital libraries are important for understanding biologicaldatabases. Biological database design, development, and long-termmanagement is a core area of the discipline of bioinformatics.[4] Datacontents include gene sequences, textual descriptions, attributes and ontologyclassifications, citations, and tabular data. These are often described as semi-structured data, and can be represented as tables, key delimited records, and XML structures.

Most biological databases are available through web sites that organise datasuch that users can browse through the data online. In addition the underlyingdata is usually available for download in a variety of formats. Biological data

comes in many formats. These formats include text, sequence data, proteinstructure and links. Each of these can be found from certain sources, forexample:

Text formats are provided by PubMed and OMIM.

Sequence data is provided by GenBank, in terms of DNA, and UniProt, in

terms of protein.Protein structures are provided by PDB, SCOP, and CATH.

Nucleic Acids Research Database

An important resource for finding biological databases is a special yearly issueof the journal Nucleic Acids Research (NAR). The Database Issue of NAR isfreely available, and categorizes many of the publicly available on linedatabases related to biology and bioinformatics. A companion database to theissue called the Online Molecular Biology Database Collection lists 1,380online databases. Other collections of databases exist such as MetaBase andthe Bioinformatics Links Collection.


8/21

PAGE 7

Species-specific databases

Species-specific databases are available for some species, mainly those that areoften used in research. For example, Colibase is an E. coli database. Other

popular species specific databases include Mouse Genome Informatics for thelaboratory mouse, Mus musculus, the Rat Genome Database for Rattus, ZFINfor Danio Rerio (zebrafish), FlyBase for Drosophila, WormBase for thenematodes Caenorhabditis elegans and Caenorhabditis briggsae, and Xenbasefor Xenopus tropicalis and Xenopus laevis frogs.

Mass Spectrometry - MALDI TOF

The resulting masses measured in MALDI-TOF (Matrix-Assisted LaserDesorption ionisation time-of-flight Spectrometry) can be used to search theprotein sequence databases for matches with theoretical masses.

Protein Fingerprinting

Following digestion of protein with reagents the resulting mixture of peptidescan be separated by chromatographic processes. The resulting pattern or the

peptide map is a diagnostic of the protein hence called its fingerprint. Thesefingerprints can be stored in databases online for reference by other Biologists.

Proteomics

The Proteome is the entire complement of proteins in a cell or organism.Proteomics is the large-scale effort to identify and characterize all the proteinsencoded in an organism’s ge nome, including its post translationalmodifications. Proteins resolved in two dimensional electrophoresis are

subjected to trypsin digestion and extremely accurate molar masses of thepeptides produced are used as a fingerprint to identify the protein fromdatabases of real or predicted tryptic peptide sizes.


9/21

PAGE 8

THE HUMANGENOME PROJECT

Just a few short years ago, most of us knew very little about our genes and their impact onour lives. But more recently, it has been virtually impossible to escape the popularmedia’s attention to a number of breathtakingdiscoveries of human genes, especially thoserelated to diseases such as cystic fibrosis,Huntington’s chorea, and breast cancer. Whathas brought this about is the Human GenomeProject, an international effort started in 1988and sponsored in the United States by theDepartment of Energy (DOE) and theNational Institutes of Health (NIH). The goalof this project is to elucidate the information

that makes up the genetic blueprint of humanbeings.

The Human Genome Project’s success in sequencing the chemical bases ofDNA is virtually revolutionizing biology and biotechnology. It is creating newknowledge about fundamental biological processes. It has increased ourability to analyze, manipulate, and modify genes and to engineer organisms,providing numerous opportunities for applications. Biotechnology in theUnited States, virtually nonexistent a few years ago, is expected to become a$50 billion industry before 2000, largely because of the Human GenomeProject.

COMPUTINGTHE GENOME

Ed Uberbacher shows a model oftranscribing genes. Photograph byTom Cerniglio.

ORNL is part of a team that isdesigning and preparing toimplement a new computationalengine to rapidly analyze large-scale


10/21

PAGE 9

Despite this project’s impact, the pace of gene discovery has actually beenrather slow. The initial phase of the project, called mapping, has beenprimarily devoted to fragmenting chromosomes into manageable orderedpieces for later high-throughput sequencing. In this period, it has often taken years to locate and characterize individual genes related to human disease.Thus, biologists began to appreciate the value of computing to mapping andsequencing.

A good illustration of the emerging impact of computing on genomics is thesearch for the gene for Adrenoleukodystrophy (related to the disease in themovie Lorenzo’s Oil). A team of researchers in Europe spent about two yearssearching for the gene using standard experimental methods. Then they

managed to sequence the region of the chromosome containing the gene.Finally, they sent information on the sequence to the ORNL server containingthe ORNL-developed computer program called Gene Recognition and Analysis Internet Link (GRAIL). Within a couple of minutes, GRAIL returnedthe location of the gene within the sequence.

The Human Genome Project has entered a new phase. Six NIH genomecenters were funded recently to begin high-throughput sequencing, and plans

are under way for large-scale sequencing efforts at DOE genome centers atLawrence Berkeley National Laboratory (LBNL), Lawrence LivermoreNational Laboratory (LLNL), and Los Alamos National Laboratory (LANL);these centers have been integrated to form the Joint Genome Institute. As aresult, researchers are now focusing on the challenge of processing andunderstanding much larger domains of the DNA sequence. It has beenestimated that, on average from 1997 to 2003, new sequence of approximately2 million DNA bases will be produced every day. Each day’s sequence will

represent approximately 70 new genes and their respective proteins. Thisinformation will be made available immediately on the Internet and in centralgenome databases.


11/21

PAGE 10

Such information is of immeasurable value to medical researchers,biotechnology firms, the pharmaceutical industry, and researchers in a hostof fields ranging from microorganism metabolism to structural biology.Because only a small fraction of genes that cause human genetic disease havebeen identified, each new gene revealed by genome sequence analysis has thepotential to significantly affect human health. Within the human genome isan estimated total of 6000 genes that have a direct impact on the diagnosisand treatment of human genetic diseases. The timely development ofdiagnostic techniques and treatments for these diseases is worth billions ofdollars for the U.S. economy, and computational analysis is a key componentthat can contribute significantly to the knowledge necessary to effect suchdevelopments.

The rate of several megabase pairs per day at which the Human Genome andmicroorganism sequencing projects will soon be producing data will exceedcurrent sequence analysis capabilities and infrastructure. Sequences arealready arriving at a rate and in forms that make analysis very difficult. Forexample, a recent posting of a large clone (large DNA sequence fragment) bya major genome center was made in several hundred thousand base fragments,rather than as one long sequence, because the sequence database was unableto input the whole sequence as a single long entry. Anyone who wishes toanalyze this sequence to determine which genes are present must manually“reassemble” the sequence from these many small fragments, an absolutelyridiculous task. The sequences of large genomic clones are being routinelyposted on the Internet with virtually no comment, analysis, or interpretation;and mechanisms for their entry into public-domain databases are in manycases inadequately defined. Valuable sequences are going unanalyzed becausemethods and procedures for handling the data are lacking and becausecurrent methods for doing analyses are time-consuming and inconvenient. And in real terms, the flood of data is just beginning.


12/21

PAGE 11

Computational Analysis

Computers can be used very effectively to indicate the location of genes andof regions that control the expression of genes and to discover relationships

between each new sequence and other known sequences from many differentorganisms. This process is referred to as “sequence annotation.” Annotation(the elucidation and description of biologically relevant features in thesequence) is the essential prerequisite before the genome sequence data canbecome useful, and the quality with which annotation is done will directlyaffect the value of the sequence. In addition to considerable organizationalissues, significant computational challenges must be addressed if DNAsequences that are produced can be successfully annotated. It is clear that newcomputational methods and a workable process must be implemented for

effective and timely analysis and management of these data.In considering computing related to the large-scale sequence analysis andannotation process, it is useful to examine previously developed models.Procedures for high-throughput analysis have been most notably applied toseveral microorganisms (e.g., Haemophilus influenzae and Mycoplasma genitalium ) using relatively simple methods designed to facilitate basically asingle pass through the data (a pipeline that produces a one-time result orreport).

However, this is too simple a model for analyzing genomes as complex as thehuman genome. For one thing, the analysis of genomic sequence regionsneeds to be updated continually through the course of the Genome Project —the analysis is never really done. On any given day, new information relevantto a sequenced gene may show up in any one of many databases, and new linksto this information need to be discovered and presented. Additionally, ourcapabilities for analyzing the sequence will change with time.

The analysis of DNA sequences by computer is a relatively immature science,and we in the informatics community will be able to recognize many features(like gene regulatory regions) better in a year than we can now. There will bea significant advantage in reanalyzing sequences and updating our knowledgeof them continually as new sequences appear from many organisms, methodsimprove, and databases with relevant information grow. In this model,sequence annotation is a living thing that will develop richness and improve


13/21

PAGE 12

in quality over the years. The “single pass -through pipeline” is simply not theappropriate model for human genome analysis, because the rate at which newand relevant information appears is staggering.

Computational Engine for Genomic SequencesResearchers at ORNL, LBNL, Argonne National Laboratory (ANL), and severalother genome laboratories are teaming to design and implement a new kindof computational engine for analysis of large-scale genomic sequences. This“sequence analysis engine,” which has become a Computational GrandChallenge problem, will integrate a suite of tools on high-performancecomputing resources and manage the analysis results.

In addition to the need for state-of-the-art computers at severalsupercomputing centers, this analysis system will require dynamic andseamless management of contiguous distributed high-performancecomputing processes, efficient parallel implementations of a number of newalgorithms, complex distributed data mining operations, and the applicationof new inferencing and visualization methods. A process of analysis that willbe started in this engine will not be completed for seven to ten years.

The data flow in this analysis engine is shown in the next page. Updates ofsequence data will be retrieved through the use of Internet retrieval agentsand stored in a local data warehouse. Most human genome centers will dailypost new sequences on publicly available Internet or World Wide Web sites,and they will establish agreed-upon policies for Internet capture of their data. These data will feed the analysis engine that will return results to the warehouse for use in later or long-term analysis processes, visualization byresearchers, and distribution to community databases and genomesequencing centers. Unlike the pipeline analysis model, the warehousemaintains the sequence data, analysis results, and data links so that continualupdate processes can be made to operate on the data over many years.


14/21

PAGE 13

The analysis engine will combine a number ofprocesses into a coherent system running ondistributed high-performance computinghardware at ORNL’s Center forComputational Sciences (CCS), LBNL’sNational Energy Research ScientificComputing Center, and ANL’s Center forComputational Science and Technologyfacilities. A schematic of these processes isshown in Fig. 2. A process manager willconditionally determine the necessaryanalysis steps and direct the flow of tasks tomassively parallel process resources at theseseveral locations. These processes will includemultiple statistical and artificial-intelligence-based pattern-recognition algorithms (forlocating genes and other features in thesequence), computation for statisticalcharacterization of sequence domains, gene

modeling algorithms to describe the extentand structure of genes, and sequencecomparison programs to search databases forother sequences that may provide insight intoa gene’s function. The process manager willalso initiate multiple distributed informationretrieval and data mining processes to accessremote databases for information relevant to

the genes (or corresponding proteins)discovered in a particular DNA sequenceregion. Five significant technical challengesmust be addressed to implement such asystem. A discussion of those challengesfollows.

THE ANALYSIS ENGINE

Fig. 1. The sequence analysis engine will inputa genomic DNA sequence from many sitesusing Internet retrieval agents, maintain it in adata warehouse, and facilitate a long-termanalysis process using high-performancecomputing facilities.


15/21

PAGE 14

Seamless high-performance computing

Megabases of DNA sequence being analyzed each day will strain the capacityof existing supercomputing centers. Interoperability between high-

performance computing centers will be needed to provide the aggregatecomputing power, managed through the use of sophisticated resourcemanagement tools. The system must be fault-tolerant to machine andnetwork failures so that no data or results are lost.

Parallel algorithms for sequence analysis

The recognition of important features in a sequence, such as genes, must behighly automated to eliminate the need for time-consuming manual genemodel building. Five distinct types of algorithms (pattern recognition,statistical measurement, sequence comparison, gene modeling, and datamining) must be combined into a coordinated toolkit to synthesize thecomplete analysis.

Fig. 3. Approximately 800 bases of DNA sequence (equivalent to 1/3,800,000

of the human genome), containing the first gene coding segment of four inthe human Ras gene. The coding portion of the gene is located betweenbases 1624 and 1774. The remaining DNA around this does not contain agenetic message and is often referred to as “junk” DNA.


16/21


17/21

PAGE 16

neural network [Fig. 4(a)]. GRAIL is able to determine regions of the sequencethat contain genes [Fig. 4(b)], even genes it has never seen before, based onits training from known gene examples.

High-speed sequence comparison represents another important class ofalgorithms used to compare one DNA or protein sequence with another in a way that extracts how and where the two sequences are similar. Manyorganisms share many of the same basic genes and proteins, and informationabout a gene or protein in one organism provides insight into the function ofits “relatives” or “homologs” in other organisms.

To get a sense of the scale, examination of the relationship between 2megabases of sequence (one day’s finished sequence) and a single database ofknown gene sequence fragments (called ESTs) requires the calculation of 1015DNA base comparisons. And there are quite a number of databases to consider.

Two examples of sequencecomparison for members of the sameprotein family are shown in Fig. 5.

One shows a very similar relative tothe human protein sequence queryand the second a much weaker (andevolutionarily more distant)relationship. The sequence databases(which contain sequences used forsuch comparisons) are growing at anexponential rate, making it necessary

to apply ever-increasingcomputational power to this problem.

Data mining and information retrieval.Methods are needed to locate andretrieve information relevant to newly


18/21


19/21


20/21


21/21

BIBLIOGRAPHY

1) Hyman, E. D. A new method of sequencing DNA. AnalyticalBiochemistry 174, 423–436 (1988)

2) Metzker, M. L. Emerging technologies in DNA sequencing. GenomeResearch 15, 1767–1776 (2005)

3) Oak Ridge National Laboratoryhttp://web.ornl.gov/info/ornlreview/v30n3-4/genome.htm

4) Wikipedia https://en.wikipedia.org

5) Googlehttps://google.co.in
http://web.ornl.gov/info/ornlreview/v30n3-4/genome.htmhttp://web.ornl.gov/info/ornlreview/v30n3-4/genome.htmhttps://en.wikipedia.org/https://en.wikipedia.org/https://google.co.in/https://google.co.in/https://google.co.in/https://en.wikipedia.org/http://web.ornl.gov/info/ornlreview/v30n3-4/genome.htm

Computer Applications in Biology and biological research

Documents

Transcript of Computer Applications in Biology and biological research