Data-intensive Computing: Case Study Area 1: Bioinformatics

12
Data-intensive Computing: Case Study Area 1: Bioinformatics B. Ramamurthy 07/03/22 1

description

Data-intensive Computing: Case Study Area 1: Bioinformatics. B. Ramamurthy. Human Genetics. Genomics Human Genome project Proteomics Diseasome Tree of life project Phylogenetics. Human cell. Base pair of DNA: CG, AT C – cytosine, G – guanine, A – adenine , T - thymine - PowerPoint PPT Presentation

Transcript of Data-intensive Computing: Case Study Area 1: Bioinformatics

Page 1: Data-intensive Computing: Case Study Area 1: Bioinformatics

Data-intensive Computing: Case Study Area 1: Bioinformatics

B. Ramamurthy

04/20/23 1

Page 2: Data-intensive Computing: Case Study Area 1: Bioinformatics

Human Genetics

• Genomics• Human Genome project• Proteomics• Diseasome• Tree of life project• Phylogenetics

04/20/23 2

Page 3: Data-intensive Computing: Case Study Area 1: Bioinformatics

Human cell• Base pair of DNA: CG, AT

– C – cytosine, G – guanine, A – adenine , T - thymine• Each human cell contains approximately 3 billion base pairs. • The DNA of a single cell contains so much information that if it were represented

in printed words, simply listing the first letter of each base would require over 1.5 million pages of text!

• If laid end-to-end, the DNA strand measures about 2 – 3 meters.• DNA is a single large molecule at the nucleus of cell• It is coiled a double helix• Each strand of the DNA molecule is made of A, C, G and T: example:

AAAGTTCTTAATTA that will be matched on the other strand by the matching base: TTTCAAGAATTAAT

• These string of alphabets contain • Ref: www.ehd.org• Ref text: Bioinformatics: Databases, tools and algorithms, by. O. Bosu and S.K.

Thukral

04/20/23 3

Page 4: Data-intensive Computing: Case Study Area 1: Bioinformatics

More details• Sequence of base pairs are grouped to make sense: genes• When a gene inside needs to be activated, the DNA

molecule at the cell nucleus uncoils and unfurls to the right extent to expose that gene

• From the exposed ends of the DNA a RNA is formed.• mRNA or messenger RNA is formed that carries with it the

“print” of the open DNA section (Map process?)• RNA and DNA differ in one respect: RNA does not contain T

or thymine but it has uracil (U). RNA is short-lived (like intermediate data in MapReduce)

• Once mRNA is formed open sections of the DNA close off.

04/20/23 4

Page 5: Data-intensive Computing: Case Study Area 1: Bioinformatics

Protein formation• mRNA travels to the cytoplasm where it meets the ribosome (rRNA)• Ribosome reads the code in the mRNA (codon) and form the amino

acids.• Twenty amino acids are prevalent in human cells. Ex: codon GCU

GCC GCA correspond to alanine• In effect ribosome is a process control computer that takes in as

input codons and produces amino acids as output.• Amino acids polymerize and form polypeptide chains called

proteins• Proteins fold and form the basic structures such as skin and hair.• Even though brain controls major human functions at the cell level

it the DNA that has the command and control.• DNA is fixed code for a given human. (WORM characteristics)

04/20/23 5

Page 6: Data-intensive Computing: Case Study Area 1: Bioinformatics

Life’s processes

• DNA is “program” that controls functions, operations and structure of a cell and in turn that of our life processes.

• Life processes are in fact dependent of the program in a DNA and the hundreds of millions of ribosomes.

• Life in this context appears as an immense distributed system.

04/20/23 6

Page 7: Data-intensive Computing: Case Study Area 1: Bioinformatics

Bioinformatics• Can we study, understand and analyze the complexity of the

immensely complex system? It structure and programs?• University of Arizona’s tree of life project (ToL): http://tolweb.org• Human Genome project (NIH and DOE): collecting approximately

30,000 genes in human DNA and determining the sequences three billion bases that make up the human DNA.

• Out of the 30000 genes we do not know the functions of more than 50% of them.

• 99.9% of the nucleotide sequence is same for all of us• 0.1% is attributed to individual differences such as race, color of

skin, disposition to diseases• High throughput sequencing is generating ultra scale biological

data: how to analyze this data?• That is a data-intensive problem.

04/20/23 7

Page 8: Data-intensive Computing: Case Study Area 1: Bioinformatics

Existing solutions?

• Traditional databases: store, retrieve, analyze and/or predict huge biological data

• Software tools for implementing algorithms, and developing applications for in-silico experiments

• Visualization tools, user interfaces, web accessibility for search through data

• Machine learning and data mining methodologies.

04/20/23 8

Page 9: Data-intensive Computing: Case Study Area 1: Bioinformatics

Databases• Taxonomy DB• Genomics• Sequence db• Structure db• Proteomic database (PDB)• Micro-array db• Expression db• Enzyme db• Disease db• Molecular biology db

04/20/23 9

Page 10: Data-intensive Computing: Case Study Area 1: Bioinformatics

Tools

• Data analysis tools– MySQL– Perl

• Prediction tools– Clustering

• Modeling tools– Surface prediction, predicting area of interest,

protein-protein interaction• Alignment tools

04/20/23 10

Page 11: Data-intensive Computing: Case Study Area 1: Bioinformatics

How can we help?

• How can we leverage our knowledge of large scale data management to address bioinformatics problems? DC methods.

• Large number of tools and data: how we standardize the efforts so that they are complementary or repetitive? Cloud computing.

04/20/23 11

Page 12: Data-intensive Computing: Case Study Area 1: Bioinformatics

04/20/23 12

Text Mining vs Genetic Sequence Mining (Dot plot)

C O R R E L A T I O N SR E L A T I O N S H I P

A C T C T A G G A G T CG A T A A T T C G A T C