Stat 877(992) Statistical methods in molecular biology.

33
Stat 877(992) Statistical methods in molecular biology

Transcript of Stat 877(992) Statistical methods in molecular biology.

Stat 877(992)

Statistical methods in molecular biology

Course plans

• Team taught: Newton, Larget, Ane, Keles, Kendziorski, Broman, Yandell

• Per instructor homework set (six at 12pts each)

• Final project, poster presentation (28 pts)

National Research Council Report, 2004Mathematics and 21st Century Biology

“Progress in the biosciences will increasingly depend on deep and broadintegration of mathematical analysis into studies at all levels of biological organization…: molecules, cells, organisms, populations, and Ecosystems.”

“The committee regards the interface between mathematics and biology as biology-driven.”

cell structural/functional unit of all living organisms

protein organic compound produced and used by cell

amino acid protein building block

nucleic acid chainlike molecule involved in preservation, replication, and expression of hereditary information in every living cell

nucleotide nucleic acid building block

Some definitions [first approximations!]

Example function: oxygen transport

2-3 x 10^13 red blood cells/body

2 x 10^6 new cells/second

95% of dry weight is protein hemoglobin

hemoglobin

more about hemoglobin

sequence of amino acids in hemoglobin

• alpha chain (141 amino acids) [2 subunits]

• VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGKKVADGLTLAVGHLDDLPGALSDLSNLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR

• beta chain (146 amino acids) [2 subunits]

• VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLALVVARHFGKDFTPELQASYQKVVAGVANALAHKYH

A few amino acids (among 20 standard)

V = Val = Valine

L = Leu = Leucine

M = Meth = Methionine

more about amino acids

Amino acids are concatenated into protein by the translation of information stored in messenger RNA

Ribonucleic acid (RNA)

Nucleotide bases

A = adenine C = cytosine U = uracil G = guanine

single stranded

Amino acids are concatenated into protein by the translation of information stored in messenger RNA (mRNA)

Ribonucleic acid (RNA)

Nucleotide bases

A = adenine C = cytosine U = uracil G = guanine

Met

Thr

Glu

Leu

Arg

Ser

stop

Amino acids are encoded by triples of mRNA nucleotides called codons

more about the genetic code

Translation: mRNA to protein via ribosome & tRNA

video podcast of translation

Base pairing A-U, G-C

mRNA structure

orientation 5’ to 3’

UTR = untranslated region: mRNA stability mRNA localization translational efficiency

Mature mRNA may have been processed by splicing a primary transcript (pre-mRNA)

Primary transcripts are produced by the transcription of DNA

Deoxyribonucleic acid (DNA)

double stranded

4 nucleotide bases ATGC

base pairing: A-T, C-G

initiate

elongate

terminate

Transcription: DNA to RNA via RNA polymerase

Central dogma of molecular biology

Replication: DNA copies itself during cell division

More on organization of DNA

Chromosomes are organized structures of DNA and proteinsthat are found in cells. Each chromosome contains a singlecontinuous piece of DNA.

In diploid species, chromosomes are paired.

Human total number chromosome base pairs1 247,200,0002 242,750,0003 199,450,0004 191,260,0005 180,840,0006 170,900,0007 158,820,0008 146,270,0009 140,440,00010 135,370,00011 134,450,00012 132,290,00013 114,130,00014 106,360,00015 100,340,00016 88,820,00017 78,650,00018 76,120,00019 63,810,00020 62,440,00021 46,940,00022 49,530,000X (sex chromosome) 154,910,000Y (sex chromosome) 57,740,000

100 yrs at 1bp/second

Estimates from Sanger’s Vertebrate Genome Annotation (VEGA) database, 7/07

3 Gbp, or

A genome equals the sequenceof one full copy

1 % of bases are in exons24 % of bases are in introns

2001: drafts of the human genome sequence published

2007: pilot phase of ENCODE project completed

Encyclopedia Of DNA Elements

majority of bases are transcribedextensive transcript overlapfunctions poorly understood

Evolving definition of gene

1860s-1900s: a discrete unit of heredity (Mendel)

1910s: a distinct locus (Morgan)

1940s: the blueprint for a protein (Beadle & Tatum)

1960s: a transcribed code (Watson & Crick)

Genome era: a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions

Mark B. Gerstein et al. Genome Res. 2007; 17: 669-681

Figure 5"> Figure 5

The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products

Gerstein et al 2007

Post ENCODE

What about Statistics?

Statistics supports the development of genomic resources

• In accomodating sequencing errors for genome assembly

• In rating the significance of sequence matches by alignment algorithms

Statistics supports analyses to determine the function of genes/transcripts/proteins

• Gene regulation

• Gene expression• Network considerations (many processes/functions)

Example: oxygen transportAccording to the Gene Ontology (GO) project,46 different genes are involved in this biological process

Statistics is critical in analyzing patterns of genomic variation within populations, and in

relating this variation to disease states or other phenotypes

• Genomes differ from the reference copy (single nucleotide polymorphisms, structural variants)

• Gene mapping by linkage and association methods

Statistics is critical in analyzing patterns of genomic variation between populations/species

• Phylogenetic analysis

“Nothing in biology makes sense except in the light of evolution”

-T. Dobzhansky

Tree of life project

“It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us. These laws, taken in the largest sense, being Growth with reproduction; Inheritance which is almost implied by reproduction; Variability from the indirect and direct action of the conditions of life, and from use and disuse; a Ratio of Increase so high as to lead to a Struggle for Life, and as a consequence to Natural Selection, entailing Divergence of Character and the Extinction of less improved forms. Thus, from the war of nature, from famine and death, the most exalted object which we are capable of conceiving, namely, the production of the higher animals, directly follows.”

- Charles Darwin