Molecular Data

25
Molecular Data DNA/RNA Protein Expression Interaction

description

Molecular Data. DNA/RNA Protein Expression Interaction. A sequence. A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids. http://www.cmu.edu/bio/education/courses/03310/LectureNotes/. Character representation of sequences. DNA or RNA - PowerPoint PPT Presentation

Transcript of Molecular Data

Page 1: Molecular Data

Molecular Data

DNA/RNAProteinExpressionInteraction

Page 2: Molecular Data

A sequence

A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 3: Molecular Data

Character representation of sequences

DNA or RNA use 1-letter codes (e.g., A,C,G,T)

protein use 1-letter codes

can convert to/from 3-letter codes

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 4: Molecular Data

The I.U.B. Code proposed by International Union of Biochemistry

AA, , CC, , GG, , TT, , UURR = A, G (pu = A, G (puRRine)ine)YY = C, T (p = C, T (pYYrimidine)rimidine)SS = G, C ( = G, C (SStrong hydrogen bonds)trong hydrogen bonds)WW = A, T ( = A, T (WWeak hydrogen bonds)eak hydrogen bonds)MM = A, C (a = A, C (aMMino group)ino group)KK = G, T ( = G, T (KKeto group)eto group)BB = C, G, T (not A) = C, G, T (not A)DD = A, G, T (not C) = A, G, T (not C)HH = A, C, T (not G) = A, C, T (not G)VV = A, C, G (not T/U) = A, C, G (not T/U)NN = A, C, G, T/U (i = A, C, G, T/U (iNNdeterminate) determinate) XX or or - - are sometimes are sometimes usedused

Page 5: Molecular Data

DNA codeAmino Acid Abbreviation DNA Codons

Alanine Ala GCA, GCC, GCG, GCT

Cysteine Cys TGC, TGT

Aspartic Acid Asp GAC, GAT

Glutamic Acid Glu GAA, GAG

Phenylalanine Phe TTC, TTT

Glycine Gly GGA, GGC, GGG, GGT

Histidine His CAC, CAT

Isoleucine Ile ATA, ATC, ATT

Lysine Lys AAA, AAG

Leucine Leu TTA, TTG, CTA, CTC, CTG, CTT

Methionine Met ATG

Asparagine Asn AAC, AAT

Proline Pro CCA, CCC, CCG, CCT

Glutamine Gln CAA, CAG

Arginine Arg CGA, CGC, CGG, CGT

Serine Ser TCA, TCC, TCG, TCT, AGC, AGT

Threonine Thr ACA, ACC, ACG, ACT

Valine Val GTA, GTC, GTG, GTT

Tryptophan Trp TGG

Tyrosine Tyr TAC, TAT

Stop . TAA, TAG, TGA

Page 6: Molecular Data

Fasta format>gi|17978494|ref|NM_078467.1| Homo sapiens cyclin-dependent kinase inhibitor AGCTGAGGTGTGAGCAGCTGCCGAAGTCAGTTCCTTGTGGAGCCGGAGCTGGGCGCGGATTCGCCGAGGC ACCGAGGCACTCAGAGGAGGTGAGAGAGCGGCGGCAGACAACAGGGGACCCCGGGCCGGCGGCCCAGAGC CGAGCCAAGCGTGCCCGCGTGTGTCCCTGCGTGTCCGCGAGGATGCGTGTTCGCGGGTGTGTGCTGCGTT CACAGGTGTTTCTGCGGCAGGCGCCATGTCAGAACCGGCTGGGGATGTCCGTCAGAACCCATGCGGCAGC AAGGCCTGCCGCCGCCTCTTCGGCCCAGTGGACAGCGAGCAGCTGAGCCGCGACTGTGATGCGCTAATGG CGGGCTGCATCCAGGAGGCCCGTGAGCGATGGAACTTCGACTTTGTCACCGAGACACCACTGGAGGGTGA CTTCGCCTGGGAGCGTGTGCGGGGCCTTGGCCTGCCCAAGCTCTACCTTCCCACGGGGCCCCGGCGAGGC CGGGATGAGTTGGGAGGAGGCAGGCGGCCTGGCACCTCACCTGCTCTGCTGCAGGGGACAGCAGAGGAAG ACCATGTGGACCTGTCACTGTCTTGTACCCTTGTGCCTCGCTCAGGGGAGCAGGCTGAAGGGTCCCCAGG TGGACCTGGAGACTCTCAGGGTCGAAAACGGCGGCAGACCAGCATGACAGATTTCTACCACTCCAAACGC CGGCTGATCTTCTCCAAGAGGAAGCCCTAATCCGCCCACAGGAAGCCTGCAGTCCTGGAAGCGCGAGGGC CTCAAAGGCCCGCTCTACATCTTCTGCCTTAGTCTCAGTTTGTGTGTCTTAATTATTATTTGTGTTTTAA TTTAAACACCTCCTCATGTACATACCCTGGCCGCCCCCTGCCCCCCAGCCTCTGGCATTAGAATTATTTA AACAAAAACTAGGCGGTTGAATGAGAGGTTCCTAAGAGTGCTGGGCATTTTTATTTTATGAAATACTATT TAAAGCCTCCTCATCCCGTGTTCTCCTTTTCCTCTCTCCCGGAGGTTGGGTGGGCCGGCTTCATGCCAGC TACTTCCTCCTCCCCACTTGTCCGCTGGGTGGTACCCTCTGGAGGGGTGTGGCTCCTTCCCATCGCTGTC ACAGGCGGTTATGAAATTCACCCCCTTTCCTGGACACTCAGACCTGAATTCTTTTTCATTTGAGAAGTAA ACAGATGGCACTTTGAAGGGGCCTCACCGAGTGGGGGCATCATCAAAAACTTTGGAGTCCCCTCACCTCC TCTAAGGTTGGGCAGGGTGACCCTGAAGTGAGCACAGCCTAGGGCTGAGCTGGGGACCTGGTACCCTCCT GGCTCTTGATACCCCCCTCTGTCTTGTGAAGGCAGGGGGAAGGTGGGGTCCTGGAGCAGACCACCCCGCC TGCCCTCATGGCCCCTCTGACCTGCACTGGGGAGCCCGTCTCAGTGTTGAGCCTTTTCCCTCTTTGGCTC CCCTGTACCTTTTGAGGAGCCCCAGCTACCCTTCTTCTCCAGCTGGGCTCTGCAATTCCCCTCTGCTGCT GTCCCTCCCCCTTGTCCTTTCCCTTCAGTACCCTCTCAGCTCCAGGTGGCTCTGAGGTGCCTGTCCCACC CCCACCCCCAGCTCAATGGACTGGAAGGGGAAGGGACACACAAGAAGAAGGGCACCCTAGTTCTACCTCA GGCAGCTCAAGCAGCGACCGCCCCCTCCTCTAGCTGTGGGGGTGAGGGTCCCATGTGGTGGCACAGGCCC CCTTGAGTGGGGTTATCTCTGTGTTAGGGGTATATGATGGGGGAGTAGATCTTTCTAGGAGGGAGACACT GGCCCCTCAAATCGTCCAGCGACCTTCCTCATCCACCCCATCCCTCCCCAGTTCATTGCACTTTGATTAG CAGCGGAACAAGGAGTCAGACATTTTAAGATGGTGGCAGTAGAGGCTATGGACAGGGCATGCCACGTGGG CTCATATGGGGCTGGGAGTAGTTGTCTTTCCTGGCACTAACGTTGAGCCCCTGGAGGCACTGAAGTGCTT AGTGTACTTGGAGTATTGGGGTCTGACCCCAAACACCTTCCAGCTCCTGTAACATACTGGCCTGGACTGT TTTCTCTCGGCTCCCCATGTGTCCTGGTTCCCGTTTCTCCACCTAGACTGTAAACCTCTCGAGGGCAGGG ACCACACCCTGTACTGTTCTGTGTCTTTCACAGCTCCTCCCACAATGCTGAATATACAGCAGGTGCTCAA TAAATGATTCTTAGTGACTTTAAAAAAAAAAAAAAAAAAAA

Page 7: Molecular Data

Sequence Content

Mononucleotide frequencies GC content

Dinucleotide frequencies CpG islands

Page 8: Molecular Data

Lander et al

GC content is non-random

Page 9: Molecular Data

GC content and expression

Page 10: Molecular Data

Determining mononucleotide frequencies

Alphabet: A T C G Count how many times each nucleotide

appears in sequence Divide (normalize) by total number of

nucleotides fA mononucleotide frequency of A (frequency

that A is observed) pAmononucleotide probability that a

nucleotide will be an A

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 11: Molecular Data

Determining dinucleotide frequencies

Make 4 x 4 matrix, one element for each ordered pair of nucleotides

Set all elements to zero Go through sequence linearly, adding one to matrix

entry corresponding to the pair of sequence elements observed at that position

Divide by total number of dinucleotides fAC dinucleotide frequency of AC (frequency that

AC is observed out of all dinucleotides)

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 12: Molecular Data

Dinucleotide counts

A T C G

A 0 0 0 0

T 0 0 0 0

C 0 0 0 0

G 0 0 0 0

ATTCGACCAGAG

Create a 4 x 4 matrixSet all cells to zerosUse a window of size 2 and add 1 to each cell of the matrix when encountering the specified dinucleotide

Page 13: Molecular Data

Dinucleotide counts

A T C G

A 0 1 1 2

T 0 1 1 0

C 1 0 1 1

G 2 0 0 0

ATTCGACCAGAG

Page 14: Molecular Data

Observed and expected frequencies

http://www.maths.lth.se/bioinformatics/publications/BasicE_2005.pdf

Page 15: Molecular Data

Observed and expected frequencies

http://www.maths.lth.se/bioinformatics/publications/BasicE_2005.pdf

Page 16: Molecular Data

Dinucleotide frequencies in genome

http://www.lapcs.univ-lyon1.fr/~piau/mps/Poster-CpG.pdf

Page 17: Molecular Data

Sequence features

A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 18: Molecular Data

Sequence features

promoters transcription initiation sites transcription termination sites polyadenylation sites ribosome binding sites protein features

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 19: Molecular Data

Consensus sequences

A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature

Consensus sequences are regular expressions

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 20: Molecular Data

Occurences

Example: recognition site for a restriction enzyme EcoRI recognizes GAATTC AccI recognizes GTMKAC

Basic Algorithm Start with first character of sequence to be searched See if enzyme site matches starting at that position Advance to next character of sequence to be searched Repeat previous two steps until all positions have been

tested

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 21: Molecular Data

Statistics of pattern appearance

Goal: Determine the significance of observing a feature (pattern)

Method: Estimate the probability that a pattern would occur randomly in a given sequence. Three different methods Assume all nucleotides are equally frequent Use measured frequencies of each nucleotide

(mononucleotide frequencies) Use measured frequencies with which a given

nucleotide follows another (dinucleotide frequencies)

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 22: Molecular Data

Example 1

What is the probability of observing the sequence feature ART (A followed by a purine, either A or G, followed by a T)?

Using observed mononucleotide frequencies: pART = pA (pA + pG) pT

Using equal mononucleotide frequencies pA = pC = pG = pT = 1/4 pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/

Page 23: Molecular Data

Example 1: using mononucleotide frequencies

Using equal mononucleotide frequencies pA = pC = pG = pT = 1/4 pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32

Using observed mononucleotide frequencies: pART = pA (pA + pG) pT

Page 24: Molecular Data

Example 1: using dinucleotide frequencies

pART=pA(p*AAp*

AT+p*AGp*

GT)

Page 25: Molecular Data

Example 2:

What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)?

Using equal mononucleotide frequencies pA = pC = pG = pT = 1/4 pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4

= 1/64

http://www.cmu.edu/bio/education/courses/03310/LectureNotes/