Protein Sequence Amino Acid Composition IEC RP HPLC Ancient Sequencing methods Modern Sequencing...

Protein Sequence

Amino Acid Composition IEC RP HPLC

Ancient Sequencing methods Modern Sequencing methods Sequencing the Gene Then what?

Amino Acid Composition

1952 - Complete Acid Hydrolysis Ion Exchange Chromatography with programmed

buffer changes (~3 hr) Post-column derivatization with

Ninhydrin Fluorescamine

1980 - Complete Acid Hydrolysis Precolumn derivatization to Phenylthiohydantoins Reversed-Phase HPLC (~30 min)

Sequencing

Sanger Endgroup Analysis Modify the protein with fluorodinitrobenzene

(amines), aka FDNB, Sanger’s reagent. Alternative reagent, dansyl chloride, fluorescent. Hydrolyze protein Separate by TLC Identify N-terminal amino acid by Rf

Treat protein with Aminopeptidase Repeat until the end gets ragged Use proteolytic fragments for simplicity

Sequencing

Generate proteolytic fragments Use more than one protease in separate experiments

Trypsin cleaves after Arg and Lys residues Chymotrypsin cleaves after Phe, Tyr, Trp

Separate fragments (HV paper electrophoresis/HPLC) Sequence all peptides independently Assemble the sequence using overlap info

TrypsinChtr

Automated Sequencing

Use proteolytic fragments Sequence each peptide using automated

Edman Degradation Each Edman cycle removes one amino acid Converts it to PTH amino acid for HPLC

Assemble the sequence using overlap info

TrypsinChtr

N-Terminal Edman Degradation

N

C

S

C CN

C CN

O O

H2N

H

RH H

R''

HR'

+

C CN

C CN

O OH

R

N

H

HR''R'

H

N

C

S-

NC S

N C O-

O

NCC

NC

H R''R'

HRH H

NC S

N CC O

H HR

O

NCCH2N

R' R''

H

N C

S

C N

CH

H R

O

Attack onPhenylisothiocyanate

Peptide

Analino-thiazolinoneamino acid

Peptide N-1

+PTH-amino acidAbsorbs 260-275 nmRP-HPLC compatible

+ H+

Rearrangement

C-Terminal Edman Degradation

C CN

C C

O OH

RH

RHN

OH

HR'H3C C

O

OC

OH3C

H3C CO

OH

C CN

C C

O OH

RH

RHN

O

CO CH3

HR'

SCN

H3C CO

OH

C CN

C C

O OH

R

RHN

NHC

S

HR'

C C

OH

R

RHN

OH

NHHN

O

S

R'H

-

+H2O

Activation of carboxyl by acetic anhydride Attack by thiocyanate

Peptide N-1

TH-amino acid

-

Hydrolysis

Alternative Sequencing - MS

Use non-fragmenting ionization Electrospray Ionization + traditional mass Spec Matrix-assisted laser desorption-ionization + time-

of-flight mass spec (MALDI-TOF) Measures mass of mature, intact protein

and/or complexes

Sequencing the Gene

DNA synthesis in vitro requires Template (the DNA you want to sequence) Primer (complementary to region up stream of where you want to

sequence) Polymerase dXTP’s, Mg++

Primer pairs with template, free 3’-OH group ready for action

As dXTP’s basepair with template, the 3’-OH attacks the -phosphate of the dXTP, displacing PPi, making a phosphodiester, extending the nascent DNA chain by one base

The Polymerase Reaction

3’

AT

GC

CG

AT

AT

CG

CG

AT

TA

TA

AT

AT

5’

A C T A G A A T T C A

OH

T

PPP P P P P P P P P P P P

P P P P P P P P P P P P P P P P P P P P P P

OBase

OH

O P O P O P OCH 2

O O O

O O O

OBase

OH

OBaseR

O

O

OCH 2PO

5’

Elongation of a primer thatis base-paired with a templateRequires a free 3’-0H group

Di-deoxy Terminators

If 2’, 3’-dideoxy nucleoside triphosphates were used, the reaction would proceed for only one cycle because there would be no free 3’-OH group to attack the next dXTP

If a fraction of a percent of ONE 2’, 3’-dideoxy nucleoside triphosphate (say ddTTP) were used SOME polymer would be terminated EACH time that base was

incorporated, i.e., each time dA occurs in the template. If 1/1000th of the dTTP were ddTTP, then 1/1000th of the polymers

would terminate at each dA in the template… the rest would continue You would get many polymers of different sizes, each corresponding to

the occurrence of a dA in the template Use four separate reactions, one with ddTTP, one with ddATP, one

with ddGTP, and one with ddCTP (and all other components) One of the reaction mixtures would contain a polymer that

terminated at each base

Dideoxy Terminators

Use fluorescent or radioactive primer so you can see every polymer

Separate them by size (gel electrophoresis)

Read sequence of polymers from gel

Infer the sequence of the template by Watson-Crick

ddATP ddTTP ddCTP ddGTP

Agarose gel

smal

l

larg

e

Bas

e in

pol

ymer

Seq

uenc

e of

tem

plat

e

3’ATGTCACAGGACAGA5’

5’TACAGTCTCCTGTCT3’

A, T, G, and C. What are the Amino Acids?Standard Genetic Code

First/Second U C A G

U

UUU PheUUC Phe

UUA LeuUUG Leu

UCU SerUCC Ser

UCA SerUCG Ser

UAU TyrUAC Tyr

UAA ***UAG ***

UGU CysUGC Cys

UGA ***UGG Trp

C

CUU LeuCUC Leu

CUA LeuCUG Leu

CCU ProCCC Pro

CCA ProCCG Pro

CAU HisCAC His

CAA GlnCAG Gln

CGU ArgCGC Arg

CGA ArgCGG Arg

A

AUU IleAUC Ile

AUA IleAUG Met

ACU ThrACC Thr

ACA ThrACG Thr

AAU AsnAAC Asn

AAA LysAAG Lys

AGU SerACC Ser

AGA ArgAGG Arg

G

GUU ValGUG Val

GUA ValGUG Val

GCU AlaGCC Ala

GCA AlaGCG Ala

GAU AspGAC Asp

GAA GluGAG Glu

GGU GlyGGC Gly

GGA GlyGGG Gly

ORFs - Look for longest uninterrupted sequence

So, you’ve got the sequence…So what?

Next topic: BioinformaticsInferences based on homology

Questions1. Has the gene been sequenced before? (Will I be able to publish?)2. What is the sequence of the protein encoded by the gene?3. Has the protein been sequenced before?4. Is the gene similar to one that has been sequenced before?

1. Did I sequence the right gene?2. Will I be able to find structural or functional relatives?

5. Is the protein similar to one that has been sequenced before?1. How similar?2. What does the similarity mean?

6. Can I predict the function of the gene product, or is the predicted function consistent with what I know about the protein?

7. Can I get information about structural features of the gene product?1. Secondary structure2. Folding domains or other common patterns3. Hydropathy profiles

1. How might predicted helices and/or sheet pack?2. Is it likely to be a membrane protein, a transmembrane protein?

Answers: Sequence Similarities and Similarity Searches

1. Search sequence databases for homologous proteins.2. Find families of proteins that are similar to your protein. 3. Use information about the structure and properties of

the similar protein(s) to establish inferences about your protein. If the exact sequence is in the database, the similarity search routines will find that, too.

4. Determine whether two sequences are related (or identical) by aligning them so that homologous regions are adjacent.

5. For two identical sequences:

MGKARSMVLKHSTKARSMGKARSMVLKHSTKARS

But, what about:Imperfect homology

MGKARSMLLKHSTKARS

MGKARTMVLKHSTRARS

Gaps/insertions

MGKARSMLLKHSLKARS

MGRA LKHSLRART

And, how homologous is homologous

Need

Similarity scores for pairs amino acids Method for dealing with gaps Algorithms for comparing a sequence

with a database Ways to assess the degree of homology Ways to link structural info with

sequence info

Dynamic Programming

Needleman-Wunsch Algorithm

Compares similarity of two proteins a & b at positions i & j:

NWi,j = max(NWi-1, j-1 + s(aibj); NWi-1, j; +g; NWi, j-1 +g)

NWi-1, j-1 = running total

s(aibj)= similarity between residue i of protein a and residue j of protein b

g = gap penaltyhttp://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

Fill a Matrix with all possibilitiesFill a Matrix with all possibilitiesSimple example: s = 1,0 and g = 0

Smith-Waterman Always compare NW terms to zero so

that it doesn’t get too small.

NWi,j = max (NWi-1, j-1 + s(aibj); NWi-1, j; + g; NWi, j-1 + g; 0)

BLAST & FASTA FASTA - great, we won’t talk about it

much faster and more selective than SW, but less sensitive

Basic Local Alignment Search Tool less selective and more sensitive than

FASTA, i.e., you may get more hits, but some of

them may be wrong

BLAST Divide sequence into “words” of length W (eg.

BLASTp, initial W = 3) Compare all W-length words

Retain only pairs with similarity above a threshold,T

Call them High-Scoring Pairs Increase W, repeat with HSPs Keep going

remaining above a minimum similarity, and compare to random probability (E)

Scoring Matrices- Making similarity quantitative

Compare the actual frequency to the frequency expected by chance alone.

Probablilty that alanine appears at position x in a protein = fraction of Ala in all proteins pAla

Probability that one protein has Ala at position x, and another protein has Gly? =pAlapGly

The frequency due to chance, alone.

Similarity

qAla,Gly = ACTUAL frequency that Ala and Gly are at position x in two proteins (in your database)

Ri,j = qi,j/pipj

Score: Si,j = log2(Ri,j) = log2(qi,j/pipj) “Log-Odds Scores” Remember Chou & Fasman?

€

=1

λ

⎛

⎝ ⎜

⎞

⎠ ⎟log

qi, j

pip j

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

PAM Matrices Margaret Dayhoff assembled the Atlas of Protein

Structure Evolutionarily-accepted mutations Calculated qi,j for all aa’s in closely-related proteins These were accepted by Nature as similar/close

enough Generate half matrices: Point Accepted

Mutation/Percent Accepted Mutations Scale, so PAM1 reflects 1 mutation per 100 residues,

PAM50, 50 allowed mutation/100

BLOSUM Henikoff and Henikoff BLOcks of Amino Acid SUbstitution

Matrix BLOCKS is a database of related

proteins

BLAST Search Go to BLAST Website Enter Nucleotide or AA sequence Choose BLAST type

Nucleotide-nucleotide; BLASTn Protein-protein, BLASTp 6-frame-translated nucleotide-

Protein:BLASTx others

Then? Does it make sense? Multisequence Alignment Secondary structure prediction Domains Families

Caveat

It ain't what you don't know that'll kill you,

it's what you know that ain't so.

Protein Sequence Amino Acid Composition IEC RP HPLC Ancient Sequencing methods Modern Sequencing...

Documents

Transcript of Protein Sequence Amino Acid Composition IEC RP HPLC Ancient Sequencing methods Modern Sequencing...