PROTEINS FROM SEQUENCES TO INTERACTIONS IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

PROTEINS FROM SEQUENCES TO

INTERACTIONS

IPM-NUS Workshop on Computational Biology

Mehdi Sadeghi

Amino Acids and Proteins

polymers composed of combinations of 20 different amino acids

range in size from about 50 to over 20000 amino acids

A single cell may have 10,000 or more different proteins.

About half of the non-water component of a typical cell is protein.

Four levels of protein structure

Primary

Secondary

Tertiary

Quaternary

Assembly

Folding

Packing

Interaction

S T

R U

C T

U R

E P R

O C

E S

S

• Occurs at the ribosome • Involves dehydration synthesis

and polymerization of amino acids attached to tRNA:

• Yields primary structure

• non-linear• 3 dimensional• Localized to regions of an

amino acid chain• Formed and stabilized by

hydrogen bonding, electrostatic and van der Waals interactions

Secondary Structure

Importance and Determinants of Secondary Structure

• Folded proteins have segments of regular conformation

• The arrangement of secondary structure elements provides a convenient way of classifying types of

folds

• Steric constraints dictate the possible types of secondary structure

Folding

• The folded structure of a protein is directly determined by its primary structure

Computational prediction of folding is not yet reliable

Tertiary Structure

• The condensing of multiple secondary structural elements leads to tertiary structure

• Tertiary structure is stabilized by efficient packing of atoms in the protein interior

The Protein Domain• a compact unit of protein structure that is usually capable

of folding stably as an independent entity in solution. Domains do not need to comprise a contiguous segment of peptide chain, although this is often the case.

• Proteins whose molecular weights are less than about 20,000 often have a simple globular shape, with an average molecular diameter of 20 to 30 Å, but larger proteins usually fold into two or more independent globules, or structural domains.

• Multidomain proteins probably evolved by the fusion of genes that once coded for separate proteins

The Protein Domain

Identical domains Structurally unrelated

Protein Domains – an alphabet of functional modules

WD40 WWSH2 SH3

14-3-3 ANK3 ARM BH1 C1 C2 CARD

EH EVH FYVE PDZDeath DED EFH

PH PTB SAM

The Universe of Protein Structures

• The number of protein folds is large but limited

• Protein structures are modular and proteins can be grouped into families on the basis of the domains they contain

• The modular nature of protein structure allows for sequence insertions and deletions

Schematic diagram of the domain arrangement of a number of signal transduction proteins. The different moduleshave different functions

• Why classify proteins:

• Number of solved structures grow rapidly• Generate overview of structure types• Detect similarities (evolutionary relationships)• Set up prediction benchmarks

Protein structure Classification

Classification schemes

• SCOP– Manual classification

• CATH– Semi manual classification

• FSSP– Automatic classification

CATH

Singh

Class: SSE composition & packing

Architecture: overall shape of domain, ignore SSE connectivity

Topology (Fold): consider connectivity

Homologous superfamily: a common ancestor

Class 1: Mainly Alpha

CATH - Class

Class 2: Mainly Beta

Class 3:mixedAlpha/Beta

Class 4: Few Secondary Structures

Secondary structure content (automatic)

Roll

CATH - Architecture

Super Roll Barrel 2-Layer Sandwich

Orientation of secondary structures (manual)

L-fucose Isomerase

CATH - Topology

Serine Protease Aconitase, domain 4 TIM Barrel

Topological connection and number of secondary structures

Alanine racemase

CATH - Homology

Dihydropteroate (DHP) synthetase

FMN dependent fluorescent proteins

7-stranded glycosidases

Superfamily clusters of similar structures & functions

Protein Motifs• Protein motifs may be defined by their primary sequence

or by the arrangement of secondary structure elements

Helix-turn-helix Four-helix bundle TIM-barrel Zinc finger

R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)

Protein Motifs

•Identifying motifs from sequence is not straightforward

Quaternary Structure

• Many proteins are composed of more than one polypeptide chain

• All specific intermolecular interactions depend on complementarity

• All types of protein-stabilizing interactions contribute to the formation of intermolecular interfaces

• Inappropriate quaternary interactions can have dramatic functional consequences

Sickle-cell hemoglobin


• Protein assemblies built of identical subunits are usually symmetric


Proteins are the most versatile macromolecules of the cell

“Protein function” may mean the biochemical function of the molecule in isolation, or the

cellular function it performs as part of an assemblage or complex with other

molecules

Functions of Protein

Enzymes•Globular proteins that facilitate chemical reactions.Defense Proteins•Antibodies•Protein toxinsTransport Proteins•Plasma membrane proteins carry substances through membranes or

formchannels or pumps for passage of substances•Oxygen carrier in circulation (hemoglobin)•Mineral protein carriers (iron, zinc)Structural/Support Proteins (Fibrous proteins)•Connective tissue in animals (collagen – the most abundant

vertebrate protein)•Webs, cocoons and other arthropod structures•Hair, nails horns, etc. (keratin)•Fibrins used in blood clotting

Functions of ProteinContractile Proteins – locomotion and movement•Muscle•Cilia and flagella,•Microtubules, microfilaments and intermediate filamentsRegulatory Proteins•Hormones•Gene Regulators – transcription factors•Osmotic regulationReceptor Proteins•Membrane surface receptor proteins•Signal transduction proteinsRecognition Proteins•Glycoproteins (carbohydrate-protein hybrids) for identification of

"self".Storage Proteins (specialized)•Examples are casein in milk, ferritin for iron storage, calmodulin for

calcium and albumin in eggsEnergy transfer molecules•Cytochromes

Fold evolutionary relationships

Biological multimeric states

Disease states mutations Active sites, enzyme clefts

Antigenic sites Surface properties

3D STRUCTURE

Protein-Ligand Interactions

HISTORYHISTORYProtein Structure-Function Relationships

Overview: Protein Function and Architecture

BindingSpecific recognition

of other molecules iscentral to protein

function. The molecule that is

bound (the ligand) can be as small as

the oxygen molecule that coordinates to the heme group of

myoglobin, or as large as the specific

DNA sequence

Catalysis

Essentially every

chemical reaction in the

Living cell is catalyzed,

and most of the

catalysts Are protein

enzymes.


Switching

Proteins are flexible

molecules and their

conformation can

change in response to

changes in pH or ligand

binding. Such changes

can be used as

molecular switches to

Control cellular

processes.


Structural Proteins

Protein molecules serve as

some of the major structural

elements of living systems.

This function depends on

Specific association of

protein subunits with

themselves as well as with

other proteins, carbohydrates,

and so on,


Protein-Protein Interaction Network

Why Study Networks?

• It is increasingly recognized that complex systems cannot be described in a reductionism view.

• the actual output may not be predictable by looking at only individual components: The whole is greater than the sum of its parts

• Understanding the behavior of such systems starts with understanding the topology of the corresponding

network.

• Topological information is fundamental in constructing realistic models for the function of the network

• Create models of networks that can help us to understand the meaning of these properties

• Find statistical properties that characterize the structure and behavior of networked systems

• Predict what the behavior of networked systems will be

Why Study Networks?

Basic notions of networks

Network (graph) – a set of nodes connected via edges.

The degree of a node (connectivity) = total number of connections of a node.

Characteristics of networksk - degree of a node,

P(k) – degree distribution,

Diameter – max of distances between nodes taken over all node pairs.

Clustering coefficient

K=2K=2

K=3

K=1

Types of Networks

• Social network Individual or organization connected by one or more specific type of

interdependency (friendship, common interest, beliefs, etc.)

• Data networkSuch as articles and citation, World Wide Web, ….

• Technological networkDesigned networks such as internet, transport, electrical,….

• Biological networks

Network models

A number of network models have been suggested to characterize networks.

The most widely accepted models are scale free and the small-world network.

An alternative model is modular network model

Different network models: Barabasi-Alberts.

Barabasi & Albert, Science, 1999

Model of preferential attachment.• At each step, a new node is added to the graph.• The new node is attached to one of old nodes with probability

proportional to the vertex degree.

ln(P(k))

ln(k)

kkp )(

Degree distribution – power law distribution.

Difference between scale-free and random graph models.

Random networks are homogeneous, most nodes have the same number of links.

Scale-free networks have a number of highly connected verteces.

Scale free model• Interaction networks with scale-free model

– Most proteins interact with a small number of partners

– A few proteins (“hubs”) interact with many partners

– Resistant to random node removal– Sensitive to targeted hub removal

• Types of Hubs– Party Hubs

• Interact with most of their partners simultaneously

• Perform specific functions inside module

– Date Hubs• Interact with different partners at different

times or locations• Connect modules (biological processes)

together

Scale free model

Small-world and modular model The shortest path between any pair of proteins

tends to be small, and the network is full of densely connected neighborhoods.

small-world and scale free models are not in conflict; rather, they complement each other

Modular network model suggests that protein interaction networks consist of several densely interconnected functional modules. The most nodes roughly have the equal edge degree, which is against the scale-free nature

Types of biomolecular networks• Gene regulatory networks

Vertices : genesEdges : regulatory influences

• Metabolic networksVertices : metabolites, reactions

(catalyzed by enzymes)Edges : consumption, production: enzymes, metabolites

• Protein-protein interaction networksVertices : proteins Edges : physical interactions

• Signaling networksVertices : proteins with state informationEdges : interactions modifying states

• Networks of functional linksVertices : genesEdges : functional relationships

• ChIP-Chip• Gene expression data• Sequence

• Sequence• Classical biochemistry• Mass spectrometry• Isotope labeling

• Yeast two-hybrid• Mass spectrometry

• Measurements of post-translational modifications

• Sequence (of several organisms)• Expression data• Any data type allowing definition of a

similarity measure…

Protein Interactions

• Proteins perform a function as a complex rather as a single protein.

- protein-protein interactions are of central importance for virtually every process in a living cell(cell growth,cell cycle,

metabolic pathway, signal transduction) • Knowing whether two proteins interact can help us

discover unknown proteins’ functions:– If the function of one protein is known, the function of its

binding partners are likely to be related- “guilt by association”.

– Thus, having a good method for detecting interactions can allow us to use a small number of proteins with

known function to characterize new proteins.

• Studying protein interaction network architecture allows us to:

– Assess the role of individual proteins in the overall pathway

– Identify candidate genes involved in genetic diseases(Gene mutation → protein interaction confusion → disease)

– Sets up the framework for mathematical modelsBiological Networks are very rich networks with very

limited, noisy, and incomplete information.Discovering underlying principles is very challenging.

Importance of protein interaction

Genome: 30,000 genes

Transcriptome: 40,000 -100,000 mRNAs

Proteome: 100,000 - 400,000 proteins

Interactome: >1,000,000 interactions

Human Genome

Human Proteome

Transcripts

Protein Interaction

105

106


S. Cerevisiae (Yeast)• 4389 proteins• 14319 interactions

C. Elegans (Worm)• 2718 proteins• 3926 interactions

D. Melanogaster (Fly)• 7038 proteins• 20720 interactions


Yeast Protein Interaction Network

Nodes: proteins

Links: physical interactions

The two representations differ in localization (a protein occurs multiple times in the list but exactly once in the layout);

in the layout, the neighbors of a protein are easily identified and studied; and mental image (the network layout allows proteins to

be memorized by position).

In positioning the nodes, secondary information can be employed to guide the layout; for example, proteins can be spatially grouped by localization or function. In this way, a particular arrangement of

the proteins can even increase the information content.

PPI more often represented graphically as two dimensional networks

Types of PPI Network

Methods to investigate protein-protein interactions

• there are a multitude of methods to detect protein-protein interactions

• Each of the approaches has its own strengths and weaknesses, especially with regard to the sensitivity and specificity of the method.

• A high sensitivity means that many of the interactions that occur in reality are detected by the screen. A high specificity indicates that most of the interactions detected by the screen are also occurring in reality.

Methods to investigate protein-protein interactions

• Experimental methods

• Computational methods

Experimental methods

Experimental methods

Co-immunoprecipitation GST-pull down assays Protein arrays Far-western analysis TAP-MS X Y

does X bindwith a protein?

Bait Prey

Bait – Prey model

In vitro

In vivo

• Yeast two-hybrid system• Phage display

Physical interaction between protein binding domains

Co-immunoprecipitation• Co-immunoprecipitation is considered to be the gold standard

assay for protein-protein interactions • Immunoprecipitation (IP) experiment - immune response & precipitation

• Affinity purify a bait protein antigen together with its binding partner using a specific antibody

• Capturing of immune complex by solid support

• Elution from the support and analysis by SDS-PAGE and detection by western blot

• it is not a screening approach

Co-immunoprecipitation

GST-pull down assays

• Affinity chromatography method

• Using a tagged or labeled bait by binding a specific affinity matrix

• Purification of a prey protein from a lysate sample or other protein-containing mixture

• GTH(glutathione)-GST(glutathione S-transferase) binding


prepare proteinextract from tissue

mix and incubate

express GST-fusionprotein in E.coli

pGEX

GSTgene X


GST

-fusi

on p

rote

inG

ST a

lone

Sepaharose bead-GTH(glutathione)

Protein arrays

• Antibody-based or bait-based arrays

• High-throughput assays ; screening and detection of specific interactions of proteins from complex mixtures

• Protein expression profiling, protein-protein interaction and enzyme activity

• Binding between the capture proteins immobilized on a surface and the target proteins in the sample solution.

Protein arrays

Mass spectroscopy

• Ionization (Ex: Electrospray ionization) produce peptide ions in a gas phase;

• Detection and recording of sample ions mass-to-charge ratios are assigned to different

peaks of spectra;

• Analysis of MS spectra, protein identification search sequence database with mass fingerprint, find correlations between theoretical and

experimental spectra.

Protein Identification by MS

Artificial spectra built

Artificially trypsinated

Database of sequences

(i.e. SwissProt)

Spot removed from gel

Fragmented using trypsin

Spectrum of fragments generated

MATCHLi

bra

ry

Tandem affinity purification method (TAP)

• Target protein ORF is fused with the DNA sequences encoding TAP tag

• tagged ORFs are expressed in yeast cells and form native complexes;

• the complexes are purified by TAP method;

• components of each complex are found by gel

electrophoresis, MS and bioinformatics methods.

TAP-MS (Tandem Affinity Purification-Mass Spectrometry)

Yeast two-hybrid experiments

• Many transcription factors have two domains; one that binds to a promoter DNA sequence (BD) and another that activates transcription (AD).

• Transcription factor can not activate transcription unless DNA-binding domain is physically associated with an activating domain

Yeast two-hybrid system• Detecting protein-protein interactions in yeast• Transcriptional regulator system• “prey”-”bait” model :fusion proteins with a transcriptional

activating domain (AD, prey), a DNA-binding domain (BD, bait)• Term “two-hybrid” derives from these two chimeric proteins.• Most commonly used method for large scale, high-throughput

identification of potential protein-protein interactions

Gene construction in yeast expression vectors

Expression of the reporter indicating that the proteins bind

reporter gene

Y

Two hybrid proteins bind

Forming a functional transcription activator

X

X Y

X Y

Gal4/LacZ Y2H system

• Target proteins are fused with BD and AD of GAL4 protein which activate LacZ gene.

• If there is no galactose, GAL80 binds to GAL4 and blocks transcription.

• If galactose is present, GAL4 can activate transcription of beta-galactosidase.

High-throughput Y2H screening

Principle of two-hybrid library and array screens(Peter Uetz, et al. 2001)

Genome-wide analysis by Y2H

• Matrix approach: a matrix of prey clones is added to the matrix of bait clones. Diploids where X and Y interact are selected based on the expression of a reporter gene.

• Library approach: one bait X is screened against an entire library. Positives are selected based on their ability to grow on specific substrates.

Drawback of Y2H

• The interactions can not be tested if a target protein can initiate transcription.

• Fusion of a protein into chimeras can change the structure of a target.

• Protein interactions can be different in yeast and other organisms.

• Proteins which can interact in two-hybrid experiments, may never interact in a cell.

Advantage of Y2H

• in vivo technique, good approximation of processes which occur in higher eukaryotes.

• Transient interactions can be determined, can predict the affinity of an interaction.

• Fast and efficient.

Differences and similarities between Y2H and MS-TAP

• Both methods generate a lot of false positives, only ~50% interactions are biologically significant.

• Y2H produces binary interactions, lack of information about protein complexes, but can detect transient interactions.

• Y2H is in vivo technique.

• MS can detect large stable complexes and networks of interactions.

Phage display

(William G.T. Willats. 2002)

Comparison of methods

Method Advantage DisadvantageCo-immunoprecipitation Independent of cloning and ectopic

gene expression

Rapid procedures

Cross-reactivity of antibody

Antibody bleeding from column

TAP-MS Generically applicable approach

Ability to purify low abundant proteins/protein complexes

high throughput identification

Protein-tag might influence protein function

requires two successive steps of protein purification and can not readily detect transient interactions.

GST-pull down assays Applicable to very weak protein

interactions

Complex formation in-vitro

Competition with in-vivo

pre-assembled complex

Protein arrays High-throughput assay

Disease diagnosis

Difficulty of protein chip production

Yeast two-hybrid system

Highly sensitive detection

Applicable to a wide range of protein interactions

No biochemical purification

Stability of folding and activity in yeast

Not post-transcriptional modification

phage display Random library screening of many cDNAs through panning cycle

Size of limitation of protein sequence

Incorrect folding or modification

Other methods

• Fluorescence resonance energy transfer (FRET) is a common technique when observing the interactions of only two different

proteins. • Label transfer can be used for screening or confirmation of protein

interactions and can provide information about the interface where the interaction takes place.

• Chemical crosslinking is often used to "fix" protein interactions in place before trying to isolate/identify interacting proteins.

• Protein-protein docking, the prediction of protein-protein interaction based on the three-dimensional protein structures only is not satisfactory

http://en.wikipedia.org/wiki/Fluorescence_resonance_energy_transfer

http://en.wikipedia.org/w/index.php?title=Label_transfer&action=edit&redlink=1

http://en.wikipedia.org/wiki/Cross-link

http://en.wikipedia.org/wiki/Protein-protein_docking

Computational methods for prediction of functional association and protein interactions

• Phylogenetic profile method.

• Rosetta Stone approach.

• Gene neighborhood method.

• Gene cluster method.

• Co-evolution methods.

• Classification methods.

Computational methods (Genomic context-based methods)

Genomic context-based methods are based on the assumption that functionally related proteins are encoded by genes that co-regulated or co-evolved.

These methods seek to predict protein functional associations. Such functional associations may or may not result from physical binding.

− Phylogenetic profile

− Gene neighbors method

− Rosetta stone method (gene fusion)

Phylogenetic profile method

Idea: pairs of proteins that are always both present or both absent in a genome suggest their functional dependence possible interaction

Profile of a protein: A vector of 0/1 where each position corresponds to one genome. 1=protein present, 0=protein absent


• Proteins with identical (or similar) profiles are boxed to indicate that they are likely to be functionally linked. Boxes connected by lines have phylogenetic profiles that differ by one bit and are termed neighbors.

Gene neighbors method

The basic assumption is that genes which interact or are functionally associated tend to be located in physical proximity to each other on the genome.

Despite the effect of neutral evolution which tends to shuffle gene order between distantly related organisms, gene clusters or operons encoding for co-regulated genes are usually conserved

Gene neighbors method

Gene cluster method

• Bacterial genes of related function are often transcribed simultaneously – operon.

• Identification of operons is based on intergenic distances.

Rosetta stone method (gene fusion)

Assumption (based on observation):

Gene fusion examines pairs of genes that exist individually in some organisms, but as a fused gene in other organisms

Proteins that are fused in one genome are likely to interact, physically or at least functionally, in other genomes.

Five examples of pairs of E. coli proteins predicted to interact by

the domain fusion analysis. Each protein is shown schematically

with boxes representing domains

Rosetta stone method (gene fusion)

Correlation between gene expression and protein interactions

• There should exist a relationship between gene expression levels of subunits in a complex, then protein-protein interactions can be verified from coexpression data.

• Methods are tested on protein complexes: ribosome, proteasome, RNA Polymerase II Holoenzyme and replication complexes.

Expression profiles were taken from: cell cycle experiments and expression ratios for overall yeast genome for 300 cell states.

• Difference between absolute expression levels can be calculated as

)(

||

ji

ji

EE

EED

Correlation between gene expression and protein interactions

Results of gene coexpression analysis.

• Subunits from the same complex show coexpression.

• Expression correlation is strong for permanent complexes.

• Transient complexes have weaker correlation.

Coevolution of interacting proteins – “mirrortree” methods

• Interacting proteins may co-evolve and their phylogenetic trees show similarity.

• Similarity between phylogenetic trees can be quantified by correlation coefficient between distance matrices used to construct trees.

Tree of life (TOL) assists in prediction of protein interactions

• There is “background” similarity between trees of any proteins, no matter if they interact or not.

• “Background” tree is constructed from 16S rRNA sequences.

• rRNA-based distances are subtracted from distances of original phylogenetic tree.

Verification of experimental protein-protein interactions

• Protein localization method.

• Expression profile reliability method.

• Paralogous verification method.

Protein localization method

True positives:

- Proteins which are localized in the same cellular compartment

Expression profile reliability method.

Paralogous verification method.

PVM method is based on observation that if two proteins interact, their paralogs would interact. Calculates the number of interactions between two families of paralogous proteins.

Aligning protein interaction networks.

• The method searches for high-scoring pathway alignments between two networks, where proteins are paired based on their sequence similarity.

A

B

C

D

E

a

b

d

e

Aligning protein interaction networks.

• The network alignment between worm, yeast and fly detected 71 network regions that were conserved between all three species.

Interaction databases

• Experiment (E)

• Structure detail (S)

• Predicted– Physical (P)– Functional (F)

• Curated (C)

• Homology modeling (H)

Protein interaction databases

• Protein-protein interaction databases

• Domain-domain interaction databases

DIP database

• Documents protein-protein interactions from experiment– Y2H, protein microarrays,

TAP/MS, PDB

• 55,733 interactions between 19,053 proteins from 110 organisms.

Organisms # proteins

# interactions

Fruit fly 7052 20,988

H. pylori 710 1425

Human 916 1407

E. coli 1831 7408

C. elegans 2638 4030

Yeast 4921 18,225

Others 985 401

DIP/Prolinks database

• Records functional association using prediction methods:– Gene neighbors– Rosetta Stone– Phylogenetic profiles– Gene clusters

Other functional association databases

• Phydbac2 (Claverie)• Predictome (DeLisi)• ArrayProspector

(Bork)

BIND database

• Records experimental interaction data

• 83,517 protein-protein interactions

• 204,468 total interactions include small molecules, NAs, complexes

MPact/MIPS database

• Records yeast protein-protein interactions

• Curates interactions:– 4,300 PPI– 1,500 proteins

STRING database

• Records experimental and predicted protein-protein interactions using methods:– Genomic context– High-throughput– Coexpression– Database/literature

mining

More interaction databases

• IntAct (Valencia)– Open source interaction database and analysis– 68,165 interactions from literature or user submissions

• MINT (Cesareni)– 71,854 experimental interactions mined from literature

by curators– Uses IntAct data model

• BioGRID (Tyers)– 116,000 protein and genetic interactions

Protein interaction databases

• Protein-protein interaction databases

• Domain-domain interaction databases

InterDom database

• Predicts domain interactions (~30000) from PPIs

• Data sources:– Domain fusions– PPI from DIP– Protein complexes– Literature

PIBASE

• Query by PDB, domain, interface

• 1,946 interacting SCOP domains

• 2,387 unique interaction types

PIBASE/ModBase

• Protein structure models

• Predict interfaces with Pibase

3did database

• Defines domains using Pfam

• Data source: Protein structure data

• 3,304 unique interaction types

• 2,247 interacting domains

• Display linkages and chain locations

iPfam database

• View Pfam interactions on PDB structures

• View individual structures and sequence plots

DIMA database

• Phylogenetic profiles of Pfam domain pairs

• Uses structural info from iPfam

• Works well for moderate information content

perspective To further expand our knowledge about protein

interaction networks, we need to improve our data-gathering capabilities.

Development of highly sensitive and accurate methods to allow data collection under various cellular functional and temporal states.

Novel computational approaches need to be developed to transfer as many interactions as possible from model organisms to human.

PROTEINS FROM SEQUENCES TO INTERACTIONS IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.

Documents

Transcript of PROTEINS FROM SEQUENCES TO INTERACTIONS IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.