PROTEINS FROM SEQUENCES TO INTERACTIONS IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.
-
Upload
beverley-goodwin -
Category
Documents
-
view
212 -
download
0
Transcript of PROTEINS FROM SEQUENCES TO INTERACTIONS IPM-NUS Workshop on Computational Biology Mehdi Sadeghi.
PROTEINS FROM SEQUENCES TO
INTERACTIONS
IPM-NUS Workshop on Computational Biology
Mehdi Sadeghi
Amino Acids and Proteins
polymers composed of combinations of 20 different amino acids
range in size from about 50 to over 20000 amino acids
A single cell may have 10,000 or more different proteins.
About half of the non-water component of a typical cell is protein.
Four levels of protein structure
Primary
Secondary
Tertiary
Quaternary
Assembly
Folding
Packing
Interaction
S T
R U
C T
U R
E P R
O C
E S
S
• Occurs at the ribosome • Involves dehydration synthesis
and polymerization of amino acids attached to tRNA:
• Yields primary structure
• non-linear• 3 dimensional• Localized to regions of an
amino acid chain• Formed and stabilized by
hydrogen bonding, electrostatic and van der Waals interactions
Secondary Structure
Importance and Determinants of Secondary Structure
• Folded proteins have segments of regular conformation
• The arrangement of secondary structure elements provides a convenient way of classifying types of
folds
• Steric constraints dictate the possible types of secondary structure
Folding
• The folded structure of a protein is directly determined by its primary structure
Computational prediction of folding is not yet reliable
Tertiary Structure
• The condensing of multiple secondary structural elements leads to tertiary structure
• Tertiary structure is stabilized by efficient packing of atoms in the protein interior
The Protein Domain• a compact unit of protein structure that is usually capable
of folding stably as an independent entity in solution. Domains do not need to comprise a contiguous segment of peptide chain, although this is often the case.
• Proteins whose molecular weights are less than about 20,000 often have a simple globular shape, with an average molecular diameter of 20 to 30 Å, but larger proteins usually fold into two or more independent globules, or structural domains.
• Multidomain proteins probably evolved by the fusion of genes that once coded for separate proteins
The Protein Domain
Identical domains Structurally unrelated
Protein Domains – an alphabet of functional modules
WD40 WWSH2 SH3
14-3-3 ANK3 ARM BH1 C1 C2 CARD
EH EVH FYVE PDZDeath DED EFH
PH PTB SAM
The Universe of Protein Structures
• The number of protein folds is large but limited
• Protein structures are modular and proteins can be grouped into families on the basis of the domains they contain
• The modular nature of protein structure allows for sequence insertions and deletions
Schematic diagram of the domain arrangement of a number of signal transduction proteins. The different moduleshave different functions
• Why classify proteins:
• Number of solved structures grow rapidly• Generate overview of structure types• Detect similarities (evolutionary relationships)• Set up prediction benchmarks
Protein structure Classification
Classification schemes
• SCOP– Manual classification
• CATH– Semi manual classification
• FSSP– Automatic classification
SCOP
CATH
Singh
Class: SSE composition & packing
Architecture: overall shape of domain, ignore SSE connectivity
Topology (Fold): consider connectivity
Homologous superfamily: a common ancestor
Class 1: Mainly Alpha
CATH - Class
Class 2: Mainly Beta
Class 3:mixedAlpha/Beta
Class 4: Few Secondary Structures
Secondary structure content (automatic)
Roll
CATH - Architecture
Super Roll Barrel 2-Layer Sandwich
Orientation of secondary structures (manual)
L-fucose Isomerase
CATH - Topology
Serine Protease Aconitase, domain 4 TIM Barrel
Topological connection and number of secondary structures
Alanine racemase
CATH - Homology
Dihydropteroate (DHP) synthetase
FMN dependent fluorescent proteins
7-stranded glycosidases
Superfamily clusters of similar structures & functions
Protein Motifs• Protein motifs may be defined by their primary sequence
or by the arrangement of secondary structure elements
Helix-turn-helix Four-helix bundle TIM-barrel Zinc finger
R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)
Protein Motifs
•Identifying motifs from sequence is not straightforward
Quaternary Structure
• Many proteins are composed of more than one polypeptide chain
• All specific intermolecular interactions depend on complementarity
• All types of protein-stabilizing interactions contribute to the formation of intermolecular interfaces
• Inappropriate quaternary interactions can have dramatic functional consequences
Sickle-cell hemoglobin
Quaternary Structure
• Protein assemblies built of identical subunits are usually symmetric
Quaternary Structure
Proteins are the most versatile macromolecules of the cell
“Protein function” may mean the biochemical function of the molecule in isolation, or the
cellular function it performs as part of an assemblage or complex with other
molecules
Functions of Protein
Enzymes•Globular proteins that facilitate chemical reactions.Defense Proteins•Antibodies•Protein toxinsTransport Proteins•Plasma membrane proteins carry substances through membranes or
formchannels or pumps for passage of substances•Oxygen carrier in circulation (hemoglobin)•Mineral protein carriers (iron, zinc)Structural/Support Proteins (Fibrous proteins)•Connective tissue in animals (collagen – the most abundant
vertebrate protein)•Webs, cocoons and other arthropod structures•Hair, nails horns, etc. (keratin)•Fibrins used in blood clotting
Functions of ProteinContractile Proteins – locomotion and movement•Muscle•Cilia and flagella,•Microtubules, microfilaments and intermediate filamentsRegulatory Proteins•Hormones•Gene Regulators – transcription factors•Osmotic regulationReceptor Proteins•Membrane surface receptor proteins•Signal transduction proteinsRecognition Proteins•Glycoproteins (carbohydrate-protein hybrids) for identification of
"self".Storage Proteins (specialized)•Examples are casein in milk, ferritin for iron storage, calmodulin for
calcium and albumin in eggsEnergy transfer molecules•Cytochromes
Fold evolutionary relationships
Biological multimeric states
Disease states mutations Active sites, enzyme clefts
Antigenic sites Surface properties
3D STRUCTURE
Protein-Ligand Interactions
HISTORYHISTORYProtein Structure-Function Relationships
Overview: Protein Function and Architecture
BindingSpecific recognition
of other molecules iscentral to protein
function. The molecule that is
bound (the ligand) can be as small as
the oxygen molecule that coordinates to the heme group of
myoglobin, or as large as the specific
DNA sequence
Catalysis
Essentially every
chemical reaction in the
Living cell is catalyzed,
and most of the
catalysts Are protein
enzymes.
Overview: Protein Function and Architecture
Switching
Proteins are flexible
molecules and their
conformation can
change in response to
changes in pH or ligand
binding. Such changes
can be used as
molecular switches to
Control cellular
processes.
Overview: Protein Function and Architecture
Structural Proteins
Protein molecules serve as
some of the major structural
elements of living systems.
This function depends on
Specific association of
protein subunits with
themselves as well as with
other proteins, carbohydrates,
and so on,
Overview: Protein Function and Architecture
Protein-Protein Interaction Network
Why Study Networks?
• It is increasingly recognized that complex systems cannot be described in a reductionism view.
• the actual output may not be predictable by looking at only individual components: The whole is greater than the sum of its parts
• Understanding the behavior of such systems starts with understanding the topology of the corresponding
network.
• Topological information is fundamental in constructing realistic models for the function of the network
• Create models of networks that can help us to understand the meaning of these properties
• Find statistical properties that characterize the structure and behavior of networked systems
• Predict what the behavior of networked systems will be
Why Study Networks?
Basic notions of networks
Network (graph) – a set of nodes connected via edges.
The degree of a node (connectivity) = total number of connections of a node.
Characteristics of networksk - degree of a node,
P(k) – degree distribution,
Diameter – max of distances between nodes taken over all node pairs.
Clustering coefficient
K=2K=2
K=3
K=1
Types of Networks
• Social network Individual or organization connected by one or more specific type of
interdependency (friendship, common interest, beliefs, etc.)
• Data networkSuch as articles and citation, World Wide Web, ….
• Technological networkDesigned networks such as internet, transport, electrical,….
• Biological networks
Network models
A number of network models have been suggested to characterize networks.
The most widely accepted models are scale free and the small-world network.
An alternative model is modular network model
Different network models: Barabasi-Alberts.
Barabasi & Albert, Science, 1999
Model of preferential attachment.• At each step, a new node is added to the graph.• The new node is attached to one of old nodes with probability
proportional to the vertex degree.
ln(P(k))
ln(k)
kkp )(
Degree distribution – power law distribution.
Difference between scale-free and random graph models.
Random networks are homogeneous, most nodes have the same number of links.
Scale-free networks have a number of highly connected verteces.
Scale free model• Interaction networks with scale-free model
– Most proteins interact with a small number of partners
– A few proteins (“hubs”) interact with many partners
– Resistant to random node removal– Sensitive to targeted hub removal
• Types of Hubs– Party Hubs
• Interact with most of their partners simultaneously
• Perform specific functions inside module
– Date Hubs• Interact with different partners at different
times or locations• Connect modules (biological processes)
together
Scale free model
Small-world and modular model The shortest path between any pair of proteins
tends to be small, and the network is full of densely connected neighborhoods.
small-world and scale free models are not in conflict; rather, they complement each other
Modular network model suggests that protein interaction networks consist of several densely interconnected functional modules. The most nodes roughly have the equal edge degree, which is against the scale-free nature
Types of biomolecular networks• Gene regulatory networks
Vertices : genesEdges : regulatory influences
• Metabolic networksVertices : metabolites, reactions
(catalyzed by enzymes)Edges : consumption, production: enzymes, metabolites
• Protein-protein interaction networksVertices : proteins Edges : physical interactions
• Signaling networksVertices : proteins with state informationEdges : interactions modifying states
• Networks of functional linksVertices : genesEdges : functional relationships
• ChIP-Chip• Gene expression data• Sequence
• Sequence• Classical biochemistry• Mass spectrometry• Isotope labeling
• Yeast two-hybrid• Mass spectrometry
• Measurements of post-translational modifications
• Sequence (of several organisms)• Expression data• Any data type allowing definition of a
similarity measure…
Protein Interactions
• Proteins perform a function as a complex rather as a single protein.
- protein-protein interactions are of central importance for virtually every process in a living cell(cell growth,cell cycle,
metabolic pathway, signal transduction) • Knowing whether two proteins interact can help us
discover unknown proteins’ functions:– If the function of one protein is known, the function of its
binding partners are likely to be related- “guilt by association”.
– Thus, having a good method for detecting interactions can allow us to use a small number of proteins with
known function to characterize new proteins.
• Studying protein interaction network architecture allows us to:
– Assess the role of individual proteins in the overall pathway
– Identify candidate genes involved in genetic diseases(Gene mutation → protein interaction confusion → disease)
– Sets up the framework for mathematical modelsBiological Networks are very rich networks with very
limited, noisy, and incomplete information.Discovering underlying principles is very challenging.
Importance of protein interaction
Genome: 30,000 genes
Transcriptome: 40,000 -100,000 mRNAs
Proteome: 100,000 - 400,000 proteins
Interactome: >1,000,000 interactions
Human Genome
Human Proteome
Transcripts
Protein Interaction
105
106
Importance of protein interaction
S. Cerevisiae (Yeast)• 4389 proteins• 14319 interactions
C. Elegans (Worm)• 2718 proteins• 3926 interactions
D. Melanogaster (Fly)• 7038 proteins• 20720 interactions
Importance of protein interaction
Yeast Protein Interaction Network
Nodes: proteins
Links: physical interactions
The two representations differ in localization (a protein occurs multiple times in the list but exactly once in the layout);
in the layout, the neighbors of a protein are easily identified and studied; and mental image (the network layout allows proteins to
be memorized by position).
In positioning the nodes, secondary information can be employed to guide the layout; for example, proteins can be spatially grouped by localization or function. In this way, a particular arrangement of
the proteins can even increase the information content.
PPI more often represented graphically as two dimensional networks
Types of PPI Network
Methods to investigate protein-protein interactions
• there are a multitude of methods to detect protein-protein interactions
• Each of the approaches has its own strengths and weaknesses, especially with regard to the sensitivity and specificity of the method.
• A high sensitivity means that many of the interactions that occur in reality are detected by the screen. A high specificity indicates that most of the interactions detected by the screen are also occurring in reality.
Methods to investigate protein-protein interactions
• Experimental methods
• Computational methods
Experimental methods
Experimental methods
Co-immunoprecipitation GST-pull down assays Protein arrays Far-western analysis TAP-MS X Y
does X bindwith a protein?
Bait Prey
Bait – Prey model
In vitro
In vivo
• Yeast two-hybrid system• Phage display
Physical interaction between protein binding domains
Co-immunoprecipitation• Co-immunoprecipitation is considered to be the gold standard
assay for protein-protein interactions • Immunoprecipitation (IP) experiment - immune response & precipitation
• Affinity purify a bait protein antigen together with its binding partner using a specific antibody
• Capturing of immune complex by solid support
• Elution from the support and analysis by SDS-PAGE and detection by western blot
• it is not a screening approach
Co-immunoprecipitation
GST-pull down assays
• Affinity chromatography method
• Using a tagged or labeled bait by binding a specific affinity matrix
• Purification of a prey protein from a lysate sample or other protein-containing mixture
• GTH(glutathione)-GST(glutathione S-transferase) binding
GST-pull down assays
prepare proteinextract from tissue
mix and incubate
express GST-fusionprotein in E.coli
pGEX
GSTgene X
GST-pull down assays
GST
-fusi
on p
rote
inG
ST a
lone
Sepaharose bead-GTH(glutathione)
Protein arrays
• Antibody-based or bait-based arrays
• High-throughput assays ; screening and detection of specific interactions of proteins from complex mixtures
• Protein expression profiling, protein-protein interaction and enzyme activity
• Binding between the capture proteins immobilized on a surface and the target proteins in the sample solution.
Protein arrays
Mass spectroscopy
• Ionization (Ex: Electrospray ionization) produce peptide ions in a gas phase;
• Detection and recording of sample ions mass-to-charge ratios are assigned to different
peaks of spectra;
• Analysis of MS spectra, protein identification search sequence database with mass fingerprint, find correlations between theoretical and
experimental spectra.
Protein Identification by MS
Artificial spectra built
Artificially trypsinated
Database of sequences
(i.e. SwissProt)
Spot removed from gel
Fragmented using trypsin
Spectrum of fragments generated
MATCHLi
bra
ry
Tandem affinity purification method (TAP)
• Target protein ORF is fused with the DNA sequences encoding TAP tag
• tagged ORFs are expressed in yeast cells and form native complexes;
• the complexes are purified by TAP method;
• components of each complex are found by gel
electrophoresis, MS and bioinformatics methods.
TAP-MS (Tandem Affinity Purification-Mass Spectrometry)
Yeast two-hybrid experiments
• Many transcription factors have two domains; one that binds to a promoter DNA sequence (BD) and another that activates transcription (AD).
• Transcription factor can not activate transcription unless DNA-binding domain is physically associated with an activating domain
Yeast two-hybrid system• Detecting protein-protein interactions in yeast• Transcriptional regulator system• “prey”-”bait” model :fusion proteins with a transcriptional
activating domain (AD, prey), a DNA-binding domain (BD, bait)• Term “two-hybrid” derives from these two chimeric proteins.• Most commonly used method for large scale, high-throughput
identification of potential protein-protein interactions
Gene construction in yeast expression vectors
Expression of the reporter indicating that the proteins bind
reporter gene
Y
Two hybrid proteins bind
Forming a functional transcription activator
X
X Y
X Y
Gal4/LacZ Y2H system
• Target proteins are fused with BD and AD of GAL4 protein which activate LacZ gene.
• If there is no galactose, GAL80 binds to GAL4 and blocks transcription.
• If galactose is present, GAL4 can activate transcription of beta-galactosidase.
High-throughput Y2H screening
Principle of two-hybrid library and array screens(Peter Uetz, et al. 2001)
Genome-wide analysis by Y2H
• Matrix approach: a matrix of prey clones is added to the matrix of bait clones. Diploids where X and Y interact are selected based on the expression of a reporter gene.
• Library approach: one bait X is screened against an entire library. Positives are selected based on their ability to grow on specific substrates.
Drawback of Y2H
• The interactions can not be tested if a target protein can initiate transcription.
• Fusion of a protein into chimeras can change the structure of a target.
• Protein interactions can be different in yeast and other organisms.
• Proteins which can interact in two-hybrid experiments, may never interact in a cell.
Advantage of Y2H
• in vivo technique, good approximation of processes which occur in higher eukaryotes.
• Transient interactions can be determined, can predict the affinity of an interaction.
• Fast and efficient.
Differences and similarities between Y2H and MS-TAP
• Both methods generate a lot of false positives, only ~50% interactions are biologically significant.
• Y2H produces binary interactions, lack of information about protein complexes, but can detect transient interactions.
• Y2H is in vivo technique.
• MS can detect large stable complexes and networks of interactions.
Phage display
(William G.T. Willats. 2002)
Comparison of methods
Method Advantage DisadvantageCo-immunoprecipitation Independent of cloning and ectopic
gene expression
Rapid procedures
Cross-reactivity of antibody
Antibody bleeding from column
TAP-MS Generically applicable approach
Ability to purify low abundant proteins/protein complexes
high throughput identification
Protein-tag might influence protein function
requires two successive steps of protein purification and can not readily detect transient interactions.
GST-pull down assays Applicable to very weak protein
interactions
Complex formation in-vitro
Competition with in-vivo
pre-assembled complex
Protein arrays High-throughput assay
Disease diagnosis
Difficulty of protein chip production
Yeast two-hybrid system
Highly sensitive detection
Applicable to a wide range of protein interactions
No biochemical purification
Stability of folding and activity in yeast
Not post-transcriptional modification
phage display Random library screening of many cDNAs through panning cycle
Size of limitation of protein sequence
Incorrect folding or modification
Other methods
• Fluorescence resonance energy transfer (FRET) is a common technique when observing the interactions of only two different
proteins. • Label transfer can be used for screening or confirmation of protein
interactions and can provide information about the interface where the interaction takes place.
• Chemical crosslinking is often used to "fix" protein interactions in place before trying to isolate/identify interacting proteins.
• Protein-protein docking, the prediction of protein-protein interaction based on the three-dimensional protein structures only is not satisfactory
Computational methods for prediction of functional association and protein interactions
• Phylogenetic profile method.
• Rosetta Stone approach.
• Gene neighborhood method.
• Gene cluster method.
• Co-evolution methods.
• Classification methods.
Computational methods (Genomic context-based methods)
Genomic context-based methods are based on the assumption that functionally related proteins are encoded by genes that co-regulated or co-evolved.
These methods seek to predict protein functional associations. Such functional associations may or may not result from physical binding.
− Phylogenetic profile
− Gene neighbors method
− Rosetta stone method (gene fusion)
Phylogenetic profile method
Idea: pairs of proteins that are always both present or both absent in a genome suggest their functional dependence possible interaction
Profile of a protein: A vector of 0/1 where each position corresponds to one genome. 1=protein present, 0=protein absent
Phylogenetic profile method
Phylogenetic profile method
• Proteins with identical (or similar) profiles are boxed to indicate that they are likely to be functionally linked. Boxes connected by lines have phylogenetic profiles that differ by one bit and are termed neighbors.
Gene neighbors method
The basic assumption is that genes which interact or are functionally associated tend to be located in physical proximity to each other on the genome.
Despite the effect of neutral evolution which tends to shuffle gene order between distantly related organisms, gene clusters or operons encoding for co-regulated genes are usually conserved
Gene neighbors method
Gene cluster method
• Bacterial genes of related function are often transcribed simultaneously – operon.
• Identification of operons is based on intergenic distances.
Rosetta stone method (gene fusion)
Assumption (based on observation):
Gene fusion examines pairs of genes that exist individually in some organisms, but as a fused gene in other organisms
Proteins that are fused in one genome are likely to interact, physically or at least functionally, in other genomes.
Five examples of pairs of E. coli proteins predicted to interact by
the domain fusion analysis. Each protein is shown schematically
with boxes representing domains
Rosetta stone method (gene fusion)
Correlation between gene expression and protein interactions
• There should exist a relationship between gene expression levels of subunits in a complex, then protein-protein interactions can be verified from coexpression data.
• Methods are tested on protein complexes: ribosome, proteasome, RNA Polymerase II Holoenzyme and replication complexes.
Expression profiles were taken from: cell cycle experiments and expression ratios for overall yeast genome for 300 cell states.
• Difference between absolute expression levels can be calculated as
)(
||
ji
ji
EE
EED
Correlation between gene expression and protein interactions
Results of gene coexpression analysis.
• Subunits from the same complex show coexpression.
• Expression correlation is strong for permanent complexes.
• Transient complexes have weaker correlation.
Coevolution of interacting proteins – “mirrortree” methods
• Interacting proteins may co-evolve and their phylogenetic trees show similarity.
• Similarity between phylogenetic trees can be quantified by correlation coefficient between distance matrices used to construct trees.
Tree of life (TOL) assists in prediction of protein interactions
• There is “background” similarity between trees of any proteins, no matter if they interact or not.
• “Background” tree is constructed from 16S rRNA sequences.
• rRNA-based distances are subtracted from distances of original phylogenetic tree.
Verification of experimental protein-protein interactions
• Protein localization method.
• Expression profile reliability method.
• Paralogous verification method.
Protein localization method
True positives:
- Proteins which are localized in the same cellular compartment
Expression profile reliability method.
Paralogous verification method.
PVM method is based on observation that if two proteins interact, their paralogs would interact. Calculates the number of interactions between two families of paralogous proteins.
Aligning protein interaction networks.
• The method searches for high-scoring pathway alignments between two networks, where proteins are paired based on their sequence similarity.
A
B
C
D
E
a
b
d
e
Aligning protein interaction networks.
• The network alignment between worm, yeast and fly detected 71 network regions that were conserved between all three species.
Interaction databases
• Experiment (E)
• Structure detail (S)
• Predicted– Physical (P)– Functional (F)
• Curated (C)
• Homology modeling (H)
Protein interaction databases
• Protein-protein interaction databases
• Domain-domain interaction databases
DIP database
• Documents protein-protein interactions from experiment– Y2H, protein microarrays,
TAP/MS, PDB
• 55,733 interactions between 19,053 proteins from 110 organisms.
Organisms # proteins
# interactions
Fruit fly 7052 20,988
H. pylori 710 1425
Human 916 1407
E. coli 1831 7408
C. elegans 2638 4030
Yeast 4921 18,225
Others 985 401
DIP/Prolinks database
• Records functional association using prediction methods:– Gene neighbors– Rosetta Stone– Phylogenetic profiles– Gene clusters
Other functional association databases
• Phydbac2 (Claverie)• Predictome (DeLisi)• ArrayProspector
(Bork)
BIND database
• Records experimental interaction data
• 83,517 protein-protein interactions
• 204,468 total interactions include small molecules, NAs, complexes
MPact/MIPS database
• Records yeast protein-protein interactions
• Curates interactions:– 4,300 PPI– 1,500 proteins
STRING database
• Records experimental and predicted protein-protein interactions using methods:– Genomic context– High-throughput– Coexpression– Database/literature
mining
More interaction databases
• IntAct (Valencia)– Open source interaction database and analysis– 68,165 interactions from literature or user submissions
• MINT (Cesareni)– 71,854 experimental interactions mined from literature
by curators– Uses IntAct data model
• BioGRID (Tyers)– 116,000 protein and genetic interactions
Protein interaction databases
• Protein-protein interaction databases
• Domain-domain interaction databases
InterDom database
• Predicts domain interactions (~30000) from PPIs
• Data sources:– Domain fusions– PPI from DIP– Protein complexes– Literature
PIBASE
• Query by PDB, domain, interface
• 1,946 interacting SCOP domains
• 2,387 unique interaction types
PIBASE/ModBase
• Protein structure models
• Predict interfaces with Pibase
3did database
• Defines domains using Pfam
• Data source: Protein structure data
• 3,304 unique interaction types
• 2,247 interacting domains
• Display linkages and chain locations
iPfam database
• View Pfam interactions on PDB structures
• View individual structures and sequence plots
DIMA database
• Phylogenetic profiles of Pfam domain pairs
• Uses structural info from iPfam
• Works well for moderate information content
perspective To further expand our knowledge about protein
interaction networks, we need to improve our data-gathering capabilities.
Development of highly sensitive and accurate methods to allow data collection under various cellular functional and temporal states.
Novel computational approaches need to be developed to transfer as many interactions as possible from model organisms to human.