Deconvolution of PPI Networks: Approximation Algorithms ... · rithms developed using the theory of...
Transcript of Deconvolution of PPI Networks: Approximation Algorithms ... · rithms developed using the theory of...
Deconvolution of PPI Networks:Approximation Algorithms and Optimization Techniques
Dong Hyun Kim
School of Computer ScienceMcGill University
Montréal
May 2012
A thesis submitted to McGill University in partial fulfilment of the requirements for thedegree of Doctor of Philosophy
c©Dong Hyun Kim, 2012
ii
Contents
Abstract vi
Résumé ix
Declaration xi
Acknowledgments xii
List of Figures xiv
1 Introduction 1
1.1 Experimental Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Yeast-Two Hybrid (Y2H) Method. . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Affinity Purification /Mass Spectrometry (AP-MS) . . . . . . . . . . . . . 6
1.1.3 Indirect Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Computational Analyses of PPI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Graph Models for PPI Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Identification of Protein-Protein Interactions . . . . . . . . . . . . . . . . 14
1.2.3 Identification of Protein Complexes . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 AP-MS based PPI Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.1 Protein Complexes in Yeast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Modularity in Yeast Protein Complexes . . . . . . . . . . . . . . . . . . . . . 21
1.4 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Approximation Algorithms and Optimization Techniques 27
iii
iv
2.1 NP-hard Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Graph Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Lower-bounding the Optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Polynomial Time Approximation Schemes . . . . . . . . . . . . . . . . . . 35
2.3.3 LP-based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Protein Quantification with Shared Peptides 43
3.1 Problem Formulation for Protein Quantification . . . . . . . . . . . . . . . . . . . 46
3.2 Hardness of Protein Quantification Problems . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Approximation Algorithms for Protein Quantification . . . . . . . . . . . . . . . . 53
3.3.1 An Algorithm for Multicover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 An Algorithm for Minimum Protein Types . . . . . . . . . . . . . . . . . . . 54
3.3.3 An Algorithm for Minimum Uniform Error . . . . . . . . . . . . . . . . . . 56
3.3.4 An Algorithm for Minimum Error Sum . . . . . . . . . . . . . . . . . . . . . 62
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Performance on Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 Protein Quantification from AP-MS Data . . . . . . . . . . . . . . . . . . . . 66
3.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Direct PPI Networks from AP-MS Data 75
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Mathematical Modelling and Problem Formulation . . . . . . . . . . . . . . . . . 77
4.2.1 A Probabilistic Model for AP-MS Data . . . . . . . . . . . . . . . . . . . . . . 78
4.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Algorithm for A-DIGCOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Identification of Weakly Connected Regions . . . . . . . . . . . . . . . . . 84
4.3.2 Identification of Densely Connected Regions . . . . . . . . . . . . . . . . . 90
v
4.3.3 Cut-based Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.4 Restricting the Solution Space for GA . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.1 Randomized Hill-climbing Algorithm . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2 Choosing a Tolerance Level δ and Handling Numerical Errors . . . . 97
4.4.3 Generation of Scale-free Networks . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.4 Calculation of Connectivity Matrix from Peptide Counts . . . . . . . . 98
4.4.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.6 Accuracy of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.7 Inferring Direct Interactions from AP-MS Experimental Data . . . . . 105
4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5 Hypergraph Modelling of PPI Data 111
5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Clique Cover for Graphs with Bounded Treewidth . . . . . . . . . . . . . . . . . . 116
5.2.1 Running Time of Treewidth-based Algorithm . . . . . . . . . . . . . . . . 119
5.2.2 Modifications for Clique Partition . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 Planar Clique Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.1 Clique Cover for Planar Graphs with Bounded Branchwidth . . . . . . 121
5.3.2 Modifications for Clique Partition . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.3 Baker’s Technique on Planar Graphs . . . . . . . . . . . . . . . . . . . . . . . 123
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.1 Simulated and Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.2 Performance of the Treewidth and Branchwidth-based Algorithms . 128
5.4.3 Performance of PTAS for Planar Graphs . . . . . . . . . . . . . . . . . . . . . 130
5.4.4 Clique Cover in Biological Networks . . . . . . . . . . . . . . . . . . . . . . . 130
5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
vi
6 Conclusion and Future Directions 133
Bibliography 137
Abstract
Understanding the organization of protein-protein interactions (PPIs) as a complex network isone of the main pursuits in proteomics today. With the help of high-throughput experimen-tal techniques, a large amount of PPI data has become available, providing us a rough pictureof how proteins interact in biological systems. One of the leading technologies for identifyingprotein interactions is affinity-purification followed by mass spectrometry (AP-MS). While theAP-MS method provides the ability to detect protein interactions at biologically reasonable ex-pression levels, this technique still suffers from poor accuracy as well as lack of sound approachto interpret the obtained interaction data.
In this thesis, we look for sources of systematic errors and limitations of the data from AP-MS experiments, and propose several approaches for improvements. In particular, we identifyvarious problems that arise within the experimental pipeline, and propose combinatorial algo-rithms developed using the theory of approximation algorithms and discrete mathematics. Thefirst part of the thesis deals with quantification of proteins from MS-based experiments. Exist-ing approaches for protein quantification often use each detected peptide as an indicator forthe originating protein. These approaches ignore peptides that belong to more than one proteinfamily. We attack this problem of protein quantification by taking these shared peptides into con-sideration, and propose a framework for estimating protein abundance via linear programming.
In AP-MS data, the identified protein interactions contain a large number of indirect inter-actions as artifacts from a series of simultaneous physical interactions. The second part of thisthesis studies the problem of distinguishing direct physical interactions from indirect interac-tions. To do this, we first propose a probabilistic graph model for the PPI data, and design acombinatorial algorithm suited for graphs with underlying structure that is evident in PPI net-works.
While the traditional model for PPI networks is a binary graph representing pairwise inter-actions, a large number of interactions involve more than two interaction partners. Such col-lections of proteins interacting concertedly are known as protein complexes, and various ap-proaches have been proposed to identify the complexes in the network. When these complexesare overlapping, however, the existing complex detection methods often fail to identify the con-stituent complexes. Taking one step further on this line of research, the last part of this thesisdiscusses the problem of modelling PPI networks as hypergraphs by studying the clique coverproblem on sparse networks.
For each problem discussed throughout the thesis, we obtain either an exact algorithm, analgorithm with provably good guarantee on the output quality, or a heuristic with efficient run-
vii
viii
ning time. Furthermore, each of the proposed algorithms is empirically tested against biologicaldata as well as simulated data, in order to validate both computational efficiency and biologicalsoundness.
Résumé
Comprendre l’organisation des interactions protéine-protéine (IPP) en tant que réseau com-plexe est un des plus grands problèmes de la protéomique moderne. Avec l’aide de techniquesexpérimentales à haut débit, une grand quantité de données de IPP est devenue disponible, nousprocurant ainsi une image approximative du fonctionnement des interactions entre protéinesdans des systèmes biologiques. Une des technologies de fine pointe pour identifier des interac-tions protéiques est la purification d’affinité suivie de la spectrométrie de masse (PA-SM). Mêmesi la méthode de PA-SM nous permet de détecter des interactions de protéines à des niveauxd’expression raisonnables biologiquement, cette technique souffre encore d’une précision défi-ciente et d’un manque d’une approche saine pour interpréter les résultats d’interactions obtenuspar celle-ci.
Dans cette thèse, nous cherchons des sources d’erreurs systématiques et des limites desdonnées provenant d’expériences PA-SM et nous proposons plusieurs approches amenant desaméliorations. En particulier, nous identifions divers problèmes présents dans la procédure ex-périmentale et proposons des algorithmes combinatoires développés en utilisant la théorie desalgorithmes d’approximation et des mathématiques discrètes. La première partie de la thèseétudie la quantification des protéines provenant d’une expérience basée sur la spectrométrie demasse. Les approches existantes pour la quantification de protéines utilisent souvent chaquepeptide détecté comme un indicateur de la protéine d’origine. Ces approches ignorent les pep-tides qui appartiennent à plus d’une protéines. Nous attaquons ce problème de quantificationde protéines en prenant ces peptides partagés en considération et proposons un cadre de travailpour estimer l’abondance des protéines via la programmation linéaire.
Les interactions de protéines identifiées dans les données de PA-SM contiennent un grandnombre d’interactions indirectes qui sont des artéfacts d’une série d’interactions physiques si-multanées. La seconde partie de cette thèse étudie le problème de distinction des interactionsphysiques directes de celles qui sont indirectes. Pour ce faire, nous proposons premièrement unmodèle d’un graphe probabiliste pour les données de IPPs et concevons un algorithme combi-natoire adapté aux graphes ayant des propriétés observées dans de vrais réseaux de IPPs.
Malgré le fait que le modèle traditionnel des réseaux de IPPs est un graphe binaire représen-tant les interactions par paires, un grand nombre d’interactions impliquent plus de deux parte-naires d’interaction. De tels groupes de protéines interagissant de concert se nomment des com-plexes de protéines. Plusieurs approches ont déjà été proposées afin d’identifier les complexesdans un réseau. Cependant, lorsque ces complexes se chevauchent, les méthodes existanteséchouent. La dernière partie de cette thèse discute du problème de modélisation des réseaux deIPPs comme des hypergraphes en étudiant le problème de couverture par cliques sur des réseaux
ix
x
clairsemés.
Pour chaque problème discuté au cours de cette thèse, nous obtenons un algorithme exact,un algorithme avec de bonnes garanties prouvables sur la qualité de la sortie, ou une heuristiqueavec un temps d’exécution efficient. De plus, chaque algorithme proposé est testé empirique-ment avec des données biologiques et simulées dans le but de valider l’efficience computation-nelle et la signification biologique.
Declaration
This thesis contains no material which has been accepted in whole, or in part, for any other de-gree or diploma. Except for results whose authors are mentioned, materials presented in Chap-ters 3, 4, and 5 of this thesis is an original contribution to knowledge.
In Section 3.4 of Chapter 3, the biological data for experimental studies was generously pro-vided by Benoit Coulombe’s lab at Institut de recherches cliniques de Montréal (IRCM).
In Section 4.3.1 of Chapter 4, the work presented in Theorems 4.1 and 4.3 are done jointlywith Ashish Sabharwal and my thesis advisors.
All other parts were done independently under the supervision of my thesis advisors, AdrianVetta and Mathieu Blanchette.
xi
xii
Acknowledgements
During my first visit to Montréal in February 2005, my old advisor Sue Whitesides told me anumber of reasons why McGill is one of the best places to study computer science and discretemaths. Seven years later, I can acknowledge every one of those reasons, and possibly double thelist with my own.
First and foremost, I express my deepest gratitude to my thesis advisors, Mathieu Blanchetteand Adrian Vetta, for their tireless efforts to guide my research, and making my life as a graduatestudent ever so enjoyable. Mathieu has introduced me to the field of protein-protein interac-tions, invited me to his Barbados workshop on the subject, was always ready to suggest the nextsteps in research, and taught me all about what makes good science. Adrian has introduced meto the field of approximation algorithms, spent countless hours discussing various lines of at-tacks to problems, and has provided me with seemingly infinite amount of resources in discretemathematics. Without their guidance, this thesis, or my own scientific becoming, would havebeen impossible.
I thank my collaborators and colleagues: Ashish Sabharwal’s visit to Montreal resulted in thefirst part of work presented in Chapter 4; Mathieu L.-A., Javier, Mathieu R. and Javad taught mea thing or two about proteomics and machine learning; human PPI data used in Chapter 3 wasgenerously provided by Benoit Coulombe’s lab at Institut de recherches cliniques de Montréal(IRCM).
I acknowledge NSERC, CIHR, Walter C. Sumner Fellowship, and my advisors’ generous fund-ing for financial support.
Special thanks to the “family” for all the laughters over pints: js, mk, sh, jm, sc, ij, sy, sj, jy, thelist goes on. And finally, to my parents, for their love and patience over the years. Thank you.
xiii
xiv
List of Figures
1.1 The yeast PPI network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A schematic of Y2H experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Pipeline of TAP experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Protein interactions from the perspective of Y2H and AP-MS . . . . . . . . . . 9
1.5 Degree distribution of yeast PPI networks . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Tree decomposition of a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Branch decomposition of a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 A characteristics matrix and its perfect phylogenetic tree . . . . . . . . . . . . . 35
3.1 Bipartite graph model from peptide-protein relationships . . . . . . . . . . . . 47
3.2 Robustness of the algorithms for protein quantification . . . . . . . . . . . . . . 65
3.3 Peptide-protein graph from an AP-MS experiment (CDK9) . . . . . . . . . . . . 68
3.4 Peptide-protein graph from an AP-MS experiment (POLR2A) . . . . . . . . . . 69
3.5 Peptide-protein graph from an AP-MS experiment (RPAP3) . . . . . . . . . . . 70
3.6 Result of our algorithm on CDK9 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Result of our algorithm on RPAP3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 A schematic for indirect interactions in AP-MS data . . . . . . . . . . . . . . . . . 79
4.2 A direct interaction network and its connectivity matrix . . . . . . . . . . . . . . 81
4.3 The outcome of our 3 phase algorithm for direct PPI network . . . . . . . . . . 83
4.4 Comparison of results from our direct PPI data against Y2H interactionnetwork from Yu et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1 Node types in nice tree decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 Sphere-cut decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3 Planar clique cover using Baker’s technique . . . . . . . . . . . . . . . . . . . . . . . 125
xv
xvi
5.4 Performance of the clique cover algorithm for graphs with bounded treewidth129
5.5 Performance comparison of treewidth-based algorithm vs. branchwidth-based algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Chapter 1
Introduction
Upon completion of the Human Genome Project1, one of the compelling topics in the
biological sciences is the large-scale study of proteins encoded by the genes – a field re-
ferred to as proteomics. Indeed, it is the proteins that execute the functions programmed
in the genes, and analyzing the proteome will thus lead us to a better understanding of
how cellular processes work in living organisms.
While the genome is considered as a long sequence of DNA, proteome is often re-
garded as a much more complex system. Because cellular processes are caused by pro-
teins interacting concertedly within the cell, the proteins in the proteome form a large
network called protein-protein interaction (PPI) network, sometimes referred to as the
interactome. As a result, both structural and functional analyses of the proteome often
require detailed examinations of the PPI network. Efforts to construct the PPI network
have begun by developing various experimental techniques to identify the interactions.
Using these technologies, Saccharomyees cerevisiae (the yeast) was shown to contain
approximately 6000 proteins, and over 78,000 interactions have now been identified. To
give an example, Figure 1.1 depicts a small portion of the yeast PPI network, using only
1 An international scientific research program whose goal was to sequence the DNA, and iden-tify the genes of human genome; began in 1990, completed in 2003. (http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml)
1
2 Chapter 1. Introduction
Figure 1.1: The yeast PPI network constructed using Cytoscape software (http://www.cytoscape.org) with high confidence PPI data from Krogan et al. [93]
the highest confidence interaction data from Krogan et al. [93]. Humans, more com-
plex organisms as we are, are expected to have approximately 25,000 proteins sharing
as many as 650,000 interactions [130]. Such sheer volume of data makes it intractable
to analyze manually, and thus computational analyses have become essential to deeper
understanding of the proteome.
More recently, it has been revealed that many of these interactions may occur only
when three or more specified proteins are all present at the site of an interaction. Such
a collection of proteins interacting concertedly is called a protein complex, and this the-
ory around protein complexes opened a new set of computational problems, such as
the identification and functional annotation of protein complexes. To make things even
more interesting, many of these interactions depend on the phase of the cell develop-
ment cycle during which the interaction takes place, as well as the localization within
the cell [80, 97]. Consequently, such temporal and spatial dependence makes the struc-
ture of PPI networks much more dynamic than originally expected.
1.1. Experimental Techniques 3
Due to the natural structure of the proteome as a network, graph theory has played a
crucial role in computational analyses of PPI networks. As shall be demonstrated later,
however, it is often difficult to formulate an optimization problem that correctly cap-
tures all the properties of PPI networks, and even when a clean optimization problem
is formulated, the problem often turns out to be computationally intractable. In many
of these cases, heuristic algorithms targeting local optima have been popular, and very
few algorithms with provable guarantee of output quality have been proposed. As such,
clean mathematical models and sound algorithms remain in high demand in the field
of protein-protein interactions.
This thesis and its contributions can be divided into three parts; each part addresses
a particular challenge related to the analysis of PPI networks, and provides a novel ap-
proach to reduce errors and noise present in the data. The principal results of this thesis
will be outlined in Section 1.4. Prior to that, however, we first give an introduction to
the field of protein-protein interactions from the perspective of bioinformatics and dis-
crete mathematics. This introductory chapter consists of three parts: in Section 1.1, we
give a brief survey on various experimental techniques for identifying protein interac-
tions. Section 1.2 then discusses computational approaches to analyze the datasets pro-
duced by those experimental techniques; one of the leading technologies that is preva-
lent in the literature is affinity purification followed by mass spectrometry (AP-MS). In
Section 1.3, we review studies on PPI networks produced by AP-MS experiments, and
point out the limitations of existing computational approaches, which will then lead us
to the topics of this thesis, namely: (1) protein quantification; (2) predicting direct inter-
action network; and (3) hypergraph modelling of PPI networks.
1.1 Experimental Techniques
In this section, we introduce several medium to high throughput experimental meth-
ods that have been proposed to identify protein interactions, and discuss the nature
4 Chapter 1. Introduction
and the quality of the data obtained by each technique. Each technique makes use of
different mechanisms to identify the interactions, and bears its own advantages and dis-
advantages. For example, some techniques identify direct, physical interactions while
others are designed to find indirect, functional relationships between proteins. There-
fore, understanding the intrinsic differences between the datasets from various experi-
mental techniques will allow us to realistically model the given data, and devise highly
customized algorithms for the formulated problems.
1.1.1 Yeast-Two Hybrid (Y2H) Method.
Yeast two hybrid (Y2H) is an in vivo2 technique to detect PPIs by testing physical in-
teractions between two proteins [49, 75, 137]. It was discovered that transcription fac-
tors found in eukaryotic organisms have two distinct domains: (1) a binding domain
(BD), and (2) an activating domain (AD) [49]. The binding domain is a module respon-
sible for binding to a promoter DNA sequence while the activating domain activates the
transcription. The transcription is inactivated while the two domains are far from each
other, but it can be restored when a binding domain is in close proximity with an acti-
vating domain.
A Y2H experiment starts by fusing the binding domain into a protein X (known as the
bait), and the activating domain into a protein Y (known as the prey). If the bait X and
the prey Y interact physically, the activating domain is in close proximity to the binding
domain. The binding domain then binds to upstream activating sequence (UAS) of a
promoter, and thus activates the transcription of the reporter gene3. Figure 1.2 shows
a schematic representation of a Y2H system. To do this experiment in a high through-
put manner, either a matrix of prey clones or a library of random cDNA fragments are
2 An experiment is said to be in vivo (Latin for “within the living”) when the subject is a living organismas opposed to in vitro, a controlled environment.
3 In order to determine whether a gene is expressed, a reporter gene (which is easily identifiable whenexpressed) is attached to a regulatory sequence of the gene of interest. Commonly used reporter genesoften induce visually identifiable characteristics, e.g. LacZ, GFP, etc.
1.1. Experimental Techniques 5
BD
UAS
XY AD
Transcription of Reporter Gene
Mediator
Figure 1.2: A Y2H experiment. Protein X (the bait) is fused with a BD that binds to the upstream activatingsequence (UAS) of a promoter. The bait interacts with Protein Y (the prey) that is fused with an AD,activating the transcription of a reporter gene via the mediator complex.
used [11, 50, 142]: In the matrix approach, a matrix of prey clones is mated with baits,
and the interacting prey proteins are identified by the expression of a reporter gene and
the position on the matrix. In the library approach, each bait is screened against a li-
brary of random cDNA fragments, and the interaction partners are identified by DNA
sequencing.
One advantage of a Y2H experiment is that it allows us to identify physical (rather
than functional) associations between interaction partners. Furthermore, being an in
vivo high throughput technique, Y2H produces large scale PPI data that potentially in-
cludes even the weaker, transient interactions. As such, the Y2H PPI data provide us
a comprehensive snapshot of the proteome. However, because each interaction must
occur between the proteins fused with the bait-prey pair, the interactions identified by
Y2H experiments are restricted to binary interactions. There are other limitations of
Y2H systems: proteins that initiate transcription on their own cannot be targeted since
they would result in expression of the reporter gene regardless of the presence of bait
proteins; the protein fusion may cause changes in the structure of the protein, and the
use of such bait proteins may result in incorrect interaction partners. Such a non-native
environment causes high false positive ratios for Y2H experiments. To rectify these is-
sues for Y2H experiments, many in silico4 methods have been proposed to post-process
Y2H experimental data (see Section 1.2.2). With the help of these computational analy-
4 Computational analyses are often called in silico methods as opposed to in vivo or in vitro.
6 Chapter 1. Introduction
sis tools, Y2H experiments have been the most widely used approach to obtain physical
interaction data.
Protein-fragment Complementation Assay (PCA). Another similar technique has been
proposed by Michnick [54, 103, 134]: Protein-fragment Complementation Assay (PCA)
is a technique in which fragments of a reporter protein can be separately expressed with
proteins (bait and prey) that are to interact with each other. In this method, the tar-
geted proteins (bait and prey) are fused with incomplete fragments of a third protein
(reporter), and are separately expressed in vivo. The interaction between the bait and
the prey brings the fragments of the reporter in close proximity, allowing the reporter to
reform, and then to be identified. While similar to Y2H experiments, PCA experiments
can test protein interactions at various subcellular compartments within a pathway of
interest in a quantitative manner, making it desirable for drug discovery purposes. Typ-
ically, PCA detects many interactions for membrane proteins whereas Y2H identifies
many interactions for nuclear proteins [78].
1.1.2 Affinity Purification /Mass Spectrometry (AP-MS)
Affinity Purification followed by Mass Spectrometry (AP-MS) [36, 58, 93, 119], some-
times referred to as tandem affinity purification (TAP), is a technique that allows us to
determine co-complex membership of protein interactions. This method is a two step
process consisting of the purification of protein complexes, and the identification of the
interaction partners via mass spectrometry.
Affinity purification. A TAP tag consists of a calmodulin binding peptide (CBP) fol-
lowed by a tobacco etch virus (TEV) protease cleavage site and the IgG binding domains
of Staphylococus protein A [118, 119]. The open reading frame of the target protein
1.1. Experimental Techniques 7
IgG beads
Bait
Calmodulin bindingpeptide TEV protease
cleavage site
Protein A
TAP tag
+
Cell extract
BaitContaminants
Prey proteins
1st AffinityColumn
Cut at TEV proteasecleavage
Bait Calmodulin beads
2nd AffinityColumn
Bait natively formed protein complex
Figure 1.3: Pipeline of TAP experiments. At the first affinity column, the bait protein is cut at the TEVprotease cleavage site after washing. At the second affinity column, the bait-prey pairs are pulled downby Calmodulin beads, and the resulting eluate form native complexes which can then be identified usinggel electrophoresis and mass spectrometry.
(known as the bait) is fused with the TAP tag at the end, and is expressed in vivo within
the cells of the host organism (e.g. yeast or human). Once the AP-tagged bait protein is
expressed, it can interact with other proteins (known as preys), and form protein com-
plexes. These protein complexes go through two purification processes: Initially, the
protein A is binding tightly to an IgG matrix. After washing out the contaminants, the
link between the protein A and IgG matrix is released by the protease. The eluate of this
step is then incubated with calmodulin-coated beads, and after washing for the second
time, the targeted protein complexes are released. See Figure 1.3 for a schematic repre-
sentation of the AP process.
8 Chapter 1. Introduction
Mass spectrometry. The result of the purification processes is the target protein com-
plexes formed by the interaction partners of the bait protein. These protein complexes
go through gel electrophoresis, resulting in component peptide fragments that can be
identified using mass spectrometers.
Mass spectrometers then read in the peptide fragments, and produce ions with charges
based on their masses. Polypeptide sequences are then identified using their mass-to-
charge ratios. Various methods have been proposed to convert the peptide molecules
into ions in the gas phase using electrospray ionization (ESI) [144] and matrix assisted
laser desorption ionionzation (MALDI) [85, 116]. Furthermore, there are several algo-
rithms to analyze the mass spectra produced by MS and either identify the peptides
present, or even quantify the peptide abundance [59, 115, 135].
Due to its ability to perform at biologically reasonable expression levels in human
cells, as well as the ability to detect protein complexes with fewer false positives [36],
AP-MS approaches have become increasingly popular. One caveat of this approach is,
however, that a significant number of the co-purified prey proteins are in fact indirect
interaction partners of the bait protein. This is an artifact of chains of binary interac-
tions that occur simultaneously, and AP-MS cannot distinguish such indirect interac-
tions from direct physical interactions. For example, Figure 1.4 shows how the same
complex can be perceived differently by different experimental techniques. In Chap-
ter 4 of this thesis, we discuss the problem of identifying direct physical interactions
from AP-MS data.
Added to the difficulty of interpretation from indirect interactions, AP-MS carries a
high chance of capturing contaminants at the AP level: gel contamination, nonspecific
binding to TAP columns, etc. To reduce the false positives from these contaminants,
several computational approaches have been proposed [124, 128]. Some of these ap-
proaches make use of the network topology to assign a purification score to each hit [35].
On the other hand, Lavallée-Adam et al. [95] recently provided a Bayesian model specif-
ically for contaminants to directly identify false positive interactions.
1.1. Experimental Techniques 9
X
Y Z
X
Y Z
X
Y Z
XY Z
X
Y Z
X
Y Z
X
Y
X
Z
X
Y Z
X
Y Z
Actual protein interaction
Binary interaction methods (Y2H, PCA)
Co-complex membership methods
(AP-MS)
Figure 1.4: Different types of protein interactions among three proteins, and how they are viewed in dif-ferent experimental techniques (assuming all three proteins have been tagged as a bait). Note that bi-nary testing methods (e.g. Y2H and PCA) cannot distinguish the last two types of interactions, whileco-complex membership testing approaches (e.g. AP-MS) identify the first two types of interactions asthe same topology.
1.1.3 Indirect Approaches
Aside from the methods described above, there are indirect approaches to associate pro-
teins with functional interactions.
Correlated mRNA expression (Synexpression). Genes can be partitioned into distinct
groups depending on the mRNA levels measured under a variety of different cellular
conditions. Then, the genes in the same partition are enriched to encode proteins that
may interact with each other [44]. Though this technique gives a broad coverage5, the
accuracy of the data is poor as it is difficult to infer direct physical (or even functional)
interactions solely from gene expression levels.
5 Coverage refers to (correctly identified interactions) / (total number of true interactions). Accuracy,on the other hand, refers to (correctly identified interactions)/(number of identified interactions)
10 Chapter 1. Introduction
Method Interaction TypeY2H physical, binaryPCA physical, binary
AP-MS physical, complexSynexpression functional
Genetic interactions functionalGenome analysis functional
Table 1.1: Different techniques to identify protein-protein interactions, and types of interactions capturedby each technique.
Genetic interactions. If two nonessential genes cause lethality when simultaneously
knocked out, they are often functionally associated, and thus their encoded proteins
may interact [12, 148]. This technique allows pairwise detection of functionally associ-
ated interactions.
Genome analysis. Some protein interactions can be detected via computational meth-
ods [112]. For example, (1) interacting proteins often show similar phylogenetic pro-
files,6 and (2) seemingly unrelated genes are sometimes fused together into one polypep-
tide chain. Though fast and inexpensive, these computational methods rely heavily on
the existing data such as phylogenetic trees, and orthology7 between proteins which are
prone to biases towards well-studied proteins.
Observe that different techniques have different goals in detecting interactions. While
Y2H, PCA, and AP-MS methods are designed to discover physical bindings between pro-
teins, the others seek to predict functional associations between entities such as tran-
scriptional regulators and their associated pathways. The different coverage and ac-
curacy of the data causes each technique to produce a unique distribution of interac-
tions [141]. Consequently, when comparing or merging PPI data from different experi-
mental techniques, one must carefully take these intrinsic differences into account. Ta-
ble 1.1 summarizes different aspects of the methods discussed in this section.
6 Across a number of species, each gene can be checked for presence, resulting in a binary vector calleda phylogenetic profile. If two proteins interact in the same biological pathway, the corresponding geneswould show similar phylogenetic profiles.
7 Homologous sequences are orthologous if they were separated by a speciation event, as opposed toparalogous if caused by a gene duplication event.
1.2. Computational Analyses of PPI Data 11
1.2 Computational Analyses of PPI Data
The traditional way to model PPI networks is to construct a combinatorial graph where
each node represents a protein, and two nodes are joined by an edge if and only if there is
an interaction between the corresponding proteins. While this classical model captures
the most basic information about the interactome, it does not accommodate various
sources of errors that are common in many experimental datasets.
On the other hand, protein complexes are hidden away within the constructed (bi-
nary) PPI network. Discovering these complexes would allow us to model the inter-
actome as a more complex system. Therefore, further post-experimental analyses are
imperative to obtain a proper view of PPI networks. In this section, we review compu-
tational methods for post-processing experimental PPI data, and we discuss how such
approaches can help characterize unknown properties of the interactome.
1.2.1 Graph Models for PPI Networks
One approach to the analysis of PPI networks is looking at topological properties of the
networks. While studying the topological structure of PPI networks may not immedi-
ately lead us to biological discoveries, understanding the characteristics of the networks
is typically a crucial step toward modelling real world networks.
Local Structure Models
Network motifs. Network motifs are patterns of interconnections (induced subgraphs)
that occur in a given biological network much more often than they would in random
networks. Shen-Orr et al. [127] have shown that the transcriptional regulatory network
of Escherichia coli contains three highly frequent motifs, each with a distinct function in
gene expression. This suggests that identifying different motifs in a PPI network would
help us describe different functional protein groups within the network and, further-
12 Chapter 1. Introduction
Figure 1.5: Degree distribution of yeast PPI network of 1210 proteins with 2357 interactions [93]. On alog− log plot, the distribution fits a power law distribution with y = a x−γ where a = 874.87 and γ= 1.809.
more, determine the function of uncharacterized proteins.
Graphlets. A similar notion of local structure called graphlets has been proposed by
Przulj et al. [117]. Graphlets are all possible simple graphs over 3− 5 vertices (up to
isomorphism). In order to analyze the global structure of PPI networks, Przulj et al.
looked at the frequency distribution of each graphlet appearing in the PPI networks, and
compared it against other global model networks such as Erdös-Renyi random graphs,
scale-free networks, and geometric random networks (see below). While network mo-
tifs for a given network capture the most significant local structures within the network,
this bottom-up approach of graphlet distribution provides a metric to compare different
networks. More specifically, the distribution of each graphlet measured by the relative
frequency of graphlets allows us to analyze networks of different sizes. This is particu-
larly useful when PPI networks are compared to various graph models (see next section)
whose network sizes may vary.
1.2. Computational Analyses of PPI Data 13
Global Structure Models
Scale-free networks. In many real world networks, including PPI networks, the degree
distribution of the nodes follows the power law P(k )≈ k−γ; that is, there are many more
nodes with a small number of neighbours than nodes with a very large number of neigh-
bours. Networks that exhibit such a degree distribution are called scale-free [10], and this
property has profound effects on real world networks. For example, scale-free networks
are resistant to random failures on the nodes due to the small number of high-degree
hubs – there are many more low-degree nodes which are equally likely to fail. This ro-
bustness can also be observed in PPI networks, and the connectivity of a node can be
correlated to the essentiality of the corresponding protein. For example, studies on the
yeast PPI network [79] have shown that:
1. 93% of the yeast proteins have degree at most 5, but only 21% of these caused
lethality when deleted;
2. only 0.7% of the proteins had degree at least 15, but 62% of these are essential.
These results suggest that non-essential proteins, whose disruption is non-lethal,
tend to have lower degree than essential proteins. Moreover, proteins in the same func-
tional group tend to show similar number of interacting partners. Such structure-function
relationships can be exploited for the prediction of protein interactions or complexes
that are yet unknown, as we discussed in Section 1.1.3.
Geometric random networks. A geometric random network [113] is a graph model
where the nodes correspond to randomly distributed points in a metric space, and nearby
nodes are joined by an edge. Przulj et al. [117] used the graphlet distributions to charac-
terize PPI networks as a geometric random network. In particular, upon comparing the
yeast and fruitfly PPI networks against various random graph models (including scale-
free graphs), the graphlet distributions of these PPI networks were closer to that of ge-
ometric random networks than any other model. This result suggests that the geomet-
14 Chapter 1. Introduction
ric random network may be a better model for PPI networks than the widely accepted
scale-free networks.
1.2.2 Identification of Protein-Protein Interactions
The PPI data from high throughput experiments such as Y2H, PCA or AP-MS contain a
significant amount of noise, and such high false positive/negative ratios deteriorate the
overall confidence level of identified interactions. These high throughput methods have
all been used to perform a large-scale study of yeast proteome [58, 70, 93, 134, 149],
but comparative assessments of those datasets continuously reveal that a significant
portion of identified interactions in one study is unique to that dataset and not observed
in others [78, 141]. Therefore, in silico approaches to improve the confidence level of PPI
data are an attractive topic of research.
Topology-based PPI detection. Inevitably, the PPI data from high throughput experi-
ments is used as the initial input. Since the global view of the currently known interac-
tome may be heavily biased due to incomplete experiments, several authors often resort
to examining the local topological structure. For example, Saito et al. [122, 123] devel-
oped a measure called interaction generalities (IG1, IG2) which considers the restricted
neighbourhood for each protein in the network. The main idea behind this measure
is that if a protein interacts with many interaction partners but these interaction part-
ners exhibit no further interactions among themselves then these interactions are more
likely to be false positives. This idea works particularly well for PPI data from Y2H exper-
iments, since some “sticky” proteins in the Y2H assays have a tendency to turn on the
positive signals by themselves, irrespective of their interaction partners.
More recently, two other measures have been proposed by Chen et al. [27, 26], and
Pei and Zhang [111]. The method of Chen et al., called Interaction Reliability by Alter-
native Path (IRAP) [27], considers alternative paths between two interacting partners.
If a pair of proteins exhibit a direct interaction, it is likely that there are alternative se-
1.2. Computational Analyses of PPI Data 15
quences of interactions that join the two proteins. Hence, IRAP looks for the shortest
(measured by distance) such alternative path for each pair of proteins, and the final
confidence level is determined by the strength of the discovered alternative path.
On the other hand, Pei and Zhang [111] have shown that, by considering all alterna-
tive paths of different lengths, a better result can be achieved. For each pair of vertices,
they consider all paths of length k > 1. Then, the proposed measure called PathRatio
is computed as a weighted sum of all the paths between two vertices. Since enumerat-
ing all paths between two vertices is computationally hard, a rough estimate is obtained
by only computing for small values of k . Using the PPI data from various sources, the
PathRatio method achieved better results in functional homogeneity, localization ho-
mogeneity, and gene expression distance when compared against a simulation using
IRAP. On the other hand, the iterative hill-climbing version of IRAP, called IRAP∗ [26],
showed similar performance to PathRatio.
While these topology-based methods attempt to assign confidence levels to inter-
actions, the resulting PPI networks still contain high ratios of false positives and false
negatives. This may be due to the fact that the models used in these methods do not
truly reflect the nature of the experimental PPI data – for example, the input PPI data
may contain a large number of indirect interactions from AP-MS experiments, but no
known algorithm addresses this issue explicitly (the algorithms discussed in this section
are evaluated only using the PPI data from Y2H experiments).
Phylogeny-based PPI detection. The poor accuracy and coverage of topology-based
methods may also be due to the lack of information contained in the PPI data itself.
To address this, several approaches have been proposed using phylogenetic informa-
tion. For example, a protein interaction in one species may be inferred from Rosetta
Stone proteins (two genes fused into a single one) in another organism (Gene Fusion
Method [131]). Or, in the case of bacterial genomes, the conservation of the order of
genes may be a good indication of interactions (the Gene Order Conservation Method [37]).
16 Chapter 1. Introduction
A potentially more powerful approach is using a set of reference organisms’ genomes.
A phylogenetic profile of a protein A is a vector whose i -th entry represents the presence
or the absence of protein A in organism i . Under the assumption that physically in-
teracting proteins coevolve, the interaction can be inferred from the similarity of two
phylogenetic profiles (the Phylogenetic Profile Method [112]). Moreover, the phyloge-
netic trees for two proteins could also be used in comparative analyses. In this method,
a multiple sequence alignment (MSA) is first obtained from a set of reference organ-
isms, and the phylogenetic trees for protein A and B are constructed using the position
of orthologs in the MSA. Then the similarity score can be computed by comparing the
distance matrices of the two phylogenetic trees. For a more comprehensive exposition
of these methods see Jothi and Przytycka [82].
Protein-protein docking. Another approach for detecting protein interactions is done
by structural analysis of protein molecules: namely, protein-protein docking. Given a
pair, or a small set of proteins, accurate and efficient protein docking algorithms pro-
vide a useful tool for understanding both the tendency for protein interactions and the
structure of protein complexes. Various docking algorithms have been proposed, and
often work in two phases: (1) A population of candidate conformations are generated;
and (2) each of the candidates is scored in order to find the most likely structural confor-
mations (see, for example, Azé et al. [6] and Choi [31]). To verify the correctness of these
methods, they are often compared against a database of NMR and X-ray structures [77].
1.2.3 Identification of Protein Complexes
While the PPI networks represent binary interactions between two proteins, protein
complexes with three or more interacting partners are difficult to identify immediately
from these networks. However, the protein complexes are often highly connected sub-
graphs in PPI networks, and thus graph clustering algorithms are often used to detect
these complexes.
1.2. Computational Analyses of PPI Data 17
Molecular Complex Detection (MCODE) algorithm. Bader and Hogue [7] proposed
the three-stage algorithm, MCODE, to identify protein complexes. First, the vertices are
given weights according to a local density measure, called the core-clustering coefficient.
Let N [v ] denote the subgraph induced by a vertex v and its neighbours. Then, a k -core
of v is a subgraph of N [v ], induced by vertices with degree at least k . The highest k -core
refers to a nonempty k -core with the highest k , and the core-clustering coefficient of a
vertex v is then calculated as the edge density of the highest k -core of v .
After each vertex is assigned a weight, the second stage of the algorithm recursively
starts to expand the clusters. The vertex with the highest weight is initially set to form
a cluster, and all of its neighbours are joined to the cluster if (i) the weight of the neigh-
bour is above a given threshold; and (ii) the neighbour has not been previously explored.
When no further expansion is possible, the next heaviest unexplored vertex is set to form
a singleton cluster, and the process is repeated. Note that the clusters found so far are
disjoint at this stage of the algorithm.
Finally, the third phase the the algorithm is a post-processing step, where clusters
that are not sufficiently dense are discarded from the set of clusters. Furthermore, any
unexplored vertices are joined to nearby clusters. At this stage, these unexplored vertices
may be joined to multiple clusters, thereby creating overlapping clusters.
Markov Cluster algorithm. The Markov Cluster (MCL) algorithm, proposed by Van
Dongen [139], simulates random walks on the graph by alternating two operations, ex-
pansion and inflation, to transform one set of probabilities into another. First, a weight
matrix M of G is normalized into M so that each column of M sums to 1. Then each
column of M represents a stochastic process, where each entry M (i , j ) corresponds to
the probability of going from vertex i to vertex j . The inflation ∆r of M for a power
coefficient r is defined as:
∆r (M (p ,q )) =M (p ,q )r
∑ni=1(M (i ,q ))r
18 Chapter 1. Introduction
Hence, higher values of r would yield tighter clusters by favouring more probable walks
on the graph.
On the other hand, the expansion operation creates longer walks on the graph by
computing the e -th power of the associated matrix M , where e is the expansion param-
eter. This operation will try to expand the information flow on the network. The MCL al-
gorithm runs the expansion and the inflation process as alternatively until M stabilizes.
When the algorithm reaches its equilibrium, the resulting matrix would correspond to a
collection of star-like components, each of which is declared a protein complex.
Min-cut based algorithms. Though not applied directly to the protein complex detec-
tion problem, several approaches have been proposed to use minimum cut algorithms
when finding clusters in biological networks (e.g. [69, 126]). These approaches first start
with a similarity graph as an input, and repeatedly partition the set of vertices into clus-
ters until a partition satisfies a particular stopping criterion. For example, in Hartuv et
al. [69], the notion of highly connected subgraph (HCS) is used as a stopping criterion.
A subgraph H is a HCS if the global minimum s − t cut of H is at least |V (H )|2
. The initial
PPI network is then recursively partitioned into two clusters by minimum s−t cuts until
each connected component is a HCS.
Simultaneous clustering algorithm. More recently, Narayanan et al. [106] proposed
a framework, called JointCluster, that finds a clustering of multiple networks that inte-
grates large-scale datasets including PPI data and gene expression data. Extending from
a previous study on clustering a single graph by Kannan et al. [84], JointCluster finds
sparse cuts to generate clusters that allow theoretical approximation guarantees on the
quality of the detected clustering relative to the optimal clustering. The biological sig-
nificance of the resulting clusters preserved across physical and coexpression networks
is verified by known protein complexes in the literature or enrichment of functionally
coherent pathways.
1.3. AP-MS based PPI Networks 19
In a recent study by Brohee and van Helden [20], four clustering algorithms – MCL,
Restricted Neighbourhood Search Clustering (RNSC) by King et al. [90], Super Param-
agnetic Clustering (SPC) by Blatt et al. [15], and MCODE - were evaluated against PPI
networks built from annotated MIPS database. In particular, to test for robustness of the
algorithms against false positives and false negatives, edges were randomly added and
removed to generate 41 altered graphs. After comparing against the hand-curated pro-
tein complexes from the MIPS database, the MCL algorithm clearly outperformed the
other methods under most conditions, while RNSC and MCL show similar behaviours
in some tests such as altered graphs with edge removal but no edge addition.
Note that these complex detection methods use similarity matrices that are often
populated using the number of times each pairwise PPI is observed. However, the sim-
ilarity between two proteins can be interpreted in various ways, e.g., (1) the confidence
level of a direct interaction, (2) the strength of an interaction, or (3) the timing of an
interaction to distinguish those that are transient. However, the existing complex detec-
tion methods do not reflect on how these similarity matrices are interpreted. In partic-
ular, if the PPI data contains many indirect interactions, clusters that highly overlap will
be be difficult to separate using the existing PPI data. Furthermore, because the clus-
tering algorithms tend to first create disjoint clusters, and then join the ones that are
close to each other, they often do not distinguish overlapping protein complexes well.
As such, methods to discover a collection of overlapping clusters are still in demand in
order to properly model PPI networks as hypergraphs8.
1.3 AP-MS based PPI Networks
In the previous section, we introduced the types of post-processing and analysis that
need to be done computationally, and reviewed different approaches proposed in the
literature. In general, these approaches do not take into account the experimental tech-
8 A hypergraph is a set system where, in the context of PPI networks, each set corresponds to a proteincomplex.
20 Chapter 1. Introduction
nique from which the dataset originates, and such generic models often fail to capture
the inherent nature of the produced data.
To give an example, the methods introduced in Sections 1.2.1 and 1.2.2 assume that
the dataset contains only direct, binary interactions. This is a reasonable assumption
when the dataset is from Y2H experiments, but when analyzing AP-MS data, indirect
interactions must be filtered out prior to applying these methods. On the other hand, the
complex detection methods discussed in Section 1.2.3 are designed to find individual
clusters within the network, and as is the case for most clustering algorithms, finding
overlapping complexes remains a difficult problem in general.
In this section, we take a brief look at how the analyses are carried out for PPI net-
works constructed from AP-MS experiments. In particular, we focus on two of the most
recent large scale studies on AP-MS based PPI networks from a bioinformatics point of
view.
1.3.1 Protein Complexes in Yeast
In Krogan et al. [93], the protein-protein interactions in the yeast were detected by the
AP-MS method. In order to improve the accuracy as well as the coverage, two MS pu-
rifications were done independently. Furthermore, both interacting partners for each
protein-protein interaction were tagged and purified in order to ensure greater data con-
sistency and reproducibility. After 2357 successful purifications of proteins, 4087 yeast
proteins were identified as preys with high confidence scores from the mass spectrom-
etry.
After the interactome with confidence scores had been established, protein com-
plexes were detected using the Markov Clustering algorithm (described in Section 1.2.3),
where the expansion and inflation operators were appropriately chosen to optimize the
overlap against the MIPS database. The Markov Clustering algorithm identified 547 dis-
joint protein complexes, half of which were not present in the MIPS database. Further-
1.3. AP-MS based PPI Networks 21
more, new proteins were identified for most complexes that had been known previously.
For the purpose of the graph theoretic analysis of the protein complexes, a soft-
ware plug-in (http://genepro.ccb.sickkids.ca) for the Cytoscape environment
was written to model the interactome in two different ways: (1) the traditional PPI net-
work model, and (2) the protein complex network model, where each protein complex is
represented as a node, and two nodes are joined by an edge if and only if the correspond-
ing complexes share common subunits. Using these two models, the authors were able
to verify (or re-verify) several hypotheses and findings from the previous literature; for
example, (1) proteins in the same complex should have similar function and co-localize
to the same subcellular compartment, and (2) highly connected proteins within the net-
work tend to be highly conserved, and conversely, highly conserved proteins tend to be
more highly connected and central (measured by betweenness9) to the network.
On the other hand, as discussed in Section 1.2.3, the Markov Clustering algorithm
does not necessarily separate two or more complexes that share subunits, and the au-
thors point out that the identified clusters may contain multiple complexes. Identifying
such overlapping complexes remains to be explored, and better suited complex detec-
tion algorithms will provide a finer view of the interactome.
1.3.2 Modularity in Yeast Protein Complexes
Gavin et al. [58] also studied the protein-protein interactions in the yeast via the AP-
MS method, and successfully purified 1993 distinct proteins, for which 2760 interacting
partners were identified via mass spectrometry. In order to measure the reproducibil-
ity, 138 purifications were done repeatedly, and 69% of the repeated purifications were
common, giving rough estimates of the false-positive and false-negative ratios.
As acknowledged by the authors, the current graph clustering approaches for iden-
9 Betweenness is a measure of graph centrality of a node. For a given node v , Betweeness(v ) is typicallydefined as the number of shortest s − t paths that pass through v (sometimes normalized by the totalnumber of s − t paths), where s 6= v 6= t .
22 Chapter 1. Introduction
tifying protein complexes are inappropriate for the PPI data from AP-MS experiments
since they do not explicitly capture the nature of the purification output. To avoid this
problem, the socio-affinity index measure was devised to quantify the propensity of pro-
teins forming an interaction. This index measures the log-odds of the number of times
two proteins are observed together, relative to what would be expected from their fre-
quency in the data set. In particular, the socio-affinity index A(i , j ) is defined as
A(i , j ) =Si ,j |i=b a i t +Si ,j |j=b a i t +M i ,j ,
where Si ,j |i=b a i t measures the tendency for protein j to be pulled down when protein i is
used as a bait (known as the spoke model), and M i ,j measures the tendency for two pro-
teins to co-purify when other proteins are tagged as a bait (known as the matrix model).
The socio-affinity model was the first attempt to quantify physical measurements
of the interactions purely from the AP-MS data. These computed values can populate
an n ×n matrix representing a PPI network, and a graph clustering algorithm (not de-
scribed) is then used in order to define protein complexes within this network. However,
since each protein can belong to multiple protein complexes, a disjoint clustering from
a single run of the clustering algorithm is unlikely to form the appropriate protein com-
plexes. To avoid this pitfall, several runs of the clustering algorithm are carried out using
different parameters.
Using this approach, 1784 different sets of complexes were generated, and were com-
pared against hand-curated known complexes. Consequently, the sets of complexes
that score highly ( > 70% ) in coverage and accuracy show similar clusterings. Thus
the top scoring set of complexes is merged with other high scoring sets of complexes
to form variants of complex formulations called complex isoforms. Interestingly, these
complex isoforms showed that the proteins within each complex can be partitioned into
two types: (1) core components, which are the subunits of a complex that are present in
most isoforms, and (2) modules, which are present in only some of the isoforms, but al-
ways together. Furthermore, each module (which can be regarded as a building block
1.4. Contributions and Thesis Outline 23
of a complex) appeared to be present in multiple complexes, where the core compo-
nents differ. The authors verified that this organization of complexes is due to biological
phenomena by looking at the co-expression level during the cell cycle, as well as co-
localization in the cell. Furthermore, the cores and modules that were detected within
the same complex tend to belong to the same functional category, or otherwise, the
core–module cross-talk between functional categories show many known connections
such as between protein synthesis, transcription, and the cell cycle.
This modularity of protein complexes gives a new insight towards the structure of
PPI networks in that each protein complex can be further partitioned into functional
building blocks – cores and modules, and this structural property of protein complexes
may be beneficial to designing new complex detection algorithms. On the other hand,
while the socio-affinity index attempts to quantify the probability measure for direct
protein interactions, the spoke model and the matrix model described above only con-
sider 1- or 2-hop neighbours of each protein from the raw PPI data. From the bioin-
formatics standpoint, devising a better suited measure that fully models the stochastic
nature of the AP-MS PPI data would certainly improve our understanding of the pro-
teome.
1.4 Contributions and Thesis Outline
Due to its ability to test for protein co-complex memberships, AP-MS has been the
method of choice for various recent studies in high-throughput proteomics. However,
the AP-MS method has its own drawbacks that require systematic curation as well as
reinterpretation of the experimental data. In this thesis, we (1) investigate three such
limitations from a mathematical viewpoint, (2) formulate combinatorial optimization
problems for each identified limitation, and (3) study algorithmic approaches with the
appropriate applications in mind.
Chapter 2. We give an introduction to the mathematical tools that are used extensively
24 Chapter 1. Introduction
throughout this thesis. This includes techniques from approximation algorithms
and combinatorial optimization as well as graph parameters that are useful for
designing efficient algorithms for computationally intractable problems.
Chapter 3. Protein Quantification. When testing the existence and the abundance of
proteins, mass spectrometry can only look at short fragments of proteins (known
as peptides), and then identify which protein the peptide belongs to. However,
there are a number of “ambiguous” peptides that belong to several different pro-
teins, which makes the identification of constituent proteins a difficult task. We
study the problem of quantifying the protein abundance despite the presence of
these ambiguous peptides: given a set S of peptide sequences and a set W of pro-
tein sequences, together with the peptide abundance σ, find the protein abun-
dance X for each protein. We formulate this problem into various mathematical
programs for which we show their hardness, and devise approximation algorithms
that can predict absolute abundance for all constituent proteins. This manuscript
is to be submitted [89].
• Ethan Kim, Adrian Vetta, Mathieu Blanchette, “Protein quantification with
shared peptides using multi cover algorithms”, in preparation.
Chapter 4. Direct PPI network. Once the PPI data from AP-MS experiments are quan-
tified, one needs to be sure that the identified interactions are accurate. However,
as discussed in this chapter, a significant fraction of the detected interactions are
artifacts of indirect interactions, i.e. detected via chains of simultaneous interac-
tions. We study the problem of separating the direct interactions from indirect
ones, in order to minimize the false positives in the AP-MS data. Here, we intro-
duce a probabilistic graph model that captures the direct and indirect interactions
within the AP-MS data. Then, under this model, the problem of distinguishing di-
rect interactions from indirect ones is formulated, and we provide an algorithm
that is highly customized for the AP-MS data. This work is published in [88].
1.4. Contributions and Thesis Outline 25
• Ethan Kim, Ashish Sabharwal, Adrian Vetta, Mathieu Blanchette, “Predict-
ing Direct Protein Interactions from Affinity Purification Mass Spectrometry
Data”, Algorithms for Molecular Biology, 5:34, (2010).
Chapter 5. Hypergraph modelling of PPI data. PPI networks are traditionally represented
as combinatorial graphs where each interaction is assumed to occur between two
proteins. Recent studies, however, revealed interactions involving several proteins
working simultaneously. Such discoveries of protein complexes as building blocks
of the proteome led to much interest in modelling PPI networks as hypergraphs.
We thus study the problem of constructing hypergraphs from binary PPI data us-
ing algorithms for the (edge) clique cover problem. This work resulted in various
efficient algorithms for both PPI networks and other restricted classes of graphs
such as graphs with bounded parameters, and an approximation algorithm in the
case of planar graphs, as published in [14].
• Mathieu Blanchette, Ethan Kim, Adrian Vetta, “Clique Cover on Sparse Net-
works”, The 9th SIAM Meeting on Algorithm Engineering & Experiments (ALENEX)
93-102, Kyoto, Japan (2012)
Chapter 6. We conclude and give a list of open problems.
26 Chapter 1. Introduction
Chapter 2
Approximation Algorithms and
Optimization Techniques
This chapter introduces topics from discrete mathematics that are preliminary to dis-
cussions throughout the thesis. In particular, we focus on algorithmic techniques that
are frequently used to tackle problems in computational biology. As such, during this
pedagogical exposition, we shall discuss problems with applications in computational
biology in order to motivate the reader. We shall first define NP-hard optimization
problems in Section 2.1. Then, Section 2.2 discusses graph parameters that often al-
low us to design efficient algorithms for computationally hard problems. Various tech-
niques to design approximation algorithms are discussed in Section 2.3, including α-
approximation algorithms, polynomial time approximation schemes, and LP-based meth-
ods. Finally we introduce genetic algorithms in Section 2.4.
2.1 NP-hard Optimization Problems
Combinatorial optimization problems ask to find an optimal object from a finite set
of objects [125]. Typical problems include shortest path, minimum spanning tree, and
27
28 Chapter 2. Approximation Algorithms and Optimization Techniques
travelling salesman problem, and these problems can often be stated as: given A, find
B of size k , where k is the size of the solution for which the problem tries to either
minimize or maximize. If k is indeed the minimum (maximum, resp.) possible value
admitting a solution, we say k is optimum.
From the perspective of computational complexity, we say that a problem Π is in
P if there is an algorithm to solve Π efficiently.1 On the other hand, if a problem Π is
such that its solution can be verified efficiently, we say Π is in NP. By definition, this
implies that problems in P belongs to NP. The hardest problems in NP deserves special
attention: if every problem in NP can be reduced to a problemΠwithin polynomial time
(without requiring that Π belongs to NP), we say that Π is NP-hard. Furthermore, if Π
also belongs to NP itself, we say that Π is NP-complete. Therefore, if an NP-complete
problem (the hardest problem in NP) can be solved efficiently within polynomial time,
then every problem in NP can be solved in polynomial time. Therefore, NP-complete
problems lie at the core of the class NP.
Consequently, it is of great interest in the theoretical computer science community
to decide whether NP \P= ;, thereby P=NP. Since it is widely believed that this is not
true, algorithm designers face difficulties in finding efficient solutions to these prob-
lems. Indeed, as is the case for problems that are discussed in this thesis, optimization
problems in computational biology often turn out to be NP-hard. As a result, one must
apply various techniques to either find plausible solutions to these problems, or restrict
ourselves to special cases of the problem that allow efficient algorithms. We shall intro-
duce some of these techniques in the following sections.
1 An algorithm is said to be efficient if its running time can be bounded by a polynomial in the inputlength. We avoid more rigorous definitions for these classes of problems; precise definitions involvingTuring machines and certificates can be found in various complexity theory texts, for example [4, 56].
2.2. Graph Parameters 29
2.2 Graph Parameters
Throughout this thesis, let G = (V, E ) be a finite graph with vertices V and edges E .
Unless stated otherwise, we assume G is simple and undirected. The number of vertices
and edges are |V |= n and |E |=m , respectively.
A graph parameter Γ is a function that assigns a non-negative number to every graph
G . For a class of graphs G , Γ(G ) denotes the maximum Γ(G ) over all graphs G ∈ G , and
we say a graph class G has bounded Γ if Γ(G )∈O(1).
Graph parameters are useful when designing efficient algorithms with parameter-
ized running time. In particular, a problem is fixed-parameter tractable (FPT) if it can
be solved in f (k ) · |I |O(1) time, where f is a computable function depending on some
parameter k , independent of the input size |I |.
A good example is treewidth, which measures how “tree-like” a given graph is. To
define treewidth, we need the notion of tree decompositions.
Definition 2.1. [120] A tree decomposition of a graph G = (V, E ) is a pair (X = {X i |i ∈
N }, T = (N , L))where each tree node i ∈N is associated with a set of vertices X i ⊆ V , such
that
(1)⋃
i∈N X i =V .
(2) For each (v, w )∈ E , there is an i ∈N with v, w ∈X i .
(3) For each v ∈V , the set of tree nodes {i ∈N |v ∈X i } induces a subtree of T .
For clarity, we shall refer to (tree) nodes of the tree decomposition T , and vertices of
the original graph G . The width of a tree decomposition (X , T ) is defined as maxi∈N |X i |−
1, and the treewidth of a graph G , denoted tw(G ), is the minimum width over all tree de-
compositions of G . Following this definition, it is easy to see that all trees have treewidth
of 1. See Figure 2.1 for an example of a tree decomposition of width 3. In general, it is NP-
complete to determine the treewidth of a graph [3]. However, for a fixed k , graphs with
30 Chapter 2. Approximation Algorithms and Optimization Techniques
1 2
3 4 5
6 7 8
9 10 11
1 2
4 5
4 5
7 84
6 7
9
3 4
6
7
9 10
7 8
11
Graph G
Decomposition Tree T
Figure 2.1: A graph G and its tree decomposition T of width 3.
treewidth k can be recognized, and a width k tree decomposition can be constructed in
linear time [17].
A tree decomposition of a graph is especially useful for designing divide and con-
quer algorithms: given a decomposition tree T , one can pick an arbitrary tree node r
as root of T . Then, for each tree node X , the problem can be solved for the subgraph
induced by the vertices associated with X , and all its descendant nodes. Building up the
solution from leaf nodes to r often provides a polynomial time algorithm (parameter-
ized by the treewidth) if solutions to the subproblems can be combined using dynamic
programming2.
Another related parameter for graph sparsity is branchwidth.
Definition 2.2. [121] A branch decomposition (T,φ) of a graph G is characterized by a
ternary tree3 T , and a bijectionφ from the leaves of T to the edges of G .
2 Dynamic programming is a method for solving complex problems by subdividing the problem intosimpler subproblems. This technique is useful when there are overlapping subproblems, so that one cansolve the problem in a bottom-up fashion.
3 A tree T is a ternary tree if every non-leaf node has degree 3.
2.2. Graph Parameters 31
1 2 3
4 5 6
7 8 9
4,5 1,4 1,2
2,5
4,7
7,8
2,3 3,65,88,9 5,6 6,9
Graph G
Branch decomposition T
Figure 2.2: A graph G and its branch decomposition T . The e -separation shown on the decompositiontree T corresponds to a middle-set of {5, 6, 8}.
Let us denote e to be a tree edge in T . Removing e from T partitions into T1 and T2,
and this partition induces a partition of edges in G , called an e -separation, associated
with the leaves of T1 and T2. The set of vertices in G that are shared by both G1 and G2 is
called the middle-set of e , and the width of this separation is the number of vertices in
the middle-set.
Given a branch decomposition (T,φ), the width of this branch decomposition is the
maximum width over all e -separations in T , and the branchwidth of G , denoted bw(G ),
is the minimum width over all branch decompositions. Figure 2.2 gives an example of
a graph and a branch decomposition. It is well known that the branchwidth is closely
related to the treewidth of a graph by the sandwich theorem [121]:
bw(G )≤ tw(G )+1≤3
2bw(G ).
These graph parameters often allow us to design efficient algorithms to solve oth-
erwise NP-hard problems, and such problems come in different flavours: The input
32 Chapter 2. Approximation Algorithms and Optimization Techniques
graphs sometimes have bounded graph parameters due to the nature of the problem.
In some other cases, different techniques can be used when the graph has unbounded
parameters, so we can restrict ourselves to the bounded case. Sometimes, the graph
decomposition gives rise to efficient algorithms to find near-optimal solutions.
All these techniques will be exploited extensively for modelling PPI networks as hy-
pergraphs in Chapter 5. Indeed, graph decomposition finds applications in a number
of different areas within computational biology. The perfect phylogeny problem, which
will be discussed in the next section, is polynomially equivalent to the problem of trian-
gulating coloured graphs, and it can be solved in polynomial time when the treewidth is
bounded [16, 92]. Within the field of proteomics, several studies have shown that pro-
tein interaction networks often exhibit low treewidth (see, e.g. Yamaguchi et al. [147]
and Cheng et al. [28]). This opens a gate for designing efficient algorithms for many
problems on PPI networks. For example, Dost et al. [43] study the problem of querying
the existence of a pathway Q within a base network G . While the problem of deciding if a
graph is a subgraph of another is an NP-hard problem in general [56], they focus on the
restricted class of graphs with bounded treewidth, and provide an efficient algorithm to
query such pathways in a database of molecular networks.4
2.3 Approximation Algorithms
Many optimization problems in bioinformatics are NP-hard, and often researchers re-
sort to heuristics for finding solutions that are close to optimal. A useful line of attack is
the theory of approximation algorithms, which offers compelling techniques for solving
NP-hard optimization problems with provable guarantees. For a minimization prob-
lem, an algorithm is an α-approximation algorithm if it can find in polynomial time
a solution of value at most α·OPT, where OPT is the value of an optimal solution. The
4 In Dost et al.’s work, similarity scores between vertices are computed using their sequence similar-ity, which is then used when matching a query network to the base network. The classical problem ofsubgraph isomorphism on graphs with bounded parameters has been studied by Eppstein [45, 46].
2.3. Approximation Algorithms 33
usage of approximation algorithms in bioinformatics includes genome assembly (short-
est superstring problem), multiple sequence alignment problem, and phylogenetic tree
construction (Steiner tree problem), just to name a few.
Despite the diverse application of approximation algorithms in bioinformatics, the
area of PPI networks has seen limited applications. In particular, to the best of our
knowledge, no combinatorial algorithm has been proposed so far to identify direct pro-
tein interactions (Chapter 4) or protein complexes (Chapter 5) purely from the topolog-
ical structure of the given PPI network. Such scarcity of applications may be due to the
fact that even defining a clean objective function is a nontrivial task. In fact, most ex-
isting approaches propose different objective functions, for which heuristic algorithms
are used. Moreover, it is often unclear whether these approaches directly capture the
experimental processes that produced the PPI data.
In this section, we present a glimpse of approximation algorithms by discussing a
few problems as examples. Since the focus of this section is to introduce techniques
for devising approximation algorithms, we refrain from discussing the latest results on
each problem with the best approximation guarantee. For more extensive surveys, Vazi-
rani [140], and Williams and Shmoys [145] give excellent tutorials on the field.
2.3.1 Lower-bounding the Optimum
When designing an approximation algorithm, one needs to compare the cost of the ob-
tained solution with the cost of an optimal solution in order to establish the approxima-
tion guarantee. This is typically done by finding a good lower bound (for minimization
problems) on the cost of an optimal solution. For example, consider the vertex cover
problem:
CARDINALITY VERTEX COVER. Given an undirected graph G = (V, E ), find a set C ⊆ V
such that every edge has at least one endpoint incident at C , and |C | is minimized.
34 Chapter 2. Approximation Algorithms and Optimization Techniques
A lower bound on the size of an optimal vertex cover can be found using a maximal
matching5 of G . With respect to a maximal matchingM , any vertex cover has to pick at
least one endpoint of the edges inM . Therefore, the size ofM is at most the size of an
optimal vertex cover. This gives rise to the following simple algorithm:
1. Find a maximal matchingM in G .
2. Output both endpoints of edges inM .
Observe that the chosen vertices are adjacent to all the edges of G , as otherwise any
uncovered edge would have been added to the matching. Furthermore, the size of the
cover is exactly 2|M |. Since |M | ≤ OPT, the size of the cover is at most 2· OPT, and thus
the algorithm is a 2-approximation algorithm.
Application: Phylogenetic Trees [18] Consider a set of m species with n possible char-
acteristics. We can represent this by a 0− 1 matrix M where M i ,j = 1 if species i has
characteristic j . Then, we say M has a perfect phylogenetic tree if there exists a tree T
such that: (1) there are m leaves, one for each species; (2) Each characteristic label cor-
responds to one edge in T (there may be edges with blank labels); (3) for each leaf s i , the
path from root to s i contains exactly those characteristics possessed by s i . See Figure 2.3
for an example of M and a possible perfect phylogenetic tree.
Let Sj be the species that have characteristic c j = 1. Then the following characteri-
zation holds.
Theorem 2.3. M has a perfect phylogenetic tree if and only if {S1, . . . , Sn} form a laminar
family.6
Thus, we say that two characteristics c i and c j have a conflict if Si and Sj intersect, while
Si \Sj 6= ; and Sj \Si 6= ;. If we remove characteristics so that there are no conflicts, then
5 Given a graph G = (V, E ), a subset of edgesM ⊆ E is a matching if no two edges inM share a vertex.We say a matching is maximal if no more edges of E \M can be added toM .
6 A set system is a laminar family if, for each pair of sets Si and Sj , either Si ⊆ Sj , or Sj ⊆ Si , or Si ∩Sj = ;.
2.3. Approximation Algorithms 35
c1 c2 c3 c4 c5s1 1 1 0 0 0s2 0 0 1 0 0s3 1 1 0 0 1s4 0 0 1 1 0s5 0 1 0 0 0
R
s1 s3
s5s2s4
c1
c2c3
c4
c5
∅ ∅
∅
Figure 2.3: Matrix M indicating the characteristics of a set of species (left) and a perfect phylogenetic treefor M (right). Tree edges with ; indicate unlabelled edges.
there is a perfect phylogenetic tree for the remaining characteristics. Thus, we want
to find the smallest set of characteristics to remove. We can turn this problem into an
instance of vertex cover: define a conflict graph whose vertices correspond to character-
istics c1, . . . , cn , and vertices c i and c j share an edge if and only if they conflict. Then,
a minimum cardinality vertex cover in the conflict graph consists of a minimum set of
conflicting characteristics to remove.
2.3.2 Polynomial Time Approximation Schemes
Some NP-hard optimization problems are approximable to an arbitrary degree defined
by the error parameter ε> 0. For an NP-hard optimization problem Πwith an objective
function f , we say that algorithm A is an approximation scheme for Π if on input (I ,ε),
A outputs a solution s such that:
• f (I , s )≤ (1+ε)·OPT if Π is a minimization problem.
• f (I , s )≥ (1−ε)·OPT if Π is a maximization problem.
The algorithm A is called a polynomial time approximation scheme (PTAS), if for each
fixed ε> 0, the running time of A is bounded by a polynomial in the size of the instance
36 Chapter 2. Approximation Algorithms and Optimization Techniques
I . Note that the running time of a PTAS depends arbitrarily on ε. If, in addition, the
running time is bounded by a polynomial in both |I | and 1/ε, we say the algorithm is
fully polynomial time approximation scheme (FPTAS). Here we give an example of an
FPTAS for the knapsack problem.
KNAPSACK PROBLEM. Given a set S = {a 1, . . . , a n}of objects with cost(a i )∈Z+, profit(a i )∈
Z+, and B ∈ Z+, find a subset of S whose total cost is bounded by B , and whose total
profit is maximized.
The design of a PTAS is often done via first mapping the problem instance to a coarser
instance, and the new instance is then solved exactly by dynamic programming ap-
proach. Before describing the FPTAS for the knapsack problem, we introduce the no-
tion of pseudo-polynomial time algorithms. An algorithm is efficient if its running time
is bounded by a polynomial in |I |, the size of the instance. However, some problems
contain numbers as part of the problem instance. For example, a knapsack problem
instance contains cost and profit for each object a i . Let |Iu | denote the size of a prob-
lem instance I , where all the numbers in the instance are written in unary. Then, an
algorithm is a pseudo-polynomial time algorithm if its running time is bounded by a
polynomial in |Iu |. Because the knapsack problem is NP-hard, we do not know whether
an efficient algorithm exists; however, there is a pseudo-polynomial time algorithm for
this problem.
Let P be the profit of the most profitable object in S. Then nP is an upper bound
on the profit of an optimal solution. For every 1 ≤ i ≤ n and 1 ≤ p ≤ nP , let S(i , p )
denote a subset of {a 1, . . . , a i } whose total profit is precisely p and whose total cost is
minimized. Further, define A(i , p ) to be the cost of the set S(i , p ). If no such set exists,
set A(i , p ) =∞. Then A(1, p ) is known for each value of p , where 1 ≤ p ≤ nP , and the
following recurrence holds.
2.3. Approximation Algorithms 37
A(i +1, p ) =
min{A(i , p ), cost(a i+1)+A(i , p −profit(a i+1))}, if profit(a i+1)< p
A(i , p ), otherwise
A dynamic programming algorithm using this recurrence will compute all values of
A(i , p ) in O(n 2P) time. Finally, the optimal solution is found in time O(nP) by looking for
max{p |A(n , p )≤ B}. Thus, this is a pseudo-polynomial time algorithm for the knapsack
problem.
Observe that, if the profit of each object is bounded by a polynomial in n , the above
algorithm would run in polynomial time. This observation leads us to an FPTAS for the
knapsack problem. We first scale the profit down to small numbers so that the profit
of each object is polynomially bounded by n . Then we can run the above algorithm to
compute the optimal solution with respect to the scaled profit function. By carefully
scaling with respect to the error parameter ε, we shall obtain a solution that is at least
(1−ε)OPT in time polynomial in |I | and 1/ε.
Here is the FPTAS for the knapsack problem:
1. Let K = εP/n .
2. For each object a i , define profit′(a i ) = bprofit(a i )
Kc.
3. Run the dynamic programming algorithm with the scaled profit function, profit′(a i )
to obtain the most profitable set S′.
4. Return S′.
We now show that profit(S′)≥ (1−ε) OPT. Let S∗ denote the optimal set with respect
to the original profit function. For each object a i , due to the rounding down at Step 2,
38 Chapter 2. Approximation Algorithms and Optimization Techniques
K ·profit′(a i ) is smaller than profit(a i ) by a difference at most K . Therefore,
profit(S∗)−K ·profit′(S∗)≤ n K .
Furthermore, S′ must be at least as good as S∗ under the scaled profits. By scaling the
profits back up by K we have:
profit(S′)≥ K ·profit′(S∗)≥ profit(S∗)−n K =OPT−εP ≥ (1−ε) ·OPT,
since OPT ≥ P .
The running time of the algorithm is O(n 2b PKc) = O(n 3 · 1
ε), which is polynomial in
both n and 1/ε.
Application: Target Protein Set. Due to limited resources, structural biologists often
face the problem of which protein to study. Because of diverse nature of protein struc-
tures, certain protein molecules are more difficult to crystallize than others. It would be
beneficial, therefore, if the chosen set of proteins are not too expensive to investigate.
Furthermore, certain proteins are more important than others due to biological rele-
vance, homology against existing results on similar proteins, etc. As such, researchers
want to find the set of target proteins that maximizes the importance staying under their
budget.
We can model this simple problem of task selection as follows: for each protein a i ,
define the cost of investigating it to be cost(a i ), and the importance of the particular
protein to be profit(a i ). If we let B to be the total resources that we have at our dis-
posal, an optimum solution to this knapsack problem gives us the most profitable set of
proteins we can study.
2.3. Approximation Algorithms 39
2.3.3 LP-based Algorithms
Linear programming gives a profound framework in which approximation algorithms
can be designed. Many NP-hard optimization problems can be stated as integer pro-
grams. While solving an integer program itself is NP-hard in general, solving a (frac-
tional) linear program is not, and thus the linear relaxation of the integer program pro-
vides a natural way of bounding the cost of the optimal solution.
A widely used approach for designing LP-based approximation algorithms is LP-
rounding. In this approach, a fractional solution to the LP-relaxation is first obtained,
and it is then converted to an integral solution by a rounding scheme. The approxima-
tion guarantee is established by comparing the cost of the fractional and integral solu-
tions. For more detailed discussion on the this technique, see [140]. Here we present
a simple example of the LP-rounding technique via a weighted version of the set cover
problem.
WEIGHTED SET COVER. Given a universe U of n elements, a collection of subsets of U ,
S = {S1, . . . ,Sk }, and a cost function c : S → Z+, find a minimum cost subcollection of
S that covers all elements of U .
We can formulate this problem as an integer program by assigning a binary variable
xS for each set S ∈ S , whose value is set to 1 if S is picked in the set cover, and 0 other-
wise. Furthermore, for each element e ∈U , at least one of the sets containing e must be
picked. So we have:
minimize∑
S∈S
c (S) ·xS
subject to∑
S:e∈S
xS ≥ 1 ∀e ∈U
xS ∈ {0, 1} ∀S ∈S
40 Chapter 2. Approximation Algorithms and Optimization Techniques
Let OPT denote the optimum solution of this integer program. The LP-relaxation of this
integer program is obtained by replacing the domain of variables xS with 0 ≤ xS ≤ 1 for
all S ∈ S . The derived linear program can then be solved optimally to obtain OPT f . By
construction, OPT f ≤OPT since the optimum integral solution is feasible for the LP.
Then, we can apply the rounding scheme as follows: pick all sets S for which xS ≥ 1/α
in the OPT f , where α is the frequency of the most frequent element (i.e. the maximum
number of sets containing an element). Let C denote the resulting subcollection of sets
chosen by the rounding scheme. To see C is a valid set cover, consider an arbitrary
element e . Since e is in at most α sets, one of these must have been assigned at least
1/α by OPT f . Thus, e is covered by C .
To check the approximation ratio, the rounding scheme increases xS by a factor of at
most α. Therefore, the cost of C is at most α·OPT f .
Application: Representative Protein Set. In Section 2.3.2, we introduced the problem
of choosing a set of target proteins to investigate. Here we modify this problem slightly.
Instead of maximizing the profit we gain by studying a set of proteins, we consider the
notion of representative protein set: from a large collection of protein molecules, we
would like to study the ones that somehow represent the entire collection of proteins.
Such a representative profile would provide us a succinct, yet global picture of the given
protein sample.
We can model this problem as follows: for each protein x , define the cost function
w (x ) ≥ 0 measured by the difficulty of crystallizing x . Then, for each pair (x , y ) of pro-
teins, define a distance function d (x , y )≥ 0 measured by the sequence alignment of the
two proteins. We can then construct a graph G = (V, E ) where each vertex corresponds
to a protein, and two vertices are joined by an edge if and only if the corresponding pro-
teins are within some prescribed distance∆. We say a subset V ′ ⊆ V of vertices is repre-
sentative, if every vertex in V is adjacent to at least one vertex in V ′. Then the problem
of finding a minimum cost representative set is precisely the dominating set problem.
2.4. Genetic Algorithms 41
WEIGHTED DOMINATING SET. Given a graph G = (V, E ) and a cost function c : V → Z+,
find a minimum cost subset D ⊆ V such that every vertex not in D is joined to at least
one vertex in D.
It is well known that dominating set is equivalent to set cover [83]. To reduce domi-
nating set to set cover, construct a set cover instance as follows: the universe U is V . For
each vertex v ∈ V (G ), create a set Sv containing the vertex v and all vertices adjacent to
v . For each set Sv , define its cost to be c (v ). Now, if D is a dominating set for G , then
C = {Sv : v ∈D} is a feasible set cover with the same cost. Conversely, if C = {Sv : v ∈D}
is a set cover, then D is a dominating set for G with the same cost.
Therefore, we can solve the representative protein set problem using the LP-rounding
scheme for set cover.
2.4 Genetic Algorithms
In order to solve optimization problems, a search heuristic called a genetic algorithm
(GA) often offers useful solutions. Inspired by natural evolution, genetic algorithms are
especially useful when a large number of candidate solutions can be efficiently gener-
ated, and pairs of candidates can be somehow “merged” to obtain a better offspring
of the two parent solutions. Each candidate solution is typically encoded as a string of
characters. A genetic algorithm can then be designed via the following four main phases:
I. Initialization: Many individual solutions are randomly generated to form an initial
population that spans the entire search space.
II. Fitness function: Each candidate solution is evaluated by the fitness function that
gives a score based on the optimization criteria.
III. Crossover: Given two parent solutions, the crossover operation generates an off-
spring based on its parents. Various methods have been proposed for the crossover
42 Chapter 2. Approximation Algorithms and Optimization Techniques
operation, with one point “cut-and-swap” being the simplest form, while it can be
fully customized in a problem-specific way.
IV. Mutation: In order to maintain genetic diversity from one generation to the next,
a prescribed portion of the population goes through a mutation operation. The
mutation operation is typically done by randomly switching characters in the can-
didate string, which can be fine-tuned depending on the application.
Using these operators, a large pool of candidate solutions are maintained from one
generation to another. Since a genetic algorithms is a search heuristic, one typically
cannot provide a theoretical bound on the output quality. Further, the fitness operator
only tells us whether one candidate solution is better than other existing candidates, and
thus it is never clear when to stop the algorithm. However, when a large initial popula-
tion can be easily generated, and the fitness function can be evaluated quickly, carefully
designed GAs provide an efficient approach to find good solutions. In this thesis, we
shall present a highly customized genetic algorithm to find direct physical PPI network
in Chapter 4. For more detailed discussions on this optimization technique, see [64, 94].
Chapter 3
Protein Quantification with Shared
Peptides
With the advance of mass spectrometry (MS) based techniques, biologists in proteomics
research are not only interested in determining the presence of proteins within a mix-
ture, but also the quantitative nature of the constituent proteins, i.e. their abundance.
MS-based shotgun proteomics approaches this problem by first shattering the (unknown)
protein molecules into shorter fragments known as peptides, and then reading the pep-
tides present in the mixture using a mass spectrometer.
Mass spectrometers typically work by first producing ions with charges based on
the masses of the peptides, and then separating the ions by their mass-to-charge ratios.
Various methods have been proposed to convert the peptide molecules into ions in the
gas phase using electrospray ionization (ESI) [144] and matrix assisted laser desorption
ionization (MALDI) [85, 116]. The produced ions can then be detected and processed
into mass spectra. There are several algorithms to analyze the mass spectra and either
identify the peptides present or quantify the peptide abundance [59, 115, 135].
Among these quantification methods, approaches known as label-free methods have
become popular due to the low cost and simplicity of the experiments. These methods
43
44 Chapter 3. Protein Quantification with Shared Peptides
work under the premise that more abundant peptides would produce a larger number
of spectra. By counting the number of spectra, we can obtain measures such as spec-
tral counting [30, 109], and peptide counting [87, 143]. Peptide count is the measure of
how well a given protein is represented in a sample by counting the number of unique
peptides. On the other hand, spectral counting simply counts the number of spectra
for each peptide. As such, while peptide counting measures the coverage of the parent
protein sequence, spectral counting provides a concrete measure of peptide abundance
that can later be translated into protein abundance.
This chapter deals with the next step in the protein quantification problem: translat-
ing the peptide abundance into protein abundance. In order to identify (and quantify)
constituent proteins in a mixture, the detected peptides often serve as indicators for
the parent proteins. For example, one of the most widely used approaches for MS pro-
tein identification is the Mascot score1[114]. Given a set of peptide mass spectra, the
Mascot score measures the probability that the match between the mass spectra and a
particular protein sequence is a random event (thus a lower probability yielding a higher
score), and is calculated for each protein in the protein sequence database. To compute
this measure, each protein sequence in the database is digested and fragmented in sil-
ico, and the resulting “theoretical spectrum” is compared against the experimental spec-
trum observed by mass spectrometers. Consequently, the computation of Mascot scores
involves considering all possible digested peptides for each protein in the database.
However, such an approach creates a problem when there is a peptide that belongs
to more than one parent protein; in other words, the mapping of peptides to proteins
is no longer one-to-one. These ambiguous peptides are called shared peptides, and
Nesvizhskii and Aebersold [108] recently highlighted the complications in protein iden-
tification with the presence of shared peptides. Unfortunately, shared peptides are preva-
lent in several mass spectrometry datasets. For example, proteins in a large protein fam-
ily have similar sequences that may create identical peptides when trypsinized, and al-
1 Mascot score is the in silico method to identify proteins from a sequence database, used by the Mascotsoftware (http://www.matrixscience.com).
45
ternative splicing may also create nearly identical protein isoforms. However, in most
studies of protein identification and quantification, shared peptides are often simply
ignored [100, 151].
In this chapter, we study the problem of identifying and quantifying the constituent
proteins from MS spectra, without having to exclude the existence of shared peptides.
Depending on the mixture of proteins under investigation, the shared peptides can con-
stitute as much as 50% of the peptides in the mixture [42, 81]. Consequently, incorpo-
rating the shared peptides into analyses would certainly lead to a more accurate quan-
tification of constituent proteins.
Recently, shared peptides have been a useful tool for protein quantification in sev-
eral studies. For example, Jin et al. [81] proposed a statistical model that finds groups of
proteins sharing significant number of peptides, namely peptide-sharing closure groups
(PSCG). In their work, proteins grouped together in the same PSCG were shown to be bi-
ologically related, and when applied to data from differential expression analysis, all the
peptides within each group showed consistent abundance relative to controls, justifying
the inclusion of shared peptides in protein quantification studies.
On the other hand, Dost et al. [42] recently proposed a clean combinatorial model
for quantifying relative protein abundance. Their model is built around peptide-protein
relationships and, using the relative peptide abundance data from multiple samples,
they set up a linear program to obtain relative protein abundance. As their input data
is relative (e.g. peptide s i is expressed 5 times more in sample x than in sample y , say),
their output is also in relative ratios across multiple samples. As such, solving the linear
program optimally (which gives a fractional value to the abundance of each protein)
sufficed to give a viable solution.
As described, existing studies of protein quantification using shared peptides focus
primarily on relative abundance measures across different samples. This may be be-
cause the quality of peptide abundance measures is relatively poor, and thus most stud-
ies only consider the differential profiling aspect of protein abundance. However, vari-
46 Chapter 3. Protein Quantification with Shared Peptides
ous methods for absolute peptide quantification have been recently proposed [9, 61, 91],
and the quality of peptide abundance measures is improving rapidly. In this chapter,
therefore, we consider the problem of absolute protein quantification under the as-
sumption that absolute peptide abundance is available. In Section 3.1, we modify the
model of Dost et al. to suit our needs for absolute protein quantification, and formu-
late various optimization problems. Then, in Section 3.2, we establish the computa-
tional hardness of these problems, and design approximation algorithms in Section 3.3.
In Section 3.4, we evaluate the performance of our approaches using both simulated
and real MS data. Finally, Section 3.5 discusses how our approaches can be modified
to handle degenerate cases, and incorporate additional biological data such as peptide
detectability.
3.1 Problem Formulation for Protein Quantification
In this section, we formulate optimization problems for protein quantification. Let W =
{w1, . . . , wm }denote the set of all known protein sequences, and let S = {s1, . . . , sn}denote
the set of peptides we observe from a given mass spectrometry experiment2. We assume
that we are given the set of protein sequences and peptide sequences, together with
the abundance of each observed peptide sequence. Then we want to find the protein
abundance that can be constructed from the given peptide abundance. For simplicity,
we make the following assumptions.
1. For each protein w j , a peptide s i may only appear in w j at most once.
2. The absolute abundance for each observed peptide is given as a positive integer,
and thus protein abundance should be a nonnegative integer.
Lifting the first assumption can be done easily as shall be discussed in Section 3.5.
The absolute peptide abundance can be estimated by the spectral counting method, as
2 We use w for words and s for syllables as an analogy to proteins and peptides.
3.1. Problem Formulation for Protein Quantification 47
s1
s2
s3
s4
w1
w3
w2
�1 = 4
�2 = 5
�3 = 8
�4 = 7
x1 = 4
x2 = 7
x3 = 1
s4 =LTAU
s3 =QWTA
s2 =TUAIL
s1 =QYUA
w1 =QYUATUAIL
w2 =LTAUQWTA
w3 =QWTATUAIL
Figure 3.1: An example of the bipartite graph from peptides (s1, . . . , s4) and the parent proteins(w1, . . . , w3). Peptides s2 and s3 are shared peptides. The abundance of the peptides are indicated byσi while x j ’s denote the protein abundance.
it measures the number of corresponding mass spectra for each peptide.
Let us construct a bipartite graph G = (S, W ; E )with |S|= n , |W |=m , where (i , j )∈ E
if and only if peptide s i is a substring of protein w j . Let A be an n×m matrix where A i ,j =
1 if (i , j ) ∈ E , and 0 otherwise. Let σi be a positive integer that denotes the multiplicity
of a peptide s i , for 1 ≤ i ≤ n . Then, let x j denote the unknown variable for multiplicity
of protein w j . Figure 3.1 gives an example of the formulation. Then, we can encode our
problem into a system of linear equations as follows:
Problem 3.1.
m∑
j=1
A i ,j ·x j =σi i = 1 . . . n
x j ∈Z+ ∪{0} j = 1 . . . m
This model captures the dependencies of peptides on proteins, and our problem re-
duces to solving a system of linear equations. However, there remain a few practical
issues. The multiplicity of peptides σi from MS contains a significant amount of noise,
48 Chapter 3. Protein Quantification with Shared Peptides
which would lead to inaccurate solutions or non-satisfiable system of equations. Fur-
ther, even after assuming the peptide multiplicities are accurate, the system may not
contain a unique, integral solution. We thus need a relaxed set of constraints and an
optimization criterion to select the most plausible solution. We propose to modify this
model in four different ways to progressively capture the data being analyzed in an in-
creasingly realistic manner.
MULTICOVER. We can relax the equality constraints: rather than requiring x j ’s sum up
to precisely σi for each peptide Wi , we turn them into covering inequalities, and mini-
mize the sum of x j ’s.
minimizem∑
j=1
x j
subject to:m∑
j=1
A i ,j x j ≥σi i = 1 . . . n
x j ∈Z+ ∪{0} j = 1 . . . m
Observe that this problem is closely related to the set cover problem. In fact, our for-
mulation above is a generalization of set cover known as multicover: instead of requiring
each element (peptide, in our case) to be covered at least once, each element needs to be
covered at least the prescribed number of times. If each set (protein, in our case) can be
picked at most once, the problem is called constrained multicover, while if a set can be
covered arbitrarily many times, the problem is called unconstrained multicover, which
is precisely the formulation above.
MINIMUMPROTEINTYPES. One further refinement from above is to look for a solution
with the smallest number of different proteins, that is, minimize the number of x j ’s with
3.1. Problem Formulation for Protein Quantification 49
positive values.
minimizem∑
j=1
χ(x j )
subject to:m∑
j=1
A i ,j x j ≥σi i = 1 . . . n
x j ∈Z+ ∪{0} j = 1 . . . m
where χ(x j ) = 1 if x j > 0, and χ(x j ) = 0 otherwise.
In this problem, we want to find a solution that uses a smallest possible number of
distinct protein types. Since the objective function is nonlinear, the resulting problem is
no longer an integer linear program. While nonlinear integer programs are much harder
class of problems in general, as we shall see later, we can still solve the above program,
via a reduction to set cover.
MINIMUMUNIFORMERROR. Instead of simply looking for covers onσi ’s, we can allow an
error term ε such that the x j ’s sum to a value withinσi ±ε.
minimize ε
subject to: σi −ε ≤m∑
j=1
A i ,j x j ≤ σi +ε i = 1 . . . n
x j ∈ Z+ ∪{0} j = 1 . . . m
This formulation looks for a solution such that, for each peptide s i , the proposed
multiplicities for proteins containing s i sum up to a value within ε ofσi . This introduc-
tion of additive error term helps us incorporate noise in the MS data while adhering to
the peptide abundance σi by minimizing the error term. One issue with this approach
is that some peptides are significantly harder to detect than others, and thus the error
50 Chapter 3. Protein Quantification with Shared Peptides
for those peptide multiplicities are much larger than peptides that are easier to identify.
We can resolve this issue of non-uniform errors using the following formulation.
MINIMUMERRORSUM. While MINIMUMUNIFORMERROR looks for a solution where the uni-
form error term ε is minimized, here we introduce an individual error term εi for each
σi , and minimize the sum∑
i εi .
minimizen∑
i=1
εi
subject to: σi −εi ≤m∑
j=1
A i ,j x j ≤ σi +εi i = 1 . . . n
x j ∈ Z+ ∪{0} j = 1 . . . m
Using these error terms allow different error for each peptide. We note that error
terms εi can be weighted by “detectability” that would penalize differently any discrep-
ancies for different peptides. We discuss such extensions in Section 3.5.
Each of these four problems captures a different aspect of protein quantification,
and requires a different algorithmic technique. Although some problems clearly model
the mass spectrometry data more realistically than others, the theoretical results build
up progressively, and thus we shall discuss each problem in this chapter.
3.2 Hardness of Protein Quantification Problems
In this section, we study the computational hardness of the problems posed in Sec-
tion 3.1. To show the hardness of a given problem, we need to find a problem that is
known to be hard, and reduce its instances to our problem of interest. For that purpose,
3.2. Hardness of Protein Quantification Problems 51
we consider the following two NP-hard problems3.
SETCOVER. [56, 86]Given a set S of elements and a collection of sets W over the ground
set S, choose a minimum cardinality sub-collection of W that covers all elements in S.
EXACTCOVER. [56, 86] Given a collection W of sets over elements S, find a subset W ′ ⊆
W such that each element in S belongs to exactly one set in W ′.
Before showing the hardness of our four problems, we first consider the original for-
mulation; Problem 3.1 is simply a system of linear equations, but we can define an inte-
ger program by adding an objective function as in MULTICOVER.
Problem 3.2.
minimizem∑
j=1
x j
subject to:m∑
j=1
A i ,j ·x j =σi i = 1 . . . n
x j ≥ 0 j = 1 . . . m
x j ∈Z
The hardness of this problem is evident from EXACTCOVER. Due to unavoidable er-
rors in MS measurements, this problem is of less practical interest, and we refrain from
further discussing algorithmic approaches.
Theorem 3.3. Problem 3.2 is NP-hard.
Proof. Take an instance Π of EXACTCOVER, and transform into an instanceφ(Π) of Prob-
lem 3.2 by assigning σi = 1 for all 1 ≤ i ≤ n . If φ(Π) finds a solution of any size, then Π
has a solution for EXACTCOVER. Otherwise, Π has no solution for EXACTCOVER.
3 See Section 3.6 for inapproximability results on these problems.
52 Chapter 3. Protein Quantification with Shared Peptides
Next, the hardness of the multicover problem is well known [140].
Theorem 3.4. [140]MULTICOVER is NP-hard.
Proof. Consider the decision problem of MULTICOVER(K). Take an instance Π of SET-
COVER(K), and cast into an instance φ(Π) of MULTICOVER(K), where σi = 1 for all i . If
Π contains a solution of size k , then the corresponding chosen sets form a solution for
φ(Π) of size k . Conversely, if φ(Π) has a multicover of size k , then there are at most k
x j ’s with positive values. These x j ’s correspond to a set cover of size k for Π.
The hardness of MINIMUMPROTEINTYPES comes from a reduction from SETCOVER.
Theorem 3.5. MINIMUMPROTEINTYPES is NP-hard.
Proof. Take an instance of set cover Π, and cast into an instance φ(Π) of MINIMUMPRO-
TEINTYPES by assigning the coverage requirement σi = 1 for each element. If Π admits a
set cover of size k , then the corresponding chosen proteins form a solution for φ(Π) of
size k . Conversely, if φ(Π) admits a solution of size k , the proteins chosen with positive
multiplicity correspond to a set cover of size k in Π.
The error minimization problems can be shown to be hard by reductions from EX-
ACTCOVER.
Theorem 3.6. MINIMUMUNIFORMERROR is NP-hard.
Proof. Take an instance Π of EXACTCOVER. Then, cast into an instance φ(Π) of MINIMU-
MUNIFORMERROR by assigningσi = 1 for 1≤ i ≤ n . Ifφ(Π) contains a solution with ε= 0,
then Π contains an exact cover. Otherwise, Π contains no solution.
A similar proof yields hardness for MINIMUMERRORSUM.
Theorem 3.7. MINIMUMERRORSUM is NP-hard.
3.3. Approximation Algorithms for Protein Quantification 53
3.3 Approximation Algorithms for Protein Quantification
In this section, we design approximation algorithms for the formulated problems. While
MULTICOVER and MINIMUMPROTEINTYPES are both covering problems, MINIMUMUNIFORMER-
ROR and MINIMUMERRORSUM share a common theme of minimizing the error terms. For
the covering problems, we obtain O(log n ) approximation guarantees, and we design
randomized algorithms for the error minimization problems.
3.3.1 An Algorithm for Multicover
The MULTICOVER problem formulated in Section 3.1 is an unconstrained version: the
solution vector x may contain any nonnegative integers. In other words, each protein
may be picked multiple times in order to meet the coverage requirement σ. For the
purpose of approximation algorithms, however, we can instead focus on the constrained
0−1 problem, where each protein can be either picked once, or not picked at all.
Lemma 3.8. If there is an α-approximation algorithm for constrained multicover, then
there is an α-approximation algorithm for MULTICOVER.
Proof. Consider the unconstrained multicover problem. We first take the LP relaxation
of the integer program, and solve the LP in polynomial time to obtain the fractional opti-
mum solution, denoted by OPT f . From the fractional optimum OPT f : x = (x1,x2, . . . ,xm ),
each x j can be split into an integral part I j and a fractional part Fj , where 0 < Fj < 1.
Then, OPT f = I + F . Now consider the “residual” IP:
54 Chapter 3. Protein Quantification with Shared Peptides
minimizem∑
j=1
x ′j (3.1)
subject to:m∑
j=1
A i ,j ·x ′j ≥σ′i i = 1 . . . n
0≤ x ′j ≤ 1 j = 1 . . . m
x ′j ∈Z
where
σ′i =max{σi −m∑
j=1
A i ,j I j , 0}
to make sure σ′i remains nonnegative. Then F = (F1, F2, . . .) forms an optimal solution
for the LP relaxation of the above residual IP (since otherwise we can replace it with a
better solution for the residual IP and improve OPT f ).
Now, suppose we have an approximate solution A that is at most α ·F for some α> 1.
Then our overall solution is:
I +A ≤ I +αF ≤α(I + F ) =α(OPT f )≤α ·OPT
Therefore, from the standpoint of approximation algorithms, we can focus on the
constrained multicover problem. Due to their resemblance, standard techniques for
set cover can be applied with similar approximation guarantees, including the simple
greedy algorithm (which iteratively picks the set containing the largest number of un-
covered elements), or the LP rounding technique using the residual solution F .
3.3.2 An Algorithm for Minimum Protein Types
In Section 3.2, we showed that MINIMUMPROTEINTYPES is NP-hard via a reduction from
set cover. While MINIMUMPROTEINTYPES is formulated as a nonlinear integer program,
3.3. Approximation Algorithms for Protein Quantification 55
a harder class of problems in general, our problem is in fact equivalent to the set cover
problem.
Theorem 3.9. Any algorithm for SETCOVER can be used to find a solution for MINIMUMPRO-
TEINTYPES of the same size.
Proof. Let Π denote an instance of MINIMUMPROTEINTYPES. Now consider the corre-
sponding set cover instance φ(Π) constructed by replacing the coverage requirement
withσi = 1 for each element i . Supposeφ(Π) contains a set cover of size k . Take the cho-
sen sets in φ(Π), and assign the corresponding x j ’s with arbitrarily large values so that
each coverage requirement σi is satisfied. Since the value of the optimization function
is the number of x j ’s with positive values, we have a solution for MINIMUMPROTEINTYPES
of value k .
This implies that any approximation algorithm for SETCOVER can be used to compute
a solution for MINIMUMPROTEINTYPES with the same approximation guarantee, and the
inapproximability results for SETCOVER also hold for MINIMUMPROTEINTYPES. See Sec-
tion 3.6 for inapproximability results on SETCOVER.
Finding the protein abundance σ for Minimum Protein Types. Observe that, in the
proof of Theorem 3.9, our algorithm assigns arbitrarily large values to the x j ’s chosen by
the SETCOVER algorithm. In practice, however, such arbitrarily large values would result
in solutions undesirable for biological purposes. To resolve this, we first run the SET-
COVER algorithm to choose the smallest set of proteins (which would be assigned posi-
tive abundances), and then remove the unchosen proteins from the bipartite graph G .
Then, we apply the algorithm for MULTICOVER on the residual bipartite graph. This way,
one can be sure that we choose the smallest possible number of proteins, and further,
assign the smallest possible abundances to the chosen proteins.
56 Chapter 3. Protein Quantification with Shared Peptides
3.3.3 An Algorithm for Minimum Uniform Error
We reduce the problem to a {0, 1} problem.
Lemma 3.10. If there is an α-approximation algorithm for constrained MINIMUMUNI-
FORMERROR, then there is an α-approximation algorithm for unconstrained MINIMUMU-
NIFORMERROR.
Proof. Let’s first spell out the original problem:
minimize ε (ILP1)
subject to: σi −ε ≤m∑
j=1
A i ,j x j ≤ σi +ε i = 1 . . . n
x j ∈ Z+ ∪{0} j = 1 . . . m
which we shall call ILP1. Now, consider the LP relaxation of ILP1, namely LP1. Then, by
definition, OPT(LP1) ≤ OPT(ILP1). Each value assigned to x j in OPT(LP1) is a fractional
value, which can be split into:
OPT(LP1) = I + F
where 0 < Fj < 1 for all j . We can then keep the solution given by I , and consider the
residual problem given by the following integer program, called ILP2:
minimize ε (ILP2)
subject to: σ′i −ε ≤m∑
j=1
A i ,j x j ≤ σ′i +ε i = 1 . . . n
0 ≤ x j ≤ 1 j = 1 . . . m
x j ∈ Z j = 1 . . . m
whereσ′i =max{σi −∑m
j=1 A i ,j I j , 0}.
3.3. Approximation Algorithms for Protein Quantification 57
Now, consider the optimum solution for the residual integer program, namely OPT(ILP2).
By definition, we have the linear relaxation LP2 such that OPT(LP2) ≤OPT(ILP2). For
OPT(LP2), we claim that F ≤OPT(LP2), and F is a feasible solution. To show F ≤OPT(LP2),
suppose otherwise. Then, OPT(LP2) can be pasted onto I to form:
I +OPT(LP2)< I + F ≤OPT(LP1)
contradicting the optimality of OPT (LP1).
Finally, suppose there is an approximate solution A for OPT(LP2) that is at most α ·F
for some α> 1. Then our overall solution is:
I +A ≤ I +α · F ≤α(I + F ) =α ·OPT(LP1)≤α ·OPT(ILP1)
Therefore, the overall approach would yield an α−approximation algorithm.
We thus focus on the constrained 0−1 problem. We can design a randomized round-
ing scheme where the fractional solution F is used as a probability, as shown in Algo-
rithm 3.3.1.
Algorithm 3.3.1: Randomized Rounding Scheme for MINIMUMUNIFORMERROR
Input: Fractional solution vector F from OPT(LP1).Output: Binary solution vector x .
for j = 1, . . . , m dop j ← Fj (each 0< Fj < 1 is treated as a probability.);Flip a p j -biased coin;if heads then
x j ← 1;endelse
x j ← 0;end
endReturn x .
58 Chapter 3. Protein Quantification with Shared Peptides
To analyze the rounding scheme, we define a random variable Yi =∑
(i ,j )∈E x j where
x j ∈ {0, 1} is the result of tossing a coin for each j with probability p j (a Poisson trial). By
linearity of expectation, we have:
E [Y ] = E [Y1]+E [Y2]+ . . .=∑
(i ,j )∈E
E [p j ] =τi .
where τi =∑
(i ,j )∈E Fj , the coverage achieved by F . Since the fractional solution F has at
most the optimum error from the integer program, the expected solution E [Y ] is opti-
mal.
We now show that Algorithm 3.3.1 gives a small tail probability.
Lemma 3.11. Algorithm 3.3.1 returns a solution such that
Pr [Yi > (1+δ)τi +6 log n
δ2]<
1
n 2
and
Pr [Yi < (1−δ)τi −4 log n
δ2]<
1
n 2
Proof. To estimate the tail probability, we apply the Chernoff bound [29, 104] on Yi . For
the bound on the upper tail, we have:
Pr [Yi > (1+δ)τi ]< e−τiδ2/3
and for the bound on the lower tail:
Pr [Yi < (1−δ)τi ]< e−τiδ2/2
We show each case separately.
3.3. Approximation Algorithms for Protein Quantification 59
Case I. (Upper tail): Suppose τi ≥ 6 log nδ2 . Then it follows that:
τi ≥6 log n
δ2
⇒τiδ2
3≥ 2 log n
⇒ log e τiδ2/3 ≥ log(n 2)
⇒ e−τiδ2/3 ≤1
n 2
and it follows that
Pr [Yi > (1+δ)τi ]<1
n 2if τi ≥ 6 log n
δ2 (1)
Now consider the case τi <6 log nδ2 . We introduce an absolute error D such that (1+
δ)τi = τi +D. Then we can express as D = δτi , and rewrite the assumption with the
additive error D:
τi <6 log n
δ2⇒τi <
6 log n
(D/τi )2
⇒D <p
τi ·6 log n
Moreover, the Chernoff bound for the upper tail now becomes:
Pr [Yi >τi +D]≤ e−D2
3τi
< e−(pτi 6 log n )2
3τi (by assumption)
= e−2 log n =1
n 2
Rewriting D in δ gives
D <p
τi ·6 log n =6 log n
δ2,
60 Chapter 3. Protein Quantification with Shared Peptides
and thus we obtain
Pr [Yi >τi +6 log n
δ2]<
1
n 2if τi <
6 log nδ2 (2)
Combining (1) and (2), we have
Pr [Yi > (1+δ)τi +6 log n
δ2]<
1
n 2.
Case II (Lower tail): Consider the lower tail as below.
Pr [Yi > (1−δ)τi ]< e−τiδ2/2
First, suppose τi ≥ 4 log nδ2 . Rearranging this gives e−τiδ2/2 ≤ 1
n 2 , and it follows that
Pr [Yi < (1−δ)τi ]<1
n 2if τi ≥ 4 log n
δ2 (3)
Now consider the case τi <4 log nδ2 . As before, let D be an additive error such that
(1−δ)τi =τi −D. Then we have D =δτi , and it follows that
τi <4 log n
δ2⇒τi <
4 log n
( Dτi)2
⇒D2
τi< 4 log n
⇒D <p
τi 4 log n
3.3. Approximation Algorithms for Protein Quantification 61
Now, the Chernoff bound can be rewritten:
Pr [Yi <τi −D]< e−τiδ2/2
= e−D2
2τi
< e−(pτi 4 log n )2
2τi
= e−2 log n =1
n 2
Rewriting D in δ gives
D <p
τi ·4 log n =4 log n
δ2,
and thus we obtain
Pr [Yi <τi −4 log n
δ2]<
1
n 2if τi <
4 log nδ2 (4)
Combining (3) and (4), we have
Pr [Yi < (1−δ)τi −4 log n
δ2]<
1
n 2
As a result, for each s i , flipping a coin independently with probability p j for each
neighbour w j results in a value within the error bound of
[ (1−δ)τi −4 log n
δ2, (1+δ)τi +
6 log n
δ2]
with high probability for both the upper and lower tail. We can then put all the s i ’s
together to build a bound on the randomized algorithm.
Theorem 3.12. The randomized rounding scheme finds a solution with optimal expected
value for MINIMUMUNIFORMERROR with high probability.
Proof. We use the union bound to put all s i ’s together. Consider the upper tail: for
each i = 1 . . . n , let A i denote the event that Yi is greater than the specified bound. By
62 Chapter 3. Protein Quantification with Shared Peptides
Lemma 3.11, Pr [A i ]≤ 1n 2 for each A i . By the union bound, we have:
Pr [∪i A i ]≤∑
i
Pr [A i ] = n ·1
n 2=
1
n
Hence, the probability that no s i falls above the bound is 1− 1n
.
Similarly, for the lower tail, the probability that any s i falls below the specified bound
is 1n
. Hence, the probability that no s i falls below the bound is 1− 1n
.
3.3.4 An Algorithm for Minimum Error Sum
Here we discuss the MINIMUMERRORSUM problem where we allow independent error
term εi for each peptide abundance σi . As before, we focus on the constrained 0− 1
problem via LP relaxation.
Lemma 3.13. If there is an α-approximation algorithm for constrained MINIMUMERROR-
SUM, then there is anα-approximation algorithm for unconstrained MINIMUMERRORSUM.
Proof. Similar to the proof of Lemma 3.10.
We can thus focus on the constrained, 0− 1 problem. The randomized rounding
scheme for MINIMUMUNIFORMERROR shown in Algorithm 3.3.1 can be applied here, too.
Theorem 3.14. There is a randomized rounding scheme that finds a solution with opti-
mal expected value for MINIMUMERRORSUM with high probability.
Proof. Similar to the proofs of Lemma 3.11 and Theorem 3.12.
3.4. Experimental Results 63
3.4 Experimental Results
In this section, we discuss how our algorithms for the protein quantification problem
have been tested empirically against simulated data as well as biological data from AP-
MS experiments.
3.4.1 Performance on Simulated Data
String Trypsinization. To generate our simulated test data, we took a collection of
1000 random strings of length 150 over the alphabet of size 4. The abundance of each
string w j was then set to a random value between 1 and 100. We call the resulting mul-
tiset of strings x . Then, we cut the strings at random into approximately 30 fragments
to simulate the trypsinization of proteins. The string fragments are counted to form the
peptide abundanceσ for this input string set.
We acknowledge that the parameters chosen for the generation of protein / peptide
data are not based on a realistic model of protein sequences (e.g. alphabet of size 20,
realistic length for protein sequences, realistic trypsinization process, etc.). However,
note that the performance of our approaches depends directly on the edge density of
the bipartite graph constructed from the strings. The chosen parameters ensure that
each protein consists of up to 30 peptides, and that the edge density of the resulting
bipartite graph is similar to that seen in the real AP-MS data.
We first ran our algorithms on the synthetic data, and compared the results against
the protein abundance of the original set of strings x . Table 3.1 summarizes the results
of these comparisons. However, because each problem has its own distinct objective
function, comparing the output of our algorithms based on any one objective function
is not very meaningful. Furthermore, the abundance x of the original set of strings may
not be “optimal” for the input vectorσ depending on the objective. In fact, as shown in
Table 3.1, the original abundance vector x does not achieve the optimum objective value
64 Chapter 3. Protein Quantification with Shared Peptides
Problem Objective Problem Objective
MULTICOVEROriginal x 6439
MINIMUMUNIFORMERROROriginal x 0
Our result 5371 Out result 74
MINIMUMPROTEINTYPESOriginal x 1000
MINIMUMERRORSUMOriginal x 0
Our result 887 Our result 782
Table 3.1: Output quality of the algorithms compared to the original abundance vector x (Objective indi-cates the value from the corresponding objective function). Note that in MULTICOVER and MINIMUMPRO-TEINTYPES, our solution achieves better objective values than the original abundance of strings, whilein the error minimizing problems (i.e. MINIMUMUNIFORMERROR, MINIMUMERRORSUM) the original stringabundances produces no error.
for MULTICOVER and MINIMUMPROTEINTYPES, while our approximate algorithms provide
slightly better solutions than x .
Consequently, we instead base our performance analyses on the distance between
the proposed solutions and the original abundance x . Let x ,x ′ denote two solution vec-
tors. Then, we define the distance between x and x ′ as the L 1-distance:
d (x ,x ′) =∑
i
|x i −x ′i |
where x i denotes the i -th component of x . We can interpret this L 1-distance between
our proposed solution x and the original abundance x as the error present in our pre-
diction.
Robustness Analysis. When Gentleman and Huber [60]discussed errors that are preva-
lent in proteomics, they classified the measurement errors as two distinct types: stochas-
tic errors and systematic errors. Stochastic errors have random variability, and thus can
be reduced by simply repeating the experiment. On the other hand, systematic errors
are recurrent in the measurements, which makes them difficult to identify and remove.
From the perspective of the protein quantification problem, stochastic errors are em-
bedded into the peptide abundanceσ. In our formulation of the problems, MINIMUMU-
NIFORMERROR and MINIMUMERRORSUM in particular, we attempt to remove the stochas-
tic errors by looking for a solution with the smallest error ε.
3.4. Experimental Results 65
20 30 40 50 60 70 80 90 100
5500
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Coverage (%)
Erro
rs
MUE
MES
MC
MPT
Figure 3.2: Robustness of the algorithms for MPT (MINIMUMPROTEINTYPES), MC (MULTICOVER), MES (MIN-IMUMERRORSUM) and MUE (MINIMUMUNIFORMERROR). Errors indicate the L 1 distance from the originalprotein abundances x (averaged over 5 distinct partial datasets).
On the other hand, because some peptides are notoriously difficult to identify, and
the sensitivity of mass spectrometers is relatively poor, the mass spectrometry data is far
from comprehensive. As a result, certain peptides (and their abundances) are missing in
data, and form a systematic error that are recurrent throughout the dataset. We test our
algorithms against such systematic errors by analyzing the robustness of the algorithms.
In order to test the robustness of our algorithms, we randomly selected and removed4
a specified proportion of constraints before running the algorithms. For example, if we
remove 25% of the constraints (the peptide abundance σ) from the system, solving the
problem with the remaining 75% of the constraints can simulate the problem of quanti-
fying the protein abundance with 75% sensitivity of the MS data.
After removing a fixed proportion of the input data (peptide abundance σ), we ran
our algorithm for each problem and compared the results in terms of the errors pro-
4 We note that, by removing a particular peptide abundance σi , we simply remove the correspondingconstraint in the system rather than settingσi = 0 for that peptide.
66 Chapter 3. Protein Quantification with Shared Peptides
duced as estimated by the metric d (x , x ). Figure 3.2 shows the result of this comparison
for different input coverage (i.e. percentage of the remaining data). Among the four dif-
ferent problem formulations, MINIMUMPROTEINTYPES showed a closest fit to the original
abundance x across the different input coverage, while MINIMUMERRORSUM and MUL-
TICOVER showed similar performance until the input coverage fell below 50%. Further-
more, when comparing the amount of errors produced between input coverage at 50%
and at 100%, the algorithm for MINIMUMPROTEINTYPES is shown to be the most robust
among the four problem formulations.
3.4.2 Protein Quantification from AP-MS Data
To test our algorithms on real biological data, several datasets have been acquired from
AP-MS experiments by our collaborators5. Spectral count is used to estimate the abun-
dance of peptides, as it measures peptide concentrations in protein mixtures based on
the number of tandem mass spectra obtained for each peptide [22, 30, 100]. Recall that
each AP-MS experiment consists of a bait protein, together with peptides from its in-
teracting partners, the prey proteins. We obtained the dataset from 3 separate mass
spectrometry runs for 3 bait proteins, RPAP3, CDK9, and POLR2A. Since each of these
datasets is produced by a distinct mass spectrometry experiment, we treat them as dif-
ferent samples.
As shown in Table 3.2, the three datasets from AP-MS experiments contain approxi-
mately 2000 peptides that belong to 1000 proteins. However, when restricted to only the
connected components (in the bipartite graph) containing shared peptides, the dataset
contains approximately 500 peptides over 100 proteins.
While the nontrivial portion of the problem is smaller in terms of the number of
peptides and proteins, the corresponding bipartite graphs are much denser due to the
peptides shared by many proteins. In fact, peptides in our datasets are shared by as
5 Our test data was generously provided by Benoit Coulombe’s lab at Institut de recherches cliniquesde Montréal (IRCM), and consists of human PPI data from AP-MS experiments.
3.4. Experimental Results 67
Number CDK9 POLR2A RPAP3
Overallproteins 1593 1386 901peptides 2592 2861 1689
components 1320 1211 703
Non-trivialproteins 125 70 125peptides 672 525 589
components 21 19 26
Table 3.2: Network analysis of prey proteins from AP-MS experiments. Overall indicates the total numberof proteins, peptides, and connected components for each dataset, while non-trivial indicates only theconnected components containing shared peptides.
many as 19 proteins. For example, Figures 3.3 through 3.5 show the bipartite graphs for
the dataset with CDK9, POLR2A, RPAP3 as the bait, respectively.
We executed our algorithms for MINIMUMPROTEINTYPES and MINIMUMERRORSUM for
the nontrivial components (containing shared peptides) of each dataset. Furthermore,
for smaller connected components with fewer than 20 proteins, we also computed exact
solutions to the corresponding integer programs6 , and found the results to be promis-
ing. In the case of MINIMUMPROTEINTYPES, both the integral optimum solution and our
approximate solution often picked the same set of proteins (FP ∼ 10%, FN ∼ 15%), with
only marginal differences in the abundance. In the case of MINIMUMERRORSUM, the sum
of errors (∑
i εi ) in our approximate solution differs from that of integral optimum solu-
tion by at most the maximum peptide abundance value (σi ) in that component. While
a pleasant result, this is no surprise since we obtain the significant part of our protein
abundance from the optimum of LP relaxation.
Within our dataset, there were various instances where a group of peptides are shared
by a large number of proteins. In such instances, our algorithm for MINIMUMPROTEIN-
TYPES picked out only a few proteins as constituent proteins. For example, Figure 3.6
shows a connected component from CDK9 dataset, where all the peptides are shared
by variants of the histone family. Among the 14 proteins that share the peptides in this
component, only 3 proteins were selected with positive abundance values. This partic-
6 Gurobi Optimizer (http://www.gurobi.com/) was used to solve the integer programs as well astheir LP-relaxations.
68 Chapter 3. Protein Quantification with Shared Peptides
WDR43,KITLG,VEMGLKLEK, RQPSLTR, LQAKESPQR, CCDC114, MGGTSREELR, PDE6C,ZDHHC6, IDMSAARR,PPP2R5C,YLEIEER, ENQKLK,QSWVRAR,ACADVL,LAHLLQADR, CCPG1, NRBF2, ENKQLK, FAM83H, SLESCLLDLR, EEA1, ZHX3,QLLIQQK, YPEEQLK, ZNF335, LKAVVEAK,ANKRD24,AQNQESELR,CALML6, ARDAAEAR,PARD6G,QAAWRSTGPAVTL, MTTP,FLSHEDLR,
DDX3X,MLQEDELRDAVLLVFANK, TAAFLLPILSQIYSDGPGEALR, MEYHTMIR, ARF5, APOE, EGAERGLSAIR, CDC5L,QNAISSNK,CLIP4, METTL4,PCDP1, TIGD7,MLVCLGAR, LKSIIIGK,KSGACPNK,DENND5A,NPPGFAFVEFEGPRDAEDAVR, ZNF841,SRSF7, EPWTVVSQVK,VRAEEAGAR, DST, DGCR2,VSPEPTHEIR, SQTCVDIK,ILESVAEGR, CKAP5, QCKSEFPIR,DDX41,FAM101A, HACL1,LQVALQR,TRIM22,TAEKEGAR,GLYR1, LLLEKNLK,MED13L, TAERQEK, MCM10,CCDC116,LECLKEDVQR, ANSTSTGKPGLAREQGCAR, QFSCLSLR, RNF32,MCAM, EIYELEK,MALSU1, DISC1,LIRLSQIK, DLCEKNK, YIEELER,RIN2, CALB1, KANK3,QIEEQTIK, MTAWTMGAR,RDX,KLSEVGR,ACTR2,EIQLEAQIK,CEP152, LLEPNPDQR,STK32A,QEYQEKGVR,IQVQALR,LDDMLSELR, IQEKQIQQEK,DACH2,PRSS12, ESLSSVCGLR, ARR3,CFP, EIMSLKK,TRAIP,SEEQVSGAK, CEELQGQK, KGEEESQK, TTC25,
ZNF248, DDX3Y, QSGGASTASK, NRG1, CALPPRLK, EPLSSLR,KIAA1875,TFIESLK, EIF3D,LSVGSNRDR,YTEDLDIK,SPINK5, LVLEQVVTSIASVADTAEEK, PXDNL, LFCPQDK,VLGDPGTR,KPASQKK, IPO5, LAS1L,CHD8, GPC3, S1PR3,BDP1,TVGGGGGEAEGKR, METDLKEIR, MRPYDANKR,TGGLFGSR,DDX4,AP5B1,FOSB, AELESEIAELQK, EPLKVVLR,SLVDGYK,MAP3K9,AISVHSTPEGCSSACK, GVMSDPNFLR, IGF2BP1,CLDGRR, DNAH10,GASWLGRR, JAG2,RNF146,VADWTGATYQDKR, ALLADAQLMLDHLK, MCPH1,RPS15AP17,QVLLRPCSEVIIR, MYO18A,NEMNVMLK, LSAVCGALGGR, QVCTDVIK,SDHD, GTF2E2,COL21A1,MAAPILK, CALFYSYIR,CIQSLR, CLISQR, KIAA1109,IL1RAPL2,KIF24, RPLP0,CLALSGR,
GRID2,IAVNIALER,ACBD4,TSNARE1, SSCPQER, GLCSQLR, YTHDC2,QGAELDAR,ASB16,DCLKLMK,CTSPRR, C19orf40, MB21D1,NSF,PLP2, SLIT2, LLDCVPIGPR,HTAAPTDPADGPV, VRNSNNLK,ALLAKEGK,MPP5, EQDQSVLVLKK,JAK1, KCNAB2,RAPGEF5,QRNLLTR, DKILSEEGQR,LLDIEEAVHR,FASTKD1,SAPYKGSEIR,RPS6KC1, RDYLEK, PSAP, MYALFLLASLLGA, CCNL2,QCDKAFSCSR,LRRC41, SDSVLLGMEGSVK, NCOA6, MEIRCNIEK,LOC646214,ALHLSDLFSPLPILELTR, ALLARGTK, LLIDPNCSGHSPR, TMEM194B,DPH2,RAILVMNPATAK, CAMSAP2,GEAAAAALSVR, ZNF709,SKSLADIK,EALAHLR,DGCR8, VQFHLMELNR, WEGVSGK,HLSSPIR, LOC339535,SARDH, XIRP1,
THAP4,CFTR,SGINCPIQK, DASKATGGVR,MQRSPLEK,
VPDNCCR, CDH26,SWTAADTAAQITLR, EIF3E, HLVFPLLEFLSVK, SNEYQLIDCAQYFLDKIDVIK, GNAS,HLA-C, HMGCR, DHX33, EAISDSLLRK, MYO15A, SALTVHCK,TRAF3IP1, LHRLINPNFFGYQDAPWK, TMCO6, MNAAVVR,NEQGGVR, TEDAGLELLACPVLR, ZNF223,
NVLAAITR, GVTTQVAAR, CSMD1, ITCVQLNNR, DMQLLSISR, LMNB1,NPM1,FPGS,SLC7A6, HNRNPA2B1,PPP6R1, DIO1,LLLGATLPR,VAGALVQNTEKGPNAEQLR, DLG5, MOS, VLLILFPDR, GFGFVTFDDHDPVDKIVLQK, ELKELK,LLQGEEER,
CSF3R, LEPPMLR, CHSY3, NPC2,OSBP,LWIDQSGEIDIVNHKTGDK, DRAP1, KPGSGGRK,LEEFLR,LIDLHSPSEIVK,RPS20,
SSYISICNR,LOC100130466,QTQQAASK, GFRA2,EIEKLK, C3orf26, VRK3,EPHA3,ELGISIK,MRPS10, EISVAIK, QPKECFLIQPK,ELEKLK, WARS2,SOST, WAC,MRVEVGIGSR,ELLGSLK,PRRT1, EEAVAEGARR, DGKI, TWKELATCR,MYLK,EKLAFEK,SPATA24,EGGSRSPR, GTPBP1,TCP10L2, IRF2BP1,MLEGQLEAR, FAM26F, ASDVQDLLK,LDAKDGR, LEGYHR,QLKACHR, EIF2C4,EFGIVVHNEMTELTGR, GFRA1, LDCVKASDQCLK, DNAH3,DNAH17,RGTSVDVDLWLITAFHEK, ACVR2A,YQAMVHCGSIR,
GLLPQILENLLSAR, PIK3R6, VLVLGDDR, ITIH4, LCVDPR, ZFYVE27, QEDLQRVR, KIAA0020,QDVLNKSVMK,PPP6R3, KIF23, QLEMQNK,DCVAKSVGGK, AARS2,DDHD1, NIFPSNLVSAAFR, HGEKTMYR, SPEG,LEPLILKR, AGDSVLAPGGDALRPARVAR, SLC1A5, MEGF10,DHH,NFDKLSFLYLITGNLEK, LQFFEER, IEARDQAR,QSDLDTLAK, FAT1, CDC34,SLC9A5, YLSQLLMR, OTC, ORAI2,VDAERDGVK, ILAVADKIK, POLD1,SV2A,ISQERLCR, QECVSLADSK,GTF2IRD1, DTX3L,GGLSDGEGPPGGRGEAQR, CRAAMPVLK, VGHEVLR,SMARCAL1,VVADHIR, GIN1,LLIPRWR,LINC00479, CARNS1,VVCGVGRGDRPLR, VWA5B2, RBCK1,AIAFAHDSTR, GIDQCIPLFVEAALER, IPO7, EYQEDLALR,UTS2, LTPEELER,
LOC100507141,
IDGGITGALRQEFK,
ESYT1, LYMKLVMR,
ZNF850,CHD3,
SSSQPCAWLRSR,
QLAVNLTR,
SFTFHSALIR,
MGPGGVGVR,
EDVEEAMKEK,
VWEGHEGLQSEMALSSTCSALLIR,
EKAPSIIFIDELDAIGTK,
TCNLILIVLDVLKPLGHKK,
IQLWDTAGQER,
LAPVPFFSLLQYE,
RAPH1,
GCET2,
CNOT1,
ATP6V0A4,
MDN1,
ASAP2,
RAB15,
RPL4,
IQIWDTAGQER,
EIISEVQR,
SLLQAWGLILR,
FYVGDGYK,
ETSLLLR,
APGFVYAWLELISHR,
ILIFEK,
NIPGITLLNVSK,
SLC6A15,
WFIKKN2,
STAG1,
RPS14,
RASAL1,
TFDP1,
PHLDB2,
CPNE3,
UBXN11,
MRPL10,
ADFPLSVVR,
EYDVAVEAIR,
MAFMTR,
YSSSSLSHMGAYSR,
GTTGVPKNSK,
RVCDALNVLR,
SPLGEVAIRDIVQFVPFR,
EMPGAPSPLR,
TPGPGAQSALR,
TVPFLPLLGGCIDDTILSR,
FDX1,
DRG1,
RRNAD1,
GSDMD,
CCDC83,
PSMC3,
RAB33B,
APRT,
MSH6,
ADAM15,
C21orf2,
BHLHB9,
COPA,
MSI1,
NUP188,
GTF3C1,
OR2AT4,
NGRN,
ZNF316,
BOP1,
ROPN1L,
FCHSD2,
XPNPEP3,
DPM1,
DEFT1P2,
GTGSGALPSGQKLEELK,
ISSLEGR, NARG2,
IMPA1,
APOL1,KNITNDIR,
AILFVPR,
AELEQKIDEAR,
SMSQASKELGPCAAATSPTTFLR,
HQWAHAEEKPHR,
ENLPLIVWLLVK,
DLGVLDVIFHPTQPWVFSSGADGTVR,
SSLEQMIRK,
NCL,TALGVAELTVTDLFR, SPATA5,ATDLGGTSQAGTSQR, GVVDSEDIPLNLSR, TRAP1, TLVLSNLSYSATEETLQEVFEK, ZC3HAV1, AVAPSIIFFDELDALAVER, ASITPGTILIILTGR, IARS2,SIVEEIEDLVAR, RPL6,VCP,LLIVSNPVDILTYVAWK, RPN2,QAAPCVLFFDELDSIAK, QLAVLTELPFVVPFEER, UBE3C,KPNA6, LDHA,LDHAL6B,IIDFLSALEGFK, SEC61A1,DYVLNCSILNPLLTLLTK, FSGNLLVSLLGTWSDTSSGGPAR, LIIVSNPVDILTYVAWK, EALDVLGAVLK,SCQTALVEILDVIVR, STESLQANVQR,MSH2, LNLVEAFVEDAELR, MRPS27, RPL13, LPCAT1,
TIAECLADELINAAK, SVVTVIDVFYK,RPS5, LLLGIDILQPAIIK, FLG2,NITWTLSNLCR, FANCD2,EMD,HK1,ALADILSESLHSLATSLPR, SANLVAATLGAILNR, SNRPD1,CLINT1, SLLLLAYLIR, YFILPDSLPLDTLLVDVEPK, YLESEEYQER,LFLWGQALYIIAK,PHKB, CDC123,MRPS18B,LFDQAFGLPR,NAP1L4, GIPEFWFTIFR,AFITNIPFDVK,VKEEIIEAFVQELR, PSMC1,HNRNPM, VAEEHAPSIVFIDEIDAIGTK, HSPB1,DDX1,FLVLDEADGLLSQGYSDFINR, RPL19, VASP,LLADQAEAR,TYGEPESAGPSR, TPETLLPFAEAEAFLKK, KPNA7,ILGVWDTAVSLGK, TLSDIFLLFK, ASPRV1,MAP7D1,ILNIFGVIK,TFRC, ELYLFDVLR, GEMIN5,NVLTAILLLLR, IPO11,ASNS, LVFCLLELLSR, SERPINB5,GLSGLTQVLLNVLTLNR, GNL3,QITIIDSPSFIVSPLNSSSALALR, TTI1,LLQDANYNVEK, ILVVNAAYFVGK,GGT7,IRS4, TTPDVIFVFGFR,LOC100287551, CPSF6,VLLDFLDQHPFSFTPLIQR, LNTEVASVVVQLLSIR, RPS24,QTQTFITYSDNQPGVLIQVYEGER, GIPEFWLTVFK,NAP1L1, SFVEFILEPLYK,EFTUD2,RPL3,AVSDASAGDYGSAIETLVTAISLIK, NNASTDYDLSDK, QARS,EAATQAQQTLGSTIDKATGILLYGLASR, SUB1,FYYNVESCGSLRPETIVLSALSGLK, POLR2C, GISLNPEQWSQLKEQISDIDDAVR, CLPX,
YYALCGFGGVLSCGLTHTAVVPLDLVK, MERSEELNK,YRDC, CNTRL, SLC25A3, LRRC59,QLEEAIQLKK,ASPESQEPLIQLVQAFVR, LANFGGLAVGLGFGALAEVAKK, LVTLPVSFAQLK, FQVWDLGGQTSIRPYWR, ADCK3, ARL1, KIAA1967, MECOM,TPR, ASGTKLTEPR,GQNLLLTNLQTIQGILER, QEGLDGGLPEEVLFGNLDLLPPPGK, SETX, CPSF2, MRPS7,VLELAQLLDQIWR, ASB7, LLLEFK, IILEFK,VLISNLLDLLTEVGVSGQGR, UNC45A, DQAVENILVSPVVVASSLGLVSLGGK, NSPGSQVASNPR, AFLTGLAR,POLM,KPVEELTEEEKYVR, IYEEQNR, PCNA,MLL2, XRN2, SERPINH1,
WDR26, RBBP4,LALLNVATQGVHLWDLQDR, ELDLEKGLEMR, SNRPB2, LLEPVLLLGK,SLYALFSQFGHVVDIVALK, RPS16, TIFTGHTAVVEDVSWHLLHESLFGSVADDQK, SAMHD1,LTDNIFLEILYSTDPK, TKVGIIAGR,tcag7.977,LGLQILK, TNRC6A, YLASGAIDGIINIFDIATGK, EALAQTVLAEVPTQLVSYFR, WDR61, ILF2, CPNE1,IAIYELLFK,RPS10, BCR,INNVIDNLIVAPGTFEVQIEEVR, MDMNSIKEPQSR,KIF3A, MCOLN2,INLSLSTLGNVISALVDGK, NCAPD2, LLNILGLIFK, CGRRF1,EVVDYIIFGTVIQEVK, LYEALQK,LILIESR,RPS13,
VTPQSLFILFGVYGDVQR, FSASGELGNGNIK, PTBP1,GIDPFSLDSLAK, RPS4X,CCT6B, TIVAINKDPEAPIFQVADYGIVADLFK, P4HA2,ETFA, CEP170,LRECLPLIIFLR, RPL18, TNRPPLSLSR,KPLTTSGFHHSEEGTSSSGSKR, HNRNPAB, ADRM1,IARS,MCM6,LFLDFLEEFQSSDGEIK, FVDILTNWYVR, GFGFILFK,SQVLDYLSYAVFQLGDLHR, NFHVFYQLLSGASEELLNK, GTPBP4,MYO1B,LSPVLDGVSELR,TBR1,SQSAAVTPSSTTSSTR, SLFVQELLLSTLVR, SRSF3,KCMF1,ASLTVLEK, NPPGFAFVEFEDPRDAADAVR, DCAVLSAIIDLIK,DHX36, PARP1,FFEVILIDPFHK,RPL15, SLQELFLAHILSPWGAEVK, FYSVNVDYSK,NDUFA4,
KIELHVK, QPGCYTLALR, PASK, QLHVELK,CHN2, PKN3, PLA2G5,IPVSVWMK, TPESTKPGPVCQPPVSQSR, XYLT1,KLLSLGLK,RAN, ITNWNR,SNYNFEKPFLWLAR, KLVYCLK, ANKRD40, PEG3, GALNT6, RASEF, LSCMSRLIELCR,C6orf99, FAM190A,TBC1D5,IVLAGDAAVGK, INSC,QIKDILK,DGGLAMLPILVSK, RHMPLR, IVSQNLSTR, PIAS1,LPFYDLLDELIKPTSLASDNSQR, RPN1, LDAQVKELVLK, NEK10, LILPNKQK, MTOR,AIQIDTWLQVIPQLIAR, MRPS14, HLADHGQLSGIQR, DLSYTLK,ELL2,QINWTVLYR,RPL24, HADHB,
RPL14P5,LLLYCLGK, ALVDGPCTQRR,CCDC167, ASNYEKELK,EFCAB10,LSLNAELQLDR,ESQLAGR, QDSSGLR, LHX2, ISSAIDR,PLEKHA4,ITGA2B,HLQTYGEHYPLDHFDK, LECDSCGKLFSNILILK, SSSSPRSINK,KLK11, TSEN54,BCKDHA, LLCGATLIAPR, TAGAP,METTL3,MLLIAVLEAIGGQASR, ZFHX3,GAPVD1, QLDSLRER,LSAQAQVAEDILDKYR, IPO13,SLWVINSALR,PIK3CA,IINENYNPK, DROSHA, SSRSPSR, LELDRLR,RILPL1,YEDLLK, C2orf16, ELAAHLR, DOPEY2, ILEPVLLLLLQPK, DNAH7, RLQILK, FAM214A,LGALS3BP,QLNLITR,IFVGGLSVNTTVEDVKQYFEQFGK, CAYQGLALSPR, MON2,TLQALEFHTVPFQLLAR, LOC100506012,NLRC5, AVYYLDAVVSFIECGNALEK, ARFGEF2,AWDVLLDHIQSAALSK, AFF4, AGTR1, KDILQLLK, IBTK, MLIDIISSKMISHGIK, UPF2,KIEEIK,RPL38, ALFIVPR,GSSLSGTDDGAQEVVKDILEDVVTSAIK, JMJD1C,TNRDLEIK,SDCCAG8,KLGAEEAAR,TBX18,FLPCENK,IL10,NUP133, QHEIVLK, SORCS1, VGAGGGSQAR,ILLDQVEEAVADFDECIR, SSALDTRR, TOMM70A,PIKFYVE,
LMEEKNK, NEB, AGDALNER, RASAL3,AEMNILEINKK,NBPF3,GORAB,SSAACAPDKSLR,SGSSSPDSEITELK, DRD5,ITIDTNK,CTPS, BBS9, SPON2,LARP1B, TGTHMSRAK, WTLGSGPSEAIR,LKLMDEPALR, LUZP1, UNC5A, VTWRGEGR,PLEC,KYLNEK, ARAEEESR, CRLS1,GWRFAAAVR,EBPL, DVMLIAAVFYVRYR, QAIEGVGNLR,ESLATLSELDLGAER, SUPT5H,EIVEEAIR,IPLHSPPSK, ENC1,PHC3, ANO2,
ENAH, EDN1,ASNSD1, DENLTANEVLK, LKTVEEAAAPR,QERQER,NQELPLR,ZNF233, SSFHDPKLK, CAMK4, RAD51AP2, LOC401317,KFLACR,LEEEDIK, INTS3, FCRL5,FLEGQDAK,LOC100287896,NEBL,QPSPSSSR, YTHDF2, LENNENKPVTNSR, KDLENEIK,RBM20, VACSVPGAR, HMGXB3,MSSVYLK, UTP11L, HEASLAIR,AAAAAASLR, LCE4A,TULP2,TVGNQTGIK, GVTNQTGLK, SHCHRPK,SNRNP70,
MAEA, LOC100507023, LOC145783,LEAPSMGHSPR,SSPELTRK,AGGQAGTRR, SRRM2, C8orf58,TGMERELIESR, SPTBN1, DGAGEGLAR, KIAA0947, KGGEESLR,QLQEDAAR, PLEKHH2, AVGNNTQR,SIN3B,AEEKLESYR,EKEDQIK, RGAAGGNLSSR,EELSNVLAAMK, LPPR1,CENPE, CCDC75, GAIEVLIR,RBM25,PPP2R3A,DLEHCLQGNIK, LEGLLAFK,C11orf35, LOC100289079,EAVASQR,SSLWNVVMR, AUH,SSWVGRMLR, TXLNA,
EHBP1, ODF3L2,GTPBP10,EQLLLDELVALVNKR, LOC100134174,GAQVGTRGSR,GMGHKFLK, RUSC1, GCKR,MPPTCSPGLR,ARHGEF11, TAVGSSDSK,MGTLSCDSTPR, VRAEGELEPR,RB1CC1, DEAIQTALK, DPY30, ERPPNPIEFLASYLLK, LOC100131673,LRRC4,YLVEVDQASFQCSAPFIMDAPR, FSAFLDK,QYKPMSLNR, MAGCCCKGNIGR, MRPL15, RALGDS,USH1C,MKGGSATK, VTSQDKAPAVIR, RBBP5, KEAGNIK,UBXN2A, GSCFLINTADR,FNITHR, DYTN, ALG13,
SGCE, QNEEFRPFIRR,QANNFIISR, RER1, UBXN4,ECVGATGLNAAR,LINC00313,QFVTGGKK,VPS41,FAAYIRAAVR,PKP4,GPGPGTGKGMGMGR, C6orf25,YEVLEK, ERCGDPCR,LVVVGDGGVGK, LSKEDVWK,DEPDC4,B3GALNT2,QQVQAKR,MYO7B, AASAAAAGSLK,MLEERNR, MRAS,ENSTTTISR,PLS1, ELFN1,LOC100287837, SECISBP2,LLVPGPSGER,ECE2,GYEGQLR,EMILIN2,APCPPRATSLR, QQPQDNFKNNVK, SMC2,
SGGQKQR, HEATR1, CLQLLTHTFNR,KIF1B,SPQILVPTLFNLLSR, KAFAGELEK,ABCB11, SMCR8, TPM1, MEQKMK,SLSPELR, ANKRD32,HES2,QLGEKGPCQR, GPS2,AGDAAELR, ELN,RNFT2,KIDINS220,MMGT1, CLSSYTSVKENFDK, TREM2,MTPLGSGPPR, KLELEK, FAM120B,VLQMLDTVR,MEIQEIQLKEAK, ASIC4,LOC646934, IPNRGSAR,ANKRD30A,LDCTFIK, LRP1B, IENAWGIR,MEQMKK, GAEGEARR,
NAGEKIK,SQYAPSPDFKR,TMEM233,STRBP, ILRDLCNR, ITPR3,ASLTWASPR,LOC100287367,CVRDLVR, LPLPGPPER,LELLEEVLALGLR, EINKLLK,TWIST1,RSAD1, ARL6IP1, FANCM, ELSLVEQR,LIDFGCAR,KSAACGQPR, ZSCAN4, SSGKNLER,YSK4, HOMER1, LKQEIDNAR,ZNF737, SSILTAHK, WNT2B, VTGTIGMAEKR,EML6,LDSMVSEGK, ARHGAP27,EMB, MAGI3,KNESLGQ,CRAFQLETGQLVECVR, DATTEKNK,RYR1,IEVMSSK, DOCK3,GRIN2D,SPRYD3, VCGTLLEYLGK, PCDHB17, MMETPLPK,SVGEGAREWIR,
RBMY2FP,AHQNAYALPSNYQPPLK, HOXB3, SPSGCLRSAR,SSCTRPIMPTCLR, DNAH8,TVGSALPSSPKR, LSLDTMKR,ARID5B,QMNIDVATWVR,PKN1,PNKP,FREMTDSSHIPVSDMVMYGYR, ARTAMPAISVLDLLQDEER, CDH11,IKLEGGIK, VPERIR,BPTF, QMAVREK,GPR45,LKLEVQK, HARS, MLLT4,QPSSSGAKR, HYLDSDPAEEYELVQVISEDK, TP53I13, RGL1, FAM163A,RPTSAQGR, STAFGMNGLGGIAAKLFAAPFSMR, TRIP10,FVKMSIR,LEKELR, LSGDSRACR,ARHGEF9, C16orf48,ANKRD6, PRAMEF23,LEEKIR,IEKIER,LEKEIR, C11orf2,LEGLLSR, PARP14,CCDC112,
QAPGGGWDCR,QPREER, QIMQGAHSDFIVR, HESAKADLLR,CTC1, SCXB,KCNH7, CGLQGARR,GDF7, PRKG1,TLAQAAGAAAVPAAAVPR, RSSMESPK, ACSS2,AASLLLEILGLLCK,CCDC18, FMAELKELR,PAIDLGQELR, FRY, HIVEP3, RHBDL2,AICGLSR,F5,NUB1, EWRGAGMAQK, PLEK, AIQMASR,ELADEALQK, SLDRQGIQR, LYST,KSCEMGLQLR, LYSMD4,MMLVLEK, EXOSC4, MAVGTRGTLLK,PSMC5, EIF4G1,RINT1, IDILDSALLRPGR,ILLLSEEK,CCDC168,
QSRQNSR,SNCAIP,C2CD3, SSFLSSPER, PRSS33,VAGAATPKK,HIST1H1D,SLNSMCTELSIPLARK, DAPK1,MIDN, QGPQSPER, QLSSELGDLEK,TBC1D19, YNTNNGAPK, LGALS3,NKX3-1, MPPHPPLCLLSQMLK, SPOCK3, GSEVTAPVASDSSYR, GNAZ,GQTCYRPLR,SUPT4H1,GNDVAFHFNPR, DLRHLR,PHF20L1, CTSZ, NAB1,AIFHLTR, ALDH1L1,ALHFLTR,LOC100506052, DALENGR,SREBF1,IDRHLR, EPHB6, GAQRAHIR,FAM98A, LEEALEAAQGEARGAQLR, RSVLSPK, TBKBP1,TTKMGFQISR, EKELLIHER,FAM105B,
KISS1,CRTC3, OPLAH,LOC100507508,EAAPGNHGR, SPPASSAALCVWLAPR, SSSGLQSSR,VKSSNIQVGDLIIVEK, ATP9A,IGFN1,GFI1B,LEAELATAR, RSHSGTRPFACDICGK, CCDC23,AMOT, NSRP1, QMARVNAK,KDCGQYSVTLR, SVDSSIKK, LEGELEELK,C10orf137, CNPY3,DIAPH2, MYH14,ILNELAELK,QKSAQQELK, RRP7A, TRIP13,LAPTM4B, KMVAPWTR, EAPFVPVGIAGFAAIVAYGLYK, HIGD1A,STWPQKR,GDTAALGGKK, SSLSTSSPESARK, YSSAGTVEFLVDSK, CEP164, RLEDLR, PACRGL,COL4A4,HNIVFGDYTWTEFDEPFLTR, KGESGIGAK, PCCA,
CHTF18,RLAQENVWLR, CDKL5, MLRTLK, LOC652586, YSSLLRGASR,IL1RAPL1, KLC3, NAA38, MIVGTLK,RDTLLIR, CD6,AEQERK, KVMQVWTEK, EIEDLLRR,IENKESR, ABP1,FAM186A, ELRLQPSSTTTMAK, CHD9,ZNF668, MEVEAAEAR, QSOX1,MAGEB16, ITSN1,KQELLVEIQR, RDTGAALLAESR,VAFLVNFMLHK,NAA16, ILKCYEQK,YLGSALELR,NANOS1,KHQNGTK,FQSVESSTR,QVDGERR, FHEHENGKSR, LLNGASQADGAR, ZNF516,USP42, RARS2, ATXN7L1, QQHVEDVQR,QTECLNYIR, KIAA0226L, RBM22, ASTMPRLDPPEDK, CYTH2,SEMA4D,
N4BP3, MAQQEQRR, C19orf45,KIYMEIDPVTR,TNMD,NCMTDLLAK,REEP5,NAPMFQR,CACNA1E,GEASVVDLK,PDE1C, KAT7, YMKSQTILR, MCM3,TVDLQDAEEAVELVQYAYFKK, ESD,C15orf39, GQTLDGTFLR,QLAQAER,TSVHIPR, LMAN1L,TNS1, HGCVVIK,METTL10,WTRSCAPGMAR, LAMC1,GSCPLRK,EAPPHGAPPR, KLHL29,FBRSL1,CLEC4F, SSGADGSGGAAVAAR, SYPGSQLDILIDQGK, IQUB,PLCH1, NCINLQNEAQKR,NTKGGGLEGR,EGECLTLCK,SH3TC2,SSAARPVSR, CXorf59, CCDC63,EMDAGVSGLRK, KEIEDLR, PTPRT, KGYHEIR, MYC, RSFFALR, NPAS2, SVTREGTGTTR, DICER1,EAARASR, FLGPCGWK,PHF7,NISQDLEKQAAR, TM7SF4,ADSSGQQAPDYR, CARS, C19orf60, GHLTDPQSR, FNSDAASPRMAPR, OVOL1, RQEVSSPR,HLA-B,LAGGAGGGSRAGR, TANK, IFT74, ENFNSQQK, AIILETR,CX3CL1,CTNIGGLELSR,CLDN15, CYP2D6,FGDIVPLGMTHMTSR,
SH3GL1,MAKPAQGAK, MDM2,EPGLESK, MPHOSPH8, MVSEKVGGAEGTK, HQNSALHFAK, ANXA6,HMX1, LHDMLGPHMLRR, ATN1,DNSNNTLYLQMSSLR, CLAWCEPR, CHD4,EELAEAEIIK,ASUN,STIAFAHR, NKYQELINDIAR,IQGAP1, ENLLLEEELR,SYNM, ACSL5,HELLS, GGQSGLNLSK,ACPSTPESR, VEEKDSLSR,C2orf73, UROC1, TLE3,SSKTASTNNIVQAR, MCTP2,MBNL1, AVSVTPIR, RCWSGNQK,GNG12, HPS4,AYPTMPR, NVVYER, HIF3A,LPTGGDAPQEHGK, MRPAAGAARRPR,POF1B,DQSVGDPK,SCARB2,FSALTILGSIAEVFHVPEGR, THADA,QQHISNDCANLFPLVDLSIR, DLL1,EISPATPNMQK,CCDC66, CALHM2, GIPR, LAMA3,LISRLATQR, QPRAGPIK, NEXN,GWHHCRLR, EYEELIK,AAVAPVTWSVISLLR, PRRC2C, FAM129B,KETLDR,HMETQAKDLR, TQTAPAGR,TDMDQIITSK, SLC22A11,SCNN1A, DNAH1,ZNF181, MPRIP,LSKGMIPDWESR,TRAF3,EAEQRAR, QEAVMGK,
PAG1,LEIEKR,KIF3B,LELEKR,KIAA1826, EERAQWAEQQR,NSMVVPSERAR, IPPESAVDTMLTAR, LOC441736,OPA1, IVLSNNTPR,NEDD4,GNSSESIEAIREYEEEFFQNSK, LOC81691,MYBPC3, SSSFRTPR,NGNQDCR, AKEEEEAEK,SLC4A8, NCNLRALK, AHNAK2, DKIIPPDPITK,KFEAELK,ISSAVTPVK, DLGSPQHPPPR,SYNE1, QALFQQEMAR,UBE2D4, P4HTM, PTPN7,MED14,GEDVNRTLEGGR,MTPN,VGSDDLER,PSD2,IKVEHLK,KIFC3,MCPPLTTK,WDR11, ZNF444, REHVLR, ZNFX1, KPMQLPRSMVK,LTCASLQK,TVTSTMLGVFR, GALNT5, ERGIC3,KMQNALGR, LFVSGVR,ACAP3, CDK19,ALB,WLPLEANPEFLK, FEELKAQK,EAAFFKELAVR, NSD1,C10orf76, GCH1,UCHL3,EGKGSPLK, DKILSMAEGIK, LTB4R, YPSEGHR,CDKN2AIP, WDGPQNK, AK8, KVMHLQDVEVK,MYO18B,RTWESPGVR, DAEEQVKLK, TRIM4,QFIDEYVETVDMLK, PIK3AP1, PTPRG,SLC9A2, LLIRENQPK, LHFVCSPR,TCIIWDLNK,WDFY3, ALDH3B2,POLN,
SPAHPTPLSR, AVSTSSMGTLPK, SYT12,EIEQLQEK, HMGCL, VTVAESSSDGR,CNNM4,MIS12,VFLCPGLLK,
LTDEGLR,DQLAIAASR,FAT4, LTGELDR, ZHX2, RNF215,FBXL7,
VAIGGQSNESDR, LDGRSLIK, IAVDELFTSLR,SCFD2, NCAIISDVKVR,KDM3B,PPIL2,
NAVSVHHALASK, PDE1B,PIEZO2, GADAGNGIR, PSD3, SECAIVYNDR,LLDALLQR,
SFI1, LPHN3,SINEMLLSR,C12orf51,ESAQGLR,
QVLGERR, EAASGTTPQKSR,STEGVDIVMDNLK, LDSSALR,FAM174B,WDFY4, TCOF1,C9orf79, FILPNR,SQKTLGCADK, ANKRA2,FAM170A, NNAT,TCVSSLCVNK,KMEELEK, TMEM214, DLASYLNYK,MTFMT, MPSPSPK,YNLPLDIQNR, IHHLSEK,TOM1L1, POLR3G,CLRN1,HSVAGELR,VFRACTELGIR,VGGSGLGPPR,LOC100293539, PAPPA2, MARAAALLPSR,PC, PVRL2,LGEECDDGDLVSGDGCSK, ARHGEF25, KLCEGDCLSPLHR, TGM1, KEVELAPGASDR, SYT13,GLTENSR,EQLDAELLR, SLAMF7, ARID5A,TWISTNB,CHPT1, MAVGASIAAR,ACVR1C,TDRILGVMGGMLR, ORC1,MTRALCSALR,DPY19L2P4,AGELQEVAAWR, KLHYQTYR,
ZNF687,RDLELR, BCAT1, GVDNKIR, C22orf25, CPGYGTR, FISHKK,VTFGNRVTSSLGDIPVSR, HERC5, KILPNGPK,HMG20B,IGSF6,GGA2, NREP,VTQIACGR,MGTASRSNIAR,AGVACLGK, C9orf91,RIEDLR,UBN2, ABCG4,SLAESALLEPQVRR, RDLEIR,MKNFHLK, REVELR,GIGYF1,KIAA1755, DNAAF1,SKELMEAVLR,LTVMYSQINGASR, RNF213,TRIM25, WASGVDGGLYEHKTFMYPVAK, ERVFC1-1, SQDWKFALR, SILNSMNSLR, SEMA5A,SYEHMAKIR,SUGP2, SCALSNLACIR,COL25A1, F11,KLHL12,SCVQSAVGMLR, NFKBIB,FAM198A, MAAVSGLVR,PDHB,CRNPAELR, PSMA2,VCVGQNREER, YNEDLELEDAIHTAILTLK,
SGKLGEQCEAVVR, HYAL1,LEELGDQR,ILQDSLGGR, POLR1A, LTITFPAMVHR,INTS7,ITPR1, FNDC7, FLNC,APANIQVSFDSGALK, FBXO31, IVLEVRER, CCDC135, QVETQLDYLK,GALSLEDQAQMAVEFK, TOB2,EAGSMGSR,GPR179, CVHIGEMVDPVVELAAKR, LLSESEK,HAUS7,ANK3,GGGHPALGSR, LPLVLLGTR,DSNQRPKNNR, MYO5A, EEERPQIR, DECIPQDR,ZFYVE28, KIF11,RLSLAAQK, TTC38,SSQIPKR, SLGGIEGCLSKPK, FLT3LG,MLL,LRLQEAANR,C16orf7, ZNF236, CPQCFR,GQGSSPVAIQK,VCL, SH2D4A, EGQTPKAGLR, SOX8,KDLQSHMIK,AHNAK,BIRC6, NANDANSAAR, ZBTB48,VIPNGNGTFR,
KQLNHVK,LIISELR, KIAA0355,LRWLLDNVR, FAM193A, ASTFLTDLFSTVFR, GK,PSME3,MVCYLLK, LLLMPLK,FAM111A, ESEKIIENFK, KLEEENENLR, SLC4A4,PMFBP1,ZNF761,MMNDSILR, OIP5,KEFLSTAQGNR,DDX58,EQNKLK, EFLQSSLR,IL6,GNLAGLTLR, ELAALFQLWLETK, CLSTN1, RLSVEIDTLR,NFRKB,MEQHPVGLPR, SWT1, CERKL,MAML3,IEEEVEK,CRABP1, CASVKCNIR,C4orf27, ALGVNAMLR,DTNB, EXT2,MVGGGGKR,ALRSFGNALGNCILDK, QLFIEMR, XRN1,FN3KRP, LGEMRLK, KIAA1257, LEELLIK,QRSQIK, IAGFDSIDKK,SUPT6H, MED12L, RAD52,
REEP1,CPHCQVLQQGMVK, GAIACAEK, FBLN1, SSAATASWR,CFH, GQGALSER,IYKENER, EXO1,VWF, TCQSLHINEMCQER, C13orf35,ILVKCLSLK,DOCK6, TAP2, IYQEGWR,GINS3,RHLSGSIASPDVK,YIPF4,THSD4,SIGAEEHEVCR, USP43,TSVPLHR,QSDTMTEKER,UPF3B, RHOT2,GABPA, DAP3,NDWHGGAIVSALSQTGSLFKPR, NGVDPPPR, RLDQEK,NKPTMNYEK,EKDTLR, TRIP11,STDLTTHK, KAGGTKPK,ZNF253, ZNF185, SWITKQDESEGR,HIST1H1C, CAMKK1,SEKPDCMER,ZMYM4,PPP1R7, GALNTL2,LELGSNRIR, ISPHGLEAR,IVKETVR,GEKGEAGVR, RSPRY1,C1QTNF8,VTVQTDDSNK,
QIDSQKETLLSR, RLMETNLSK,NRIP3,QAPESRK,LOC644667,LTIMPKEIQLAR, TIEDRNIK,MAP9,KLLVAGNR, SLC26A6, MGLADASGPR, CEP290,LOC644950,LRRC38,VPSLQHFRK,CCEEIQPR, KIVEIHK,FRMD3,LGLLQLK,TTC34, TFAP4,AHCTF1,RHOV, QTIVELAK,KAT2B, NKLEDELK, TULP1,QSSPSPSR,SHB,CVLVGDGAVGKSSLIVSYTCNGYPAR, ITEADEK, EYA3,VRK1, TVLQLSLR,MRPL46,SLK, QTLDALNYLHDNK, SSADLKER, SLLLIQSR, MIA3, TVLVVKDR, LDLRAP1, GEELSAAAIK, FTH1, MKLQNQR, GOLGA6L2, KMWQEEK, SIK3, DLKAENLLLDANLNIK, GON4L,PFKL, EVQKAMDDK,
LQFYQNR,LLEGEEQRLCEGVGSVNVCVSSSR, ATAENEFVALK,
ATAENEFVALKK,
LTAEVENAK,
EAECVEADSGR,
SGGVCGPSPPCITTVSVNESLLTPLNLEIDPNAQCVK, ATAENEFVVLK,
KSDLEANVEALVEESSFLR,
DVDCAYLR,
LTAEVENAKCQNSK,
ATAENEFVVLKK,
FLEQQNKLLETK,
DLNMDCIVAEIK,
ECCQSNLEPLFEGYIETLRR,
LAGLEEALQKAK,
QDMACLLK,
CKLAGLEEALQK,
FASFINK,
AKQDMACLLK,
LAGLEEALQK,
AQYDDIASR,
KRT82,
CCISAAPYR,
ECCQSNLEPLFEGYIETLR,
KRT83,
KRT86,
GLTGFGSR,
VLQAHISDTSVIVK,
KRT85,
LLETKWQFYQNQR,
WQFYQNQR,
LGLDIEIATYRR,
LYEEEIRVLQAHISDTSVIVK,
LTAEIENAK,
LGLDIEIATYR,
LAELEGALQK,EYQEVMNSK,
HGETLRR,
LYEEEIR,
KDVDCAYLR,
CCITAAPYR,
SRAEAESWYR,
RLLEGEEQR,
FAAFIDK, LLEGEEQR,
AEAESWYR,
LASELNHVQEVLEGYKK, LASELNHVQEVLEGYK,
FAAFIDKVR,TKEEINELNR,
LLETKLQFYQNR,
RLYEEEIR,CKLAELEGALQK,
YEEEVALR,KYEEEVALR,
DVDCAYLRK,
LCEGVGSVNVCVSSSR,
GGGGGGYGSGGSSYGSGGGSYGSGGGGGGGR,
THNLEPYFESFINNLR, SLDLDSIIAEVK,
LLRDYQELMNTK,
SLVNLGGSK,
AQYEDIAQK,
AEAESLYQSK,
GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR,
MSGECAPNVSVSVSTSHTTISGGGSR,
SKAEAESLYQSK,
GSGGGSSGGSIGGR,
LALDIEIATYR,
NKLNDLEDALQQAK,
SLNNQFASFIDKVR,
TNAENEFVTIK,
LALDLEIATYR,
THNLEPYFESFINNLRR,
VDLLNQEIEFLK,
GGSGGGGSISGGGYGSGGGSGGR, QISNLQQSISDAEQR,
RSTSSFSCLSR,NSKIEISELNR,
TAAENDFVTLKK,SISISVAR,
NLDLDSIIAEVK,
GFSSGSAVVSGGSR,
VLYDAEISQIHQSVTDTNVILSMDNSR,
GSSSGGGYSSGSSSYGSGGR,
LLEGEECR,WELLQQMNVGTRPINLEPIFQGYIDSLKR,
KRT2,
YEELQITAGR,GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR,
SKEEAEALYHSK,
KRT4,
YQELQITAGR,
GGSISGGGYGSGGGK, KRT77,
WTLLQEQGTK,
KRT5,
YEELQQTAGR,QNLEPLFEQYINNLRR,
QLDSIVGER,AQYEEIAQR,
LALDVEIATYR,
KRT6A,
KRT6C,
VSLAGACGVGGYGSR, FASFIDK,TAAENEFVALKK,
KRT75,
KRT6B,
TAAENEFVALK,
IEISELNR,
KRT1,
WELLQQVDTSTR,
FSSCGGGGGSFGAGGGFGSR,
FGGFGGPGGVGGLGGPGGFGPGGYPGGIHEVSVNQSLLQPLNVK,
FLEQQNQVLQTK,
TNAENEFVTIKK,TLLEGEESR,
YLDGLTAER,
TAAENEFVVLK,
LALDIDIATYR,
KRT8,TAAENEFVVLKK,
KRT84,
SEIIELNR,
KRT7,GRLDSELK,
FLEQQNK,
KRT3,
AEIEHAKAQR,
LVQLLVK,
LLNQPNQWPLVK,
TMQNTSDLDTAR,
LLNDEDPVVVTK,
EGLLAIFK,
LNTIPLFVQLLYSSVENIQR,
QVVSSSEQLQSYQAEIIELR,
SDLEAQVESLKEELLCLK,
LVVQIDNAK,
SDLERQNQEYQVLLDVR,
QLVESDINGLRR,
LVVQIDNAKLAADDFR,
QVVSSSEQLQSYQAEIIELRR,
QLVESDINGLR,
QNHEQEVNTLR,
QLVESDINSLR,
ARLEGEINTYR,
RILDELTLCK,
ILDELTLCK,
ILDDLTLCK,
QVVSSSEQLQSCQAEIIELRR,
KRT33A,
LVVEIDNAK,
LASDDFR,
LNVEVDAAPTVDLNR,
CQLGDRLNVEVDAAPPVDLNR,
ARLECEINTYR,
KRT31,
YETEVSLR,
LKYDNELALR,
KRT33B,
LVQNCLWTLR,NLSDVATKQEGLESVLK,
NLALCPANHAPLQEAAVIPR,
JUP,
LAEPSQLLK,
EQALRMSVEADINGLR,
TIEELQQK,
CQYETLVENNRR,
LNVEVDAAPPVDLNR,
LASYLEK,
TKYETEVSLR,
LAADDFR,
KNHEEEVNSLR,
ETMQFLNDR,
KRT35,
QLVESDINSLRR,
DALESTLAETEAR,
LECEINTYR,
SDLESQVESLREELICLK,
NHEEEVNSLR,
KRT34,
QNQEYQVLLDVR,
QVVSSSEQLQSCQAEIIELR,
LVLQIDNAKLAADDFR,
KRT32,
QLERENAELESR,
ARLEGEINMYR,
LASYLTR,
LVLQIDNAK,
KRT37,
YESELSLR,
KRT38,
KRT40,
KRT36,
QLVEADINGLRR,
LIVQIDNAK,
LANYLEK,
QNQEYQVLLDVK,
QLVEADINGLR,
ENAELESR,
SGGGGGGGLGSGGSIR,
GGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,
LENEIQTYR,
QEYEQLIAK,DAEAWFNEK,
ADLEMQIESLTEELAYLK, QSLEASLAETEGR,TIDDLKNQILNLTTDNANILLQIDNAR,
KRT9,
HGVQELEIELQSQLSK,
FSSSGGGGGGGR,
GGSGGSYGGGGSGGGYGGGSGSR,
MTLDDFR,
QGVDADINGLR,
YCGQLQMIQEQISNLEAQITDVR,
LASYLDKVQALEEANNDLENK,
IRLENEIQTYR,KGPAAIQK,
KRT13,
QSVEADINGLRR,
GSLGGGFSSGGFSGGSFSR,
VLDELTLTK,SLLEGEGSSGGGGR,
VILEIDNARLAADDFR,
SKELTTEIDNNIEQISSYK,
VLDELTLTQADLEMQIESLTEELAYLKK,
KRT80,
NVQALEIELQSQLALK,
KLVEGEEGR,
YCVQLSQIQAQISALEEQLQQIR,
LKYENEVALR,
ALEESNYELEGK,
ELTTEIDNNIEQISSYK,
GQLEANLLQVLEKVEEFR,
FSSSSGYGGGSSR,
DIENQYETQITQIEHEVSSSGQEVQSSAK, STMQELNSR,
IKFEMEQNLR,
QLWWRR,
ALEEANADLEVK,IKEWYEK,
KRT14, VLDELTLAR,KRT16,
YENEVALR,
LASYLDK,ADLEMQIESLTEELAYLKK,
VTMQNLNDR,
KRT10,
AETECQNTEYQQLLDIK,
LASYLDKVR,ISSVLAGGSCR,
QEIECQNQEYSLLLSIK,
NYSPYYNTIDDLKDQIVDLTVGNNK,
HGVQELEIELQSQLSKK,
GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK,
EIETYHNLLEGGQEDFESSGAGK,
QVLDNLTMEK,
TLVTQNSGVEALIHAILR,
HVAAGTQQPYTDGVR,
HPEAEMAQNSVR,
KRT15,
NVSTGDVNVEMNAAPGVDLTQLLNNMR,
NHEEMNALR,
SQYEQLAEQNRK,KRT17,
KRT19,
HIYYITGETK,DLVILLYETALLSSGFSLEDPQTHANR,
ELHINLIPNKQDR,VILHLKEDQTEYLEER, ALLFIPR,KHLEINPDHPIVETLR,
DLVVLLFETALLSSGFSLEDPQTHSNR,
IDIIPNPQER,YHTSQSGDEMTSLSEYVSR,
TIGGGDDSFNTFFSETGAGKHVPR,
EIIDLVLDR,
LADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,
EIIDLVLDRIR,
LDHKFDLMYAK,
GHYTIGK,
LSVDYGKK,EDAANNYAR,
GHYTIGKEIIDLVLDR,
GMLSTHLTSMFEYLAPPR, VVTLWYRPPELLLGER,
LADFGLAR,
CDK13,
KGSQITQQSTNQSR,
EGFPITAIR,
CDK2,
QYDSVECPFCDEVSK,
DPYALDLIDKLLVLDPAQR, NPATTNQTEFER,
AYVRDPYALDLIDK,
NKSTGQMVALK,
DQVANSAFVER,
APFDLFENK,HSQFIGYPITLFVEKER,
ELISNASDALDKIR, NPDDITNEEYGEFYK, HNDDEQYAWESSAGGSFTVR,
LGIHEDSQNR,ELKIDIIPNPQER,
YESLTDPSKLDSGK,
RAPFDLFENK,ELISNSSDALDKIR,ADLINNLGTIAK,
TKPIWTR,
ASPH, CCDC147,OXTR,PUM1,MAP7D2,
LTSFIGAIAIGDLVK, GATQQILDEAER,LDINLLDNVVNCLYHGEGAQQR, LVLDSIIWAFK,ILGGVISAISEAAAQYNPEPPPPR,
VPS13C,FNTA,CAMK2D, SCUBE1,MADD,CCT2, DSG4, TDRD3,
THYSNIEANESEEVR, WTLLSLVLSQHVK, RPVIRPNVGFWR,VPLADMPHAPIGLYFDTVADKIHSVSR, ALIAGGGAPEIELALR, DALSDLALHFLNK, VLGNTSGLLQLLFNR,
LAKENAPAIIFIDEIDAIATK, CCT4,ATP1A1,
EAFSLFDKDGDGTITTK, DTDSEEEIREAFR,
ENAPAIIFIDEIDAIATK,
LTN1, DUSP14, XPO1,CAPNS1,
AGEEAKR, KIAESEIK,GLFLPEDENLR, RNEFLAR,YISAAPGAEAK,RAB39A,LNIPVSQVNPR,
CALM2, FSNISAAK,ATP1A4,
KGTVAPHDQSPR, LOC652826,RAB5C,PANK3, C7orf73, ACTR3,RPL7, PCBP1,EIF4E, KDFGFLVDILYSALR, HLMPYWGI, FLVTVIKDLLGLCEQK, LFAQLAGDDMEVSATELMNILNK, PKD1,LOC100506888, SCN5A,DCDC2C, PRAMEF3, LOC440258, RPL36AL, MAP4K4, LALVTGGEIASTFDHPELVK, KLEGETYR,EKDTIR,MGLTPGMLGPSR,GKGEGGAGSR,GTADGGGSVLGR, CMKIPTGQGYAAK, ETKTFGGGGGGAR,
LOC100294386,LOC100510063,
TEKDNAALAR,VPSEVQQR, ESLAAIEKR,
TCEB3C,LGFEEFK, AVGHPFVIQLGR, VQDDEVGDGTTSVTVLAAELLR, QVLLSAAEAAEVILR, LFVTGLFSLNQDIPAFK,
LQVLELR,
PRAMEF25, L1RE2,
LLVAELQR,
NPIP,
EALNMERNNR, GNLHFIR,
CEP85L,
KLEEIK,ALFLIPR,
PANK1,RPL36A,
LECVEPNCR,
TNIK,KLGGTIDDCELVEGLVLTQK, VIDPATATSVDLR,
VFDKDGNGYISAAELR,
TDGFGIDTCR,QLIDYER,LIIVEGCQR, MISEGDIGGIAQITSSLFLGR, TAHNLENVLIHFWER, ISEPEADVESVLGVSNLLQVLQKPK,
ENAPAIIFIDEIDAIATKR, RPL7P9, EIF4E1B,
TTHFVEGGDAGNREDQINR,
RAB5B,
LVLLGESAVGK, WALWFFK,
PCBP2, RAB6B,ACTR3C,
DITYFIQQLLR, INISEGNCPER, LQLWDTAGQER,
QLIVGVNK,
IGGIGTVPVGR,
QTVAVGVIK,
THINIVVIGHVDSGK,
VETGVLKPGMVVTFAPVNVTTEVK, EVSTYIKK,
EEF1A1P5,
STTTGHLIYK,
EHALLAYTLGVK,
TLSDYNLQK,
EEF1A1,
YEEIVK,KDGNASGTTLLEALDCILPPTRPTDKPLR,
EVSTYIK,
LPLQDVYK,
NMITGTSQADCAVLIVAAGVGEFEAGISK, DGNASGTTLLEALDCILPPTRPTDKPLR,
YYVTIIDAPGHR,
YEEIVKEVSTYIK,
UBB,
TITLEVEPSDTIENVK,
MQIFVK,
UBA52,
TLTVELEPSDTVENLK,
MQLFVK,
HIST1H2BL, ESYSVYVYK,
UBC,
HIST1H2BN,
HIST1H2BM,
RPS27A,
HIST1H2BC,TITIEVEPSDTIENVK,
IQDKEGIPPDQQR, TLSDYNIQK,
HIST1H2BE,HIST1H3B,HIST1H2BK, HIST3H3,
H2BFS,
H3F3A,
EIAQDFKTDLR,
H3F3B,HIST1H2BH,
HIST1H3A,
HIST1H2BG,
QVHPDTGISSK,HIST2H2BF, KLPFQR,AMGIMNSFVNDIFER,
H3F3AP5,
HIST1H4E,
DAVTYTEHAK,
VFLENVIRDAVTYTEHAK,
VFLENVIR,FGIEPNAELIYEVTLK,
HIST1H4A,
HIST1H4C,
TVTAMDVVYALKR, FDSSHDRNEPFVFSLGK,
ISGLIYEETR,
YGFGEAGKPK,HIST2H4A,
DNIQGITKPAIR,
DAVTYTEHAKR,
FAEQDAKEEANK,
EIQTAVR,STELLIR,
HIST1H2BD,EIAQDFK,
HIST1H3D,LLLPGELAK, HIST1H3I,
POMZP3,
INVS,
VCSPVTVR,
KIAA1598,
PPP3CA,
PPP3CB,
POM121, PPP3CC,
AKAP9,
INERMPPR,KELELK,
TQEEVAALK,
KTLEEMELR,
ERC2,KTQEEVMALK,
QHIEVLK,
POM121C,
ERC1,HNRNPF,
YVELFLNSTAGASGGAYEHR, LOC100290337,
ATENDIYNFFSPLNPVR,
ALYYLQIHPQELR,
TEPATGFIDGDLIESFLDISRPK,
EATADDLIKVVEELTR,
STGEAFVQFASQEIAEK,
GQAVTLLPFFTSLTGGSLEELRR, SYSDPPLK,
VHIEIGPDGR,ITGEAFVQFASQELAEK, YKEVYAAAAEVLGLILR,
YVEVFK,
DDB1,
KTEPATGFIDGDLIESFLDISRPK,
HTGPNSPDTANDGFVR,
HNRNPH1,
GLPWSCSADEVQR, SDPNRETDDTLVLSFVGQTR,
NPAPPIDAVEQILPTLVR, KPNA2,
NNQGTVNWSVDDIVK, AIGNIVTGTDEQTQVVIDAGALAVFPSLLTNPK,
LLGASELPIVTPALR,
QDQIQQVVNHGLVPFLVSVLSK, GINSSNVENQLQATQAAR,
NLTWTLSNLCR,
DIIALNPLYR,
YSNENLDLAR,
LAQPYVGVFLK,
LONP1,
LSSDVLTLLIK,
QLEVEPEEPEAENKHKPR,
TIRDIIALNPLYR,
LTEEETVCLDLDKVEAYR,
LLQLQEQMR,
TLELQGLINDLQR,
ETQSQLETER,
DSP,
NQCTQVVQER,
QLQNIIQATSR,
SNIWVAGDAACFYDIK,
LNDGSQITYEK,
SITIIGGGFLGSELACALGR,
IDNSVLVLIVGLSTVGAGAYAYK,
GLGLDESGLAK,SATEQSGTGIR,
QAASGLVGQENAR,
FANCI,
ETKPIPNLIFAIEQYEK,
DGEQHEDLNEVAK,
SLELLPIILTALATKK, TGGLEIDSDFGGFR,
ELVKHLK,
FVSSLLTALFR,
ETGHVSGPDGQNPEK,
TADKLQEFLQTLR,
VTEAFDYLSFLPLQTVQR,
MNLQEIPPLVYQLLVLSSK, EALLLVTVLTSLSK,
SLELLPIILTALATK,
VGQQGDSNNNLSPFSIALLLSVTR,
AVLLAGPPGTGK,ELWFSDDPNVTK,
ALESSIAPIVIFASNR, SLSAIDR,
ISGLGLTPEQK, YSVQLLTPANLLAK,
NLLIFENLIDLK,
GIAPGDER,
GTEDITSPHGIPLDLLDR,
DSIEKEHVEEISELFYDAK,
HKQSLEEAAK,
LKQLIDK,
SVEEVASEIQPFLR,
IGLVRPGTALELLEAQAATGFIVDPVSNLR,
SVQNDSQAIAEVLNQLKDMLANFR,
IVSGEAESVEVTPENLQDFVGKPVFTVER,
IAYTFAR,
ALCGLDESK,
TENPLILIDEVDKIGR,
VLFICTANVTDTIPEPLRDR,
ALGTEVIQLFPEK, LDPSIFESLQK, PRKDC,AIFM1, LOC731751,RUVBL1,
DFSAFINLVEFCR,
TALALAIAQELGSK, QLKLDPSIFESLQK,
VEAGDVIYIEANSGAVKR,
LQSVQALTEIQEFISFISK,
NKNPAPPIDAVEQILPTLVR,
ASLSLIEK,
LLHHDDPEVLADTCWAISYLTDGPNER,
EKQPPIDNIIR,
TDCSPIQFESAWALTNIASGTSEQTK,
NVSSFPDDATSPLQENR, TTDEGAKNNEESPTATVAEQGEDITSK,
POTED,QSTTHLADGPFAVLVDYIR,
ALAELFGLLVK,
SDDELLDDFFHDQSTATSQAGTLSSIPTALTR, EEQCILYLGPR, WITPVLLLIDFYEK,
SLLSILQR,
MQREEQCILYLGPR,
KEEDLLR,
LOC100287399,
KQLAAFLEGFYEIIPK, HADHSSLTLGSGSSTTR,
HUWE1,
DVAFTVGEGEDHDIPIGIDK,
FKBP5,LOC100288966,
CTAGE9,
EKLEQAAIVK,CTAGE4,
ALGLDSANEK, EVLGSLPNVFSALCLNAR, QLAAFLEGFYEIIPK,
POTEB, POTEC,RQQQAATSESSQSEASVR,
CTAGE6P, LTEDKADVQSIIGLQR,
DCQLNAHKDHQYQFLEDAVR,
STAPSAAASASASAAASSPAGGGAEALELLEHCGVCR,
VLVNDAQK,FQWDLNAWTK,
TRIM28,
FASWALESDNNTALLLSK,
KLLASLVK,
VLVNDAQKVTEGQQER, VFPGSTTEDYNLIVIER,
LDLDLTADSQPPVFK,
CTAGE15P,
QVISYEK,KLIYFQLHR,
VSLERLDLDLTADSQPPVFK,
LSPPYSSPQEFAQDVGR,
CTAGE8,
RAB6A,ACTR3B,RAB5A, PCBP3, AVAGDASESALLK, LOC100289196, RPL7P32, EIF4EP2,SCN3A, PKD1P1, LOC440264,PRAMEF22, PANK2,LOC100509473, TCEB3CL, RPL36AP7, MINK1, RLFAQLAGDDMEVSATELMNILNK, TGEIQFSR,AQDIEAGDGTTSVVIIAGSLLDSCTK, SLHDALCVLAQTVK, CALM3, YYGLQILENVIK,KHGATLVHCAAGVSR, PSMC4, NFLTSLVAGLSTER, IPTGQEYAAK, MHLGLVIPK,DGSMNVSVK, RNEMLAR,SSGSGRTK,ILESSIPMEYAK, IIAEADGER,QQVLLEK,EKGEGGAGSR, CQLISR, GHVLSLALQMYGCR,
ISGSILNELIGLVR,ACLIFFDEIDAIGGAR, TYQQSCVSSCR,LAVDTDTFSFGVVVLETLAGQR, ENIWSASEELLLR, IADFGLAR, VNDVVPWVLDVILNK, DISEASVFDAYVLPK, APLDLDKYVEIAR, IADGYEQAAR,SIATLAITTLLK,NYQQNYQNSESGEKNEGSESAPEGQAQQR, ISLGLPVGAVINCADNTGAK, ALVDGPCTQVR, STIIGESISR, LTTDFNVIVEALSK, TSPY1,
IQCJ,LUC7L2,
TSPY4,AIAYLFPSGLFEKR, VIHLSNLPHSGYSDSAVLK, KQESTVMVLR,LSNQVSTIVSLLSTLCR, LVGSQEELASWGHEYVR, TLFSFLGEIEELR,AALSSQQQQQLALLLQQFQTLK, SVNSLDGLASVLYPGCDTLDKVFTYAK, LFGSAANVVSAK, TPGQVLSLISSLGFNTPIAEK, IFVGGLSPDTPEEK, KNFIAVSAANR,LFIGGLPNYLNDDQVKELLTSFGPLK, LSTHLQQEGSELLWYLAEK, DLEEFFSTVGK, AIGPHDVLATLLNNLK, TIEYLEEVAITFAK, FADDQLIIDFDNFVR, VYEFLDKLDVVR,EKENELK, IAAQDLLLAVATDFQNESAAALAAAATR, SSDPKSR,LPFETFR,AAMCKPLMQNR,ALLDQKK, MGKEQELVQAVK, NAAQELATLLLSLPAPASVQQQSK, ELEFLR, LDSLTYK,MPAPGASLALR,NSETDTIIFIR, SCVFLSIAVK, QENKLK,GCQLMATATLK,QLTEHLR, AGTERWMK, DGADIHSDLFISIAQALLGGTAR, YEEILDAR, VIQATGGGAYK,ESANVEDLEPVR,CTNILR,LTDHCLPLFR,SSNNNGSVR, QTELQDK, TEDPDLPAFYFDPLINPISHR, QNIENVSVEMLLR, LIGEYGLR,LTADDLR, HSEVQQANLK,SNVKPNSGELDPLYVVEVLLR, NVVHQLSVTLEDLYNGVTK, LLLPWLEAR, LASYVEKVR, QLGYIHRNFAGNR, RLDLAGPLLAFLFR, AEPYCSVLPGFTFIQHLPLSER, FFCFQVLEHQVK, VTVHRNK,FLDALISLLS,LLGLLFPLLAR,LPDFLQR, ILLLEAGPK, VSFELFADKVPK,RDEMLGLVPMR, SSAAAAAALDLSGRR, CQDVSAGSLQELALLTGIISK, VLPRGLAAR, SPQEEALQR, TFEEAAAQLLESSVQNLFK, EMLLLTRMAEDVK,
WYFTR,
SPVGLSSDGISSSSSSSR,
LEPTQGHR,DHYIAAQVEQQHK, TEQLYSQK,
VHAAADKHNSVEDSVTK, WSNWEIPVSTDGK,
IDRWFLHR, NERPDGVLLTFGGQTALNCGVELTK,
EATAGNPGGQTVR,
VLGTSPEAIDSAENR, TLGVDLVALATR,
GHNQPCLLVGSGR,
VLKPEWFR,
TLTETIYKNYYR,
DEVVSPLPSALQGPSGSLSAPPAASVISAPPSSSSR, YYGYRNPSCEDGR,
HLRPGGILVLEPQPWSSYGKR,
NGYQPHRPPGGGGGK,
NSCNVGGGGGGFK,
VVSVLTVLHQDWLNGK,
SCDKTHTCPPCPAPELLGGPSVFLFPPKPK,
IGHM,
FNWYVDGVEVHNAK,
GPSVFPLAPSSK,
IGHG1,
IGHV4-31,
TPEVTCVVVDVSHEDPEVK, THTCPPCPAPELLGGPSVFLFPPKPK,
AEDTAVYYCAR,
TPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAK,
STSGGTAALGCLVK,
ELSDLESAR,
MREIVHIQAGQCGNQIGAK,
MSSTFIGNSTAIQELFKR,
TUBB3,ISVYYNEASSHK,VFNTGGAPR,
CDFALFLGASSENAGTLGTVAGSAAGLK,
EIVHIQAGQCGNQIGAK,
YLTVATVFR,
TUBB6,
IMNTFSVMPSPK,
MASTFIGNSTAIQELFK,
SGPFGQLFRPDNFIFGQTGAGNNWAK, SGAFGHLFRPDNFIFGQSGAGNNWAK,
INVYYNESSSQK,
AILVDLEPGTMDSVR,
HNSVEDSVTK,MPIEGSENPERPFLEK,
TYSLSSSFSSSSSTR,
WLSSQPSFKLEPTQGHR,
AASSKPEEIKMR,
EQLENSPSR,
ISDHSSVKQEYTHK, WLSSQPSFK,
RIWNWR,
KLEHVIK,SILEQLAEKNFELVINLSMR,
KLETLDLDVR,
CGVEADKELSCR,
HHGPISTTPGIIPQK,
LFVEALGQIGPAPPLK,
YVAPPSLR,SSSLNFSFPSLPTMGQMPGHSSDTSGLSFSQPSCK,
WYVDGVEVHNAK, SCDTPPPCPR,
TPLGDTTHTCPR,
IGHG3,
VVSVLTVLHQDWLNGKEYK,
TPEVTCVVVDVSHEDPEVQFK,
TPEVTCVVVDVSHEDPDVQFKWYVDGVEVHNAK,
NQVSLTCLVK,
IGHG4,
FWEVISDEHGIDPAGGYVGDSALQLER,
GFYPSDIAVEWESNGQPEDNYK,
GFYPSDIAVEWESNGQPENNYK,
TTPPVLDSDGSFFLYSK,
GHYTEGAELVDAVLDVVR,
EAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK,
EIVHIQAGQCGNQIGTK, FPGQLNADLR,
IREEYPDR,KEWESCDCLQGFQLTHSLGGGTGSGMGTLLISK,
FWEVISDEHGIDPTGTYHGDSDLQLDR,
LHFFMPGFAPLTSR,
MASTFIGNSTAIQELFKR,
FPGQLNADLRK,
TUBB,
TUBB8,
GHYTEGAELVDSVLDVVRK, GHYTEGAELVDSVLDVVR,
NSSYFVEWIPNNVK,
FWEVISDEHGIDPSGNYVGDSDLQLER,
LAVNMVPFPR,
IMNTFSVVPSPK,
ISEQFTAMFR,
NMMAACDPR,
TUBB2A,
YLTVAAIFR,
INVYYNEAAGNKYVPR, EQLENTPSRR,
VAHACLHPLEPLLDTK,
NIISSTALFLAAKVEEQAR,
CCNT1,
CCNT2,
EQLENTPSR,NIISSTALFLAAK,LNVSQLTINTAIVYMHR,
QYISSHNSVFNHPLPPPPPVTYQVGYGHLSTLVK,
CTQLVR,QETSLSGSQYNINFQQGPSISLHSGLHHRPDK,
FNKNIISSTALFLAAK, FYMIQSFTQFPGNSVAPAALFLAAK,
GPSEETGGAVFDHPAK,
HSSQTSNLAHK,
AKHAEELAAQK,
RWYFTR,
DITDPLSLNTCTDEGHVVLASPLK,
IWNWR,
FGVDPDKELSYR,
QLENMEANVK,
SCFPASLTASR,
HSHSQLPVGTGNKRPGDPK,
SGNTDKPRPPPLPSEPPPPLPPLPK,
GGGPQAQSHGEAR,
FQYGNYCK,
HHHHHPLPAAGFKK,
TPHVLVLGSGVYR,
LSLDDLLQR,
VPQFSFSR,
FWEVISDEHGIDPTGSYHGDSDLQLER,
MSATFIGNSTAIQELFK,
RPVINAGDGVGEHPTQALLDIFTIR,
QGQSQAASSSSVTSPIK,
ILALDCGLK,
HSADGIPPTVLR,
TLHEWLQQHGIPGLQGVDTR,
CAD,
GHYTEGAELVDAVLDVVRK, MAVTFIGNSTAIQELFK,
EAESCDCLQGFQLTHSLGGGTGSGMGTLLISK, STSESTAALGCLVK,
TTPPVLDSDGSFFLYSR, TAVCDIPPR,
CCVECPPCPAPPVAGPSVFLFPPKPK,
IGHG2,
VVSVLTVVHQDWLNGKEYK,
TPEVTCVVVDVSHEDPEVQFNWYVDGVEVHNAK, KCCVECPPCPAPPVAGPSVFLFPPKPK,
VVSVLTVVHQDWLNGK,
GPSVFPLAPCSR,
IGH@,
FWEVISDEHGIDPTGTYHGDSDLQLER,
MAATFIGNSTAIQELFK,
TUBB4A,
TUBB4B,
EIVHLQAGQCGNQIGAK, MREIVHLQAGQCGNQIGAK,
INVYYNEATGGK,
SGPFGQIFRPDNFVFGQSGAGNNWAK,
MAVTFIGNSTAIQELFKR,
YLTVAAVFR,
LTTPTYGDLNHLVSATMSGVTTCLR,
VLGTPVETIELTEDRR,
EVDEQMLNVQNK, LAADFSVPLIIDIK,
VYFLPITPHYVTQVIR,
MSATFIGNSTAIQELFKR,
SQYAYAAQNLLSHHDSHSSVILK, HAEELAAQK,
EQLENSPSRR,
IPVAGGDKAASSKPEEIK,
VSLKEYR,
TSENLALTGVDHSLPQDGSNAFISQK, AASSKPEEIK,
TYSLSSSFSSSSSTRK, MALLATVLGRF,
EIEYEVVR,
VTAVDWHFEEAVDGECPPQR,
KVLILGSGGLSIGQAGEFDYSGSQAIK,
LLESLGYSLYASLGTADFYTEHGVK,
HPQPGAVELAAK,LALGIPLPELR,
LVQNGTEPSSLPFLDPNARPLVPEVSIK,
LLDTIGISQPQWR,
VAHTCLHPQESLPDTR,
VIECNVR,RGPSEETGGAVFDHPAK,
WFESSGIHVAALVVGECCPTPSHWSATR,
HAEELAAQKR,HSHSQLPVGTGNK,
LPPQTLEGDPGAEGEEGTTTVRK,
ESPGAAATSSSGPQAQQHR,
QQAANLLQDMGQR,
GRDVLDLGCNVGHLTLSIACK,
HHHHHPLPAAGFK,
AAPPDVGEER,
MRIPVAGGDK,
NKPIPALR,
GFAFVEFETKEQAAK, FLREQIEK,KEYLALQK,EDNIQAKEENMDTSNTSISK,
LARP7, MEPCE,
RGGGGTELGPPAPPRPR,
TLNAETPK,
GQHHQQQQAAGGSESHPVPPTAPLTPLLHGEGASQQPR,
GGGGTELGPPAPPRPR,
NPSCEDGRLR,
GFQRPVYLFHK,DAPQPYELNTAINCRDEVVSPLPSALQGPSGSLSAPPAASVISAPPSSSSR,
FKTPEDAQAVINAYTEINK,
LEILSGDHEQR,
EENMDTSNTSISK,
QVLADIAK,
DTLAAISEVLYVDLLEGDTECHAR,
QVDFWFGDANLHK,
SRDGYVDISLLVSFNK,
KLTTDGK,
HCWKLEILSGDHEQR,
DPVEILIPK,
KFQYGNYCK,
GRDPVEILIPK,
HYLSEELR,
AAPPDVGEERR,
CAPSAGSPAAAVGR,
IQLKPEQFSSYLTSPDVGFSSYELVATPHNTSK,
ITPSYVAFTPEGER,
DNHLLGTFDLTGIPPAPR,
NQLTSNPENTVFDAK,
NELESYAYSLK,
TWNDPSVQQDIK,
TKPYIQVDIGGGQTK,
IDTRNELESYAYSLK,
LTPEEIER,
HSPA5,
RSRPTSEGSDIESTEPQK,
NVNHSWIER,EYLALQK,
HKMGEEVIPLR,
IISTEPLPGR,SRPTSEGSDIESTEPQKQCSK,
VNATGPQFVSGVIVKIISTEPLPGR, CGNVVYISIPHYK,
IISTEPLPGRK,
HYLSEELRLPPQTLEGDPGAEGEEGTTTVR,
HLRPGGILVLEPQPWSSYGK, WVHLNWGDEGLK, MVGLDIDSR,
LPPQTLEGDPGAEGEEGTTTVR, HRGQHHQQQQAAGGSESHPVPPTAPLTPLLHGEGASQQPR,
WVHLNWGDEGLKR, DVLDLGCNVGHLTLSIACK,
VTHAVVTVPAYFNDAQR,
IEIESFYEGEDFSETLTR,
IINEPTAAAIAYGLDKR,
ITITNDQNR,
SQIFSTASDNQPTVTIK,
IEWLESHQDADIEDFKAK, VKQVLADIAK,
LDKSQIHDIVLVGGSTR,
SQIHDIVLVGGSTR,
SINPDEAVAYGAAVQAAILSGDK,
MKEIAEAYLGK,
FEELNADLFR,TVYVELLPK,
TQEKVNATGPQFVSGVIVK,
IIQKDIIK,
DAPQPYELNTAINCR,
NSCNVGGGGGGFKHPAFK,
KTLTETIYK,
SSAVVELDLEGTR,
STAGDTHLGGEDFDNR, TTPSYVAFTDTER,
LLQDFFNGKELNK,
GTLDPVEK,
DAGTIAGLNVLR,ITITNDK, GPAVGIDLGTTYSCVGVFQHGK, QTQTFTTYSDNQPGVLIQVYEGER,
NSTIPTK,
HWPFQVINDGDKPK,
MKEIAEAYLGYPVTNAVITVPAYFNDSQR, QTQIFTTYSDNQPGVLIQVYEGER,
LVNHFVEEFKR,
LLQDFFDGRDLNK,
CQEVISWLDANTLAEKDEFEHK,
LLQDFFNGR,
AAAIGIDLGTTYSCVGVFQHGK,
HSPA1B,
LVNHFVEEFK,
NQVALNPQNTVFDAK,
AQIHDLVLVGGSTR, LDKAQIHDLVLVGGSTR,
LSKEEIER,
ELEQVCNPIISGLYQGAGGPGPGGFGAQGPK, HSPA1A,
YKAEDEVQR,
NQVALNPQNTVFDAKR,
DAGVIAGLNVLR,
QVDFWFGDANLHKDR,
TPEDAQAVINAYTEINKK,
VEASSLPEVR,
GFAFVEFETK,
VNATGPQFVSGVIVK,
KPGIFPK,
SSSEDAESLAPR,
AIEFLNNPPEEAPR,
SRPTSEGSDIESTEPQK,
DRVEASSLPEVR,
AIEFLNNPPEEAPRKPGIFPK,
IINEPTAAAIAYGLDKK,
VQVEYKGETK,
ARFEELNADLFR,
TVTNAVVTVPAYFNDSQR,
MVNHFIAEFK,
VEIIANDQGNR,
VCNPIITK,
HSPA8,
VQVEYK,
ILDKCNEIINWLDK,
LLQDFFNGK,
DNNLLGR,
ATAGDTHLGGEDFDNR,
STLEPVEK,
VEILANDQGNR,
IINEPTAAAIAYGLDR, HSPA6,
ARFEELCSDLFR,
AQLGVQAFADALLIIPK,
TKVHAELADVLTEAVVDSILAIK, AKLEEAILGSIGAR,
TTSGDDACNLTSFRPATLTVTNFFK,
NSLLASYIHYVFR, TVEFLNEK,ELQHDVQK,IEHYASR,
TPM3P4,
MELQEIQLKEAK, EVPDADSLQR,
MILIQDGSQNTNVDKPLR, DKVQAGDVITIDK, IPDEFDNDPILVQQLR, AINQQTGAFVEISR, TQGFLALFSGDTGEIKSEVR, DVYVPNTTYR, QSRLEQEEQQR,LTPEEEEILNK,
TLGILGLGR,
VLSIGDGIAR,
IIDVVYNASNNELVR, EVAAFAQFGSDLDAATQQLLSR, LAATNALLNSLEFTK,
VLANPGNSQVAR,QADVNLVNAK,
LLIVSTTPYSEKDTK, IGGGIDVPVPR,RAPDQAAEIGSR,YAIQLITAASLVCR, IGGDAATTVNNSTPDFGFGGQKR, TVLEHYALEDDPLAAFK, CDY2B,GSTA1,
GSTA3,NFIA,CDY1,
NFIX, KRTAP2-4,CCDC72,RPS27,EXOG, KRTAP2-1,
AQFEGIVTDLIR,AQFEGIVTDLIRR,
QAVTNPNNTFYATK,
LQQELDDLLVDLDHQR,
KLEGDSTDLSDQIAELQAQIAELK, RYDDPEVQK,
VISGVLQLGNIVFKK, RPS27P17, EKPYFPIPEEYTFIQNVPLEDR,
GYFEYIEENKYSR,
NVVHQLSVTLEDLYNGATR,
NVVHQLSVTLEDLYNGATRK,
TIVITSHPGQIVK,NQSQGYNQWQQGQFWGQKPWSQHYHQGYY, AHQANQLYPFAISLIESVR,
ITFTGEADQAPGVEPGDIVLLLQEK, GIFEALRPLETLPVEGLIR, SLSALGNVISALAEGSTYVPYR,
LYDILGVPPGASENELKK, ILQDSLGGNCR,
FQGEDTVVIASKPYAFDR, YGEQGLR,
AFHNEAQVNPERK,
GALQNIIPASTGAAK,
VFEVSLADLQNDEVAFRK,
YLSLANNK,VLCELADLQDKEVGDGTTSVVIIAAELLK,
LVINGNPITIFQERDPSK,
ACQSIYPLHDVFVR,
ASQVECTGAR,
LVVPELSSR, DVIVRAADCK, KLFCPQK,KDGAEIDK, LTPESTR, IFVDIEK,CRVVSR, RDGADIHSDLFISIAQALLGGTAR, VLVDSLVEDDRTLQSLLTPQPPLLK, ILPVNLKAVK, LLEEQLQHEISNK,AFAALPSSRPVYDIQSPDFAEELR, SESLESPRGER, EATSVPR,KQNELK, VQDFLQR,KEVDENK, QDSSLGGRAR,KCLPIR, AQTELDPPR,LREATEAK, QITEEQIK,NTRVYPCVWCK, NREMVDVRPR, EILRER, ARQEGGYR,LKNELLK, TDMIQALGGVEGILEHTLFK, SSNNNGSVRTA, LDLRSCR, LLLSYGASR,RYEEILDAR, KLQDLIK, HLPSTEPDPHVVR, LVSRCGAVR,SFAVGTLAETIQGLGAASAQFVSR, AAQVFALLFVTEYLTK, VYFGGFFIR, SKFLDALISLLS,APFQASSPQDLR, LINFIRLK,IIPGFMCQGGDFTR, LASYVEK,SFKPDFGAESIYGGFLLGVR, KFDVNTSAVQVLIEHIGNLDR, QVVNIPSFIVR,LVSQIDNTK,KSNVKPNSGELDPLYVVEVLLR, NVVHQLSVTLEDLYNGVTKK, FSDHVALLSVFQAWDDAR, VGRPSNIGQAQPIIDQLAEEAR, YQLLQLVEPFGVISNHLILNK, QLREEQR,LFHFLGIFLAK,LDLAGPLLAFLFR, AELATEEFLPVTPILEGFVILR, AIAYLFPSGLFEK, SSSRSPR, LALFPLLPK, GIAYVEFVDVSSVPLAIGLTGQR, NFIAVSAANR,VVELSYWEWLPLLK, VLSEAAISASLEKFEIPVK, KLDSLTTSFGFPVGAATLVDEVGVDVAK, ICFELLELLK,IFVEFTSVFDCQK, ILGTPDYLAPELLLGR, LFAQLAGEDAEISAFELQTILR, TSPY2,NT5DC4,IQCJ-SCHIP1, TSPY3,GFGFVLFK, TYQQSCVSSCRR,VLNSIISSLDLLPYGLR, HDADGQATLLNLLLR, VADLVHILTHLQLLR, SDVWSFGILLTELVTK, FVNLGIEPPKGVLLFGPPGTGK, IDLRPVLGEGVPILASFLR, DFVSEQLTSLLVNGVQLPALGENKK, FGAQNESLLPSILVLLQR, LVAIVDVIDQNR,GYYSLETFTYLLALK, NIVEAAAVR, GSAITGPVAK, WVGGPEIELIAIATGGR, TEALSVIELLLK, SVGDGETVEFDVVEGEKGAEAANVTGPGGVPVQGSK, ICHQIEYYFGDFNLPR,
ACSL3, MYLK2,HADHA, U2AF2,RBM39, SF3B1,GIGYF2, FASTKD5,ZBED1, MASTL,KRT23, MRPS9,DNAJA4, DHX9, PUF60,RPS9, HECTD1, MATR3,POLR2B, PSMD2, SREK1, IQGAP2,HNRNPD, IRAK1,PSMD3,CAPN2, FYN,KRTAP11-1, PSMC2, CAND1, KIAA0368, RPS26, PPP6C, RPL14, COPG2, YBX1, RARS, RPL23, VPERLR,CCT5, LEELKR,SSB,CHSY1, FBXL19, GOLGA4,CLCC1,IPO9, TNN,MYBPC1, DNAJA3, OR6K4P,TDRD6, CCDC22, TTN, CDC42BPB, COQ6,PRPF8,OFD1, DHX29, XPOT,BARD1, PANK4,FCHO1, ULK1, IPO4,ARHGEF2,RAD50, PPP1R15A,FSIP2, NUP205,PYGL, SCRIB, GTF3C4, ZMAT1, CASKIN1,UBR4,ZMYM3,KTN1,NET1, SKI, CLTC,PPIA,PRMT3, SKIV2L2,CDC45, KRT39, COPB2,HNRNPUL1,
KIVDQIRPDR,
ZFP57,
IEELEEELEAERTAR, MKPEFNVR,EAVSILK,
MYH6,USP17L4, DDX17,
MEKENQK,
TBC1D4,
KNAVVMDQEPAGNR, QCALAALR,
PFKFB4, MAPK6,ACAA2,KIR3DL3,
SSWPNSEK, HALREIK,AVASWSR,DRLHLR,
EDNRB,NOA1, BIVM-ERCC5,RAB11A,
IREFCQR,SDLRHLR,LDGSRLIK,TILEMLLGFLK, HGSGSGHSSSYGQHGSGSGWSSSSGR, QLATVNEQPLQNGFEELIQWTK, LVLDAFALPLTNLFK, ESLNASIVDAINQAADCWGIR, EWAAQYR, VELSEQQLQLWPSDVDKLSPTDNLPR, TNTPADVFIVFTDNETFAGGVHPAIALR, FAGGDYTTTIEAFISASGR, TDYNASVSVPDSSGPER, NKEEAAEYAK, TQLRNEFIGLQLLDVLAR, LPHLPGLEDLGIQATPLELK, LVEGILHAPDAGWGNLVYVVNYPK, IVSRPEELREDDVGTGAGLLEIK, INALTAASEAACLIVSVDETIK,
IINDLLQSLR,GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK, TIALNGVEDVR,AVLIAGQPGTGK,FVQCPDGELQK,SYTEDWAIVIR,SAIFSITYPSQDVFLVIK, AQGPQQQPGSEGPSYAK,
EEIKIK,MELQELQLKEAK,KLASQGDSISSQLGPIHPPPR, KPFSVSSTPTMSR,KNLEDGINNLK,
VQISPDSGGLPER,AIECLEK,
LLVKCLSLK, GLGLDDALEPR,VLIPIHEANR, GTSYQSPHGIPIDLLDR, AQGPQQQPGSEGPSYAKK, VLLDLSAFLK,
VATAQDDITGDGTTSNVLIIGELLK, STVAQLVK,
QNLSKEELIAELQDCEGLIVR, AAVENLPTFLVELSR, AVDSLVPIGR,
EHIKNPDWR,
GIDPFSLDALSK,
AGTGVDNVDLEAATR,
HVVFIAQR, LFNLSKEDDVR,FLESVEGNQNYPLLLLTLLEK, GPYESGSGHSSGLGHR, SGNYTVLQVVEALGSSLENPEPR, IEPLSPELVAAASAVADSLPFDK, APVPGTPDSLSSGSSR, VPNSNPPEYEFFWGLR, DKPELQFPFLQDEDTVATLLECK,
TLIQNCGASTIR,TARDBP, NEFIGLQLLDVLAR, TROVE2, INF2,CCT7, AFAFVTFADDQIAQSLCGEDLIIK, CCT3, TRDELEVIHLIEEHR, INALTAASEAACLIVSVDETIKNPR, STOML2, AHITLGCAADVEAVQTGLDLLEILR, RPS6,HRNR, GCIVDANLSVLNLVIVK, IIIPEIQK,CSE1L, ILEPGLNILIPVLDR, QSLGHGQHGSGSGQSPSPSR, CNP, EPRS,HNRNPK,NFDFEDVFVKIPQAIAQLSK, GSYGDLGGPIITTQVTIPK, LQAILEDIQVTLFTR, NDUFA9,SELLSQLQQHEEESR, TLTAVHDAILEDLVFPSEIVGKR, MRPS31, MMS19, EIDKNDHLYILLSTLEPTDAGILGTTK, RPS7,ACDSVTSNVLPLLLEQFHK, MAGED2,
RAB11B,IPTHLFTFIQFK, ILSISADIETIGEILKK, TSDLIVLGLPWK, YNVYPTYDFACPIVDSIEGVTHALR, SSVSGIVATVFGATGFLGR, SSQLLWEALESLVNR, TAVETAVLLLR,VHTVEDYQAIVDAEWNILYDKLEK, MYH7, PFKFB3,MYO5B,KIR2DL3, TBC1D1,MAPK4,DDX5,ERCC5, USP17L8,EDNRA,SNURF, PRKCE,
FAYKDEYEK,EATQELIEDLR,
CD209, AP1M1,RAB11FIP4,
AAVGELPEKSK, VFLSGMPELR,
TMEM120B,PFKM,SENP7,
EQWWLK,EELKLK,KNEIQK,
PAPD4,
RLLESLR, QEEQRR,
PNMA5,ROCK2, CDRT1,
AGARLDVR, ENLESRLK,IRSLLER,
CDKN2B,
APMTTVR,
KLHL9, TAOK1,
ITWEEK,EALASHLR,
ANGPT1,
GPSYSLR,
LOC338579,
LTIMLEK,
THEX1,
ELEKIK,
BAT2L,LOC100505868,MYH4,
AFLVRSR,KNMEQTVK, KSLFMLK,
LOC729562, LOC100509661,
QREVLR,
CGN,
EKSIFLVAHR,ELKENLDR,
NPHP3,
YEQAEHFR,
SHROOM3, LOC653566,
GSTA2,CDY1B, CDY2A, GSTA5, NFIB,NFIC, LOC729973,
ADGYVLEGKELEFYLR, LYDNPWR,
VIHDNFGIVEGLMTTVHAITATQK, NCIVLIDSTPYR,
VPTANVSVVDLTCR,
CDSDILPLR,PPP2R5A, KRTAP2-3, RPS27L,RPS27P9,KRTAP2-2, VQVALEELQDLKGVWSELSK,
VFEVSLADLQNDEVAFR, YPVNSVNILK,
TTDGYLLR,FATEAAITILR,
ILDDDTIITTLENLKR,
QAVTNPNNTFYATKR,
LLGQFTLIGIPPAPR,
QAQQERDELADEIANSSGK, TFHIFYYLLSGAGEHLK, VLENAEGAR,
SSGPTSLFAVTVAPPGAR, VSHLLGINVTDFTR,
AVVVCPK,DLPEHAVLK,
STNGDTFLGGEDFDQALLR,
VNFPENGFLSPDKLSLLEK, ISFLENNLEQLTK,
ITFTGEADQAPGVEPGDIVLLLQEKEHEVFQR, NTIQWLENELNR,ITFHGEGDQEPGLEPGDIIIVLDQK,
SLSALGNVISALAEGSTYVPYRDSK, IGLVEALCGFQFTFK,
VIEPGCVR,QISQAYEVLSDAK,
RUVBL2,EIVTNFLAGFEA, CCT6A,GALQYLVPILTQTLTK, ATP5A1, KPNB1, TPM3,VHAELADVLTEAVVDSILAIK, DOCK7, KHSRP,TIMM50, DMD, LVEPQISELNHR, SYNE2, AATEELLANK,TMF1, XIRP2,ELMQLEK, GEGLEYENIK,LTEGCSFR,TRISNLPTVK,ASANEMLIAGR, QADKVWR, KLEELK, CTRPICEPCR, RPS8, LLACIASRPGQCGR, DLPLLLFR,GAPDH, LISWYDNEFGYSNR, PHGDH,SLLVIPNTLAVNAAQDSTDLVAK, RPS3A,DYNC1H1, TCP1, LTLFGNSLK,LRRC15,LITEDVQGK,ENFIPTIVNFSAEEISDAIR, NFILDQTNVSAAAQR, MYH9,TTPSVVAFTADGER, QNKELK, HNRNPU,HSPA9, KIF5B,DNAJA2, TNLSVHEDKNR,NVLCSACSGQGGK, DNAJA1, LIIEFK,
TUFM,RCN2, MRPS34,AVQYLSSQDEK, FASN, ARHGEF1, PTCD3,EEF2,SFN,
U2AF1,ATAD3C,
WFNGQPIHAELSPVTDFR, AVIDLNNR,
RPS2,
GTGIVSAPVPK, TYSYLTPDLWK,
GLLLFVDEADAFLR,
SPNQNVQQAAAGALR,
YVEPIEDVPCGNIVGLVGVDQFLVK, AIGIEPSLATYHHIIR, DLEKPFLLPVEAVYSVPGR, GTVVTGTLER,ALLELQLEPEELYQTFQR, VSNSPSQAIEVVELASAFSLPICEGLTQR, HEEEAFTAFTPAPEDSLASVPYPPLLR, LPLFGLGR,SDYDREALLGVQEDVDEYVK, LSEEEILENPDLFLTSEATDYGR, LAEQAER, LLLKSHSR,SAYQEAMDISKK,ALGLGVEQLPVVFEDVVLHQATILPK, HLKAEVDAEKPGATDR,
LGLALNFSVFHYEIANSPEEAISLAK, DLVEAVAHILGIR,QRCPPGVVPACHNSK, PKP1, EMPPTNPIR, QESGYLIEEIGDVLLAR, EALLGVQEDVDEYVK, STAISLFYELSENDLNFIK, SAYESQPIR, VAVLQALASTVNR, LLQLLGRLPLFGLGR, AWGILTFK,QLHDDYFYHDEL, LLDAVDTYIPVPAR, TFCQLILDPIFK,HYAHTDCPGHADYVK, RQESGYLIEEIGDVLLAR,
RAB11FIP3,ANGPT4,PRRC2B, NPHP3-ACAD11,TMEM120A, SHROOM2,CLEC4M, AP1M2, SPCS2,
LOC100509022,ATAD3A,
TLLEGSGLESIISIIHSSLAEPR, RPS2P40, LOC652798,
ISALNIVGDLLR, IASLEVENQSLR,
SNRPB, GFPT2,EEF1DP1,
VLGLVLLR,
NDEL1,
VIQQLEGAFALVFK,
05-Sep, STRCP1,
RMQEMLQR, ASQPSAHISPR,MREENMER, YELLEEK,
BEX2,
IQDALSTVLQYAEDVLSGK,
EIF3FP3, TRPC7,
TAOK2,KLHL13,CDKN2A,USHBP1,
ARF3,
VELSDVQNPAISITENVLHFK, SALSGHLETVILGLLK,
ANXA2P2, LOC100291405,
MLAEDELRDAVLLVFANK, NQSFCPTVNLDKLWTLVSEQTR,
LOC643751, EIF4A1,RPL27AP6,
NVFDEAILAALEPPEPK, IPDWFLNR,
RPS18,
VLITTDLLAR,
ERI1,ANKRD30B,FBXW10, ZFHX2,MYH2, MOAP1,HERPUD1, OVOL3, SOS1, MTMR6, PFKP,NOS2,
TIGD6,RPP30,SVSTTGLQTAGR, YTISSALNLMQICKGK, HNF1B,
BEX1, STRC,SEPT5-GP1BB, EIF3F, TRPC6,
UGGT2, RNF167,LEGKMMNGLR, VQIDYK,PLXNA1,SSLKGLAR,GDSGGPLVTR,LTALFCCNASGTEK, TMPRSS11A, MAN2B2,
NR3C1, KCLQAGMNLEAR, KCNV1, SHPRH,KENPGGTGLEK, KELSETLDFK,BTNL8, TSGGVNPEPCPICAR, TVGATALPR,
HNRNPA1,
SESPKEPEQLR, NQGGYGGSSSSSSYGSGR, GFVENSYLAGLTPTEFFFHAMGGR, SIAANMTFAEIVTPFNIDRLQELVR,
CCT8,POLR2A,
LFVTNDAATILR, AIADTGANVVVTGGK,
SSGPYGGGGQYFAKPR, LSGEAFDWLLGEIESK, TTSNDIVEIFTVLGIEAVR, LFIGGLSFETTDESLR, NLRDIDEVSSLLR,
VITLSLAGR, RCCHSACGR,CVDPWLTQTR, MLNKQQSEDDVR, CPNE2,EAEKQR, DENND2D,WFDC5,CELF4,NUMA1, TVLRGSGNLLR, RPL7P34, AGNCYILAEPK,DYNC2H1,MPLQDLVTAEKPIPLFVEK, ARHGAP5, NMGYVQQR, DKC1,NPPAGYFQQK, BUB3, DDX12P,ILVESQNR, CCNL1,FSGVCPDSR, SERPINA10,NDSLRTNVFVR,LIELVDR,MBLAC2, PHLPP1, LRP2,IMLPGVLR, FEEIDLSGNK,QLNLTIR,VPS13D,
C9orf142,SYAPELGR, CPGESLINPGFKSK, PDGFRB, ISLIILIMLWQK, LPIN3, RSFHEVR,CMNCGKVFSR, RIT1, CDR2L, RGMSILR,ZNF83,AESHMQWAWGRLPK, APAF1, LOC442132,SGLLMKLINLCTR, GTF3C5,LLKFESCGK, EIEAAIERK, REQQER, LRDQEK,RELB, TCHH,IESLGLR, ZFP37,STCGLLETQIKK, WDSUB1,CCDC39,
EEF1D,ARF1, SNRPN,ANXA2, NDE1,CDC42, RPS18P5, RPL27A, EIF4A2, PTPLAD1, GFPT1,
KLEEATAAR,DUOX2, C14orf133,KEIEDIR, CSPP1,TMEM57, VCEKDPALK,CNTFR,KVEEELR, RYLTQGITQGK, ELLRER,SDLGVHLR, ILK,LSTGLMITSVAVELCKNVK, ST8SIA6, MLLTRK,PLSCR1,SLALETVQNDLR, QGKNVLAK,BCAR1,RVETAMEACSLPSSR, PLCB1, CCHCR1,ARLPGEPLVLLPGR, CDK11B, NEQESAVHPR,STK32C,VSMAPPR, DBNL,VSGQEGGLR,MSH5, ANKRD20A8P, TLALLDEEPK, SWTAADTAAQISKR, RLEEEVR, RFC1, C11orf51,MCC, HLA-G,RLQQTER, AFGGPVPTSALRTVNLGTPNQDA, HCRP1, SLRQSTIAK, CTCF,PHF3,QLCPESIK,
ITENAANKK,AKNAD1, EVQPAPELEIK, MRPL38,CCNO,NLQVSRVLR, QSPGSLKAVLK, ZSCAN29, LLTQSAPK, C17orf66, LTIKSEMR, HADH, IRDSEAQLQVLLR, ZNF148, LNTCAMKGGLLR,ARRDC5, NTFTPGEK,EDVKLHR, BCOR, EGGHPATK, FAM47B, KLEDAWAR, C9orf96,UCN, QSDQDITK,CHRDL2,CCT6P4,EYFGEK, ALLLLLAER, ESIAALHR, ICT1, CHTMAFVTR, SPVSAPKLPK,KLF10,LRDQTR,KIAA1217, JAKMIP3, LPCHPNDGTR,LOC100131193,
TGLYSKR,
LGAPAAGGEEEWGQQQR,
AENLQLLTENELHR,
QELIKEYLELEK,
VRELELELDR,RLGGDDAR,
HEXIM1,
HWRPYLELSWAEK,
IRAEMFAK,
LTWEEKK,
VATWVDETLSSVASKLEHTAQTYSELQGER,
LTWEEK,LTYQDAVNLQNYVEEK,
SGTISTSAAAAAAALEASNASSYLTSASSLAR,
AFPQLGGRPGPEGEGSLESQPPPLQTQACPESSCLR,
AASTAPSSTSTPAASSAGLIYIDPSNLR,
UBR5,
NDEELNKLLGK,
HIST2H2AC,
HIST1H2AJ,HIST1H2AH,HIST1H2AD,
HIST2H2AA3,HIST1H2AG,
NDEELNK,
RHWRPYLELSWAEK, TQSPGGCSAEAVLAR,
HIST1H2AI,
GQPVAPYNTTQFLMDDHDQEEPDLKTGLYSK,
LQQLQACTGQQSCR,
HLQLAIR,H2AFJ,
VTIAQGGVLPNIQAVLLPK, LSQAEEETRR,
ACTG1,ACTB,
VAPEEHPVLLTEAPLNPK,
GYSFTTTAER,
AGFAGDDAPR,
DSYVGDEAQSK,
IWHHTFYNELR,
VLQDWNALK,CTADILLLDTLLGTLVK,
WSESEPYR,LSAVEAIANAISVVSSNGPGNR,
SFYTAIAQAFLSNEK,
LOC728379,
USP17,USP17L3,
LOC100288520,
LOC100287144,USP17L2,
LOC100287327,
RAB35,
RAB3B,
RAB30,
RAB39B,RAB3C,
RAB8B,
LOC100287404,LOC100287441,
LOC100287478,LOC100287205,
LOC728369,
LOC100287364,LOC100287513,
USP17L1P,SYELPDGQVITIGNER,
LOC100287178,
USP17L5,
FLQEQNK,
IIAPPER,LOC100287238,
DSYVGDEAQSKR,
VPDCFQR,
ELELELDR,
ELELELDRLR,WSSGVGGSGGGSSGR,
LAGNTLGSR,
HWKPYYK,
HIST1H2AA, HIST3H2A,
HIST1H2AB,QELVRDYLELEK,
HIST1H2AE,
HLQLAIRNDEELNK, AGLQFPVGR,
LSQAEEETR,
TSGAPGSPQTPPER,
HIST1H2AC,
VAPDEHPILLTEAPLNPK,
ACTBL2,NDEELNKLLGR,H2AFX,
NDEELNKLLGGVTIAQGGVLPNIQAVLLPK,
DYLELEKR,
HVAYVFQALIYWIK,
NVAIFTAGQESPIILR,
TSDSPWFLSGSETLGR, NNPLYHAGAVAFSISAGIPK,
VDGAYVAVK,ILGLCLLQNELCPITLNR,
GEHLLLFLVQTVAR,
LRQENQMWNR,
HEXIM2,DFSETYER,
YHTESLQNMSK,
EGEKGQNGDDSSAGGDFPPPAEVEPTPEAELLAQPCHDSEASK, QVEELAAEVQR,
EYLELEK,
FHTESLQGR,
KDFSETYER,
GQNGDDSSAGGDFPPPAEVEPTPEAELLAQPCHDSEASK,
MEDENNRLR,GQPVAPYNTTQFLMDDHDQEEPDLK, DYLELEK,LRAENLQLLTENELHR,
RLQQLQACTGQQSCR, YFPTQALNFAFK,
GNLANVIR,
YFAGNLASGGAAGATSLCFVYPLDFAR,
SLC25A4,
SLC25A6,
DFLAGGIAAAISK,
TAVAPIER,
SLC25A5,DFLAGAVAAAVSK,
AAYFGIYDTAK,SFNRGEC,
VYACEVTHQGLSSPVTK,
VDNALQSGNSQESVTEQDSKDSTYSLSSTLTISK,
VDNALQSGNSQESVTEQDSK,
DSTYSLSSTLTLSK,
LLIYWASTR,
IGKC,
VDNALQSGNSQESVTEQDSKDSTYSLSSTLTNSK,
LYACEVTHQGLSSPVTK,
AASQSTQVPTITEGVAAALLLLK,
EQGVLSFWR,
LLIYGASSR,IGK@,
PSSLSASVGDRVTITCR,
SSGSAALLALTWTCLLVR,
RAB4A,MIA-RAB4B,
RAB3A,
RAB8A,
RAB1B,
RAB37,
RAB1C,
LQIWDTAGQER,
RAB3D,
RAB10,
SWEQKLEEMR, GCN1L1,
QIGSVIRNPEILAIAPVLLDALTDPSR,
CDC37,
HFGMLR,
KDSSSVVEWTQAPK, KPLVIIAEDVDGEALSTLVLNR,
INQQEEPGFEVITR,
VGLQVVAVK,
IDVEDILCKAEAISLQMVK,
TLLVNCQNK,
LVQDVANNTNEEAGDGTTTATVLAR, WSFLFSLTDLK,
LLIESLEK,
TBC1D15,
LQAEAQQLR,
CALMEQVAHQTIVMQFILELAK,
TEEDSEEVREQK,
LGPGGLDPVEVYESLPEELQK,
LQAEAQQLRK,
MEQFQK, TFEQLHSTIGHQALEDILPFLLK,
TGDEKDVSV,
NPEILAIAPVLLDALTDPSR,
TFVEKYEK,
LVLPSLLAALEEESWR,
VGKGEPGAAPLSAPAFSLVFPFLK, LSVADSQAEAK,
VLPLEALVTDAGEVTEAGK,
AALLETLSLLLAK,
RAB12,
RAB43,
GANPVEIRR,
HSPD1,
ALMLQGVDLLADAVAVTMGPK,
VTDALNATR,
RAB1A,
CEFQDAYVLLSEKK,
VGGTSDVEVNEKK, EFSFLDILR,TALLDAAGVASLLTTAEVVVTEIPKEEK,
VTNYIFDSLR,AAVEEGIVLGGGCALLR, GYISPYFINTSK, NDSPTQIPVSSDVCR,
TVAAPSVFIFPPSDEQLK, DFLAGGVAAAISK,
SGTASVVCLLNNFYPR,
VDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSK,
HKVYACEVTHQGLSSPVTK, DSTYSLSSTLTISK,
GAWSNVLR,GLGDCLVK,
LFNDSSPVVLEESWDALNAITK,
FSSVQLLGDLLFHISGVTGK,
ILPEIIPILEEGLR, VLAFLSSVAGDALTR,
VLPQLISTITASVQNPALR,
ENVNSLLPVFEEFLK, ASEAKEGEEAGPGDPLLEAVPK, RLGPGGLDPVEVYESLPEELQK,
LKELEVAEGGK,
EGEEAGPGDPLLEAVPK,
RWDDSQK,
SMPWNVDTLSK,
ELEVAEGGKAELER,
SLSQSFENLLDEPAYGLIQK,
HYGFNEILK,SKWSFLFSLTDLK,
FLLGYFPWDSTKEER,
VLEKDAEVIVDWRPLDDALDSSSILYAR,
GANPVEIR,
KISSIQSIVPALEIANAHR,
IQEIIEQLDVTTSEYEKEK,
GVMLAVDAVIAELKK,
LKVGLQVVAVK,LSDGVAVLK,
ISSIQSIVPALEIANAHR, RAB4B,
HSQFIGYPITLFVEK, SIYYITGESK,
HSP90AA1,
RLSELLR,
HLEINPDHPIVETLR, FICIGALYSELLAVSSK,
HSP90AB1,
HSQFLGYPITLYLEKER,
HSP90AB2P,
ALLFVPR, LGIHEDSTNR, NTPVQSPVSLGEDLQWWPDKDGTK,
AKFENLCK,
DRGSGLLGSQPQPVIPASVIPEELISQAQVVLQGK, HGLEVIYMIEPIDEYCVQQLK,
RVFIMDSCDELIPEYLNFIR,
HSQFLGYPITLYLEK,
RVFIMDNCEELIPEYLNFIR, ELISNSSDALDK,ELISNASDALDK,
TLTIVDTGIGMTK,
HIYYITGETKDQVANSAFVER, TLTLVDTGIGMTK,
LISQIVSSITASLR, VVVITKHNDDEQYAWESSAGGSFTVR,
LVSSPCCIVTSTYGWTANMER,
TKPIWTRNPDDITNEEYGEFYK,
FENLCK, RAPFDLFENR, IVLLSANSIR,EKYIDQEELNK,HSQFIGYPITLYLEKER,
LGIHEDSTNRR, FYEQFSK,YIDQEELNK,
NPDDITQEEYGEFYK,
IRYESLTDPSK,EQVANSAFVER,GVVDSEDLPLNISR,
TKPIWTRNPDDITQEEYGEFYK,
SLTNDWEDHLAVK,
KHSQFIGYPITLYLEK, HLEINPDHSIIETLR,
HSQFIGYPITLYLEK, EDQTEYLEER,KHLEINPDHSIIETLR,
APFDLFENR,
HFSVEGQLEFR,
VMQMLLNGLYYIHR,
TKASPYNR,
FTLSEIKR,DPYALDLIDK,
FTLSEIK,
HENVVNLIEICR,
GSQITQQSTNQSRNPATTNQTEFER, IGQGTFGEVFK,
IGQGTFGEVFKAR,
LPAPPIGAAASSGGGGGGGSGGGGGGASAAPAPPGLSGTTSPR,
QYDSVECPFCDEVSKYEK,
EGFPITALR,
LLVLDPAQR,
VLMENEKEGFPITALR,
AANVLITR,CDK9,
GSQITQQSTNQSR,
ILQLLKHENVVNLIEICR,
IDSDDALNHDFFWSDPMPSDLK,
AYHEQLTVAEITNACFEPANQMVK, RSIQFVDWCPTGFK,
AVCMLSNTTAIAEAWAR,
TIGGGDDSFNTFFSETGAGK, TUBA1A,
RNLDIERPTYTNLNR,
FDGALNVDLTEFQTNLVPYPR, AYHEQLSVAEITNACFEPANQMVK,
SRSPYSSR,
AVFVDLEPTVIDEVR,
QLFHPEQLITGKEDAANNYAR, SIQFVDWCPTGFK,
TUBA1C,TUBA1B,
QLFHPEQLITGK,GLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,
LSVDYGK,
NLDIERPTYTNLNR,
TUBA4B,
IHFPLATYAPVISAEK,
QIFHPEQLITGK,
YMACCLLYR,VGINYQPPTVVPGGDLAK, LIGQIVSSITASLR,
Figure 3.3: Peptide-protein graph constructed from an AP-MS experiment with CDK9 as a bait. Proteinsare drawn as red triangular vertices while peptides are drawn as blue round vertices. Note that there arevarious peptides shared by several different proteins.
3.4. Experimental Results 69
RPAP1, C15orf44, AALAFGFLDLLK, C21orf2, NVLTAILLLLR,NTESLGELWLGLLR, LLAQLDPSLVAFLR, ZCCHC11, ARMC5,
XPR1, ACAP3, TRIP13, VVNAVLTQIDQIKR, YQLLQLVEPFGVISNHLILNK, PRR11,SIWPLIR, MAEEMR, MATR3,
NISCH,TLNLAGNLLESLSGLHK, LQPDQLR, POLA1, IIPHTSSHK, IGBP1,AAQQQEEQEEKEEEDDEQTLHR, BCR,MRPL37, CRYGA,QEALERLK,
FLLPGLEEELEEAVGR, PPP6C, APLDLDKYVEIAR, CLINT1, SLLLLAYLIR, HK1,
CDAIVDLIHDIQIVSTTR, LANLAATICSWEDDVNHSFAK, ANLN,TEALSVIELLLK,KIAA0368,QVLLRPCSEVIIR,RPS15AP17,LSSSFSSR, NUP210,
OTOG,SLAGSSGPGASSGTSGDHGELVVR, EEF1D,LSQLEEER,FAM184A,MPMAGLLK, LSSVFLR,
DHX9,FSDHVALLSVFQAWDDAR, IPO4, BAZ1A,DTIFPSR, LLGLLFPLLAR,
SILLGLLLGAGR,TPGPGAQSALR,ALYSILDEVIFK,IQCB1,LETVGSIFSR,MRPS23, PPOX,RPS14,SANLVAATLGAILNR,
ELDLEKGLEMR, GAB2, KAKPTPLDLR, MYO18A,ALLADAQLMLDHLK, PIN4,
IIIQALR,LLLAIVGQ,ANXA7,YKSEALGVGDVK,TES, CCT7P2,SSGNIFHKEQQR,
CTLVTTLQMFK, FVEQLGSYDPLPNSHGEK, TMPRSS4, IPMETFR, GAK, LSIAEVVHQLQEIAAAR, MRPS16,
VALVYGQMNEPPGAR, ATP5B,GEAAAAALSVR,
MYO1A,DCLK1,
BAZ1B, SPACA7,
QVELIRNSR, TAHSFEQVLTDITDAIK,
RTPIGTDR, TMEM194B,
ATP13A1,AVEGPSTVLTTAVMTDLPVISNILSR, MON2,SAEQISDSVR,SCO2,QAEKISR,TRPM7,
CSQVQQDHLNQTDASATK, THUMPD3, STLAYGMLR, RPL38, CALHM2,AAVAPVTWSVISLLR, KIEEIK,ELCDGLELENEDVYMASTFGDFIQLLVRK,
QQHVEDVQR,
LSLVGIAK,
ACSL5,
ZNF91,
HOOK1,
KIAA0226L, LLQGARR,SRGAHLASADELR, SUSD5,HLSCTVGDLQTK,RUFY1,VPGSAVPEAGR, PNPLA5,
NTELAVFHDETEIQNQTDLLSLSGK, IARS,VLELAQLLDQIWR,
LEPLILKR,DDHD1,AIEQADLLQEEAETPR, AP1S2,DNSNNTLYLQMSSLR, CSF3R, LEPPMLR, FAM59A, SEAVREECR, PIAS4,
NOLC1,
FBP1,
UGGT2,
PLCH1,
TTC27,
KPRP,
POLR1D,
EPS8,
PRKAG1,
UNC45A,
NCKAP5L,
THSD1,
PARP9,
SMC5,
GHDC,
TLE3,QAGLRSK,
ACACA,
PACSIN3,
VVPSDLYPLVLGFLR,
RSEPIYNSR,
SPNGKLR,
ILFLDVLFPLAVDK,
QLPSPQSLK,
VLISNLLDLLTEVGVSGQGR,
GIVSLSDILQALVLTGGEK,
RVFDLER,
CPLASTNKR,
LKLLDAK,
IILVNVR,
PRPH,
USH2A,
SYNE1,
KIAA1217,
KCNQ5,
MICAL3,
SLC45A4,
VEEKDSLSR,
DUX4,
QRFPR,
HPSE,
HEXIM2,
MTMR2,
C13orf35,
XPNPEP3,
VPS54,
UPF2,
CASC4,
UXT,
YNILTLMR, HIST1H1D,
MATPSKK,
VSSLSPSQCR,
ADSAVSQEQLR,
GSVLEPEGTVEIK,
EYELVLTDR,
VQEEMARK,
SPTBN1,
SRSF7,
AHSA1,
SLAMF7,
ESYT1,
NPPGFAFVEFEGPRDAEDAVR,
NIT1,ELVDRK,
VKDANLTLESK,
VRNSNNLK,
KLTTDPDLILEVLR,
SKTVESAEGK,
AEMNILQINEK,
C19orf40,
LARP4,
VCAN, MFINIK,
TMF1,
NBPF3,
GNA12,
P2RX4,
SLC22A11,
UBA2,
CHMP2A,
QRGEPLR,
ZNF208,ZNF469,
BRIX1,
GPR153,
DDAH1,
LOC441736,
DDX41,
PCCA,
GPR171,
YRADLK,
MAGI3,
PRR25,
NBPF8,
C3orf77,
KALRN,
KIF16B,
FAT2,
OSM,
KHWLSK,
SSRPSCANVLLR,
MAPPTPPK,
LIEEMEEK,
GNNGSAFR,
LSCQRTIPMTGK,
QEQEKR,
LADLEQR,
LSIGQRNGGLR,
GEGLEYENIK,
RTLSLPR,
MGDTVLLK,
IMKEEVK,
AGAKRPVK,
OLFM2,
MED13, DPT,
MSLINPSSSLK,
TNEKQEK,
RDGEVDATASSIPELER,
GATTTFSAVER,
PRSS16,
RLHSADGQR,SLC1A6,RASVASPGEK,VFMIPVENENHCDFVKLR, LFQVENQSAQEKVK,
MAPK7,KVMHLQDVEVK,TRIM4,
DST,ASPSPFCPELR,CCDC87,
DOCK8,
IVSSAMEPDR,
SYQASPDLR,
QSVSRNR,QNVDRK,MIADFLR,
LFGEKFR, DCC,SSYHLIR,
KIF4A,KPSPEPR,
MOK,
ZNF518B,MPLSRR,
KNLLELIEYTHMLR,
KLEQKPK,
LTGMGGAR,GPSM3,
VSAQEQR,TRIT1,
SYPLPAIVDFIVER,
ALSCSSYIR,
KIF21A,
C14orf180, AQALRGLR,
QKLIDELENSQK,
QPVPWETVYPGADR,
ILGEFLQEK,
VTFGNRVTSSLGDIPVSR, GGA2,
KIAA2026,
LEEALEAAQGEARGAQLR, DNAH2, FAM161B, XIRP2,WBSCR16,
AHI1, NPAT,
RLSGPGLGR,
DLIISIENARK, STAFGMNGLGGIAAKLFAAPFSMR, C20orf132,
TJP2,
TMEM201,
APBA3,
LOC653510,
CBWD1,
LOC100510529,
LOC100510407,RGPD8,
GOLGA8A,RGPD3,
GOLGA8B,
KMSQEVCTLK,
ALGQNPTNAEVLK, LASYVEK,
MYL6, KRT23,
ENLPLIVWLLVK,GFGFVTFDDHDPVDKIVLQK,
DPM1, MYO1C,HNRNPA2B1,
TFAP2A,
NGGRSLR,TFAP2C,
DLVEAVAHILGIR,RDNYPQSVPR,
TFAP2B,MINK1,
IKELVVTQLGYDTR,
FASN,PFKP,
LNIIIVAEGAIDTQNKPITSEK, GFNLNPLNQDELK,
VCP,
C19orf2,
KVDNDYNALR,
CBWD7,
H2AFX,
PCBP2,
LVHTNEVTVLLGDNWFAK,
NSTGSGHSAQELPTIR,
HIST1H2AA,
AFVDVVNGEYVPR,
DLLADKELWAR,
TSDIFEADIANDVK,
PCBP1,
GEGIKMTPR,
LNTEWSELENLVLK,
VASP,
PCBP3,
LSSEVEALR,
DSP,
PPP1CC,
EIFLSQPILLELEAPLK,
KPTGLFYLLDEESNFPHATSQTLLAK,
MYO9B,
PPP1CB,
ATGREEGLEVGPPEVLDTLSQLLK,
KPNB1,
AKAP9,
TNIK,ALFLIPR,
MAP4K4,
GGGGNFGPGPGSNFR, GANFLTQILLRPGASDLTGSFR, LASYVEKVR,VLDFEHFLPMLQTVAK,
QAVAKGQSTPR,FAN1,
PTPRC,
SP140,
AKAP2,
ILSISADIETIGEILKK,
ILSISADIETIGEILK,
ARF1,
MLAEDELRDAVLLVFANK, ARF3,
POLR3A,
AQENLNPLVVLNLFK,
LQQQPGCTAEETLEALILK,
EVVEKR,
GSASDWYDALCILLR,
AMOT,
RSPH3,
GFGFVLFK,
HNRNPD,
VAGAATPKK,
PPP2CA,
P4HA2,
GAGYTFGQDISETFNHANGLTLVSR,
HNRNPA1,
GNTELFGGQVDGDNETLSVVSASLASASLLDTNRR,
AVLQLLVEGALHR,
IASVVGILGHLASR,
AFLTGVDPILGHQLSAR,
NNLHVLPDELGDLPLVK,
LVSIPEEIGK,
TLLEGSGLESIISIIHSSLAEPR,
DLVDKGGDIFLDQLAR,
YLFDLPLK,
LSNQVSTIVSLLSTLCR, IAAQLLQQEQK,
LRCH2,TDSQKDQEVYDFVDPNTEDVAVPEQGNAHIGSFVSFFK,
SEC13,
QAAPCVLFFDELDSIAK,
DVAWAPSIGLPTSTIASCSQDGR, LAYINPDLALEEK,NAPAIIFIDELDAIAPK,
PDRG1, P4HA1,
TPKDESANQEEPEAR,
SLLVIPNTLAVNAAQDSTDLVAK, TCP1,
DDKHGSYEDAVHSGALND,
VLCELADLQDKEVGDGTTSVVIIAAELLK,
YPVNSVNILK,
RYR2,
NVFDEAILAALEPPEPK,
LAQQISDEASR,
LOC643751,
IIEDCSNSEETVK,
USP9X,
NKLEGEIR,
INVS,
CDKL1,
GTF3C5,
LAPAHVDEQVFLYENAR,
POLR1B,
ZBED1,
CAMKK2,
CSNK2A1P,
GAEETSWSGEER,
LKPESSQQPGQDALAVK,
LPADTCLLEFARLVR,
KRT18,
ASLENSLREVEAR,
AQIFANTVDNAR,
LFGSAANVVSAK,
LOC440264,
EALNMERNNR,
HFVTISSPLATQIPQAVGAAYAAK,
RDSLEEGELR,
ELP3,
LALPSRVQK,
LLLPWLEAR,
CLTC,
TIFTQEIFTEQVVTAHAVR,
SSQLLWEALESLVNR,
DHX38,
ITEIQYTCR,
SRPK2,
CDK11B,
SLC25A6,
QDFETVLLLEPGNK,
TSTD2,
ZNF525,
GMGGAFVLVLYDELK,
TVGATALPR,
ETKTFGGGGGGAR,
SDLAVPSELALLK,
CICTAVIPAQSTCSRDGR,
RNAVSSSTNNSR,
FNIINMTFPTK,
VEEARR,
LGAYELVTGRR,
QVQTQR,
MAAVSLR,
TLADLIR,
QPREER,
DTAEEKELLR,
PRRC2B,
PTPLAD1,
ARID5A,
ZNF674,
TCHH,
GNAS,
TGALVLSR,
ZNF791,
LOC100507203,
OR4K3,
AKEQLEIR,
EPMTVSSDQMAK,
ILQCFLDR,
PLEC,TLMELR,
FAM75A5,
FAM75A4,
INTS7,
KLEEIK,
LQQQQAQQPLQQQQQR,
LOC100289196,DBF4,
14-Sep,
CFH,
EQEERLEQR,
SETX,
SLC25A5,
TLQALEFHTVPFQLLAR, LGALS3BP,
VELSDVQNPAISITENVLHFK,
IILEFK,
FCHSD2,
CYB5R1,
SLC7A6OS,
MCTP2,
MFN2,
NUB1,
SHQ1,
SMARCA5,
SH3TC2,
LOC645954,
TRIM68,
AELEQKIDEAR,
KSPPEPR,
RWVPTLR,
LSRDSQQR,
ETQEATR,
STIAFAHR,
QFVEHASEK,
SGVLQRLQDELSDVIDIK,
EWRGAGMAQK,
EGECLTLCK,
IDISRR,
ITGB3,
GRIA1,
MWRAVSLATR,
ATP6V0A2,
LTA4H,
RPS18P5,
CACNA1S,
LOC100128703,
EIVDRK,
HAGPGAVMALRRPGER,
SSSFSLPSR,
LSAYVIR,
SYNM,
GIQDLHVQIR,
NVDCVLLAR,
ALDH1L1,
DDX31,
SHPFTNPR,
FAM163A,
ATP6AP1L,KLCAYPR,
QMIYSAARAIAGMYK,
SSLIILK,
KIAA1257,
TIMM44, TIESETVRTSEVLR,
FAM203A, PSMA5, RHBDL2,
SSTLTEHK,
DFPTISR,
ALSLGTIPSLTR,
ISSGNIQER,
ESKIQR,
ENLLLEEELR,
TELSACGPLGR,
MAP1A,
TNFRSF6B,
DKIPCEK, MACF1,
AKT3,
AIADTGANVVVTGGK, EVAAFAQFGSDLDAATQQLLSR,
QLREEQR,
MSEYIR, GIGYF2,
OR2M2,LYQDEK,
IEKSPAPQK,
LSAYVLR,
EAAWAISNLTISGR,
FAEAFEAIPR,
CPSF2,
GLYR1,
ATG4A,
APLP2,
SPN,
CA3,
TBKBP1,
CDO1,
ATP2B3,
BAZ2A,
NSSQDDLFPTSDTPR,
SYDYEAWAK,
KEQAGER,
HENMT1,
LCANEEIMR,
LMEPNLIK,
EKPLLIFEILQR,
ELEGELEAEQKR,
NCAPH2,
MDN1,
LYRM1,
AARPPPAASATPTAQPLPQPPAPR,
FEDEELQQILDDIQTK,
ALIPSLEGRFEDEELQQILDDIQTK,
EAGGQAQAMELPEAQPRQAR,
KEVDMMK,
QVSNIGR,
YSSAGTVEFLVDSK,
LYMKLVMR,
QEEYIEEKK,
KAKPAPSK,
DUOX2,
IPGPVCK,
EERAQWAEQQR,
LIDWGLAEFYHPGQEYNVR,
DYNC2H1,
KIAA0586,
HOOK3,
MAP3K10,
IQCH,
SSGPYGGGGQYFAKPR,
SNPH,
QLYRNK,
EGTSGRLR,
SRFDIK,
KEIEDIR,
CSNK2A1,
VLMASVQGSKR,
TQSPGGCSAEAVLAR,
LSLLLDNER,
ALFIVPR,
VTVQTDDSNK,
YETFISDVLQR,
LILAGALR,
AILFVPR,
LCEQTEEK,
CPNE1,
SYNCRIP,
APOB,
NIPBL,
FGF19,
EIF2C4,
ASB4,
SUPT7L,
GAPDH,
M6PR, ESWQTEEK,
QLYKNR,
SSFDLLPR,
QEEISRLVK,
EQHLFLPFSYK,
DLFEDELVPLFEK,
EALAQTVLAEVPTQLVSYFR,
VIHDNFGIVEGLMTTVHAITATQK,
IQMAALSEISK,
RSTVNEKPK,
RSLIHVMSVEK,SLALWPR,
ADQLCQQELR,
SMEMVKTK,
QVGEAVATLK,
DWVSPWLPR,
DGTIVYTGLETR,
MERGTGETR,
LLWLSMLK,
STSANISR,
KVYNIPGISPDMMK,
MYOF,
CAMSAP2,
PMPCA,
FBXO31,
F5,
CHEK2,
FNDC3B,
ZNF155,
KLHL10,
UBTF,
CAND1,
AASS,
DCDC2,
TTI1,
FUS,
EIF3E,
TPR,
SBF1,
KVIEDR,
KEELESK,
TLSIAPGQCR,
GLDTVVALLADVVLQPR,
KIAA1244,
KDEFSTK,
SLDRQGIQR,
ILGETSLMR,
TTC40,
NDRG4,
ABCA13,
ZNF30,
UTP18,
SOCS6,
MRPL4,
LKVLEAK,
DFIDMRVK,
EMID1,
UNC45B,
MVLAISSCR,
LDKDDLEK,DOCK1,
CDC42BPB, HLA-DPA1, MRPEDRMFHIR,
DLG5, ELKELK,KCNB1, EQMNEELK,
LPQEQR,IFI16,
ESNSLCPAGIR,USP24,
AEMNILEINKK,
MAIIPDWLR,EFCAB6,
ZNF425, RVPACTAPGR,
PPARGC1B,
FGD3, FHGQFLLPELK,
ILLYETCPR,
LARIEEPK,
AGVEARAGPR,
WPSTLR,
GRHL1,
STXBP2,
VEQQKK,
CDC42,
CD1B,MLSHKMK,
TSVSHMPIR,
TCEVAAWCPVEDDTHVPQPAFFK,
GBP7,
EYS,
KVLEDR,WNT7B,
VPHATGALNELLQLWMGCR,
ILSPKVK,
APQLLIAPFKEEDEWDSPHIVR,
AMOTL1,
LLSVLAQVFTELAQK,
TQTAPAGR,
KIIADIK,
SPQQSAALPR,
SQREMEEK,
ISPSGIDSATTVAAATAAAIATAAPLIK,
YYREMK,
WLDGLFFPR,
GILSMIER,
PPP2CB,
PDE1C, RSSLNSISSSDAK,LOC494150, EKAIEVER,GBP2,SINTAVR,
LPLPNKMK,
VDVYGYVVK,
HLVFPLLEFLSVK,
AAIDWFDGKEFSGNPIK,
GLLALLFPLR,
IDLRPVLGEGVPILASFLR,
GQNLLLTNLQTIQGILER,
ALADILSESLHSLATSLPR,
STVGSSDNSSPQPLKR,
QDLVISLLPYVLHPLVAK,
SHEESLAHTGTLRR,
ISRQKPR,
VQYDLK,
AGLQELR,
AEVRGGGR,
ATRTAALQPTPSSGTSGK,
LSKDNLK,
KLAEEQQK,
KIGQQPQQPGAPPQQDYTK,
ITELTDENVKFIIENTDLAVANSIR, VQISPDSGGLPER,
DQGGFGDRNEYGSR,
MHFSLKE,
TLTAEEAEEEWER,QAQVATGGGPGAPPGSQPDYSAAWAEYYR,
SNKQNLFLGSLTSR,
IYQEEEMPESGAGSEFNRK,
QSRLEQEEQQR,
EFRPEDQPWLLR,
TIMM50,
KTGLSSEQTVNVLAQILK, APDQAAEIGSR,
VVVVDCKK,
TVLEHYALEDDPLAAFK,
GTF2F1,
IYQEEEMPESGAGSEFNR,
TTPNSGDVQVTEDAVRR,
LRLDTGPQSLSGK, VLNHFSIMQQR,GNSRPGTPSAEGGSTSSTLR,
PYANQPTVR,LGLIPLISDDIVDKLQYSR,
ITELTDENVK,
KLSDLQTQLSHEIQSDVLTIN,
IINDLLQSLR,
FAM75A3,
YFKDYR,
KISQALASK,
FAVAESDCNLAVALNR,
RGPD2,
LIF,
YFPTQALNFAFK,GIAADGANALLPANR, DFLAGGVAAAISK,
EQGVLSFWR,
IPDEFDNDPILVQQLRR,
LNEAKQDFETVLLLEPGNK, QNLFLGSLTSR,
YNIMAFNAADKVNFATWNQAR,
DISCLNRDPAR,
RAPDQAAEIGSR, LLLQVQHASK,
SLC25A4,
VLELEPNNFEATNELR,
RPAP3,
IEEVSDTSSLQPQASLK,
TIALNGVEDVR,
AQGPQQQPGSEGPSYAKK,
ISQALASK,ILHDFYIEK,
AIELQLQVK,
QLEDGDQPESK,KHSRP,
IQFKQDDGTGPEK,
MILIQDGSQNTNVDKPLR,
LASQGDSISSQLGPIHPPPR, DAFADAVQR,AINQQTGAFVEISR,
DIEQFTEFFENPAFRPDGLK, CLKSCKPR, LRVPLLEILK,AIGPHDVLATLLNNLK, MAGQRVELPCK, IAALQAFADQLIAAGHYAK, GKGCYNTR,IPMSKMMTVLGAK, SPPPPPPR,QLEIYR,CMMLTNVQMLNKEPEDMITGER, EQQETIDKLR,WCGHKEVPPR,DSNTQIEQLR,LGALPSMPR,FQAMEETHLRHMK, LEVDLDK, QFQVLDAPLLK,
LOC440258,N4BP2L2,
CEP85L,
LLSSDRNPPIDDLIK,
LOC645084, L1RE2,
SPHQNVCEQAVWALGNIIGDGPQCR, C7orf73,
EALGDAQQSVR,
DNAJC7,
NALCN,
RLAELER, KLSSPTTPR,
KDYNEAYNYYTK,
MYO18B,
EAETFKEQGNAYYAK,
AVQFFVQALR,FFELIQGTEIDIFSYKPTLLTSK,
ISAIHILDVLVLNGTDVR, GIYKDDIAQVDYVEPSQNTISLK,
TPHYGSQTPLHDGSR, LAVEAVLR,
ILLAELEQLKGQGK,
CENPE,PDGFD,
LSVGSNRDR,
KIAA1109,
LKNGIQK,SSMRAASLK,
DSTFDLPADSIAPFHICYYGR,
QLLLCQFLMALSIVR,
VAELALSLSSTSDDEPPSSVSHGAK,
TARDBP,ELDTVTLEDIKEHVK, LQLPENFDIHPVGSLAEK, VAELYLPLLSIAR,ISLPLPNFSSLNLR,
VELHSTCQTISVDR,
VAC14, SRCIN1, CHD5,
KIEELK,
VVDALGNAIDGKGPIGSK,
LTSFIGAIAIGDLVK,
QVLLSAAEAAEVILR,
ELMLGEDTGLPKR,
CCDC158,
LQELQHLK,
HRLDLGEDYPSGK,
EVDDLGPEVGDIK,
SNLGSVVLQLK,
QESGYLIEEIGDVLLAR,
ARHGEF1,
LLLKSHSR,
AIGIEPSLATYHHIIR,
TLISGLPGR,
EKSIFLVAHR,
APITD1-CORT,
LOC653566,
IQEQVQQTLAR,
LST-3TM12,
SNRNP200,
LSLVGLAK,
FREALGDAQQSVR,
QDVDNASLAR,
WTELGALDILQMLGR,
TSSVPEYVYNLHLVENDFVGGR,
NAQLELKK,
LOC652147,
LAYELYTEALGIDPNNIK,
SIK3,
DLKAENLLLDANLNIK,
FCHO1,
GTPGEKGPR,VSSDKLALCHLELTR, INVSVSKNLNLK,
ASVDTKEAEGAPQVEAGK,
LEQYTSAIEGTK,
NDDDEEEAARER,
CALD1,
YEIEETETVTK,
AHYTHSDYQYSQR,
MQNDTAENETTEKEEK,
GNVFSSPTAAGTPNKETAGLK,
LQEALER,
LPSFAIPYAIDVLTTGSYPR,
CORT,
CCDC162P,
SPCS2,
DSCAM,
AVGDPSPAVKWMK,
AVDSLVPIGR,
ATP5A1,
GIRPAINVGLSVSR,
LKEIVTNFLAGFEA,
SMYD3,
TSDLIVLGLPWK,
LVEGILHAPDAGWGNLVYVVNYPK,
EALLSSAVDHGSDEVK,
VQDDEVGDGTTSVTVLAAELLR,
FLVLDEADGLLSQGYSDFINR,
CCT2,
REMVYASR,
SLHDALCVLAQTVK,
ERCC4,
EKEAFEK,
FAVALDSEQNNIHVK,
SLGGDVASDGDFLIFEGNR,
DIVTESNKFDLVSFIPLLR,
VEENFVILFSDLTMHELK,
SLCO1B7,NLYIISVK,
RPL23P8,SLEQELASPILDIEDLVK,
PPAN-P2RY11,QLSLDVRR,
RAQHNEVER, EAMVKPFEK,
OR2M1P,PDCL2,
LOC653707,
USF2,
LICTTVPK,
IASLLGLLSK,HKLDVTSVEDYK,
RTEL1,
ETGANLAICQWGFDDEANHLLLQNNLPAVR,
PPAN,RPL23,
INALTAASEAACLIVSVDETIK,
MIIEDETEFCGEELLHSVLQCK,
ILVLTELLER,
FTSJD2,
TALLHALASSDGVQIHNTENIR,
DOCK6,
USF1,
SHSWVNSAYAPGGSK,
TLLACVLWVLK,
FRY,
LNDDAKR,
LWGIPDQAR,
ASB6, PSMD3,
HDADGQATLLNLLLR,
VYEFLDKLDVVR,
ALVSFSDYQIFQELIK,
ELFFLPLGFALK,
EFETAETLLNSEVHMLLEHR,
AQELIQATNQILSHR,
SCAF1,
KVEEELR,
IALREIR,
LPCAT1,
GRIA4,
QAELLEKR,SRPK1,
CDK11A,
QAQGLQQELGGLHSALLR,
LALFPLLPK,
HLQTYGEHYPLDHFDK, DLQSWINGIR,
TWFGTFIR,
INTS2,
QLLLELMGILPTVR,
NEFIGLQLLDVLAR, INF2,
SATDKQQELLVSLATVIFVASQK,
PSMC4,
ENAPAIIFIDEIDAIATK,
LQQELEFLEVQEEYIKDEQK,
LOC652826,
TYSYLTPDLWK,FAEFLLPLLIEK,
AKT2,
SSLLSATR,
HALIIYDDLSK,
FENAFLSHVVSQHQALLGTIR,
KEVIIAK,
GOLM1,
LOC100293491,
AALSSQQQQQLALLLQQFQTLK,
DIDEVSSLLR,
VINEEYK,
DVGRPNFEEGGPTSVGR,
VTADVINAAEK,
NAGNCLSPAVIVGLLK,
IDLLQAFSQLICTCNSLK, TQLRNEFIGLQLLDVLAR,
RBBP7,
ILQDGGLQVVEK,LAKENAPAIIFIDEIDAIATK,
ASHEEVEGLVEK,
TVALWDLR,
TVLEQVLPI,
DNAJB1,
FKEIAEAYDVLSDPR,
DLPLLLFR,
TLGILGLGR,
PHGDH,
QADVNLVNAK,
ICFELLELLK,
CCT8,
LKENQAR,
MYH6,
ILESSIPMEYAK,
KAVELDLVPGR,
FCSNLCLPK,
APDLFPTDFK,
SPISVAAAAIYMASQASAEK,
GTGAASFDEFGNSK,
EIGDIAGVADVTIR,
GTF2B,
VLNSIISSLDLLPYGLR,
SSTSNANDIIPECADKYYDALVK,
EIEDIIEEVTVGYIR,
SLIINTNPVEVYK,
FQATSSGPILREEFEAR,
IFYPETTDVYDRK,
IQGAP2,
NPNAVLTLVDDNLAPEYQK,
VLWLDEIQQAVDEANVDEDR,
VNAELQAR,GVLLDIDDLQTNQFK,
VLWLDEIQQAVDEANVDEDRAK,
LNDGSQITYEK,IYDVEQTR,
KYDYYYNTDSK,
CLIATGGTPR,SNIWVAGDAACFYDIK,
SITIIGGGFLGSELACALGR,
AIFM1,
DKVVVGIVLWNIFNR,
AALSASEGEEVPQDKAPSHVPFLLIGGGTAAFAAAR,
ALGTEVIQLFPEK,
ISGLGLTPEQK,
AIVFRPFKGEVVDAVVTQVNK,
HSIPSEMEFDPNSNPPCYK,
YFGPNLLNTVK,
VDKNDIFAIGSLMDDYLGLVS,
TMDEDIVIQQDDEIR, GFVLYPVK,
GEVVDAVVTQVNK,
POLR2G,
YGFVIAVTTIDNIGAGVIQPGR,
LFTEVEGTCTGK,
VGLFTEIGPMSCFISR, NLLIFENLIDLK,
SLGPPQGEEDSVPR,
DVLIQGLIDENPGLQLIIR,
LQSVQALTEIQEFISFISK,
GQAVTLLPFFTSLTGGSLEELRR,
AQNVTLEAILQNATSDNPVVQLSAVQAAR,
KPNA4,DAQVVQVVLDGLSNILK,
PRKDC,LOC731751,
KPNA3,
GALQYLVPILTQTLTK, FLEFQYLTGGLVDPEVHGR,
TDYNASVSVPDSSGPER,
HNRNPK,
HNRPDL,
PPP1CA,
AFHNEAQVNPERK,
PALM2-AKAP2,
AAVENLPTFLVELSR,
QLDAGEEATTTK,
SQVLDYLSYAVFQLGDLHR,
INTS5,
NNIFYILLR,
MEPLATVESLEQYLLK,
HECTD1,
ILQQIEEPLALASGALPDWCEQLTSK,
GQVKPSTSSQPILSAPGPTK,
RAD21L1,
NQGGYGGSSSSSSYGSGR,
YLVEVEELAEEVLADKR, TYEEGLKHEANNPQLK, GQGSVSASVTEGQQNEQ,
STIP1,
MSLLQLVEILQSK,
VKEEIIEAFVQELR,
URI1,
CBWD2,
INISEGNCPER,NDEELNKLLGGVTIAQGGVLPNIQAVLLPK, HIST2H2AB,
SQQVIVQGVHELYDLEETPVSWKDDTER,
SISCEEATCSDTSESILEEEPQENQKK,
LOC100291405,IPDWFLNR,
RPS18,
CDKL4,
LASYLDR,
BAT2L,
SESPKEPEQLR,
TDRD3, SF3B1,
ETEGDVTSVK,
TQTSDPAMLPTMIGLLAEAGVR,
AIFTGHSAVVEDVAWHLLHESLFGSVADDQK,
ELEKIK,
SLESSQTDLK,
POLR2D,
SNPEATNQPVTEQEILNIFQGVIGGDNIR,
TALGVAELTVTDLFR, ILALVPPWTR,
BCKDHA,
KFDVNTSAVQVLIEHIGNLDR,
ENAPAIIFIDEIDAIATKR,
RPS2,
TLNYTAR,
DSQVVQVVLDGLK, ILLGQNR,
ILDELDKDDSTHESLSQESESEEDGIHVDSQK,
DFLAGAVAAAVSK,
FAM75A1,
FAM75A6,
RGPD4,
NCDYQQEADNSCIYVNK,
QLDDKDEEINQQSQLVEK,
EISFAYEVLSNPEKR,
IGLVEALCGFQFTFK,
RQESGYLIEEIGDVLLAR, VIM,
SUPT5H,
SSVGETVYGGSDELSDDITQQQLLPGVKDPNLWTVK,
PKP2,
DIKPSNLLVGEDGHIK, TILEMLLGFLK,EVATAIR,
SPTAN1,MMS19,
SFTTTIHK,
MAD2L1,EFCAB7,DDX1,
LPLQDVYK,
WNPTAGVAFEYDPDNALR,
KLASQGDSISSQLGPIHPPPR,
HSVGVVIGR,
ELWFSDDPNVTK,
SEYSELDEDESQAPYDPNGKPER,
IPDEFDNDPILVQQLR,
MFYHISLEHEILLHPR,
TMDEDIVIQQDDEIRLK,
KIAA1598,
LLLQGEADQSLLTFIDK,
KELELK,
QVPLSAAEAAEVILR,
LVPGGGATEIELAK,
FEDEELQQILDDIQTKR,
AIQDEIR,
LATNAAVTVLR,
SQALEFPDQPDIAEDLKDLITR,
LLGRCSGTR,
AVDNILR,
SATEQSGTGIR,
VEAVLSR,LPNGEGTR, ISQKAER,
EMILIN1,COL16A1,
IDNSVLVLIVGLSTVGAGAYAYK,
LLPLHHMPTQLLSIEESLALQK,
SQDSEEHDSTFPLIDSSSQNQIR,
SVSSSVQVCPEVGKR,
SFCSNFCYQASK,
KAELEAAVR,
YVLGEETTK,
RPAP2,
SEITLVGISKK,
ESQNSLDESLPFR,
TNKVYDITER,
VHAELADVLTEAVVDSILAIK,
ALQFLEEVK,
ALHIVEQLLEENITEEFLMECGR, GIDPFSLDALSK,
GLQDVLR,
VATAQDDITGDGTTSNVLIIGELLK,
AQLGVQAFADALLIIPK, LTFLYLANDVIQNSKR,
TKVHAELADVLTEAVVDSILAIK, CCT6A,
HKSETDTSLIR,
SVTLLIK,
SVYGGEFIQQLK,
MLADFLR,
ALQDLENAASGDATVR,
RPRD1A,
DFAPVIVEAFK,SVYENDVLEQLK,
EFESVLVDAFSHVAR,
KLTFLYLANDVIQNSK,
RPRD1B,
IASLPVEVQEVSLLDK,
LTFLYLANDVIQNSK, QALYGDKKPR,
TYEQIKVDENENCSSLGSPSEPPQTLDLVR,
IQSLPDLSR,
KLSELSNSQQSVQTLSLWLIHHR,
HSQFIGYPITLYLEK,
FYEAFSK,
ALLFVPR,
HSP90AB1,
HNDDEQYAWESSAGGSFTVR,
KHLEINPDHPIVETLR, EQVANSAFVER,NPDDITNEEYGEFYK,
HSP90AA1,
ALLFIPR,
HFSVEGQLEFR,
ADLINNLGTIAK,
SLTNDWEDHLAVK,
YESLTDPSK,
HIYYITGETK,
HLEINPDHPIVETLR,
HSQFIGYPITLYLEKER,
HSQFLGYPITLYLEK, YHTSQSGDEMTSLSEYVSR,
QNTGVWLVK,
YLLQEQLK,
WAIIGWLLTTCTSNVAASNAK,
GLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,
DLVDITKQPVVYLK, TLTAEEAEEEWERR, AIECYTR, LVELYR,GMGGAFVLVLYDEIK, GAWSNVLR,
RGPD5,FAM75A2,
TAVAPIER,QDVCQSYSEK,
YDEAIDCYTK,
VLKIEEVSDTSSLQPQASLK, VLLDLSAFLK,AQGPQQQPGSEGPSYAK, NLDPDVFNQIVK,GHWDDVFLDSTQR,
KEQESCDCLQGFQLTHSLGGGTGSGMGTLLISK,
LSVNMVPFPR,
TIGGGDDSFNTFFSETGAGK,
GHYTIGKEIIDLVLDR, NLDIERPTYTNLNR,
EHYHATALGAK,
YMACCLLYR,
AVCMLSNTTAVAEAWAR,
TLGAPASSER,
EIEEAPDIRK,
NRDNDPNDYVEQDDILIVK, GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK, SVSLTGAPESVQK, TTPNSGDVQVTEDAVR,
TGLSSEQTVNVLAQILK,
FIIENTDLAVANSIR, TGLSSEQTVNVLAQILKR,
EAESCDCLQGFQLTHSLGGGTGSGMGTLLISK,
TUBB8,
GHYTEGAELVDSVLDVVR,
EEESCDCLQGFQLTHSLGGGTGSGMGTLLISK,
FPGQLNADLR,
TUBB6,
SCDTPPPCPGCPAPELLGGPSVFLFPPKPK,
SCDTPPPCPICPAPELLGGPSVFLFPPKPK,
TTPPVLDSDGSFFLYSK,
VVSVLTVVHQDWLNGKEYK,
RISEQFTAMFR,
TUBB3,
YLTVAAVFR,
STSGGTAALGCLVK,
YDDPEVQKDIK,
NTTIPTKK,
MKETAENYLGHTAK,
QAVTNPNNTFYATK,
QAASSLQQASLK,ERVEAVNMAEGIIHDTETK,
QAVTNPNNTFYATKR,
VNVTHKEEIILTPIEVAIEDMQK,
EQQIVIQSSGGLSKDDIENMVK,
TTSGDDACNLTSFRPATLTVTNFFK,
AQFEGIVTDLIRR,
SLSNSNPDISGTPTSPDDEVR, STNGDTFLGGEDFDQALLR,
DAGQISGLNVLR,NAVITVPAYFNDSQR,
TTPSVVAFTADGER,
LYSPSQIGAFVLMK,
GPSVFPLAPSSK,
HSPA9,
SAVRPASLNLNR,
SQVFSTAADGQTQVEIK, KDSETGENIR,
AQFEGIVTDLIR, TPNEEIDRQNDDQR,
GSWACSIFDLK,
IDISPAPENPHYCLTPELLQVK,
VQQTVQDLFGR,
YDDPEVQK,
DKVQAGDVITIDK,
LLIVSTTPYSEK,
SAIFSITYPSQDVFLVIK,
LQEAFSK,
KTQELAFATHQDPADPK,
VQAGDVITIDKATGK, TQELAFATHQDPADPK,
GYQTSPDLR,
FMDDIAALVSTIASDIVSR, LKEALQPLINR,
DOCK7,
AKLEEAILGSIGAR,
YSDPQIK,
KQISGQYSGSPQLLK, NLLYIYPQSLNFANR,
DSTEVEISTGER,
NSLPDALLPNLLDR,
GLLRPHVPPAAITTLAR,
SNSWVNTGGPK,SYTEDWAIVIR,
NSLLASYIHYVFR,
NAEKYAEEDR,
RYDDPEVQK,
AHGELHEQFK,
TILTYAEEDLELR,LLGQFTLIGIPPAPR,
GIDPFSLDSLAK,
CCT6B,
NAIDDGCVVPGAGAVEVAMAEALIK,
VLAQNSGFDLQETLVK,
YVLGEETTKSQDSEEHDSTFPLIDSSSQNQIR,
EEQSGHSGEEVQLCSK,
KSFCSNFCYQASK,
SSQTSQNQGLGRPTLEGDEETSEVEYTVNKGPASSNR,
KIEFER,VYDITER,
SSQTSQNQGLGRPTLEGDEETSEVEYTVNK,
IFDSFAK,
AAIAECEEVRR,
AELEAAVR,
AAIAECEEVR,
IREFYR,
LDSQEKDATCELPLQK,
AAIAECEEVGR,
LKASENSESEYSR,
TNVLPFR,
QNYEEMQAK,
FVQCPDGELQK,
AVLIAGQPGTGK,SGSPISSEER,
GTSYQSPHGIPIDLLDR, NLTGLSSGTEK,
EYQDAFLFNELKGETMDTS, SELFNPVSLDCK, GFEPQAPEDLAQR, AIAEVDVGTDKAQNSDPILDTSSLVPGCSSVDNIK,
YAIQLITAASLVCR,
GLGLDDALEPR,
TQGFLALFSGDTGEIKSEVR,
SYNPEGESSGR,
ITIADQGEQQSEENASTK,
GTEVQVDDIKR,KSELFNPVSLDCK,
HRVSSQAEDTSSSFDNLFIDR,
YAIQLITAASLVCRK, FVQCPDGELQKR,TTEMETIYDLGTK,
DYDAMGSQTK,
IKEETEIIEGEVVEIQIDRPATGTGSK,
MIESLTK,
DKQHLDDITAAR,SGSPISSEERR,
VSSQAEDTSSSFDNLFIDR,
POLR2M,
DRVPPSSEASEHHPR,
RUVBL2,
IRCEEEDVEMSEDAYTVLTR,
ARDYDAMGSQTK,
KEVVHTVSLHEIDVINSR, IRGTSYQSPHGIPIDLLDR, TQGFLALFSGDTGEIK, QASQGMVGQLAAR,
TEALTQAFR,
VYSLFLDESR,
LLIVSTTPYSEKDTK, ETLIEWK,
FLDTLLEELHLK,
VNTQSSSNSTLPER,
ASENSESEYSR,
TSDIDNPSHFEK,
LCGYPLCQK,
FITPAHYSDVVDER,
VLPGLLVPLQITLGDIYTQLK,
CBWD5, VEFTEDLQK, TPADIYR,
CBWD3,
SIYYITGESK,
HSQFLGYPITLYLEKER,
HSP90AB2P,NPDDITQEEYGEFYK, IDIIPNPQER,YIDQEELNK,
ELISNASDALDK,
FENLCK,
RGPD1,FAM75A7,
EEERHPDFQLLK,
FFEAQIPK,
LOC100509836,
LOC100509749,
LPNVTGSHMHLPFAGDIYSED,
LAAEIDDR,
ALQDLENAASGDAAVHQR,
SPATA5L1,ASTPAILFLDEIDSILGAR, MSH6, IIDFLSALEGFK, PSMA2,YNEDLELEDAIHTAILTLK, NDUFA10, NUBPL,XPO1,LDINLLDNVVNCLYHGEGAQQR, LLQYSDALEHLLTTGQGVVLER, DLIDDATNLVQLYHVLHPDGQSAQGAK, LAQTLGLEVLGDIPLHLNIR, YBX1, EWSR1, MRPS22, C20orf4,LDHAL6B,LIIVSNPVDILTYVAWK, NYQQNYQNSESGEKNEGSESAPEGQAQQR, GDATVSYEDPPTAK, WISLTNFISEATVEK, AKAP8L, TVEDLDGLIQQIYR, ISYNA1, SVLVDFLIGSGLK, MAP1B, TPDTSTYCYETAEK, HEATR6, AVAPSIIFFDELDALAVER, SLFN11, NUBP2,STESLQANVQR,RPL13,ELIQSSDLQAFFETK, Incorrect!MKEIAEAYLGYPVTNAVITVPAYFNDSQR7:6 KEAESCDCLQGFQLTHSLGGGTGSGMGTLLISK4:3 FLDPDMYSLLEDSTSDLR, FQGSGAATETAESR, SETD1A,TVAQSQQLETNSQR, RFC3,TLEEGHDFIQEFPGSPAFAALTSIAQK, GPN3,SPATA5,TVPFLPLLGGCIDDTILSR, MRPL10,ACALQVLSAILEGSK,
SSGVPVDGFYTEEVR, NTPCR,
GFGFILFK,HNRNPAB,LANFGGLAVGLGFGALAEVAKK,
MKSLESQLEK,LOC643371,CISEQFTAMFRCK,
IHSDCAANQQVTYR, DSG1,
ADCK3,PMF1,
AVPPTYADLGKSAR, ASGR1,
LYTENR,
ASITPGTILIILTGR,
LONRF3, LRPMGFK, VDAC1,
MED1, VPLILNLIR,
TRMT11, SPFWILSIPSEDIAR, RPL6,
CEP170,VAVQGDVVR,NVIAVLEEFMK,C17orf75,VHSYNASETSQLLSVSEGELILHK, KPLTTSGFHHSEEGTSSSGSKR, EPRS,FASTKD5,
RBM39, DLEEFFSTVGK, KIAA0355,SQSAAVTPSSTTSSTR, ADRM1, ASTFLTDLFSTVFR, RPRD2, VEITPESILSALSK,
MMAA,
QTARGVK,
IPQERK,
POLR1A, ELL2,EDSTCISLPK,
PDE4D,SYSSGGEDGYVR,GISLNPEQWSQLKEQISDIDDAVR, NCL,LDSLTYK,
AFAP1, LQTKEEELLK, CNNM1, VEVEVGK, HAX1, ITKPDGIVEER, MUM1, LSQMQGAVR,
LPDFLQR, PRSS12,
ENO4,
MAST4,
MFHAS1,LDLDKR,AFAP1L1,TPGRPTSSQSYEQNIK, TAQLNISK,MRLSWFR,LOC100509073,SISDARAR, EIF4E2,TVPLDALR,BAI1,
SLCO1B3,ADIFALALTVVCAAGAEPLPR, WEE1,GPR112, IIPTVDR,DTLVISVK, RNF25, TKIDAIR,ABHD1, ZNF318,USP9Y, PIK3CA, VRAEGELEPR, NQSFCNKI,DMQLLSISR,IIEDCSNSEDTIK, LRBA,
CRTC2,MAGLTVR,CSTF2,IBTK,VLLGVGDPK,NOP56,LLLMGHK, QKCGATPK, LPSALNR,
LQIPGLTLDCR, C12orf34, ASSTSTPEPTR, LOC151760, KLDPLLK, PTPRU,
EKFGSFLCHK, MUC6,FGAACAPTCQMLATGVACVPTK, VAVDGNVIR, RGS3, ACVAAACTVAAR, PFKL, SEWGSLLEELVAEGK, ZEB1,
QAQPTSR,VIFGEENR,IYDQEK,
ACGTSIFQNR,
ASUN,
LTSDYLK,
SEMA4D,QQQEELEAEHGTGDKPAAPR,
MAGEB17,
C22orf28, AKR1C3,
DLEC1,
C9orf50, MVGGGGKR,
EILSAAEHFSMIR,
CASP8AP2,
DRLQYR,
MDVLQTDVSQNFGLELDTKR,
ANK2, GAS2L2,
PLXNA1,
XRN2,
EAQLQSR,
KACQACR,
RIYVGEK,ZNF429,
ARHGAP32,
MYLIP, GTPGTRGPR,
FXR2,
RQMPQGITSCMLHLDGK, EKDTLR,
CEVTVLDALPR,
EKDTIR,
ASB8,
IGLL3P,
QSISNSESGPR,
BYSL,
VIQQLEGAFALVFK,
DHGLLHLK,
GTQTKAEGPTIK,
DGAT2L6,
GDFITIGSRK,AGK,
MYCN,
QDVQMKPVK,
SEC23A,
VSGSNGLAAAR,
SQVISNAK,
SLIRP,
DOCK10,
NTS,
QSWVRAR,TLK2,
SSLPWLPK,PABPC1L,
ITGA7, TNFRSF25,
NFAT5,
C3orf15,
GMKDTQR,
NCAIISDVKVR,
RIVTGNPR,
LAPVPFFSLLQYE,
CDC42BPA,TATALADRAIR,
EEECSPVLR,
PRPF8,
EMB,
DGKI,
ATAAFILANEHNVALFK,
LPEAAAAK,
FSAFLDK,ALG13,
CD320,
IQKVEPAGR,
IPO5,
LACLAGELR,
LACLQR,
GSLQHCCCLLPK,
AVEDEATK,
SLIT2,
SINEMLLSR,
ENAH,
SKOR1,
CLSSYTSVKENFDK, KDM3B,
TPRN,
TMEM79, ABCE1,
C12orf51,
EIEKLK,
FEM1C, GADVNRK,
LCP2,
RAB23,
CNNM4,
WARS2, ACBD4, CALSLPR, CSF2RB, C8orf74,MQLCLEK,RMPPPPR, EALRLER,SRRM1, LEPRE1,
ISQERGK,FAAH2,TEGRLQQK,SMARCAL1,EXOSC9,GLYADTR,C17orf85, MICU1,ETPLSNCER, VMAGPGIKR,
GPI, NLVTEDVMRMLVDLAK, SPAHPTPLSR, LSQEINK, RARS, SCARF1, DRPARDGAAVSR,STIIGESISR,
FISHKK,ZNF687,TYKGGSVR,NEEAEALAKR,RBFA,
GASWLGRR,RNF146,TDMIQALGGVEGILEHTLFK,
ZNF85,DVLETPVDLAGFPVLLSDTAGLR, LQEIRLEVLK,
APRT,LRRC55,
LECLVHR,
FSTLTTHK,
C19orf45,
C8orf85, VEATLAR,
EAPPHGAPPR,
GTPBP3,
DOCK4,
MTUS1,
IALDCNLPR,
PHF1,LMDSVALK,C1orf123,
SCN4A,
QIAAPKAK,
NAALADL1,
DRFISGR,KNESLGQ,
TAGSVILR,
MDC1, KLAMQEFMEINER, OXCT1, MLGAMQVSK, CARS,TLN1, ADSSGQQAPDYR,
11-Mar, FAM120B, KSLHSCSVK,
LAPAMIR,
ACTR2,
LOC100133315,
BBS9,
EPB41L3,
SUV420H2,
C8orf82,
FADS6,
PIP4K2C,
MYBL1,
PVRL2,
FLDFITNIFA,
EPARSAHR,
MARAAALLPSR,
LELLVGCIAELR,
ITIDTNK,
KDGGGLYEK,
ERNFLR,
LSRSPLK,
QLYLER,
KPNPNTSKVVK,
PIGO,
MRS2,
KCNMA1,
SYNE2,
INCENP,
ARHGEF2,
FXYD3,
NRG1,
SKA3, MTAP,
CLCN5,
C3P1,
LSM14A,
MOAP1,
PREX2,
HEATR5A,
VSRPENEQLR,
MOBP,
FAM53B,
CARKD,
MASRFSR,
KTLLADR,
FIAANDK,
KNLEDGINNLK,
DASLPVPGQR,
CALPPRLK,
ATYKEMIR,
SLASTLDCETARLQR,
KEEQQR,
SESLESPRGER,
LAMTOR3, SPAG17, C6,
APARCFSAR, SLSLLLK,
VEILYR,
AEGHRVLIFTQMTR, CAGE1,TFDRSTK,GAD1,LSSLGCHLK,ARG2,LLQPGDK,
ERN1,RGMSILR, NVEDLSGGELQR,
ERBB2IP, MKL2,
MTEQETLALLEVK, DAALFPGCER,
LNPADIK,
KNLNLSK, HSD11B1L, DRASAAEAVR, SDHD, LSAVCGALGGR, DNAAF2, GIT1,WDHD1, TSLDGARR,
INALTAASEAACLIVSVDETIKNPR,
SQDAEVGDGTTSVTLLAAEFLK,
TEX264,
LGRAGGKPGGR,
CCT7,
PNMA1,
PRRC1,
QRSQIK,
QVKPYVEEGLHPQIIIR, LLSDDYEQVR,WVGGPEIELIAIATGGR,
QMETMGKR,
VYSLQHLDPQGAQELLEFTIR,
LSTLIEFLLHR, SPALQVLR, DCQLNAHKDHQYQFLEDAVR, EQAELTGLR, ATSPADALQYLLQFAR, AQETEAAPSQAPADEPEPESAAAQSQENQDTRPK, NVESGEEELASK, EIDKNDHLYILLSTLEPTDAGILGTTK, LSPPYSSPQEFAQDVGR,
CCT5,MAGED2,PGAM5,TRIM28,PTCD3, INTS4,
SPATA16,MAAAGPMTGAAASR, FAM99A,QVTQAGR,TLGSAVSPQQR,HOPX,SLLDNLRNK, PROCA1,
YEEAIKQTR, AJUBA, LSGLGGPRK, SEZ6L, CRYYSNLR, CD248, ACRELGGDLATPR,
C7orf59, SSSHSCFPGGR, IQSEC3, AACTIQTAFR, CCNB3, ZNHIT6,TNFKEDSLVK,
TASVLWPWINR, LELSNLEIPHQGVQVEGDGFSHAIR, EILSEVER, GIPEFWLTVFK,
VGAAAARAR,PENK,EPGPFVPR,BRPF3,TVDLQDAEEAVELVQYAYFK, TPIDYAR, MCM3,
VPNSNPPEYEFFWGLR, GPIAFWAR,VSTDLLR,HSQYHVDGSLEKDR, MNEAFGDTK,VLVNDAQKVTEGQQER, SAYESQPIR,VSNSPSQAIEVVELASAFSLPICEGLTQR, IAILTCPFEPPKPK,
NPPGFAFVEFEDPRDAADAVR,
VKEERPLFPQILSSIELLQHSLPK,
QSKDAPK,
CILPFDK,
LQLEGQVTELR,
IETYGGR,
ELEKLK,
QSDLDTLAK,
STDLTTHK,
ISSLEGR,
DYNC1LI2,
RAF1,
KIF1C,
ZNF619,
ZNF253,
AKAP17A,
OR2AT4,
PLEKHG6,
SRSF3,
CALCOCO1,
ELKILNLSK,
ISSAIDR,
IQDPQLR,
YLSQLLMR,
QLTLPEEDK,
LRRK2,
MTR,
LHX2,
PRAMEF10,
SLC9A5,
ALLARGTK,
ALEADFLTNMHTSKISK,
EQLLLDELVALVNKR,
TPESTKPGPVCQPPVSQSR,
LTPEELER,
KVDLGYISDVDK,
GRVPGVDR,
SNTQAPRLAPSHR,
KLCEEDLEK,
ARHGEF10L,
LRHPVSTK,
LLTELESPAWWPFSSK,
LONRF2,
UTS2,
SLFVQELLLSTLVR,
SMELYR,
MEIQEIQLKEAK,
PARP11,
EHBP1,
ISIMYSTLR,TAF1A,VSFQPKLSPDLTR,
RTWLESMAK,LOC388906,DGDEEALKTMIK, AGGCPSSLASSR,
DADPTWTLR,
DKDVTLSPVK,BOD1L,
DLLAKSLYSR,
IVTCAHR,
SSYLESPR,
MYO16,ISSLNKR, UTF1,
IMMT,
WAC,
OTC,
TTC17,
CDR2L,
GNPTG,
C10orf103,
RDPSPVSDR,
WGPTSNNPR,
VSRGPFSPR,
MAREPR,
SEMA6C,
EHALLAYTLGVK,
EVSTYIK,
THINIVVIGHVDSGK,
DNDPNDYVEQDDILIVK,
IGGIGTVPVGR,
YEEIVKEVSTYIK,
DCTCEEFCPECSVEFTLDVR,
YYVTIIDAPGHR,
EEF1A1,
VADWTGATYQDKR,
LDGLVETPTGYIESLPR,
SUCLG1,
ADAM15,
CDK16,
TLATDILMGVLK,
MED14,
KIPIIIR,
KAIDCLR,
QLKACHR,LRRC53, IKGQNLK,
QVSAINR,
CXCL2, ACLNPASPMVK,
ZNF285,
YLPM1,
ENVNSLLPVFEEFLK,
VLVVHDGFEGLAK,
TTN,
LVPLLLEDGGEAPAALEAALEEK,
FAISLPKAR,
PRX,MRPS31,
POLR2F,
RYLPDGSYEDWGVDELIITD, TQGNVFATDAILATLMSCTR,
DNAJA3,
QLLQANPILEAFGNAK,
RPS27L,
IIGLDQVAGMSETALPGAFKTR,
DGADIHSDLFISIAQALLGGTAR,
ALMLQGVDLLADAVAVTMGPK,
IQDKEGIPPDQQR,
TALLDAAGVASLLTTAEVVVTEIPKEEK,
ESTLHLVLR,
GINSSNVENQLQATQAAR,
VTEPSAPCQALVSIGDLQATFHGIR,
MQLFVK,UBC,
MQIFVK,
TLSDYNIQK,
TITLEVEPSDTIENVK,
QIGSVIRNPEILAIAPVLLDALTDPSR,
QDQIQQVVNHGLVPFLVSVLSK,
SHGIDLDHNR,
SMC4,
AAVEEGIVLGGGCALLR,
CEFQDAYVLLSEKK,
NPEILAIAPVLLDALTDPSR,
PFKFB3,
PESGAAAARPRR,
PPP2R5A,
VITSQGTVVK,
LVLGQEELR,
GTPBP5,
SLELLPIILTALATK,
POC1A,
JUP,
LNTIPLFVQLLYSSVENIQR,
EXOG,
MELQEIQLKEAK,
CAKFSPDGR,
CSPG4,
ACLGEPLAR,
QLSQSLLPAIVELAEDAKWR,
YFAQEALTVLSLA,
TFDP1,
KIF27,
SNAP47,
FKBP2,
C16orf7,
TPM1,
RTQSDLTIEK,
EVRDLK,
RCPSILGGLAPEK,
LQELEAR,
TVLWNPEDLIPLPIPK,
IYQTDR,
LHHQNQQQIQQQQQQLQR,
IAIYELLFK,
DGMLYLGSR,
HSPA4,
ZC3H4,
SRCAP,
TRIM22,
QDMLAEKVLAGK,
SSNNNGSVRTA,
ACADVL,
ASB2,
CHSY1,
C14orf49,
MED15,
RPS10,
SPAG5,
GNAL,
ZNRF1,
LRRIQ1,
NSSIASDIHGDDLIVTPFAQVLASLR,
MMHVQQLSLASLCVKK,
KEDMLLK,
PPTDSPLER,
FLLTGLLSGLPAPQFAIR,
MUC16,
GRIK1,
SMTN,
SALL4,
UNC5A,
ERC1,
PIGN,
ZBTB11,
FVSSLLTALFR,
TRIM6-TRIM34,
PSMA3,
VADLVHILTHLQLLR,
ESPNL,
AVENSSTAIGIR,
NSGLTEEEAK, SSPO,
RAE1,
AGSLAALEKR,
NFSSASALQIHERTHTGEKPFVCNICGR,
VTWRGEGR,
KIAELESH,
LQEEIAR,
VSAQQFDDAFLK,
YLTNER,
QLENQNPQGR,
IQTEPTSSLTLGLRK,
WDR11,
UACA,
DBF4B,
URB2,
LSLEGIVVQR,
WLNCPR,
YDSQVAEENR,
WKPPSLNSVDFR,
RNGTT,
TEVSFTLNEDLANIHDIGGKPASVSAPR,
DDX26B,NLQASGLTTLGQALR,
LKLGAIFLEGVTVK,
IIRPSETAGR,
ALIVVQQGMTPSAK,
TDLTVLVAHNDDPTDQMFVFFPEEPK,
LRENQLPR,
AECRPAASENYMR,
LQIEESSKPVR,
QHVLDMLFSAFEK,
HQYYNLK,
TIRDIIALNPLYR,
AQFGDKPSEGRPR,
EEVTELLAR,YFGIKR,
RTDLTVLVAHNDDPTDQMFVFFPEEPK,
POLR2E,
DIIALNPLYR,
FLDDLGNAK,TLQQNAESRFN,
BAG2,
IIDEVVNKFLDDLGNAK, LLESLDQLELRVEALR,
TLQQNAESR,
GLANPEPAPEPK,
KPVAIFK,
RECQL5,
NDRDQVSFLIR,
ANLFYDVQFK,
EEEENRLR,
FAM50A,
LOC100294412,
QLAVLTELPFVVPFEER, DKPELQFPFLQDEDTVATLLECK,
SEC16A,
VPHPLEHK,
POLR2J,
FENNSWVFMR,
NKPFFDICTSR,
GELDLTGAK,
UBE3C,MSH2,MYLK2,
KNFIAVSAANR,
LNLVEAFVEDAELR,
CNP,
VELSEQQLQLWPSDVDKLSPTDNLPR,
MEX3A,LASVPEYR,
ATGSANMTK,DQEEIEDQSPPCPR,
NBPF6, SPIN2A,
SPIN2B,NBPF4,
FOXK1,
HNRNPH1,
STGEAFVQFASQEIAEK,
LAQNILSYLDLKSDEWEY,
BAG5,
QLFEIDSVDTEGKGDIQQAR,
HTGPNSPDTANDGFVR,
IQELGDLYTPAPGR,
AVIEVQTLITYIDLK,
FEELLLQASK,
IRS4,
VLGNTSGLLQLLFNR,
ELQQAQTTRPESTQIQPQPGFCIK,
PIH1D1,
GQGCTAYDVAVNSDFYRR,
EALDVLGAVLK,ALEEAANSGGLNLSAR, FDDGAGGDNEVQR,
LNTEVASVVVQLLSIR,
LLQEEEQER,
DVWQVIVKPR,
ACLIFFDEIDAIGGAR,
ELYDRYGEQGLR,CFLSWLVK,
LRCH1,
GNTSGSHIVNHLVR,
DNAJA2,
PSMC2,
THYSNIEANESEEVR,
CAPNS1,
FSATEVTNK,
HDLPPYR,
VIEPGCVR,
ELFLTAGGAGGLHLWK,
LVATSLEGK,
EIINAIDGIGGLGIGEGAPEIVTGSR,
YLWNNIK,
ITFTGEADQAPGVEPGDIVLLLQEK,
VHLTPYTVDSPICDFLELQR,
DAQGQPGLER,
YVDVLNPSGTQR,
DDTHKVDVINFAQNK,
SSLSSHSHQSQIYR, SWSSYFSLPNPFR,
LNTEVASVVVQLLSIRR,
LLIVSNPVDILTYVAWK,
WFQPVANAADAEAVR,
IEEFVPPDENCPLKEASSR,
GSVSASEQGTLNPTAQDPFQLSAPGVSLK,
GTGVIQLYEIQHGDLK,
YLATGDFGGNLHIWNLEAPEMPVYSVK,
VVCAGYDNGDIKLFDLR,
WDR92,
STVWQVR,YEYPIQR,
EWNLFYQK,
YTGEEDGAGGHSPAPPQTEECLR,
UPF1,
NVFLLGFIPAK,
ISEPEADVESVLGVSNLLQVLQKPK,
SQLLKDPQVLFAGYK,
LLMEELVNFIER,
LTN1,LDHA,
LLGHEVEDVIIK,
KTIMQLCHDR, TENPLILIDEVDKIGR,
LLQNVAR,
LONP1, INTS3,
IVSGEAESVEVTPENLQDFVGKPVFTVER, DIIHNPQALSPQFTGILQLLQSR,
ILCFYGPPGVGK,
TQLVWLVR,
GHKEIINAIDGIGGLGIGEGAPEIVTGSR,
DCWTVAFGNAYNQEER,
AAYNPGQAVPWNAVK, LAEAEETAR,
KIAA1967,
QEGLDGGLPEEVLFGNLDLLPPPGK, ALVSHNGSLINVGSLLQR,
SPAPPLLHVAALGQK,
ILQDSLGGNCR,
ANLEAFTVDKDITLTNDKPATAIGVIGNFTDAER,
VLLLSSPGLEELYR,
QLEESVDALSEELVQLR,
KIF5B,
SLSALGNVISALAEGSTYVPYR,
LYLVDLAGSEK,
SAEIDSDDTGGSAAQK,
HVAVTNMNEHSSR,
TAAPGHDLSDTVQADLSK, IESEGLLSLTTQLVK,
MRPS27,
MQNSDFLR,
GQGCTAYDVAVNSDFYR, NRPFMGSISQQNIR,
ELVITIAR,
VVSRGNK,
VDNALQSGNSQESVTEQDSKDSTYSLSSTLTNSK,
VDNALQSGNSQESVTEQDSKDSTYSLSSTLTISK,
RTVAAPSVFIFPPSDEQLK,
VDNALQSGNSQESVTEQDSK,
IGKC,
TVAAPSVFIFPPSDEQLK,
DSTYSLSSTLTISK,
PCDHB8,
SGTASVVCLLNNFYPR,
VDNALQSGNSQESVTEQDSKDSTYSLSSTLTDSK,
NFIC,FAM90A9,SYSDPPLK,
HSGPNSADSANDGFVR, HNRNPF,
IEGDETSTEAATR,
ADQFEYVMYGK,
LSAYVSYGGLLMR,
LQGDANNLHGFEVDSR, LHCESESFK,
VYLLMK,
MELQELQLKEAK,TELQGLIGQLDEVSLEKNPCIR,
AVWNVLGNLSEIQGEVLSFDGNR,
YVELFLNSTAGASGGAYEHR,
ITGEAFVQFASQELAEK,
YVEVFK,
YLDLEEEADTTK,
ATENDIYNFFSPLNPVR,
CCT4,
IDDVVNTR,
LFAQLAGDDMEVSATELMNILNK, THYSNIEANESEEVRQFR,
TDGFGIDTCR,
ITFTGEADQAPGVEPGDIVLLLQEKEHEVFQR,
AFADAMEVIPSTLAENAGLNPISTVTELR,
LGFEEFK,
GKGAAAAAAASGAAGGGGGGAGAGAPGGGR, YSNENLDLAR,
VLFICTANVTDTIPEPLRDR,
NYLDWLTSIPWGK, LAQPYVGVFLK,
FSDLFSLAEEYEDSSTKPPK, QLEVEPEEPEAENKHKPR,
NKNPAPPIDAVEQILPTLVR,
LLHHDDPEVLADTCWAISYLTDGPNER,
LLGASELPIVTPALR,
FQGEDTVVIASKPYAFDR,
EAVFFQSHSAR,
ITHEVDELTQIIADVSQDPTLPR,
FAM90A19,
KPNA2,
PRPS1,
PRPS2,
KIAA0415,
CAMK2B,
DTNA,
USP13,
VNMSNQVPR,
LAD1,
HTR3B,
PRPF40A,
IEKMELALVPVSAHGNFYEGDCYVILSTR,
MKEASVNPSGAR,
SRESPGGDLR,
TQGGVSPR,
MEX3B,
CALM3,
QEKQER,
YLLDKDEK,
CALM2,
ARL9,
MOSPD2,
NCOA1,
EAFSLFDKDGDGTITTK,
CHADL,
DENND2D,
SIRPD,
ZMYM5,
C19orf55,CD36, VFNGKDNISK,
GQVCVMIHSGSRGLGHQVATDALVAMEK,
QSVLWSR,
ALHLSDLFSPLPILELTR,
WASGVDGGLYEHKTFMYPVAK,
QNLQEEKER,
LKEAGQGR,
WPFVPEK,
IDMPRK,
RVLSGATLPDK,
QAPESRK,
EERPLFPQILATIELLQR,
TPEEHELETGIK,
GHGEGRLGCR,
LPEITDLK,
APDSTLR,
ADPECCSILHGLVAAVETLCK,
C19orf24,
PDCD11,
PEG3,
ANKRD27,
EIF3L,
PLAGL2,
SNED1,
RPS24,
DNAH12,
SCRIB,
AKAGWAGDPR,
RGPGLGAR,
TTPDVIFVFGFR,
MPPLIADSPKAR,
LFSQRKPCK,
LSVAGNCR,
QEKEAGR,
AYDITNVLDGIDLVEK,
MQEEEARK,
HIASGNQK,
ALLLLLAER,
GRM7,
NPM1,
CEP44,
ERVFC1-1,
MTFR1,
AMVVDLVKR,
LLCGATLIAPR,
ZNF223,
SLWVINSALR,
ASB15,
KLCTHK,
TARS2,
RTL1,
AVIL,
HAPLN1,
SLC43A3,
IL4I1,
GCET2,
COL9A2,
ATP8B1,
KLK11,
PPIL4,
OSMR,
LLGAGRAR,
EQYDEKR,
KQAELDK,
AREMLLK,
NDWHGGAIVSALSQTGSLFKPR,
TAMEFQRR,
LOC729040,
RPL19,
DAK,
BBX,
ZNF250,
NAA16,
RVDYAAR,
LVEPQISELNHR,
LTQLAMLLAEISSVAHQK,
QLDSLRER,
DMD,
MRPL55,
TRIM37,
KANSL2,
FBXL13,
MRPS28,
ANGEL2,
CECR2,
EEF2,
FIGN,
KEISVNGK,
CVRLSDASVMK,
GTGKTLLGR,
CSPGEKSQLVR,
CSSQSLPMTR,
IADGSVMR,
LSSYSTTIR,
QEETPVLTR,
ALPALPSAR,
NPPAGYFQQK,
PKSQVSEEEGK,
LLGLTGLLSK,
GAAVLGVDHRVR,
LYENEK,
GVEITGFPEAQALGLEVFHAGTALK,
NQAIQALR,
LLVRPTSSETPSAAELVSAIEELVK,
ILKCYEQK,
RDFVEAPPPK,
SPGWSEEVVR,
RLASSVLR,
IKMATADEIVK,
KCVPVPR,
TCTTTRVSK,
SSESPGDQVMAAAR,
KLVEAIK,
LOC100131673,
LNNSSPCDIR,
RAD51AP2,
AIAYLFPSGLFEK,
C4orf27,
SALTVHCK,
EPB41,
SCGTTRELQK,
STK40,
PARN,
ACRV1,
PCLO,
MREG,
DAP3,
MITF,
SMARCA1,
TRIO, TLVLSNLSYSATEETLQEVFEK, SUB1,
RCOR3,
LRRC38,
KRT78,
ESLSSVCGLR,
C4orf29,
FLGATTDTTVLEANAVLLGIQESK, CLCC1,
YEEIAR, YEELAR, ARL1,
QVNSALK, DLSYTLK, POLR2K, LVVFDAR,
GTGLDEAMEWLVETLK, MGLL,
LELLEEVLALGLR,
MAAQRLGK,
RABGGTB,
LYSEQR,
LLKFESCGK, LRRTM2,
VWF,
RP1,
TEDAGLELLACPVLR,
PPP1R14A,SKFLDALISLLS,
ATXN7L1,NVGCLQEALQLATSFAQLR,
STX18,EFTRSR,
ZFP37,
TYGEPESAGPSR,EMD,
FFEVILIDPFHK,
DHRPPCAQEAPR,
TDVKNTLSEIK, LQREQQR, RP1L1, VQIDYK, MLPALGMACPPK,
LYATDGR,
TCQSLHINEMCQER,
MSEYLR,
ATR,
DFEPILEQQIHQDDFGESK,
MEPCE,ISVCPAMR,
NTN1,
EPHA8,
LEGVVTR,
EAYLLLQLFK,
TLLPLLLESSTESVAEISSNSLER, QAACTIVEALATIPSR, DASGFWSSLVNPATGLAEVEGENAQR,
EGMEIPNLR,VPS41,
QQLVKLEMNTLNVMLGTLNLGNLLLTGDK,
PCDHB13,
ELPQLVASVIESESEILHHEK,
MKIHSPSPHK,
COL6A1,
C16orf70,
FREM3,
FHL2,
C7orf34,
IQGAP3,
NAV2,QMLNDKK,BRCA2,NUP214,
YVFDIK,
RC3H1,
SSVLPSPSGR,
RERE,
AFMID,
CLPX,
GOLGA4,
GTF2H5,
GHRLOS2,
WDR82,
NCAM2,
AISHEHSPSDLEAHFVPLVK,
QLSQSLLPAIVELAEDAK,
AINTAMYHIMMFQLVKDHIR,
VCCELELR,
VGELADQNAFSLTQK,
KTEALLEAK,
LGAEEALR,
IEATNNPR,
EALNPETIEIK,
DCEGLSREK,
SMARCA2,
RELB,
CIITA,
MDLAGLLK,
CVHIGEMVDPVVELAAKR,
CADPS,
FNTA,
EVFGRGGQPGPK,
RAMELLK,
VAILAELDKEK,
ZMAT1, TMCO6,
TNFAIP8L1,
RPL36,
QSGPRTK,
CDC45,C9orf80,
SGCE, QANNFIISR,
ESRRA,
THOC2, KEENGTMGVSK,
AGPAT4,
TOB2,
PABPC4, EAAQKDSK,
TEQGTVLIGSEHGLIEVDPVSR,
EEASSLLK,
EEGMVPFIFVGTR,
EVEVTACLVWK,
LQGGPGLHLK,
SLGDLEK,
RPCAPTR,
VHELQGNAPSDPDAVSAEEALK,
MQEEEEVVDK,
NICEGGEEMDNKFGVEQPEGDEDLTK, MHGGGPPSGDSACPLR, NSINQVVQLR,LSGEAFDWLLGEIESK, CIESNMLTDMTLQGIEQISK, SMVVSGAK,
RSGLELYAEWK,
TLPHFIKDDYGPESR,
AEIQELAMVPR,
IKDILAK,
LTFASTLSHLRR,
RLDLAGPLLAFLFR,
FEQIYLSKPTHWER,
LRNLTYSAPLYVDITK, IVATLPYIKQEVPIIIVFR,
TAETGYIQRR,
RVQFGVLSPDELK,
SLSEYNNFK,
IVEDAPPIDLQAEAQHASGEVEEPPR, IPQIGDKFASR,
NLTYSAPLYVDITK,
ATAGDTHLGGEDFDNRLVNHFVEEFK,
QIALAVLSTELAVRK, LVQNGTEPSSLPFLDPNARPLVPEVSIK, WFESSGIHVAALVVGECCPTPSHWSATR, AIVHAVGQELQVTGPFNLQLIAKDDQLK,
CDFALFLGASSENAGTLGTVAGSAAGLK,
TLHEWLQQHGIPGLQGVDTR, VFNTGGAPR,
RPVINAGDGVGEHPTQALLDIFTIR,
ASDPGLPAEEPK,QTQTFTTYSDNQPGVLIQVYEGER,
LLDTIGISQPQWR,
SVGEVMGIGR,
RIIAHAQLLEQHR,AKFEELNMDLFR,
CAD,MVNHFIAEFKR,
VCNPIITK,NQLTSNPENTVFDAKR,
FDDAVVQSDMK,ITITNDK,
VEIIANDQGNR,
IEIESFYEGEDFSETLTR,
LSKEEIER,
AFYPEEISSMVLTK, NALESYAFNMK,
YKAEDEVQR,QATKDAGVIAGLNVLR, QAMGVYITNFHVR, KGFDQEEVFEKPTR,
VSANKGEIGDATPFNDAVNVQK, ARGPIQILNR,DCQIAHGAAQFLR, ICRPLLIVEK, LVNHFVEEFK,ITSQIFIGPTYYQR,
GEIGDATPFNDAVNVQK,
LLQFHVATMVDNELPGLPR,
QEDMPFTCEGITPDIIINPHAIPSR, TFAPEEISAMVLTK,
CQEVISWLDANTLAEKDEFEHK,
MKEIAEAYLGYPVTNAVITVPAYFNDSQR,
AQIHDLVLVGGSTR, LVNHFVEEFKR,
ELEQVCNPIISGLYQGAGGPGPGGFGAQGPK,
AAAIGIDLGTTYSCVGVFQHGK, KELEQVCNPIISGLYQGAGGPGPGGFGAQGPK, SAVEDEGLKGK,
SAVEDEGLK,
VQVSYKGETK,
NQVALNPQNTVFDAKR, HSPA1B,
QTQIFTTYSDNQPGVLIQVYEGER,
LLQDFFNGR,
HSPA1A,
EIAEAYLGYPVTNAVITVPAYFNDSQR,
FKEIAEAYLGYPVTNAVITVPAYFNDSQR,
HWPFQVINDGDKPK, LLQDFFDGRDLNK,
LLQDFFDGR,
IINEPTAAAIAYGLDR, STLEPVEK,
NQVALNPQNTVFDAK,
FEELCSDLFR,
VEILANDQGNR,
ARFEELCSDLFR,
VTHAVVTVPAYFNDAQR,
NSTIPTK,
HSPA1L,
SINPDEAVAYGAAVQAAILMGDK,
DNNLLGR,
DAGVIAGLNVLR,
RFDDAVVQSDMK, MVNHFIAEFK,
ATAGDTHLGGEDFDNR,
TKPYIQVDIGGGQTK,
TTPSYVAFTDTER,
EIAEAYLGK,ITITNDKGR,
IINEPTAAAIAYGLDK,
CNEIINWLDK,
IEWLESHQDADIEDFKAK,
KVTHAVVTVPAYFNDAQR,
AVEEKIEWLESHQDADIEDFK,
IINEPTAAAIAYGLDKR,
TWNDPSVQQDIK,
VYEGERPLTK,
SFYPEEVSSMVLTK,
HSPA5,
IDTRNELESYAYSLK,
MKEIAEAYLGK,
LLQDFFNGK,DAGTIAGLNVLR,
QTQTFTTYSDNQPGVFTQVYEGER, STAGDTHLGGEDFDNR,
HSPA6,
QATKDAGTIAGLNVLR,
LSKEDIER,
NQVAMNPTNTVFDAKR,
MVQEAEK,
TVTNAVVTVPAYFNDSQR,
LSKEEVER,
ELEEIVQPIISK,LSSEDKETMEK,DNHLLGTFDLTGIPPAPR,
GTLDPVEK,
ITITNDQNR,
DAGTIAGLNVMR,
ITPSYVAFTPEGER,
NKITITNDQNR,
LYGSAGPPPTGEEDTAEKDEL,
LLESLGYSLYASLGTADFYTEHGVK,
FHLPPR,
GQPLPPDLLQQAK,
MAEIGEHVAPSEAANSLEQAQAAAER,
HPQPGAVELAAK,
LALGIPLPELR,
QIALAVLSTELAVR,
VNEISVEVDSDPR,
GHNQPCLLVGSGR,
SQIFSTASDNQPTVTIK, NQLTSNPENTVFDAK, IEWLESHQDADIEDFK,
LTPEEIER,SDIDEIVLVGGSTR,
QISNLQQSISDAEQR,
GSSSGGGYSSGSSSYGSGGR,
LQAEIEGLK,
LQAEIEGLKGQR,
RSTSSFSCLSR,AQYEDIANR,
ISSSSFSR,
NSKIEISELNR,
QNLEPLFEQYINNLRR,
WTLLQEQGTK,
QNLEPLFEQYINNLR, AQYEEIANR,
QLDSIVGER,
KRT6C,KRT6A,
KRT5,
KRT75,
GFSSGSAVVSGGSR,
FASFIDK,
KRT79,
YEDEINKR,
SISISVAGGGGGFGAAGGFGGR,
NKLNDLEEALQQAK,
LEGLTDEINFLR,
SISISVAR,VDPEIQNVK,
KRT6B,
KLLEGEECR,GSGGGSSGGSIGGR,
TSQNSELNNMQDLVEDYKK,
SLDLDSIIAEVKAQYEDIAQK,
SLNNQFASFIDK,
LNDLEDALQQAKEDLAR,
AEAESLYQSKYEELQITAGR,
LQGEIAHVK,
AQYEDIAQK,
NLDLDSIIAEVK,
YEELQQTAGR,GRLDSELR,
VSLAGACGVGGYGSR,
DVDAAYMSK,
KRT7,
YEELQVTAGK,
TTSGYAGGLSSAYGGLTSPGLSYSLGSSFGSGAGSSSFSR,
ASLEAAIADAEQRGELAIK,
YEELQSLAGK,
KRT8,
LDSELKNMQDMVEDYR,
YEELQITAGR,DYQELMNTK,
DVDNAYMIK,TNAENEFVTIK,
AQYEEIAQR,VLYDAEISQIHQSVTDTNVILSMDNSR,
NERPDGVLLTFGGQTALNCGVELTK,
FLSSAAAVSKEHPVVISK,
NSVTGGTAAFEPSVDYCVVK,
EIEYEVVR,
RLAADFSVPLIIDIK,
LCPPGIPTPGSGLPPPR,
EELSALVAPAFAHTSQVLVDK,
GEVAYIDGQVLVPPGYGQDVR,
RLSSFVTK,
ELSDLESAR,
LSLDDLLQR,
LREQGSLLGK,
TPHVLVLGSGVYR,
VTAVDWHFEEAVDGECPPQR,
VYFLPITPHYVTQVIR,
IIAHAQLLEQHR,
LSSFVTK,
SLNNQFASFIDKVR,
DVDGAYMTK,
THNLEPYFESFINNLRR,
SKAEAESLYQSK,
FLEQQNQVLQTKWELLQQVDTSTR,
MSGECAPNVSVSVSTSHTTISGGGSR,
NKLNDLEDALQQAK,
SLVNLGGSK,
IEISELNR,EIKIEISELNR,
FLEQQNQVLQTK,TAAENDFVTLKK,
GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR,
FSSCGGGGGSFGAGGGFGSR, VDLLNQEIEFLK,
YEELQVTVGR,
GGSGGGGSISGGGYGSGGGSGGR,
GGGGGGYGSGGSSYGSGGGSYGSGGGGGGGR,
THNLEPYFESFINNLR, SLDLDSIIAEVK,
LLRDYQELMNTK,AEAESLYQSK,
SDQSRLDSELK,
SKEEAEALYHSK,
TLLEGEESR,
HGGGGGGFGGGGFGSR,
FGGFGGPGGVGGLGGPGGFGPGGYPGGIHEVSVNQSLLQPLNVK,
GFSSGSAVVSGGSRR,
TNAENEFVTIKK,
KRT2,
WELLQQVDTSTR,
GGSISGGGYGSGGGK,
KRT1,YLDGLTAER,GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR,
TAAENDFVTLK,
MALLATVLGRF,ATGYPLAYVAAK,
VLGTSPEAIDSAENR,
VLSEPNPRPVFGICLGHQLLALAIGAK,
KVLILGSGGLSIGQAGEFDYSGSQAIK,
VLILGSGGLSIGQAGEFDYSGSQAIK,
AIVHAVGQELQVTGPFNLQLIAK,
TLGVDLVALATR,
VIECNVR,ILALDCGLK,
TSSSFAAAMAR,LAADFSVPLIIDIK,
EHPVVISK,
QDVEALWENMAVIDCFASDHAPHTLEEK,
ALKEENIQTLLINPNIATVQTSQGLADK,
KNILLTIGSYK,
EATAGNPGGQTVR,
AAFALGGLGSGFASNR, LFVEALGQIGPAPPLK,
VQVEYKGETK,
NSLESYAFNMK,
ARFEELNADLFR,
VQVEYK,
FEELNADLFR,VNHFIAEFK,
ILPWSTFR,
SINPDEAVAYGAAVQAAILSGDK,
SQIHDIVLVGGSTR,
TVIKEGEEQLQTQHQK,
HISPGDTK,
ASDPGLPAEEPKEK,
SFEEAFQK,
TLSSSTQASIEIDSLYEGIDFYTSITR,
NQTAEKEEFEHQQK, HSPA8,
QTQTFITYSDNQPGVLIQVYEGER,
DNNLLGK,
HWPFMVVNDAGRPK, GPAVGIDLGTTYSCVGVFQHGK, NTTIPTKQTQTFTTYSDNQPGVLIQVYEGER, NQVAMNPTNTVFDAK,
LOC100287551,
ELDDRDHYGNK,
ATVEDEKLQGK, LDKSQIHDIVLVGGSTR,
LTTPTYGDLNHLVSATMSGVTTCLR,
FWEVISDEHGIDPTGTYHGDSDLQLER, MREIVHLQAGQCGNQIGAK,
SGPFGQIFRPDNFVFGQSGAGNNWAK,
FWEVISDEHGIDPTGTYHGDSDLQLDR, INVYYNEATGGK,
KEPESCDCLQGFQLTHSLGGGTGSGMGTLLISK,
KEAESCDCLQGFQLTHSLGGGTGSGMGTLLISK,
MRENVHLQAGQCGNQIGAK, MAVTFIGNSTAIQELFKR, MSATFIGNSTAIQELFKR,
IMNTFSVVPSPK,TUBB4B,
MAVTFIGNSTAIQELFK,
MAATFIGNSTAIQELFKR, EVDEQMLNVQNK,
INVYYNEATGGKYVPR, KEWESCDCLQGFQLTHSLGGGTGSGMGTLLISK,
TUBB,TUBB4A,
MREIVHIQAGQCGNQIGAK,
MSATFIGNSTAIQELFK,
MSMKEVDEQMLNVQNK,
AVLVDLEPGTMDSVR, NSSYFVEWIPNNVK,
VREEYPDR,
MAATFIGNSTAIQELFK,
TAVCDIPPR,TAVCDIPPRGLK,
GHYTEGAELVDAVLDVVR,
IREEYPDR,GHYTEGAELVDSVLDVVRK,
AILVDLEPGTMDSVR,
KLAVNMVPFPR,
MASTFIGNSTAIQELFKR,
FWEVISDEHGIDPSGNYVGDSDLQLER, FPGQLNADLRK,
NMMAACDPR,
EIVHLQAGQCGNQIGAK,
TTPPMLDSDGSFFLYSK,
VVSVLTVVHQDWLNGK,
ALTVPELTQQMFDAK,
GHYTEGAELVDAVLDVVRK,
MSSTFIGNSTAIQELFKR,
ISEQFTAMFR,
TLKLTTPTYGDLNHLVSATMSGVTTCLR,
LHFFMPGFAPLTAR,
EIVHIQAGQCGNQIGAK,
KEAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK,
FWEVISDEHGIDPAGGYVGDSALQLER, INVYYNESSSQK,
EIVHIQAGQCGNQIGTK,
MASTFIGNSTAIQELFK,
LAVNMVPFPR,
LTTPTYGDLNHLVSATMSGVTTSLR,
YLTVATVFR,
TPLGDTTHTCPR,
IGHG3,
WQEGNVFSCSVMHEALHNHYTQK,
DTLMISR,
GFYPSDIAVEWESNGQPENNYK,
IGHG2,
IGH@,
SCDTPPPCPR,
GPSVFPLAPCSR,
CCVECPPCPAPPVAGPSVFLFPPKPK,
STSESTAALGCLVK,
GYLVTQDELDQTLEEFK, QSLVDMAPK,
MQEENITR,
RNLDIERPTYTNLNR,
GFASVSEK,
ADSWVEKEEPAPSN,
GYLVTQDELDQTLEEFKAQFGDKPSEGRPR,
IHFPLATYAPVISAEK,
DVNAAIATIK,
YMACCLLYRGDVVPK,
FDGALNVDLTEFQTNLVPYPR,
TIQFVDWCPTGFK,
DYEEVGVDSVEGEGEEEGEEY,
LDHKFDLMYAK,AVCMLSNTTAIAEAWAR,
FDLMYAK,
LIGQIVSSITASLR,
SIQFVDWCPTGFK,
QLFHPEQLITGKEDAANNYAR,
EDMAALEKDYEEVGADSADGEDEGEEY,
TUBA1A,
TUBA1B,
TUBA1C,
EDAANNYAR,
VGINYQPPTVVPGGDLAK,
AYHEQLSVAEITNACFEPANQMVK,
AFVHWYVGEGMEEGEFSEAR,
AVFVDLEPTVIDEVRTGTYR,
QIFHPEQLITGK,
TUBA4B,
QLFHPEQLITGK,
LISQIVSSITASLR,
RP11-631M21.2,LHFFMPGFAPLTSR,
EAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK,
LSVDYGK,
GAAAAAAASGAAGGGGGGAGAGAPGGGR,
LLCHLTPSIYTEFPDETLR,
EIFDIAFPDEQAEALAVER, LSSDVLTLLIK,
IQAGDPVAR,
DGPSAGCTIVTALLSLAMGRPVR,
IQVSSEKEAAPDAGAEPITADSDPAYSSK,
AVFVDLEPTVIDEVR,
KLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,
TIGGGDDSFNTFFSETGAGKHVPR,
GHYTIGK,
EIIDLVLDR,
EIIDLVLDRIR,
AYHEQLTVAEITNACFEPANQMVK,
LADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,
VELGDSDLK,
ITTPYMTK,
LLQCAEPYK,
YAVLYQPLFDKR,
DSWLAELAGER,RGP1,
VTPENAGQWKPDELQVLEK,
DNAH17,
MAELYR,ASLYVIR,QCALAALR,YATDSETVEPIISQLVTVLLK, LOC651921, PFKFB4, LOC642846,ZCCHC18,
LTEGCSFR,KLEELK,
SCTLGLQPHRGPSVCPR, QLVGTNK,NNALEKEK,
RPS27,RPS27P17,
HHEASSRADSSR,
NFIB,
AQDIEAGDGTTSVVIIAGSLLDSCTK,
GLPWSCSADEVQR,
SVYSWDIVVQR,RDGADIHSDLFISIAQALLGGTAR, TLVTQNSGVEALIHAILR,
EIF3D,FSQLAEAYEVLSDEVKR,
DDX11,ZCCHC12,POC1B,
TRIM6,
CCDC162,
NVDLLSDMVQEHDEPILK,
NAP1L1,FANCI,
CDK17,
CCT3,
ELGIWEPLAVK,
IRAK1,
EALLLVTVLTSLSK,
TLIQNCGASTIR,TIADIIR,SLELLPIILTALATKK,
CYP2D6,PRDX1,
IKBKAP,
TLSALLLPAAGLDDVSLPVAPR,
QENIASGISAK,NHS,SSPLPAKGR,
PPP2R1A,
FITVGWGR,
VFDKDGNGYISAAELR,
EIISAAEHFSMIR, GLFIIDDKGILR,EPHA10, PRDX4, CYP2D7P1,
ELISDLER, LSLPADIR, AIARQMGK,
IEIQNIFEEAQSLVR,
SVEQQVIGFSGLSDDKNYK,
LVIASTLYEDGTLDDGEYNPTDDRPSR,
VYRIEGDETSTEAATR,
POLR2H, TPM3,
TPM3P4,
TCTDIKPEWLVK,
DHX15,
FAHIDGDHLTLLNVYHAFK,
LRTVQLNVCSSEEVEK,
IRVESLLVTAISK,
LQLPVWEYKDR,
HQSFVLVGETGSGK,
VIQYLAYVASSHK,
MYH9,
DFSALESQLQDTQELLQEENRQK,
HLESFPK,
IAEFTTNLTEEEEKSK,
VISGVLQLGNIVFKK,
LQQELDDLLVDLDHQR,
NFRKB,
VSHLLGINVTDFTR, RPS27P9,
VTSLNRQR,
QADKVWR,
CCDC72,
LOC729973,
IAQLEEQLDNETKER,
NFIX,FAM90A17P, NFIA,
TDCSPIQFESAWALTNIASGTSEQTK, ITHEVDELTQIIADVSQDPTLPRTEDHPCQK,
LYYVCTAPHCGHR,
SAAEMSGRGSVLASLSPLR,
POLR2I,
VIDPATATSVDLR,
ALIAGGGAPEIELALR, DALSDLALHFLNK,
FAM90A8,
TIMP1, PPP6R1,
HLACLPR,
FLG, PFKM, CCDC19, NUP205,
KMETEAELR,QLVRTPQR,
GGSGSGPTIEEVD,
KFGDPVVQSDMK,
LDKAQIHDLVLVGGSTR,
HAIYDKLDDDGLIAPGVR,
HIDQLKER,
GFDQEEVFEKPTR,
EVAYCSTYTHCEIHPSMILGVCASIIPFPDHNQSPR,
LFEASDPYQVHVCNLCGIMAIANTR,
MDTLAHVLYYPQKPLVTTR,
YAYTGECR,
ISNLLSDYGYHLR,
IVATLPYIK,
NTYQSAMGK,
RLNSPIGR,
EGEEQLQTQHQK,
GKDFNLELAIK,
SMEYLR,
TVTLPENEDELESTNR,
YSLATGNWGDQK, DCSTFLR,
LLLAALGR,
LTFASTLSHLR,
GFYPSDIAVEWESNGQPEDNYK,
KLEDGPK,
TSETGIVDQVMVTLNQEGYKFCK,
POLR2B,
GTCGIQYR,
QMDIIVSEVSMIR,
KSAIGQR,
AGVSQVLNR,
MTIGHLIECLQGK,
YSLATGNWGDQKK, EMLPHVGVSDFCETK, GPIQILNR,ELPAGINSIVAIASYTGYNQEDSVIMNR, LDLAGPLLAFLFR,
LLFQELMSMSIAPR,
IYTDAGR,
QLHNTLWGMVCPAETPEGHAVGLVK, RDCSTFLR,
KDGNASGTTLLEALDCILPPTRPTDKPLR,
YYVTIIDAPGHRDFIK,
LGLIPLISDDIVDK,
FYYNVESCGSLRPETIVLSALSGLKK,
FYYNVESCGSLRPETIVLSALSGLK,
HTVYPKPEEWPKSEYSELDEDESQAPYDPNGKPER,
POLR2C,
HTVYPKPEEWPK,
NIWLAESVLDILTEQR,
VLAHLAPLFDNPK,
VGGTSDVEVNEK,
IQEIIEQLDVTTSEYEKEK,
NAGVEGSLIVEK,GANPVEIR,
RYEEIVK,
STTTGHLIYK,
FAELKEK,
TPEVTCVVVDVSHEDPEVQFK, VVSVLTVLHQDWLNGK,
TPEVTCVVVDVSHEDPEVK,
IGHG4, IGHG1,
FNWYVDGVEVHNAK,
GTLVTVSSASTK,
THTCPPCPAPELLGGPSVFLFPPKPK,
SCDKTHTCPPCPAPELLGGPSVFLFPPKPK,
ASNGDAWVEAHGK,
ETGVDLTKDNMALQR,
VINEPTAAALAYGLDK,
RTIAPCQK,
VLENAEGAR,
SVEMHHEALSEALPGDNVGFNVK, VTDALNATR,
ISSIQSIVPALEIANAHR,
GVMLAVDAVIAELKK,
KPLVIIAEDVDGEALSTLVLNR,
HSPD1,
LVQDVANNTNEEAGDGTTTATVLAR,
HDELLAEHIK,
ERLFEASDPYQVHVCNLCGIMAIANTR,
QQLDSFDEFIQMSVQR,
GPIQILNRQPMEGR,
IFVNGCWVGIHK,
TVTLPENEDELESTNRR,
LPSDLHPIKVVEGVK,
GSKINISQVIAVVGQQNVEGK,
RIPFGFK,
NQDDLTHK,
NQDDLTHKLADIVK,
HGVNRQDTGPLMK,
ELINISK,
RDVFLER,
MIVTPQSNRPVMGIVQDTLTAVR,
TPSLTVFLLGQSAR,
DDYGPESR,
MDDDVFLR,
INISQVIAVVGQQNVEGK,
VLIAQEK,
SCLENSSRPTSTIWVSMLAR,
FRELPAGINSIVAIASYTGYNQEDSVIMNR,
QEVPIIIVFR,
TLQEDLVK,
TQISLVR,
TFIGKIPIMLR,
SLGTSAGSLVHISYLEMGHDITR,
MATNTVYVFAK,
AYFLGYMVHR,
THSTHPDDEDSGPYK,
IPQIGDK,
LNLSVTTPYNADFDGDEMNLHLPQSLETR, VSGDDVIIGK,
KITSQIFIGPTYYQR,
TSETGIVDQVMVTLNQEGYK, HMCDGDIVIFNR,
YSPTSPTYSPTTPK,
SIAANMTFAEIVTPFNIDR,
MQEEEEVVDKMDDDVFLR,
MIWNAQK,
GNEVLYNGFTGR,
GNEVLYNGFTGRK,
KILLSPER,
DFNLELAIK,
QDTGPLMK,
QDVIEVIEK,
INAGFGDDLNCIFNDDNAEK,
LTHVYDLCK,
LVIVNGDDPLSR,
TYQDIQNTIK,
QAQENATLLFNIHLR,
TLQEDLVKDVLSNAHIQNELER,
YSPTSPTYSPTSPVYTPTSPK,
TAETGYIQR,
FGVEQPEGDEDLTKEK,
MSVTEGGIKYPETTEGGRPK,
IIITEDGEFK,
RVDFSAR,
LDILYR, METTL3,
KLLVAGNR,
ATGDLKR,
MNSSLSFCSWKK,
GVQYLNEIKDSVVAGFQWATK,
LPFETFR,
AARS,
MSSVYLK,
GART,
DHX40,
RSAD1,
SQSTM1,
EIF3I,
RPL15,
KHQNGTK,
UBE2D2,
CNOT1,
BMX,
TJP3,
ZHX2,
CACNA1C,
ANKRD26P1,
DHVPELVK,
LQEAELR,
IPLHSPPSK,
TVEGALEER,
PHC3,
SSSLHNHHR,
VFTKDMHGR,
ZKSCAN4,
LTQYIDGQGRPR,
CAMK2G,
LERYHTAIR,
UBR4,
FYVEDLK, PTGDS,
GVLVEIEDLPASHFR,
VLADAGGGR,
UPF3B,LOC642620,HLA-F,
CYP7B1,
LSLYNNCICDVGAESLAR,
LRRN3,
ADARB2-AS1,
SIIKEPESAAEAVK,
KEIMAQLEER, FMNLIK,
AALFALQNLFEEK,ANKRD40,KCMF1,
RVCDALNVLR,
MKEDLIK,EIISEVQR,ASAP2,
NNAASEGVLASFFNSLLSK,
UROC1,
AEGAPGRTR,
CWEVQDSGQTIPK,
AGWARAR,
AIQFLHQDSPSLIHGDIK,
LAVDTDTFSFGVVVLETLAGQR,
DMBT1,YTHAANTVVYSSNK,
AGTSAASSWGGGK,
SMPPLAPQLCR,
LARP1,
QAVEQQIQSHR,SLSALGNVISALAEGSTYVPYRDSK,
ILLYACR,
TITIEVEPSDTIENVK,
UBB,
QNFFSQTPILQALQHVQASCDEAHK,
TLTVELEPSDTVENLK,
TLSDYNLQK,
DLALVSR,
ESTIHLVLR,
LAQLTLEQILEHLDNLR,
GYISPYFINTSK,
SFSLLVR,
UBA52,IIIEDLLEATR,
RPS27A,
NMITGTSQADCAVLIVAAGVGEFEAGISK,
TPVIDADKPVSSQLR,
DYNC1H1,
VEDVLGK,
QKSIQR,
GEPGAAPLSAPAFSLVFPFLK,
VETGVLKPGMVVTFAPVNVTTEVK, QLIVGVNK,
FYFVGDEDLLEIIGNSK,
SSLQSQCLNEVLK,
AHQANQLYPFAISLIESVR,
GTEGRTGPVAVR,
LALESICLLLGESTTDWK,
WVTCASAVQK,
AASQSTQVPTITEGVAAALLLLK,
LFNDSSPVVLEESWDALNAITK,
SSAGERHGSR,
ILPEIIPILEEGLR,
GCN1L1,
VLPLEALVTDAGEVTEAGK,
VLPQLISTITASVQNPALR, LVLPSLLAALEEESWR,
FSSVQLLGDLLFHISGVTGK,
ENFIPTIVNFSAEEISDAIR,
KLVPLLLEDGGEAPAALEAALEEK,
GIFEALRPLETLPVEGLIR,
SRQELEQHSVDTASTSDAVTFITYVQSLK,
VTDFGDKVEDPTFLNQLQSGVNR,
SELLSQLQQHEEESR,
TFKEICAVSR,
SPISVAAAAIYMASQASAEKR,
KMWEELPEVVR,
TQKEIGDIAGVADVTIR,
LSDLQTQLSHEIQSDVLTIN,
VIDVGSEWR,
AVELDLVPGR,
QVQMAATHIAR,
BDH1,
LEHTTLR,
LLVDSNNPK,
VRGNLMGK,VVLPCNLLR,
AKDILCR,INISQVIAVVGQQNVEGKR,
FHPKPSDLHLQTGYKVER, LKELINISK,
FRFDYTNER,
GNSQYPGAK,
RISDEECFVLGMEPR,
DVFLER,
HLALLCDTMTCR,
DNGDRIDLR,
YTPTSPSYSPSSPEYTPTSPK, LFYSNIQTVINNWLLIEGHTIGIGDSIADSK,
LYAHTIAGFLDPEK,
GPPAPSSSLPIR,
KIVPELK,
KLTMEQIAEK,ALQEWILETDGVSLMR, RNEQNGAAAHVIAEDVK,
RTLQEDLVK,
KIIITEDGEFK,
EGLIDTAVK,
VVEGVKELSK,
ILLSPER,
VQFGVLSPDELKR,
THSTHPDDEDSGPYKHISPGDTK,
ASLENSLEETKGR,
LPSDLHPIK,
YSPTSPK,
TYQDIQNTIKK,
TTSNDIVEIFTVLGIEAVRK,
VQFGVLSPDELK,
RGNSQYPGAK,
DVDPVRTTSNDIVEIFTVLGIEAVR,
KSLGTSAGSLVHISYLEMGHDITR,
CSFEETVDVLMEAAAHGESDPMK,
DVLSNAHIQNELEREFER,
GHLMAITR,
AKQDVIEVIEK,
ISDEECFVLGMEPR,
IFHINPR,
YPETTEGGRPK,
NICEGGEEMDNK,
MAEEFR,
AHNNELEPTPGNTLR,
NEQNGAAAHVIAEDVK,
SIAANMTFAEIVTPFNIDRLQELVR,
YARPEWMIVTVLPVPPLSVRPAVVMQGSAR,
VELDRK,
VLSEKDVDPVR,
ILNDARDK,
QIFSLIIPGHINCIR,
FNQAIAHPGEMVGALAAQSLGEPATQMTLNTFHYAGVSAK,
GVSENIMLGQLAPAGTGCFDLLLDAEK, ISPWLLR,
VVVENGELIMGILCK,
RMAEEFR,VVVENGELIMGILCKK,
INTS9,
GLGLDESGLAK,
PILLFLIDTSASMNQR,
MASTL,
CDTYATEFDLEAEEYVPLPK,
VPFCPMVGSEVYSTEIK,
GTEDITSPHGIPLDLLDR,
LEKEIETYHNLLEGGQEDFESSGAGK,
YCGQLQMIQEQISNLEAQITDVR,
LASYLDKVQALEEANNDLENK,
POLR2A,
SDLEMQYETLQEELMALKK,
AVPPPQPQMFGEELPDAQDGEQPGPSR,
DIENQYETQITQIEHEVSSSGQEVQSSAK,
IKFEMEQNLR,
ATFLGLTNEK,
QLSIDTRPFRPASEGNPSDDPEPLPAHR, ANAHFILK,
GEHLLLFLVQTVAR,
YNHLVYSQIPAAVK,
SFYTAIAQAFLSNEKLPNLECIQNANK,
VTFKDEPGEGSGVAR,
VYMHLPQTDNK,
DECIDPFSK,
ELCAQPGQVVAPGAVLVR,
AVPPPQPQMFGEELPDAQDGEQPGPSRR,
DSPRPGKPDER,
YLNKEIEEAPDIR,
EACGVIVELIK,
TALALAIAQELGSK,
GTEVSEPSPPVRDPEGVTQAPGVEPSNGLEKPAR,
AQTEGINISEEALNHLGEIGTK,
LVLSPDAPDR,
FDYTNER,
DVLSNAHIQNELER,
LTMEQIAEK,
VHEIFKR,
LASYLDK,YENEVALR,
NHEEEMKDLR,
EVATNSELVQSGK,
KRT16,
TKYEHELALR,NVSTGDVNVEMNAAPGVDLTQLLNNMR,
LASYLDKVR,
SQYEQLAEQNRK,
KRT10,
KRT14,
TVITPDPNLSIDQVGVPR,
FHPKPSDLHLQTGYK,
VLDELTLAR,
QEIECQNQEYSLLLSIK, APSTYGGGLSVSSSR,
FSSSSGYGGGSSR,
VQALEEANNDLENK,
GSCGIGGGIGGGSSR,
SGGGGGGGLGSGGSIR, TLLDIDNTR,
KDIENQYETQITQIEHEVSSSGQEVQSSAK,
TTSNDIVEIFTVLGIEAVR,
TLNDMRQEYEQLIAK,
GFVENSYLAGLTPTEFFFHAMGGR,
KRT13,AETECQNTEYQQLLDIK,
SQYEQLAEQNR,
NYSPYYNTIDDLKDQIVDLTVGNNK,
MTLDDFR,
QVLDNLTMEK,
ALEEANADLEVK,
GGSGGSYGGGGSGGGYGGGSGSR,
DAEAWFNEK,
VAPDEHPILLTEAPLNPK,
PPP2R2A,
QSVEADINGLR,
VLDELTLTK,
SLLEGEGSSGGGGR,
LKYENEVALR,
KRT24,
MED17,
QEYEQLIAK,
FSSSGGGGGGGR,
SGLELYAEWK,
YTPQSPTYTPSSPSYSPSSPSYSPTSPK,
GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK,
HGVQELEIELQSQLSK, HGVQELEIELQSQLSKK,
INAGFGDDLNCIFNDDNAEKLVLR,
GGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,
ELYHVISFDGSYVNYR,
LLASINSTVR,
ACTBL2,
STMQELNSR,
EIETYHNLLEGGQEDFESSGAGK,
LKYDNELALR,
VIGSGCNLDSAR,
VLVEGPLNNLR,
RLEALER,
ELLADLER,
LVLSPDAPDRATHLIAAR,
ALEAELNDLM,
ALGQEADKGLSGCGIVYCR,
EIGVQNVK,
RPEEQEEEPQPR,
KYVYFQGTGDMNAPPGSR,
GPPAPSSSLPIRQEPSSFR,
VPAEGAPTAAVAEVR,
LEGCSHPVVMK,
DIWPPAQAPTSSQELAGAPEPQGSCAQGGR,
EIEEAPDIR,
TREHYHATALGAK,
VHTDYYAK,SKVLADVAIIFSGLHPTNFPIEK,
EHPFVLQSVGGQTLTVFTESSSDK,
ENSPAAFPDR,
TISHVIIGLK,
VVTTNYKPVANHQYNIEYER,
LYAHTIAGFLDPEKK,
ATHLIAAR,
CTDP1,
FGVEQPEGDEDLTK, QTFENQVNR,
FAPNLITVK,
IGSVLAVFEAAASAQSAGASQSR, RRPEEQEEEPQPR,
YGEDGLAGESVEFQNLATLKPSNK, NVTLGVPR,
LOC100288966,
FNSQPVGDCDFNVR, YDSQVAEENRFHPSMLSNYLK,
DNSVFHVSNLEFPFRK,
TPGQVLSLISSLGFNTPIAEK,
VPNACLFTINKEDHTLGNIIK,
QQTNTLLNLVWR,
GVTQVTTQPK,
MDWSIEAAVATFAQARPPGIYK,
POLR2J2,
NTWELKPEYR,
RLQIEESSKPVR,
GTF2F2,
AVLLAGPPGTGK,
IRAQTEGINISEEALNHLGEIGTK,
LDPSIFESLQK,
EHPFVLQSVGGQTLTVFTESSSDKLSLEGIVVQR,
LSQQLDK,
YLSQQWAK,
LSQQLDKVVTTNYKPVANHQYNIEYER,
LEHTTLRK,VRILPWSTFR,
VHEIFK,YSPTSPTYSPTSPK,
LYELHVFTFGSR,VLADVAIIFSGLHPTNFPIEK,
VLQAQECGHLHVVNPDWLWSCLER,
KLVIVNGDDPLSR,
YLESEEYQER,MRPS18B,TFADFLVYEFSTSAGGQQLNK, MED23,ARAF, MRPS9,ILLDQVEEAVADFDECIR, LSAQAQVAEDILDKYR, HAS1, LAVEALVR, ASNS, WDR26, CEP55,SEELLSQVQFLYTSLLK, TOMM70A,LALLNVATQGVHLWDLQDR, ELYLFDVLR,
IPVSVWMK,GOLGA3,LQAQVECSHSSQQR, IDATVVR,SCQTALVEILDVIVR, PASK,
ESAGADTRPTVRPR, RPS6, LIEVDDER, DUOXA2, NCAPD2, LLNILGLIFK, QARS,EAATQAQQTLGSTIDKATGILLYGLASR, LASYTVR,TMPRSS6,NLSQVARR,SAT1,
LKDGLVR, UBAP1L, GVRLELAGAR, WHSC1L1, GASEISDSCKPLK, NAP1L4, GIPEFWFTIFR, FMR1,
PRKAA2,QDPYLLK,MNGNTALAIAKR,CPSF6,AVSDASAGDYGSAIETLVTAISLIK, ITPTDSR,SLCO1B1, LFLPEAVALR,
TPEGIYR,
MAVKEGK,
LGASEKNER,
MYH10,
RAB3IL1,
SACS,
SULF1,
KQWIGK,PHKG2,SPPPSTLQGR,
RSPGLLER,ANKRD34A,
DASAVSLSESK,MTOR,
GCLSLNLR,
EAISDSLLRK,DAIGLQPDAR,
LPAGVGTR, TEKT2,
ZNF645,EGTFNHTLR,
DHX33,
KNLSIDER,
E2F6,ULK1,
OFD1,
YTHDC1,
LLASKSEGIR,
TTLL4,
FAM214B,
MPLPSPTESR,
IKDVVVK,
HIPK1,
LLRAEDFR, FSIP1, LDKILAK, VLLYHR, IARS2, ASB7, LLLEFK, CDCP1,
ZNF391,EVDQLR, GRSASGLPR,ADRA2A, QAILESPEKK,
KATTHTK,
MPYDFDIR,
DNAJC1,
TDG,
IDMLQAEVTALK,
SLEAEILQLQEELASSER, TMEM14B,
RNF103,
MYO1B,
APOL1,
NFHVFYQLLSGASEELLNK,
VRAEEAGAR,
NDUFS3,LKMAPIK,
GAPVD1,
HFM1,
RTN4RL1,
MEX3C,
KIR2DS5,
AKATSSATTAAAPTLR,
GQAQNYSTLLLEEASAR,
LAG3,
RAPGEF3,
SAVAWEER,
LPQGTVR,
TSEN54,
TAF7,
CAPN3,
TCOF1,
MMGT1,
PLCB1,
GLYAT,
GRIA3,
DNA2,
SSSSPRSINK,
LLSTDAEAVRTR,
MEICADELKK,
MTPLGSGPPR,
ELLDVGNIGR,
AGCSPSDIGIIAPYRQQLK,
VKQTQR,
DTNRNSLSR,
SSSMWYIMQSIQSK,
EAASGTTPQKSR,
MYH1,
DACT1,
VPS13A,
SRRD,
GLG1,
SMAP2,
KDM2A,
TAF3,
SOGA1,
FAM19A5,
SPTBN2,
SEMA4G,
C19orf47, TBL2,
FOXP1,
TIGD6,VPLLLEEQGVVDYFLR, IESGEGTIPVR,
MAADSREEK,
CTPS, QFSLEEK,
SSVQSNILILQ, HAUS6,EIKLEELIDSLGSNPFLTR, SKIV2L2,KSNVKPNSGELDPLYVVEVLLR, LOC644667,
PNMA3,
HGGTAELK, COX4I2, CDC5L, LECLKEDVQR, SLC7A10,NKVAHFSR,JPH3, SLFN14,
RMGCPTPK,MALARPRPR,HRASLS2,DTLDECGLR,
UCN,
KPSYAEICQR,
DALSLEEILR,
LRRC41,XKR6,
MPSSHRILHK,
WTVSNLR,CD200,KGYHEIR,KAEALASGK,GPRIN1,KEIEDLR, NDUFS1, FAF2,TTLL1,
MASFFSTLTFPSLK, PLA2G4E, EGLSPCHLLTVR, ODF2, MKGDTVNVR, ALPK1, VRTN,
MGRN1,NNSTLPK,
SPTA1,ZMYM3,
TRAF3IP3,
PJA1, KTNSEVPMHR,LAEKEETGMAMR,
VAGGAKEK,
FEELNRTLR,
CATLDTR,
VLEDVGK,PACSIN1,
LSSAHVYLR, BMPR1A, ADARB1,
LQPEVQR,TBC1D24,LGAPEARR,KIF13B,PLEKHA7, IIKSSSK,TRSTTLR, FUBP1, IQPQLDR,
MSTO1,KSGNLSEK, LSRQPQR,
FBXO40, IKNLANK,
STOX2,
C3orf35,
CCDC25,
ILIFEK,
TLPCLVR,
ILITFR,
ASPASRGPR,VLCLSQSEGRVVSLTGICGAEEVEGGQDPK,
ARMC8,
NDE1,
SSSLVSVR,
DTLETQLSR,
EYVMSMGVCPVSSSALK,
TOM1L1,
NUDT21,
MRI1,
ARHGAP21,
CYBB,
PIAS1,
TLEAIRYSR,
LTTEILGR,
ISKNYNHK,
DLISRR,
ISALNIVGDLLR,
DGTGNQMLQASK,
ESKAEQAESK,
QLQSDGK,
QEQKEVSVMR,KIF20B,
NDEL1,
TGM2,
GSWTGPGCWPRGFQSK,
EIF3F,
LOC100508384,
CCDC63,
GFPT2,
PTPRT,
FNIP2,
SEC14L1,
LPSCEVLGAGMKMDQQAVCELLK,
LDVDAPR,
SWT1, EELSAELLR,
FLJ43944,
LNVGMENKVVQLQR,
GLLLFVDEADAFLR,
GELSQGVQK,
NTRVYPCVWCK,
KDM5B,
YEEALR,RLHACAR,SPIRE1,
LARP4B,
MLPRAAWSLVLR,
QSQEQWQEKEQR,
DMXL2,
C1orf65, LOC100507699,
DLPEHAVLK,
HNRNPU,
SSGPTSLFAVTVAPPGAR,
HGSGSGHSSSYGQHGSGSGWSSSSGR,
GSGSGQSPSSGQHGTGFGR, HRNR,
LLDAVDTYIPVPAR,
NFILDQTNVSAAAQR, TUFM,
YEEIDNAPEER,
AEAGDNLGALVR,
HYAHTDCPGHADYVK,
DLEKPFLLPVEAVYSVPGR,
LOC728989,QQIAEGK,
EIF3FP3,IQDALSTVLQYAEDVLSGK,
PDE4DIP,
LLREELQLLQEQGSYVGEVVR,
LEGGSGGDSEVQR,
PSMC5,
ANXA2P2,
HIP1R,
SALSGHLETVILGLLK, PIAS2,
FLG2,LTGHLHAQGTPPYVINLDPAVHEVPFPANIDIR,
LTADPDSEIATTSLR,
FSNSSSSNEFSK,
ANXA2,
HSSWSEGEEHGYSSGHSR,
HQEEESETEEDEEDTPGHK,
FAM198A,EAQQSLR,
ANKHD1-EIF4EBP3, TTSEIFLSSTAEGADLR,
DLKPENLLLASK,VYAILTHGIFSGPAISR,
NVVHQLSVTLEDLYNGVTKK, AFITNIPFDVK,VLAQQGEYSEAIPILR, SFVEFILEPLYK,YVGETQPTGQIK,SLAGAAQILLK, ILGTPDYLAPELLLGR, HREDIEDYISLFPLDDVQPSK, NFIAVSAANR, VLKPLLSGSIPVEQFVQTLEK,
GAQDVAGGSNPGAHNPSANLAR,
FAM50B,
FYQQVTK,
QISDGEREELNLTANR,
LVLIGSNHSLPFLK,
QEMETQVR,
LLPSAPQTLPDGPLASPAR,
DAEIPEGAARGPHR,
GFPT1, ATAD3A,
ATAD3C,
TMLELLNQLDGFEATK,
IQTQPGYANTLR,MAVDQDWPSVYPVAAPFKPSAVPLPVR,
QSLGHGQHGSGSGQSPSPSR,
LPSGKVGFSK,LSPVGWVSSSQGK,
EKPYFPIPEEYTFIQNVPLEDR,
GNFTLPEVAECFDEITYVELQKEEAQK,
LLESLDQLELR,
VVQHSNVVINLIGRDWETK,
NFDFEDVFVKIPQAIAQLSK,
TLTVEVSVETIRNPQQQESLK,
NDUFA9,
SSVSGIVATVFGATGFLGR,
LPHLPGLEDLGIQATPLELK,
VFEISPFEPWITR,
MORN2,
MYO5B,
ACAA2,
QVLMEHK, VTSIADR,
ANKHD1,
ARHGEF33,
CDH11,
CUL4B,
LOC100134240,
MQEMGNGKANR,
CUL4A,
GAVSAEQVIAGFNR,
MVGGVLVER,
FAM201B,
SMSLVLDEFYSSLR,
IIETLTQQLQAK,
SLANAESQQQR,
GPYESGSGHSSGLGHR,
GPYESGSGHSSGLGHQESR,
QCVVGPNHAAFLLEDGR, AYSIVIR,
GSGLLGSQPQPVIPASVIPEELISQAQVVLQGK,
DAVWWLHTVVPSISK,
LYVPLYSSK,
WSSGVGGSGGGSSGR,
KPNA6,
UBR5,
RSGTISTSAAAAAAALEASNASSYLTSASSLAR,
DLLFILTAK,
SEHQGALWDCLLSFIR,
LSPSAASDAVLSALLSIFSR,
PPP2R2D,
SLC25A3,
TPETLLPFAEAEAFLKK,
MAP7D1,
TPETLLPFAEAEAFLK,
YYALCGFGGVLSCGLTHTAVVPLDLVK,
LEHTAQTYSELQGER,
TVPLYESPR,
VTMQNLNDR,
NQILNLTTDNANILLQIDNAR, LENEIQTYR,
TIDDLKNQILNLTTDNANILLQIDNAR,
LTYQDAVNLQNYVEEK,
CTADILLLDTLLGTLVK,
GDFLNYALSLMR,
NAQNPSLHHPR,
NVAIFTAGQESPIILR, QYLMNLEQR,VLLLPLER,
SFYTAIAQAFLSNEK,
NTPVQSPVSLGEDLQWWPDKDGTK,
WFWSIVEK,
HVAYVFQALIYWIK, FICIGALYSELLAVSSK,
LOC730429,FAQLALER,
RPPPHILDQVK,
RRPEILSFFSTNLQR, TASTLLKPAPSGLPSER,
LSAVEAIANAISVVSSNGPGNR, AASTAPSSTSTPAASSAGLIYIDPSNLR,
VFMEDVGAEPGSILTELGGFEVK,
SLNQSLR,
INTS1,
LAEALAFRQDLEVVSSTVR,
LDLLYR,
SFQNQIAAIQR,
KPNA1,
KPNA5,
MRPS7,
TSSVFEDPVISK,
QTITESSSLLLSQLTSLDPQGPPR,
IVQVALNGLENILR,
AWGILTFK,
KPVEELTEEEKYVR,
GPN1,
TROVE2,
ALLQEMPLTALLR,
LPLFGLGR,
DSQLYAVDYETLTRPFSGR,
TNTPADVFIVFTDNETFAGGVHPAIALR,
MPAATLR,
QSTTHLADGPFAVLVDYIR,
HUWE1,
ALAELFGLLVK,
TIDGILLLIER,
LLQLLGRLPLFGLGR,
PFDN2,
VRPDYTAQNLDHGK,
MRPS34,
GSSGGGCFGGSSGGYGGLGGFGGGSFR,
SKELTTEIDNNIEQISSYK,
YCVQLSQIQAQISALEEQLQQIR,
IRLENEIQTYR,
ALEESNYELEGK,
QSLEASLAETEGR,
NVQALEIELQSQLALK,
ADLEMQIESLTEELAYLK,
IPSEQEQLR,DIVQFVPFR,
QVVEYYSHR,CPNE7,
SAMHD1, EFTUD2, FKBP8,
SALTEMCVLYDVLSIVR, LTDNIFLEILYSTDPK, AFIPAIDSFGFETDLR, SCSLVLEHQPDNIK, LLLAVFVTPLTDLR,
DNAJA4,
NVVHQLSVTLEDLYNGVTK,
MED24,HNRNPM,
VVLAITDLSLPLGR, INEILSNALK,
VFGFDSFK,FKQISQAYEVLSDAK,
NVVHQLSVTLEDLYNGATRK,
DDB1,
GHGECPTTENTETFIR,
YLIYDIIK,
LIIEFK,
POTEB,
YSELGHPFGYLK,
POTED,
KEEDLLR,
INTS6,
POTEC,
DFQQLLQGISEDVPHR,
CDC42EP1,
ASWESLDEEWRAPQAGSR,
LYGPTNFSPIINHVAR,
FARSA,
SLQALGEVIEAELR,
CPNE3,
SPLGEVAIRDIVQFVPFR,
VVEELTR,
SFFSEIISSISDVK,
PPP2R2B,
NILETLLQMK,
MRPS35,
IIAPPER,
ILGLCLLQNELCPITLNR,
NNPLYHAGAVAFSISAGIPK,
LSLELFGR,
LLCDSVVLQPYLR,
DRGSGLLGSQPQPVIPASVIPEELISQAQVVLQGK,
KRT9,
HYNGEAYEDDEHHPR,
NVVHQLSVTLEDLYNGATR,
DNAJA1,
GLCAECGQDLTQLQSK,
ELNGSEAATPR,
RUVBL1,
VEAGDVIYIEANSGAVK,
YSVQLLTPANLLAK,
GYSFTTTAER,
DSYVGDEAQSKR,
ALESSIAPIVIFASNR,
QAASGLVGQENAR,
MKIEEVK,
VEAGDVIYIEANSGAVKR, ERVEAGDVIYIEANSGAVK, DSIEKEHVEEISELFYDAK,
SYELPDGQVITIGNER,
ACTG1,
LOC100287399,
TIVITSHPGQIVK,
VNFPENGFLSPDKLSLLEK,
QISQAYEVLSDAK,
VAPEEHPVLLTEAPLNPK,
IWHHTFYNELR,
DSYVGDEAQSK,
ACTB,
LCYVALDFEQEMATAASSSSLEK,
LEELHVIDVK, KTEPATGFIDGDLIESFLDISRPK, ITFHGEGDQEPGLEPGDIIIVLDQK,
SDPNRETDDTLVLSFVGQTR, ELYDKGGEQAIK,
EATADDLIKVVEELTR, TYEVSLR, ALYYLQIHPQELR,
LLEGEDAHLSSSQFSSGSQSSR,
GSLGGGFSSGGFSGGSFSR,
QSVEADINGLRR,
ADLEMQIESLTEELAYLKK,
QFSSSYLSR,
QTMQNLNDR,
Figure 3.4: Peptide-protein graph constructed from an AP-MS experiment with POLR2A as a bait. Proteinsare drawn as red triangular vertices while peptides are drawn as blue round vertices.
70 Chapter 3. Protein Quantification with Shared Peptides
LLETECPQYIR,
FSPFGDIQDIWVVR, RBM45,MRAAELALR,ADARB2, RPL6,
LLSSFDFFLTDAR, S100A8,
LSDQFHDILIR, INA, QILELEER, RPS10, IAIYELLFK, TRIP12,GGPVKIDPLALVQAIER, CCT6A,ASITPGTILIILTGR, PDCD6, DDX18, ALDYLAIGIHELAAISER, SNRPD1, AVAAAAAAAVTPAAIAAATTTLAQEEPVAAPEPK, TLAFLIPAVELIVK, SRRM1,LSSPADITDK, NREPVQLETLSIR,HAL, TCOF1,GISLNPEQWSQLKEQISDIDDAVR, SUB1,LLLLESVSGLLQPR, PELP1,GAVDALAAALAHISGASSFEPR, VLTELLEQER,DDX50,QIGNVAALPGIVHR, C22orf28,LLIYAASSLQSGIPSR, IGKV1OR1-1,TAFYLAEFFVNEAR, GGH,PSMA2,LLQSQLQVK,RBM17,WVGGPEIELIAIATGGR, TLAESALQLLYTAK, SQAPGQPGASQWGSR, DAZAP1,TLN1, PLG,GNDTFVTLDEILR, CCT5,MRE11A,GDATVSYEDPPTAK, AQLGVQAFADALLIIPK, EWSR1,
SSPATADKR, SSNNNGSVRTA, ZSCAN2,PIAS4,GLDSLVAIMR,NOL10, QVELIRNSR,CHSY1, LLEQQELR, VWA3A, SIM1,DTIFPSR,PRR11,GQNLLLTNLQTIQGILER, TPR,VAELYLPLLSIAR,DOCK6,MAADIPR, GVDYIHSK,EIF2AK2, EKENSEFYELAK, ISYNA1, SLLGYDPQILQMLK, RAD54B, CDC5L,KIVTEEFVR,VTVQTDDSNK,IIIPEIQK,ILLLEAGPK,COQ6,LILPNKQK,NEK10, C13orf35, MAGEH1, QTLLAVLQFLR,TAF5,HLPSTEPDPHVVR, PTBP1,KLPIDVTEGEVISLGLPFGK, HNRNPUL1,ILLGGYQSR, GNAL, QDMLAEKVLAGK, FLJ39061, KGPYCPR, MRPS5,VSGSINMLSLTQGLFR, LTELLSER,EIISEVQR,ASAP2,
HDAC4, LQETGLR, MS4A13, DGKZ, MLKSVSR, L1RE2,LGREVSR, LUZP1,PCCA,ZNF629, VLGAQSVPR, TAB1,VTEDTSSVLR, C17orf56, ISSELEMLR, AK8, SLSSTCARR, EXO1, TVCFQNLR, HSPA12A, ADAMTS4, GLS,DTAGEVAGDTGGDTVGYTETSANVKTMG, LPEGHLK,
AP4B1,HDDC2,SLALWPR,CASC4,LFDEVLDRDVQK,ZRANB1,ALWPLPQLPQPSEWELGATPCLSWR, MAN1C1, AACALLNSGGGVIR, LTIAILR, SWTIPSSLR,PAH,HOXD9,LLLKSHSR,PARP11, SNTQAPRLAPSHR,
C19orf60,LOC100287869,STGLRTSDPR, DRGX, VTNLSEDTR,EIF3G, DAPIRTLVQR,EALEAQQSLGR, KNITNDIR,GTF3C1,VLKYFTR,LCN8,KAAGQALK,BAIAP3,ELKELK,DLG5,KCNQ2, EHVDRHGCIVK,EALNMERNNR,
LOC100507483, VCAAALARAWRPENQAR, ENTPD1, VLDVVER, TKTL2,VWPGAER,C2CD2,DGDEEALKTMIK,ASB2,IVATLPYIKQEVPIIIVFR, LVEPQISELNHR,DMD,
CYFIP1,EASEPIHVR,MLNR,GEFPGVVLSK,CRTASGVLGLCK,
RAD50,KENEGSAK,KIAA1109,MTOR, ILLNIEHR,
LNPVVEK, NCKAP1L, MPSLTASR, PLCL1,
TRMT12,
MTAIKHALQR,EXOC1,RC3H1,CQLNLDILMEDK,PCDHGA12,LISDSKR,GVINP1,RBCK1, KEIMAQLEER,NTSLNPQELQR,
ILGLEYSLR, CRELYYCVK, LINC00479, LLIPRWR,
QITQLLPEDLR, MLGGSAGRLK, MMEL1,
ALDH1B1,
ATP5A1,KLASCSR,PCDHGA1,OR51A7,DACT1,MVSASEDATIK, GJA10,PAFAH1B1, KQMALLMQMTAR, QLEVDPSNGKK,EAVVAKPK,
VATISPRR,
EDRGTPINK, EVFAIQGNK,SP140, FAM120B,CLSSYTSVKENFDK, GNA12, MADVGGQRSQR,C6orf136, ARTSVLPGLR, NUDC, GKDMVVDIQR,
EKLSALQEQK,N4BP2L2,LKVFLCCTR,DAGLA,GPYLLGVKAVLAESYEK, ITWRSR,IQTQTPR, NQSFLR,SAV1,CCDC122, KVLFNLK, FLG,YILGIASSR,ZNF469, ARID5B, LILPYER, GSN, LMATQVSKGIR, FKTN, IESKDPR, TC2N, SCN9A,HHEASSRADSSR, QAAPPIELDAVWEDIRER, IREB2, TVIDKSR,SEMA6D, LLEMTSRK,SPAG1, ALELQHK,LTWRSR, NAF1,AZGP1,MMP16, ILLTFSTGR, FAM9B,
LLSLNTDK,
CLILCYELLK,NCAPG,
LGLEAGDTDDPPR,
SLLSSPTK,
LEDVFVGLCLERLNIR,
HAGKDPVR, AKAP1,
USP7,
SEMA4D,
R3HCC1,
ARID5A,
PRPF8,LKLSNIADR,VLDQLSAGK,SMC1A,SKGGVVGIK,ALDOA,QQKSAVR,QQKEEIGLSR, ADCYAP1,DLTLEENQVK,RIOK2,AKNA, SSLSADLR, TEP1, BTN3A2,FAM99A,
WHSC1L1,ENSGAIEASVK,KIAA1429,
TTC25,SPARDVIFR,TNS1,
DAILNELK,MED25,
TJP2, QLEITGK, WWC2,AELEQKIDEAR,QGVNTMR,HCCEELMKIEIMSPR, STX3,FCHSD2,RTKN2, NSLISSLEEEVSILNR, NUMA1, IFIQTLEANACR,GEMIN4,RPL38,
MED27, SANQMGVSAKR,
KIEEIK, ITPR1, KTEEGNNKPQK, TPM1, MEIQEIQLKEAK, LNMGKGDPK, RGS12, VLVVDLGGSSSR, BBX,HMGB1,RTSWIDPR,
AEVWRTLR,CHMP4A,MRPWLR,ACSM5,LIN37,VPVSKTPK,SETX, NQLDAVLQCLLEK, ERVALAK, CYP2E1,IILQPCGSK, RVCAGEGLAR,C2orf77, CDSN, CPE,VLQISAGVR,LIIFQQR,IFNE,AQGQQEELLQR,CDC42BPG,QENIQLAADAR,CCDC88C, LLIPGNYK,MORN1,
DSE,KLAAVMR,CWF19L1, CCNDBP1,
ISEGSSPIR,MAPK8IP2,DSVEQVIR, VLTAMVGK,KLADLYGSK, CA3, LRRC9, TVDLIPVDQFR, SAP30BP, MARLESR, LEKTVPR, CCDC168,EPMTVSSDQMAK, SPEIISGR,
GQRSTPLTHDGQPK, ATXN7L1, KHQNGTK, B3GALT5, CALY, MAPLQTVR, CCHCR1, LREQLSDTER,
MIKGGGLK,UPF2,RVPQAKPVAVEEK, POLR2B, DLAMFR,CCDC12,AAVTEAVR,SARS2,RNF123,SLFN11,HVIPETNSR,FLJ45964, ZNF619, TAQLNISK,
C15orf5,
LSSLWAR,
RPPEQTR,IPAIQEK, TFAP4,AAILQQTAEYIFSLEQEK, RFX7, USP15, KVVEQGMFVK, PRIC285,VPHSGKTEGSTAGAQIPSK, MOS, NUDT14, DAK,MERIEGASVGR,LLLGATLPR, ERAL1, LSSAHVYLR,CCDC25,LVQSALR, VIVDANNLTVEIENELNIIHK, NDUFS8,VPDVEVSLPSMEVDVQAPR, AHNAK2,QAEIILARR,KIN,PSMD1, NNNTDLMILK, OTC, QSDLDTLAK, IL31RA, RADVLLLK,C8orf74,YIHLENLLAR,SMU1,MADD,MLRFLAPR, IKMATADEIVK, VAC14, MADILLR, KIAA1107, ARLENMSPR, S100A6,
NKDNVAR, TYK2, HGIPLEEVAKK, FANCA, LTAASLSR, MGAT5, SPECC1, MIRALEEK,MVEAFCATWK,LOC100508805, WITQKR, RBM34, LNNSELMGRK, LTNRSPR, TIMM8A, LENG1,TRAPPC9, LNSMLLNDK,RGP1, DSWLAELAGER, ELALSDLR, FKBP15, IMNQVFQSLR,CEP135, FABP7,KLTDAIK,RFESD, QLSSELGDLEK, ATDLAGKR, FAM211B,PCDHB8, VDFLTGEIRLK,NKX3-1,NLDGSAQDPEKR, XKR5, KIAA1841,EKSLLNIMMLVPHK, MAN2B2, C5orf30, SACS,LSVMGCDVLK,MFI2,
GVVDPEFR,DST,RIMS1,DCLK2,VKDAVEQQGEVK, LYAR, MOK,GASEISDSCKPLK, LGAYELVTGRR,KFPNIDVR, CCDC166, SLEAIEEKDLR, C20orf111,VDASGSVASLSVGEGTGVR, UTRN, DTLTQLNAK,FBXO33, VSPAEQPR, NDUFA10,LLQYSDALEHLLTTGQGVVLER, WDR55, LKFWDMAQLR, MYL6,VLDFEHFLPMLQTVAK,
SGLNQIPNRR, MED14, LLIDSVHAR, C10orf118, HGGTAELK, REEP3, VSWMISR, SIK3,LLLHDKDR, GLDLSTCPIALK, MAAAGPMTGAAASR, VWVQSLEK,LLITYSNR, TSGSLCPR,CTTNBP2,
AGBL4,TIAM1, KCNQ5,TLDSHASR, STSANISR,
IKSDHPGISITDLSK,
PCBP2,EFTUD2,SRSF1,
REV1,
TTVQQEPLESGAK,
RFEDLGSR,
SQAEPLSGNKEPLADTSSNQQK,
SSSGLLEWESK,ASLNGADIYSGCCTLK,
CD5L,
LVGGDNLCSGR,
NPNGPYPYTLK,
HTDVQFYTEVGEITTDLGK,
LTEQKGEQQIQK,
IAAEIAQAEEQAR,
GVWGSVCDDNWGEKEDQVVCK,
AEMNILEINKK,NVSEELDRTPPEVSK,
NLLIFENLIDLK,
CNN2,
EFEDPRDAPPPTR,
SNRNP70, IHMVYSK, SPTAN1, ALB,IAALQAFADQLIAAGHYAK,
HSDFSRLAR,
KLSGLNAFDIAEELVK,
DINTDFLLVVLR,
LAQFEPSQR,
ILF2,
DHX9,
KFESEILEAISQNSVVIIR,
FSDHVALLSVFQAWDDAR,
ISFEEFCAVVGGLDIHKK, VKPAPDETSFSEALLKR,
YGEYFPGTGDLR,
HDAC1,
LLLPHWAK,
AAECNIVVTQPR,
GBAP1,
LMRGLLHCMIR,
FLDQNLQK,
NQCTQVVQER,
H2AFY, SNRPD3, ARG1,
TVNTAVAITLACFGLAR, GVTIASGGVLPNIHPELLAK, FLILPDMLK,
AELIVQPELK,
ALLQAILQTEDMLK,
VLLQEEGTR,
DSP,
LTEEETVCLDLDKVEAYR,
ETQSQLETER,TLELQGLINDLQR,
SBSN,
FGQGVHHAAGQAGNEAGR,
YIEIFR,
PHKA1P1,
ATGEADVEFVTHEDAVAAMSK,
EIAENALGK,
HNGPNDASDGTVR,
PHKA1, FAM98B, SP3,
YGLVTYATYPK, GLRAMSVGSGLR,
RLDVTVQSFGWSDR,
STGEAFVQFASK,
DGMDNQGGYGSVGR,
VHIDIGADGR,
ATENDIANFFSPLNPIR,
LONP1,
GAVEALAAALAHISGATSVDQR,
HNRNPH3,
LSSDVLTLLIK,
PNPLA6,
BRD2,
DDX42,
NFYNEHEEITNLTPQQLIDLR,
VADPVVTFCETVVETSSLK,
LFKHSK,
RSIEER, C9orf80,
DGTIVYTGLETR,
KMGIAEAASK,
GSGGGNETIGAKQGR, SNX13, LGRTSLPR, LLLEYKAR, AGGIGGATIPESAPRAGPTR, LRRC8A,
GPSSVEDIKAK,
TVIAQHHVAPR,NCCRP1,
LLDLDALRK, SDHB,
LOC100289085,
CLIP1,
FTEYETQVK,
TARDBP,
FGGNPGGFGNQGGFGNSR,
TSPAN12,
TCHH,LLLDQEQKK,
FGPALSVK,
ELSLVEQR,ATP6AP2,
KVEISIPASSLPR,
GLAGGQQPR,
GGFMKWMSR,
IPO13,
LGDGSFGVVRR, BTN1A1,
LLHHDDPEVLADTCWAISYLTDGPNER,
PGS1,
CGN,PROS1,VIAIDSDAESPAK,RBL1,LENEKQSK,UNC80, RANSLLEETK,
RAG1,
DNAJC8,LGLRLDR,ANKRD63,
ADVIQATGDAICIFR,
LLSSRSR,
POLI,
NPM1,
LVVPASQCGSLIGK,
GPSSVEDIK,
IITLAGPTNAIFK,
TCFVDCLIEQTHPEIR,
YIELFR,
CSNK2A1P,
FINYVK,
TRLDGVR,SINQQSGAHVELQR, YFGFDTNAEGCILQWK, AS3MT, IBTK,LADHRTQ,FUBP3, SMCR8,
CYP2F1,LSPQAGSR,ANO2,MUC7,LPPSPNNPPKFPNPHQPPK,
MOXD1,
MLL,
KEGLSFNK, ZNF643,
AQP5,SETKSGDK,
EXOC6B, UBR1, ILTCMQGMEEIR, POLR2C, NAT1,SVYDGEEHGRFMEK, EPKELDPMLPR,
AVEKMPR,SULT1C2,ADAM28,YLHSNMMAATLCGSPMYMAPEVIMSQHYDAK, ULK2,GLPGPPGSK,COL16A1, AGTPBP1,LPANLLYK,
SYT4, LOC93622, AAEAARMGR, NUP205, LEEGSELEKK, ZFR2, CLESLAALR, ZBED1,DFPVGERR, MLMNREIIK,
SFGPAVVMNR, FLNA, TPCEEILVK, NEKEPMVLK, LLILEAGHR, ARHGAP20, DENND5B, IKTDVGR,PSKH2, TLLIDGR,KDM3A,
AGVETTTPSK,
NDQRPSGVPDR,
LTVLSQPK,
YLWNNIKK,
ATLVCLISDFYPGAVTVAWK,
CDK9,
CDK14,
CDK4,
CDK16,CDK17,
GNFGGSFAGSFGGAGGHAPGVAR,
CDK5,
CDK15,
LADFGLAR,
AGEVFIHKDK,
LFVGNLPADITEDEFKR,
ISDSEGFKANLSLLR,
FATHAAALSVR,
FAQHGTFEYEYSQR, WNLDELPKFEK,
QVSDLISVLR,
DDX5,
GIVEFASKPAAR,AVVIVDDR,
HLQLAIRGDEELDSLIK,
H2AFV,
IADFGLAR,
FGFR1,
HIST1H2AD,GGGGNFGPGPGSNFR,
GFGFVTFDDHDPVDK, H2AFZ,
MDERALPK,
EAFSLFDKDGDGTITTK,
VFDKDGNGYISAAELR,
FESPEVAER,
DKFNECGHVLYADIK,
VGEVTYVELLMDAEGK,
VGSEIER,
LCK,CDK20,
INEILSNALK,
FGFR4,
EXOG,GRIA4,
GRIA1,GVTFLFPIQAK,FGQGAHHAAGQAGNEAGR,
MKDTDSEEEIR,
VLEEANQAINPK,
LLQLVEDR,
STCIYGGAPK,
GFGDGYNGYGGGPGGGNFGGSPGYGGGR, QEMQEVQSSR,
LOC100287513,
LOC100287205,
LOC100287178,
USP17L1P,
DYFEEYGKIDTIEIITDR,
EESGKPGAHVTVK,
HCK,
YHTINGHNAEVRK,
EEF1A1,
VETGVLKPGMVVTFAPVNVTTEVK,
IGGIGTVPVGR,
YYVTIIDAPGHR,
CDK12,
CDK2,
FKQESTVATER,
YEEIVK,
LPLQDVYK,
SYSCQVTHEGSTVEK,
AAPSVTLFPPSSEELQANK,
LOC649330,
VDSLLENLEKIEK,YAASSYLSLTPEQWK,
TVAPTECS,
VTISCSGSR,
INQVFHGSCITEGNELTK,
GOLGA6A,
EIAQDFKTDLR,
SVAGGFVYTYK,NLNPEDIDQLITISGMVIR,
FLAVGLVDNTVR,
STELLIR,
YRPGTVALR,
YQQLFEDIRGQSDIAITK,
KTTIENIQLPHTLLSR,
SGVSLAALKK,
KALAAAGYDVEK,
KATGAATPK,
ASGPPVSELITK,
QSNNKYAASSYLSLTPEQWK,
ALAAAGYDVEKNNSR,
KASGPPVSELITK,
VAGAATPKK,
HIST1H1D,
HIST1H1E,
RTCCFPALPK,
SGTSASLAISGLR,
VVPSDLYPLVLGFLR,
CNN2P9,
VLVDVER,
NOLC1,
SPAVKPAAAPKQPVGGGQK,
THINIVVIGHVDSGK,
EHALLAYTLGVK,
EEF1A1P5,
CCDC72,
SYSCQVTHEGSTVEKTVAPTECS,
JMJD6,
SPTBN2, IMPACT,
GALVSGSGLR,
LEQLAARFDR, CNGB1,
GIMDIEAYLER, USH2A,
RYNENQDEIR, MATN1,MFAVGVGNAVEDELR, ATP2A1, GCSSGTGVR, SILDQSISSFMR, INTS10,
HDAC2,CLVNLIEK, VSEADSSNADWVTK, AFSYYGPLR,
MCM6,TIAL1,SF3B1,SRSF7,CFB, FUS,
LAMA1,IESALLDGSERK,ABCC10,FILGLLDAGKAHLQR, CD1A,ITELTDENVKFIIENTDLAVANSIR, SLSLGQR, NID2,
YGAIRDIDLK,DQGSRHDSEQDNSDNNTIFVQGLGENVTIESVADYFK, YLQLAEELIRPER,SAFAPFGK,FEDVVNQSSPK,ICFELLELLK,DTPGHGSGWAETPR, VYVGNLGTGAGKGELER, HVEEFSPR,
QEYDESGPSIVHR,
LCYVALDFEQEMATAASSSSLEK,
TUBB4B,
KESYSIYVYK,
TUBB,
ACTG1,
AGGPTTPLSPTR,
LLEGEEER,
DNIQGITKPAIRR,
LMNA,
TUBB2A,
TUBB4A,
HIST1H2BJ,
SGPFGQIFRPDNFVFGQSGAGNNWAK,
LADALQELR,
VAVEEVDEEGKFVR,
ELVHDRFHK,
ATCNNYTPQFQLRPSYSSCFPQYR,
RBM39,
VFNVFCLYGNVEK,
RISVADR,
QYNKLK,NGQVNLTVR,TLSVSRR,ILIGTVFDK,AISEQESAGQLEIR, SEFYSLQEQHLGLALDVDR, GMQPTEFFQSLGGDGER,
IINEPTAAAIAYGLDKK, ARFEELNADLFR,
SQIHDIVLVGGSTR,
HSPA8,
GPAVGIDLGTTYSCVGVFQHGK, LLQDFFNGKELNK,
DHX15,
LAGANPAVITCDELLLGHEK,
LWSLDSDEPVADIEGHTVR,
DVNLASCAADGSVK,
QAEVLAEFER,
AYHNSPAYLAYINAK,
AGIEAGNINITSGEVFEIEEHISER,
HRLDLGEDYPSGK, HQSFVLVGETGSGK,
HIST1H3D,EIAQDFK,
UBC,TITIEVEPSDTIENVK,
IQDKEGIPPDQQR, ALYETELADARR,
RPS27A, TTIPEEEEEEEEAAGVVVEEELFHQQGTPR,
KLPFQR,
LMNB1,HIST1H3B,
NFGDQPDIR,
H3F3A,
HIST1H3A,
SF3B3,
HIST3H3,
VFIGNLNTLVVK,
VFIGNLNTLVVKK,
MIAGQVLDINLAAEPK,
QKVDSLLENLEK,
TVTAMDVVYALKR,
HIST1H4A,
DAVTYTEHAKR,
VEILANDQGNR,
HSPA6,GTLDPVEK,
HIST2H2BE,
HIST1H4C,
YHTVNGHNCEVR,
EDSQRPGAHLTVK,
IFVGGIKEDTEEHHLR,
EDSQRPGAHLTVKK,
PSMB8, NBPF3, ALG13, PON3, SPECC1L, CCDC93,
RSL1D1,
SHROOM3,
TTBK1,
ARID4B,
SEZ6L,
LFENLR,
GBA,
GGNFGFGDSR,
GGGGNFGPGPGSNFRGGSDGYGSGR,
LOC100287441,
LFQECCPHSTDR,
TNEGVIEFR,
TLETVPLER,
ALSRQEMQEVQSSR,
IILDLISESPIKGR,
NQDLAPNSAEQASILSLVTK,
LLEVDLK,
IILDLISESPIK,
TDYNASVSVPDSSGPER,
LOC100287478,
LOC100287238,
LOC100288520,
LOC100287404,
LYN,FLQEQNK,
USP17L3,
USP17L2,
LOC100287327,
DYFEEYGK,
HNRNPA2B1,
EADIDGDGQVNYEEFVQMMTAK, ALRTDYNASVSVPDSSGPER,
LLIHQSLAGGIIGVK,
IITITGTQDQIQNAQYLLQNSVK,
IDEPLEGSEDRIITITGTQDQIQNAQYLLQNSVK,
SRSF4,
QAGEVTYADAHK,
HNRNPK,
VLHEAEGHIVTCETNTGEVYR,
RHLYDPK,
GLVSSDELAKDVTGAEALLER,
PTX4,
AAIDWFDGKEFSGNPIK, LKGEATVSFDDPPSAK, TKDIEDVFYK,
APKPDGPGGGPGGSHMGGNYGDDR,
CNN3,
ALINADELASDVAGAEALLDR,
PNPLA7,
LEEQFR,
PPP2R5A,
KLEELK,
GRIA3, GRIA2,
TTGEGTSLRGDINVCIVGDPSTAK, LFLDFLEEFQSSDGEIK, DTSNHFHVFVGDLSPEITTEDIK, VNWATTPSSQKK,IYNSIYIGSQDALIAHYPR, AIGPHDVLATLLNNLK, NPPGFAFVEFEGPRDAEDAVR, NPPGFAFVEFEDPRDAEDAVR,
FAM98A,
MELQELQLKEAK,
TPM3,
FMRSDHLAK,
LTEVPVEPVLTVHPESK,
MELQEIQLKEAK,DFPVESVKLTEVPVEPVLTVHPESK,
PNN,
TPM3P4,
SP5,
QSLEASLAETEGR,
NHKEEMSQLTGQNSGDVNVEINVAPGK,
QFSSSYLSR,
AVAREESGKPGAHVTVK,
LKYENEVALR,ELTTEIDNNIEQISSYK,
RVLDELTLTK,
GSLGGGFSSGGFSGGSFSR,
LTDCVVMRDPASK,
YHTINGHNAEVR,
LFIGGLSFETTEESLRNYYEQWGK,
GFGFVTFDDHDPVDKIVLQK,
HGVQELEIELQSQLSKK,
LEKEIETYHNLLEGGQEDFESSGAGK, YCGQLQMIQEQISNLEAQITDVR, LASYLDKVQALEEANNDLENK,
H2AFX,
KRT9,
SLNNQFASFIDK,
VELQSKVDLLNQEIEFLK,
THNLEPYFESFINNLR,
SLDLDSIIAEVK,
LNDLEDALQQAKEDLAR,
SGGGFSSGSAGIINYQR,
LTDCVVMR,
GGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,
IDTIEIITDR,
HGVQELEIELQSQLSK,
KLFVGGIK,
QEYEQLIAK,
SRGFGFVTFSSMAEVDAAMAARPHSIDGR,
FSSSSGYGGGSSR,
LGLDIEIATYR,
SLNSFGGCLEGSR,
FLEQQNQVLETK,LLQDSVDFSLADAINTEFK,
LQGEIAHVKK,
AVSREDSVKPGAHLTVK,
EDSVKPGAHLTVK,
SSGSPYGGGYGSGGGSGGYGSR,
GEEGHDPKEPEQLR, CAPNS2,
YLWNNIK,
LDAMFR,NQGGYGGSSSSSSYGSGR,
LFIGGLSFETTDESLR,
KLFIGGLSFETTDESLR,
LGFEEFK,
THYSNIEANESEEVR,
TDGFGIDTCR,
THYSNIEANESEEVRQFR,
CAPNS1,
LFAQLAGDDMEVSATELMNILNK,
SMVAVMDSDTTGK,
VVTRHPDLK,
RLFAQLAGDDMEVSATELMNILNK,
NKLNDLEDALQQAK,
DVDGAYMTK,
LRSEIDNVK,
SMVAVMDSDTTGKLGFEEFK,
AEAESLYQSKYEELQITAGR, SLNNQFASFIDKVR,
NDEELNKLLGR,HIST1H2AB,
HIST1H2AC,
HIST1H2AE,VTIAQGGVLPNIQAVLLPK,
HIST3H2A,
LALDLEIATYR,
KDIENQYETQITQIEHEVSSSGQEVQSSAK,
QGVDADINGLRQVLDNLTMEK,
SDLEMQYETLQEELMALKK,
QLWWRR,
DIENQYETQITQIEHEVSSSGQEVQSSAK,
IKFEMEQNLR,
TNAENEFVTIK,
SLDLDSIIAEVKAQYEDIAQK,
VQALEEANNDLENKIQDWYDK,
HIST1H2AH,
H2AFJ,
HIST1H2AG,
HIST1H2AJ,VGAGAPVYLAAVLEYLTAEILELAGNAAR,
HLQLAIRNDEELNK,
HIST1H2AI,
AGLQFPVGR,
FSSSGGGGGGGRFSSSSGYGGGSSR,
HLQLAIR,
TLNDMRQEYEQLIAK,
TLLDIDNTR,
KDVDGAYMTK,
THNLEPYFESFINNLRR,
SGGGGGGGLGSGGSIR, GFSSGSAVVSGGSRR,
DVDNAYMIKVELQSK, KRT1,
STMQELNSR,
TAAENDFVTLK,
KRT79,
KRT75,KQISNLQQSISDAEQR,
KRT2,
ADTLTDEINFLR,
AQYEDIAQK,
SRTEAESWYQTK,
GSGGGSSGGSIGGR,
KRT76,
NSKIEISELNR,
GFAFVTFDDHDSVDKIVIQK,
GFGFVTYATVEEVDAAMNARPHK,
KIFVGGIK,
GFAFVTFDDHDTVDKIVVQK,
SESPKEPEQLR,
HNRNPA1,
SDQSRLDSELK,
PPP3R1,
HILLAVANDEELNQLLK,
DDX21,
ISLVLGGDHSLAIGSISGHAR, IVSGEAESVEVTPENLQDFVGKPVFTVER, ISFEEFCAVVGGLDIHK,
AFITNIPFDVK,
APILIATDVASR,
CALM2,
DTDSEEEIR,
ILEVDLK,
ILSISADIETIGEILK,
DDX17,
NFYQEHPDLAR,
SGKAPILIATDVASR,
CALM3,
EEVIDFSKPFMSLGISIMIK,
LOC729973,
VGQTIER,
SFPQ,
NONO,FGQGGAGPVGGQGPR,
DTDSEEEIREAFR,
ELGTVMR,
LOC100287364,
FYN,
LOC100287144,
ICK,
BLK,
YES1,
MAK,
FGR,FGFR3,
NMGGPYGGGNYGPGGSGGSGGYGGR,
USP17,FGFR2,
CDK6,
CDK13,
CDK18,
CDK1,
MAAPIDR,
HNRNPM,
QGGGGGGGSVPGIER,
KCGGIDK,
MGANSLER,
GOLGA6C,
MCM4,
RVVPSDLYPLVLGFLR,
CNN1,
DNQLSEVANK,
ATGATQQDANASSLLDIYSFWLNR,
ALAAAGYDVEK,
ALADDDFLTVTGK, NKPGPYSSVPPPSAPPPKK,
GLQVDLQSDGAAAEDIVASEQSLGQK,
CDK3,
CYAT1,
IGLV1-44,
LLIYSNNQRPSGVPDR,
SHKSYSCQVTHEGSTVEK,
ADSSPVKAGVETTTPSK,
LASYLDK,GGSGGSYGGGSGSGGGSGGGYGGGSGGGHSGGSGGGHSGGSGGNYGGGSGSGGGSGGGYGGGSGSR, MTLDDFR,
TKYETELNLR,
NKIIAATIENAQPILQIDNAR, KRT14,ADLEMQIESLKEELAYLK,
GSCGIGGGIGGGSSR,
NDEELNKLLGK,
IQDWYDKK,
ASLENSLEETK,
FEMEQNLR,
LLGGVTIAQGGVLPNIQAVLLPK,
KRT16,APSTYGGGLSVSSSR,
NHEEEMKDLR,
SEITELRR,
TKYEHELALR,AETECQNTEYQQLLDIK, EVFTSSSSSSSR,
IIAATIENAQPILQIDNAR,
ISSVLAGGSCR,
ALEEANADLEVK,
SEITELR,
VLDELTLTK,
VTMQNLNDR,
NQILNLTTDNANILLQIDNAR,
VLDELTLAR,
QRPAEIKDYSPYFK,
NYSPYYNTIDDLKDQIVDLTVGNNK,
SGGGGGGGGCGGGGGVSSLR,
ILTATVDNANVLLQIDNAR,
DAEEWFFTK,
NHEEEMNALR,
SQYEQLAEQNR,
LLEGEDAHLSSSQFSSGSQSSR,
APSTYGGGLSVSSR,
KGPAAIQK,
VCGRGGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,
LSGGLGAGSCR,
QVLDNLTMEK,
EIETYHNLLEGGQEDFESSGAGK,
ILTATVDNANILLQIDNAR,
KRT17,
GGSGGSYGGGGSGGGYGGGSGSR,
TEELNREVATNSELVQSGK,
TKFETEQALR,
LAADDFR,
SQYEQLAEQNRK, KDAEAWFNEK,IKEWYEK,
EVATNSELVQSGK,
TIEELQNK,
ADLEMQIESLTEELAYLKK,
LASYLDKVR,
ALEEANTELEVK,
ASLENSLEETKGR,
QRPSEIKDYSPYFK,
NKILTATVDNANVLLQIDNAR,
DAETWFLSK,
ISIGGGSCAISGGYGSR,
GRLDSELR,
SFNRGEC,
VDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSK,
LLIYGVSSR,
ATGIPDRFSGSGSGTDFTLTITR,
FSGSGSATDFTLTISR,
IGK@,
TAAENEFVTLKK,
IGKC,
NMQDMVEDYR,SISISVAR,
LLRDYQELMNVK,
FLEQQNQVLQTKWELLQQVDTSTR,
GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR, FLEQQNQVLQTK,
GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK, GGSISGGGYGSGGGK,
QISNLQQSISDAEQR,
VQALEEANNDLENK,
FSSCGGGGGSFGAGGGFGSR,
EIKIEISELNR,
GGSGGGGSISGGGYGSGGGSGGR,
TAAENDFVTLKK,
IEISELNR,
VDLLNQEIEFLK,
TNAENEFVTIKK,
YGSGGGSKGGSISGGGYGSGGGK,
YEELQVTVGR,TLLEGEESR,
HGGGGGGFGGGGFGSR,
SKEEAEALYHSK,
NVQDAIADAEQR,
LNDLEEALQQAKEDLAR, STSSFSCLSR,RSTSSFSCLSR,
YQELQVSAQLHGDR,
QGVDADINGLR,
YCGQLQMIQEQISNLEAQITDVRQEIECQNQEYSLLLSIK,
QEIECQNQEYSLLLSIK, WELLQQVDTSTR,
EIVLTQSPGTLSLSPGER,
WLAWYQQKPGK,GFPSVLR,
ZCWPW2,SCFV,
IGKV3-20,
EVQLLESGGGLVQPGGSLR,
VYACEVTHQGLSSPVTK, GQPREPQVYTLPPSR, VSNKALPAPIEK,
TPEVTCVVVDVSHEDPEVQFNWYVDGVEVHNAK,
AIGGGLSSVGGGSSTIK,
TEAESWYQTK,TTAENEFVMLKK,
SEIIELNR,TSQNSELNNMQDLVEDYKK, AQYEEIANR,YEDEINKR,
LALDVEIATYR,
MSGDLSSNVTVSVTSSTISSNVASK, NLDLDSIIAEVK,
KRT84,
LALDIDIATYR,KRT74,
KRT8,
VLYDAEISQIHQSVTDTNVILSMDNSR,
LEGLTDEINFLR,
LALDIEIATYR,
AQYEEIAQR,
STSESTAALGCLVK,
TTPPVLDSDGSFFLYSR,
GPSVFPLAPCSR,
NQVSLTCLVK,
CCVECPPCPAPPVAGPSVFLFPPKPK,
EPQVYTLPPSREEMTK,
VELKTPLGDTTHTCPR,
KCCVECPPCPAPPVAGPSVFLFPPKPK,
SLLEGEGSSGGGGR,
LSGLNAFDIAEELVK,
NVQALEIELQSQLALK,
MBD3,
SRSF6,
ILSISADIETIGEILKK,
ADLEMQIESLTEELAYLK,
QSVEADINGLRR,IRLENEIQTYR,
SKELTTEIDNNIEQISSYK,
DAEAWFNEK,
LENEIQTYR,
QSVEADINGLR,
KRT13,
GSSGGGCFGGSSGGYGGLGGFGGGSFR,
NVSTGDVNVEMNAAPGVDLTQLLNNMR,
LKYDNELALR,
YCVQLSQIQAQISALEEQLQQIR,
KRT10,
TIDDLKNQILNLTTDNANILLQIDNAR,
ALEESNYELEGK,
GSYGDLGGPIITTQVTIPK,
NYYEQWGK,
LFVGGIKEDTEEHHLR,
EESGKPGAHVTVKK,
KLFIGGLSFETTEESLR,
GFGFVTFSSMAEVDAAMAARPHSIDGR,
LFIGGLSFETTEESLR,
RGFGFVTFDDHDPVDK,
SYELPDGQVITIGNER, LFIGGLNTETNEKALEAVFGK,
GFAFVTFESPADAKDAAR, IVEVLLMK,
QACRTPSR,
CEP85L,
LLLPGELAK,AMGIMNSFVNDIFER,
QVHPDTGISSK,
KLEEIK,
SLFSSIGEVESAK,
ELAVL1,
IGGGIDVPVPR,
KHSRP,
KESYSVYVYK,
IINDLLQSLR,ESYSVYVYK,
HIST1H2BM,
IGGDAATTVNNSTPDFGFGGQKR, HIST2H2BF,
AINQQTGAFVEISR,
CTNILR,
GK,
HIST1H2BH,
KELELK,
NYYGYQGYR,
C7orf73,
H2BFS,
HIST1H2BD,
HSVGVVIGR,
GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK,
VLKQVHPDTGISSK, HIST1H2BE,
SPRR2G,
SPRR2A,
QPCQPPPVCPTPK,
HIST1H2BK,
HIST1H2BC,
HIST1H2BL, KLASQGDSISSQLGPIHPPPR,
SVSLTGAPESVQK,
VQISPDSGGLPER,
MILIQDGSQNTNVDKPLR,
HIST1H2BN,
HIST1H2BG,
SPRR2D,SPRR2E,
NFILDQTNVSAAAQR, HAAENPGKYNILGTNTIMDK,
EKPYFPIPEEYTFIQNVPLEDR, HLYTKDIDIHEVR,
NGQDLGVAFK,
HNRNPU, FIEIAAR,
GYFEYIEENKYSR,
DIDIHEVR,
YNILGTNTIMDK,
NREELGFRPEYSASQLK,
NTHATTHNAYDLEVIDIFKIER,
SDAYYCTGDVTAWTK,
SLQELFLAHILSPWGAEVK, PARP1,
VAPSKWEAVDESELEAQAVTTSK,
U2SURP,
SNLELFKEELK,
FLFENQTPAHVYYR,
FQDELESGKRPK,FADQKNPPNQSSNERPPSLLVIETK,
KPGQSFQEQVEHYRDK,
DKLEEILR,
SPRR2B,
TIQGHLQSENFK,
NPPNQSSNERPPSLLVIETK,
TLGDFAAEYAK,
VVSEDFLQDVSASTK, LTVNPGTK,
VFSATLGLVDIVK,LYLVSDVLYNSSAK,
LYSILQGDSPTK,AAAEIYEEFLAAFEGSDGNKVK, FGPLASVK,
IGPYQPNVPVGIDYVIPK,
YQLLQLVEPFGVISNHLILNK,
GNLGAGNGNLQGPR, TEEGPTLSYGR,
MATR3,
GDADQASNILASFGLSAR,
GPLPLSSQHRGDADQASNILASFGLSAR,
LCSLFYTNEEVAK,
VCSTNDLKELLIFNK,
GIDLLKK,
DLSAAGIGLLAAATQSLSMPASLGR,
QQVPSGESAILDR,
VVDRDSEEAEIIR,
SFQQSSLSR,ITPENLPQILLQLK,DSFDDRGPSLNPVLDYDHGSR,
VVHIMDFQR,DLPEHAVLK,
SSGPTSLFAVTVAPPGAR, ISKEVLAGRPLFPHVLCHNCAVEFNFGQK,
LLEQYKEESK,
AINTLNGLR,
DYGHSSSRDDYPSR,
GFGFVTMTNYDEAAMAIASLNGYR, GFAFVTFESPADAK,
ELAVL2,
ALEAVFGK,
ELAVL4,
RBMX,
VEQATKPSFESGR,
DSYESYGNSR,
LFIGGLNTETNEK, DRDYSDHPSGGSYR, QTGIVLNRPVLR,
GPPPSYGGSSR, DVYLSPR,AAVLLEQER,SSLGQSASETEEDTVSVSKK,
KPGDLSDELR,GIEKPPFELPDFIK,
QSLGHGQHGSGSGQSPSPSR,
GPYESGSGHSSGLGHR,
HGSGSGQSSSYGPYGSGSGWSSSR, HGSGSGHSSSHGQHGSGSSYSYSR, GSGSGQSPSSGQHGTGFGR,
GHYESGSGQTSGFGQHESGSGQSSGYSK,
LAQHITYVHQHSR, SITVLVEGENTR, SKGYAFIEFASFEDAK, SISLYYTGEK,
EVVNKDVLDVYIEHR, IAQPGDHVSVTGIFLPILR,
QVVQGLLSETYLEAHR,
SLEQNIQLPAALLSR,
MVDVVEKEDVNEAIR,
FLQEFYQDDELGKK,
NDLAVVDVR,
GFGFVDFNSEEDAK,
TLLAILR,
GRHGSGSGHSSSYGQHGSGSGWSSSSGR, LFADAVQELLPQYK,
SSSRGPYESGSGHSSGLGHQESR, AGILTTLNAR,FGYVDFESAEDLEK,
LELQGPR,
GPYESGSGHSSGLGHQESR,
HGSGSGHSSSYGQHGSGSGWSSSSGR,
ISLGMPVGPNAHK,
HRNR,
HGSGSGQSSSYSPYGSGSGWSSSR,
FDLLWLIQDRPDR,
FTVAELK,
RFELYFQGPSSNKPR,
SF3B2, QSSSYGPHGYGSGR,
VGEPVALSEEERLK, TGIQEMR,
AIKVEQATKPSFESGR, LTIHGDLYYEGK,
LHDAFFK,
EKKPGDLSDELR,
IPQALEK,
GIEKPPFELPDFIKR,
LAEIGAPIQGNREELVER,
HGSGSGQSSSYGPYR, SSSGQSSGYSQHGSGSGHSSGYGQHGSR,
SSSSGQHGSGLGESSGFGHHESSSGQSSSYSQHGSGSGHSSGYGQHGSR, YGPPPSYPNLK,
VEGTEPTTAFNLFVGNLNFNK,
VFGNEIK,
KFGYVDFESAEDLEK, ALELTGLK,
VFGNEIKLEKPK,
TGISDVFAK,
SISLYYTGEKGQNQDYR,
TEADAEKTFEEK,
NCL,
VTQDELKEVFEDAAEIR,
QKVEGTEPTTAFNLFVGNLNFNK,
ALLLLLVGGVDQSPR, MCM7,
TQRPADVIFATVR,
SLGTADVHFER,
DIVQFVPFR,
TFIIDYYFEVVQK,
CPNE3,
SPLGEVAIRDIVQFVPFR,
LYGPTNFSPIINHVAR,
SFLLDLLNATGK,
KGADSLEDFLYHEGYACTSIHGDR,
DDX3X,
NINITKDLLDLLVEAK,
DLLDLLVEAK,VGSTSENITQK,
AHQCGDDDKTRPLVK,
HAIPIIK,
INAGFGDDLNCIFNDDNAEKLVLR, YSGSYNDYLR,
VAVGELTDEDVK,QINIHNLSAFYDSELFR,
ESLVVNYEDLAAR,
MCM2,
IFASIAPSIYGHEDIKR,
SGYGFNEPEQSR,
RDNNELLLFILK,AVPKEDIYSGGGGGGSR,
HLSDHLSELVEQTLSDLEQSK,
HLILPEKYPPPTELLDLQPLPVSALR,
KLFVGGLK,
YAQAGFEGFK,
SNRNP200,
SNIDALLSR,
VLAGQTLDINMAGEPKPDRPK,
LQASNVTNKNDPK,
RALY,
SELKELLTR, EFEEESKQPGVSEQQR,
CPNE1,S100A4,
SDPFLEFFR,
DAINAETVLVLVNAVYFK,
FCFDLFQEIGKDDR, VQDQDLPNTPHSK,
ASSLSESSPPK,
LIMA1,
SSELQAIKTELTQIK,
VFIGNLNTALVK,
TGYTLDVTTGQR,
HRADHPPAEVTSHAASGAK,
NLATTVTEEILEK,JUP,
NKTLVTQNSGVEALIHAILR,
LLLSYGASR,
BARD1, DYNC1H1,
SQELEVKNAAANDK,
NLSDVATKQEGLESVLK,
TMQNTSDLDTAR,
AGPIWDLR,
LRWLLDNVR,
INVS,
LVQIASR,
GYAFITFCGK,
LFVGSIPK,
GALQNIIPASTGAAK, EQGIQDPIKGGDVR,
LIGLSATLPNYEDVATFLR,
DILCGAADEVLAVLKNEK,
SSQSNYGQHGSGSSQSSGYGQHGSSSGQTTGFGQHR,
FLG2,
SVVTVIDVFYK,LFVGGLKGDVAEGDLIEHFSQFGTVEK,
LGLKEFYILWTK,
GHRHQEEESETEEDEEDTPGHK,
QSSYGQHGSGSSQSSGYGQYGSR,
FGEVVDCTIKTDPVTGR,
DAASVDKVLELK,
SRGFGFVLFK,
HNRPDL,
GFCFITYTDEEPVKK,
YHQIGSGKCEIK,
MFIGGLSWDTSKK,
DLTEYLSR,
FGEVVDCTIK,EVYQQQQYGSGGR, GFGFILFK,
FHTVSGSKCEIK,
IFVGGLNPEATEEK, HNRNPAB,
IREYFGEFGEIEAIELPMDPK,
HSGPNSADSANDGFVR,
ITGEAFVQFASQELAEK,
VTGEADVEFATHEDAVAAMSK,
YIEVFK,
YGDGGSTFQSTTGHCVHMR,
ATENDIYNFFSPLNPVR,
HNRNPF,
VHIEIGPDGR,
HNRNPH1,
EGRPSGEAFVELESEDEVK,
VTGEADVEFATHEDAVAAMAK,
YVELFLNSTAGASGGAYEHR,
SNNVEMDWVLK,
DLNYCFSGMSDHR,
HNRNPH2,STGEAFVQFASQEIAEK,
GLPWSCSADEVQR,
QEPSQGTTTFAVTSILR,
YVEVFK,
SEDTAMFFCAR,
WLQGSQELPR,
IGHA2,
VAAEDWK,
FFSDCK,
IQNGAQGIR,
HTGPNSPDTANDGFVR,
ENDLSVLR,
TRPV5,
SRGFGFILFK,
MFVGGLSWDTSKK, EYFGEFGEIEAIELPMDPK,
DLKDYFTK,
HSEAATAQREEWK,
FGEVVDCTLK,
GFGFVLFK,IREYFGGFGEVESIELPMDNK,
HNRNPD,MFIGGLSWDTTKK,
FGEVVDCTLKLDPITGR,
IFVGGLSPDTPEEK,
IDASKNEEDEGHSNSSPR,
ESESVDKVMDQK,
VTISVDTSK,
NSLYLQMNSLR,GLEWIGR,
AEDTALYYCAK,
AEDTAVYYCAR,
VDDTAVYYCAR,
LSCAASGFTFR,LRAEDTAVYYCAR,
TTPPVLDSDGSFFLYSK,
EPQVYTLPPSRDELTK,
GPSVFPLAPSSK,
THTCPPCPAPELLGGPSVFLFPPKPK,
SDDTAVYYCAR,
VTVSSASTK,
WQQGNVFSCSVMHEGLHNHYTQK,
SRWQQGNVFSCSVMHEALHNHYTQK,
TPEVTCVVVDVSHEDPEVK,
SCDKTHTCPPCPAPELLGGPSVFLFPPKPK,
TPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAK,
DTLMISR,
IGHM,SDDTAVYFCAR,
GTTVVVSSASTK,
AEDTGVYYCAK,
GLEWVAFIR,
WQQGNVFSCSVMHEALHDHYTQK,
GTLVIVSSASTK,
IGHG1,
LSCAASGFTFSR,
IGHV4-31,
IGHV1OR15-1,
ALPAPIEK,
TEDTAVYYCTR,
QAPGQGLEWMGR,
VVSVLTVLHQDWLNGKEYK,
TTPPMLDSDGSFFLYSK,
IGHG4,
VVSVLTVLHQDWLNGK,
YGPPCPSCPAPEFLGGPSVFLFPPKPK,
SDDTVVYYCAR,
TPLGDTTHTCPR,
SCDTPPPCPR,
STSGGTAALGCLVK,
NTLYLQMNSLR,
DNSKNTLYLQMNSLR,
SLRSDDTAVYYCAR,
GTAVTVSSASTK,
ADDTAVYYCAR,
CPAPELLGGPSVFLFPPKPK,
GRFTISR,
LSSVTAADTAVYYCAR,
GFYPSDIAVEWESNGQPENNYK,
WQEGNVFSCSVMHEALHNHYTQK,
WQQGNVFSCSVMHEALHNHYTQK,
GTTVIVSSASTK,
LTVDKSR,
GDDTAVYYCAR,
VVSVLTVVHQDWLNGK,
EPQVYTLPPSR,LOC100290146,
GRFTVSR,
SEDSALYYCAR,SEDTAVYYCAR,
GTLVTVSSASTK,
FNWYVDGVEVHNAK, SEDTAVYFCAR,
GPSVFPLAPSSKSTSGGTAALGCLVK,
GLVWVSR,
FTISRDDSK,
GLEWVGFIR,
VTISLDTSK, IGH@,
IGHA1,
TPLTATLSK,
NQFSLR,
DASGVTFTWTPSSGK,
VEDTAVYYCAR,
TFTCTAAYPESK,
LSISIDTSK,
TLVTQNSGVEALIHAILR,
ILVNQLSVDDVNVLTCATGTLSNLTCNNSK,
TSFSPCVPQCQTQGSYGSFTEQHR,
KPRP,
CPVEIPPIR,
RLDQCPESPLQR,
GRPAVCQPQGR,
VSVELTNSLFK,
LTTPTYGDLNHLVSATMSGVTTCLR,
TUBB2B,
QDLMIEDNLLK, IDEVPSR,
LOC100289196,
HNRNPL,
NDQDTWDYTNPNLSGQGDPGSNPNKR, CSGEEQSLEQCQHR,
FSAFLDK,
EATLQDCPSGPWGK, QINVSTDDSEVK,
SASPQRAARPR,
TDNAGDQHGGGGGGGGGAGAAGGGGGGENYDDPHKTPASPVVHIR,
YYGGGSEGGRAPK,
HQNQWYTVCQTGWSLR,
KFVIHPESNNLIIIETDHNAYTEATK,
GOLGA6D,
EKADGTEQVER, PRKDC, LAAVVSACK,
LWIANYSLPR,
SLGTIQQCCDAIDHLCR, DFSAFINLVEFCR,
QLFSSLFSGILK,
RBBP6,
KSNSSPSR,
EVDDLGPEVGDIK,
SNLGSVVLQLK,
TLATDILMGVLK,
HIANYISGIQTIGHR,
H3F3B,
RVTIMPK,UBA52,
DAALATALGDKK,UBB,
TITLEVEPSDTIENVK,
EATADDLIKVVEELTR,
DDB1,
KTEPATGFIDGDLIESFLDISRPK,
RHPN2,VPS13D,MAGI2, CCDC39,
QENKMR,
GFAFVTFDDHDSVDK,
SHFEQWGTLTDCVVMR,
IEVIEIMTDR,DYFEQYGKIEVIEIMTDR,
SRGFGFVTYATVEEVDAAMNARPHK,
SSGPYGGGGQYFAKPR, LFIGGLSFETTDDSLR,
YHTINGHNCEVKK,
RGFAFVTFDDHDSVDK, IFVGGIKEDTEEYNLR,
LFIGGLSFETTDDSLREHFEK,
GFGFVTYSCVEEVDAAMCARPHK,
YGKIETIEVMEDR,
SDVEAIFSK,
VDSLLENLEK,
MIASQVVDINLAAEPK,
GDDLQAIKK,
HNRNPCL1,
VPPPPPIAR,
MYSYPAR,
HNRNPC,
STAGDTHLGGEDFDNR,
QTQTFTTYSDNQPGVLIQVYEGER,
TTPSYVAFTDTER,
NQVAMNPTNTVFDAK, DAGTIAGLNVLR,
VQVEYKGETK,
VEIIANDQGNR,
FEELNADLFR,
IINEPTAAAIAYGLDK,
VSYARPSSEVIKDANLYISGLPR,
VAPEEHPVLLTEAPLNPK,
GYSFTTTAER,
ACTB,
TTGIVMDSGDGVTHTVPIYEGYALPHAILR,
AGFAGDDAPR,
IIAPPER,
ISGLIYEETR,VFLENVIRDAVTYTEHAK,
HIST1H4E,DNIQGITKPAIR,
DAVTYTEHAK,
HIST2H4A,TVTAMDVVYALK,
VLRDNIQGITKPAIR,
EIQTAVR,HIST1H2BO,
HIST1H2BB,
ISGLIYEETRGVLK,
KLFIGGLSFETTDDSLR, SRGFGFVTYSCVEEVDAAMCARPHK,
EDSVKPGAHLTVKK,
HNRNPA3,
YHTINGHNCEVK,
GFAFVTFDDHDTVDK,
WGTLTDCVVMRDPQTK,
VAPDEHPILLTEAPLNPK,
ACTBL2,
GFSSGSAVVSGGSR,
DVDAAYMNKVELEAK,
MSGECAPNVSVSVSTSHTTISGGGSR,
SLVNLGGSK,
NKLNDLEEALQQAK, LDNLQQEIDFLTALYQAELSQMQTQISETNVILSMDNNR,
TSFTSVSR,
KRT6B,KRT6A,
KRT5, SGFSSVSVSR,
YEELQVTAGR,
QNLEPLFEQYINNLRR,
SFSTASAITPSVSR,
QSSVSFRSGGSR,ATGGGLSSVGGGSSTIK,
VSLAGACGVGGYGSR,
SRGSGGLGGACGGAGFGSR,
NTKQEIAEINR,
WTLLQEQGTK,FSGSGSGTDFTLK,
SGTASVVCLLNNFYPR,
QNLEPLFEQYINNLR,
TLNNQFASFIDKVR,
LQGEIAHVK,
DVDNAYMIK,KRT73,
YLDFSSIITEVR,
VIM,
KRT78,
LLEGEECR,
NVQDAIADAEQRGEHALK,
FLEQQNK,FASFIDKVR,
SLDLDSIIAEVR,
FASFIDK,
KLLEGEECR,
GGSSSGGGYGSGGGGSSSVK,
SLVGLGGTKSISISVAGGGGGFGAAGGFGGR,
KRT3,
DVDAAYMSK,
KDVDAAYMSK,
KRT4,
LALDIDIATYRK,
KRT7,
LLRDYQELMNTK,
GSSSGGGYSSGSSSYGSGGR,
SKAEAESLYQSK,
GGGGGGYGSGGSSYGSGGGSYGSGGGGGGGR,
SGGGGGRFSSCGGGGGSFGAGGGFGSR,
GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR,
DSGVPDRFSGSGSGTDFTLK,
HKVYACEVTHQGLSSPVTK,
DSTYSLSSTLTLSK,
VQWKVDNALQSGNSQESVTEQDSK,
RTVAAPSVFIFPPSDEQLK, VDNALQSGNSQESVTEQDSK,
TVAAPSVFIFPPSDEQLK,
SISISVAGGGGGFGAAGGFGGR, TEAESWYQTKYEELQQTAGR, LALDVEIATYRK,
YEELQQTAGR,
SRAEAESWYQTK,GLGVGFGSGGGSSSSVK,
KIAA1598, ENO1, TCEB3C, CUL4A, ZNF20, CHD5, RBFOX3,
LSGEAFDWLLGEIESK, RLPDAHSDYAR,FRHENIIGINDIIR,FSGVPDR,MPCQLHQVIVAR,DTSSVAVNLPVPSGVAFPAVFLR,
RBFOX2,CHD3,ZNF571,TCEB3CL,ENO2, CAPN3,
ALYREF,SRRM2,DDX1,NCOA5,C19orf43,KRT23,ZNF326,POLR2A, RBM3,ANXA6,
RBFOX1,CUL3,ENO3,AKAP9, LOC100506888, CHD4,ZNF461,
YSGGNYRDNYDN, QQLSAEELDAQLDAYNAR, ALIEILATR,NQGGSSWEAPYSR, GGHPPAIQSLINLLADNR, QRQEEPPPGPQRPDQSAAAAGPGDPK, AAFGISDSYVDGSSFDPQRR, LASYVEKVR, FLICTDVAAR,
HNRNPUL2, DSC1, SERPINB12, GAPDH, CHERP,
VTIFTVPENCR, VIHDNFGIVEGLMTTVHAITATQK, LLAAVEAFYSPPSHDRPR, ALDVMVSTFHK,DIVQFVPYRR,TLLADQGEIR, LLIYETEAK,
ILDVEIIFNER,HHYEQQQEDLAR, QCGKAFIR,LEGMFR,TEKDNAALAR,SGETEDTFIADLVVGLCTGQIK,
RBM14,MAPK1,IGKV2-24,CAPN2,RPAP1,RBM25,MTA2,
FAANPNQNK,
RGPPPPPR,
GFYPSDIAVEWESNGQPEDNYK,
VVSVLTVVHQDWLNGKEYK,
LTIHGDLYYEGKEFETR,
VLVDQTTGLSR,
TFGQGTKVEIK,
ASQSVSSSYLAWYQQKPGQAPR,
LLIYGASSR,
ATGIPDRFSGSGSGTDFTLTISR,
AEDTAVYYCAK,
LYACEVTHQGLSSPVTK,
HKLYACEVTHQGLSSPVTK,
ATLFCR,
ATGVPDRFSGSGSATDFTLTISR,
LLIYVASSR,
IGKV3D-20,
FSGSGSGTDFTLTISR,
FSGSGSGTDFTLTITR,
SCDTPPPCPYCPAPELLGGPSVFLFPPKPK,
SCDTPPPCPNCPAPELLGGPSVFLFPPKPK,
TPEVTCVVVDVSHEDPDVQFK,
SCDTPPPCPGCPAPELLGGPSVFLFPPKPK,
SCDTPPPCPICPAPELLGGPSVFLFPPKPK,
IGHG3,
WQQGNIFSCSVMHEALHNR,
FDPWGQGTLVTVSSASTK,
IGHG2,
TPEVTCVVVDVSHEDPEVQFK,
GLPAPIEK,
WYVDGVEVHNAK,
IRVESLLVTAISK,
DVLIQGLIDENPGLQLIIR, LQETLSAADR,
GOLGA6B,
GMKPTEFFQSLGGDGER, QNVNILIDTEK,
TKAPDDLVAPVVK,
ILIGTVFMK, STLDINEMVR,
AITHLNNNFMFGQK,
RFEPCSSSYLPLRPSEGFPNYCTPPR,
ELGCGAASGTPSGILYEPPAEK,
LCFSTAQHAS,
IWSLDSDEPVADIEGHTVR, PRPF4,
GQWGTVCDDGWDIKDVAVLCR,
FEPIHGNFLLTGAYDNTAK,
MKRN4P,GIAYVEFVDVSSVPLAIGLTGQR, RHPYFYAPELLFFAK, AFGYYGPLR, QYGLGPNGGIVTSLNLFATR, KIAA1949,GVDEVTIVNILTNR, ASAISVTVLNVIEGPVFRPGSK,
WTLNSR,
LVQLLVK,
WLLDNVR,
VTVPLVR,
DLYEDELVPLFEK,
SNIDALLSRLEQIAAEQK,
HNRNPR,
LCDSYEIRPGK,
MDLRQACR, SALEQMR,
SKGIAYVEFVDVSSVPLAIGLTGQR,
LRPEPCISLEPRPRPLPR,
YLPM1,SUPT16H, DLEEFFSTVGK,ELAAQLNEEAK,
IWLDNVR,LVGGDNLCSGRLEVLHK,
PPP1R18,
LTGHLHAQGTPPYVINLDPAVHEVPFPANIDIR,
SMSLVLDEFYSSLR, TNQELQEINR,
ANXA2,
IHSDCAANQQVTYR,
SALSGHLETVILGLLK, DSG1, GPN1,
PABPC3,
ESSNVVVTER,
HPYFYAPELLFFAK, SRAEAALEEESR,
SMARCE1,
DSCPLDCK,
NPPGFAFVEFEDPRDAADAVR, SRSF3,
BRD4,
QLSLDINKLPGEK,
CSNK2A1,
GGPNIITLADIVKDPVSR,
SHCIAEVENDEMPADLPSLAADFVESK,
EQVANSAFVER,
MAVAEVSR,PLA2G4B,AGGGQALSR,
HSP90AB1,
CSPP1,KNETGVLR, NOS2, PTPRG,
ALAEFEEK,
ABHD11,DGGLAMLPILVSK, C6orf99,TSVEAAVAPR,PPP1R3F, TAMLLALQR,
ABCC13, SAAKGSDSR, SRCAP, IFCFILSTR, CMA1, LLLFLLCSR,
LOC100506191,
LIICVFLEK,
AK3,
MACROD1,
C1orf144, ITQKESR, LEQLASR, MLLT4, ELQALSR,MVSMMEGVIQVR, PLEKHJ1,
EDDKPETVINR,TVTEDFPK,NAP1L2,KSNTLLR,YIPF6,FAM105B,
RFASLSR,
EPEKEAGAGALPR,
VARS,
EPLHVVATK,
SSLKGLAR,
HVEDVPAFQALGSLNDLQFFR,
MPGTVATLR,
HTATSF1,
LLLSFWK,
LREAMAALR,
PLD5,
JMJD5,
KLTGCSR,
SETDB1,
MANLTELELIR, CRH,
QLKGLEEK, CLVALKER,
ESLTLLR,
DACT2, MYOM3,
GPLD1,
LMEIIGK, C1orf87,
C10orf47,
SVLNNSEVK,
ITLGEEKDR,
TSDLIVLGLPWK,
DRWTAAGALPR,
SLAEKQNLEK,
PABPC1,
ESRP2,
TVLDQAR,
FLJ22184, KLHL34,
PSD4,
TRIM37,ANSLVTLGSHR,
ZKSCAN3,
RBBP4,FAM133B,
KIF26B,
RTWLESMAK,
SMARCC2,
RBBP7,
NSIVVPSER,
HSP90AB3P,
KVDCLESTLEK,
RASAL2,
CDH11,
VFQTEAELQEVISDLQSK,
KLSEVGR,
MESRGLR, ANKRD26,
TXLNG2P,
OR1Q1, KPNA2,
MAANSSGQGFQNK,
SMARCC1,
NLDVGANIFIGNLDPEIDEK,
LEQRDR,
YSFLRGTK,
SF3B4,
TQDECILHFLR,
VTGQHQGYGFVEFLSEEDADYAIK,
THEMIS,
IENLSYDAK, FNKDAQR,
USP24,
CEP350, LEGQLEAGEPK, SSSHSGREVVMR,
DTX4,
RPAVVYIGSAGKPHER, GVPTPSGDR, DFPTISR, MYH10,SLEAEILQLQEELASSER, LASYTVR,RAD51AP2, SNED1,C10orf114, MSSVYLK,APBA3,
LILEHQEK,SCLT1,SLIMEAPR,SGCZ,ETFA,AAELEKK, SPDTFVR,DIAPH1,NVEAVRNAK,
LRHLPSR,GLOD5,
ELKSIPR,
NIETIINTFHQYSVK,
EAAVLQEIR,
YEDLLK,TAQEYGGIRHGAK,
GLG1,
CCDC87,
JMJD1C,
VKLAQLAEK,
DEIIER, IRGC,
QCGKAFLR,
AQHQQALSSLELLNVLFR,
S100A9,LPFETFR,
PCCB,
AAVENLPTFLVELSR,
YGPIVDVYVPLDFYTR,
LAQAYHR,
NWGVMKMILMK,
RPQDKVAVSDMK,
TLQCLGLR,
CAIGVTHFQLVQMK,
LAGSKDPR,
TKILATGGASHNR,
NIPAL3,
XYLB,
FFEVILIDPFHK,
VFIGNLNTAIVK,
SDSVLLGMEGSVK, GBE1,EMENKETLK, IPO7,LAMAIPDK,
ZMAT1,
IL18RAP,
MYBBP1A,
PTPRZ1,
GBP1,
PABPN1,
RPUSD3,DAEWTTVFK,ARVCF,
ACO1,MSAPSLR,LRIG3,
SPGSAQSLGK,
PITRM1,
MAST4,
C14orf166,
EFCAB7,
LOC646214,
ACAT2,
S100G,
PRPF40A,
HDAC9,
ZNF687,
TRIM66,
FBLN7,
FISHKK, PTPRB,
NBPF8,
ARFGEF2,
MYLIP,
SSIVPEVK,
YVFDIK,
TLKINDEK,
IGEEARR,
ISFLLGASR,
FUT8,
DLKAENLLLDANLNIK,
VLDEEGSER, KLRAP1,
FBXO18,NGHTDCVR,
EQILTNKTLK,RMAESLR,
RBBP8, TMEM198, VGSAAQTR,
IFEIFGPVAHPFYVLR,
KCNEVK,FBXO22,GWALYCAGK,
RRP1B,
SPR,LMDAAELVK,
DWKEVLR,
KLEEMR,
QGLCVLRR,
PPP2R3C,
WNT8A,
PRPF31,
MAP3K1,
PDZD2,AEMNILQINEK,
ATF2,
DDX23,
SPMDKVLR,
SPEELKR,
NSKLMEPNLIK,
SLDWNSLLRQK,
QLDAGLAR,
ILLGDDSQK,
VAVLSQNR,
GBP2, MLEEIQK,
BRP44L, MAP1A,
TLR7,
VAIDRSR,
TMESESLRTLEFR,
LKQHFNNALPK,
VSTMLDRLHSTR,
ZHX3,
ANKRD12,
GKNLSKPK,
LESGDRIR,
PARPBP,
C14orf38,
RVEGIQYEDISK,
VPSLREAMEK,
LAALQNSVGR,
RTLSLPR,
TRIO,
BEGAIN,
LAMA4,MLVAENGKK,
PYLGSEDVVKELK,
SVLVDFLIGSGLK, CSE1L,
YNEDLELEDAIHTAILTLK,
LQDAGIAR,
ARHGEF1,
RPL15,NLEQENQNLR,POF1B,TSLALDESLFR,BBS9,
C1orf141,
HILGFDTGDAVLNEAAQILR, AGQAVSSGGILR, YJEFN3,
MRCLTTPMLLR,
YPEEQLK, PCM1,
LYST,
PIK3R6,
KPNB1,
EAGEAPKPGEEVK,
GLAAALLLCQNK,
VLGTEAVQDPTKVEAHVR,
VAPPERR,
SRSF10,
YBX1, RALYL,RPS6,GCIVDANLSVLNLVIVK, NYQQNYQNSESGEKNEGSESAPEGQAQQR, SART1,
PRPF3,
HIVEP3,
ETSLLLR,RAPH1,
ZNF14,
IMKNEIQDLQTK, SWAP70,
DQNEAQTAKEFIK, CAPZA1, EHD2,
MGSLKEELLK,
TSFIQYLLEQEVPGSR,
LTVLAVNR,
LPEMEPLVPR,
CEP250,
09-Mar,
ELSEARIR,
TCP10,
QKLGGASQGR,HDAC6, MYRIP,
ATP13A5, ELKEAAR,
MDTLAVALR,
KNVLQGGESTK,
LQATEVR,
MORF4L1,
ANPEP,
AKAP13,
BCL2L10, PRKCE,
TSGLQQKNVEVK,
TRMT112,
YESLLR,
GLDIESTSK,
QGHYESLKER,
LKLLEEER,
GLQTVHINENFAK,
MAPAYLPR,
EGECLTLCK,
LLLVAWDR,
SH3TC2,
ITGLCPFGPR,
NCAPD3,
SCEL,
GSVMPSVIR,
MNKSLNR,
SLC6A10P,MLANTGRK, EGTPDTAPTSR, TRIM56,
MAAPTLGR,
VVDVSVPR,
KIAA0825,
KVHGDVVK,
LALTDAIK,AQIEDTLR, ASXL3,
ITPR3, PCDH17,
VLQVPMLAHGGCCREDAVVASR, CEP97,
MQVAMAR,
SERPINA12,
CCDC124,
DAGFPFSQDINSHLASLSMAR,
LNASDLRLPSR,
RYR3,
KLDQQYK,
SLC52A1,
FPSNIIVTNGAAR,
SSAWKTPR,
LEKAEAR,
VLVDMSR,
PRPS1L1, SRSPISAK,
ATRDPFAPSR,
MKTTSTK,
EEPRVPPLK,
ZNF284,
SNW1,
ABCF1,
MTR,
PLXNA3,
DPYGFGDSR,FLVLDEADGLLSQGYSDFINR,
HVINFDLPSDIEEYVHR,
SLSYSPVER,
SSPVEFECINEK,
SEIDLLNIRR, YYDSRPGGYGYGYGR,
C9orf174,LOC100508206,
CAMK2B,
KIAA1529,
SNRPN,
SRSF8,
VGDVYIPR,
CEL,MYT1L,
SNRPB,SNRPD2P1,
SGQSSYGQHSSGSSQSSGYGQHGSR, EWTLEAGALVLADR,
DNNELLLFILK,
LASYVEK,
SLSGCPRAK,
MYT1,
LGLLGDSVDIFKGIPFAAPTK,
FSNSSSSNEFSK,
VADPDHDHTGFLTEYVATR, FSGVPDRFSGSGAGTDFTLK,
LTAIDILTTCAADIQR,
DILCGAADEVLAVLK,
GFGFVYFQNHDAADK,
HNRNPA0,
TRIP10,
MKRN1,
QLEMEK,
MXRA5,
CCT2,
TVALWDLR,
REEPQR,
EEA1,
SGGGGGGGGSSWGGR, EDIYSGGGGGGSR,
IIERDTSSVAVNLPVPSGVAFPAVFLR, EYSSELNAPSQESDSHPR,
HSSWSEGEEHGYSSGHSR,
LFIGGLNVQTSESGLR,
TEDPKLR,
MLSAYLR,MLSSFLR,
LEKEAAR,
NNTQVLINCR,
SNRPD2, SRSF2,
RBM15,
LTQYIDGQGRPR,
DPP7,
SVAAWLR,
LNYTLSQGHR,
CAMK2G,
DLKPENLLLASK,
KLGELQEK,GOLGA2,
WASGILR,
VLGLVLLR,
LOC100507381,
ARMCX6,SSKVNDVEAIR,
HERC1,FLQQTGGR,
SUDS3,SCYL1,
CCDC112,
ADCK5,
ANSTITQLTANITK, LAMB4,GGLSGPQGGR,ARL6IP4,QLGLPTALKR,MRTO4,KNSASAER,ALSLWPR,
NQDATVYVGGLDEK,
DOCK10,
CST8,
FAM133A,
RYLTQGITQGK,
ABCF3,
LOC729175,
TNK2,
INAEPSESDTAR, IPDVAALSMGFSVK, FANCM,
SHEGETAYIR,
TCLQASR,
SFVEFILEPLYK,
EAMYNQLGLTGCAGVASNK,
VVQGDIGEANEDVTQIVEILHSGPSK, FYVPPTQEDGVDPVEAFAQNVLSK, MSVQPTVSLGGFEITPPVVLR,
SSRP1,
GLAEDIENEVVQITWNR,
ESRP1,
EAGDVCYADVYRDGTGVVEFVR, DSNFAGDLVR,KSEYTQPTPIQCQGVPVALSGR, QVTITGSAASISLAQYLINVR, INISEGNCPER, LLYDTFSAFGVILQTPK, LVEGILHAPDAGWGNLVYVVNYPK, ELQCLTPR,
Figure 3.5: Peptide-protein graph constructed from an AP-MS experiment with RPAP3 as a bait. Proteinsare drawn as red triangular vertices while peptides are drawn as blue round vertices.
3.4. Experimental Results 71
HIST2H2AA3,VTIAQGGVLPNIQAVLLPK,
HIST1H2AJ,
HIST1H2AD,
NDEELNKLLGK,
HLQLAIRNDEELNK,
NDEELNKLLGGVTIAQGGVLPNIQAVLLPK,
NDEELNKLLGR,
AGLQFPVGR,
HLQLAIR,
NDEELNK,
H2AFJ,
HIST3H2A,
HIST1H2AC,
HIST1H2AA,
H2AFX,
HIST1H2AE,
HIST1H2AH,
HIST1H2AG,
HIST1H2AB,
HIST1H2AI,
HIST2H2AC,
x=7
σ=1
σ=1
σ=1σ=1
σ=1
σ=8
σ=2
x=1
x=2
σ=2
Figure 3.6: A connected component of the bipartite graph from the CDK9 dataset, where all peptides inthe component are shared peptides that belong to histones. Our algorithm for MINIMUMPROTEINTYPES
chooses only 3 proteins (HIST1H2AH, HIST3H2A and H2AFX) to be in the solution. Blue round verticesindicate peptides while proteins are drawn as red triangular vertices.
ular selection of proteins may not be very accurate due to the relatively small number of
peptides. However, the existing approach of simply ignoring shared peptides within MS
data would lead us to miss this instance entirely. On the other hand, using the naive ap-
proach of assigning each protein the average value of constituent peptide abundances
would lead us to selecting all the histones in this component, resulting in poor quanti-
tative data.
Such an effect is more prominent in the example shown in Figure 3.7. Here, the pep-
tide abundances are taken from an AP-MS run with RPAP3 as the bait protein. While the
peptides are shared by four distinct proteins, our algorithms for MINIMUMPROTEINTYPES
and MINIMUMERRORSUM both picked HNRNP1 and HNRNP2 (heterogeneous ribonucle-
oprotein particles) with positive abundances. This is indeed a reasonable identification
as the remaining candidate proteins are supported by only 1 or 2 constituent peptides.
72 Chapter 3. Protein Quantification with Shared Peptides
DLNYCFSGMSDHR,
VTGEADVEFATHEDAVAAMSK,
YGDGGSTFQSTTGHCVHMR,
ATENDIYNFFSPLNPVR,
VHIEIGPDGR,
HTGPNSPDTANDGFVR,
YVEVFK,
FFSDCK,
IQNGAQGIR,
SNNVEMDWVLK,
EGRPSGEAFVELESEDEVK,
STGEAFVQFASQEIAEK,
HNRNPH1,
VTGEADVEFATHEDAVAAMAK,
GLPWSCSADEVQR,
YVELFLNSTAGASGGAYEHR,
HNRNPF,
HNRNPH2,
TRPV5,
x=5
x=6
σ=5
σ=2
σ=2
σ=1
σ=4
σ=2
σ=5
σ=2
σ=4
σ=2
σ=1
σ=6
σ=4
σ=1
σ=3
Figure 3.7: A connected component of the bipartite graph from the RPAP3 dataset. Our algorithmsfor MINIMUMPROTEINTYPES and MINIMUMERRORSUM both select HNRNPH1 and HNRNP2 in the solutionwhile discarding the less supported proteins.
3.5 Discussions
In this chapter, we studied the protein quantification problem by first modelling the
mass spectrometry data using mathematical programs. While the four problems for-
mulated are all expressed as integer programs, we studied their hardness from combi-
natorial standpoints by reductions from SETCOVER and EXACTCOVER. Then, we devised
approximation algorithms for the corresponding problems which also showed promis-
ing empirical performances when tested on synthetic data as well as biological data.
There remain various extensions to the models discussed here. For example, all
four problems (MULTICOVER, MINIMUMPROTEINTYPES, MINIMUMUNIFORMERROR, MINI-
MUMERRORSUM) model the mass spectrometry data under the assumption that each
peptide may occur in a protein at most once. Lifting this assumption can be achieved
easily by modifying the adjacency matrix A so that A i ,j is the number of times peptide
s i appears in protein w j , and the resulting problems can be attacked using the same
3.5. Discussions 73
approaches.
Moreover, one may wish to incorporate into our models the “detectability” c (i ) of
each peptide7, indicating the difficulty for correctly detecting a particular peptide. Re-
cent studies have shown that peptide detectability can be estimated reliably [2, 102].
For example, Tang et al. proposed a machine learning approach to estimate the pep-
tide detectability using its sequence length and its neighbouring regions in the parent
protein sequence [133]. To incorporate this into our model of MINIMUMERRORSUM, we
can minimize the objective function∑
i c (i ) · εi . The algorithm for the modified objec-
tive function need not change, as our randomized rounding scheme simply rounds the
fractional solution from the LP relaxation.
In the case of MULTICOVER and MINIMUMPROTEINTYPES, we can integrate this no-
tion of peptide detectability to each protein w j by p (j ) =∑
(i ,j )∈E c (i ). Then, given this
penalty function for each protein, we can ask to minimize the total cost of chosen pro-
teins:∑
j p (j )x j for MULTICOVER, and∑
j p (j )χ(x j ) for MINIMUMPROTEINTYPES. The re-
sulting problems can be solved under the same framework, using algorithms for the
corresponding weighted set cover problems.
Another factor that can be incorporated into our model is the set of peptides that
were not observed in the MS data. For every unobserved peptide sk , we can introduce
its absence to our model by adding sk to the bipartite graph with the abundance value
σk = 0. As these are additional constraints to the integer programs, they would help us
obtain a more accurate solutions albeit with slower running time. However, as discussed
in Section 3.4.1, the coverage of a protein by the identified peptides is typically low due
to low sensitivity of mass spectrometers. As a result, we cannot simply interpret un-
detected peptides as zero abundance, and the addition of these additional constraints
should be avoided until the sensitivity of MS data improves.
7 The detectability c (i ) for peptide s i can be set higher if it is less likely to observe the correct abundanceof s i , say.
74 Chapter 3. Protein Quantification with Shared Peptides
3.6 Bibliographic Notes
The set cover problem and its generalization into the multicover problem have been
studied extensively in the approximation algorithms community. Lund and Yannakakis [101]
showed that SETCOVER cannot be approximated within a factor of o(log n ) unless P =
NP. Moreover, Feige [48] showed that, unless NP ⊂ DT I M E (nO(log log n )), there is no
(1−o(1)) ln n approximation. Thus the approximability of the set cover problem is es-
sentially resolved if P 6= NP. When the input set system admits a special underlying
structure, the lower bound on the approximation ratio for the general set cover (and
multicover) can be improved: for example, Brönnimann and Goodrich [21] showed an
improved O(1) approximation ratio when the elements are points on the plane while the
sets correspond to disks on the plane. Further, there are various algorithms specialized
for different restricted cases of set systems [25, 47] that may provide better approxima-
tion ratios. Unfortunately, however, the set systems constructed from our PPI dataset do
not admit these underlying structures.
The results discussed in this chapter are to be submitted [89].
Chapter 4
Direct PPI Networks from AP-MS Data
With the help of high-throughput experimental techniques, a large amount of PPI data
has recently become available, providing us with a rough picture of how proteins inter-
act in biological systems. However, interaction data from high-throughput experiments
often suffers from relatively high error rates and protocol-specific biases. Therefore, in-
ferring the physical PPI network from high-throughput data remains a challenge in sys-
tems biology.
As discussed in Chapter 1, affinity purification followed by mass spectrometry iden-
tification (AP-MS) is an increasingly popular approach to observe protein-protein inter-
actions (PPI) in vivo. One drawback of AP-MS, however, is that it is prone to detecting
indirect interactions mixed with direct physical interactions. Therefore, the ability to
distinguish direct interactions from indirect ones is of much interest. In this chapter,
we propose a simple probabilistic model for the interactions captured by AP-MS exper-
iments, under which the problem of separating direct interactions from indirect ones is
formulated. Then, given idealized quantitative AP-MS data, we propose an approach to
identify the most likely set of direct interactions that produced the observed data.
This chapter is organized as follows. In Section 4.1, we give a brief overview of related
research around the AP-MS technology. Then, Section 4.2 describes the mathematical
75
76 Chapter 4. Direct PPI Networks from AP-MS Data
modelling of an AP-MS experiment, and formulates an optimization problem. In Sec-
tion 4.3, we describe the overall algorithm, which is based on a collection of graph theo-
retic approaches that succeed at inferring a large fraction of the network nearly exactly,
followed by a genetic algorithm that infers the remainder of the network. Finally, in Sec-
tion 4.4, we test the accuracy of our method using both biological and simulated PPI
networks. Here, we apply our algorithm to the prediction of direct interactions based on
a large set of AP-MS PPI data in yeast [93].
4.1 Background
In an AP-MS experiment, a protein of interest (the bait) is first tagged and expressed
in vivo. The bait is then immuno-precipitated (IP), together with all of its interacting
partners (the preys), and finally, preys are identified using mass spectrometry. Like Y2H
and other high-throughput experimental methods, however, AP-MS suffers from experi-
mental noise. A number of approaches have been proposed to separate true interactions
from false-positives. These approaches mostly focus on reducing false-positives due to
protein misidentification from MS data [63, 107, 150], on detecting contaminants [19],
or a combination of both [26, 33, 35, 111, 122, 123]. These methods often make use of
the guilty-by-association principle, and quantify the confidence level of an interaction
by considering alternative paths between two protein molecules. In this context, a true
interaction between bait b and prey p is considered a true positive if, at some point in
the set of cells considered, there exists a complex that contains both b and p . One con-
cern with this line of reasoning is that, as the sensitivity of the AP-MS methods improves
(and thus the stability of the complexes that can be detected decreases), the transience
of detectable interactions will increase to a point where, eventually, every protein may
be shown to marginally interact with every other protein.
A key property of AP-MS approaches is that a significant number of the co-purified
prey proteins are in fact indirect interaction partners of the bait protein, in the sense that
4.2. Mathematical Modelling and Problem Formulation 77
they do not interact physically and directly with the bait, but interact with it through a
chain of physical interactions involving other proteins in the complex. Therefore, it is
critical, when interpreting AP-MS-derived PPI networks, to correctly interpret the mean-
ing of the term “interaction”. Although not designed to identify physical interactions, the
increasing amount of AP-MS data available justifies the need for separating direct phys-
ical interactions from indirect ones. This is the problem we consider in this chapter:
“Given the results of a set of AP-MS experiments, filtered for protein misiden-
tifications and contaminants, how can we distinguish direct (physical) from
indirect interactions?”
Note that since the false-positive filtering methods listed above consider indirect in-
teractions as true-positives, they cannot be used to address this problem. Gordân et
al. [65] study the related problem of distinguishing direct vs. indirect interactions be-
tween transcription factors (TF) and DNA. While the objective of their study is similar
to ours, their method makes use of information specific to TF-DNA interactions (e.g. TF
binding data, motifs from protein binding microarrays), and thus is not immediately ap-
plicable to the problem on general PPI networks. In fact, to our knowledge, no existing
approach seems directly applicable.
4.2 Mathematical Modelling and Problem Formulation
Throughout this chapter, we make the assumption that appropriate methods have been
used to reduce as much as possible protein misidentifications and contaminants, in
such a way that all interactions detected are either direct or indirect interactions. Our
task is to separate the former from the latter. To save from confusion, therefore, we note
that false positives (resp. false negatives) henceforth refer to falsely detected (resp. un-
detected) direct interactions inferred by our algorithm.
78 Chapter 4. Direct PPI Networks from AP-MS Data
4.2.1 A Probabilistic Model for AP-MS Data
We first describe a simple model of the AP-MS PPI data that shall be used throughout
this chapter. Though admittedly rather simplistic, our model has the benefit of allowing
the formulation of a well-defined computational problem.
Let Gd i r e c t = (V, Ed i r e c t ) be the graph of direct (physical) interactions over the set V
of proteins considered, and let N (b ) = {p ∈ V : (b , p ) ∈ Ed i r e c t } be the set of direct in-
teraction partners of protein b . We model the physical process through which PPIs are
identified in an AP-MS experiment as follows. If a bait protein b is in contact with a di-
rect interaction partner p ∈N (b ), the affinity purification (AP) process (see Section 1.1.2
in Chapter 1) on b will pull down p , which will then be identified through mass spec-
trometry. In addition, if p interacts with p ′ ∈ N (p ) at the same time as it interacts with
b , protein p ′ may also be pulled down, even though the two proteins b and p ′ only
indirectly interact. In general, any protein x that is connected to b by a series of simulta-
neous direct interactions may be pulled down by b . As a result, all interaction partners
of b (direct or indirect) will be identified together. Figure 4.1 depicts an example of this
effect.
In order to distinguish direct physical interactions from indirect ones, the availabil-
ity of quantitative AP-MS data is helpful. Although quantitative AP-MS remains in its in-
fancy, prey abundance can be estimated fairly accurately using various approaches, e.g.
the peptide count [55], spectral count [100], sequence coverage [51], protein abundance
index [73], and our own approach of quantifying absolute protein abundance discussed
in Chapter 3. Combined with the increasing accuracy and sensitivity of mass spectrom-
eters, these methods are becoming more reliable. Throughout the discussions in this
chapter, we thus assume that this quantitative data is available to us.
To model the quantitative AP-MS data, we first consider the strength of a direct in-
teraction, which can be measured by the energy required to break it. To give an example,
suppose the PPI network consists of only two proteins, b and p . Then, let A(b , p ) denote
4.2. Mathematical Modelling and Problem Formulation 79
b
p1 p2 p3 p4 p5 p6 p7 p8
b
p1 p2
p7
p6
p5
b
p5p6
p8
p1
p2
p3
p4
b
bp1
p2
p3
p4
p5
p6
p7 p8
(a)
(b)
(c)
b
Cell
Figure 4.1: A schematic for indirect interactions in AP-MS data. (a) In a cell, multiple copies of a baitprotein b are expressed, and interact (directly or indirectly) with other prey proteins p1, . . . , p8. (b) Afterthe pull-down on the bait b , MS detects all prey proteins, including indirect interaction partners p4 andp8. (c) The direct interaction network should, however, contain only edges between direct interactionpartners.
the observed abundance of a prey protein p obtained by the bait b , and letχ(b , p ) denote
the true abundance of interactions between b and p . Since protein interactions may be
disrupted by the AP process, we expect that A(b , p ) is correlated with the strength of
the interaction between b and p , as well as the true abundance χ(b , p ). We model the
strength of an interaction using the probability p (b , p ) that the interaction between b
and p survives the purification process, and assume that the interaction breaks with
probability 1− p (b , p ). Then, the observed abundance of protein p obtained from the
pull-down on b would be
A(b , p )∝χ(b , p ) · p (b , p ).
As another example, consider now a system of three proteins b , p1, p2, where (b , p1)
and (p1, p2) form direct interactions, but b and p2 do not interact directly. As before,
let χ(b , p1, p2) denote the true abundance of protein interactions that occur simulta-
neously, i.e., number of times both (b , p1) and (p1, p2) occur together. Then, A(b , p2) ∝
80 Chapter 4. Direct PPI Networks from AP-MS Data
χ(b , p1, p2) · p (b , p1) · p (p1, p2). In general, the observed amount of protein p obtained
upon pull-down of b will be proportional to the probability that b and p remain con-
nected after each edge (u , v ) ∈ E (Gd i r e c t ) is broken with probability p (u , v ). Our goal
is then to infer Gd i r e c t from the set of observed abundances A(u , v ). In this paper, we
make the following simplifying assumptions:
1. All direct interactions (u , v ) ∈ Ed i r e c t survive with the uniform probability p , and
fail independently with probability 1− p .
2. All possible direct interactions take place at the same time, irrespective of the pres-
ence of other interactions, and with the same frequency.
Albeit rather strong, these assumptions provide a useful starting point for separating
direct interactions from indirect ones (see Section 4.5 for possible relaxation of these
assumptions). Despite its simplicity, our mathematical modelling of AP-MS does fit ex-
isting biological data reasonably well (see Section 4.4.5). We note that Asthana et al. [5]
have proposed an approach similar to our probabilistic graph model to identify novel
members of known protein complexes. However, their goal is not to identify indirect
interactions, so their approach is not applicable to our problem.
4.2.2 Problem Formulation
We are now ready to formulate the algorithmic problem addressed in this chapter. We
henceforth consider the (unknown) direct interaction network Gd i r e c t as a probabilistic
graph, where each edge in Gd i r e c t survives the AP-MS process with probability p , and
fails otherwise. Let G denote a random graph obtained from Gd i r e c t by removing edges
in Ed i r e c t independently with probability 1−p . Then, define PGd i r e c t (u , v ) to be the prob-
ability that vertices u and v remain connected (directly or indirectly) in G :
PGd i r e c t (u , v ) = Pr[ there exists at least one path from u to v in G ].
4.2. Mathematical Modelling and Problem Formulation 81
a b
c
d
e f
GDirect
a b c d e f
a 1 5/8 5/8 5/16 5/32 5/32
b 1 5/8 5/16 5/32 5/32
c 1 1/2 1/4 1/4
d 1 1/2 1/2
e 1 1/4
f 1
PGDirect
(a)
a b
c
d
e f
GDirect
a b c d e f
a 1 5/8 5/8 5/16 5/32 5/32
b 1 5/8 5/16 5/32 5/32
c 1 1/2 1/4 1/4
d 1 1/2 1/2
e 1 1/4
f 1
PGDirect
(b)
Figure 4.2: An example of (a) a direct interaction network GDi r e c t and (b) its connectivity matrix PGDi r e c t
calculated with p = 12
. Assuming each edge of GDi r e c t survives with probability p , the probability ofconnectivity between each pair of protein can be estimated via sampling of the probabilistic network.
We call PGd i r e c t the connectivity matrix of Gd i r e c t . See Figure 4.2 for an example of a direct
interaction network and its connectivity matrix. In Figure 4.2(b), the entry PGd i r e c t (b , d )
is assigned 516
, because, out of all 64 subgraphs of Gd i r e c t , only 20 of those maintain
the vertices b and d in the same component. In general, PGd i r e c t (u , v ) can be estimated
from Gd i r e c t by straight-forward Monte Carlo sampling. However, its exact computation
(known as the two-terminal network reliability problem [34, 138]) is #P-complete [138].
A set of AP-MS experiments where all proteins have been tagged and used as baits
yields an approximation of A(x , y ) for all pairs of proteins (x , y ), which can be trans-
formed into an estimate M (x , y ) of PGd i r e c t (x , y ) through an appropriate normalization.
We are thus interested in inferring Gd i r e c t from M :
EXACT DIRECT INTERACTION GRAPH FROM CONNECTIVITY MATRIX (E-DIGCOM)
Given: An n ×n connectivity matrix M
Find: A graph G = (V, E ) such that PG (u , v ) =M (u , v ) for each u , v ∈V .
82 Chapter 4. Direct PPI Networks from AP-MS Data
In a more realistic setting, the connectivity matrix PG would not be observed pre-
cisely, and the E-DIGCOM problem may not admit a solution. We are thus interested in
an approximate, optimization version of the problem:
APPROXIMATE DIRECT INTERACTION GRAPH FROM CONNECTIVITY MATRIX (A-DIGCOM)
Given: An n ×n connectivity matrix M and a tolerance level 0≤δ≤ 1
Find: A graph G = (V, E ) such that the number of pairs (u , v )∈V×V such that |PG (u , v )−
M (u , v )| ≤δ is maximized.
Note that although the complexity of the DIGCOM problems are currently unknown,
the reverse problem of finding the connectivity matrix M given an input graph G (net-
work reliability problem) is well studied. Furthermore, related network design problems
have been studied extensively in the computer networking community. For example,
the reliable network design problem is to choose a minimal set of edges over a set a
nodes so that the resulting network has at least the prescribed all-pairs terminal relia-
bility; various algorithms including branch-and-bound heuristics [76] and genetic algo-
rithms [38, 39] have been proposed.
4.3 Algorithm for A-DIGCOM
Our algorithm for the A-DIGCOM problem has three main phases outlined as below.
PHASE I. We start by identifying, based on the connectivity matrix M , vertices from Gd i r e c t
with low degree, together with edges incident to them. As most PPI networks ex-
hibit the properties of scale-free networks [10], this resolves the edges incident to
a significant portion of the vertices (∼ 75%, in our networks).
PHASE II. At the other end of the spectrum, Gd i r e c t contains dense clusters that often cor-
respond to protein complexes. We use a quasi-clique finding heuristic to identify
such dense clusters from the connectivity matrix M .
4.3. Algorithm for A-DIGCOM 83
(a) (b)
(c) (d)
Figure 4.3: The outcome of our direct PPI detection algorithm after each phase. (a) The original solutionspace; (b) After detecting weakly connected regions; (c) After detecting dense clusters; (d) The geneticalgorithm detects the remaining interconnecting regions.
PHASE III. To infer the remainder of the network, we use a novel genetic algorithm (see Sec-
tion 2.4). This highly customized genetic algorithm makes use of the findings from
the previous two steps in order to dramatically reduce the dimension of the prob-
lem space, and to guide the mating process between parent candidates to create
good offspring solutions.
Figure 4.3 gives an example of the output after each phase of our algorithm. In what
follows, we describe each phase of the algorithm in detail.
84 Chapter 4. Direct PPI Networks from AP-MS Data
4.3.1 Identification of Weakly Connected Regions
I-a. Finding cut edges
A cut edge in a graph G is an edge (u , v )whose removal would result in u and v belonging
to two distinct connected components (e.g. edge (c , d ) in Figure 4.2(a)). The following
theorem allows the identification of all cut edges based on the connectivity matrix PG .
Theorem 4.1. A pair of vertices u and v from V forms a cut edge in G if and only if the
following two conditions hold:
(i) PG (u , v ) = p
(ii) V can be partitioned into V = Vu ∪ Vv , where Vu = {x ∈ V : PG (x , u ) ≥ PG (x , v )}
and Vv = {x ∈ V : PG (x , u ) < PG (x , v )}, such that ∀s ∈ Vu and ∀t ∈ Vv , PG (s , t ) =
PG (s , u ) · p ·PG (v, t ).
Proof. Necessity is trivial. For sufficiency, suppose the conditions (i) and (ii) hold, and
(u , v ) is not a cut edge. Then, to keep the graph connected, there must be an edge (s , t ) 6=
(u , v ) joining Vu and Vv . Since (s , t ) is an edge, p (s , t )≥ p . However, by assumption, we
have p (s , t ) = p (s , u ) · p ·p (v, t )< p , which is a contradiction.
The above theorem immediately provides an efficient algorithm to test whether a
pair of vertices forms a cut edge in time O(|V |2): recursively find edges satisfying condi-
tions in Theorem 4.1. Observe that removing a cut edge (u , v ) allows us to decompose
the graph into two connected components (subgraphs induced by Vu and Vv , respec-
tively), and the probability of connectivity between every pair of vertices in Vu (Vv , resp.)
remains the same after removing (u , v ). Therefore, the submatrices that correspond to
Vu and Vv can be treated as independent subproblems, and one can recursively detect
cut edges in the remaining subproblems.
Note that a special case of cut edges is an edge whose endpoint is a degree-1 vertex.
Because removing a cut edge does not change the connectivity between any two nodes
4.3. Algorithm for A-DIGCOM 85
on the same side of the cut, this allows us to repeatedly identify and remove degree-1 ver-
tices, i.e., the corresponding rows and columns in the input matrix. Hence, if the entire
graph is assumed to be a tree, this algorithm can reconstruct the graph completely. On
the other hand, PPI networks tend to contain many cut edges due to their sparsity. Con-
sequently, this simple characterization for cut edges allows a significant simplification
of our problem by identifying all cut edges.
Theorem 4.2. If G is a tree, E-DIGCOM can be solved efficiently.
I-b. Finding degree-2 vertices
We now consider the problem of identifying degree-2 vertices from the connectivity ma-
trix M . After degree-1 vertices, which are identified in the previous step, they constitute
the next most frequent vertices in the biological networks we studied. While we do not
have a full characterization of these vertices, the following theorem gives a set of neces-
sary conditions that lead to a heuristic to predict degree-2 vertices.
Theorem 4.3. Let s be a degree-2 vertex in G such that N (s ) = {u , v }. Then, the following
four conditions must hold.
(i) Low connectivity: for each t ∈V, PG (s , t )< 2p − p 2.
(ii) Symmetry: PG (s , u ) = PG (s , v ).
(iii) Neighbourhood: for each t ∈V −{s , u , v }, PG (s , t )< PG (s , u ).
(iv) “Triangle” inequality: for each t ∈V −{s , u , v }, PG (s , t )<m a x { PG (u , t ), PG (v, t ) }.
Before we prove this, consider the following results on graph composition. Let G1 =
(V1, E1) and G2 = (V2, E2) be two graphs. Then the following hold.
Fact 4.4. (Series composition) Suppose V1 ∩V2 = {c}, and a new graph G is constructed
by joining G1 and G2 at c . Then, for any s ∈ V1 − {c} and for any t ∈ V2 − {c}, PG (s , t ) =
PG1(s , c ) ·PG2(c , t ).
86 Chapter 4. Direct PPI Networks from AP-MS Data
Proof. Since c is a cut vertex, PG (s , c ) = PG1(s , c ), and PG (c , t ) = PG2(c , t ). For the vertices s
and t to be connected, all of s , c , t must be in the same connected component. Because
every edge removal is independent, we have PG (s , t ) = PG1(s , c ) ·PG2(c , t ).
Fact 4.5. (Parallel composition) Suppose V1∩V2 = {s , t }, and a new graph G is constructed
by joining G1 and G2 at s and t (possibly leading to parallel edges between s and t ). Then,
PG (s , t ) = PG1(s , t )+PG2(s , t )−PG1(s , t ) ·PG2(s , t ).
Proof. The event that s and t are not connected is when s and t are disconnected in
both G1 and G2. Thus 1−PG (s , t ) = (1−PG1(s , t )) · (1−PG2(s , t )) = 1−PG1(s , t )−PG2(s , t )+
PG1(s , t ) ·PG2(s , t ).
Now we are ready to prove Theorem 4.3.
Proof. We prove each condition separately.
Condition (i) Low connectivity. Since s has degree 2, it becomes disconnected from
the rest of the graph with probability (1− p )2. Thus, PG (s , t )≤ 1− (1− p )2 = 2p − p 2. The
equality can only hold if PG (s , u ) = PG (s , v ) = 1, which is impossible.
Condition (ii) Symmetry. Let PG−{(s ,u )}(s , u ) denote the connectivity of s and u when
edge (s , u ) is removed from E . Then, we have:
PG (s , u ) = p +PG−{(s ,u )}(s , u )− p ·PG−{(s ,u )}(s , u ) (by Fact 4.5)
= p + p ·PG−{(s ,u ),(s ,v )}(v, u )− p 2 ·PG−{(s ,u ),(s ,v )}(v, u ) (by Fact 4.4)
= p + p ·PG−{(s ,u ),(s ,v )}(u , v )− p 2 ·PG−{(s ,u ),(s ,v )}(u , v )
= p +PG−{(s ,v )}(s , v )− p ·PG−{(s ,v )}(s , v ) (by Fact 4.4)
= PG (s , v ) (by Fact 4.5)
4.3. Algorithm for A-DIGCOM 87
Condition (iii) Neighbourhood Next, we show that for any t ∈ V −{s , u , v }, PG (s , t ) <
PG (s , u ). For a subgraph H ⊆G , let p (H ) denote the probability of observing H from a
probabilistic graph G . We can write this probability as
p (H ) = p |E (H )| · (1− p )|E (G )|−|E (H )|
Thus, for any two subgraphs Hi , H j ⊆G , such that |E (Hi )| = |E (H j )|, p (Hi ) = p (H j ). Let
H(s , t ) be the set of subgraphs of G where s and t are connected. We can then write the
probabilities PG (s , t ) and PG (s , u ) as the sum of the probabilities of these subgraphs.
PG (s , t ) =∑
Hi∈H(s ,t )
p (Hi )
PG (s , u ) =∑
H j ∈H(s ,u )
p (H j )
Observe that these two sets of subgraphs may overlap:
H(s , t ) =H(s , t , u )∪H(s , t , u )
H(s , u ) =H(s , t , u )∪H(s , t , u )
where H(s , t , u ) is the set of subgraphs such that s , t , and u are all connected, while
H(s , t , u ) is the set of subgraphs where s , t belong to the same connected component
that doesn’t contain u . Thus, we now focus on H(s , t , u ) and H(s , t , u ). The subgraphs in
H(s , t , u ) can be partitioned as follows, depending on which component v belongs to:
(i) {s , u , v },{t }: Vertices s , u , v belong to the same component, and t belongs to an-
other component.
(ii) {s , u },{v },{t }: Vertices s , u belong to the same component, and each of v and t
belongs to distinct components.
(iii) {s , u }, {v, t }: Vertices s , u belong to the same component, and v, t belong to an-
other component.
88 Chapter 4. Direct PPI Networks from AP-MS Data
It is easy to see that the cases (i) and (ii) are nonempty, since (s , u ), (s , v )∈ E (G ).
Finally, consider the set of subgraphs in H(s , t , u ). Let H be a subgraph in this set.
Since s and u belong to distinct components, (s , u ) /∈ E (H ), and thus (s , v ) ∈ E (H ) in
order to have s and t connected. We then make the following operation on H to con-
struct H ′: (1) remove (s , v ) from H , and (2) insert (s , u ). Then, H ′ has s and u in the same
component, and v and t in another component. Therefore, H ′ belongs to Case (iii) of
H(s , t , u ). Furthermore, note that H and H ′ have the same number of edges. Therefore,
there is a mapping from subgraphs in H(s , t , u ) to subgraphs in Case (iii) of H(s , t , u )
with an equal number of edges. Since the cases (i) and (ii) are nonempty, it follows that
PG (s , t )< PG (s , u ).
Condition (iv) “Triangle” inequality. Given the probabilistic graph G , we partition the
subgraphs of G into four cases depending on the existence of the two edges (s , u ) and
(s , v ).
• (s , u ), (s , v )∈ E : occurs with probability p 2.
• (s , u )∈ E , (s , v ) /∈ E : occurs with probability p (1− p ).
• (s , u ) /∈ E , (s , v )∈ E : occurs with probability p (1− p ).
• (s , u ), (s , v ) /∈ E : occurs with probability (1− p )2.
We can thus rewrite the probability PG (s , t ) as follows.
PG (s , t ) = p · (1− p ) ·PG−{s }(u , t )+ p · (1− p ) ·PG−{s }(v, t )+ p 2 ·PG−{s }(u or v, t ) (4.1)
where PG−{s }(u or v, t ) denotes the probability that u or v is connected to t in the graph
G −{s }. We write PG (u , t ) and PG (v, t ) similarly:
PG (u , t ) = (1− p 2) ·PG−{s }(u , t )+ p 2 ·PG−{s }(u or v, t ) (4.2)
PG (v, t ) = (1− p 2) ·PG−{s }(v, t )+ p 2 ·PG−{s }(u or v, t ) (4.3)
4.3. Algorithm for A-DIGCOM 89
Subtracting (4.1) from (4.2), and (4.1) from (4.3) gives:
PG (u , t )−PG (s , t ) = (1− p ) ·PG−{s }(u , t )− p (1− p ) ·PG−{s }(v, t )
PG (v, t )−PG (s , t ) = (1− p ) ·PG−{s }(v, t )− p (1− p ) ·PG−{s }(u , t )
If PG−{S}(u , t )> p ·PG−{S}(v, t ), it follows that PG (u , t )> PG (s , t ). Otherwise, PG−{s }(u , t )≤
p · PG−{s }(v, t ) implies p · PG−{s }(v, t ) < PG−{s }(v, t ), and it follows that PG (v, t ) > PG (s , t ).
This completes the proof.
These necessary conditions allow us to rule out vertices that cannot have degree
greater than 2. As a result, Theorem 4.3 provides a heuristic (Algorithm 4.3.1) for pre-
dicting the set of degree 2 vertices as well as the edges incident to them. It is easy to see
that Algorithm 4.3.1 runs in time O(|V |2), and in practice, our studies have shown that
vertices satisfying these conditions while having degree higher than two are rare (see
Table 4.1).
Unlike the case of degree-1 vertices, we cannot simply remove degree-2 vertices
without affecting the remaining entries in the matrix. Therefore, as shown in Algo-
rithm 4.3.1, degree-2 vertices and their incident edges are simply marked as such in the
solution, but are not removed from the input matrix.
Algorithm 4.3.1: Prediction of degree 2 vertices
Input: Probability matrix M and vertex set VOutput: Degree 2 vertices V 2 and their incident edges E 2
foreach vertex s ∈V doif s satisfies conditions in Theorem 4.3 then
V 2←V 2 ∪{s };u , v ← the two vertices with M s ,u =M s ,v =max{M s ,t :∀t ∈V −{s }};E 2← E 2 ∪{(s , u ), (s , v )};
endendReturn V 2 and E 2
90 Chapter 4. Direct PPI Networks from AP-MS Data
4.3.2 Identification of Densely Connected Regions
We now turn to the problem of finding densely connected regions in the network. These
regions may correspond to protein complexes, where tagging any one of the members
of the complex results in the identification of all other members with high probability.
While correctly predicting the physical interactions within each complex is a difficult
task, separating these dense regions from the remainder of the network is key to im-
proving the accuracy of the genetic algorithm (Phase III).
In order to build an algorithm for detecting dense regions, we consider the connec-
tivity between nodes in an n-clique Kn , which we denote as CliqueConn(n). First, let us
recall a classical result in graph enumeration.
Lemma 4.6. [62, 67] Let γn denote the number of connected simple graphs over n labeled
vertices. We can count this number by the recurrence:
γn =
1 if n = 1
2(n2)− 1
n
∑n−1i=1
�
i�n
i
�
·2(n−i
2 ) ·γi
�
otherwise
Using this combinatorial result, we can construct the formula of CliqueConn(n )when
p = 12
, which is the value we used for our analyses. The formula for other values of p can
be formulated using a similar proof.
Lemma 4.7. Let G be a complete, probabilistic graph on n vertices with p = 12
. Then, for
any two vertices in G , the probability that they are connected is:
C l iqu e Conn (n ) =n∑
i=2
� ni−2
�
γi
2(n2)−(n−i
2 )
Proof. We write the probability as follows.
C l iqu e Conn (n ) =σ2+σ3+ . . .+σn
2(n2)
4.3. Algorithm for A-DIGCOM 91
Algorithm 4.3.2: Detecting dense clusters in the network
Input: Probability matrix M , and minimum cluster size kOutput: Set of possible k -cliques in Gd i r e c t .
t ←CliqueConn(k )G t ← (V, E t ), where E t = {(u , v ) : u , v ∈V and M (u , v )≥ t }foreach connected component S ⊆V do
Find cliques of size k in SendReturn cliques discovered.
where σi denotes the number of subgraphs of Kn with i vertices, where the two given
vertices u and v are connected. To computeσi , there are� n
i−2
�
ways to pick i−2 vertices
(in addition to u and v ) for the connected components, and γi ways to keep them all
connected, and finally, 2(n−i
2 ) ways to assign edges among the remaining vertices. So we
have
σi =�
n
i −2
�
·γi ·2(n−i
2 ),
and finally, we have
C l iqu e Conn (n ) =n∑
i=2
� ni−2
�
γi
2(n2)−(n−i
2 )
Lemma 4.7 gives rise to a heuristic (see Algorithm 4.3.2) that finds a set of dense re-
gions covering the network. While the algorithm guarantees to identify all cliques of size
k ′ ≥ k contained within Gd i r e c t , sets of vertices that do not form a k -clique may also be
reported, provided that they are sufficiently connected among themselves, possibly via
vertices outside the dense cluster. However, for sufficiently large values of k , we found
this to be a very rare occurrence. While finding cliques in a graph is a computation-
ally intensive task in general, the construction of G t (in Algorithm 4.3.2) for large values
of k creates few small connected components and leaves the remaining vertices iso-
lated. Therefore, in practice, Algorithm 4.3.2 can be implemented to run in a reasonable
amount of time (see Table 4.5(ii)).
Based on the connectivity matrix M , our algorithm identifies (possibly overlapping)
92 Chapter 4. Direct PPI Networks from AP-MS Data
clusters of proteins of size at least k such that, for every pair u , v in each cluster, M (u , v )≥
tk for some threshold tk . For appropriately chosen values of k and tk , the discovered
clusters correspond to cliques in Gd i r e c t with high accuracy (see Section 4.4.6).
The dense regions discovered at this phase provide us with (1) the set of edges within
each dense region; and (2) sparse cuts between disjoint dense regions. The edge set
within each cluster will be used for initial candidates in the genetic algorithm, whereas
the cuts defined by the clusters will be used as crossover points during the crossover
operation in the genetic algorithm.
4.3.3 Cut-based Genetic Algorithm
To predict the remaining section of the network, we use a customized genetic algorithm
that aims at finding an optimal solution to the A-DIGCOM problem. We first devise a
solution to a generalization of the A-DIGCOM problem, and then show how the results
of Phase I and Phase II of the algorithm are used to improve the performance.
Genetic algorithms have been shown to be an effective family of heuristics for a
wide variety of optimization problems [64], including network design under connec-
tivity constraints [38, 39]. A genetic algorithm models a set of candidate solutions as in-
dividuals of a population. From this population, pairs of promising candidate solutions
are mated, and their offspring solutions inherit properties of the parents with some ran-
dom mutations. Over generations, this process of natural selection improves the fitness
of the population.
The A-DIGCOM problem is a hard optimization problem, because (i) the size of the
search space is huge – 2(n2) for a graph of size n , and (ii) there is no known polynomial-
time algorithm to evaluate a proposed candidate solution (i.e. compute PG from G ). For
these reasons, a straight-forward genetic algorithm implementation failed to produce
satisfactory results (data not shown). Instead, we use a more sophisticated approach
by making use of the results obtained in previous sections in order to reduce the search
4.3. Algorithm for A-DIGCOM 93
space and to guide the mating operations for more effective search.
The genetic algorithm aims to solve a generalization of A-DIGCOM. First, we allow
each edge (u , v ) in the network to survive with a non-uniform probability p (u , v ), in-
stead of one constant probability p over all edges. Secondly, we assume that we are
given two sets of edges EY ES and ENO that indicate the set of edges that are guaranteed
to be in the solution, and guaranteed not to be in the solution, respectively. This will
later allow us to factor in the outcomes of the previous sections. Therefore, the edges
whose presence remains to be determined are EM AY B E = EY ES ∪ENO .
Encoding of candidate solutions. To represent a candidate solution, we first create a
hash table that maps each putative edge in EM AY B E to an integer. Each candidate is then
encoded as a list of integers (edges). Edges in EY ES , which are part of all solutions, are
not explicitly listed, in order to save space. Since the networks we consider are sparse
(|E |=O(|V |)), such an encoding technique significantly reduces the space requirements.
Initial population. The initial population of candidates is generated using a preferen-
tial attachment model [10] using the following observations: (i) The average connectiv-
ity of vertex u , avgCon(u ) = 1|V |
∑
v∈V−{u }M (u , v ) is strongly positively correlated with
the degree of u in Gd i r e c t ; (ii) the age of a vertex, measured by when the vertex was in-
troduced to the graph, is positively correlated with the degree of the vertex. Therefore,
during the generation of each candidate, we choose the next vertex to be added with
probability proportional to its average connectivity. This results in a candidate solu-
tion where the degree of most vertices is likely to be close to their true degree in Gd i r e c t .
Furthermore, in order to create candidates that are clustered similarly to the true direct
interaction graph, we include the set of edges predicted by Algorithm 4.3.2 to each initial
candidate.
94 Chapter 4. Direct PPI Networks from AP-MS Data
Fitness function. The fitness of a candidate solution G , fitness(G ) is obtained by first
estimating the probability matrix PG using 500 Monte Carlo samples. To compute these
samples, we break each edge independently with probability p , and count the number
of times each pair of vertices remain connected. Given 500 samples, we then count
the number of vertex pairs (u , v ) whose estimated connectivity PG (u , v ) is within the
tolerance level δ, i.e., M (u , v )±δ (see Section 4.4.2 on how to choose δ).
Crossover. The crossover operation needs to hybridize two parent candidates to pro-
duce offsprings preserving the good properties of the parents. This operation will be
guided by a randomly chosen balanced cut V = V1 ∪V2. Let G1 and G2 denote the two
parent networks and let E1(G i ) and E2(G i ) denote the edges of G i such that both end-
points lie in V1 and V2, respectively. Furthermore, let E1,2(G i ) denote the edges of G i
that cross from V1 and V2. Mating G1 and G2 results in two children G ′ = (V, E ′), and
G ′′ = (V, E ′′) such that:
E ′ = E1(G1)∪E2(G2)∪ (E1,2(G1)∩E1,2(G2))
E ′′ = E2(G1)∪E1(G2)∪ (E1,2(G1)∩E1,2(G2))
While choosing a random cut as the crossover point is a reasonable strategy to con-
struct a new pair of offsprings, our studies have shown that a planned strategy in choos-
ing the crossover points results in better performance with less chance of premature
convergence. In particular, if the crossover point is chosen as a dense cut in the parent
networks, then the connectivity among vertices within each partition would be deterio-
rated significantly. This results in offsprings with much poorer fitness than their parents.
On the other hand, if the parents are hybridized at a sparse cut, the connectivity among
vertices within each partition is disrupted much less. Therefore, crossover operations
are best done by selecting sparse balanced cuts (|V1| ≈ |V2|). Finding sparse balanced cuts
is a well-studied problem in combinatorial optimization, for which various approxima-
tion algorithms exist [96, 140]. However, these algorithms assume that the graph itself,
4.3. Algorithm for A-DIGCOM 95
not the connectivity matrix M , is given as input. We therefore use a simple heuristic that
avoids cutting through the dense regions of the network. To generate these sparse cuts,
we contract each dense region identified in Algorithm 4.3.2 to a single vertex, and then
generate a weighted (by the number of vertices in each dense region) balanced partition
of the vertices at random.
Mutation. In order to introduce variability to the population of candidates, a small
number of edges (5−10%) are randomly inserted or deleted. Moreover, observe that the
child network constructed as above may not remain connected. Aside from the random
mutation, therefore, we employ a simple local search that greedily adds edges to keep
the network connected.
Genetic algorithm parameter selection. The various parameters of the genetic algo-
rithm were selected based on the resulting performance on the Yu et al. data set. Two
main parameters that affect the performance significantly are the population size and
the selection criteria. For selection criteria, we tested several different selection criteria
by setting the probability of choosing a candidate as a parent. The best compromise
between running time and accuracy was obtained using a population size of 500 and, a
selection probability for a parent proportional to fitness(c i )−minFit, where minFit is the
fitness of the worst candidate in the population.
4.3.4 Restricting the Solution Space for GA
While our genetic algorithm offers a plausible method for the A-DIGCOM problem, one
can reduce the size of the solution space, which typically results in faster convergence to
better solutions, using the results in Theorem 4.1 and 4.3. First, recall that finding all cut
edges decomposes the problem into independent subproblems on 2-edge-connected
components. Second, the identification of degree-2 vertices defines two sets of edges
EY ES and ENO that constitute all putative edges incident to the identified degree-2 ver-
96 Chapter 4. Direct PPI Networks from AP-MS Data
tices. In other words, EM AY B E forms the subgraph of G induced by the set V 3+ of vertices
with degree ≥ 3.
Furthermore, observe that the edges in EY ES form parallel paths between vertices
in V 3+. A classical result in network reliability (see Fact 4.5) suggests that these parallel
paths can be merged into a single meta-edge whose reliability can be efficiently com-
puted. To be more formal, consider the set
{P1(u , v ),P2(u , v ), . . . ,Pk (u , v )}
which is the set of paths between u and v in EY ES . These paths can then be replaced by
a single edge (u , v ) with survival probability p (u , v ) = 1−Πi=1...k (1− p |Pi (u ,v )|). By merg-
ing every set of parallel paths, we obtain a compact network over V 3+ that efficiently
encodes the edges in EY ES . Since our genetic algorithm handles the case where the edge
survival probability is non-uniform, this compact encoding results in substantial gains
in the running time for estimating the fitness of the candidates, as well as in time and
space requirements for handling large population sizes. In our applications, this allows
us to remove approximately 70−75% of the original set of vertices.
4.4 Experimental Results
In this section, we provide the details on how our algorithm is tested on various datasets.
Then, the accuracy of the predicted direct PPI network is evaluated using simulated
data. Finally, our algorithm is tested on a well-known yeast AP-MS dataset to compare
against PPI datasets from other sources.
4.4. Experimental Results 97
4.4.1 Randomized Hill-climbing Algorithm
For performance comparisons, we tested our algorithm against a simple randomized
hill-climbing approach. In this approach, we start with a randomly chosen spanning
tree G 1 of the vertex set V . At the i t h iteration, we first sample the connectivity proba-
bility PG i of G i , using the Monte Carlo simulation. Then, we randomly pick a vertex pair
u , v with probability proportional to
D(u , v )∑
∀i ,j∈V D(i , j ),
where D(u , v ) = |M (u , v )−PG i (u , v )|. If the selected pair u , v are connected by an edge
in G i , but M (u , v ) > PG i (u , v ), then we remove (u , v ) from G i . On the other hand, if u
and v are not connected by an edge, but M (u , v ) < PG i (u , v ), then we add (u , v ) to G i .
We repeat this local optimization heuristic while making sure the candidate solution
remains connected.
4.4.2 Choosing a Tolerance Level δ and Handling Numerical Errors
In order to deal with numerical errors from the Monte Carlo sampling, we use an ad-
ditive tolerance level δ. Note that the sampling process for estimating the probability
matrix PG is a binomial process, which, by the central limit theorem, is closely approxi-
mated by a normal distribution. The confidence interval is largest when the estimated
probability is equal to 0.5, in which case we obtain a confidence interval of
p ±δ= p ±z 1−α/2
2p
n,
where p denotes the fraction of samples where the two vertices are connected after n
samples, and z 1−α/2 is the z-value for desired level of confidence. Using this formula, we
conclude:
98 Chapter 4. Direct PPI Networks from AP-MS Data
1. When n = 20000 (the input matrix M by sampling the test networks), we obtain a
95% confidence interval of size at most 2 ·δ= 2 ·0.007= 0.014.
2. When n = 500 (the connectivity matrix by sampling each candidate solution in our
genetic algorithm), the 95% confidence interval is of size at most 2 ·δ = 2 · 0.04 =
0.08.
With the chosen tolerance levelδ, we modify our algorithm as appropriate each time
we compare two connectivity probabilities. For example, in Theorem 4.1, the first con-
dition PG (u , v ) = p is modified to PG (u , v )∈ [p−δ, p+δ]; and in Theorem 4.3, we modify
the first condition PG (s , t )< 2p − p 2 to PG (s , t )< 2p − p 2+δ.
4.4.3 Generation of Scale-free Networks
In order to generate artificial scale-free networks, we used two generation models: the
preferential attachment model, and the duplication model. In the preferential attach-
ment model, we evaluated the degree distribution of the two biological networks (GY u
and GDI P ) and used the Barabási-Albert algorithm to construct a scale-free network with
attachment factor 1.5 (each iteration adds a new vertex with 1− 2 edges attached to ex-
isting vertices). In the duplication model, at each iteration, we randomly pick a vertex to
duplicate with probability proportional to its degree and randomly drop the duplicated
edges with probability at 0.5 in order to fit the degree distributions and sparsity of bio-
logical networks. While there are various other random graph models proposed for PPI
networks, these scale-free network models provided a sufficiently diverse set of initial
population for our genetic algorithm.
4.4.4 Calculation of Connectivity Matrix from Peptide Counts
The peptide count of a prey protein in an AP-MS experiment is the number of different
peptides that have been observed by mass spectrometry for that protein. We note that
4.4. Experimental Results 99
the peptide counts are biased towards preys with longer protein sequences, and to rec-
tify this propensity, we normalized the abundance data by the protein sequence lengths
to obtain the abundance ratios R(i , j ). In order to turn the normalized abundance ratios
into the connectivity matrix for our probabilistic graph model, we used a simple logistic
function
M (i , j ) =1
1+ e−R(i ,j )−αβ
where the parameters α,β are chosen so that the computed distribution of p fits the
simulated connectivity distribution of GY u , using a χ2 test (α= 2.8921,β =−0.6318). In
the cases where R(i , j ) differ from R(j , i ), we choose the average of the two entries to
symmetrize the matrix.
4.4.5 Model Validation
To test our approach, we first sought to validate our model of AP-MS data. To this end,
we used one of the most comprehensive yeast AP-MS networks, obtained by Krogan et
al. [93]. The dataset reports the Mascot score [114] and the number of peptides detected
for each bait-prey pair. The complete set of interactions reported contains 2186 proteins
and 5496 interactions (Krogan et al. Table S6); we call the resulting network GK ro g a n Fu l l .
The authors identified a subset of these interactions as high-confidence, based on their
Mascot scores (Krogan et al. Table S5). We call this set of high-confidence interactions
GK ro g a nHi g h ; this network consists of 1210 proteins and 2357 interactions. We expect
that GK ro g a nHi g h is relatively rich in direct interactions, whereas the complete set of in-
teractions GK ro g a n Fu l l consists in part of indirect interactions.
Considering GK ro g a nHi g h as a direct interaction network, we simulated Monte Carlo
sampling to estimate PGK ro g a nHi g h , using p = 0.5 and 50,000 samples, which yields a 95%
confidence interval of size at most 0.007 on each PGK ro g a nHi g h (u , v ) entry. Next, we normal-
ized the peptide counts of the interactions in GK ro g a n Fu l l using protein lengths. We then
compared PGK ro g a nHi g h to the normalized peptide counts of the interactions in GK ro g a n Fu l l .
100 Chapter 4. Direct PPI Networks from AP-MS Data
We expect that a significant fraction of low-confidence interactions in GK ro g a n Fu l l −
GK ro g a nHi g h are likely to be indirect interactions. If our model is correct, their peptide
counts should then be correlated with the corresponding entries in PGK ro g a nHi g h . Indeed,
the positive linear correlation between the predicted connectivity PGK ro g a nHi g h and the ob-
served normalized peptide counts is very significant (regression p-value of 8.17×10−11,
Student t -test). Furthermore, this correlation is strongest when p ≈ 0.5, as compared to
p = 0.3 or 0.7, justifying the use of this value in our subsequent analyses.
4.4.6 Accuracy of the Algorithm
The ideal validation of the accuracy of our algorithm would involve (i) constructing a
connectivity matrix M using actual quantitative AP-MS data; (ii) predicting direct inter-
actions based on M using our algorithm; and then (iii) comparing our predictions to
experimentally generated direct interaction data. Yeast 2-Hybrid (Y2H) experiments are
less prone to detect indirect interactions than are AP-MS methods, and several large-
scale efforts have been reported [53, 74, 137]. Unfortunately, for a number of techni-
cal reasons, the overlap between AP-MS PPI networks and Y2H networks remains very
small [149]. As a consequence, Y2H data cannot be used directly to validate predictions
made on AP-MS data. Instead, we had to rely on partially-synthetic data set, where an
actual network of high-quality Y2H interactions is assumed to form the direct interac-
tion graph, and a connectivity matrix is generated from it using Monte Carlo sampling,
under our model. Two sets of Y2H interactions were used: (i) GY u is the network con-
structed from the gold standard dataset of Yu et al. [149]. This network consists of 1090
proteins and 1318 interactions with high confidence of direct interactions; (ii) GDI P is the
core, high-quality, physical interaction network of yeast, available at the DIP database,
version 20090126CR [146], consisting of 1406 proteins and 1967 interactions.
These biological networks were complemented with two artificial 1000-vertices net-
works. The first was generated using the preferential attachment model (PAM) [10]. For
the second, we used the duplication model (DM) [13], which, in contrast to the PAM,
4.4. Experimental Results 101
generates graphs containing several dense clusters. The resulting artificial “direct” in-
teractions graphs are called GPAM and GDM and contain 1500− 2000 interactions each.
We then used the Monte Carlo sampling approach described above to estimate the con-
nectivity matrices PGY u , PGDI P , PGPAM , and PGDM . These will form the input to our inference
algorithm, whose output will then be compared to the corresponding direct interaction
graph. It is important to note that these input matrices are not perfectly accurate and
may contain sampling errors. However, it is easy to bound the size of the errors with
high probability and use this as a tolerance level within our algorithm. We also note that
the results presented in this section only aim at evaluating the performance of the infer-
ence algorithm on input data that was generated exactly according to our probabilistic
model. As such, the error rates reported may be considered as lower bounds for those on
actual biological data. An assumption-free evaluation is provided later in this section.
Identification of weakly connected vertices
Theorem 4.1 provides an efficient algorithm that guarantees the identification of all cut
edges, provided that the given connectivity matrix is precise. We say that a vertex v is a
1-cut vertex if all edges incident on v are cut edges. By applying Theorem 4.1 recursively
to detect cut edges and decomposing the graph into two connected components, we
can detect and remove all 1-cut vertices from the input connectivity matrix. Table 4.1 (i)
reports the number of 1-cut vertices that are detected by the recursive algorithm from
Theorem 4.1. In both the Yu and the DIP network, 1-cut vertices constitute approxi-
mately 50% of the network, and identifying them allows a significant reduction in the
problem size. We note that the inaccuracies in the input connectivity matrices could, in
principle, have introduced errors in the detection of cut edges. However, this rare event
was never observed on any of our networks.
Algorithm 4.3.1 (see Methods) guarantees to efficiently identify all degree-2 vertices
(again, provided that the connectivity matrix is known), but may also incorrectly flag
some higher-degree vertices. As seen in Table 4.1 (ii), nearly all degree-2 vertices were
102 Chapter 4. Direct PPI Networks from AP-MS Data
Network Total(i) 1-cut vertices (ii) Degree 2 vertices
Remainingreal pred. FDR(%) FNR(%) real pred. FDR(%) FNR(%)
Yu 1090 552 552 0 0 195 207 7.7 2.05 331DIP 1406 656 656 0 0 309 326 5.82 0.64 424PAM 1000 457 457 0 0 351 363 3.58 0.28 180DM 1000 323 323 0 0 117 126 11.9 5.12 551
Table 4.1: Performance of detecting weakly connected vertices. Number of vertices detected as (i) 1-cutvertices, and (ii) degree 2 vertices in the real network and the predicted network. False discovery (FDR)and false negative (FNR) ratios are given in percentage. Remaining: the number of vertices remainingafter identifying 1-cut vertices and degree 2 vertices.
identified, with a low false-discovery rate ranging from 6% to 9%. Moreover, the false-
positives incorrectly detected as degree-2 vertices indeed had small degrees, and their
predicted neighbours were mostly correct (but incomplete) predictions. Flagging degree-
2 vertices reduces the problem size further by 15% to 36%.
After repeatedly detecting and removing 1-cut vertices and degree-2 vertices from
the problem space, the edges adjacent to approximately 70% of the vertices are detected
with very low error rate. The remaining vertices only constitute approximately 30% of
the original network. We call this remaining subset the hard core of the connectivity
matrix. Because it is more densely connected than the rest of the network, the topology
of hard core is more difficult to reconstruct.
Running our algorithm on the PAM simulated data yields similar resolution and error
rates as on the Y2H networks. However, our DM network is found to be less amenable to
these strategies, leaving 55% of vertices unresolved and resulting in an error rate approx-
imately twice that seen for other networks. This is simply due to the fact that networks
generated by the duplication model do not contain as many 1-cut vertices or degree 2
vertices when compared to other networks, including biological Y2H networks.
Identification of dense regions
Our dense region detection algorithm aims at identifying all edges that belong to a k -
clique in GDi r e c t , for a given value of k . Table 4.2 reports the accuracy of the algorithm.
4.4. Experimental Results 103
Networkk = 7 k = 6 k = 5
real pred. FD(%) FN(%) real pred. FD(%) FN(%) real pred. FD(%) FN(%)Yu 42 54 22.22 0 66 104 36.53 0 146 308 55.19 5.48
DIP 0 42 100 0 86 112 26.79 4.65 184 266 35.34 6.52PAM 0 31 100 0 0 45 100 0 0 96 100 0DM 254 346 28.32 2.36 194 267 29.21 2.57 488 718 34.96 4.30
Table 4.2: Performance of quasi-clique predictions. Number of edges that belong to maximal cliques ofsize k . Real: actual number of edges that belong to maximal cliques of size k ; pred: predicted numberof maximal k -clique edges; FD(false discovery ratio): percentage of false positives in the predicted set;FN(false negative ratio): percentage of false negatives in the real set.
As expected, our algorithm achieves extremely high sensitivity for clique edges. How-
ever, the false-discovery rate is quite high, especially for the case of DIP and PAM, k = 7.
This is due to the fact that distinguishing a 5-clique from, say, a quasi-clique of size 7
is extremely difficult, causing false-positive predictions. We note however that these er-
roneous predictions are mostly inconsequential, as the intra-cluster topology of each
dense region shall only be used in generating the initial candidate solutions for the ge-
netic algorithm.
Cut-based genetic algorithm
The various parameters of the genetic algorithm (population size, mate selection prob-
ability, mutation rate, etc.) were optimized for the running time and accuracy of the so-
lution based on GY u . Although our genetic algorithm could in principle be used on any
connectivity matrix, running it on the full matrix of > 1000 proteins is impossible: the
search space is huge, and the amount of time required to evaluate the fitness of a given
candidate solution is too large. However, as discussed previously, applying first the 1-
cut and degree-2 vertex detection algorithms significantly reduces the problem size and
makes it accessible to our genetic algorithm. Table 4.3 (i) reports the accuracy of the
genetic algorithm predictions on the hard core of each connectivity matrix. We note
that since the network to be inferred is relatively highly connected, the problem is sig-
nificantly more difficult than the identification of 1-cuts and degree-2 vertices. Indeed,
the false-discovery and false-negative rates range from 35% to 55% for most datasets.
104 Chapter 4. Direct PPI Networks from AP-MS Data
Network(i) reduced network (ii) overall network
real pred. FDR(%) FNR(%) real pred. FDR(%) FNR(%)Yu 563 552 43.65 44.76 1318 1390 14.96 10.31
DIP 931 890 35.50 38.34 1967 2041 17.34 14.23PAM 473 421 49.88 55.39 1538 1462 16.14 20.28DM 1138 1295 43.39 35.58 1869 1804 32.81 35.15
Table 4.3: Performance of genetic algorithm and overall algorithm. (i) Performance of the genetic algo-rithm on the reduced network obtained from removing 1-cut vertices and degree-2 vertices. (ii) Overallperformance of the combined prediction pipeline on the complete connectivity matrix. Real and pre-dicted describe the number of edges in the real and predicted networks, respectively.
For a comparison, an algorithm that would pick edges randomly would achieve 98.75%
false-discovery and false-negative rates.
Combining the three phases of the algorithm, the overall error rate obtained on each
data set ranges from 10 to 20% false-discovery and false-negative rates, except for the
DM data set, which fares considerably worse, for the reasons explained earlier.
To the best of our knowledge, there has been no other efforts to solve the DIGCOM
problems (neither exact, nor approximate version). We thus compared our approach
to a simple hill climbing search algorithm on the Yu et al. data set (see Methods). We
let this algorithm run over several days (as opposed to few hours spent using our ap-
proach), with multiple restarts, and discovered that it provides very poor sensitivity and
specificity (see Table 4.4 for the best results obtained). This is not surprising since the
hill climbing method is highly dependent on the initial solution (in this case, a spanning
tree chosen randomly based on the connectivity matrix) and the search space is simply
Network Real Predicted FDR(%) FNR(%)(i) Hill-climbing 1318 2108 86.67 78.68
(ii) Hill-climbing +weakly conn. nodes 1318 1517 43.57 35.05(iii) Our approach (GA) 1318 1390 14.96 10.31
Table 4.4: Comparison of our method to simple hill-climbing approach. (i) Accuracy of the hill-climbingapproach used over the complete network; (ii) Accuracy of the hill-climbing approach after fixing theweakly connected nodes using our algorithm; (iii) Accuracy of our combined pipeline using the geneticalgorithm.
4.4. Experimental Results 105
Network (i) 1-cut & degree 2 vertices (secs) (ii) Quasi-clique predictions (iii) Genetic algorithmYu 0:00:29 0:31:02 15:00:00
DIP 0:00:41 0:48:14 15:00:00PAM 0:00:19 0:29:49 15:00:00DM 0:00:24 1:18:03 15:00:00
Table 4.5: Running times for each phase of our algorithm. (i) Detecting 1-cut vertices and degree twovertices; (ii) Predicting quasi-clique clusters; and (iii) Running the genetic algorithm. We are reporting theaverage run time over three runs on each network. The implementation was tested on a Powermac G52Ghz with 4GB of RAM. Note that the genetic algorithm was run for a fixed amount of time, and the topscoring candidates achieved the quality as shown in Table 3. The times are shown in hh:mm:ss format.
too large to exhaustively search for the a good initial solution. We also tested the hill-
climbing approach in the same setting as the genetic algorithm, i.e. combining it with
the 1-cut edges and degree-2 vertices detection algorithms. Here, the hill-climbing ap-
proach showed a better sensitivity and specificity than the pure hill-climbing approach,
but still performed much worse than our genetic algorithm (Table 4.4). Furthermore,
the improvement over the pure hill-climbing approach was mostly due to the high sen-
sitivity and specificity of our algorithm for detecting weakly connected vertices.
To provide an idea of the running times, Table 4.5 gives the empirical data from our
experiments. The first two phases (detecting weakly connected vertices and recognizing
dense regions) were run within seconds while the genetic algorithm was run for a fixed
amount of time.
4.4.7 Inferring Direct Interactions from AP-MS Experimental Data
In order to apply our algorithm to biological data from AP-MS experiments, we used the
raw data reported by Krogan et al. [93] for the 2186 putative interactions of GK ro g a n Fu l l .
We only considered the subnetwork of tagged proteins, and further focussed our efforts
on the analysis of 77 proteins that are well separated in the tag-induced subnetwork.
Quantitative abundance estimates were derived from the peptide counts reported for
each prey, and an experimentally derived connectivity matrix M was obtained after nor-
malization (see Section 4.4.4). Our full prediction algorithm was then run on the es-
106 Chapter 4. Direct PPI Networks from AP-MS Data
timated connectivity matrix, resulting in a direct interaction graph prediction we call
GK i m that consists of 164 interactions (see Table 4.6). The network GK i m was compared
to GK ro g a nHi g h , the set of high-confidence interactions reported by Krogan et al., and
to G TopK ro g a nHi g h , a subset of GK ro g a nHi g h consisting of the 164 (to compare against GK i m )
most confident interactions they reported.
Both GK ro g a n Fu l l and GK ro g a nHi g h overlap GK i m quite substantially. These three sets
of predictions were then compared against a set of high-quality binary interactions from
GY u . In Y2H experiments, the interaction partners are separately screened using a ge-
netic readout. Therefore, interactions from GY u are believed to be direct, and thus used
to test against the predictions from AP-MS data. On the other hand, these interactions
may reflect only a subset of all direct interactions among the 77 proteins.
As shown in Figure 4.4, our results show that the high-confidence AP-MS data GK ro g a nHi g h
exhibited very little overlap with the direct binary interaction set GY u . 72.6% of inter-
actions in GK ro g a nHi g h are disjoint from GY u , and 25% of GY u remains undetected by
GK ro g a nHi g h . Furthermore, even the top scoring set of interactions G TopK ro g a nHi g h showed
high discrepancy ratios against GY u . In contrast, GK i m produced by our algorithm coin-
cides with GY u with better sensitivity and specificity. Given the crudeness of the method
in translating the AP-MS data into a connectivity matrix, our algorithm has thus per-
formed relatively well in predicting direct interactions from real AP-MS data.
4.5 Discussions
Approaches for determining bait-prey abundance remain in their infancy, and to date,
no large-scale PPI networks have this type of quantitative data. As these approaches
improve in accuracy, so will the results of our method. Furthermore, as the sensitiv-
ity of AP-MS pipelines improves, the fraction of indirect interactions detected will also
increase, thereby making the ability to distinguish them even more critical.
4.5. Discussions 107
Protein A Protein B Protein A Protein B Protein A Protein BYOL012C YKR048C YNL031C YJL115W YLR418C YIL035CYBL003C YGL241W YNL031C YBL003C YDR054C YGL173CYBL003C YKR048C YNL031C YLL022C YDR054C YLR418CYBL003C YGL207W YNL031C YIL035C YER112W YCR077CYBL003C YIL035C YNL031C YGL207W YER112W YDR432WYGR103W YIL035C YNL031C YER164W YMR229C YDR432WYGR103W YMR229C YOR039W YER164W YOR206W YMR229CYGR103W YOR272W YER164W YGL207W YHL034C YOR310CYGR103W YKR048C YBL002W YOL012C YHL034C YDR432WYGR103W YNL189W YBL002W YDR224W YHL034C YLR175WYGR103W YLR347C YBL002W YKR048C YHL034C YMR229CYMR304W YGR159C YBL002W YBR245C YHL034C YGL173CYMR304W YDR432W YBL002W YGL207W YOL076W YGR159CYMR304W YLR197W YBL002W YIL035C YOL076W YDR432WYMR075W YBL002W YLL022C YPL001W YNL088W YGR159CYMR075W YDR432W YBR010W YBR245C YCR077C YDR432WYMR075W YNL189W YBR010W YGL207W YER146W YCR077CYOL004W YAL034C YBR010W YDR432W YHR064C YGL173CYOL004W YMR075W YBR010W YKR048C YOR048C YDR432WYOL004W YDR432W YBR010W YPL001W YGL241W YKR048CYOL004W YIL035C YBR010W YBL002W YDR225W YGL207WYOR310C YGL120C YBR010W YLL022C YDR225W YKR048CYOR310C YNL189W YCR060W YGR159C YDR225W YGL241WYLR455W YOR310C YJR041C YNL189W YGL207W YOR039WYLR455W YOR206W YJR041C YDR432W YGL207W YGL244WYLR455W YNL088W YJR041C YMR125W YGL207W YIL035CYLR455W YMR229C YJR041C YPL178W YGL207W YBR279WYLR455W YDR496C YNL209W YDR432W YGL207W YOL145CYBR009C YPL001W YNL209W YER146W YGL207W YNL088WYBR009C YGL207W YNL209W YKR048C YGL207W YGL173CYNL030W YPL001W YGL244W YLR418C YGL207W YOR061WYNL030W YBR009C YGL244W YOL145C YDR224C YGL207WYNL030W YLL022C YOR123C YGL207W YDR224C YKR048CYNL030W YGL207W YOR123C YGL244W YGL120C YGR159CYKR024C YMR229C YDR234W YHR064C YGL120C YDR432WYKR024C YGL173C YOR061W YOR039W YMR128W YGL120CYBR103W YDR432W YIL035C YBR009C YGR272C YHL034CYDL229W YBR169C YIL035C YDR172W YGR272C YDR432WYDL229W YMR229C YIL035C YOR061W YMR014W YDR432WYDL229W YGL207W YIL035C YER164W YMR014W YGR272CYDL229W YDR432W YPL153C YGR159C YNL147W YGL173CYDL229W YKR048C YPL153C YNL189W YDR432W YDL060WYDL229W YGR159C YJL115W YOR061W YGL147C YDR432WYDL229W YHR064C YJL115W YIL035C YPL178W YLR347CYOL145C YBR279W YJL115W YPL153C YPL178W YMR125WYOL145C YDR172W YLR410W YGR159C YGR159C YOR310CYOL145C YOR123C YLR410W YKR048C YML062C YNL189WYBR279W YLR418C YLR197W YOR310C YML062C YGR159CYBR279W YIL035C YLR197W YLR347C YMR125W YNL189WYBR279W YGL244W YDL040C YMR229C YNL139C YLR347CYBR279W YOR123C YDL040C YLR175W YNL139C YNL189WYDR365C YMR229C YDL040C YLR197W YNL189W YLR347CYDR365C YGR159C YLR418C YGL207W YDR289C YNL189WYNL031C YPL001W YLR418C YOR123C YDR289C YLR347CYNL031C YBR009C YLR418C YNL189W
Table 4.6: The set of direct interactions among 77 yeast proteins, predicted by our algorithm. The connec-tivity matrix is generated from normalized peptide counts, and then our algorithm is run to predict thedirect interactions.
108 Chapter 4. Direct PPI Networks from AP-MS Data
GY u GKroganHigh
121
332
GY u GTopKroganHigh
121164
GY u GKim
121164
31 90 242 42 79 85 15 106 58
Figure 4.4: Inferring direct interactions from actual AP-MS dataset. Overlap between the Y2H interactionnetwork of Yu et al. and various AP-MS-based networks: (a) High-confidence set of interactions from Kro-gan et al. (b) The set of 164 highest scoring interactions from Krogan et al. (c) The set of 164 interactionspredicted as direct interactions by our algorithms, based on the AP-MS data from Krogan et al.
In this chapter, we lay the groundwork for modelling indirect interactions in AP-
MS experiments. We formulate the DIGCOM problems, which aim at distinguishing
direct interactions from indirect ones, and provide a set of theoretical and heuristic
approaches that are shown to be highly accurate on both biological PPI networks and
simulated networks. Despite the unrealistic assumptions that should be relaxed, our re-
sults show that the predicted set of interactions fits the experimental data reasonably
well. In addition, applying our algorithms to a large-scale AP-MS data set from Krogan
et al. results in predictions that overlap Y2H data approximately 35% more often than
the equivalent number of top-scoring interactions reported by these authors.
The DIGCOM problems raise a number of challenging, yet fascinating computa-
tional and mathematical problems. Is the solution to the exact DIGCOM problem, if
it exists, always unique? We suspect it is. What is the computational complexity of the
exact and approximate DIGCOM problems? We believe they are NP-hard, and possi-
bly not even in NP. Are there types of graph substructures, other than those discussed
here, that can be unambiguously inferred from PG ? Are there special properties of PPI
networks, other than the power-law degree distribution, of which an algorithm can take
advantage to make more accurate predictions and/or provide approximation or proba-
bilistic guarantees?
4.5. Discussions 109
The model and algorithm proposed here are only a first step toward an accurate de-
tection of direct interactions from AP-MS data. Several generalizations and improve-
ments are worth investigating. First, the abundance of an interaction is not constant
and needs to be modelled more accurately. Second, the strength of all physical inter-
actions is non-uniform, and some interactions may be more prone to disruption by the
affinity purification process than others. Given sufficient quantitative AP-MS data, one
may study a generalization of the DIGCOM problems that aims at identifying not only
the set of direct interactions, but also their individual strengths and abundances. While
modelling these aspects is in theory possible, the amount and quality of experimental
data required is currently unavailable, and the computational complexity of the result-
ing problems are likely to be daunting.
Perhaps a more significant limitation of our model is that all direct interactions are
assumed to occur simultaneously, though it is clear that certain interactions are either
mutually exclusive, or restricted to specific subcellular compartments or conditions.
One can investigate approaches to decompose the observed network into a family of
simultaneously occurring interactions. In order to do so, complementary experimental
data, such as comprehensive protein localization assays or cell cycle expression data,
would be required to reduce the space of possible solutions in a biologically meaningful
manner.
An additional assumption that may need to be relaxed is the independence of the
edge failures, which may not hold in cases where the loss of an interaction between two
proteins causes a significant destabilization of the larger complex they belong to. Un-
fortunately, in the presence of strong dependencies between edge failures, it becomes
almost impossible to distinguish direct from indirect interactions. Nonetheless, it may
be possible to at least identify complexes where such dependencies hold, by studying
subsets of proteins for which the AP-MS data differs significantly from our model.
110 Chapter 4. Direct PPI Networks from AP-MS Data
4.6 Bibliographic Notes
The results discussed in this chapter have been published in [88].
Chapter 5
Hypergraph Modelling of PPI Data
PPI networks are traditionally modelled as graphs whose vertices represent proteins,
and two vertices are joined by an edge if and only if the two corresponding proteins
interact with each other. However, many recent discoveries of protein complexes as
building blocks of the proteome led us to believe that protein-protein interactions may
involve any number of interaction partners [7, 58, 93]. Consequently, various methods
have been proposed for detecting protein complexes (e.g. [7, 90, 129]). We discussed
some of these methods in detail in Chapter 1.
One of the limitations of the existing approaches is that the identified protein com-
plexes are often disjoint, and the sparse inter-cluster bridges remain as binary interac-
tions. Some recently proposed methods do allow finding overlapping clusters [1, 7, 98],
but most of these approaches are designed to first identify dense clusters, and then try
to merge the nearby clusters. While intuitive, such a strategy often fails to find a globally
optimum structure, if, for example, the optimization criteria is to minimize the number
of clusters. In fact, none of these methods guarantees to find the optimum solution for
the problem they are designed for. As a result, finding a comprehensive model for PPI
networks as a collection of protein complexes remains unsolved to date.
This chapter is concerned with the problem of modelling binary PPI data as a hyper-
111
112 Chapter 5. Hypergraph Modelling of PPI Data
graph, i.e., a set system where each set represents a protein complex. In order to model
this problem, let G = (V, E ) denote the combinatorial graph constructed from the widely
available binary PPI data (e.g. Y2H interaction data, or the direct AP-MS data produced
from Chapter 4). Then, each protein complex in the PPI network corresponds to a set S
of vertices in G , where each pair of vertices in S share an edge between them. In other
words, each protein complex corresponds to a clique in G . Our problem of finding a
hypergraph (with possibly a small number of sets, or hyperedges) can then be stated as
finding a clique cover of minimum cardinality.
(EDGE) CLIQUE COVER. Given a graph G = (V, E ), find a smallest collection Q of cliques
in G such that every edge in E belongs to at least one clique in Q .
A similar problem can be defined as a partitioning problem:
(EDGE) CLIQUE PARTITION. Given a graph G = (V, E ), find a smallest collection Q of
cliques in G such that every edge in E belongs to precisely one clique in Q .
Therefore, we draw our attention to the clique cover problem, with an application
to hypergraph modelling of PPI data. In fact, our theoretical pursuit on the clique cover
problem gives rise to efficient algorithms for sparse networks (including PPI networks),
where the graph sparsity is measured by various graph parameters.
Related Work. In addition to graph theoretic aspects, the clique cover problem has
been studied extensively from the standpoint of computational complexity. In particu-
lar, there are numerous studies concerning approximability and fixed-parameter tractabil-
ity. Clique cover is NP-hard in general [110], even when the input graph is planar [24]
or has bounded degree [71]. Furthermore, Lund and Yannakakis have shown that clique
cover is not approximable within a factor of |V |ε for some ε > 0 unless P = NP [101],
thereby removing the hope of good approximation algorithms in the general case. In the
case of clique partition problem, the problem has also been shown to be NP-complete
5.1. Preliminaries 113
for various restricted classes of graphs [23, 56, 57, 72].
A parameterized problem is fixed-parameter tractable (FPT) if it can be solved in
f (k ) · |I |O(1) time, where f is a computable function depending on some parameter k ,
independent of the input size |I |. Recently, Gramm et al. [66] showed that the clique
cover problem is FPT when the size of the cover is chosen for the parameter k . Similarly,
Mujuni and Rosamond [105] have shown that the clique partition problem is FPT with
the output size chosen as the parameter. These algorithms run in polynomial time in
the input size but exponential time in the number of cliques in the solution. As a result,
these algorithms are well suited for dense graphs where a few cliques can cover the entire
graph, but they are not suitable for sparse graphs that require a large number of small
cliques in the solution.
In this chapter, we fill the gap by designing efficient algorithms for such sparse graphs.
We first introduce some definitions and related results in Section 5.1. Then, in Sec-
tion 5.2, we present an exact algorithm for clique cover when the input graph has bounded
treewidth. Section 5.3 discusses the problem restricted to planar graphs, and we provide
a polynomial time approximation scheme. Finally, we show the performance of our al-
gorithm from experimental studies in Section 5.4. In particular, our algorithm shows
efficient and practical running time for both real and simulated biological networks.
Furthermore, our PTAS for planar graphs shows a clear trade-off between the approxi-
mation ratio and the running time when tested against random planar graphs.
5.1 Preliminaries
Throughout this chapter, we focus on the clique cover problem for sparse networks.
Where possible, we shall discuss how the algorithms for the clique cover problem can
be modified to solve the clique partition problem. Various measures of network sparsity
have been proposed in the past, with possibly the best known being treewidth. We recall
the definition of tree decomposition from Section 2.2.
114 Chapter 5. Hypergraph Modelling of PPI Data
Definition 5.1. [120] Tree decomposition of a graph G = (V, E ) is a pair (X = {X i |i ∈
I }, T = (I , F ))where each node i ∈ I is associated with a set of vertices X i ⊆V , such that
(1) Each vertex belongs to at least one node:⋃
i∈I X i =V .
(2) Each edge is induced by at least one node: ∀(v, w ) ∈ E , there is an i ∈ I with v, w ∈
X i .
(3) For each v ∈V , the set of nodes {i ∈ I |v ∈X i } induces a subtree of T .
Then, the width of a tree decomposition (X , T ) is defined as m a x i∈I |X i | − 1, and the
treewidth of a graph G , denoted tw(G ), is the minimum width over all tree decomposi-
tions of G .
An alternative definition of treewidth can be given using k -trees.
Definition 5.2. [3] A k -tree is defined recursively as follows:
(i) The complete graph on k vertices is a k -tree, and
(ii) A k -tree G with n + 1 vertices (n ≥ k ) can be constructed from a k -tree H with n
vertices by adding a vertex adjacent to exactly k vertices that form a k -clique in H.
Definition 5.3. A graph is a partial k -tree if it is a subgraph of a k -tree.
Then, the class of partial k -trees is equivalent to the class of graphs with treewidth
k . Note that this recursive definition provides a simple algorithm to construct a graph
of treewidth k ; we shall use this approach to generate a set of synthetic test data in
Section 5.4.1. In general, it is NP-complete to determine the treewidth of a graph [3].
However, when k is fixed, graphs with treewidth k can be recognized, and width k tree
decompositions can be constructed, in linear time [17].
From the empirical studies of our input PPI data (shown in Section 5.4, Table 5.1), the
treewidth of PPI networks is often small compared to the network sizes. We thus assume,
5.1. Preliminaries 115
where applicable, that the input graph has a bounded treewidth, and its optimum tree
decomposition can be constructed efficiently. Furthermore, for ease of exposition, we
assume that the decomposition tree T admits a nice structure as defined below.
Definition 5.4. [92] A tree decomposition (X , T ) is called nice if the tree T is rooted, and
for each node i ∈ I , one of the following holds:
1. LEAF: node i is a leaf of T , and |X i |= 1.
2. JOIN: node i has exactly two children j1 and j2 such that X i =X j1 =X j2 .
3. INTRODUCE: node i has exactly one child j , and X i =X j ∪{v }.
4. FORGET: node i has exactly one child j , and X j =X i ∪{v }.
Figure 5.1 illustrates an example of these node types. It is easy to see that if tw(G )≤ k ,
then G also admits a nice tree decomposition of width ≤ k , with O(n ) tree nodes: given
an arbitrary decomposition tree T , one can repeatedly split each node X i until all nodes
satisfy the conditions above.
Another closely related graph parameter is branchwidth. We recall its definition from
Section 2.2.
Definition 5.5. [121] A branch decomposition (T,φ) of a graph G is characterized by a
ternary tree1 T , and a bijectionφ from the leaves of T onto the edges of G .
Let e be a tree edge in T . Removing e from T partitions into T1 and T2, and this
partition induces a partition of edges in G , called an e -separation, associated with the
leaves of T1 and T2. The set of vertices in G that are shared by both G1 and G2 is called
the middle-set of e , and the width of this separation is the number of vertices in the
middle-set.
Given a branch decomposition (T,φ), the width of this branch decomposition is the
maximum width over all e -separations in T , and the branchwidth of G , denoted bw(G ),
1 A tree T is a ternary tree if every non-leaf node has degree 3.
116 Chapter 5. Hypergraph Modelling of PPI Data
a b
c v
a b
c
Xi
Xj
(1)
a b
c
a b
c v
Xi
Xj
(2)
a b
c d
a b
c d
a b
c d
Xj1
Xi
Xj2
(3)
Figure 5.1: Node types in a nice tree decomposition: (1) introduce node: X i = X j ∪ {v }; (2) forget node:X j =X i ∪{v }; (3) join node: X i =X j1 =X j2
is the minimum width over all branch decompositions. It is well known that the branch-
width is closely related to the treewidth of graph [121]: bw(G )≤tw(G )+1≤ 32
bw(G ). For
planar graphs, Fomin and Thilikos gave an upper bound on the branchwidth:
Theorem 5.6. [52] For any planar graph G , bw(G )≤p
4.5n ≈ 2.122p
n.
5.2 Clique Cover for Graphs with Bounded Treewidth
In this section, we design a dynamic programming algorithm for finding a minimum
clique cover for a graph G where a nice tree decomposition (X , T ) is given. Let k denote
the width of that decomposition. First, let us define E (X i ) to be the set of edges in the
subgraph induced by the vertices in X i . Furthermore, we let Vi denote the union of all
vertices in X i and its descendent nodes. Similarly, let G i denote the subgraph of G in-
duced by the vertices Vi . Finally, for some v ∈ V (G ), let δ(v ) denote the set of edges that
5.2. Clique Cover for Graphs with Bounded Treewidth 117
are incident to v in G .
We shall in fact design an algorithm for a generalization of the clique cover problem,
where we are given a subset S of edges that are already covered, i.e., our solution need not
cover S, but may use these edges in the cliques. Then, the original clique cover problem
is a special case where S = ;. Since our dynamic programming is formulated around the
decomposition tree T , we often speak of a subgraph G i where a certain subset S of edges
is already covered, denoted by G i (S). Now we can define a cost function:
C i (S) = minimum size of clique cover for G i where S is already covered.
Then, our final solution is precisely Cr (;) where r is the root of T . Our dynamic
programming algorithm will proceed from the leaves of T up to its root, computing, for
each node X i , the value of C i (S) for every possible subset S of E (X i ). Depending on the
type of node, C i (S) is computed differently.
LEAF NODE. Suppose i is a leaf node. Then, by the definition of nice tree decomposition,
|X i |= 1, and thus C i (S) = 0 for all S ⊆ E (X i ), trivially.
FORGET NODE. Suppose i is a forget node. Then, it has one descendant X j = X i ∪ {v } for
some unique vertex v .
Lemma 5.7. If i is a Forget node, then for any subset S ⊆ E (X i ), C i (S) =C j (S).
Proof. Note that since X i ⊂ X j , the corresponding graphs G i and G j are the same. Fur-
thermore, since v is not in X i , S ∩δ(v ) = ;. Therefore, S ∩ E (X i ) =S ∩ E (X j ) and we have
C i (S) =C j (S).
INTRODUCE NODE. Suppose i is an introduce node. Then, it has one descendant node X j
such that X i = X j ∪ {v } for some unique vertex v . Consider an arbitrary clique coverW
for G i (S). Since the cliques inW can be partitioned into Qv = cliques containing v and
Qv =W −Qv , the following recurrence relation holds for each introduce node.
118 Chapter 5. Hypergraph Modelling of PPI Data
Lemma 5.8. If i is an Introduce node, then for any subset S ⊆ E (X i ):
C i (S) =min{ |Qv |+C j (S ∪E (Qv )) : Qv is a clique cover for δ(v )−S }
Proof. LetW be a clique cover for G i (S). We partition the coverW =Qv ∪Qv as defined,
and consider the edgeset in Qv . Since these edges are covered by Qv , Qv needs only
cover G i − (S ∪Qv ). Moreover, since the cliques in Qv do not contain v , Qv is a cover for
G j − (S ∪Qv ). Therefore, |Qv | ≥C j (S ∪Qv ), and the lemma follows.
To compute C i (S), we need to consider all possible clique covers, Qv , for δ(v )−S.
Since |X i | ≤ k , this is the number of ways to partition k vertices, and is given by the k th
Bell number B (n )≤ k !2k .
JOIN NODE. Finally, suppose i is a join node. Then, it has two children nodes j1 and j2
such that X i = X j1 = X j2 , and G i = G j1 ∪G j2 . Therefore, a clique cover for G i contains
cliques that belong to G j1 or G j2 . We need to ensure not to double count cliques that
belong in both G j1 and G j2 .
Lemma 5.9. Let S ⊆ E (X i ) be the set of already covered edges, and let R = E (X i )−S be the
edges to be covered. Then,
C i (S) =min{ C j1(S ∪R2)+C j2(S ∪R1) : ∀R1 ⊆R and R2 =R −R1}
Proof. Assuming S is already covered, let W be a minimum clique cover for G i (S) with
cost C i (S). By definition, any clique that belongs to both G j1 and G j2 must also belong to
X i . Thus, the cliques inW can be partitioned asW =Q1 ∪Q2 ∪Q3, where
Q1 = {q ∈W | q ⊆Vj1 and q 6⊆X i }
Q2 = {q ∈W | q ⊆Vj2 and q 6⊆X i }
Q3 = {q ∈W | q ⊆X i }.
5.2. Clique Cover for Graphs with Bounded Treewidth 119
Thus, |W |= |Q1|+ |Q2|+ |Q3|. Now, the edgeset R can be partitioned to R = R1 ∪R2 such
that:
R1 = {e ∈R | e covered by Q1 or Q3 }
R2 = {e ∈R | e covered only by Q2 }=R −R1
By definition, Q1∪Q3 needs to cover the edges in R1. Furthermore, Q3 needs to cover the
edges in G j1−E (X j1), and thus |Q1∪Q3| ≥C j1(S∪R2). On the other hand, the cliques in Q2
only need to cover the edges in R2 together with G j2 −E (X j2), and thus |Q2| ≥C j2(S ∪R1),
and the result follows.
Therefore, we can compute C i (S) for any given subset S ⊆ E (X i ). Observe that the
recurrence relation looks at all possible bipartitions of R . Since the number of edges in
E (X i ) is at most�k
2
�
, we need to check at most 2(k2) different partitions of R . Once the
bipartition of R is fixed, it takes constant time to look up the values from C j1 and C j2 .
Note that, while these recurrence relations calculate the size of clique covers, they
are also constructive: a little bookkeeping at each node will allow us to construct the
optimal cover at the root node.
5.2.1 Running Time of Treewidth-based Algorithm
For each node i ∈ I , we compute C i (S) for every S ⊆ E (X i ). Since tw(G ) = k , each node
contains at most k vertices. Let ρ denote the maximum number of edges induced in
any node. Then we need to consider 2ρ cases. Then, for each fixed S ⊆ E (X i ), we carry
out one of the four recurrence relations. Leaf nodes and Forget nodes can be computed
in constant time. An Introduce node can be computed in O(B (k ) · k ) time, where B (k )
is k th Bell number. Finally, a Join node takes O(2ρ) to compute. Therefore, the dynamic
programming algorithm takes 2ρ ·max{B (k ) ·k , 2ρ} ·O(n ) =O(4ρn ) time overall. Whileρ
can be as large as�k
2
�
in theory, this is rarely the case as shown in our experimental tests
120 Chapter 5. Hypergraph Modelling of PPI Data
(see Section 5.4, Table 5.1).
Theorem 5.10. There is a linear time algorithm for computing minimum clique cover for
graphs with fixed treewidth k .
5.2.2 Modifications for Clique Partition
It is straightforward to modify the above algorithm to solve the clique partition problem:
instead of assuming that edges already covered can be re-used to form other cliques,
we simply delete those edges and solve for remaining edgeset. If we redefine the cost
function C i (S) to be the size of the minimum clique partition for the graph G i −S, the
same recurrence holds for each node type. The only difference is when, at an Introduce
node, finding a local solution for δ(v ), we look for a clique partition rather than a cover.
5.3 Planar Clique Cover
In this section, we study the clique cover problem restricted to planar graphs, and present
a PTAS for planar graphs. While planar graphs are possibly the most restricted class of
interest for the clique cover problem (the largest clique being just K4), the problem re-
mains NP-hard [136]. Furthermore, as treewidth is unbounded in planar graphs [68,
120], simply applying our algorithm from Section 5.2 would result in an exponential
running time.
Instead, we shall design an exact polynomial time algorithm for planar graphs with
bounded branchwidth. As we shall see, the algorithm runs in O(2k n ) time with k being
the branchwidth, and since bw(G ) ≤p
4.5n when G is planar, this would be the first
subexponential algorithm for the clique cover problem on planar graphs.
Then, we will use our exact algorithm to construct a polynomial time approximation
scheme. Baker [8] has proposed a divide-and-conquer technique to design approxima-
tion schemes for various optimization problems on planar graphs. We will show that her
5.3. Planar Clique Cover 121
technique can be applied to the planar clique cover problem, using our exact algorithm
as a subroutine, resulting in a (1+ε) approximation algorithm.
5.3.1 Clique Cover for Planar Graphs with Bounded Branchwidth
As with its counterpart, treewidth, it is NP-complete to determine if a graph has a branch
decomposition of width at most k in general, but this decomposition can be found in
linear time when k is fixed. We thus assume that the input graph G is given together
with a branch decomposition (T,φ) of width at most k . Now pick an arbitrary edge e of
T , and subdivide it to create a root node r . Then each tree node X is associated with a
subset of edges in E (G ), namely the leaf nodes of the subtree rooted at X . We let E (S)
denote the edges in the subgraph induced by a subset of vertices S.
Define the middle-set of X , denoted by mid(X ), to be the middle-set of the edge be-
tween X and its parent node. Since the root node r has no parent, set mid(r ) = ;. Then,
we create a table WX [·] indexed by a subsetF of edges as follows:
WX [F ] =minimum clique cover for edges in X withF already covered.
Since mid(X ) is a cutset in G , we can paste together solutions from each subproblem
by computing only the entries WX [F ]whereF is a subset of edges in E (mid(X )).
Before describing the recurrence relation for this table, we study the middle-set of
three adjacent edges. Consider the sphere-cut branch decomposition for planar graphs,
as studied by Dorn et al. [41]; here, each middle-set defines a closed curve (namely, a
noose) on the planar embedding of the input graph that intersects only the vertices in
the middle-set. Let X be a tree node with two children nodes XL and XR , and a parent
node XP . The three edges adjacent to X define 3 middle-sets which we denote by OP , OL ,
OR for parent edge, left and right child edge, respectively. Since OR − (OP ∪OL) = ; and
OL − (OP ∪OR ) = ;, the vertices of OP ∪OL ∪OR can be partitioned as follows:
122 Chapter 5. Hypergraph Modelling of PPI Data
OL OR
OP
XP
XL XR
X
(a)
OL OR
OP
(b)
Figure 5.2: A sphere cut decomposition at a node X and its planar embedding. (a) A tree node X hasthree adjacent edges, each of which defines a middle-set that forms a simple curve called a noose; (b) Themiddle-set of OP is drawn as solid lines. Inside the noose OP are OL and OR . Here, portal vertices P aredrawn as red circles, intersection vertices are drawn as blue squares, and symmetric difference verticesare drawn as green diamonds.
• Portal vertices P =OL ∩OR ∩OP
• Intersection vertices I =OL ∩OR −P
• Symmetric Difference vertices D =OP − (P ∪ I )
Figure 5.2 gives an example of a sphere cut decomposition at a tree node X .
Lemma 5.11. The table WX [F ] can be calculated as
WX [F ] =min{WX1[F ∪ F2]+WX2[F ∪ F1] : F1 ∪ F2 is a partition of E (I ∪P) }
Proof. For an arbitrary clique cover for X with F already covered, consider the cliques
covering the edges E (I ∪P). Observe that, by planarity of G , any clique intersecting with
I ∪ P only contains either vertices of X1 or vertices of X2. Therefore, we can partition
the edges in E (I ∪ P) into F1 and F2, where F1 is to be covered by cliques in X1, and
F2 is covered by cliques in X2. For each partition, the solution from the subproblem
WX1[F ∪ F2] together with the solution from WX2[F ∪ F1] gives the solution for WX [F ].
Since we consider all possible partitions of E (I ∪P), the lemma follows.
5.3. Planar Clique Cover 123
The algorithm runs in a bottom-up manner. Observe that to compute each state
F of a node, we need to consider all bipartitions of E (I ∪ P). Since I ∪ P ⊆ mid(X1),
|E (I ∪P)| =O(k ), and thus there are 2O(k ) such partitions. Moreover, to compute for all
statesF , we consider 2O(k ) subsets of edges in E (mid(X )). Altogether, the above dynamic
programming algorithm runs in 2O(k )O(n ) time.
Lemma 5.12. There is a linear time algorithm to compute a minimum clique cover for
planar graphs with bounded branchwidth.
5.3.2 Modifications for Clique Partition
As with our treewidth-based algorithm, the above algorithm can be easily modified to
solve the clique partition problem: rather than assumingF is already covered, one can
simply delete those edges and solve for the remaining edges.
5.3.3 Baker’s Technique on Planar Graphs
Baker [8] has proposed a general approach to design approximation algorithms for var-
ious NP-hard problems on planar graphs. Here, we show how this technique, together
with our exact algorithm in Section 5.3.1, results in a (1+ε) approximation algorithm.
Baker’s technique is a divide-and-conquer approach, where the input graph is de-
composed into layers of subgraphs defined by the distance from a chosen vertex. Ap-
plying this technique to the planar clique cover problem, we obtain Algorithm 5.3.1. In
short, we pick an arbitrary vertex r , and define the level of each vertex of G −{r } as dis-
tance from r . Then, we slice up the graph into layers of subgraphs, where each subgraph
is solved independently. The clique covers for the layers are then merged to construct a
solution for the entire graph G . The edges on the boundary of each layer are covered by
cliques from adjacent layers, which can be bounded to obtain an approximation ratio.
See Algorithm 5.3.1 for details, and Figure 5.3 shows a schematic of the approach.
124 Chapter 5. Hypergraph Modelling of PPI Data
Algorithm 5.3.1: PTAS for Planar Clique Cover
Input: A planar graph G = (V, E ), 0<ε< 1Output: A clique cover C of G .
r ← an arbitrary vertex of G ;foreach vertex v ∈V \ {r } do
level(v )← distance from r ;endk ←d2/εe;for i = 0, 1, . . . , k −1 do
for j = 0, 1, 2 . . . doG i j ← subgraph of G induced by the vertices at levels j k +1 through(j +1)k + i ;C i j ←minimum clique cover for G i j ;
endendfor i = 0, 1, . . . , k −1 do
C i ←⋃
j C i j .
endReturn a clique cover from {C0, . . . ,Ck−1}with the minimum size.
Lemma 5.13. Algorithm 5.3.1 finds a clique cover of weight at most (1+ε)OPT.
Proof. Let Q∗ denote the optimum solution, and given some congruence class i mod k ,
let us assume that the above approximation scheme divides the problem into m pieces.
Since the solution for each piece is solved exactly, we have
|C i |=m∑
j=1
|C i j | ≤m∑
j=1
|Q∗i j |,
where Q∗i j denotes the optimum solution restricted to the graph G i j . We therefore need
to compare the two values∑m
j=1 |Q∗i j | and |Q∗|. To compare these, consider the number
of cliques in Q∗ that contain vertices at levels i mod k . For planar graphs, each clique
belongs to at most 2 consecutive levels. Therefore, for at least one value of i , 0 ≤ i ≤ k ,
there are at most d 2ke|Q∗| cliques containing a vertex at level i mod k . Since these cliques
5.3. Planar Clique Cover 125
(a) (b)
(c) (d)
Figure 5.3: Planar clique cover using Baker’s technique. (a) For each value of 0 ≤ i ≤ k − 1, the graph issliced into layers; (b) Each layer is treated as an independent subproblem; (c) Each layer is solved usingour clique cover algorithm on planar graphs with bounded branchwidth; (d) The solutions from the layersare pasted together. The number of edges doubly-covered by cliques from adjacent layers can be boundedto give the approximation ratio.
are double counted in the sum∑m
j=1 |Q∗i j |, we have
m∑
i=1
|Q∗i | ≤ (1+2
k)|Q∗|
and it follows that |C i | ≤ (1+ 2k)|Q∗|.
What remains now is an algorithm to compute the clique cover for each G i j . Note
that Lemma 5.12 provides an algorithm for planar graphs with bounded branchwidth,
and thus it suffices to show that each G i j also has bounded branchwidth. Tamaki’s the-
126 Chapter 5. Hypergraph Modelling of PPI Data
orem [132] directly provides an answer.
Definition 5.14. Given a plane-embedded graph G , the face-incidence graph G = (V ,E )
of G consists of vertices V from faces of G , and two vertices in V are joined by an edge if
and only if the corresponding faces in G share a vertex.
Theorem 5.15. [132] Given a planar graph G , there is a linear time algorithm that finds
a branch decomposition with width at most the radius of the face-incidence graph of G .
Since the face-incidence graph of G i j has a bounded radius, each G i j is a planar
graph with bounded branchwidth. Therefore, our algorithm from Lemma 5.12 can com-
pute the minimum clique cover for the subgraph in each G i j .
Recall that our dynamic programming algorithm for graphs with branchwidth ≤ k
runs in time 2O(k )O(n ). Due to Theorem 5.6, directly applying this algorithm to planar
graphs gives an exact solution in time 2O(p
n )O(n ). On the other hand, the PTAS using
Baker’s technique involves decomposing the graph into n/k layers, and solving each
piece exactly in time 2O(k )O(n/k ). Trying for every congruent classes between 1 and k−1,
the overall algorithm takes k · dnke · 2O(k )O(n/k )≈ 2O(k )O(n 2
k) to obtain a solution of value
at most (1+ 2k)OPT . Therefore, while our PTAS provides an approximation with varying
degree of approximation ratio, if one wants to get a solution any closer than (1+ 1pn)OPT ,
one may be better off running our dynamic programming algorithm directly to the graph
to obtain an exact solution. This trade-off is clearly shown empirically in Table 5.2.
Theorem 5.16. There is a PTAS for planar clique cover.
5.4 Experimental Results
We have implemented the described algorithms and tested them against both real bi-
ological PPI data and simulated data. As a comparison, we considered Gramm et al.’s
algorithm [66] for the decision problem version of clique cover: given the size of clique
5.4. Experimental Results 127
cover k as an input parameter, their algorithm works by first applying a set of reduc-
tion rules to reduce the problem instance, and then using a search tree algorithm on the
reduced instance, in time exponential in k . While their algorithm works well in cases
where the solution size is small, it performed poorly against our test data which con-
tains several hundreds of cliques. This is mainly because their reduction rules did not
significantly decrease the size of our input graphs (especially biological networks): the
solutions for the reduced instances would still contain a large number of cliques, result-
ing in inefficient running time for the search tree algorithm. As their input parameter is
the size of the minimum clique cover, it motivates our development of approaches using
edge sparsity for these networks.
5.4.1 Simulated and Biological Networks
Each of our algorithms was tested on both simulated and actual biological networks.
First, Krogan et al. [93] obtained an extensive dataset on yeast protein interactions. Tak-
ing the largest connected component from the dataset, with 323 vertices and 742 edges,
we formed a model network, GK ro g a n . Various studies have shown that PPI networks ex-
hibit the properties of scale-free networks [10]. Many generative models for scale-free
networks have also been proposed, and we used the two most frequently used mod-
els [10, 32] to create test graphs for our algorithm: (1) preferential attachment model
(denoted by GPAM ), and (2) duplication model (denoted by GDM ). In both cases, we set
the parameters of generative models so that the resulting networks show similar charac-
teristics to that of real PPI data; density of |E | ≈ 2|V |, and degree distribution P(k )∼ k−γ
where γ≈ 1.7.
To investigate the behaviour of our algorithms on denser graphs, we generated a set
of partial k -trees. Recall that a k -tree is a maximal graph with treewidth k such that no
edge can be inserted without increasing its treewidth, and a graph is a partial k -tree if it
is a subgraph of a k -tree. The partial k -trees of given treewidth have been generated by
first generating a k -tree, and randomly removing edges to obtain desired edge density.
128 Chapter 5. Hypergraph Modelling of PPI Data
n m tw ρ tree decomp. clique cover algo. # of cliques(hh:mm) (hh:mm)
Krogan 323 742 11 21 2:13 6:08 482PAM 300 634 14 29 2:41 9:21 421DM 300 794 21 43 4:13 17:47 583
Table 5.1: Performance of treewidth-based exact clique cover algorithm; we show the treewidth of eachgraph, maximum # of edges per tree node (ρ), time taken to compute an optimal tree decomposition anda clique cover, and size of the solution.
Since the generation process of k -trees is similar to that of the preferential attachment
model, this allows us to create graphs with higher density than GPAM while preserving
low treewidth.
5.4.2 Performance of the Treewidth and Branchwidth-based Algorithms
Table 5.1 reports results obtained on real and simulated biological networks. Both Kro-
gan’s PPI network and simulated networks exhibit relatively low treewidths for their size.
Figure 5.4 gives a more comprehensive view of the running times from empirical testing.
As expected, the running time increases linearly with n for graphs with fixed treewidths
(Figure 5.4(a)), but exponentially with treewidth k for graphs with fixed n (Figure 5.4(b)).
Running times for partial k -trees (Figures 5.4(c) and (d)) follow the same trends, al-
though they are somewhat higher due to the higher edge density.
While our branchwidth-based exact algorithm was designed for planar graphs, it
is easy to modify the algorithm to handle non-planar graphs with bounded branch-
width (but at the expense of higher time complexity due to non-planarity). Figure 5.5
shows that, in practice, the running time of the treewidth-based algorithm grows slower
than that of the branchwidth-based algorithm, allowing it to handle graphs with larger
treewidths.
5.4. Experimental Results 129
50 100 150 200 250 300
80,000
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
# of vertices
Tim
e ( s
econ
ds )
tw=8tw=9
tw=10
tw=11
tw=12
(a)
8 9 10 11 12
80,000
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
TreewidthTi
me
( sec
onds
)
n=50
n=100
n=150
n=200
n=250
n=300
(b)
40 80 120 160 200
160,000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
# of vertices
Tim
e ( s
econ
ds )
tw = 8tw = 9
tw = 10
tw = 11
tw = 12
(c)
8 9 10 11 12
160,000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Treewidth
Tim
e ( s
econ
ds )
n = 200
n = 150
n = 100
n = 50
(d)
Figure 5.4: Performance of the treewidth-based algorithm on simulated networks: (a) scale-free networks(PAM) with fixed treewidth; (b) scale-free networks (PAM) with fixed graph size; (c) partial k -trees withfixed treewidth; (d) partial k -trees with fixed graph size.
130 Chapter 5. Hypergraph Modelling of PPI Data
8 9 10 11 12
17
10
11
12
13
14
15
16
Treewidth
log2
( tim
e )
y = 1.4219x -
0.2377 R²
= 0.9914
y = 1.0604x + 0.8255 R
² = 0.9789
tw-based
bw-based
(a)
8 9 10 11 12
12
13
14
15
16
17
Treewidth
log2
( tim
e )
y = 1.
3651
x + 1.
3206
R²
= 0.
9838
y = 0.8246x + 4.5411 R
² = 0.9959
bw-based
tw-based
(b)
Figure 5.5: Performance comparison of treewidth-based algorithm vs. branchwidth-based algorithm onscale-free networks; y -axes are shown in log scale to exemplify the difference in exponents in the runningtime. (a) scale-free networks with 50 vertices; (b) scale-free networks with 100 vertices; data points withthe same treewidth values in each chart are taken from the same network.
5.4.3 Performance of PTAS for Planar Graphs
To test our PTAS on planar graphs, we generated a set of random planar graphs us-
ing the simple algorithm by Denise and Vasconcellos [40]. Then, we ran both our ex-
act branchwidth-based algorithm and the PTAS with varying values of ε to explore the
trade-off between running time and quality of the solution. As shown in Table 5.2, the
promised approximation ratios are almost exactly realized and substantial speed-ups
are obtained for relatively large values of ε, compared to the exact branch-decomposition
based algorithm. The running time increases exponentially with 1/ε, and the exact al-
gorithm starts to become faster than the PTAS when ε becomes sufficiently small.
5.4.4 Clique Cover in Biological Networks
When executed on the yeast protein-protein interaction network of Krogan et al. [93],
our treewidth-based algorithm finds a clique cover that includes 93 cliques of size 5 or
5.5. Discussions 131
ε # of cliques time (hh:mm:ss)0.5 159 00:12:080.2 124 00:39:170.1 116 02:36:18
0.05 113 03:53:42OPT 109 02:46:31
Table 5.2: Performance of PTAS for planar graph with n = 200, m = 407, t w = 10. OPT was obtained usingthe branch-decomposition based algorithm.
more. While the PPI data may not admit a unique clique cover on the network, we man-
ually verified that most discovered cliques correspond to known complexes, such as the
RNA polymerase II, the RSC complex, the mediator complex, and the 20S proteasome.
In addition, three highly overlapping complexes, SWR1 (a chromatin remodelling com-
plex), NuA4 (a histone acetyltransferase complex), and INO80 (another chromatin re-
modelling complex) are correctly identified, despite the fact that SWR1 and NuA4 share
three subunits (ARP4, GOD1, and YAF9) and SWR1 and INO80 share four (ARP4, GOD1,
RVB1, RVB2). This suggests that our algorithm is capable of identifying biologically rel-
evant protein complexes, even those that share a significant number of subunits.
5.5 Discussions
In this chapter, we studied the clique cover problem on sparse networks as measured
by treewidth and branchwidth, with an application to protein-complex discovery in PPI
networks. We gave exact polynomial-time algorithms for graphs with bounded treewidth
and bounded branchwidth, and built on the latter using Baker’s technique to obtain a
polynomial time approximation scheme for planar graphs.
Our empirical studies show that the biological networks as well as synthetic net-
works with similar characteristics (e.g. edge density, degree distribution) indeed exhibit
low treewidth, and our algorithms showed practical running times on these test net-
works. Moreover, our branchwidth based PTAS algorithm shows practical running time
for computing solutions close to the optimal.
132 Chapter 5. Hypergraph Modelling of PPI Data
In proteomics research, existing experimental methods for detecting binary interac-
tions often suffer from false negatives, i.e., some edges are not detected in the experi-
ments. Therefore, in the direction towards hypergraph modelling of PPI networks, one
may wish to cover the edges with quasi-cliques: Here, the optimization function needs
to be modified slightly. One possible formulation may be to find a minimum cardinal-
ity quasi-clique cover, where a quasi-clique is defined by some lower bound on the edge
density. On the other hand, there may be classes of graphs other than the ones discussed
here that admit polynomial time exact algorithms, for example, graphs with bounded
genus or bounded degree.
5.6 Bibliographic Notes
The results discussed in this chapter have been published in [14].
Chapter 6
Conclusion and Future Directions
Researchers in systems biology strive to uncover complex interactions in biological sys-
tems, and protein-protein interactions lie at the heart of their efforts. As opposed to the
classical reductionist paradigm, systems biologists consider biological phenomena as a
complex system, and thus high throughput experimental technologies are vital to their
studies. Indeed, the recent development of high throughput techniques has provided a
huge momentum – the amount of PPI data available is rapidly growing, and large-scale
studies of the human proteome are also underway.
Unfortunately, high throughput technologies remain in their infancy, and produce a
large amount of data with relatively poor accuracy. As a result, the accumulating data in
the literature presents us with both an opportunity and a challenge. While the analysis
of the PPI networks provides invaluable information on the inner workings of the cell,
the significant amount of noise within the data makes it difficult to handle. AP-MS is a
good example of such a technology that suffers from noisy output data.
Noise in proteomics data can often be classified as two distinct types [60]: stochas-
tic errors and systematic errors. Stochastic errors are measurement errors with random
variability, which can be reduced by simply repeating the experiment. Systematic er-
rors, on the other hand, are recurrent in the measurements, and thus require a careful
133
134 Chapter 6. Conclusion and Future Directions
modelling of every step of the experiment in order to correctly interpret the data.
In this thesis, we considered the overall pipeline of the AP-MS method, and looked
for sources of systematic errors. Here we highlight the novel contributions put forward
in this thesis, and discuss future research directions.
Protein Quantification with Shared Peptides. Chapter 3 is concerned with protein
quantification from quantitative MS data. While significant progress has been made, the
problem of protein identification (and quantification) still remains unresolved due to
shared peptides prevalent in many MS datasets [108]. We proposed several approaches
to protein quantification in the presence of shared peptides by defining a set of opti-
mization problems based on a clean combinatorial model.
One of the limitations in our approach comes from the assumption that every pep-
tide can be detected by MS equally well – and thus the errors in the peptide abundances
are equally likely. While we have empirically tested the robustness of our algorithms by
varying the coverage of the data (see Section 3.4.1), an approach that directly models the
noisy MS data would likely yield more practical results.
Open Problem 6.1. Does there exist an approach for protein quantification with shared
peptides that incorporates peptide detectability?
Peptide detectability may be estimated either using a suitably large MS/MS dataset [2,
133] or using the physical properties of each peptide [99]. In Section 3.5, we discussed
possible ways to extend our combinatorial approach by incorporating peptide detectabil-
ity as weighted errors. Alternatively, we may also attack this problem using a probabilis-
tic framework where observed peptide abundances are thought to be random variables
that incorporate detectability. Then, we can look for protein abundances maximizing
the probability of observing the given peptide abundances using various Bayesian ap-
proaches.
135
Predicting Direct PPI Network. In Chapter 4, we studied the problem of distinguish-
ing direct interactions from indirect ones within the PPI data from AP-MS experiments.
We tackled this problem by proposing a probabilistic graph model, and formulated the
DIGCOM problems to identify the direct interaction network from quantitative PPI data.
Our algorithm consists of three main phases: (1) Identify weakly connected vertices (and
their neighbours); (2) Discover dense clusters in the network; (3) Identify remaining di-
rect interactions via a genetic algorithm.
As discussed in Section 4.5, the DIGCOM problems raise a number of challenging
computational problems to investigate further. From the standpoint of computational
complexity, the hardness of the problems are not yet known, and we conjecture that they
are NP-hard.
Open Problem 6.2. What is the computational complexity of E-DIGCOM and A-DIGCOM?
On the other hand, the probabilistic graph model is built around several assump-
tions. For example, the abundance of protein complexes is not constant, and the strength
of all physical interactions is non-uniform as some interactions may be more prone to
disruption by the affinity purification process than others.
Open Problem 6.3. Given sufficient AP-MS data, can we generalize the DIGCOM prob-
lems to handle individual interaction strengths and abundances?
In fact, if we were given the individual interaction strength for each pair of proteins
(possibly from a complementary dataset such as protein co-crystallization or physical
models), the second and third phases of our algorithm can be easily modified to handle
non-uniform interaction probabilities. The first phase of the algorithm would require
more careful modifications to handle arbitrary survival probability for each edge.
PPI Networks as Hypergraphs. In Chapter 5, we considered the problem of modelling
PPI networks as hypergraphs via the (edge) clique cover problem on the binary PPI net-
work. In particular, we devised an exact algorithm for graphs with bounded treewidth
136 Chapter 6. Conclusion and Future Directions
which showed promising performance on both simulated data and real PPI data. Fur-
thermore, our theoretical pursuit on clique cover for planar graphs with bounded branch-
width resulted in a PTAS for planar graphs.
However, existing experimental techniques for detecting binary interactions often
suffer from false negatives, i.e., some edges are missing in the PPI network. Conse-
quently, protein complexes that should ideally form cliques instead appear as quasi-
cliques or simply dense subgraphs. In the direction towards modelling PPI networks as
hypergraphs, therefore, we may wish to relax the assumption that each protein complex
forms a clique.
Open Problem 6.4. Does there exist a quasi-clique cover algorithm?
Quasi-cliques can be defined in various ways: one possible definition of a quasi-
clique is a dense subgraph with a lower bound on the edge density. Using this defini-
tion, one may generalize our algorithm for graphs with bounded treewidth. Following
our algorithm, the same type of dynamic programming algorithm can be conceived –
however, our approach of finding cliques in the local neighbourhood of a vertex should
now be extended to multi-hop neighbours depending on the edge density, resulting in
an increase in the running time. On the other hand, one may devise an algorithm spe-
cialized for graphs with other properties of PPI networks, for example the scale-freeness
of networks.
There are many problems in PPI networks to pursue further studies, and the work
presented in this thesis opens the door to several opportunities with important impli-
cations in systems biology. One of the main pursuits in the field is a comprehensive
understanding of the proteome as a complex dynamic system, and any improvements
in tackling the problems discussed here would take us one step closer towards this goal.
Bibliography
[1] B. Adamcsek, G. Palla, I. J. Farkas, I. Derényi, and T. Vicsek. CFinder: locatingcliques and overlapping modules in biological networks. Bioinformatics (Oxford,England), 22(8):1021–1023, Apr. 2006.
[2] P. Alves, R. J. Arnold, M. V. Novotny, P. Radivojac, J. P. Reilly, and H. Tang. Advance-ment in protein inference from shotgun proteomics using peptide detectability.Pacific Symposium on Biocomputing, pages 409–420, 2007.
[3] S. Arnborg, D. G. Corneil, and A. Proskurowski. Complexity of finding embeddingsin a k -tree. Society for Industrial and Applied Mathematics. Journal on Algebraicand Discrete Methods, 8(2):277–284, 1987.
[4] S. Arora. Computational Complexity: A Modern Approach. Cambridge UniversityPress, 1 edition, 2009.
[5] S. Asthana. Predicting protein complex membership using probabilistic networkreliability. Genome Research, 14(6):1170–1175, 2004.
[6] J. Azé, T. Bourquard, S. Hamel, A. Poupon, and D. Ritchie. Using Kendall-TauMeta-Bagging to Improve Protein-Protein Docking Predictions. In Lecture Notesin Bioinformatics 7036, PRIB 2011, pages 284–295, 2011.
[7] G. D. Bader and C. W. V. Hogue. An automated method for finding molecular com-plexes in large protein interaction networks. BMC bioinformatics, 4:2, Jan. 2003.
[8] B. Baker. Approximation algorithms for NP-complete problems on planar graphs.Journal of the ACM, 41(1), 1994.
[9] M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster. Quantitative massspectrometry in proteomics: a critical review. Analytical and bioanalytical chem-istry, 389(4):1017–1031, Oct. 2007.
[10] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science(New York, N.Y.), 286(5439):509–512, Oct. 1999.
[11] P. L. Bartel, J. A. Roecklein, D. SenGupta, and S. Fields. A protein linkage map ofEscherichia coli bacteriophage T7. Nature Genetics, 12(1):72–77, Jan. 1996.
137
138 Bibliography
[12] H. M. Berman, T. N. Bhat, P. E. Bourne, Z. Feng, G. Gilliland, H. Weissig, andJ. Westbrook. The protein data bank and the challenge of structural genomics.Nature structural biology, 7 Suppl:957–959, Nov. 2000.
[13] A. Bhan, D. Galas, and T. Dewey. A duplication growth model of gene expressionnetworks. Bioinformatics, 18:1486–1493, 2002.
[14] M. Blanchette, E. Kim, and A. Vetta. Clique Cover on Sparse Networks. In The 9thSIAM Meeting on Algorithm Engineering & Experiments (ALENEX), Kyoto, Japan,2012.
[15] M. Blatt, S. Wiseman, and E. Domany. Superparamagnetic Clustering of Data.Phys. Rev. Lett., 76:3251–3254, Apr 1996.
[16] H. Bodlaender and T. Kloks. A simple linear time algorithm for triangulating three-colored graphs. Journal of Algorithms, 15(1):160–172, 1993.
[17] H. L. Bodlaender. A linear-time algorithm for finding tree-decompositions ofsmall treewidth. SIAM Journal on Computing, 25(6):1305–1317, 1996.
[18] H. L. Bodlaender, M. R. Fellows, M. T. Hallett, H. T. Wareham, and T. J. Warnow. Thehardness of perfect phylogeny, feasible register assignment and other problemson thin colored graphs. Theoretical Computer Science, 244(1-2):167–188, 2000.
[19] A. Breitkreutz, H. Choi, J. Sharom, L. Boucher, V. Neduva, B. Larsen, Z.Y.Lin, B. Bre-itkreutz, C. Stark, G. Liu, J. Ahn, D. Dewar-Darch, Z. Qin, T. Pawson, A. Gingras, A. I.Nesvizhskii, and M. Tyers. Global architecture of the yeast kinome interaction net-work. Science, 2010.
[20] S. Brohée and J. van Helden. Evaluation of clustering algorithms for protein-protein interaction networks. BMC bioinformatics, 7:488, 2006.
[21] H. Brönnimann and M. Goodrich. Almost optimal set covers in fi-nite VC-dimension. Discrete & Computational Geometry, 14:463–479, 1995.10.1007/BF02570718.
[22] P. C. Carvalho, J. Hewel, V. C. Barbosa, and J. R. Yates. Identifying differences inprotein expression levels by spectral counting and feature selection. Genetics andmolecular research : GMR, 7(2):342–356, 2008.
[23] M. R. Cerioli, L. Faria, T. O. Ferreira, C. A. J. Martinhon, F. Protti, and B. Reed. Par-tition into cliques for cubic graphs: planar case, complexity and approximation.Discrete Applied Mathematics, 156(12):2270–2278, 2008.
[24] M.-S. Chang and H. Müller. On the tree-degree of graphs. In Graph-theoreticConcepts in Computer Science (Boltenhagen, 2001), pages 44–54, 2001.
BIBLIOGRAPHY 139
[25] C. Chekuri, K. L. Clarkson, and S. Har-Peled. On the set multi-cover problem in ge-ometric settings. In Proceedings of the 25th Annual Symposium on Computationalgeometry, SCG ’09, pages 341–350, New York, NY, USA, 2009. ACM.
[26] J. Chen, W. Hsu, M. L. Lee, and S. Ng. Increasing confidence of protein interac-tomes using network topological metrics. Bioinformatics, 22(16):1998–2004, Aug.2006.
[27] J. Chen, W. Hsu, M. L. Lee, and S.-K. Ng. Systematic assessment of high-throughput experimental data for reliable protein interactions using networktopology. In Proceedings of the 16th IEEE International Conference on Tools withArtificial Intelligence, ICTAI ’04, pages 368–372, Washington, DC, USA, 2004. IEEEComputer Society.
[28] Q. Cheng, P. Berman, R. Harrison, and A. Zelikovsky. Efficient alignments ofmetabolic networks with bounded treewidth. In IEEE International Conferenceon Data Mining Workshops (ICDMW), pages 687–694, 2010.
[29] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based onthe sum of observations. The Annals of Mathematical Statistics, 23(4):pp. 493–507,1952.
[30] H. Choi, D. Fermin, and A. I. Nesvizhskii. Significance analysis of spectralcount data in label-free shotgun proteomics. Molecular & Cellular Proteomics,7(12):2373–2385, December 2008.
[31] V. Choi. Yucca: an efficient algorithm for small-molecule docking. Chemistry &biodiversity, 2(11):1517–1524, Nov. 2005.
[32] F. Chung, L. Lu, T. G. Dewey, and D. J. Galas. Duplication models for biologicalnetworks. Journal of Computational Biology, 10(5):677–687, Oct. 2003.
[33] P. Cloutier, R. Al-Khoury, M. Lavallée-Adam, D. Faubert, H. Jiang, C. Poitras,A. Bouchard, D. Forget, M. Blanchette, and B. Coulombe. High-resolution map-ping of the protein interaction network for the human transcription machineryand affinity purification of RNA polymerase II-associated complexes. Methods(San Diego, Calif.), 48(4):381–386, Aug. 2009.
[34] C. J. Colbourn. The Combinatorics of Network Reliability. Oxford University Press,Inc., 1987.
[35] S. R. Collins, P. Kemmeren, X.-C. Zhao, J. F. Greenblatt, F. Spencer, F. C. P. Holstege,J. S. Weissman, and N. J. Krogan. Toward a comprehensive atlas of the physicalinteractome of Saccharomyces cerevisiae. Molecular & cellular proteomics : MCP,6(3):439–450, Mar. 2007.
140 Bibliography
[36] B. Coulombe, M. Blanchette, and C. Jeronimo. Steps towards a repertoire ofcomprehensive maps of human protein interaction networks: the Human Pro-teotheque Initiative (HuPI). Biochemistry and Cell Biology, 86(2):149–156, Apr.2008.
[37] T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation of gene order: a finger-print of proteins that physically interact. Trends in Biochemical Sciences, 23(9):324– 328, 1998.
[38] B. Dengiz, F. Altiparmak, and A. Smith. Efficient optimization of all-terminal reli-able networks, using an evolutionary approach. Reliability, IEEE Transactions on,46(1):18–26, 1997.
[39] B. Dengiz, F. Altiparmak, and A. Smith. Local search genetic algorithm for op-timal design of reliable networks. Evolutionary Computation, IEEE Transactionson, 1(3):179–188, 1997.
[40] A. Denise and M. Vasconcellos. The random planar graph. Congressus Numeran-tium, 113:61–79, 1996.
[41] F. Dorn, E. Penninkx, H. Bodlaender, and F. Fomin. Efficient exact algorithmson planar graphs: exploiting sphere cut branch decompositions. Algorithms–ESA2005, 3669:95–106, 2005.
[42] B. Dost, N. Bandeira, X. Li, Z. Shen, S. P. Briggs, and V. Bafna. Accurate mass spec-trometry based protein quantification via shared peptides. Journal of Computa-tional Biology, 19, 2012.
[43] B. Dost, T. Shlomi, N. Gupta, E. Ruppin, V. Bafna, and R. Sharan. QNet: a toolfor querying protein interaction networks. Journal of Computational Biology,15(7):913–925, 2008.
[44] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis anddisplay of genome-wide expression patterns. Proceedings of the National Academyof Sciences of the United States of America, 95(25):14863–14868, Dec. 1998.
[45] D. Eppstein. Subgraph isomorphism in planar graphs and related problems. J.Graph Algorithms and Applications, 3(3):1–27, 1999.
[46] D. Eppstein. Diameter and treewidth in minor-closed graph famillies. Algorith-mica, 27:275–291, 2000.
[47] G. Even, D. Rawitz, and S. Shahar. Hitting sets when the VC-dimension is small.Inf. Process. Lett., 95(2):358–362, July 2005.
[48] U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634–652,1998.
BIBLIOGRAPHY 141
[49] S. Fields and S. O. A novel genetic system to detect protein-protein interactions.Nature, 340(6230):245–6, 1989.
[50] R. L. Finley and R. Brent. Interaction mating reveals binary and ternary con-nections between Drosophila cell cycle regulators. Proceedings of the NationalAcademy of Sciences of the United States of America, 91(26):12980–12984, Dec.1994.
[51] L. Florens, M. Washburn, J. Raine, R. Anthony, M. Grainger, J. Haynes, J. Moch,N. Muster, J. Sacci, D. Tabb, and et al. A proteomic view of the Plasmodium falci-parum life cycle. Nature, 419(6906):520–526, 2002.
[52] F. V. Fomin and D. M. Thilikos. New upper bounds on the decomposability ofplanar graphs. Journal of Graph Theory, 51(1):53–81, 2006.
[53] M. Fromont-Racine, J. Rain, and P. Legrain. Toward a functional analysis of theyeast genome through exhaustive two-hybrid screens. Nat Genet, 16(3):277–282,July 1997.
[54] A. Galarneau, M. Primeau, L.-E. Trudeau, and S. W. Michnick. β-Lactamase pro-tein fragment complementation assays as in vivo and in vitro sensors of protein-protein interactions. Nature Biotechnology, 20(6):619–622, June 2002.
[55] J. Gao, G. Opiteck, M. Friedrichs, A. Dongre, and S. Hefta. Changes in the proteinexpression of yeast as a function of carbon source. J Proteome Res, 2(6):643–649,2003.
[56] M. R. Garey and D. S. Johnson. Computers and Intractability : A Guide to the Theoryof NP-Completeness. WH Freman,1979, 1979.
[57] M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simplified NP-completegraph problems. Theoretical Computer Science, 1(3):237–267, 1976.
[58] A. Gavin, M. Bösche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz,J. M. Rick, A. Michon, C. Cruciat, M. Remor, C. Höfert, M. Schelder, M. Brajen-ovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau,A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. Heurtier, R. R. Copley, A. Edel-mann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork,B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. Functional organi-zation of the yeast proteome by systematic analysis of protein complexes. Nature,415(6868):141–147, 2002.
[59] L. Y. Geer, S. P. Markey, J. A. Kowalak, L. Wagner, M. Xu, D. M. Maynard, X. Yang,W. Shi, and S. H. Bryant. Open mass spectrometry search algorithm. Journal ofproteome research, 3(5):958–964, Aug. 2004.
[60] R. Gentleman and W. Huber. Making the most of high-throughput protein-interaction data. Genome Biology, 8(10):112, 2007.
142 Bibliography
[61] S. A. Gerber, J. Rush, O. Stemman, M. W. Kirschner, and S. P. Gygi. Absolute quan-tification of proteins and phosphoproteins from cell lysates by tandem MS. Pro-ceedings of the National Academy of Sciences, 100(12):6940–6945, 2003.
[62] E. Gilbert. Enumeration of labeled graphs. Canad. J. Math, 8:405–411, 1956.
[63] J. Gilmore, D. Auberry, J. Sharp, A. White, K. Anderson, and D. Daly. A bayesian es-timator of protein-protein association probabilities. Bioinformatics, 24(13):1554–5, 2008.
[64] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learn-ing. Addison-Wesley Professional, January 1989.
[65] R. Gordân, A. J. Hartemink, and M. L. Bulyk. Distinguishing direct versus indirecttranscription factor-DNA interactions. Genome Research, 19(11):2090–2100, Nov.2009.
[66] J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Data reduction and exact algo-rithms for clique cover. ACM Journal of Experimental Algorithmics, 13, 2009.
[67] J. Gross and J. Yellen. Handbook of Graph Theory (Discrete Mathematics and itsApplications). CRC, 2003.
[68] R. Halin. S-functions for graphs. Journal of Geometry, 8:171–186, 1976.10.1007/BF01917434.
[69] E. Hartuv, A. O. Schmitt, J. Lange, S. Meier-Ewert, H. Lehrach, and R. Shamir. Analgorithm for clustering cDNA fingerprints. Genomics, 66(3):249 – 256, 2000.
[70] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. Adams, A. Millar, P. Taylor,K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff, J. Shew-narane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin,K. Michalickova, A. R. Willems, H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. An-dersen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen,J. Crawford, V. Poulsen, B. D. Sørensen, J. Matthiesen, R. C. Hendrickson, F. Glee-son, T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys,and M. Tyers. Systematic identification of protein complexes in Saccharomycescerevisiae by mass spectrometry. Nature, 415(6868):180–183, 2002.
[71] D. N. Hoover. Complexity of graph covering problems for graphs of low degree.Journal of Combinatorial Mathematics and Combinatorial Computing, 11:187–208, 1992.
[72] H. B. Hunt III, M. V. Marathe, V. Radhakrishnan, and R. E. Stearns. The complex-ity of planar counting problems. SIAM Journal on Computing, 27(4):1142–1167(electronic), 1998.
BIBLIOGRAPHY 143
[73] Y. Ishihama, Y. Oda, T. Tabata, T. Sato, T. Nagasu, J. Rappsilber, and M. Mann. Ex-ponentially modified protein abundance index (e m PAI ) for estimation of abso-lute protein amount in proteomics by the number of sequenced peptides per pro-tein. Mol Cell Proteomics, 4(9):1265–1272, 2005.
[74] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehen-sive two-hybrid analysis to explore the yeast protein interactome. Proceedings ofthe National Academy of Sciences of the United States of America, 98(8):4569–4574,2001.
[75] T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto,S. Kuhara, and Y. Sakaki. Toward a protein-protein interaction map of the buddingyeast: A comprehensive system to examine two-hybrid interactions in all possiblecombinations between the yeast proteins. Proceedings of the National Academy ofSciences of the United States of America, 97(3):1143–1147, Feb. 2000.
[76] R.-H. Jan, F.-J. Hwang, and S.-T. Chen. Topological optimization of a communica-tion network subject to a reliability constraint. Reliability, IEEE Transactions on,42(1):63–70, 1993.
[77] J. Janin, K. Henrick, J. Moult, L. T. Eyck, M. J. E. Sternberg, S. Vajda, I. Vakser, S. J.Wodak, and Critical Assessment of PRedicted Interactions. CAPRI: a Critical As-sessment of PRedicted Interactions. Proteins, 52(1):2–9, July 2003.
[78] L. J. Jensen and P. Bork. Not comparable, but complementary. Science,322(5898):56–57, 2008.
[79] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality inprotein networks. Nature, 411(6833):41–42, 05 2001.
[80] R. Jin, S. Mccallen, C.-C. Liu, Y. Xiang, E. Almaas, and X. J. Zhou. Identifying dy-namic network modules with temporal and spatial constraints. In Pacific Sympo-sium on Biocomputing, pages 203–214, 2009.
[81] S. Jin, D. Daly, D. Springer, and J. Miller. The effects of shared peptides on pro-tein quantitation in label-free proteomics by LC/MS/MS. Journal of Proteome Re-search, 7(1):164–169, 2008.
[82] R. Jothi and P. T. M. Computational approaches to predict protein-protein anddomain-domain interactions. In M. I. and Z. A., editors, Bioinformatics Algo-rithms: Techniques and Applications. Wiley Press, 2008.
[83] V. Kann. On the Approximability of NP-complete Optimization Problems. PhDthesis, Department of Numerical Analysis and Computing Science, Royal Instituteof Technology, Stockholm, 1992.
[84] R. Kannan, S. Vempala, and A. Vetta. On clusterings: good, bad and spectral. Jour-nal of the ACM, 51(3):497–515, 2004.
144 Bibliography
[85] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molec-ular masses exceeding 10, 000 daltons. Analytical Chemistry, 60(20):2299–2301,1988.
[86] R. M. Karp. Reducibility among combinatorial problems. Complexity of ComputerComputations, 40(4):85–103, 1972.
[87] S. M. Khan, B. Franke-Fayard, G. R. Mair, E. Lasonder, C. J. Janse, M. Mann, andA. P. Waters. Proteome analysis of separated male and female gametocytes revealsnovel sex-specific Plasmodium biology. Cell, 121(5):675–687, June 2005.
[88] E. Kim, A. Sabharwal, A. Vetta, and M. Blanchette. Predicting direct protein inter-actions from affinity purification mass spectrometry data. Algorithms for Molecu-lar Biology, 5(1):34, 2010.
[89] E. Kim, A. Vetta, and M. Blanchette. Protein quantification with shared peptidesusing multi cover algorithms. in preparation, 2011.
[90] A. D. King, N. Przulj, and I. Jurisica. Protein complex prediction via cost-basedclustering. Bioinformatics (Oxford, England), 20(17):3013–3020, Nov. 2004.
[91] D. S. Kirkpatrick, S. A. Gerber, and S. P. Gygi. The absolute quantification strat-egy: a general procedure for the quantification of proteins and post-translationalmodifications. Methods, 35(3):265 – 273, 2005.
[92] T. Kloks. Treewidth: Computations and Approximations (Lecture Notes in Com-puter Science). Springer, 1 edition, Sept. 1994.
[93] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu,N. Datta, A. P. Tikuisis, T. Punna, J. M. Peregrín-Alvarez, M. Shales, X. Zhang,M. Davey, M. D. Robinson, A. Paccanaro, J. E. Bray, A. Sheung, B. Beattie, D. P.Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. M. Canete,J. Vlasblom, S. Wu, C. Orsi, S. R. Collins, S. Chandran, R. Haw, J. J. Rilstone,K. Gandi, N. J. Thompson, G. Musso, P. St Onge, S. Ghanny, M. H. Y. Lam, G. But-land, A. M. Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. S. Weissman, C. J. Ingles,T. R. Hughes, J. Parkinson, M. Gerstein, S. J. Wodak, A. Emili, and J. F. Greenblatt.Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Na-ture, 440(7084):637–643, Mar. 2006.
[94] W. B. Langdon and R. Poli. Foundations of Genetic Programming. Springer, Dec.2010.
[95] M. Lavallée-Adam, P. Cloutier, B. Coulombe, and M. Blanchette. Modeling Con-taminants in AP-MS/MS Experiments. Journal of Proteome Research, 10(2):886–895, 2011.
[96] T. Leighton and S. Rao. Multicommodity max-flow min-cut theorems and theiruse in designing approximation algorithms. Journal of the ACM, 48:787–832, 1999.
BIBLIOGRAPHY 145
[97] E. D. Levy and J. B. Pereira-Leal. Evolution and dynamics of protein interac-tions and networks. Current Opinion in Structural Biology, 18(3):349 – 357, 2008.<ce:title>Nucleic acids / Sequences and topology</ce:title>.
[98] M. Li, J. Wang, J. Chen, Z. Cai, and G. Chen. Identifying the overlapping com-plexes in protein interaction networks. International Journal of Data Mining andBioinformatics, 4(1):91–108, 2010.
[99] Y. F. Li, R. J. Arnold, H. Tang, and P. Radivojac. The Importance of Peptide De-tectability for Protein Identification, Quantification, and Experiment Design inMS/MS Proteomics. Journal of Proteome Research, 9(12):6288–6297, 2010.
[100] H. Liu, R. G. Sadygov, and J. R. Yates. A model for random sampling and estima-tion of relative protein abundance in shotgun proteomics. Analytical chemistry,76(14):4193–4201, July 2004.
[101] C. Lund and M. Yannakakis. On the hardness of approximating minimizationproblems. Journal of the ACM, 41(5):960–981, 1994.
[102] P. Mallick, M. Schirle, S. S. Chen, M. R. Flory, H. Lee, D. Martin, J. Ranish,B. Raught, R. Schmitt, T. Werner, B. Kuster, and R. Aebersold. Computationalprediction of proteotypic peptides for quantitative proteomics. Nature Biotech-nology, 25(1):125–131, Jan. 2007.
[103] S. W. Michnick. Protein fragment complementation strategies for biochemicalnetwork mapping. Current Opinion in Biotechnology, 14(6):610 – 617, 2003.
[104] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algo-rithms and Probabilistic Analysis. Cambridge University Press, Jan. 2005.
[105] E. Mujuni and F. Rosamond. Parameterized complexity of the clique partitionproblem. In Proc. Fourteenth Computing: The Australasian Theory Symposium(CATS 2008), 2008.
[106] M. Narayanan, A. Vetta, E. E. Schadt, and J. Zhu. Simultaneous clustering of mul-tiple gene expression and physical interaction datasets. PLoS Computational Bi-ology, 6(4):e1000742, 13, 2010.
[107] A. Nesvizhskii. Protein identification by tandem mass spectrometry and sequencedatabase searching. Methods Mol Biol., 367:87–119, 2007.
[108] A. I. Nesvizhskii and R. Aebersold. Interpretation of shotgun proteomic data: theprotein inference problem. Molecular & cellular proteomics : MCP, 4(10):1419–1440, Oct. 2005.
146 Bibliography
[109] W. M. Old, K. Meyer-Arendt, L. Aveline-Wolf, K. G. Pierce, A. Mendoza, J. R. Sevin-sky, K. A. Resing, and N. G. Ahn. Comparison of label-free methods for quanti-fying human proteins by shotgun proteomics. Molecular & Cellular Proteomics,4(10):1487–1502, 2005.
[110] J. Orlin. Contentment in graph theory: covering graphs with cliques. IndagationesMathematicae (Proceedings), 80(5):406–424, 1977.
[111] P. Pei and A. Zhang. A topological measurement for weighted protein interactionnetwork. In Computational Systems Bioinformatics Conference, 2005. Proceedings.2005 IEEE, pages 268–278, 2005.
[112] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. As-signing protein functions by comparative genome analysis: protein phylogeneticprofiles. Proceedings of the National Academy of Sciences of the United States ofAmerica, 96(8):4285–4288, Apr. 1999.
[113] M. Penrose. Random Geometric Graphs (Oxford Studies in Probability). OxfordUniversity Press, USA, July 2003.
[114] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-basedprotein identification by searching sequence databases using mass spectrometrydata. Electrophoresis, 20(18):3551–3567, 1999.
[115] P. A. Pevzner, V. Dancík, and C. L. Tang. Mutation-tolerant protein identification bymass spectrometry. Journal of computational biology : a journal of computationalmolecular cell biology, 7(6):777–787, 2000.
[116] U. Pieles, W. Zürcher, M. Schär, and H. Moser. Matrix-assisted laser desorptionionization time-of-flight mass spectrometry: a powerful tool for the mass and se-quence analysis of natural and modified oligonucleotides. Nucleic Acids Research,21(14):3191–3196, 1993.
[117] N. Przulj, D. G. Corneil, and I. Jurisica. Modeling interactome: scale-free or geo-metric? Bioinformatics, 20(18):3508–3515, 2004.
[118] O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, M. Wilm,and B. Séraphin. The tandem affinity purification (TAP) method: a general proce-dure of protein complex purification. Methods (San Diego, Calif.), 24(3):218–229,July 2001.
[119] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Séraphin. A genericprotein purification method for protein complex characterization and proteomeexploration. Nature Biotechnology, 17(10):1030–1032, 1999.
[120] N. Robertson and P. D. Seymour. Graph minors. II. Algorithmic aspects of tree-width. Journal of algorithms, 7(3):309–322, 1986.
BIBLIOGRAPHY 147
[121] N. Robertson and P. D. Seymour. Graph minors. X. Obstructions to tree-decomposition. Journal of Combinatorial Theory. Series B, 52(2):153–190, 1991.
[122] R. Saito, H. Suzuki, and Y. Hayashizaki. Interaction generality, a measurement toassess the reliability of a protein-protein interaction. Nucl. Acids Res., 30(5):1163–1168, Mar. 2002.
[123] R. Saito, H. Suzuki, and Y. Hayashizaki. Construction of reliable protein-proteininteraction networks with a new interaction generality measure. Bioinformatics,19(6):756–763, Apr. 2003.
[124] M. E. Sardiu, Y. Cai, J. Jin, S. K. Swanson, R. C. Conaway, J. W. Conaway, L. Flo-rens, and M. P. Washburn. Probabilistic assembly of human protein interactionnetworks from label-free quantitative proteomics. Proceedings of the NationalAcademy of Sciences of the United States of America, 105(5):1454–1459, Feb. 2008.
[125] A. Schrijver. Combinatorial Optimization (3 volume, A,B, & C). Springer, 1 edition,Feb. 2003.
[126] R. Sharan and R. Shamir. CLICK: a clustering algorithm with applications to geneexpression analysis. In Proceedings of the Eighth International Conference on In-telligent Systems for Molecular Biology. AAAI Press, Aug. 2000.
[127] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcrip-tional regulation network of Escherichia coli. Nat Genet, 31(1):64–68, 05 2002.
[128] M. E. Sowa, E. J. Bennett, S. P. Gygi, and J. W. Harper. Defining the human deubiq-uitinating enzyme interaction landscape. Cell, 138(2):389–403, July 2009.
[129] V. Spirin and L. A. Mirny. Protein complexes and functional modules in molecularnetworks. Proceedings of the National Academy of Sciences of the United States ofAmerica, 100(21):12123–12128, Oct. 2003.
[130] M. P. H. Stumpf, T. Thorne, E. de Silva, R. Stewart, H. J. An, M. Lappe, andC. Wiuf. Estimating the size of the human interactome. Proceedings of the Na-tional Academy of Sciences, 105(19):6959–6964, 2008.
[131] K. Suhre. Inference of gene function based on gene fusion events. Springer Proto-cols, 396, November 2007.
[132] H. Tamaki. A linear time heuristic for the branch-decomposition of planar graphs.In G. D. Battista and U. Zwick, editors, European Symposium on Algorithms (ESA),volume 2832 of Lecture Notes in Computer Science, pages 765–775. Springer, 2003.
[133] H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly,and P. Radivojac. A computational approach toward label-free protein quantifi-cation using predicted peptide detectability. Bioinformatics (Oxford, England),22(14):e481–8, 2006.
148 Bibliography
[134] K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. S. Molina, I. Shames,Y. Malitskaya, J. Vogel, H. Bussey, and S. W. Michnick. An in vivo map of the yeastprotein interactome. Science, 320(5882):1465–1470, 2008.
[135] J. A. Taylor and R. S. Johnson. Sequence database searches via de novo peptidesequencing by tandem mass spectrometry. Rapid Communications in Mass Spec-trometry, 11(9):1067–1075, 1997.
[136] R. Uehara. NP-complete problems on a 3-connected cubic planar graph and theirapplication. Technical report, Tokyo Woman’s Christian University, Tokyo, Sept.1996.
[137] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lock-shon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin,D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, andJ. M. Rothberg. A comprehensive analysis of protein-protein interactions in Sac-charomyces cerevisiae. Nature, 403(6770):623–627, 2000.
[138] L. G. Valiant. The complexity of enumeration and reliability problems. SIAM Jour-nal on Computing, 8(3):410–421, 1979.
[139] S. M. van Dongen. Graph clustering by flow simulation. PhD thesis, University ofUtrecht, The Netherlands, 2000.
[140] V. Vazirani. Approximation Algorithms. Springer, 2004.
[141] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork.Comparative assessment of large-scale data sets of protein-protein interactions.Nature, 417(6887):399–403, 05 2002.
[142] A. J. Walhout, R. Sordella, X. Lu, J. L. Hartley, G. F. Temple, M. A. Brasch, N. Thierry-Mieg, and M. Vidal. Protein interaction mapping in C. elegans using proteins in-volved in vulval development. Science (New York, N.Y.), 287(5450):116–122, Jan.2000.
[143] X. Wang, J. Venable, P. LaPointe, D. M. Hutt, A. V. Koulov, J. Coppinger, C. Gurkan,W. Kellner, J. Matteson, H. Plutner, J. R. Riordan, J. W. Kelly, J. R. Yates, and W. E.Balch. Hsp90 cochaperone Aha1 downregulation rescues misfolding of CFTR incystic fibrosis. Cell, 127(4):803–815, Nov. 2006.
[144] C. M. Whitehouse, R. N. Dreyer, M. Yamashita, and J. B. Fenn. Electrospray in-terface for liquid chromatographs and mass spectrometers. Analytical Chemistry,57(3):675–679, 1985. PMID: 2581476.
[145] D. P. Williamson and D. B. Shmoys. The design of approximation algorithms. Cam-bridge University Press, 2011.
BIBLIOGRAPHY 149
[146] I. Xenarios, L. Salwínski, X. J. Duan, P. Higney, S. M. Kim, and D. Eisenberg. DIP, theDatabase of Interacting Proteins: a research tool for studying cellular networks ofprotein interactions. Nucleic Acids Res, 30(1):303–305, January 2002.
[147] A. Yamaguchi, K. F. Aoki, and H. Mamitsuka. Graph complexity of chemical com-pounds in biological pathways. In Genome Informatics Vol. 14, pages 376–377,2003.
[148] P. Ye, B. D. Peyser, X. Pan, J. D. Boeke, F. A. Spencer, and J. S. Bader. Gene func-tion prediction from congruent synthetic lethal interactions in yeast. Molecularsystems biology, 1:2005.0026, 2005.
[149] H. Yu, P. Braun, M. A. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, T. Hao, J. Rual, A. Dricot, A. Vazquez, R. R.Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Fan, A. de Smet, A. Motyl,M. E. Hudson, J. Park, X. Xin, M. E. Cusick, T. Moore, C. Boone, M. Snyder, F. P.Roth, A. Barabasi, J. Tavernier, D. E. Hill, and M. Vidal. High-quality binary proteininteraction map of the yeast interactome network. Science, 322(5898):104–110,Oct. 2008.
[150] B. Zhang, B. Park, T. Karpinets, and N. Samatova. From pull-down data to proteininteraction networks and complexes with biological relevance. Bioinformatics,24(7):979–86, 2008.
[151] B. Zhang, N. C. VerBerkmoes, M. A. Langston, E. Uberbacher, R. L. Hettich, andN. F. Samatova. Detecting differential and correlated protein expression in label-free shotgun proteomics. Journal of Proteome Research, 5(11):2909–2918, Nov.2006.