Post on 14-Dec-2015
Analyzing Local Properties• Many local properties are important for the function of
your protein– Hydrophobic regions are potential transmembrane domains
– Coiled-coiled regions are potential protein-interaction domains
– Hydrophilic stretches are potential loops
• You can discover these regions– Using sliding-widow techniques (easy)
– Using prediction methods such as hidden Markov Models (more sophisticated)
Sliding-window Techniques• Ideal for identifying strong
signals• Very simple methods
– Few artifacts– Not very sensitive
• Use ProtScale on www.expasy.org
• Make the window the same size as the feature you’re looking for
www.expasy.org/cgi-bin/protscale.pl
www.expasy.org/cgi-bin/protscale.pl
Hphob. / Eisenberg
Using TMHMM
• TMHMM is the best method for predicting transmembrane domains
• TMHMM uses an HMM
• Its principle is very different from that of ProtScale
• TMHMM output is a prediction
Searching for PROSITE Patterns
• Search your protein against PROSITE on ExPAsy– www.expasy.org/tools/scanprosite
• PROSITE motifs are written as patterns– Short patterns are not very informative by themselves
– They only indicate a possibility
– Combine them with other information to draw a conclusion
• Remember: Not everything is in PROSITE !
www.expasy.org/tools/scanprosite
P12259
www.expasy.org/tools/scanprosite
Protein Domains
• Proteins are usually made of domains
• A domain is an autonomous folding unit
• Domains are more than 50 amino acids long
• It’s common to find these together:– A regulatory domain
– A binding domain
– A catalytic domain
www.ebi.ac.uk/InterProScan
www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi
Secondary Structures
• Helix– Amino acid that twists like a spring
• Beta strand or extended– Amino acid forms a line without
twisting
• Random coils– Amino acid with a structure neither
helical nor extended
– Amino-acid loops are usually coils
bioinf.cs.ucl.ac.uk/psipred//?program=psipred
bioinf.cs.ucl.ac.uk/psipred//?program=psipred
bioinf.cs.ucl.ac.uk/psipred//?program=psipred
Servers
• www.predictprotein.org
• cubic.bioc.columbia.edu/predictprotein
• www.sdsc.edu/predicprotein
• www.cbi.pku.edu.cn/predictprotein
www.rcsb.org
www.rcsb.org
ncbi.nlm.nih.gov/BLAST
zhanglab.ccmb.med.umich.edu/I-TASSER/
zhanglab.ccmb.med.umich.edu/I-TASSER/
http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S102840/
R-programming
Introduction•R is:
– a suite of operators for calculations on arrays, in particular matrices,
– a large, coherent, integrated collection of intermediate tools for interactive data analysis,
– graphical facilities for data analysis and display either directly at the computer or on hardcopy
– a well developed programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
•The core of R is an interpreted computer language.– It allows branching and looping as well as modular programming using
functions. – Most of the user-visible functions in R are written in R, calling upon a smaller
set of internal primitives. – It is possible for the user to interface to procedures written in C, C++ or
FORTRAN languages for efficiency, and also to write additional primitives.
R and statisticso Packaging: a crucial infrastructure to efficiently produce, load
and keep consistent software libraries from (many) different sources / authors
o Statistics: most packages deal with statistics and data analysis
o State of the art: many statistical researchers provide their methods as R packages
Data Analysis and Presentation
• The R distribution contains functionality for large number of statistical procedures. – linear and generalized linear models– nonlinear regression models– time series analysis– classical parametric and nonparametric tests– clustering – smoothing
• R also has a large set of functions which provide a flexible graphical environment for creating various kinds of data presentations.
R as a calculator
> log2(32)
[1] 5
> sqrt(2)
[1] 1.414214
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5
> plot(sin(seq(0, 2*pi, length=100)))
0 20 40 60 80 100
-1.0
-0.5
0.0
0.5
1.0
Index
sin
(se
q(0
, 2 *
pi,
len
gth
= 1
00
))
Object orientation
primitive (or: atomic) data types in R are:
• numeric (integer, double, complex)• character• logical• function
out of these, vectors, arrays, lists can be built.
Object orientation
• Object: a collection of atomic variables and/or other objects that belong together
• Example: a microarray experiment• probe intensities• patient data (tissue location, diagnosis, follow-up)• gene data (sequence, IDs, annotation)
Parlance:• class: the “abstract” definition of it• object: a concrete instance• method: other word for ‘function’• slot: a component of an object
Object orientation
Advantages:
Encapsulation (can use the objects and methods someone else has written without having to care about the internals)
Generic functions (e.g. plot, print)
Inheritance (hierarchical organization of complexity)
Caveat:Overcomplicated, baroque program architecture…
variables
> a = 24
> b<-25> sqrt(a+b)[1] 7
> a = "The dog ate my homework"> sub("dog","cat",a)[1] "The cat ate my homework"
> a = (1+1==3)> a[1] FALSE
numeric
character string
logical
variables> paste("X", "Y")
> paste("X", "Y", sep = " + ")
> paste("Fig", 1:4)
> paste(c("X", "Y"), 1:4, sep = "", collapse = " + ")
x<-2.17y<-as.character(x)z<-as.numeric(y)
Help(as)
vectors, matrices and arrays• vector: an ordered collection of data of the same type> a = c(1,2,3)> a*2[1] 2 4 6
• Example: the mean spot intensities of all 15488 spots on a chip: a vector of 15488 numbers
• In R, a single number is the special case of a vector with 1 element.
• Other vector types: character strings, logical
vectors, matrices and arrays
• matrix: a rectangular table of data of the same typeexample: the expression values for 10000 genes for 30
tissue biopsies: a matrix with 10000 rows and 30 columns.
• array: 3-,4-,..dimensional matrixexample: the red and green foreground and background
values for 20000 spots on 120 chips: a 4 x 20000 x 120 (3D) array.
Lists• vector: an ordered collection of data of the same type. > a = c(7,5,1)> a[2][1] 5
• list: an ordered collection of data of arbitrary types. > doe = list(name="john",age=28,married=F)> doe$name[1] "john "> doe$age[1] 28• Typically, vector elements are accessed by their index (an integer),
list elements by their name (a character string). But both types support both access methods.
Data frames
data frame: is like a spreadsheet.
It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types.
Example:> a localisation tumorsize progressXX348 proximal 6.3 FALSEXX234 distal 8.0 TRUEXX987 proximal 10.0 FALSE
id<-c("xx348", "xx234", "xx987")locallization<-c("proximal", "distal", "proximal")progress<-c(F, T, F)tumorsize<-c(6.3, 8.0, 10.0)results<-data.frame(id, locallization , tumorsize, progress)
> results id locallization tumorsize progress1 xx348 proximal 6.3 FALSE2 xx234 distal 8.0 TRUE3 xx987 proximal 10.0 FALSE
results<-edit(results)
> summary(results) id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00
>x<-summary(results)>x id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00
SubsettingIndividual elements of a vector, matrix, array or data frame are accessed with “[ ]” by specifying their index, or their name> results localisation tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0> results[3, 2][1] 10> a["XX987", "tumorsize"][1] 10> results["XX987",] localisation tumorsize progressXX987 proximal 10 0
SubsettingSubsetting> results localisation tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0> results[c(1,3),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0> results[c(T,F,T),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0> results$localisation[1] "proximal" "distal" "proximal"> results $localisation=="proximal"[1] TRUE FALSE TRUE> results[ results$localisation=="proximal", ] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0
subset rows by a vector of indices
subset rows by a logical vector
subset a column
comparison resulting in logical vector
subset the selected rows
results[2,]
results[2,2]
results[1:3,]
results[c(1,3),]
results[c(T,F,T),]
x<-summary(results)xX[2,2]
x = c(1, 1, 2, 3, 5, 8)
x[c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)]
x[c(TRUE, FALSE)]
x == 1
x[x == 1]
x[x%%2 == 0]
y = c(1, 2, 3)
y[]=3
y
Matrix
• a matrix is a vector with an additional attribute (dim) that defines the number of columns and rows
• only one mode (numeric, character, complex, or logical) allowed
• can be created using matrix()x<-matrix(data=0,nr=2,nc=2) orx<-matrix(0,2,2)
Data Frame
• several modes allowed within a single data frame
• can be created using data.frame()L<-LETTERS[1:4] #A B C Dx<-1:4 #1 2 3 4data.frame(x,L) #create data frame
• attach() and detach()– the database is attached to the R search path so that the database is searched by R
when it is evaluating a variable.– objects in the database can be accessed by simply giving their names
a=matrix(1:9, ncol = 3, nrow = 3)a
b=matrix(c(TRUE, FALSE, TRUE), ncol = 3, nrow = 3)b
x=1:10y=11:20z=matrix(c(x,y))z
z=matrix(c(x,y),nrow=2)z
z=matrix(c(x,y),nrow=4)z
R code
max(z) min(z) length(z) mean(z) sd(z) sum(z)
index=c(15,27,34,10,9)welcome=c(13,26,30,10,7)paper=c(2,1,3,0,1)days=c("mon", "tues", "wed", "thurs", "fri")filenames=c("index.html", "welcom.png", "paper.pdf")downloads=matrix(c(index,welcome,paper), nrow=5, dimnames=list(days,filenames))downloads
filesizes = c(1624, 23172, 1234065)downloads%*%filesizes
image(as.matrix(downloads))
Factors
A character string can contain arbitrary text. Sometimes it is useful to use a limited vocabulary, with a small number of allowed words. A factor is a variable that can only take such a limited number of values, which are called levels.
expression<-factor(c("over","under","over","unchanged","under","under"))
levels(expression)
protein<-list("glucose oxidas", "1CF3", 63355)protein
protein<-list(name="glucose oxidas", accession="1CF3", weight=63355)
x<-c(16614, 50660, 6066, 6118)protein$GOIDs<-x
protein
class(protein)
length(protein)
attributes(protein)
Working directorysetwd("D:/data")
x<-read.table("profiles.csv", sep=",", header=TRUE)
x<-read.table("http://www.bixsolutions.net/profiles.csv", sep=",", header=TRUE)
matplot(x, type="l")
matplot(x, type="l", xlab="fraction", ylab="quantity", col=1:6, lty=1:5, lwd=2)
lty: line stylelwd: line width
xmax<-apply(x, 2, max)xmax
ymax<-apply(x, 1, max)ymax
Apply the max function on columns (2) or rows (1) of matrix x
cummean = function(x){n = length(x)y = numeric(n)z = c(1:n)y = cumsum(x)y = y/zreturn(y)
}
n = 10000z = rnorm(n)x = seq(1,n,1)y = cummean(z)X11()plot(x,y,type= 'l',main= 'Convergence Plot')
Apply the max function on columns (2) or rows (1) of matrix x