Analyzing Local Properties Many local properties are important for the function of your protein...

Post on 14-Dec-2015

214 views 0 download

Tags:

Transcript of Analyzing Local Properties Many local properties are important for the function of your protein...

Analyzing Local Properties• Many local properties are important for the function of

your protein– Hydrophobic regions are potential transmembrane domains

– Coiled-coiled regions are potential protein-interaction domains

– Hydrophilic stretches are potential loops

• You can discover these regions– Using sliding-widow techniques (easy)

– Using prediction methods such as hidden Markov Models (more sophisticated)

Sliding-window Techniques• Ideal for identifying strong

signals• Very simple methods

– Few artifacts– Not very sensitive

• Use ProtScale on www.expasy.org

• Make the window the same size as the feature you’re looking for

www.expasy.org/cgi-bin/protscale.pl

www.expasy.org/cgi-bin/protscale.pl

Hphob. / Eisenberg

Using TMHMM

• TMHMM is the best method for predicting transmembrane domains

• TMHMM uses an HMM

• Its principle is very different from that of ProtScale

• TMHMM output is a prediction

Searching for PROSITE Patterns

• Search your protein against PROSITE on ExPAsy– www.expasy.org/tools/scanprosite

• PROSITE motifs are written as patterns– Short patterns are not very informative by themselves

– They only indicate a possibility

– Combine them with other information to draw a conclusion

• Remember: Not everything is in PROSITE !

www.expasy.org/tools/scanprosite

P12259

www.expasy.org/tools/scanprosite

Protein Domains

• Proteins are usually made of domains

• A domain is an autonomous folding unit

• Domains are more than 50 amino acids long

• It’s common to find these together:– A regulatory domain

– A binding domain

– A catalytic domain

www.ebi.ac.uk/InterProScan

www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Secondary Structures

• Helix– Amino acid that twists like a spring

• Beta strand or extended– Amino acid forms a line without

twisting

• Random coils– Amino acid with a structure neither

helical nor extended

– Amino-acid loops are usually coils

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

bioinf.cs.ucl.ac.uk/psipred//?program=psipred

Servers

• www.predictprotein.org

• cubic.bioc.columbia.edu/predictprotein

• www.sdsc.edu/predicprotein

• www.cbi.pku.edu.cn/predictprotein

www.rcsb.org

www.rcsb.org

ncbi.nlm.nih.gov/BLAST

zhanglab.ccmb.med.umich.edu/I-TASSER/

zhanglab.ccmb.med.umich.edu/I-TASSER/

http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S102840/

R-programming

Introduction•R is:

– a suite of operators for calculations on arrays, in particular matrices,

– a large, coherent, integrated collection of intermediate tools for interactive data analysis,

– graphical facilities for data analysis and display either directly at the computer or on hardcopy

– a well developed programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.

•The core of R is an interpreted computer language.– It allows branching and looping as well as modular programming using

functions. – Most of the user-visible functions in R are written in R, calling upon a smaller

set of internal primitives. – It is possible for the user to interface to procedures written in C, C++ or

FORTRAN languages for efficiency, and also to write additional primitives.

R and statisticso Packaging: a crucial infrastructure to efficiently produce, load

and keep consistent software libraries from (many) different sources / authors

o Statistics: most packages deal with statistics and data analysis

o State of the art: many statistical researchers provide their methods as R packages

Data Analysis and Presentation

• The R distribution contains functionality for large number of statistical procedures. – linear and generalized linear models– nonlinear regression models– time series analysis– classical parametric and nonparametric tests– clustering – smoothing

• R also has a large set of functions which provide a flexible graphical environment for creating various kinds of data presentations.

R as a calculator

> log2(32)

[1] 5

> sqrt(2)

[1] 1.414214

> seq(0, 5, length=6)

[1] 0 1 2 3 4 5

> plot(sin(seq(0, 2*pi, length=100)))

0 20 40 60 80 100

-1.0

-0.5

0.0

0.5

1.0

Index

sin

(se

q(0

, 2 *

pi,

len

gth

= 1

00

))

Object orientation

primitive (or: atomic) data types in R are:

• numeric (integer, double, complex)• character• logical• function

out of these, vectors, arrays, lists can be built.

Object orientation

• Object: a collection of atomic variables and/or other objects that belong together

• Example: a microarray experiment• probe intensities• patient data (tissue location, diagnosis, follow-up)• gene data (sequence, IDs, annotation)

Parlance:• class: the “abstract” definition of it• object: a concrete instance• method: other word for ‘function’• slot: a component of an object

Object orientation

Advantages:

Encapsulation (can use the objects and methods someone else has written without having to care about the internals)

Generic functions (e.g. plot, print)

Inheritance (hierarchical organization of complexity)

Caveat:Overcomplicated, baroque program architecture…

variables

> a = 24

> b<-25> sqrt(a+b)[1] 7

> a = "The dog ate my homework"> sub("dog","cat",a)[1] "The cat ate my homework"

> a = (1+1==3)> a[1] FALSE

numeric

character string

logical

variables> paste("X", "Y")

> paste("X", "Y", sep = " + ")

> paste("Fig", 1:4)

> paste(c("X", "Y"), 1:4, sep = "", collapse = " + ")

x<-2.17y<-as.character(x)z<-as.numeric(y)

Help(as)

vectors, matrices and arrays• vector: an ordered collection of data of the same type> a = c(1,2,3)> a*2[1] 2 4 6

• Example: the mean spot intensities of all 15488 spots on a chip: a vector of 15488 numbers

• In R, a single number is the special case of a vector with 1 element.

• Other vector types: character strings, logical

vectors, matrices and arrays

• matrix: a rectangular table of data of the same typeexample: the expression values for 10000 genes for 30

tissue biopsies: a matrix with 10000 rows and 30 columns.

• array: 3-,4-,..dimensional matrixexample: the red and green foreground and background

values for 20000 spots on 120 chips: a 4 x 20000 x 120 (3D) array.

Lists• vector: an ordered collection of data of the same type. > a = c(7,5,1)> a[2][1] 5

• list: an ordered collection of data of arbitrary types. > doe = list(name="john",age=28,married=F)> doe$name[1] "john "> doe$age[1] 28• Typically, vector elements are accessed by their index (an integer),

list elements by their name (a character string). But both types support both access methods.

Data frames

data frame: is like a spreadsheet.

It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types.

Example:> a localisation tumorsize progressXX348 proximal 6.3 FALSEXX234 distal 8.0 TRUEXX987 proximal 10.0 FALSE

id<-c("xx348", "xx234", "xx987")locallization<-c("proximal", "distal", "proximal")progress<-c(F, T, F)tumorsize<-c(6.3, 8.0, 10.0)results<-data.frame(id, locallization , tumorsize, progress)

> results id locallization tumorsize progress1 xx348 proximal 6.3 FALSE2 xx234 distal 8.0 TRUE3 xx987 proximal 10.0 FALSE

results<-edit(results)

> summary(results) id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00

>x<-summary(results)>x id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00

SubsettingIndividual elements of a vector, matrix, array or data frame are accessed with “[ ]” by specifying their index, or their name> results localisation tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0> results[3, 2][1] 10> a["XX987", "tumorsize"][1] 10> results["XX987",] localisation tumorsize progressXX987 proximal 10 0

SubsettingSubsetting> results localisation tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0> results[c(1,3),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0> results[c(T,F,T),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0> results$localisation[1] "proximal" "distal" "proximal"> results $localisation=="proximal"[1] TRUE FALSE TRUE> results[ results$localisation=="proximal", ] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0

subset rows by a vector of indices

subset rows by a logical vector

subset a column

comparison resulting in logical vector

subset the selected rows

results[2,]

results[2,2]

results[1:3,]

results[c(1,3),]

results[c(T,F,T),]

x<-summary(results)xX[2,2]

x = c(1, 1, 2, 3, 5, 8)

x[c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)]

x[c(TRUE, FALSE)]

x == 1

x[x == 1]

x[x%%2 == 0]

y = c(1, 2, 3)

y[]=3

y

Matrix

• a matrix is a vector with an additional attribute (dim) that defines the number of columns and rows

• only one mode (numeric, character, complex, or logical) allowed

• can be created using matrix()x<-matrix(data=0,nr=2,nc=2) orx<-matrix(0,2,2)

Data Frame

• several modes allowed within a single data frame

• can be created using data.frame()L<-LETTERS[1:4] #A B C Dx<-1:4 #1 2 3 4data.frame(x,L) #create data frame

• attach() and detach()– the database is attached to the R search path so that the database is searched by R

when it is evaluating a variable.– objects in the database can be accessed by simply giving their names

a=matrix(1:9, ncol = 3, nrow = 3)a

b=matrix(c(TRUE, FALSE, TRUE), ncol = 3, nrow = 3)b

x=1:10y=11:20z=matrix(c(x,y))z

z=matrix(c(x,y),nrow=2)z

z=matrix(c(x,y),nrow=4)z

R code

max(z) min(z) length(z) mean(z) sd(z) sum(z)

index=c(15,27,34,10,9)welcome=c(13,26,30,10,7)paper=c(2,1,3,0,1)days=c("mon", "tues", "wed", "thurs", "fri")filenames=c("index.html", "welcom.png", "paper.pdf")downloads=matrix(c(index,welcome,paper), nrow=5, dimnames=list(days,filenames))downloads

filesizes = c(1624, 23172, 1234065)downloads%*%filesizes

image(as.matrix(downloads))

Factors

A character string can contain arbitrary text. Sometimes it is useful to use a limited vocabulary, with a small number of allowed words. A factor is a variable that can only take such a limited number of values, which are called levels.

expression<-factor(c("over","under","over","unchanged","under","under"))

levels(expression)

protein<-list("glucose oxidas", "1CF3", 63355)protein

protein<-list(name="glucose oxidas", accession="1CF3", weight=63355)

x<-c(16614, 50660, 6066, 6118)protein$GOIDs<-x

protein

class(protein)

length(protein)

attributes(protein)

Working directorysetwd("D:/data")

x<-read.table("profiles.csv", sep=",", header=TRUE)

x<-read.table("http://www.bixsolutions.net/profiles.csv", sep=",", header=TRUE)

matplot(x, type="l")

matplot(x, type="l", xlab="fraction", ylab="quantity", col=1:6, lty=1:5, lwd=2)

lty: line stylelwd: line width

xmax<-apply(x, 2, max)xmax

ymax<-apply(x, 1, max)ymax

Apply the max function on columns (2) or rows (1) of matrix x

cummean = function(x){n = length(x)y = numeric(n)z = c(1:n)y = cumsum(x)y = y/zreturn(y)

}

n = 10000z = rnorm(n)x = seq(1,n,1)y = cummean(z)X11()plot(x,y,type= 'l',main= 'Convergence Plot')

Apply the max function on columns (2) or rows (1) of matrix x