THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB … · 3 Supplementary Tables and Figures (to...

1

THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB (BMBL) BIWEEKLY SCIENCE REPORT

Reporting Time: Due at 5pm on Feb. 26th

Institution: South Dakota State University

Report prepared by: JuanXie Advisor: Qin Ma

Project Name: Do DAVID enrichment analysis based on QUBIC result

SIGNIFICANT SCIENCE ACCOMPLISHMENTS: (Examples: major achievement in meeting a milestone, new

collaborations, publication in high impact journal.)

1. Summary of Science Activities (<1 page):

Select candidate biclusters for DAVID enrichment analysis based on SAG overlapped ratio, condition

number, pvalue of QUBIC result and pvalue of hypergeometric distribution, and finally determine the

suitable candidate biclusters and finish DAVID enrichment analysis. Replace the DAVID id in the results

file with original probeset id.

2. Future Work Plans (Brief summary of the tasks/milestones working on next month):

3. Issues to Resolve (Issues that need input from another partner, FA Lead, Science Coordinator or BESC Director for

resolution, etc.):

4. Publications:

5. Presentations:

6. News / Awards:

7. Personnel changes (New, reassigned, or departed):

8. Intellectual Property:

9. Quality Assurance:

2

10. Environment, Safety and Health:

Please complete and return to Qin Ma ([email protected]). If no activity, please indicate “N/A”.

mailto:[email protected])

3

Supplementary Tables and Figures (to support above accomplishments)

Background information about the work

Jiading provided three excel files: Original-expression data-total probesets, UP-gene list-LSCR-4 organs and

conserved-SAG-list.

The Original-expression data-total probesets file contains ids (i.e., probeset id, Arabidopsis hit, rice hit, etc.)

and expression values of different organs under different time. It consists of 122974 rows (including the header

row);

The UP-gene list-LSCR-4 organs file contains 4 sheets, each organ corresponding one sheet. Each sheet

contains two columns: probeset id and PviUT sequence ID. The row number of each sheet varies from about

1800~3000; This file is the target of analysis.

The conserved-SAG-list file also contains 4 sheets, and each sheet contains the probeset id and PviUT

sequence ID columns. The row number of each sheet is much less, about 400~600Jianding provided. This file

contains the probesets which already known to have certain functions.

Our work need to conduct biclustering for the expression values of four organs gene ( the ones from the UP-

gene list –LSCR-4 organs), then select candidate biclusters and do DAVID enrichment analysis.

1. Select candidate biclusters for DAVID enrichment analysis.

1) Prepare data to be used

Extract the first column (probeset id column) and all the condition columns from the original expression data,

save it as “raw data.csv” . (note: The original expression file is too large, load it directly in R would be very

slow)

Extract the first column and the Arabidopsis.hit column from the original expression data( save it as “Original-

expression data_Ahit.csv” ), reserve the rows with valid Arabidopsis hit (that means ,exclude the “/” and blank

rows), and save the file as “allgene_hit1.csv”:

allgene <-read.table("C:/Users/qibao/Desktop/project/Original-expression data_A hit.csv",header=T,sep=",")

hit<-allgene[allgene$Arabidopsis.hit!="/",] ## exclude the “/” rows

hit1<-hit[hit$Arabidopsis.hit !="",] ## exclude the blank rows, and save the remains

allgene_hit1<- write.csv(hit1,"C:/Users/qibao/Desktop/project/allgene_hit1.csv",row.names=F,sep=",") ## write the obtained rows as csv file

Extract the up gene list and SAG list of each organ, save as csv files. Take the leaf data for example, save

the up gene list of root as leaf_sm,.the SAG list of root as SAG list_leaf

Select the expression data of the genes listed in up gene list from the ”raw data.csv” file using R:

4

leaf_sub <- rawdata[(rawdata$probeset_id %in% root_sm$probeset_id),]

write.csv(root_sub,file="C:/Users/qibao/Desktop/project/leaf/leaf_sub.csv",row.names=F)

save the leaf_sub file obtained as txt file(table delimited) using notepad ++ and use it as input file for QUBIC

2) run QUBIC

QUBIC is a qualitative biclustering algorithm, the link of the C version is:

http://csbl.bmb.uga.edu/~maqin/bicluster/doc/build/. Recently a R package version was developed and

accepted by Bioconductor: https://www.bioconductor.org/packages/devel/bioc/html/QUBIC.html. And this work

was done using the C version of QUBIC.

The parameters used were as following : -k 3 -f 0.25 -c 0.95 -o 300 -q 0.10 -r 1. The program found 76

biclusters. Actually, three sets of parameters were tested: a). -k 3 -f 0.25 -c 0.95 -o 300 -q 0.10 -r 1, b). -k 3 -f

0.25 -c 1.0 -o 300 -q 0.10 -r 1, and c) -k 2 -f 0.25 -c 0.95 -o 300 -q 0.06 -r 1. Under a) we can got larger

biclusters ( in terms of the gene number of each bicluster). Thus I choose the a) option

3) Grep the genes, pvalues and BC number of biclusters under Linux, and save them as sub3BCgenes,

sub3BCPvalue and sub3BC# ,respectively:

Grep Genes leaf_sub.blocks | cut –d ‘[’ –f2 > sub3BCgenes

Grep Pvalue leaf_sub.blocks | cut –d ‘[’ –f2 > sub3BCPvalue

Grep BC leaf_sub.blocks | cut –d ‘S’ –f1 > sub3BC#

4) Match the SAG and Arabidopsis.hit for the genes of every bicluster:

(Note: this step is needed because we are trying to select biclusters which contain larger proportion of SAG genes,

and after that, we need to find the corresponding Arabidopsis.hit of probeset)

leafgenelist <- read.csv("C:/Users/qibao/Desktop/project/leaf/leaf_sm.csv",header=T,sep=",") ## read the up gene list of leaf

leaf_SAG <-read.csv("C:/Users/qibao/Desktop/project/leaf/SAG-list_leaf.csv",header=T,sep=",") ## read the SAG list of leaf

dim(leafgenelist)

dim(leaf_SAG)

head(leafgenelist)

head(leaf_SAG)

SAGreptall <-leaf_SAG$Probe.Set.ID[which(leaf_SAG$Probe.Set.ID %in% leafgenelist$Probe.Set.ID)] ## match the probe id of leaf up gene list with the corresponding SAG list

length(SAGreptall) ##number of matched gene in the whole SAG, in this case,571

dim(leafgenelist)[1] ## total gene # in up-list, in this case,3022

http://csbl.bmb.uga.edu/~maqin/bicluster/doc/build/

https://www.bioconductor.org/packages/devel/bioc/html/QUBIC.html

5

# I/O

fSAG <- "C:/Users/qibao/Desktop/project/leaf/SAG-list_leaf.csv" ## define the saving path and names, and the same below. This is the SAG list of leaf

fGene <- "C:/Users/qibao/Desktop/project/leaf/sub3BCgenes" ## this file contains all the genes of biclusters

fBCName <- "C:/Users/qibao/Desktop/project/leaf/sub3BC#" ## this file contains the bicluster name, i.e, BC010

fOut <- "C:/Users/qibao/Desktop/project/leaf/SAGMatchResultAll.csv"

fPvalue<-"C:/Users/qibao/Desktop/project/leaf/sub3BCPvalue" ## this file contains the pvalue of bicluster

fhit <- "C:/Users/qibao/Desktop/project/allgene_hit1.csv" ## this is the file contains all Arabidopsis.hit and also probeset id

hit <- read.csv(fhit,header=T,sep=",")

leafSAG <-read.csv(fSAG,header=T,sep=",")

gene <- read.table(fGene ,header = FALSE,sep=" ",fill=T)

BCName <- read.table(fBCName, header=FALSE)

Pvalue<-read.table(fPvalue,header=F)

# declare variables

numGene <- nrow(gene) # number of rows

rep_ratio <- rep(-1,numGene) # ratio of number of repeat gene to that of all

hyperp <- rep(-1,numGene) # hypergeometric distribution pvalue

gene_all <- rep(-1,numGene) # all gene in each BC (here is each row)

gene_rep <- rep(-1,numGene) # matched number of repeat gene

Totalgene <-2451 ##total gene in up-list(minus total repeated)

TotalSAG <-571 ## number of repeated gene in the whole SAG

hitmatchednumber <- rep(-1,numGene) # the number of hit matched gene

# Match the repeat

for(i in 1:numGene){

BC <- gene[i,]

6

# extract the "probeset_id" column and convert it to character vector

leaf_geneid <- as.character(leafSAG$Probe.Set.ID)

class(leaf_geneid)

# convert to character vector

tBC <- t(BC)

tBC <- as.character(tBC[,1])

tBC[tBC==""] <- NA # replace the filling value (blank, "") with "NA"

tBC <- tBC[!is.na(tBC)]

# Match the repeat

rep <- tBC[which(tBC %in% leaf_geneid)]

repNum <- length(rep)

gene_rep[i] <- repNum

gene_all[i] <- length(tBC)

hitmatch<- hit[hit$probeset_id %in% tBC,] # found the matched hit

Arahit <-hitmatch[,2] #select the column of Arabidopsis.hit

hit_number <-dim(hitmatch)[1] # the number of founded hit(equals to the row# of hitmatch)

hitmatchednumber[i]<-hit_number

hyper_pvalue <-dhyper(repNum,TotalSAG,Totalgene,length(tBC)) ## calculate the hypergeometric distribution p value

hyperp[i]<-sprintf("%.3f",hyper_pvalue) ## keep three decimal

ratio <- repNum/length(tBC)

rep_ratio[i] <- sprintf("%.3f",ratio)

}

## write result

rst <- data.frame(BC_Name=BCName[,1], Pvalue=Pvalue[,1],GeneCount=gene_all, RepeatCount=gene_rep, repeatratio=rep_ratio,hyperpvalue=hyperp,hitCount=hitmatchednumber)

write.csv(rst, file=fOut,row.names=F,sep=",")

7

5) Select candidate biclusters based on pvalue of hypergemotric distribution(<=0.05) and the number of

matched Arabidopsis.hit (>= 10):

rst <-read.csv("C:/Users/qibao/Desktop/project/leaf/SAGMatchResultAll.csv",header=T)

foutDir <- "C:/Users/qibao/Desktop/project/leaf/"

hyperP_threshold <- 0.05 ## filter condition1 hyper pvalue

hitCount_threshold <- 10 ##filter condition2 matched hit number

rst_cp <- rst

rst_cp <- rst_cp[rst_cp$hyperpvalue<=hyperP_threshold,]

rst_cp <- rst_cp[rst_cp$hitCount>=hitCount_threshold,]

BCorder<- 1+as.numeric(substr(rst_cp[,1],3,5)) ## corresponding row# of candidate bicluster in total qubic gene

length(BCorder)

BChit <-rep(-1,length(BCorder))

for(i in 1:length(BCorder)){

BC <- gene[BCorder[i],]

tBC <- t(BC)

tBC <- as.character(tBC[,1])

tBC[tBC==""] <- NA # replace the filling value (blank, "") with "NA"

tBC <- tBC[!is.na(tBC)]

hitmatch<- hit[hit$probeset_id %in% tBC,]

Arahit <-hitmatch[,2] ##select the column of Arahit

BChit<-substr(as.character(Arahit),1,9) ##select the string before /

fout<- paste(foutDir,i,sep="")

write.csv(BChit,fout,row.names=F)

}

We obtained 16 biclusters, each with a separate gene list.

6) upload the obtained gene list to DAVID and do enrichment analysis.

2. Conduct DAVID enrichment analysis

8

Upload the gene lists obtained in step1-6) to DAVID website (https://david.ncifcrf.gov/home.jsp) to conduct

enrichment analysis, details as follow:

Step1: click “Start Analysis” in the top bar from home page

Step 2: upload or paste the gene list, select Not Sure in the drop down box, select the list as Gene List, and

click the “Submit List” botton:

Step3: Click “Submit to Conversion Tool”

https://david.ncifcrf.gov/home.jsp

9

Step4: click “Convert All” to convert the ids

Step5: right click the “Download File” to save the converted list file, and then click” Submit Converted List to DAVID as a

Gene List”, you may rename the gene list

10

Step6: Return to the tools page, click “Functional Annotation Clustering”

Step7: click “Functional Annotation Clustering”

Step8: right click “Download File”, save the anntation clustering results

11

Step9(optional): Return to to annotation summary results page, uncheck the “Check Defaults” box,click “Pathways”,then

select”KEGG_PATHWAY” , then click “Functional Annotation Clustering”, and download the obtained results

Note: this step is optional, and we did this because Jiading was interested in the specific KEGG annotation clustering

result

3. Replace the DAVID ID of the obtained results with corresponding probeset id.

Following the above steps, we can get three txt files for each bicluster(download from the DAVID website):

a) the id conversion records file (convert the Arabidopsis hit id to DAVID ID); b) the enrichment analysis result

file and c) the KEGG result file. In b) and c), the genes are represented by their DAVID ID. To facilitate

users, we replace the DAVID ID with probeset id.

1) Preparing the hash table

Select the Arabidopsis hit column of allgene_hit1.csv( which contains two columns: Arabidopsis hit and

probeset id), and upload it to DAVID as gene list, submit, convert to DAVID ID and download the

obtained file, then save it as csv file (comma delimited) using notepad ++, with the file named as

12

DAVID ID.csv. The DAVID ID.csv file contains four columns: From(Arabidopsis id),To(DAVID id),

species and gene names. The following are R scripts:

AhitID <- read.csv("C:/Users/qibao/Desktop/project/allgene_hit1.csv",header=T,sep=",")

davidid<-read.csv("C:/Users/qibao/Desktop/project/DAVID ID.csv",header=T,sep=",")

From <-as.character(davidid$From) ## extract the From column and convert it to character

head(AhitID)

head(davidid)

length <-nrow(davidid)

probeset_id_matched <- rep("",length) # record matched corresponding "probeset_id"

for(i in 1:length){

tmp <- AhitID[substr(as.character(AhitID$Arabidopsis.hit),1,9)==From[i],]

probeset_id_matched[i] <- as.character(tmp$probeset_id[1])

}

# write result out

fOut <- "C:/Users/qibao/Desktop/project/davidid_with_probid_1.csv"

df <- data.frame(ProbesetID=probeset_id_matched,davidid)

write.csv(df,fOut,sep=",",row.names=F)

Following the above steps, we obtained a csv(davidid_with_probed_1) which contains five columns: Probeset ID, From,

To, Species and Gene names.

2) Replace the DAVID ID with probeset id using python:

import re

import csv

import glob

import os

## function defined to perform string replacement

def multiple_replace(text, adict):

rx = re.compile('|'.join(map(re.escape, adict)))

def one_xlat(match):

return adict[match.group(0)]

13

return rx.sub(one_xlat, text)

## main function

if __name__ == '__main__':

#read old and new strings from a csv file

fileMap = 'C:\\Users\\qibao\\Desktop\\project\\davidid_with_probid_1.csv'

fCSV = open(fileMap)

mapDict = {} # dictionary declared to keep old and new strings. (First column: old; second: new)

i=0

for row in csv.reader(fCSV):

if i > 0: # the first row is for titles in csv file

mapDict[row[2]] = row[0]

i = i + 1

#read text for replacing

dirIn = 'C:\\Users\\qibao\\Desktop\\project\\stem\\enrichment\\'

dirOut = 'C:\\Users\\qibao\\Desktop\\project\\stem\\enrichment\\enrichment result\\'

os.chdir(dirIn) # change directory to predefined directory

for fileTXT in glob.glob("*.txt"):

strConst1='enrichment'

strConst2='kegg'

# only process file which has a section like 'enrichment' or 'kegg'

if fileTXT.find(strConst1) > 0 or fileTXT.find(strConst2) > 0 :

print "Processing file: " + fileTXT + "..."

fileInPath = dirIn + fileTXT # construct full path of a input file

fInTXT = open(fileInPath, 'r') # open file for reading

fileOutPath = dirOut + fileTXT.replace('.txt','_new.txt') # construct full path of a output file

fOutTXT = open(fileOutPath,'w') # open file for writing

for line in fInTXT:

14

newT = multiple_replace(line, mapDict) # call replace function

#print newT

fOutTXT.writelines(newT) # write result to file

fInTXT.close #close connection pointing to the input file

fOutTXT.close #close connection pointing to the output file

# Done

print "Done, all files were processed"

THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB … · 3 Supplementary Tables and Figures (to...

Documents

Transcript of THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB … · 3 Supplementary Tables and Figures (to...