THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB … · 3 Supplementary Tables and Figures (to...
Transcript of THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB … · 3 Supplementary Tables and Figures (to...
1
THE BIOINFORMATICS AND MATHEMATICAL BIOSCIENCE LAB (BMBL) BIWEEKLY SCIENCE REPORT
Reporting Time: Due at 5pm on Feb. 26th
Institution: South Dakota State University
Report prepared by: JuanXie Advisor: Qin Ma
Project Name: Do DAVID enrichment analysis based on QUBIC result
SIGNIFICANT SCIENCE ACCOMPLISHMENTS: (Examples: major achievement in meeting a milestone, new
collaborations, publication in high impact journal.)
1. Summary of Science Activities (<1 page):
Select candidate biclusters for DAVID enrichment analysis based on SAG overlapped ratio, condition
number, pvalue of QUBIC result and pvalue of hypergeometric distribution, and finally determine the
suitable candidate biclusters and finish DAVID enrichment analysis. Replace the DAVID id in the results
file with original probeset id.
2. Future Work Plans (Brief summary of the tasks/milestones working on next month):
3. Issues to Resolve (Issues that need input from another partner, FA Lead, Science Coordinator or BESC Director for
resolution, etc.):
4. Publications:
5. Presentations:
6. News / Awards:
7. Personnel changes (New, reassigned, or departed):
8. Intellectual Property:
9. Quality Assurance:
2
10. Environment, Safety and Health:
Please complete and return to Qin Ma ([email protected]). If no activity, please indicate “N/A”.
3
Supplementary Tables and Figures (to support above accomplishments)
Background information about the work
Jiading provided three excel files: Original-expression data-total probesets, UP-gene list-LSCR-4 organs and
conserved-SAG-list.
The Original-expression data-total probesets file contains ids (i.e., probeset id, Arabidopsis hit, rice hit, etc.)
and expression values of different organs under different time. It consists of 122974 rows (including the header
row);
The UP-gene list-LSCR-4 organs file contains 4 sheets, each organ corresponding one sheet. Each sheet
contains two columns: probeset id and PviUT sequence ID. The row number of each sheet varies from about
1800~3000; This file is the target of analysis.
The conserved-SAG-list file also contains 4 sheets, and each sheet contains the probeset id and PviUT
sequence ID columns. The row number of each sheet is much less, about 400~600Jianding provided. This file
contains the probesets which already known to have certain functions.
Our work need to conduct biclustering for the expression values of four organs gene ( the ones from the UP-
gene list –LSCR-4 organs), then select candidate biclusters and do DAVID enrichment analysis.
1. Select candidate biclusters for DAVID enrichment analysis.
1) Prepare data to be used
Extract the first column (probeset id column) and all the condition columns from the original expression data,
save it as “raw data.csv” . (note: The original expression file is too large, load it directly in R would be very
slow)
Extract the first column and the Arabidopsis.hit column from the original expression data( save it as “Original-
expression data_Ahit.csv” ), reserve the rows with valid Arabidopsis hit (that means ,exclude the “/” and blank
rows), and save the file as “allgene_hit1.csv”:
allgene <-read.table("C:/Users/qibao/Desktop/project/Original-expression data_A hit.csv",header=T,sep=",")
hit<-allgene[allgene$Arabidopsis.hit!="/",] ## exclude the “/” rows
hit1<-hit[hit$Arabidopsis.hit !="",] ## exclude the blank rows, and save the remains
allgene_hit1<- write.csv(hit1,"C:/Users/qibao/Desktop/project/allgene_hit1.csv",row.names=F,sep=",") ## write the obtained rows as csv file
Extract the up gene list and SAG list of each organ, save as csv files. Take the leaf data for example, save
the up gene list of root as leaf_sm,.the SAG list of root as SAG list_leaf
Select the expression data of the genes listed in up gene list from the ”raw data.csv” file using R:
4
leaf_sub <- rawdata[(rawdata$probeset_id %in% root_sm$probeset_id),]
write.csv(root_sub,file="C:/Users/qibao/Desktop/project/leaf/leaf_sub.csv",row.names=F)
save the leaf_sub file obtained as txt file(table delimited) using notepad ++ and use it as input file for QUBIC
2) run QUBIC
QUBIC is a qualitative biclustering algorithm, the link of the C version is:
http://csbl.bmb.uga.edu/~maqin/bicluster/doc/build/. Recently a R package version was developed and
accepted by Bioconductor: https://www.bioconductor.org/packages/devel/bioc/html/QUBIC.html. And this work
was done using the C version of QUBIC.
The parameters used were as following : -k 3 -f 0.25 -c 0.95 -o 300 -q 0.10 -r 1. The program found 76
biclusters. Actually, three sets of parameters were tested: a). -k 3 -f 0.25 -c 0.95 -o 300 -q 0.10 -r 1, b). -k 3 -f
0.25 -c 1.0 -o 300 -q 0.10 -r 1, and c) -k 2 -f 0.25 -c 0.95 -o 300 -q 0.06 -r 1. Under a) we can got larger
biclusters ( in terms of the gene number of each bicluster). Thus I choose the a) option
3) Grep the genes, pvalues and BC number of biclusters under Linux, and save them as sub3BCgenes,
sub3BCPvalue and sub3BC# ,respectively:
Grep Genes leaf_sub.blocks | cut –d ‘[’ –f2 > sub3BCgenes
Grep Pvalue leaf_sub.blocks | cut –d ‘[’ –f2 > sub3BCPvalue
Grep BC leaf_sub.blocks | cut –d ‘S’ –f1 > sub3BC#
4) Match the SAG and Arabidopsis.hit for the genes of every bicluster:
(Note: this step is needed because we are trying to select biclusters which contain larger proportion of SAG genes,
and after that, we need to find the corresponding Arabidopsis.hit of probeset)
leafgenelist <- read.csv("C:/Users/qibao/Desktop/project/leaf/leaf_sm.csv",header=T,sep=",") ## read the up gene list of leaf
leaf_SAG <-read.csv("C:/Users/qibao/Desktop/project/leaf/SAG-list_leaf.csv",header=T,sep=",") ## read the SAG list of leaf
dim(leafgenelist)
dim(leaf_SAG)
head(leafgenelist)
head(leaf_SAG)
SAGreptall <-leaf_SAG$Probe.Set.ID[which(leaf_SAG$Probe.Set.ID %in% leafgenelist$Probe.Set.ID)] ## match the probe id of leaf up gene list with the corresponding SAG list
length(SAGreptall) ##number of matched gene in the whole SAG, in this case,571
dim(leafgenelist)[1] ## total gene # in up-list, in this case,3022
5
# I/O
fSAG <- "C:/Users/qibao/Desktop/project/leaf/SAG-list_leaf.csv" ## define the saving path and names, and the same below. This is the SAG list of leaf
fGene <- "C:/Users/qibao/Desktop/project/leaf/sub3BCgenes" ## this file contains all the genes of biclusters
fBCName <- "C:/Users/qibao/Desktop/project/leaf/sub3BC#" ## this file contains the bicluster name, i.e, BC010
fOut <- "C:/Users/qibao/Desktop/project/leaf/SAGMatchResultAll.csv"
fPvalue<-"C:/Users/qibao/Desktop/project/leaf/sub3BCPvalue" ## this file contains the pvalue of bicluster
fhit <- "C:/Users/qibao/Desktop/project/allgene_hit1.csv" ## this is the file contains all Arabidopsis.hit and also probeset id
hit <- read.csv(fhit,header=T,sep=",")
leafSAG <-read.csv(fSAG,header=T,sep=",")
gene <- read.table(fGene ,header = FALSE,sep=" ",fill=T)
BCName <- read.table(fBCName, header=FALSE)
Pvalue<-read.table(fPvalue,header=F)
# declare variables
numGene <- nrow(gene) # number of rows
rep_ratio <- rep(-1,numGene) # ratio of number of repeat gene to that of all
hyperp <- rep(-1,numGene) # hypergeometric distribution pvalue
gene_all <- rep(-1,numGene) # all gene in each BC (here is each row)
gene_rep <- rep(-1,numGene) # matched number of repeat gene
Totalgene <-2451 ##total gene in up-list(minus total repeated)
TotalSAG <-571 ## number of repeated gene in the whole SAG
hitmatchednumber <- rep(-1,numGene) # the number of hit matched gene
# Match the repeat
for(i in 1:numGene){
BC <- gene[i,]
6
# extract the "probeset_id" column and convert it to character vector
leaf_geneid <- as.character(leafSAG$Probe.Set.ID)
class(leaf_geneid)
# convert to character vector
tBC <- t(BC)
tBC <- as.character(tBC[,1])
tBC[tBC==""] <- NA # replace the filling value (blank, "") with "NA"
tBC <- tBC[!is.na(tBC)]
# Match the repeat
rep <- tBC[which(tBC %in% leaf_geneid)]
repNum <- length(rep)
gene_rep[i] <- repNum
gene_all[i] <- length(tBC)
hitmatch<- hit[hit$probeset_id %in% tBC,] # found the matched hit
Arahit <-hitmatch[,2] #select the column of Arabidopsis.hit
hit_number <-dim(hitmatch)[1] # the number of founded hit(equals to the row# of hitmatch)
hitmatchednumber[i]<-hit_number
hyper_pvalue <-dhyper(repNum,TotalSAG,Totalgene,length(tBC)) ## calculate the hypergeometric distribution p value
hyperp[i]<-sprintf("%.3f",hyper_pvalue) ## keep three decimal
ratio <- repNum/length(tBC)
rep_ratio[i] <- sprintf("%.3f",ratio)
}
## write result
rst <- data.frame(BC_Name=BCName[,1], Pvalue=Pvalue[,1],GeneCount=gene_all, RepeatCount=gene_rep, repeatratio=rep_ratio,hyperpvalue=hyperp,hitCount=hitmatchednumber)
write.csv(rst, file=fOut,row.names=F,sep=",")
7
5) Select candidate biclusters based on pvalue of hypergemotric distribution(<=0.05) and the number of
matched Arabidopsis.hit (>= 10):
rst <-read.csv("C:/Users/qibao/Desktop/project/leaf/SAGMatchResultAll.csv",header=T)
foutDir <- "C:/Users/qibao/Desktop/project/leaf/"
hyperP_threshold <- 0.05 ## filter condition1 hyper pvalue
hitCount_threshold <- 10 ##filter condition2 matched hit number
rst_cp <- rst
rst_cp <- rst_cp[rst_cp$hyperpvalue<=hyperP_threshold,]
rst_cp <- rst_cp[rst_cp$hitCount>=hitCount_threshold,]
BCorder<- 1+as.numeric(substr(rst_cp[,1],3,5)) ## corresponding row# of candidate bicluster in total qubic gene
length(BCorder)
BChit <-rep(-1,length(BCorder))
for(i in 1:length(BCorder)){
BC <- gene[BCorder[i],]
tBC <- t(BC)
tBC <- as.character(tBC[,1])
tBC[tBC==""] <- NA # replace the filling value (blank, "") with "NA"
tBC <- tBC[!is.na(tBC)]
hitmatch<- hit[hit$probeset_id %in% tBC,]
Arahit <-hitmatch[,2] ##select the column of Arahit
BChit<-substr(as.character(Arahit),1,9) ##select the string before /
fout<- paste(foutDir,i,sep="")
write.csv(BChit,fout,row.names=F)
}
We obtained 16 biclusters, each with a separate gene list.
6) upload the obtained gene list to DAVID and do enrichment analysis.
2. Conduct DAVID enrichment analysis
8
Upload the gene lists obtained in step1-6) to DAVID website (https://david.ncifcrf.gov/home.jsp) to conduct
enrichment analysis, details as follow:
Step1: click “Start Analysis” in the top bar from home page
Step 2: upload or paste the gene list, select Not Sure in the drop down box, select the list as Gene List, and
click the “Submit List” botton:
Step3: Click “Submit to Conversion Tool”
9
Step4: click “Convert All” to convert the ids
Step5: right click the “Download File” to save the converted list file, and then click” Submit Converted List to DAVID as a
Gene List”, you may rename the gene list
10
Step6: Return to the tools page, click “Functional Annotation Clustering”
Step7: click “Functional Annotation Clustering”
Step8: right click “Download File”, save the anntation clustering results
11
Step9(optional): Return to to annotation summary results page, uncheck the “Check Defaults” box,click “Pathways”,then
select”KEGG_PATHWAY” , then click “Functional Annotation Clustering”, and download the obtained results
Note: this step is optional, and we did this because Jiading was interested in the specific KEGG annotation clustering
result
3. Replace the DAVID ID of the obtained results with corresponding probeset id.
Following the above steps, we can get three txt files for each bicluster(download from the DAVID website):
a) the id conversion records file (convert the Arabidopsis hit id to DAVID ID); b) the enrichment analysis result
file and c) the KEGG result file. In b) and c), the genes are represented by their DAVID ID. To facilitate
users, we replace the DAVID ID with probeset id.
1) Preparing the hash table
Select the Arabidopsis hit column of allgene_hit1.csv( which contains two columns: Arabidopsis hit and
probeset id), and upload it to DAVID as gene list, submit, convert to DAVID ID and download the
obtained file, then save it as csv file (comma delimited) using notepad ++, with the file named as
12
DAVID ID.csv. The DAVID ID.csv file contains four columns: From(Arabidopsis id),To(DAVID id),
species and gene names. The following are R scripts:
AhitID <- read.csv("C:/Users/qibao/Desktop/project/allgene_hit1.csv",header=T,sep=",")
davidid<-read.csv("C:/Users/qibao/Desktop/project/DAVID ID.csv",header=T,sep=",")
From <-as.character(davidid$From) ## extract the From column and convert it to character
head(AhitID)
head(davidid)
length <-nrow(davidid)
probeset_id_matched <- rep("",length) # record matched corresponding "probeset_id"
for(i in 1:length){
tmp <- AhitID[substr(as.character(AhitID$Arabidopsis.hit),1,9)==From[i],]
probeset_id_matched[i] <- as.character(tmp$probeset_id[1])
}
# write result out
fOut <- "C:/Users/qibao/Desktop/project/davidid_with_probid_1.csv"
df <- data.frame(ProbesetID=probeset_id_matched,davidid)
write.csv(df,fOut,sep=",",row.names=F)
Following the above steps, we obtained a csv(davidid_with_probed_1) which contains five columns: Probeset ID, From,
To, Species and Gene names.
2) Replace the DAVID ID with probeset id using python:
import re
import csv
import glob
import os
## function defined to perform string replacement
def multiple_replace(text, adict):
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
13
return rx.sub(one_xlat, text)
## main function
if __name__ == '__main__':
#read old and new strings from a csv file
fileMap = 'C:\\Users\\qibao\\Desktop\\project\\davidid_with_probid_1.csv'
fCSV = open(fileMap)
mapDict = {} # dictionary declared to keep old and new strings. (First column: old; second: new)
i=0
for row in csv.reader(fCSV):
if i > 0: # the first row is for titles in csv file
mapDict[row[2]] = row[0]
i = i + 1
#read text for replacing
dirIn = 'C:\\Users\\qibao\\Desktop\\project\\stem\\enrichment\\'
dirOut = 'C:\\Users\\qibao\\Desktop\\project\\stem\\enrichment\\enrichment result\\'
os.chdir(dirIn) # change directory to predefined directory
for fileTXT in glob.glob("*.txt"):
strConst1='enrichment'
strConst2='kegg'
# only process file which has a section like 'enrichment' or 'kegg'
if fileTXT.find(strConst1) > 0 or fileTXT.find(strConst2) > 0 :
print "Processing file: " + fileTXT + "..."
fileInPath = dirIn + fileTXT # construct full path of a input file
fInTXT = open(fileInPath, 'r') # open file for reading
fileOutPath = dirOut + fileTXT.replace('.txt','_new.txt') # construct full path of a output file
fOutTXT = open(fileOutPath,'w') # open file for writing
for line in fInTXT:
14
newT = multiple_replace(line, mapDict) # call replace function
#print newT
fOutTXT.writelines(newT) # write result to file
fInTXT.close #close connection pointing to the input file
fOutTXT.close #close connection pointing to the output file
# Done
print "Done, all files were processed"