Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD.

Bioinformatics for Stem CellLecture 2

Debashis Sahoo, PhD

Outline

• Lecture 1 Recap• Multivariate analysis• Microarray data analysis• Boolean analysis• Sequencing data analysis

MULTIVARIATE ANALYSIS

Identify Markers of Human Colon Cancer and Normal Colon

Piero Dalerba Tomer Kalisky

Single Cell Analysis of Normal Human Colon Epithelium

Hierarchical Clustering

• Cluster 3.0– http://bonsai.hgc.jp/~mdehoon/software/cluster/

• Distance metric– Euclidian, Squared Euclidean, Manhattan,

maximum, cosine, Pearson’s correlation

• Linkage– Single, complete, average, median, centroid

Multivariate Analysis - PCA

X = data matrixV = loading matrixU = scores matrix

Principal Component Analysis

Fundamentals of PCA

• Reduces dimensions of the data

• PCA uses orthogonal linear transformation

• First principal component has the largest possible variance.

• Exploratory tool to uncover unknown trends in the data

PCA Analysis

HIGH-THROUGHPUT DATA ANALYSIS

MICROARRAY ANALYSIS

Microarray

• Spotted vs. in situ• Two channel vs. one

channel• Probe vs. probeset vs.

Quantile NormalizationS

ort Average

#1 #2 #3

Val(Probe_i) = SortedAvg[Rank(Probe_i)]

SortedAvg

Invariant Set Normalization

BeforeNormalization

After Normalization

Invariant set

Good to Check the Image

1. Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Group A Group B

SAM Two-Class Unpaired

Permutation tests

i) For each gene, compute d-value (analogous to t-statistic). This isthe observed d-value for that gene.

ii) Rank the genes in ascending order of their d-values.

iii) Randomly shuffle the values of the genes between groups A and B,such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene

Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6

Gene 1

Group A Group B

Exp 1Exp 4 Exp 5Exp 2Exp 3 Exp 6

Gene 1

Group A Group B

Original grouping

Randomized grouping

iv) Rank the permuted d-values of the genes in ascending order

v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed(unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene.

vi) Plot the observed d-values vs. the expected d-values

Significant positive genes (i.e., mean expression of group B >

mean expression of group A)

Significant negative genes (i.e., mean expression of group A > mean expression of group B)

“Observed d = expected d” line

The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.

GenePatternhttp://genepattern.broadinstitute.org/

AutoSOMEhttp://jimcooperlab.mcdb.ucsb.edu/autosome/

Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117

Aaron Newman

Gene Set Analysis

Cell CycleCell Cycle

Transcription factorTranscription factor

TGF-beta Signaling PathwayTGF-beta Signaling Pathway

Wnt-signaling PathwayWnt-signaling Pathway

Protein-protein interaction network

Your Gene Set

Compute enrichment in pathways and

networks

Compute enrichment in pathways and

networks

Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING

BOOLEAN ANALYSIS

Boolean Implication

• Analyze pairs of genes.• Analyze the four

different quadrants.• Identify sparse

quadrants.• Record the Boolean

relationships.– If ACPP high, then GABRB1

low– If GABRB1 high, then ACPP

[Sahoo et al. Genome Biology 08]

45,000 Affymetrix microarrays

Threshold Calculation

• A threshold is determined for each gene.

• The arrays are sorted by gene expression

• StepMiner is used to determine the threshold

Sorted arrays

[Sahoo et al. 07]

Threshold

Intermediate

BooleanNet Statistics

nAlow = (a00+ a01), nBlow = (a00+ a10)

total = a00+ a01+ a10+ a11, observed = a00

expected = (nAlow/ total * nBlow/ total) * total

(a00+ a01)

(a00+ a10)+( )1

2error rate =

a01 a11

statistic =(expected – observed)

expected√

Boolean Implication = (statistic > 3, error rate < 0.1)

Six Boolean Implications

MiDReG Algorithm

[Sahoo et al. PNAS 2010]

MiDReG = (Mining Developmentally Regulated Genes)

MiDReG Algorithm

B Cell Genes

Boolean Implications

Jun Seita

[Seita, Sahoo et al. PLoS ONE, 2012]

http://gexc.stanford.edu

SEQUENCING DATA ANALYSIS

Sequencing Data Format

@HWI-EAS209:5:58:5894:21141#ATCACG/1TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1efcfffffcfeefffcffffffddf`feed]`]_Ba

>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)

Mapping

Mapping Software

• Long reads– BLAST, HMMER, SSEARCH

• Short reads– BLAT– Bowtie, BWA, Partek, SOAP, Tophat, Olego,

BarraCUDA

Visualizations

• UCSC Genome Browser• GenoViewer, Samtools tview, MaqView, rtracklayer,

BamView, gbrowse2• Integrative Genomics Viewer (IGV)

Quantification

• Peak calling– QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER,

SiSSRs, OMT

• Expression quantification– Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT,

Velvet, MISO, RSEQ

• SNP calling– samtools, VarScan, GATK, SOAP2, realSFS, Beagle,

QCall, MaCH

Peak Discovery

[Pepke et al. Nature Methods 2009]

Transcript Quantification

[Pepke et al. Nature Methods 2009]

RPKM, FPKM

SNP Calling

Typical RNA-seq Workflow

[Trapnell et al. Nature Biotech 2010]

Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD.

Documents

Transcript of Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD.

Department of Statistics, Stanford University 390 Serra ...anson.ucdavis.edu/~debashis/techrep/eigenlimit.pdf · 390 Serra Mall, Stanford, CA 94305 debashis@stat.stanford.edu Abstract

Scanned by CamScanner...SACHITA NANDA PALAI JITENDRA ROHIDAS KRUSHNA CHANDRA SETHI BIJAYA KUMAR BEHERA KIRAN KUMAR SOREN SUNIL KU SAHOO BIBEK RANJAN SAHOO TAPAN KUMAR SAHOO AMAR KUMAR

Ntpc Training Report- Saket Sahoo

NEETASHA SAHOO

Phylogenetic Analysis by Implementation of Dna Barcoding ...Central rii cellece i e ccess JSM Bioinformatics, Genomics and Proteomics Cite this article: Ray M, Panda B, Sahoo S (2019)

Stanford University Boolean Analysis of Large Gene-expression Datasets Debashis Sahoo PhD Candidate, Electrical Engineering Joint work with David Dill,

Presentation by CA. (DR.) DEBASHIS MITRA M.COM, LL.B, F.C.A., A.C.M.A., A.C.S., DISA(ICA), PhD. DEBASHIS MITRA & ASSOCIATES Chartered Accountants.

Bioinformatics for Stem Cell Lecture 1 Debashis Sahoo, PhD.

Sahoo Soumendra Et. Al.

Debashis Ppt

Project Report 8th Sem Smrutiranjan Sahoo

RD University | Homepralipta sahoo subhashree sahoo manaswini prusty nibedita behera iftesam ali alina priyadarsini sunanda pradhan sushree sangita nath jayashree behera rasmita sahoo

Prepared By: Mrs. Kabita Sahoo

Online Trading Trustline Debashis

davcsp.orgdavcsp.org/File/50/LKG Notice.pdf · bibhudatta mohanty hitangsu mazumdar palak priyanshi mohanty swayambhuba samantaray diksha mai-lick debashis sahoo naman samarth mohanty

PROBABILITY AND MATHEMATICAL STATISTICS Prasanna Sahoo ...

Shalini sahoo -copyright vs. plagiarism[1]

Dr. Debashis Saha, Chairman Professor, MIS & Computer Science Group

Literature Seminar 061112_G Sahoo

Santosh Kumar Sahoo - Academy of Technology