CSAMA%2013:%Computaonal% · PDF file• Grammar%of%graphics%concepts,%deployment...

CSAMA 2013: Computa1onal visualiza1on in genomic data

analysis Vince Carey PhD

Harvard Medical School

Road map of the talk

•  Mo1va1ons for computa1onal visualiza1on –  Concrete accompaniment to more abstract sta1s1cal tabula1ons/inferences

–  Exploring an observa1onal fron1er •  PloJng and basic sta1s1cs (design, univariate, mul1variate, mul1ple comparisons)

•  Grammar of graphics concepts, deployment against Yeast Cell Cycle expression archive

•  ggbio applica1ons to GWAS •  Object designs for CCLE and a cancer regulatory network

Three principles

•  Visualiza1ons should be obtained using a program: no photoshop –  Steps by which snapshots of interac1ve visualiza1ons are obtained should have a textual representa1on

•  The process by which a given visualiza1on was selected (from among relevant varia1ons) should be disclosed

•  Uncertainty and variability can be hard to depict but aWempts should be made to indicate: –  Roles of modeling assump1ons –  Scope and sources of measurement varia1on

Report to an academy

“looks like there is something there”

Honest pairwise comparisons

A basic dilemma

•  Use visualiza1on to discover/communicate rela1onships in data

•  Avoid choosing visualiza1ons that overstate the strength of rela1onship – Are there trustworthy prac1ces? – Can we guide the viewer to results of a principled sta1s1cal analysis? Rela1onships that are highly likely to “independently replicate”?

Communicate about the experimental design

•  Objec1ves •  Demonstra1on of validity and sa1sfactory power of proposed test procedure

•  Illustra1on of development of the dataset from screening to analysis

R/Bioconductor and scien1fic visualiza1on

•  Integra1ve, self-‐describing containers – ExpressionSet, SummarizedExperiment, GRanges

•  Packages that define workflow components – affy, oligo, limma, ShortRead, DESeq, GSEAlm – these oaen include their own visualiza1on funcs.

•  Packages that specialize in visualiza1on – geneplo9er, ggbio (focused ggplot2), Gviz, Rgraphviz, HilbertVis

By the end of the course

•  Understand the ra1onale for the container designs, and the ones you will actually use

•  Understand why func1ons have been packaged up as they have been

•  Get a sense of the discipline required to use the containers and packages vs. files and scripts

•  Sharpen your sta1s1cal acumen … recognize the good arguments, improve the flawed ones

A reproducible experiment? Four labs assay a sample of blinded origin on two quan11es – as regulator, do you declare

them consistent?

When we look at the data….

Moral of the Anscombe data?

•  “High-‐quality” summary sta1s1cs (e.g., es1mators that are unbiased, efficient under mild regularity condi1ons …) can hide a lot of scien1fically relevant informa1on – Some1mes a good visualiza1on will reveal important structures

•  Always create an opportunity to look •  Fold visualiza1ons at various scales into your reports

Displaying 150x4 iris measurements with pairs()

A unified view with ggplot2

Marginal distribu1ons for species discrimina1on

A detour: large-‐scale use of boxplots

•  The ReporAngTools package is an important new contribu1on that allows developers to synthesize approaches to analysis and visualiza1on for high-‐level communica1ons

•  Target medium is a modern browser with significant client-‐side capabili1es permiJng search, sor1ng, etc.

Summary thus far

•  Visuals are to help with interpreta1on –  Interpreta1on is difficult without clear sense of objec1ves of underlying experiment

–  Experiments are oaen messy … data filtering diagram like CONSORT should be standard prac1ce

•  Marginal and pairwise joint distribu1ons are easy to visualize – Marginal and pairwise sta1s1cs/tests can obscure important paWerns

–  Efficient enhancements to tables (Repor1ngTools)

Principal components reexpression of a mul1variate dataset

PC1 is the linear combina1on of the original variables that has maximum variance among all linear combina1ons of the original variables; PC2 is MVLC among all LC uncorrelated with PC1, …

Interpre1ng the “loading” on PCs

Heatmaps: distances clustering criteria annota1on

Under the hood with dendrograms: c1 = hclust(dist(iris[,1:4]))

Summary

•  PCA is widely used for “dimension reduc1on”, “eigengene” reexpression, QA, removal of extraneous varia1on

•  Cluster analyses are commonly used and underlie many prominent displays/analyses – Highly tunable – R and other tools hide complexity: “don’t believe in magic”, know how to open the box

“Grammar of graphics”

Deploying grammar of graphics directly on an expression archive

In the brixvis2013 lab (op1onal)

•  Func1on clquad() reexpresses a piece of a cluster analysis of cell-‐cycle expression trajectories

•  Func1on clpts() breaks up a cluster into its cons1tuent trajectories

•  Neither is wriWen for general use, but help to illustrate exposure of variability at different scales for a “holis1c” workflow step

Some examples with ggbio package: autoplot method sensi1ve to input class

ggbio comments

•  Containers for genomic annota1on and experimental results have autoplot methods

•  ggplot2 factoriza1on of visualiza1on tasks allows programma1cally efficient embellishments

•  You can go your own way

Lightweight containers for substan1al experimental data

Simple syntax for CCLE surveys

Gene1cs of topotecan sensi1vity

Summary of CCLE visualiza1on support

•  Regulari1es in data structure across cell lines iden1fied for reten1on in hierarchical object structure

•  plot() methods defined for inputs at different levels of the hierarchy

•  Use of ggplot2 infrastructure permits immediate use of tunable sta1s1cal visualiza1on paWerns, factoriza1on of embellishments

Two levels of encyclopedia structure

Envoi: Variant-‐TF-‐Phenotype network structures

Conclusions

•  R/bioconductor provide substan1al infrastructure for flexible approaches to visualiza1on

•  Emphasis: sta1s1cal integrity, reproducibility, acknowledgment of variability, uncertainty

•  Produc1ve approach: separately conceptualize the underlying data structure, visualiza1on objec1ves, and code to render the informaAon in a tunable, extensible way

CSAMA%2013:%Computaonal% · PDF file• Grammar%of%graphics%concepts,%deployment...

Documents

Transcript of CSAMA%2013:%Computaonal% · PDF file• Grammar%of%graphics%concepts,%deployment...

CS 598: Computaonal Topology Spring 2013 - GitHub Pages€¦ · Codex Seraphinianus , p.166 The 13 basketball posions according to Muthu Alagappan CS 598: Computaonal Topology Spring

Development of computaonal microstructure model

Computaonal Challenges and Needs for Academic and ...

ACE3P – Advanced Computational …...ACE3P – Advanced Computational Electromagnetics Code Suite Omega3P((Frequency*domain)((((( ACE3P((Advanced Computaonal …

Computaonal+modelling+of+ verbs+in+Dene+languages+ · Computaonal+model+ • Finite]State+Transducers:+FST+(e.g.+Beesley& Karvunen2003,+ Huldén2009,+ Lindén+etal.+2011)+ – Advantages+–well]known+computaonal+beasts+

Intensive Research Program on DISCRETE, COMBINATORIAL AND ... · Geometry Inspiring Lectures I Computaonal Geometry Towards Applicaons Hands-on on geometric soZware Advanced Course

WebMO: AUserFriendly Computaonal&Plaorm&for& Chemical ...

Basics&of&Computaonal& Chemistry&

Deep Learning for NLP - Syracuse University Learning for NLP Nancy McCracken Center for Computaonal and Data Sciences iSchool Syracuse University njmccrac@syr.edu

Analy&cal(and(computaonal(challenges(in( ((coalescentbased ...tandy.cs.illinois.edu/SMBE-Warnow.pdf · Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar de Investigação

Computaonal methods for data integraon...Computaonal methods for data integraon July 21, 2017 Karan Uppal, PhD Assistant Professor of Medicine Director of Computaonal Metabolomics

The$Full$Spectrum$Biomedical$and$ …Alumni$numbers,$with$some$job$7tles$ and$employers$ Job$Titles$ • Clinical$informacs$analyst • Informacs$researcher$ • Computaonal$Biologist

Workshop:)Reproducibility)in) Computaonal)and ......This report summarizes discussions that took place during the ICERM Workshop on Reproducibility in Computational and Experimental

Bressanone IT classification in genomic applications CSAMA ...x axis: genomic coordinate on chr1 y axis: expression smoothed over windows of ~100 genes OPC: oligodendrocyte precursor

RNA-seq workﬂow - gene-level exploratory analysis and ...bioconductor.org/help/course-materials/2016/CSAMA/...July 10, 2016 Abstract Citing scientiﬁc research software Introduction

CS 581 / BIOE 540: Algorithmic Computaonal Genomicstandy.cs.illinois.edu/581-intro-to-course.pdf · CS 581 / BIOE 540: Algorithmic Computaonal Genomics Tandy Warnow Departments of

GGbio - grammar of graphics for genomic data

Implementaon of computaonal pipelines to support next gen ...bioinformatics.org.au › ws09 › presentations › Day2_RBarrero.pdfGenomics Australia Embedded Activities Proteomics

Groundwork)for)aResource)in)Computaonal…staff · Groundwork)for)aResource)in)Computaonal)Hearing)for)Extended)String)Techniques) ... for)unintended)changes)in)performance) ...

Computaonal+and+Policy+Tools+ for+Reproducible+Research+ · Computaonal+and+Policy+Tools+ for+Reproducible+Research+ Roger+D.+Peng,PhD+ Departmentof)Biosta/s/cs) Johns)Hopkins)Bloomberg)School)of)PublicHealth)