Archetype Analysis: a framework for selecting...

52
Archetype Analysis: a framework for selecting representative objects Volker Roth, Department of Mathematics and Computer Science, University of Basel IPAM MPSWS1 2016, V Roth 1

Transcript of Archetype Analysis: a framework for selecting...

Page 1: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetype Analysis: a framework forselecting representative objects

Volker Roth, Department of Mathematics and Computer Science,

University of Basel

IPAM MPSWS1 2016, V Roth 1

Page 2: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Machine Learning: Supervised Setting

Learning Machine

Supervisor

Letter −> Class Label

Statistical Learning Theory

spam

Important

©20th

Cen

tury

Fox F

ilm

Corp

ora

tion

IPAM MPSWS1 2016, V Roth 2

Page 3: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Machine Learning: Uncertain Labels

Sometimes even the best experts make errors.

Labels might be uncertain or missing...

IPAM MPSWS1 2016, V Roth 3

Page 4: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Machine Learning: Unsupervised Setting

Learning Machine

Supervisor

©20th

Cen

tury

Fox F

ilm

Corp

ora

tion

(Bayesian) Model Selection

Maybeclustering

is a good idea

Data −> Structure

How sparse is the graph?

How many clusters?

IPAM MPSWS1 2016, V Roth 4

Page 5: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetypes (Cluster−)Prototypes

"an attempt at something""most perfect possible form"

©20th

Cen

tury

Fox F

ilm

Corp

ora

tion

©20th

Cen

tury

Fox F

ilm

Corp

ora

tion

A Repeated Pattern: Search for RepresentativeObservations

IPAM MPSWS1 2016, V Roth 5

Page 6: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetype Analysis:Biological Motivation

Is there a theoretical foundation of the “archetype concept”?

; O. Shoval, H. Sheftel, G. Shinar, Y. Hart, O. Ramote, A. Mayo,

E. Dekel, K. Kavanagh, U. Alon: Evolutionary Trade-Offs, ParetoOptimality, and the Geometry of Phenotype Space. Science, 2012

IPAM MPSWS1 2016, V Roth 6

Page 7: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetypes and Evolutionary Trade-offs

IPAM MPSWS1 2016, V Roth Shoval et al.: Evolutionary Trade-Offs, Pareto Optimality, and the Geometry of Phenotype Space. Science, 2012 7

Page 8: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

IPAM MPSWS1 2016, V Roth Shoval et al.: Evolutionary Trade-Offs, Pareto Optimality, and the Geometry of Phenotype Space. Science, 2012 8

Page 9: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Gene Expression Space

Human colon crypt cells fall in a tetrahedron in gene expression space.

The four vertices of this tetrahedron are each enriched with genes for a

specific task related to stemness and early differentiation.IPAM MPSWS1 2016, V Roth Korem et al. 2015 9

Page 10: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Computational Archetype Selection

Cutler &Breiman, Archetypal Analysis, Technometrics 1994.

• n observations {x1, . . . ,xn} ∈ Rp, as rows of data matrix X ∈ Rn×p

• Aim: find K archetypes ⇒ Z ∈ RK×p; K � n fixed.

• Observations are convex mixtures of archetypes

xi = Ztai + εi, aij ≥ 0 and∑K

j=1 aij = 1.

IPAM MPSWS1 2016, V Roth 10

Page 11: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Computational Archetype Selection

• Archetypes themselves are convex mixtures of observations:

zi =n∑

j=1

bijxj, where bij ≥ 0 and∑n

j=1 bij = 1

Archetypes approximatethe convex hull!

• Constrained optimization problem involving two sets of coefficients

{aij} and {bij} : iteratively minimize sum of squares

(A, B) = argminA,B‖X −AZ‖2 = ‖X −ABX‖2.

IPAM MPSWS1 2016, V Roth 11

Page 12: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Basic Algorithm

1. Given K archetypes ZK×p, update the compositions ai

of the i-th object (a QP):

ai = arg mina∈Rp

+:at1=1‖xi − Zta‖2.

2. Given compositions An×K, update archetypes by solving

the least-squares problem

Z = arg minZ∈RK×p

‖X −AZ‖2.

3. Move ATs back to the convex hull (also a QP):

bk = arg minb∈Rn

+:bt1n=1‖zk −Xtb‖2.

IPAM MPSWS1 2016, V Roth 12

Page 13: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Computational Archetype Selection

Problems:

• High computational complexity,

• Solution heavily depends on initialization of archetypes,

• Have to fix the number of archetypes K a priori.

• Assume that we have a suitable representation such that it makes

sense to search for “triangles”...

???IPAM MPSWS1 2016, V Roth 13

Page 14: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetype Analysis:Model Selection

How many archetypes?

IPAM MPSWS1 2016, V Roth 14

Page 15: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Model selection: how many archetypes?

• Solve for all numbers and choose “best”

; Bayesian model comparison.

• A clever way for looking at all nubers?

• Initialize all observations as archetypes, apply sparse regressionmethods to shrink most archetypes to zero.

Sparsity constraint

IPAM MPSWS1 2016, V Roth 15

Page 16: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Sparse Regression via the Group-Lasso

Least-squares: minβ ‖Xβ − y‖2.

Treat regression coefficients βi as resources needed for optimization.

Idea: limit resources ; model must concentrate on important features.

00

1

2

3

1

2ββ

Lasso GroupLasso

β

β β

β

β

‖β‖1 ≤ κ∑K

j=1 ‖βj‖2 ≤ κ

IPAM MPSWS1 2016, V Roth 16

Page 17: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Automatic detection of archetypes

Rewrite ‖X − AZ‖ as ‖~x − (A ⊗ Ip)~z‖, where ~x means column-wise

vectorization of X and use `1,2 block-norm constraint:K∑j=1

‖zj‖ ≤ κ.

1

n

2

x

x

x

2

a11

a11

0

0

a12

a12

a21

a21

0

0

a22

a22

0

0

an1

an1

an2

an2

a1K

a1K

0

0

anK

anK

a2K

a2K

0

0z1

0

0

0

0

0

0

0

0

z2

Kz Z

Z

21

22

Z11

GroupLasso constraint will shrink some archetypes zj to zero, depending

on constraint value κ.IPAM MPSWS1 2016, V Roth 17

Page 18: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

A solution for the model selection problem

• Use GL algorithm that approximates the whole solution path on a fine

grid of κ values

Sparsity constraint

• For every κ-value, use BIC score

BIC(µ) =‖~x− µ‖2

nσ2+

log(n)

n· df(µ),

and select the best scoring model.

IPAM MPSWS1 2016, V Roth 18

Page 19: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetype Analysis:Algorithms

Will it work for huge datasets?

IPAM MPSWS1 2016, V Roth 19

Page 20: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Efficient algorithms(Bauckhage et al., 2015):

Solve step 1 and combined (2,3)-step with Frank-Wolfe algorithm

Idea: use linear optimization oracle over constraint set.

• Linear minimization oracle

∆(x) = argminz〈x, z〉,

s.t. g(z) ≤ κ

• Update x← (1−γ)x+γ∆(∇f(x))

• Decrease γ M. Jaggi, 2015

Pros: Highly efficient for one fixed value of κ

Cons: Not efficient for the whole solution path

; model selction trick does not work well.

IPAM MPSWS1 2016, V Roth 20

Page 21: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Alternative: Forward stagewise

Technically similar to Frank-Wolfe:• Linear minimization oracle (LMO)

∆(x) = argminz〈x, z〉,

s.t. g(z) ≤ ε� κ

• Update x← x+ ∆(∇f(x))∆

f (x)

g(z) <

contours of LMO

contours of f(x)

ε∆

...but very different behaviour:

• incremental path following behaviour

; efficient for computing the whole solution path

• built-in monotonicity “regularization” ; very stable,

can be extended to non-convex “norms” for increased sparsity.

IPAM MPSWS1 2016, V Roth 21

Page 22: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Forward stagewise

Consider step 1 in AT analysis:

Given K archetypes ZK×p, update the compositions ai of the i-th object

xi under convexity constraints:

ai = arg mina∈Rp

+:at1=1‖xi − Zta‖2.

This is a non-negative lasso estimate

mina

‖xi − Zta‖2

s.t. ‖a‖1 = 1, aj ≥ 0.

Use projected gradient:

a← a+ ∆(∇+f (a))

f (x)

∆ ∆

(x)+

f

contours of LMO

g(z) < ε

IPAM MPSWS1 2016, V Roth 22

Page 23: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Forward stagewise

• Useful if LMO can be computed easily: all norms, block-norms etc.

• NN lasso: LMO is intersection of simplex and linear function

; find best vertex l, update l-th component of a as al ← al + ε

; a is monotone increasing in every component.

• Simple and efficient: iterate until ‖a‖1 = 1 ; 1/ε iterations needed.

• Conceptually the same behaviour for group-lasso estimate in step 2.

R. Tibshirani, 2015

IPAM MPSWS1 2016, V Roth 23

Page 24: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Further algorithmic tricks

• Pre-select candidate points on convex hull:

– Points on convex hull in any linear projection are also on the

“full” convex hull.

– Convex hull computation very efficient in 2D: O(n log n)

– Randomly project data to planes, compute points on convex hull,

aggregate.

• Alternative to random projections: use pairwise PCA projections.

– PCA is a good preprocessing step anyway, since convex polygon with

K vertices can be embedded in a space with < K dimensions.

• We can easily solve AT problems with > 100 millions of objects.

IPAM MPSWS1 2016, V Roth 24

Page 25: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetype Analysis:Data representation

How to encode the data such that we can see “triangles”?

IPAM MPSWS1 2016, V Roth 25

Page 26: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Representation issues

Representations

• different domains

(p-values, body size, weight)

• monotone transformations (e.g. log)

Problem: archetypal analysis is

sensitive to choice of representation

Solution: Copula extension

• representation independence

• robust against outliers

• mixed data & missing values

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

● ● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●

●●

●●

●●

●●

IPAM MPSWS1 2016, V Roth 26

Page 27: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Copula

Copula density

• p-dim pdf on [0, 1]p

• uniform marginals

• defines dependency

structure

Property

• construct arbitrary

multivariate distribution

yj = F−1j (uj)

(Sklar 1959)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

copula

0.0

0.2

0.4

0.6

0.8

1.0

U1

0.0 0.2 0.4 0.6 0.8 1.0

U2

norm cdf

3 4 5 6 7 8 9

−0.

20.

00.

20.

40.

60.

8

beta cdf

●●

●●

● ● ● ●

●●

●● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

●●

●●● ●

●●

● ●

●●

●●

●●●

● ●●

●●

●●

●●●

● ●

● ●

●● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●●

●●

●● ●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

● ●● ●

●●

●●

● ●

● ●

● ●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●● ●

●●

●●

● ● ● ●

●●

●● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

●●

●●● ●

●●

● ●

●●

●●

●●●

● ●●

●●

●●

●●●

● ●

● ●

●● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●●

●●

●● ●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

● ●● ●

●●

●●

● ●

● ●

● ●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●● ●

Y1

Y2

measurement space

Independence Copula

IPAM MPSWS1 2016, V Roth 27

Page 28: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Copula

Copula density

• p-dim pdf on [0, 1]p

• uniform marginals

• defines dependency

structure

Property

• construct arbitrary

multivariate distribution

yj = F−1j (uj)

(Sklar 1959)

●●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

copula

0.0

0.2

0.4

0.6

0.8

1.0

U1

0.0 0.2 0.4 0.6 0.8 1.0

U2

norm cdf

3 4 5 6 7 8 9

−0.

20.

00.

20.

40.

60.

8

beta cdf

●● ●

●●

●●

●●

●●

●●

● ●

●●

●● ●● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●●

●●

●●

●●●

●● ●

●●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●●

● ●

●●

● ●

● ●

●●

● ●

● ●●●●

●●

●● ●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●● ●● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●●

●●

●●

●●●

●● ●

●●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●●

● ●

●●

● ●

● ●

●●

● ●

● ●●●●

●●

●● ●

●●

●●

●●

●●

● ●

Y1

Y2

measurement space

Gaussian Copula

IPAM MPSWS1 2016, V Roth 28

Page 29: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Semi-parametric Gaussian Copula

Hierarchical model:

R

x ∼ N (0, R)

yi = fi(xi)

• ATs invariant against monotone

marginal transformations

• ATs depend only on ranks

; insensitive to outliers.

• Idea: reconstruct latent space

xj = (Φ−1 ◦ Fj)(yj)

• Use empirical cdf’s

Fj =ranks(Y [·,j])

n+1

−1.5 −1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(y)

x

Fn(

x)

●●

●●●

●●●●

●●●

●●

●●●

●●●●

●●●

IPAM MPSWS1 2016, V Roth 29

Page 30: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Mixed Continous / Discrete Data

Problem with discretedata: ties, empirical copula

no longer uniform

Extended rank likelihood(Hoff, 2007):

stochastic association-

preserving mapping

Algorithm: Gibbs sampler

1. sample latent variables

x, R|x

2. compute archetypes z

copula

0.0

0.2

0.4

0.6

0.8

1.0

U1

0.0 0.2 0.4 0.6 0.8 1.0

U2

discrete cdf

5 10 15 20 25

Y1

Y2

−0.

20.

00.

20.

40.

60.

8

beta cdf

IPAM MPSWS1 2016, V Roth 30

Page 31: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Artificial Data

• Monotone transformation with beta marginals

• Variables quantised to 5 levels

IPAM MPSWS1 2016, V Roth 31

Page 32: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Why use the Gaussian copula?

Generative model

ai ∼ DirK(α)

xi|Z,ai ∼ N (Ztai, ηIp)

X|Z,A ∼MN

(M︷︸︸︷AZ , I, ηI

)XtX ∼ Wnc

(n, nI,M tM

)≈ Wc

(n,

1

nM tM + ηI

)(Steyn and Roux, 1972)

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

● ● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

Under the assumed generative model,

a Gaussian covariance structure is plausible.

IPAM MPSWS1 2016, V Roth 32

Page 33: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetype Analysis:Applications

What is it good for?

IPAM MPSWS1 2016, V Roth 33

Page 34: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Analysis of E.coli Data

Identified ArchetypesShoval et al., Science, 2012

Tra

it 2

Trait 1

©2

0th

Cen

tury

Fo

x F

ilm

Co

rpo

rati

on

IPAM MPSWS1 2016, V Roth 34

Page 35: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetypical Chemical CompoundsIdea: find ATs in list of compounds identified in an AIDS antiviral screen

performed by the Developmental Therapeutics Program of the NCI/NIH,

enriched with all available anti-HIV drugs

Tanimoto similarity scores

Objects: chemical compounds SMILES strings

Rep.: Similarity matrix

PCA embedding

Archetype analysis

IPAM MPSWS1 2016, V Roth 35

Page 36: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetypical Chemical Compounds

1

2

34

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

IPAM MPSWS1 2016, V Roth 36

Page 37: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Archetypical Chemical Compounds

1

2

34

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

5

10

7

18

IPAM MPSWS1 2016, V Roth 37

Page 38: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Compounds explained by AT 5

0.99 AZT : 0.67

AZT It is of the nucleoside reverse-transcriptase inhibitor (NRTI) class.

It inhibits the enzyme (reverse transcriptase) that HIV uses to synthesize

DNA.

IPAM MPSWS1 2016, V Roth 38

Page 39: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Compounds explained by AT 7

Fosamprenavir : 0.94 0.99 0.99

amprenavir : 0.99 Darunavir;Prezista(TM) : 0.9 Aluviran;Lopinavir : 0.7

HIV protease inhibitors (they block a peptide cleaving enzyme).

IPAM MPSWS1 2016, V Roth 39

Page 40: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Compounds explained by AT 10

0.92 Calanolide A : 0.92 0.92

0.92 0.89 0.89

Calanolide A is an experimental non-nucleoside reverse transcriptase

inhibitor (NNRTI).IPAM MPSWS1 2016, V Roth 40

Page 41: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Compounds explained by AT 18

0.81 1−Naphthalenesulfonic acid, 3,3'−[1,2−ethenediylbis [(3−sulfo−4,1−phenylene)azo]]bis[4−hydroxy−, tetrasodium salt : 0.8

IPAM MPSWS1 2016, V Roth 41

Page 42: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Text categorization

• Reuters Corpus Volume 1 (RCV1): an archive of news documents

• 4 categories: Economics, Government, Corporate, Markets

• 23149 documents, vocabulary of 57180 words.

• Documents represented by word frequencies:

Term frequency (TF) times Inverse Document Frequency (IDF).

• Automatically detect “pure” or archetypical documents

(which might represent “pure” topics)

au

cti

on

yie

ld

do

llar

hu

rric

an

e

sto

rm

mp

h

au

cti

on

yie

ld

do

llar

hu

rric

an

e

sto

rm

mp

h

+ 0.5 0.5 =

IPAM MPSWS1 2016, V Roth 42

Page 43: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Text categorization

IPAM MPSWS1 2016, V Roth Prabhakaran et al., Springer LNCS 7476, 2012 43

Page 44: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Text categorization

150 200 250 300 350 400

1.9

2.0

2.1

2.2

2.3

kappa

BIC

sco

re

9010

011

0

num

ber

of a

rche

type

s

BIC

RSS

#(Archetypes)

IPAM MPSWS1 2016, V Roth Prabhakaran, 2012 44

Page 45: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Text categorization

Root

CORPORATE/INDUSTRIAL ECONOMICS GOVERNMENT/SOCIAL MARKETS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93

Archetype rank

IPAM MPSWS1 2016, V Roth 45

Page 46: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Text categorization

auctionbi

llyi

eldtreasur

averpoundgko

egyptaccept

roub

l

bid

august

cair

pct

trillday

ringg

it

prevw

orth

mth

percent

avg

bank

disc

ount

russiarussian

cent

issu

million

rang

maturmalays

lumpur

kual

month

wednesdaynovoffer

off

bracket

newsroom

low

numb

high

ago

valu

rbi

rupee allot

week

daxindex

pointchip

blu

bundesbank

ftscac

germ

durabl

wallsentbours

unexpect

ibi

stre

et

nikk

ei

cut

pari

clos

fran

kfur

t

gainshar

stock

weigh

intr

gold

recov

reco

rd

sengdata

london

end

rate

finis

h

retreat

high

sess

ion

take

strong

rise

repo

good

zuric

h

hang

interest

profit

soft

spi

key

IPAM MPSWS1 2016, V Roth 46

Page 47: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Text categorization

hurricanm

phtr

opic

dolly

edouard

kmh

storm

westward

windla

titud

coast westnortheast

edtmotion

mile

km

strength

pesc

outward

mexicverd

tampic warnveracruz

atlant

longitud

movehour

cape

north

extend

island

nort

hw

track

categ

continu

ocea

n

cent

southeast la

post

east

dist

sust

ain

forc

maxim

gain

bring

weakpalestinjerusalem

israel

neta

nyah

u

arafatplo

arab

jewish

gazahebron

west peac

east

settl

redivid

yass redeploy

talk

author

benj

amin

strik

levy

elect

tell

aug

carav declar

city

offic

bulldoz

demand

accord

bank

promis

war

lift legi

sl

troop

resum

chron

clos

freez

bow

min

ist

wing

anger

affair built

convimpos

IPAM MPSWS1 2016, V Roth 47

Page 48: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Text categorization

dist

illgasolin

crudbarrelbrent ip

eapi

oil ashbyinvent

racestock

oct

doldrum

gas

tom

drag

produc

brokdata

heat

low

bull

touch

petroleumdemand

fall surg

responsibl

refinrun

referround

rise

thing

holidayseason

com

e

weekend

sept

strength

coun

t

labor

grea

t

shar

p

day

institut

week

early

gmt

pukiraq ku

rd

kdpkurdist

artil

l

baghdad

arbilnorthern

rivalattack

iran

fight irani

an

gulf

forc tehr

patr

iot

ceasefir

casualttalaban

clash

washington

irna

accus

heavy

shell

zone

barzan

milit

guerrill

turk

ey

faction

troop

immed

split

bar

protect

party

radi

o

confirm

war

back

enclavreact

agent

demo

massoud

bahr

ain

invitat

IPAM MPSWS1 2016, V Roth 48

Page 49: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Pose analysis

IPAM MPSWS1 2016, V Roth Bauckhage et al., 2009 49

Page 50: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Image Encoding

IPAM MPSWS1 2016, V Roth Bauckhage et al., 2015 50

Page 51: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Conclusion

• Archetype analysis is a powerful tool identifying representative objectsin large data collections.

• Many technical challenges:

– data types, representation

; invariances due to semi-parametric copula construction

– computational/memory complexity

; stagewise forward algorithms, convex hull approximations, etc.

• Many application areas: objects can be bacteria, genes, documents,

chemical compounds, images, etc. etc.

• Open questions: AT analysis essentially is an auto-encoding technique

; neural implementations,

; use as building-blocks in deep belief networks, etc.

IPAM MPSWS1 2016, V Roth 51

Page 52: Archetype Analysis: a framework for selecting ...helper.ipam.ucla.edu/publications/mpsws1/mpsws1_13590.pdf · Archetype Analysis: a framework for selecting representative objects

Acknowledgments

Department of Mathematics and Computer Science, University of Basel:

Dinu Kaufmann, Sebastian Keller, Damian Murezzan, Sonali Parb-hoo, Sandhya Prabhakaran, Melanie Rey, Aleksander Wieczorek,Mario Wieser

University Hospital Zurich: Francesca Di Giallonardo, Yannick Duport,Christine Leemann, Stefan Schmutz, Nottania K. Campbell, BedaJoos, Osvaldo Zagordi, Huldrych F. Gunthard, Karin J. Metzner

Department of Biosystems Science and Engineering, ETH Zurich:

Armin Topfer, Christian Beisel, Niko Beerenwinkel

Functional Genomics Center Zurich: Maria R.Lecca, Andrea Patrignani

Inst. Medical Virology, U Zurich: Peter Rusert, Alexandra Trkola

IPAM MPSWS1 2016, V Roth 52