Archetype Analysis: a framework for selecting...
Transcript of Archetype Analysis: a framework for selecting...
Archetype Analysis: a framework forselecting representative objects
Volker Roth, Department of Mathematics and Computer Science,
University of Basel
IPAM MPSWS1 2016, V Roth 1
Machine Learning: Supervised Setting
Learning Machine
Supervisor
Letter −> Class Label
Statistical Learning Theory
spam
Important
©20th
Cen
tury
Fox F
ilm
Corp
ora
tion
IPAM MPSWS1 2016, V Roth 2
Machine Learning: Uncertain Labels
Sometimes even the best experts make errors.
Labels might be uncertain or missing...
IPAM MPSWS1 2016, V Roth 3
Machine Learning: Unsupervised Setting
Learning Machine
Supervisor
©20th
Cen
tury
Fox F
ilm
Corp
ora
tion
(Bayesian) Model Selection
Maybeclustering
is a good idea
Data −> Structure
How sparse is the graph?
How many clusters?
IPAM MPSWS1 2016, V Roth 4
Archetypes (Cluster−)Prototypes
"an attempt at something""most perfect possible form"
©20th
Cen
tury
Fox F
ilm
Corp
ora
tion
©20th
Cen
tury
Fox F
ilm
Corp
ora
tion
A Repeated Pattern: Search for RepresentativeObservations
IPAM MPSWS1 2016, V Roth 5
Archetype Analysis:Biological Motivation
Is there a theoretical foundation of the “archetype concept”?
; O. Shoval, H. Sheftel, G. Shinar, Y. Hart, O. Ramote, A. Mayo,
E. Dekel, K. Kavanagh, U. Alon: Evolutionary Trade-Offs, ParetoOptimality, and the Geometry of Phenotype Space. Science, 2012
IPAM MPSWS1 2016, V Roth 6
Archetypes and Evolutionary Trade-offs
IPAM MPSWS1 2016, V Roth Shoval et al.: Evolutionary Trade-Offs, Pareto Optimality, and the Geometry of Phenotype Space. Science, 2012 7
IPAM MPSWS1 2016, V Roth Shoval et al.: Evolutionary Trade-Offs, Pareto Optimality, and the Geometry of Phenotype Space. Science, 2012 8
Gene Expression Space
Human colon crypt cells fall in a tetrahedron in gene expression space.
The four vertices of this tetrahedron are each enriched with genes for a
specific task related to stemness and early differentiation.IPAM MPSWS1 2016, V Roth Korem et al. 2015 9
Computational Archetype Selection
Cutler &Breiman, Archetypal Analysis, Technometrics 1994.
• n observations {x1, . . . ,xn} ∈ Rp, as rows of data matrix X ∈ Rn×p
• Aim: find K archetypes ⇒ Z ∈ RK×p; K � n fixed.
• Observations are convex mixtures of archetypes
xi = Ztai + εi, aij ≥ 0 and∑K
j=1 aij = 1.
IPAM MPSWS1 2016, V Roth 10
Computational Archetype Selection
• Archetypes themselves are convex mixtures of observations:
zi =n∑
j=1
bijxj, where bij ≥ 0 and∑n
j=1 bij = 1
Archetypes approximatethe convex hull!
• Constrained optimization problem involving two sets of coefficients
{aij} and {bij} : iteratively minimize sum of squares
(A, B) = argminA,B‖X −AZ‖2 = ‖X −ABX‖2.
IPAM MPSWS1 2016, V Roth 11
Basic Algorithm
1. Given K archetypes ZK×p, update the compositions ai
of the i-th object (a QP):
ai = arg mina∈Rp
+:at1=1‖xi − Zta‖2.
2. Given compositions An×K, update archetypes by solving
the least-squares problem
Z = arg minZ∈RK×p
‖X −AZ‖2.
3. Move ATs back to the convex hull (also a QP):
bk = arg minb∈Rn
+:bt1n=1‖zk −Xtb‖2.
IPAM MPSWS1 2016, V Roth 12
Computational Archetype Selection
Problems:
• High computational complexity,
• Solution heavily depends on initialization of archetypes,
• Have to fix the number of archetypes K a priori.
• Assume that we have a suitable representation such that it makes
sense to search for “triangles”...
???IPAM MPSWS1 2016, V Roth 13
Archetype Analysis:Model Selection
How many archetypes?
IPAM MPSWS1 2016, V Roth 14
Model selection: how many archetypes?
• Solve for all numbers and choose “best”
; Bayesian model comparison.
• A clever way for looking at all nubers?
• Initialize all observations as archetypes, apply sparse regressionmethods to shrink most archetypes to zero.
Sparsity constraint
IPAM MPSWS1 2016, V Roth 15
Sparse Regression via the Group-Lasso
Least-squares: minβ ‖Xβ − y‖2.
Treat regression coefficients βi as resources needed for optimization.
Idea: limit resources ; model must concentrate on important features.
00
1
2
3
1
2ββ
Lasso GroupLasso
β
β β
β
β
‖β‖1 ≤ κ∑K
j=1 ‖βj‖2 ≤ κ
IPAM MPSWS1 2016, V Roth 16
Automatic detection of archetypes
Rewrite ‖X − AZ‖ as ‖~x − (A ⊗ Ip)~z‖, where ~x means column-wise
vectorization of X and use `1,2 block-norm constraint:K∑j=1
‖zj‖ ≤ κ.
1
n
2
x
x
x
2
a11
a11
0
0
a12
a12
a21
a21
0
0
a22
a22
0
0
an1
an1
an2
an2
a1K
a1K
0
0
anK
anK
a2K
a2K
0
0z1
0
0
0
0
0
0
0
0
z2
Kz Z
Z
21
22
Z11
GroupLasso constraint will shrink some archetypes zj to zero, depending
on constraint value κ.IPAM MPSWS1 2016, V Roth 17
A solution for the model selection problem
• Use GL algorithm that approximates the whole solution path on a fine
grid of κ values
Sparsity constraint
• For every κ-value, use BIC score
BIC(µ) =‖~x− µ‖2
nσ2+
log(n)
n· df(µ),
and select the best scoring model.
IPAM MPSWS1 2016, V Roth 18
Archetype Analysis:Algorithms
Will it work for huge datasets?
IPAM MPSWS1 2016, V Roth 19
Efficient algorithms(Bauckhage et al., 2015):
Solve step 1 and combined (2,3)-step with Frank-Wolfe algorithm
Idea: use linear optimization oracle over constraint set.
• Linear minimization oracle
∆(x) = argminz〈x, z〉,
s.t. g(z) ≤ κ
• Update x← (1−γ)x+γ∆(∇f(x))
• Decrease γ M. Jaggi, 2015
Pros: Highly efficient for one fixed value of κ
Cons: Not efficient for the whole solution path
; model selction trick does not work well.
IPAM MPSWS1 2016, V Roth 20
Alternative: Forward stagewise
Technically similar to Frank-Wolfe:• Linear minimization oracle (LMO)
∆(x) = argminz〈x, z〉,
s.t. g(z) ≤ ε� κ
• Update x← x+ ∆(∇f(x))∆
f (x)
g(z) <
contours of LMO
contours of f(x)
ε∆
...but very different behaviour:
• incremental path following behaviour
; efficient for computing the whole solution path
• built-in monotonicity “regularization” ; very stable,
can be extended to non-convex “norms” for increased sparsity.
IPAM MPSWS1 2016, V Roth 21
Forward stagewise
Consider step 1 in AT analysis:
Given K archetypes ZK×p, update the compositions ai of the i-th object
xi under convexity constraints:
ai = arg mina∈Rp
+:at1=1‖xi − Zta‖2.
This is a non-negative lasso estimate
mina
‖xi − Zta‖2
s.t. ‖a‖1 = 1, aj ≥ 0.
Use projected gradient:
a← a+ ∆(∇+f (a))
∆
f (x)
∆ ∆
(x)+
f
contours of LMO
g(z) < ε
IPAM MPSWS1 2016, V Roth 22
Forward stagewise
• Useful if LMO can be computed easily: all norms, block-norms etc.
• NN lasso: LMO is intersection of simplex and linear function
; find best vertex l, update l-th component of a as al ← al + ε
; a is monotone increasing in every component.
• Simple and efficient: iterate until ‖a‖1 = 1 ; 1/ε iterations needed.
• Conceptually the same behaviour for group-lasso estimate in step 2.
R. Tibshirani, 2015
IPAM MPSWS1 2016, V Roth 23
Further algorithmic tricks
• Pre-select candidate points on convex hull:
– Points on convex hull in any linear projection are also on the
“full” convex hull.
– Convex hull computation very efficient in 2D: O(n log n)
– Randomly project data to planes, compute points on convex hull,
aggregate.
• Alternative to random projections: use pairwise PCA projections.
– PCA is a good preprocessing step anyway, since convex polygon with
K vertices can be embedded in a space with < K dimensions.
• We can easily solve AT problems with > 100 millions of objects.
IPAM MPSWS1 2016, V Roth 24
Archetype Analysis:Data representation
How to encode the data such that we can see “triangles”?
IPAM MPSWS1 2016, V Roth 25
Representation issues
Representations
• different domains
(p-values, body size, weight)
• monotone transformations (e.g. log)
Problem: archetypal analysis is
sensitive to choice of representation
Solution: Copula extension
• representation independence
• robust against outliers
• mixed data & missing values
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
IPAM MPSWS1 2016, V Roth 26
Copula
Copula density
• p-dim pdf on [0, 1]p
• uniform marginals
• defines dependency
structure
Property
• construct arbitrary
multivariate distribution
yj = F−1j (uj)
(Sklar 1959)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
copula
0.0
0.2
0.4
0.6
0.8
1.0
U1
0.0 0.2 0.4 0.6 0.8 1.0
U2
norm cdf
3 4 5 6 7 8 9
−0.
20.
00.
20.
40.
60.
8
beta cdf
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
● ● ● ●
●
●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
● ●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●● ●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
● ●●
●●
●●
●●●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
● ●●
●
●
●●
●
● ●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●● ●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
● ●● ●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●●
●
● ●●
●
●●
●●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
● ● ● ●
●
●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
● ●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●● ●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
● ●●
●●
●●
●●●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
● ●●
●
●
●●
●
● ●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●● ●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
● ●● ●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●●
●
● ●●
●
●●
●●
●
●● ●
●
Y1
Y2
measurement space
Independence Copula
IPAM MPSWS1 2016, V Roth 27
Copula
Copula density
• p-dim pdf on [0, 1]p
• uniform marginals
• defines dependency
structure
Property
• construct arbitrary
multivariate distribution
yj = F−1j (uj)
(Sklar 1959)
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
copula
0.0
0.2
0.4
0.6
0.8
1.0
U1
0.0 0.2 0.4 0.6 0.8 1.0
U2
norm cdf
3 4 5 6 7 8 9
−0.
20.
00.
20.
40.
60.
8
beta cdf
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
● ●
●
●
●
●
●
●●
●
●
●
●● ●● ●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●
●
●
●
●
●●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
● ●●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
●
●●
●
●
● ●
● ●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
● ●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
● ●
●
●
●
●
●
●●
●
●
●
●● ●● ●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●
●
●
●
●
●●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
● ●●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
●
●●
●
●
● ●
● ●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
● ●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●
Y1
Y2
measurement space
Gaussian Copula
IPAM MPSWS1 2016, V Roth 28
Semi-parametric Gaussian Copula
Hierarchical model:
R
x ∼ N (0, R)
yi = fi(xi)
• ATs invariant against monotone
marginal transformations
• ATs depend only on ranks
; insensitive to outliers.
• Idea: reconstruct latent space
xj = (Φ−1 ◦ Fj)(yj)
• Use empirical cdf’s
Fj =ranks(Y [·,j])
n+1
−1.5 −1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ecdf(y)
x
Fn(
x)
●●
●●●
●●●●
●●●
●●
●●●
●●●●
●●●
IPAM MPSWS1 2016, V Roth 29
Mixed Continous / Discrete Data
Problem with discretedata: ties, empirical copula
no longer uniform
Extended rank likelihood(Hoff, 2007):
stochastic association-
preserving mapping
Algorithm: Gibbs sampler
1. sample latent variables
x, R|x
2. compute archetypes z
copula
0.0
0.2
0.4
0.6
0.8
1.0
U1
0.0 0.2 0.4 0.6 0.8 1.0
U2
discrete cdf
5 10 15 20 25
Y1
Y2
−0.
20.
00.
20.
40.
60.
8
beta cdf
IPAM MPSWS1 2016, V Roth 30
Artificial Data
• Monotone transformation with beta marginals
• Variables quantised to 5 levels
IPAM MPSWS1 2016, V Roth 31
Why use the Gaussian copula?
Generative model
ai ∼ DirK(α)
xi|Z,ai ∼ N (Ztai, ηIp)
X|Z,A ∼MN
(M︷︸︸︷AZ , I, ηI
)XtX ∼ Wnc
(n, nI,M tM
)≈ Wc
(n,
1
nM tM + ηI
)(Steyn and Roux, 1972)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
Under the assumed generative model,
a Gaussian covariance structure is plausible.
IPAM MPSWS1 2016, V Roth 32
Archetype Analysis:Applications
What is it good for?
IPAM MPSWS1 2016, V Roth 33
Analysis of E.coli Data
Identified ArchetypesShoval et al., Science, 2012
Tra
it 2
Trait 1
©2
0th
Cen
tury
Fo
x F
ilm
Co
rpo
rati
on
IPAM MPSWS1 2016, V Roth 34
Archetypical Chemical CompoundsIdea: find ATs in list of compounds identified in an AIDS antiviral screen
performed by the Developmental Therapeutics Program of the NCI/NIH,
enriched with all available anti-HIV drugs
Tanimoto similarity scores
Objects: chemical compounds SMILES strings
Rep.: Similarity matrix
PCA embedding
Archetype analysis
IPAM MPSWS1 2016, V Roth 35
Archetypical Chemical Compounds
1
2
34
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
IPAM MPSWS1 2016, V Roth 36
Archetypical Chemical Compounds
1
2
34
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
5
10
7
18
IPAM MPSWS1 2016, V Roth 37
Compounds explained by AT 5
0.99 AZT : 0.67
AZT It is of the nucleoside reverse-transcriptase inhibitor (NRTI) class.
It inhibits the enzyme (reverse transcriptase) that HIV uses to synthesize
DNA.
IPAM MPSWS1 2016, V Roth 38
Compounds explained by AT 7
Fosamprenavir : 0.94 0.99 0.99
amprenavir : 0.99 Darunavir;Prezista(TM) : 0.9 Aluviran;Lopinavir : 0.7
HIV protease inhibitors (they block a peptide cleaving enzyme).
IPAM MPSWS1 2016, V Roth 39
Compounds explained by AT 10
0.92 Calanolide A : 0.92 0.92
0.92 0.89 0.89
Calanolide A is an experimental non-nucleoside reverse transcriptase
inhibitor (NNRTI).IPAM MPSWS1 2016, V Roth 40
Compounds explained by AT 18
0.81 1−Naphthalenesulfonic acid, 3,3'−[1,2−ethenediylbis [(3−sulfo−4,1−phenylene)azo]]bis[4−hydroxy−, tetrasodium salt : 0.8
IPAM MPSWS1 2016, V Roth 41
Text categorization
• Reuters Corpus Volume 1 (RCV1): an archive of news documents
• 4 categories: Economics, Government, Corporate, Markets
• 23149 documents, vocabulary of 57180 words.
• Documents represented by word frequencies:
Term frequency (TF) times Inverse Document Frequency (IDF).
• Automatically detect “pure” or archetypical documents
(which might represent “pure” topics)
au
cti
on
yie
ld
do
llar
hu
rric
an
e
sto
rm
mp
h
au
cti
on
yie
ld
do
llar
hu
rric
an
e
sto
rm
mp
h
+ 0.5 0.5 =
IPAM MPSWS1 2016, V Roth 42
Text categorization
IPAM MPSWS1 2016, V Roth Prabhakaran et al., Springer LNCS 7476, 2012 43
Text categorization
150 200 250 300 350 400
1.9
2.0
2.1
2.2
2.3
kappa
BIC
sco
re
9010
011
0
num
ber
of a
rche
type
s
BIC
RSS
#(Archetypes)
IPAM MPSWS1 2016, V Roth Prabhakaran, 2012 44
Text categorization
Root
CORPORATE/INDUSTRIAL ECONOMICS GOVERNMENT/SOCIAL MARKETS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
Archetype rank
IPAM MPSWS1 2016, V Roth 45
Text categorization
auctionbi
llyi
eldtreasur
averpoundgko
egyptaccept
roub
l
bid
august
cair
pct
trillday
ringg
it
prevw
orth
mth
percent
avg
bank
disc
ount
russiarussian
cent
issu
million
rang
maturmalays
lumpur
kual
month
wednesdaynovoffer
off
bracket
newsroom
low
numb
high
ago
valu
rbi
rupee allot
week
daxindex
pointchip
blu
bundesbank
ftscac
germ
durabl
wallsentbours
unexpect
ibi
stre
et
nikk
ei
cut
pari
clos
fran
kfur
t
gainshar
stock
weigh
intr
gold
recov
reco
rd
sengdata
london
end
rate
finis
h
retreat
high
sess
ion
take
strong
rise
repo
good
zuric
h
hang
interest
profit
soft
spi
key
IPAM MPSWS1 2016, V Roth 46
Text categorization
hurricanm
phtr
opic
dolly
edouard
kmh
storm
westward
windla
titud
coast westnortheast
edtmotion
mile
km
strength
pesc
outward
mexicverd
tampic warnveracruz
atlant
longitud
movehour
cape
north
extend
island
nort
hw
track
categ
continu
ocea
n
cent
southeast la
post
east
dist
sust
ain
forc
maxim
gain
bring
weakpalestinjerusalem
israel
neta
nyah
u
arafatplo
arab
jewish
gazahebron
west peac
east
settl
redivid
yass redeploy
talk
author
benj
amin
strik
levy
elect
tell
aug
carav declar
city
offic
bulldoz
demand
accord
bank
promis
war
lift legi
sl
troop
resum
chron
clos
freez
bow
min
ist
wing
anger
affair built
convimpos
IPAM MPSWS1 2016, V Roth 47
Text categorization
dist
illgasolin
crudbarrelbrent ip
eapi
oil ashbyinvent
racestock
oct
doldrum
gas
tom
drag
produc
brokdata
heat
low
bull
touch
petroleumdemand
fall surg
responsibl
refinrun
referround
rise
thing
holidayseason
com
e
weekend
sept
strength
coun
t
labor
grea
t
shar
p
day
institut
week
early
gmt
pukiraq ku
rd
kdpkurdist
artil
l
baghdad
arbilnorthern
rivalattack
iran
fight irani
an
gulf
forc tehr
patr
iot
ceasefir
casualttalaban
clash
washington
irna
accus
heavy
shell
zone
barzan
milit
guerrill
turk
ey
faction
troop
immed
split
bar
protect
party
radi
o
confirm
war
back
enclavreact
agent
demo
massoud
bahr
ain
invitat
IPAM MPSWS1 2016, V Roth 48
Pose analysis
IPAM MPSWS1 2016, V Roth Bauckhage et al., 2009 49
Image Encoding
IPAM MPSWS1 2016, V Roth Bauckhage et al., 2015 50
Conclusion
• Archetype analysis is a powerful tool identifying representative objectsin large data collections.
• Many technical challenges:
– data types, representation
; invariances due to semi-parametric copula construction
– computational/memory complexity
; stagewise forward algorithms, convex hull approximations, etc.
• Many application areas: objects can be bacteria, genes, documents,
chemical compounds, images, etc. etc.
• Open questions: AT analysis essentially is an auto-encoding technique
; neural implementations,
; use as building-blocks in deep belief networks, etc.
IPAM MPSWS1 2016, V Roth 51
Acknowledgments
Department of Mathematics and Computer Science, University of Basel:
Dinu Kaufmann, Sebastian Keller, Damian Murezzan, Sonali Parb-hoo, Sandhya Prabhakaran, Melanie Rey, Aleksander Wieczorek,Mario Wieser
University Hospital Zurich: Francesca Di Giallonardo, Yannick Duport,Christine Leemann, Stefan Schmutz, Nottania K. Campbell, BedaJoos, Osvaldo Zagordi, Huldrych F. Gunthard, Karin J. Metzner
Department of Biosystems Science and Engineering, ETH Zurich:
Armin Topfer, Christian Beisel, Niko Beerenwinkel
Functional Genomics Center Zurich: Maria R.Lecca, Andrea Patrignani
Inst. Medical Virology, U Zurich: Peter Rusert, Alexandra Trkola
IPAM MPSWS1 2016, V Roth 52