Detecting Path Anomalies in Sequential Data on Networks
Tim LaRocktlarock.github.io
[email protected] collaboration with
Tina Eliassi-RadGiona CasiraghiVahan Nanumyan
1
Ingo Scholtes Frank Schweitzer
This TalkMotivation: Understanding mechanisms behind sequential data on networks
Today:Motivate the study of path anomalies
Introduce de Bruijn graph representation of sequential data
Develop tractable null model to measure deviation of path data from expectation
Validate null model in synthetic data + compare with naïve baseline method
Application of methodology to a real system
2
Intuitive Example: Passenger Flight Data
3
4
Itinerary 1
Itinerary 2
…
Itinerary 3
…
…
…
BOS JFK LAX
ATL JFK BTV
BOS JFK BTV
Intuitive Example: Passenger Flight Data
5
x 10
x 1
x 0
Itinerary 1
Itinerary 2
…
Itinerary 3
BOS JFK LAX
ATL JFK BTV
BOS JFK BTV
Intuitive Example: Passenger Flight Data
6
x 10
x 1
x 0
Itinerary 1
Itinerary 2
…
Itinerary 3
BOS JFK LAX
ATL JFK BTV
BOS JFK BTV
Intuitive Example: Passenger Flight Data
Underrepresented?
7
x 10
x 1
x 0
Itinerary 1
Itinerary 2
…
Itinerary 3
BOS JFK LAX
ATL JFK BTV
BOS JFK BTV
Intuitive Example: Passenger Flight Data
Overrepresented?
is significantly overrepresented?
8
Research Question
is significantly underepresented?
x 10
x 1
x 0
Itinerary 1
Itinerary 2
Itinerary 3
BOS JFK LAX
ATL JFK BTV
JFK BTVBOS
Given this pathway dataset, can we determine whether…
is significantly overrepresented?
9
Research Question
is significantly underepresented?
x 10
x 1
x 0
Itinerary 1
Itinerary 2
Itinerary 3
BOS JFK LAX
ATL JFK BTV
JFK BTVBOS
Given this pathway dataset, can we determine whether…
In other words: Which paths are anomalous?
Problem: Path anomaly detection
10
For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.
Problem: Path anomaly detection
11
For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.
When k=2, this corresponds to comparing a random walk with a single step of memory to a memoryless (Markovian) random walk on G.
Toy Example
12
Three Goals:
1. Introduce de Bruijn graphs as representations of sequential data
2. Show how path anomalies emerge in a simple setting
3. Show how path anomalies can be detected through a random walk simulation approach
(Spoiler: Simulation approach is infeasible for real world datasets!)
Toy Example
13
Itinerary 1
Itinerary 2
…
Itinerary 3
…
…
…
BOS JFK LAX
ATL JFK BTV
BOS JFK BTV
Toy Example
14
Pathway 1
Pathway 2
…
Pathway 3
…
…
…
XA C
B DX
XA C
Toy Example: Data
15
XA C
B DX
XA C
…
…
…
p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>
p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>
p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>
S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>
pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>
…
Toy Example: Data
16
Total number of paths N = |S| = 235
<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>
…
p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>
p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>
p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>
S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>
pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>
XA C
B DX
XA C
…
…
…
Toy Example: Data to (first-order) graph
17
edge counts
30
XA
205
XB
135
CX
100
DX
XA C
B DX
XA C
…
…
…
p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>
p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>
p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>
S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>
pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>
…
Total number of paths N = |S| = 235
<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>
Toy Example: Data to (first-order) graph
18
edge counts
30
XA
205
XB
135
CX
100
DX A
BX
C
D
XA C
B DX
XA C
…
…
…
p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>
p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>
p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>
S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>
pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>
…
Total number of paths N = |S| = 235
<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>
Toy Example: Data to 2nd order de Bruijn graph
19
Path counts
30
XA X B
105
CX
100
DXC BA D
XA C
B DX
XA C
…
…
…
p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>
p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>
p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>
S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>
pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>
…
Total number of paths N = |S| = 235
<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>
Toy Example: Data to 2nd order de Bruijn graph
20
XA C
B DX
XA C
…
…
…
p1<latexit sha1_base64="32piOB4m0s8TFvrrE//lPH6QV48=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq0nfq/bLFbfmLkDWiZeTCuRo9stfvUHM0gilYYJq3fXcxAQZVYYzgbNSL9WYUDahI+xaKmmEOsgWx87IhVUGZBgrW9KQhfp7IqOR1tMotJ0RNWO96s3F/7xuaoY3QcZlkhqUbLlomApiYjL/nAy4QmbE1BLKFLe3EjamijJj8ynZELzVl9dJq17zrmr1h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDuLCN9g==</latexit>
p2<latexit sha1_base64="7XYqazwj7X+GHe48AUv8nkePnnc=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgno68OuDas1tuAugdeIVpAYFWoPqV38oSZZQYQjHWvc8NzVhjpVhhNNZpZ9pmmIywSPas1TghOowXxw7QxdWGaJYKlvCoIX6eyLHidbTJLKdCTZjverNxf+8XmbimzBnIs0MFWS5KM44MhLNP0dDpigxfGoJJorZWxEZY4WJsflUbAje6svrpO03vKuG/+DXmrdFHGU4g3O4BA+uoQn30IIACDB4hld4c4Tz4rw7H8vWklPMnMIfOJ8/ujWN9w==</latexit>
p3<latexit sha1_base64="Z6AU2fag2QeM9tg6J4H+2085QRQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAx3ByO/cfn1BpHssHM00wiOhI8iFn1FjJryb9RrVfrrg1dwGyTrycVCBHq1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DjIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxO2vWa16jV7+uV5k0eRxHO4BwuwYMraMIdtMAHBhye4RXeHOm8OO/Ox7K14OQzp/AHzucPu7qN+A==</latexit>
S = {<latexit sha1_base64="jR9Gx72XqiPCM3LHdS3SOjkze5A=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstDEh2lhiFCQBQvaWPdiwt3fuzpmQC3/CxkJjbP07dv4bF7hCwZdM8vLeTGbm+bEUBl3328mtrK6tb+Q3C1vbO7t7xf2DpokSzXiDRTLSLZ8aLoXiDRQoeSvWnIa+5A/+6HrqPzxxbUSk7nEc825IB0oEglG0Uqt8Ry5JJy33iiW34s5AlomXkRJkqPeKX51+xJKQK2SSGtP23Bi7KdUomOSTQicxPKZsRAe8bamiITfddHbvhJxYpU+CSNtSSGbq74mUhsaMQ992hhSHZtGbiv957QSDi24qVJwgV2y+KEgkwYhMnyd9oTlDObaEMi3srYQNqaYMbUQFG4K3+PIyaVYr3lmlelst1a6yOPJwBMdwCh6cQw1uoA4NYCDhGV7hzXl0Xpx352PemnOymUP4A+fzByf6jrs=</latexit>
pN<latexit sha1_base64="66lAMOV8OdCg9bnkdFLAB9TDMvI=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWjm6nfekKleSwfzThBP6IDyUPOqLHSQ9K765XKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n81OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwis/4zJJDUo2XxSmgpiYTP8mfa6QGTG2hDLF7a2EDamizNh0ijYEb/HlZdKsVrzzSvX+oly7zuMowDGcwBl4cAk1uIU6NIDBAJ7hFd4c4bw4787HvHXFyWeO4A+czx8swo25</latexit>
…
Total number of paths N = |S| = 235
<latexit sha1_base64="4z0D7GzwoVnotHiFLJOXG6+EZnY=">AAACDnicbVDLSgMxFM34rPU16tJNsC24KjMtohuh6MaVVOwL2qFk0kwbmkmGJCOUab/Ajb/ixoUibl27829M21lo64GEwzn3cu89fsSo0o7zba2srq1vbGa2sts7u3v79sFhQ4lYYlLHggnZ8pEijHJS11Qz0ookQaHPSNMfXk/95gORigpe06OIeCHqcxpQjLSRunahJjRikMehTyQUAYyQHiiYv4WXcHw/Nn+pfJbv2jmn6MwAl4mbkhxIUe3aX52ewHFIuMYMKdV2nUh7CZKaYkYm2U6sSITwEPVJ21COQqK8ZHbOBBaM0oOBkOZxDWfq744EhUqNQt9UhtNtF72p+J/XjnVw4SWUR7EmHM8HBTGDWsBpNrBHJcGajQxBWFKzK8QDJBHWJsGsCcFdPHmZNEpFt1ws3ZVylas0jgw4BifgFLjgHFTADaiCOsDgETyDV/BmPVkv1rv1MS9dsdKeI/AH1ucPKe6Zlg==</latexit>
30
Path counts
XA X B
105
CX
100
DXC BA D
XB
XA
DX
C30
X
100
21
Toy Example: Path Anomalies via Simulations
For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.
Toy Example: Path Anomalies via Simulations
22
Simulate many random walk datasetsCompute expected frequency of each pathwayand subtract from observed value
A
BX
C
D XB
XA
DX
C30 - E[ ]
XXA C
100 - E[ ]
For a given pathway dataset S, graph G, and integer k, identify paths of length k through G whose observed frequencies in S deviate significantly from random expectation in a (k-1)-order model of paths through G.
Toy Example: Path Anomalies via Simulations
23
Path counts
30
XA X
105
B CX
100
DXC BA D +12.57
+11.69
-12.57
-11.69
XA C
XA D
DXB
B CX
Deviation from Randomized Paths
XB
XA
DX
C30 - E[ ]
XXA C
100 - E[ ]
Challenges
24
Path Anomaly Detection: Challenges
Detecting path anomalies via simulations à computationally intensive
Result is expected value, no concrete notion of significance
Alternative: detect path anomalies analytically by developing a tractable null model
25
Null Model: Challenges
26
Traditional null models (e.g. configuration model) cannot be applied directly
Null Model: Challenges
27
Traditional null models (e.g. configuration model) cannot be applied directly
Edges between higher-order nodes can not be randomized by stub-matching
XBXA DCCX✅ 🚫XA
Null Model: Challenges
28
Traditional null models (e.g. configuration model) cannot be applied directly
Edges between higher-order nodes can not be randomized by stub-matching
Need to randomize edge weight distribution in de Bruijn graph models, since connectivity structure is fixed by 1st-order topology
XBXA DCCX✅ 🚫XA
HYPA: Efficient Detection of Path Anomalies
29
Generalized Hypergeometric Ensemble
30Casiraghi et al. (2016, 2018)
Generalized Hypergeometric Ensemble
31
Generalization of the configuration model to weighted, directed networks.
Generalized Hypergeometric Ensemble
32
Generalization of the configuration model to weighted, directed networks.
Fixes the expected weight of every node, rather than the exact degree sequence.
Generalized Hypergeometric Ensemble
33
Generalization of the configuration model to weighted, directed networks.
Fixes the expected weight of every node, rather than the exact degree sequence.
Urn Problem Intuition:
○ Each pair of nodes that can possibly connect is assigned a color
DXXA
DXXB
XB CX
X CX
Urn
A
Generalized Hypergeometric Ensemble
34
Generalization of the configuration model to weighted, directed networks.
Fixes the expected weight of every node, rather than the exact degree sequence.
Urn Problem Intuition:
○ Each pair of nodes that can possibly connect is assigned a color
○ Add balls, where Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>
DXXA
DXXB
XB CX
X CX
Urn
A 30x135
205x100
205x135
30x100
Generalized Hypergeometric Ensemble
35
Generalization of the configuration model to weighted, directed networks.
Fixes the expected weight of every node, rather than the exact degree sequence.
Urn Problem Intuition:
○ Each pair of nodes that can possibly connect is assigned a color
○ Add balls, where
○ Draw m edges to sample a network from the urn, where
Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>
DXXA
DXXB
XB CX
X CX
Urn
A 30x135
205x100
205x135
30x100
Hypergeometric Ensemble
36
Pr(Xvw = f(v, w)) /✓
Kvw
f(v, w)
◆✓Pij Kij �Kvw
m� f(v, w)
◆
<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>
Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>
Hypergeometric Ensemble
37
{Pr(Xvw = f(v, w)) /
✓Kvw
f(v, w)
◆✓Pij Kij �Kvw
m� f(v, w)
◆
<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>
Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>
Probability of observing frequency f(v,w) given the entire weighted network structure.
Hypergeometric Ensemble
38
{
Pr(Xvw = f(v, w)) /✓
Kvw
f(v, w)
◆✓Pij Kij �Kvw
m� f(v, w)
◆
<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>
Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>
Number of ways to pick f(v,w) multiedges from Kvw possible.
Hypergeometric Ensemble
39
{
Kij = kouti kinj<latexit sha1_base64="m9/JBMqpKEyarcEuT1mV5zfrvcQ=">AAACFHicbVC7SgNBFJ31GeNr1dJmMAiCEHajoI0QtBFsIpgHJOsyO5lNJpl9MHNXDMt+hI2/YmOhiK2FnX/j5FHExAMDh3Pu5c45Xiy4Asv6MRYWl5ZXVnNr+fWNza1tc2e3pqJEUlalkYhkwyOKCR6yKnAQrBFLRgJPsLrXvxr69QcmFY/COxjEzAlIJ+Q+pwS05JrHN27Ke9lF/z5tAXsEGaRRAlnmcjwl8VArPdcsWEVrBDxP7AkpoAkqrvndakc0CVgIVBClmrYVg5MSCZwKluVbiWIxoX3SYU1NQxIw5aSjUBk+1Eob+5HULwQ8Uqc3UhIoNQg8PRkQ6KpZbyj+5zUT8M8dHSlOgIV0fMhPBIYIDxvCbS4ZBTHQhFDJ9V8x7RJJKOge87oEezbyPKmVivZJsXR7WihfTurIoX10gI6Qjc5QGV2jCqoiip7QC3pD78az8Wp8GJ/j0QVjsrOH/sD4+gWFdaBh</latexit>
Pr(Xvw = f(v, w)) /✓
Kvw
f(v, w)
◆✓Pij Kij �Kvw
m� f(v, w)
◆
<latexit sha1_base64="NCkO6i3fdAOka+krxJwECKzqz/A=">AAACTXicbVFLSwMxGMzWd31VPXoJFqEFLbsq6EUQvQheKlgtdMuSzWY1No8lybaUZf+gF8Gb/8KLB0XE9KH4GggZZuYjySRMGNXGdR+dwsTk1PTM7FxxfmFxabm0snqpZaowaWDJpGqGSBNGBWkYahhpJoogHjJyFXZOBv5VlyhNpbgw/YS0OboWNKYYGSsFpaiuKs0g6/ZyeAjjSnerV61CP1EyMRL6UUiF5NnZMJBnAx/2qvmX4euUBxm9zeHZaNuGn1lu+Wc+KJXdmjsE/Eu8MSmDMepB6cGPJE45EQYzpHXLcxPTzpAyFDOSF/1UkwThDromLUsF4kS3s2EbOdy0SgRjqewSBg7V7xMZ4lr3eWiTHJkb/dsbiP95rdTEB+2MiiQ1RODRQXHKoG1qUC2MqCLYsL4lCCtq7wrxDVIIG/sBRVuC9/vJf8nlTs3bre2c75WPjsd1zIJ1sAEqwAP74AicgjpoAAzuwBN4Aa/OvfPsvDnvo2jBGc+sgR8ozHwAAFyzMw==</latexit>
Number of ways to pick everything else.
Putting it all together: HYPA scores
40
Putting it all together: HYPA scores
41
If “close” to 0, then the pathway is underrepresented.
If “close” to 1, then pathway is overrepresented.
Putting it all together: HYPA scores
42
XB
XA
DX
CX0.99
0.97XB
XA
DX
C30 X
100
If “close” to 0, then the pathway is underrepresented.
If “close” to 1, then pathway is overrepresented.
+12.57
+11.69
-12.57
-11.69
XA C
XA D
DXB
B CX
Deviation from Randomized Paths
Observed Data Simulation Result HYPA Scores
Validation
43
Noise via Path Randomization
44
Synthetic Anomalies: Setup
45
Synthetic Anomalies: Setup
46
A
BX
C
D XB
XA
DX
CX
Start with an arbitrary first order topology, then construct the kth-order de Bruijn graph
k = 2
Synthetic Anomalies: Setup
47
A
BX
C
D XB
XA
DX
CX
Randomly choose some edges to label over-represented
k = 2
Synthetic Anomalies: Setup
48
A
BX
C
D XB
XA
DX
CX30
100
Assign heterogeneous weights based on label
k = 2
Synthetic Anomalies: Setup
49
A
BX
C
D XB
XA
DX
CX
Generate paths via random walks on this model, then evaluate ability of HYPA to detect injected anomalies (binary classifier).
30
100
k = 2
Synthetic Anomalies: ROC Example
50
Synthetic Anomalies: ROC Example
51
Injected Anomalies of Length 2
Synthetic Anomalies: AUC Results
52
Application to Flight Data
53
Airlines
54
5% sample of all US domestic flights in 2018
Data from: http://www.transtats.bts.gov/Tables.asp?DB_ID=125
AirlinesHypotheses:
1. Return flights should be over-represented, since people most often travel round trip.
55
Airlines: Return trips are over-represented
56
Fraction of over-represented return/non-return flights for various discrimination thresholds.
Airlines: Return trips are over-represented
57
Fraction of over-represented return/non-return flights for various discrimination thresholds.
AirlinesHypotheses:
1. Return flights should be over-represented, since people most often travel round trip.
2. Over-represented non-return flights are due to regional/national hubs, since people need to fly from small airports → regional hub → large airport.
58
Airlines: Trip Balance
59
AirlinesHypotheses:
1. Return flights should be over-represented, since people most often travel round trip.
2. Over-represented non-return flights are due to regional/national hubs, since people need to fly from small airports → regional hub → large airport.
3. “Efficient” paths are more likely to be over-represented.
60
Airlines: Efficiency
61
References
tlarock.github.iohttps://arxiv.org/abs/1905.10580
Thanks!
62
Scholtes, Ingo. "When is a network a network?: Multi-order graphical model selection in pathways and temporal networks." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.
Casiraghi et al. "Generalized hypergeometric ensembles: Statistical hypothesis testing in complex networks." arXiv:1607.02441 (2016).
Casiraghi & Nanumyan. “Generalised hypergeometric ensembles of random graphs: the configuration model as an urn problem.” arXiv:1810.06495 (2018)
R. TransStat. Origin and destination survey database. http://www.transtats.bts.gov/Tables.asp?DB_ID=125, 2018.
63
https://arxiv.org/abs/1905.10580
Definition: kth-order de Bruijn Graph
64
For a given graph G = (V,E) and positive integer k we define a k-th order
De Bruijn graph of paths in G as a graph Gk= (V k, Ek
), where (i) each node
~v := ~v0v1 . . . vk�1 2 V kis a path of length k � 1 in G, and (ii) (~v, ~w) 2 Ek
i↵
vi+1 = wi for i = 0, . . . , k � 2.
<latexit sha1_base64="nPo7/5HIQzxKitn0S+xtps/GiRw=">AAADPnicbVJNb9NAELXNVylfLRy5jEiQbOFGdjiAkCJVhVKOReqXVKfRZj1OFtu71u46URX5l3HhN3DjyIUDCHHlyK6TStAyl32aN/PezO6Oq4IpHUVfXO/a9Rs3b63dXr9z9979BxubD4+UqCXFQyoKIU/GRGHBOB5qpgs8qSSSclzg8Th/bfnjGUrFBD/Q5xUOSzLhLGOUaJMabboHb4UEAhM2Qw4TSaopJP7ewD8Kd4MkAMJTqIRi2vDAuMYJSujmXZgjpJgZV9Oc+HkSbOkpCJka+g3CjqzZhws9kUFF9FSZfqttVZW1bMnu3llu3M7yEHbP8qAbwnyKEsFnASChU+AiRdOWzJAuZs2rwRKMotkoTopUaAWz0SLfipsGEmNglIwBswbW1JoXyCfarmWqLLeaImyX85kxSvwL/bA9503Qiu0uxbLMVBgX9ixuYABzgxqTz8zNJT4bRCEsJwkh3+onQW99tNGJelEbcBXEK9BxVrE/2vicpILWJXJNC6LUaRxVerggUjNaYLOe1AorQnMywVMDOSlRDRft8zfw1GTSdppMcA1t9u+OBSmVOi/HprK073CZs8n/cae1zl4OF4xXtUZOl0ZZXYAWYP8SpEwi1cW5AYRK80co0CmRhGrz4+wlxJdXvgqO+r34ea//vt/Z3lldx5rz2Hni+E7svHC2nXfOvnPoUPej+9X97v7wPnnfvJ/er2Wp5656Hjn/hPf7D9a6/RE=</latexit>
Pseudocode
65
Naïve Baseline ComparisonFrequency-Based Anomaly Detection (FBAD)
Compute mean, µ, and standard deviation, σ, of kth order edge weights
Given scaling factor α, label edges as- Overrepresented if frequency is larger than µ + σα - Underrepresented if frequency is smaller than µ - σα
66
Synthetic Anomalies
67
Real Data
68
Exploring Motifs
69
Case Study: London TubeData:
- Origin → destination statistics between London Tube stations- (origin, destination, #observations)
- Shortest paths between stations- Assume people follow shortest paths
70
London TubeHypothesis:
- People typically use public transportation to travel large geographic distances- Overrepresented pathways should cover larger distances
Test:
- Measure distance between every station- For 2nd order transitions A-B-C, compute distance between nodes A and C- Analyze distributions of distance in over vs. under represented transitions
- Expect to see distribution shifted towards higher values for over-represented transitions
71
London Tube
72
London Tube
73
Median distance between source and destination nodes in under/over represented transitions.
Constructing Ground TruthConstruct ground truth based on the method discussed earlier:
○ Randomize path data using k-1st order random walks
○ Compute kth-order path statistics
○ Repeat m times, noting the frequency of each path
○ Estimate multinomial distribution and its CDF from these statistics
○ If CDF(path) > threshold, label over-represented
74
Tube Data - Ground Truth
75
Computational complexity
76
O(N + |V |2�k1)
<latexit sha1_base64="ab5tZyX/wq/L2KgZL0VZSB19Lxg=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUBFKUgVdFt240gr2AW0aJpNJO3QyCTMToaTd+CtuXCji1s9w5984bbPQ1gMDh3Pu4c49XsyoVJb1beSWlldW1/LrhY3Nre0dc3evIaNEYFLHEYtEy0OSMMpJXVHFSCsWBIUeI01vcD3xm49ESBrxBzWMiROiHqcBxUhpyTUP7kq38BSOGqNupcN0zkeu3R2cuGbRKltTwEViZ6QIMtRc86vjRzgJCVeYISnbthUrJ0VCUczIuNBJJIkRHqAeaWvKUUikk04PGMNjrfgwiIR+XMGp+juRolDKYejpyRCpvpz3JuJ/XjtRwaWTUh4ninA8WxQkDKoITtqAPhUEKzbUBGFB9V8h7iOBsNKdFXQJ9vzJi6RRKdtn5cr9ebF6ldWRB4fgCJSADS5AFdyAGqgDDMbgGbyCN+PJeDHejY/ZaM7IMvvgD4zPHzTNlOI=</latexit>
Scalability
77
Top Related