Modeling massive dynamic graphs Chris Volinsky AT&T Research Statistics Research Department Along...
-
Upload
barrie-norman-york -
Category
Documents
-
view
213 -
download
1
Transcript of Modeling massive dynamic graphs Chris Volinsky AT&T Research Statistics Research Department Along...
Modeling massive Modeling massive dynamic graphsdynamic graphs
Chris Volinsky
AT&T Research
Statistics Research Department
Along with:
Deepak Agarwal, Bob Bell, Corinna Cortes, Shawndra Hill, Daryl Pregibon,
OutlineOutline• Defining a Dynamic Graph, and our objectives• A motivating example: Repetitive fraud in telecommunications• Our approach: representation and approximation of dynamic graphs• Revisiting Repetitive Fraud • Parameter setting and applications to other domains• Fraud revisited – applying our methods• Other applications, conclusions, further work
Summary
• Analyzing large dynamic graphs of transactional data is hard!• By storing and analyzing a massive graph as an indexed set of small
local graphs, we have developed a generic framework for dynamic graph representation, which facilitates fast and accurate analysis.
Defining a Dynamic Graph, and our objectives
Defining Dynamic Graphs Defining Dynamic Graphs
• Dynamic Graphs represent transactional data – – Credit card data– Web connectivity data– Web logs– Telecommunications network traffic– Online auction data
Transactional data can be represented as a directed graph…
Kathleen
Chris Daryl
JenFred
Corinna
John ZachDebbyAnne
Defining Dynamic GraphsDefining Dynamic Graphs
• Dynamic Graphs– Nodes represent transactors– Edges are directed transactions– All edges have a time stamp– All edges have a weight (?)– May contain
• Other attributes on nodes
• Other attributes on edges
Analysis of dynamic graphsAnalysis of dynamic graphsWhy is it hard?
• Scale– Often tens or hundreds of millions of nodes and edges – Doesn’t fit in main memory
• Dynamic – Large numbers of nodes coming and going
continuously– Accounting for temporal component of changing
graphs is a challenge
• Heterogeneous nodes– Superactive nodes dominate visualization and analysis– Zipf’s law holds: most users have few connections
Overview: Our ObjectivesOverview: Our Objectives• Define a representation for a dynamic graph
– What is the graph at time t: Gt
– How does one account for addition and attrition of nodes
• Graphs can be used for– Global properties - learning about the graph properties
• Clustering coefficient
• Power law coefficient
• Graph depth
– Local represenation – learning about individual nodes in the graph
• Entities – the local subgraph of a node of interest
• Fraud signatures, anomaly (intrusion) detection, matching, etc.
• Our Goal: Methods for entity based analysis in large dynamic graphs
A motivating example: Repetitive fraud in telecommunications
Motivating Example: Repetitive Motivating Example: Repetitive FraudFraud• Lots of people cant pay their bill, but they want phone service
anyway:
Name Ted Hanley
Address 14 Pearl Dr
St Peters, MN
Balance $208.00
Disconnected 2/19/04 (nonpayment)
Name Debra Handley
Address 14 Pearl Dr
St Peters, MN
Balance $142.00
Connected 2/22/04
Name Elizabeth Harmon
Address APT 1045
4301 ST JOHN RD
SCOTTSDALE, AZ
Balance $149.00
Disconnected 2/19/04 (nonpayment)
Name Elizabeth Harmon
Address 180 N 40TH PL
APT 40
PHOENIX, AZ
Balance $72.00
Connected 1/31/04
Motivating Example: Repetitive FraudMotivating Example: Repetitive FraudHow can we identify that it is the same person behind
both accounts?
Old Account:
67855232344 New Account:
4215554597
Old Date: 2003-02-25 New Date: 2003-02-13
Old Name: DAVID ATKINS New Name: DAVID WATKINS
Old Address:
10 NIGHT WAY APT 114
New Address:
10 HATSWORTH DR
Old City: FAYVILLE New City: BONDALE
Old State: AL New State: AL
Old Zip: 302141798 New Zip: 300021530
Old II Code: 5512127609901 New II Code: 5312074639501
Old Balance: 284.62 New Balance: 5.83
Motivating Example: ChallengesMotivating Example: Challenges• This is a problem of record linkage and graph matching,
but because of obfuscation, we can only count on the graph matching!
• But the problem is huge:
• Q: How can we do efficient matching to put a dent in this problem?
• A: Efficient representation of entities
Connect pool
TRestrict pool
10 K/day10 K/day300K/300K/monthmonth
5 K/day5 K/day150 150
K/monthK/month45 billion comparisons
Motivating Example: Our Motivating Example: Our datadata– Our graph is large
• 350M Telephone numbers (TNs) currently active on our Long Distance network, 300M calls/day
– Our graph is dynamic 4 Million TNs appear per
week
4 Million TNs disappear per
week
Motivating Example: Our dataMotivating Example: Our data
Our graph is sparse:
For one year of long distance data:
Median = 34Median = 34
95% = 17195% = 171
• Our Approach to Dynamic Graphs– Definition– Representation– Approximation
Our Approach: Defining dynamic Our Approach: Defining dynamic graphs graphs • Q: for transactional data, what does the graph at time t (Gt)mean?
- let gt be the collection of nodes and edges during the time period t
• We could use: tt gG Too narrow!
• We could use the union of all time periods:
i
t
itt ggggG
1
21
Too broad!
• We could use a moving average of the most recent time periods:
i
t
ntitntntt ggggG
1
Too many!
Our Approach: Defining dynamic Our Approach: Defining dynamic graphsgraphs
We adopt an Exponentially Weighted Moving Average (EWMA):
t1tt gθ)(1θGG
Alternatively, this is:
igωgωgωgωG i
t
1itt2211t
θ)(1θω where iti
• Advantages:- recent data has most influence- only one most recent graph need be stored
i.e. today’s graph is defined recursively as a convex combination of yesterday’s graph and today’s data
Through time, edge weights decay with decay rate
Our Approach: Defining dynamic Our Approach: Defining dynamic graphsgraphs
closer to 1• calls decay slower• more historical data included• smoother
closer to 0• faster decay• recent calls count more• more power to detect changes• less smooth
Selecting Selecting
= 1/(1-n) means weight reduces to 1/e times its original weight in n days
Our Approach: RepresentationOur Approach: Representation• Since we are interested in entities, we represent the graph as a union of
entity graphs
• These entity graphs are the atomic units of analysis, a signature of the behavior of the node
• As it turns out, storing hundreds of millions of small graphs is much more efficient than storing one massive graph. We use Hancock, a language developed at AT&T for signatures, which allows for easy indexed storage of
• In short, we are approximating a hugh graph with many little graphs, which are easily accessible (via indexed storage) for analysis.
2222222222 100.32222222222 100.31111111111 90.11111111111 90.13213232423 27.03213232423 27.09098765453 11.39098765453 11.388764573268876457326 5.4 5.42122121212 3.02122121212 3.09908989898 0.99908989898 0.98887878787 0.18887878787 0.1
Our Approach: RepresentationOur Approach: Representation
Update the graph by updating all of the atomic units daily – so any time we access the data we have the most recent representation.
1111111111 20.01111111111 20.02122121212 10.02122121212 10.09991119999 5.09991119999 5.0
2222222222 100.32222222222 100.31111111111 90.11111111111 90.13213232423 27.03213232423 27.09098765453 11.39098765453 11.388764573268876457326 5.4 5.42122121212 3.02122121212 3.09908989898 0.99908989898 0.98887878787 0.18887878787 0.1
++ ==1111111111 92.11111111111 92.12222222222 90.32222222222 90.33213232423 24.33213232423 24.39098765453 10.19098765453 10.188764573268876457326 4.9 4.92122121212 3.72122121212 3.79991119999 0.59991119999 0.5 3990898989 0.83990898989 0.88887878787 0.098887878787 0.09
Yesterday’s graphYesterday’s graph Today’s dataToday’s data Today’s Today’s graphgraph
Our Approach: ApproximationOur Approach: Approximation• We also implore two types of approximation of the graph,
by pruning.
• Local pruning of edges – designate a maximal in and out degree (k) for each node, and assign an overflow bin
• Global pruning of edges – overall threshold () below which edges are removed from the graph
1111111111 92.11111111111 92.12222222222 90.32222222222 90.33213232423 24.33213232423 24.39098765453 10.19098765453 10.188764573268876457326 4.9 4.92122121212 3.72122121212 3.7OtherOther 1.4 1.4
1111111111 92.11111111111 92.12222222222 90.32222222222 90.33213232423 24.33213232423 24.39098765453 10.19098765453 10.188764573268876457326 4.9 4.92122121212 3.72122121212 3.79991119999 0.5 9991119999 0.5 3990898989 0.83990898989 0.88887878787 0.098887878787 0.09
==
Removes stale edges
Reduces effect of supernodes
Our Approach: ApproximationOur Approach: Approximation• Defending k
– Most entities have the vast majority of their weight in a fraction of their nodes
Our Approach: Parameter SelectionOur Approach: Parameter Selection• We have now defined a representation of a dynamic graph
by three parameters: controls the decay of edges and edge weights global pruning parameter k – local pruning parameter
• For a given application, we choose the parameters by optimizing predictive performance, selecting the parameters which optimize a loss function
– Two loss functions we have used:• Weighted Dice• Hellinger Distance
Our Approach: Parameter SelectionOur Approach: Parameter Selection• Let A and B be two entities.
• Weighted Dice:
• Hellinger Distance:
• For each value– Set to be a low tolerance value
– For a range of k, optimize – Look at the plot to select parameters
jA
BABAj
jp
jpjpIBAWD
)(1
))()((),(
)(
)()(),(BAj
BA jpjpBAHD
Our Approach: Parameter SelectionOur Approach: Parameter Selection
Our Approach: SummaryOur Approach: Summary• This method allows us, for all 350 million phone numbers
we see on the network to have an up-to-date representation of the entity associated with each number. These entities are stored in an indexed data base for easy storage and retrieval
• Parameters are set to maximize the self-similarity of a node with itself through time, to allow us to match entities.
Fraud Revisited: Applying our methods
Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods
• Fraudsters don’t get caught, so they go elsewhere
• Can we find them, based on their calling patterns?
• Intuition: fraudster may change network identity, but calling pattern will be similar
• Find overlap in calling pattern between recent connected lines and lines shut down for fraud.
Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods
• Repetitive Fraud Process
Connect pool
TRestrict pool
10 K/day10 K/day300K/300K/monthmonth
5 K/day5 K/day150 150
K/monthK/month
• Anchor on a few restrict dates in the pool
• Compare to all connects in the connect pool
• For all of those that have at least one overlap, collect more information
Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods• For each potential matched pair
– Calculate a matching score, based on the characteristics of the overlap:
– Incorporate other information, including • Name, address
• Payment history
• Tenure
Restrict
TN-1
TN-2
TN-3
TN-4
TN-5
Connect
Connect
Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods• Results:
– We identify 50-100 of these cases a day– 95% match rate– 85% block rate
– Credited with saving AT&T $5M in reduced costs and uncollectables in 2002-2003
– By far the most reliable matching criteria is the entity based matching
Other applications, conclusions, further Other applications, conclusions, further workwork• We have a representation of a dynamic graph which can be
applied to any problem where entity modeling over time is of interest
• Other fraud: GBA
• Language models
• Web pages
• Terrorism
• Viral Marketing– Targeting– Understanding
Other slidesOther slides
Matching AlgorithmMatching Algorithm
• What cases will we present to the reps? • A combination of:
– COI Overlap measures• At least two, and strength determined by uniqueness of overlap
TNs
– Name/address overlap• Edit distance no more than 50% of the longest name or address
– $$ owed• Most interested in the ones that will generate the most $$
• 500-1000 cases a day become 100-150 that we present to the reps
Motivating Example: Repetitive FraudMotivating Example: Repetitive Fraud
• When we catch a fraudster, we rarely catch the person, we simply shut down the line
• They will likely move on to another attempt at defrauding us, from a different network location
• Idea: record linkage - network identity has changed, but network behavior is the same
• We can use network behavior to indicate that the new line has the same “owner” as an old line
COI Signatures to COICOI Signatures to COI• To construct a COI from a COI signature:
– Often the signature contains things we don’t want:• Businesses• High weight nodes
– Often the signature doesn’t contain things we do want:• Local calls• Other carrier calls
• To combat this, create a COI by:– Recursively expanding the COI signature– Adding edges– Pruning edges
here’s an example…
COI COI signaturesignature
me
other
other
Extended Extended COICOI
me
other
other
Enhanced Enhanced COICOI
me
other
other
Pruned COIPruned COI
me
other
other
A likely case of the same A likely case of the same fraudster showing up as a new fraudster showing up as a new numbernumber
Pink nodes exist in both COI
Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods
Where:
wao = weight of edge from a to o
wob = weight of edge from o to b
wo = sum weight of edges to o
dao, dob are the graph distances from a and b to o
obaoo o
obao
ddw
ww 1b)overlap(a,
overlap}in {
• Calculate the “informative overlap” score:
ZA O
Bwobwao
wo