Giga-Mining
-
Upload
yoshio-meadows -
Category
Documents
-
view
23 -
download
0
description
Transcript of Giga-Mining
Giga-Mining
Corinna Cortes and Daryl Pregibon
AT&T Labs-Research
Presented by:
Kevin R. Gee
28 October 1999
Case Study
Statistical modeling Processing of multi-GB databases Data warehousing Prediction and classification User interfaces
Three Goals
Daily perform meaningful mining on multi-GB of data
Classify telephone numbers as business or residential (pattern deviation, etc.)
Maintain operational data for each phone number.
Quantity of data
1997: 275 million phone calls per week day -- total of 76 billion for whole year
65M unique TNs per weekday 350M unique TNs over a 40-day period “Universe list”: Set of all TNs observed on
network, each with a 7-byte profile
Contents of each profile
Inactivity -- number of days since TN used Minutes of use -- average daily minutes TN is
observed on network Frequency -- estimated number of days
between observing a TN “Bizocity” -- Business-like behavior of TN
Stored for inbound/outbound, toll/toll-free
Calculation of each variable
Inactivity: Set to 0 if observed, and (Inactivity++) if not observed.
Other variables are calculated via an exponential weighted average:
X(TN)new = λX(TN)today + (1-λ)X(TN)old, 0 < λ < 1
Aging factor λ
Provides for estimate as a weighted sum of all previous daily values, where weights decrease smoothly over time.
Most recent day’s activity is weighted higher than 2 weeks ago.
Weight of a call k days ago is wk = (1-λ)k λ
Old data is “aged out” as new data is “blended in”
“Bizocity”
Concerns over whether a TN is residential or business.
Different operations for residences and businesses for customer care, billing, collections, fraud detection, etc.
“Bizocity” continued
AT&T has confirmed residential/business status for 30% of 350M TNs.
Incomplete data is due to lack of communication with local companies, additional lines, out of date information.
Behavioral estimate is generated by observing behavior of all 350M TNs, generating a bizocity score, and combining it with previous days’ totals.
Generating “Bizocity”
When a call completes, data such as originating TN, dialed TN, connect time, and call duration (note that callers are not identified, just phone numbers).
Those with known biz/res status are flagged, and training sets are generated.
Noise and outliers are usually eliminated by the volume of data.
Generating “Bizocity” -- examples
Example: Long calls originating at night are usually residential, not business.
Example: Residential calls peak in eve., business calls peak between 9am-5pm
Example: Business calls are generally shorter, call other businesses, or call 800 services.
Processed every 24 hours
Provides better aggregate data for each TN Reduces I/O by 75% Have to store all call details and sort them. Each call is reduced to a 32-byte binary
record, resulting in 8GB daily. Sorting takes 30 min. (3GB RAM, 1
processor)
Processing -- continued
4d data cube is generated Dimensions are day-of-week, time-of-day,
duration, and biz/res/800 status (7x6x5x3) Have previously developed logistic regression
models for scoring TNs based on each profile (to estimate “Bizocity”)
Biz(TN)new = λBiz(TN)today + (1-λ)Biz(TN)old 0 < λ < 1
Processing -- continued
Training set is used to classify TNs with unknown status based on probabilities
Inactive TNs are not updated “Bizocity” scores for unknown TNs are
generated using probabilities
Accuracy
Accuracy of prediction of status is 75% Failures due to incorrectly provided status
of shifting status (ex. home businesses, cell phones, etc.)
Data Structures
Exploit the “exchange” concept (1st 6 digits form an exchange)
Only about 150,000 of 1M exchanges are in use
All 10,000 TNs for each exchange are stored sequentially, whether used or not
Each data structure is 2GB for each variable (lower bound is 1.5GB)
Interface
Variety of visualization tools (start at top, drill-down)
Web interface with password protection Images are computed on the fly C-code directly computes images in gif
format
Toll Fraud Detection
Same methodology, but event-driven Only have to track about 15M TNs. Profiles are about 512 bytes each (7.5GB)