Giga-Mining

18
Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999

description

Giga-Mining. Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999. Case Study. Statistical modeling Processing of multi-GB databases Data warehousing Prediction and classification User interfaces. Three Goals. - PowerPoint PPT Presentation

Transcript of Giga-Mining

Page 1: Giga-Mining

Giga-Mining

Corinna Cortes and Daryl Pregibon

AT&T Labs-Research

Presented by:

Kevin R. Gee

28 October 1999

Page 2: Giga-Mining

Case Study

Statistical modeling Processing of multi-GB databases Data warehousing Prediction and classification User interfaces

Page 3: Giga-Mining

Three Goals

Daily perform meaningful mining on multi-GB of data

Classify telephone numbers as business or residential (pattern deviation, etc.)

Maintain operational data for each phone number.

Page 4: Giga-Mining

Quantity of data

1997: 275 million phone calls per week day -- total of 76 billion for whole year

65M unique TNs per weekday 350M unique TNs over a 40-day period “Universe list”: Set of all TNs observed on

network, each with a 7-byte profile

Page 5: Giga-Mining

Contents of each profile

Inactivity -- number of days since TN used Minutes of use -- average daily minutes TN is

observed on network Frequency -- estimated number of days

between observing a TN “Bizocity” -- Business-like behavior of TN

Stored for inbound/outbound, toll/toll-free

Page 6: Giga-Mining

Calculation of each variable

Inactivity: Set to 0 if observed, and (Inactivity++) if not observed.

Other variables are calculated via an exponential weighted average:

X(TN)new = λX(TN)today + (1-λ)X(TN)old, 0 < λ < 1

Page 7: Giga-Mining

Aging factor λ

Provides for estimate as a weighted sum of all previous daily values, where weights decrease smoothly over time.

Most recent day’s activity is weighted higher than 2 weeks ago.

Weight of a call k days ago is wk = (1-λ)k λ

Old data is “aged out” as new data is “blended in”

Page 8: Giga-Mining

“Bizocity”

Concerns over whether a TN is residential or business.

Different operations for residences and businesses for customer care, billing, collections, fraud detection, etc.

Page 9: Giga-Mining

“Bizocity” continued

AT&T has confirmed residential/business status for 30% of 350M TNs.

Incomplete data is due to lack of communication with local companies, additional lines, out of date information.

Behavioral estimate is generated by observing behavior of all 350M TNs, generating a bizocity score, and combining it with previous days’ totals.

Page 10: Giga-Mining

Generating “Bizocity”

When a call completes, data such as originating TN, dialed TN, connect time, and call duration (note that callers are not identified, just phone numbers).

Those with known biz/res status are flagged, and training sets are generated.

Noise and outliers are usually eliminated by the volume of data.

Page 11: Giga-Mining

Generating “Bizocity” -- examples

Example: Long calls originating at night are usually residential, not business.

Example: Residential calls peak in eve., business calls peak between 9am-5pm

Example: Business calls are generally shorter, call other businesses, or call 800 services.

Page 12: Giga-Mining

Processed every 24 hours

Provides better aggregate data for each TN Reduces I/O by 75% Have to store all call details and sort them. Each call is reduced to a 32-byte binary

record, resulting in 8GB daily. Sorting takes 30 min. (3GB RAM, 1

processor)

Page 13: Giga-Mining

Processing -- continued

4d data cube is generated Dimensions are day-of-week, time-of-day,

duration, and biz/res/800 status (7x6x5x3) Have previously developed logistic regression

models for scoring TNs based on each profile (to estimate “Bizocity”)

Biz(TN)new = λBiz(TN)today + (1-λ)Biz(TN)old 0 < λ < 1

Page 14: Giga-Mining

Processing -- continued

Training set is used to classify TNs with unknown status based on probabilities

Inactive TNs are not updated “Bizocity” scores for unknown TNs are

generated using probabilities

Page 15: Giga-Mining

Accuracy

Accuracy of prediction of status is 75% Failures due to incorrectly provided status

of shifting status (ex. home businesses, cell phones, etc.)

Page 16: Giga-Mining

Data Structures

Exploit the “exchange” concept (1st 6 digits form an exchange)

Only about 150,000 of 1M exchanges are in use

All 10,000 TNs for each exchange are stored sequentially, whether used or not

Each data structure is 2GB for each variable (lower bound is 1.5GB)

Page 17: Giga-Mining

Interface

Variety of visualization tools (start at top, drill-down)

Web interface with password protection Images are computed on the fly C-code directly computes images in gif

format

Page 18: Giga-Mining

Toll Fraud Detection

Same methodology, but event-driven Only have to track about 15M TNs. Profiles are about 512 bytes each (7.5GB)