CS294-1 Behavioral Data Mining - Peoplejfc/DataMining/SP13/lecs/lec01.pdf · CS294-1 Behavioral...

download CS294-1 Behavioral Data Mining - Peoplejfc/DataMining/SP13/lecs/lec01.pdf · CS294-1 Behavioral Data Mining Prof: John Canny GSI: ... (Hadoop, Spark) ... • Ebay 2009-2010

If you can't read please download the document

Transcript of CS294-1 Behavioral Data Mining - Peoplejfc/DataMining/SP13/lecs/lec01.pdf · CS294-1 Behavioral...

  • CS294-1 Behavioral

    Data Mining

    Prof: John Canny

    GSI: Huasha Zhao

  • A brief history of behavioral data

  • 1997

  • Pagerank: The web as a behavioral dataset

  • DB size = 50 billion sites

    Google server farms

    2 million machines (est ?)

  • 1998 sponsored search

    Overture

    2002

  • Sponsored search

    Google revenue around $28 bn last year from marketing, 97% of the companies revenue.

    Sponsored search uses an auction a pure competition for marketers trying to win access to consumers.

    In other words, a competition for models of consumers their likelihood of responding to the ad and of determining the right bid for the item.

    There are around 20 billion search requests a month. Perhaps a trillion events of history between search providers.

    Features that matter: source URL, time-of-day, geo, demographics, and user history. Increasingly real-time.

  • Sponsored search

  • Sponsored search

  • Bluefin Technologies

    Social Media and Sentiment

  • Bluefin Technologies

    Microtargeting

  • Recommenders

  • Tag data + Implicit tagging

    Based on Luis Von Ahns ESP game

    Flickr hosts around 6 billion images

  • John Kelly, Berkman center Harvard

    Computational Social Science

  • Sentiment Analysis

    Alan Mislove et al, NorthEastern

    http://www.youtube.com/v/ujcrJZRSGkg&hl=en_US

  • Social Media and Crises

    Virginia earthquake 2011: Mislove et al. 2011

    http://www.youtube.com/watch?v=XJ1EQbmJ_LQ

  • Leskovec, Backstrom, Kleinberg

  • Modeling Spread of Disease from Social Interactions

    Adam Sadilek, Henry Kautz, Vincent Silenzio

    Epidemiology

  • Behavioral Data Landscape

    19

    Dataset sizes dictated by number of people:

    Behavioral data is big but not outrageously so.

    1 TB 100 TB 10 GB

    Web graph Portal logs

    Twitter archives

    Spinn3r archives

    Human Genome

    NOT:

    FB w/ images

    Youtube

    MRI data

    Climate models

    The Web

  • Course Components

    Algorithms for mining behavioral data: Machine learning (Natural Language)

    Causal analysis

    Network algorithms

    Tools: Algorithm level (BIDMach, Mahout)

    Matrix level (BIDMat, Matlab)

    Data Movement (Hadoop, Spark)

    Techniques: GPU programming

    Performance patterns

    Optimization

    Communication-minimizing Algorithms

  • Course Contents

    Introduction, example problems, open research questions

    Basic statistics, bias-variance tradeoffs, Nave Bayes classifier

    Regression and generalized linear models (GLM)s

    Causal analysis: Matching, propensity scores

    About people - power laws, traits, social network structure

    Performance measurement, significance tests, resampling

    Map-Reduce: Hadoop, Spark,

    Machine biology - b/w hierarchy, caching, communication

    Scientific computing: communication-minimizing algorithms

    GPU architecture and programming

    Text and metadata indexing and retrieval, XML content-bases

    Natural language processing

  • Course Contents

    Natural language processing

    Classifiers

    Excavating - crawling, web services, processing streams

    Query languages

    Factor Models - Naive Bayes, LDA, GaP

    Optimizers: ADAGRAD, LBFGS, MCMC

    Clustering - k-means, PAM, generative models

    Causal analysis: Targeted Maximum-Likelihood Estimation

    Causal graphical models

    Prediction - kNN, kd-trees, kNC

    Network algorithms - HITS and Pagerank

    Network algorithms - diffusion

    Visualizing large datasets

  • Instructor John Canny (Ph.D. MIT 1987) research in computer vision, robotics

    DataMining projects

    MarkLogic co-founder 2003

    BigTribe/Overstock 2004-2005

    Yahoo 2007-2008

    Ebay 2009-2010

    Quantcast 2010-2011

    About 40% of time in industry since 2003.

    Research in behavior change, health care, education, computational social science: using data mining techniques on social media.

  • Desiderata for BDM Toolkits

    24

    Desiderata:

    Interactivity (Scala REPL)

    Natural syntax +,-,*, etc and high-level expressivity

    CPU and GPU acceleration custom GPU kernels

    Easy threading (Actors)

    Java VM + Java codebase

    C accelerators

    Hence BIDMat (Matrix Algebra) and BIDMach (Stats +

    Machine learning), written in Scala. (on github).

  • Toolkit Roadmap

    Algorithm layer

    Matrix layer

    CPU and GPU

    acceleration Cluster mapping

    BIDMach

    BIDMat Weka, MFE, CRAN

    Jama, Matlab, R

    Hadoop, Spark,

    Hyracks, DryadLINQ,

    Pig, Hive, Shark,

    Intel MKL

    JCUDA Mahout

  • GraphLab (CS colloquium talk today, 4pm)

    Algorithm layer

    Graph Layer

    CPU acceleration Cluster mapping

    GraphLab

    Intel MKL

    Powergraph

  • Innovations

    Algorithm layer

    Matrix layer

    CPU and GPU

    acceleration Cluster mapping

    BIDMach

    BIDMat

    Intel MKL

    JCUDA

    Communication-

    Minimizing

    Algorithms

    Optimizers:

    SGD, MCMC

  • A conflict

    Algorithm layer

    Matrix layer

    CPU and GPU

    acceleration Cluster mapping

    BIDMach

    BIDMat

    Intel MKL

    JCUDA

    Communication-

    Minimizing

    Algorithms

    Optimizers:

    SGD, MCMC

    Communication

    bloat

  • Resolved

    Algorithm layer

    Matrix layer

    CPU and GPU

    acceleration Cluster mapping

    BIDMach

    BIDMat

    Intel MKL

    JCUDA

    Communication-

    Minimizing

    Algorithms

    Optimizers:

    SGD, MCMC

    Butterfly Mixing

  • Course Mechanics

    The class wiki is up at:

    http://bid.berkeley.edu/cs294-1-spring13/index.php/

    There is no textbook for the course, but several recommended texts are up on the wiki.

    There will be required and recommended readings posted before each class. Those are the text for the class.

    There will be a Piazza for Berkeley as CS294-1. Please sign up on http://piazza.com

    http://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://bid.berkeley.edu/cs294-1-spring13/index.php/Main_Pagehttp://piazza.com/

  • TA, Office Hours, Section

    Teaching Assistants

    Huasha Zhao, Par Lab (5th floor)

    Office Hours

    John Canny: TuTh 4-5, in 637 Soda Hall for now

    Sections TBD

  • Prerequisites

    A desire to learn a bunch of new topics and the time to do it.

    Background in:

    Basic math literacy:

    Linear algebra, statistics

    Programming experience, ideally in Java

    Basic understanding of computers and networks

  • Course Enrollment

    Waitlist + current enrollment is larger than the course limit.

    We will process the waitlist by next Monday.

    If youre on the waitlist, please fill out a petition and hand it in to

    637 Soda.

  • What is data mining?

  • Data Mining

    The process of exploiting large amounts of data? YES

    The art of discovering novel structure and value in data? YES!,

    and this is where innovation happens.

    Agile visualization and analysis of raw data

    Quick hypothesis testing with sketch models

    Ability to test against samples of big datasets

    Ability to easily migrate to large scale

    Ability to get performance data on the scaled-up model

  • Accuracy: as scientists, we care most about accuracy:

    drawing the most accurate conclusions we can from the data.

    Resolution: we also want to be able to tell whether putative

    structure is there at all, i.e. can we resolve it?

    For exploring data we want answers quickly!

    Generally speaking, both accuracy and resolution improve with

    the amount of data processed. So there are many reasons to

    improve speed and scale, i.e. algorithm and system

    performance really matter.

    A Few Words on Performance

  • Performance is a small but very important part of the course.

    You dont need to be a systems specialist to be able to apply

    these principles.

    Understanding asymptotics, O(n2) vs O(n log n)

    Understanding memory: sequential vs. random access, cache

    hierarchy.

    Having a feel for constants: hashing, string comparison, array

    access etc.

    Often, the solution is to use an efficient library (e.g. Intel MKL or

    Atlas for matrix operations, scientific functions, and random

    numbers).

    Simple Performance Rules

  • In Java: multiply C = A * B,

    where A is an m*p matrix, B is a p*n matrix

    for (int i = 0; i < m; i++) {

    for (int j = 0; j < n; j++) {

    double sum = 0;

    for (int k = 0; k < p; k++)

    sum += A[i,k] * B[k,j];

    C[i,j] = sum;

    }}

    (For Java, we map [i, j] to [i + j*nrows])

    Case Study: Matrix Multiply

  • Performance: Mahout (1000x1000 multiply, Intel i5):

    12 secs, 0.16 Gflops

    Matlab performance (i.e. Intel MKL), same machine, same

    problem: ??

    Case Study: Matrix Multiply

  • Performance: Mahout (1000x1000 multiply, Intel i5):

    12 secs, 0.16 Gflops

    Matlab performance (i.e. Intel MKL), same machine, same

    problem: 14 msecs, 156 Gflops, about 1000x faster.

    Where does the difference come from?:

    Matlabs implementation is multi-threaded, 4x

    2x Overhead in array access calculations getQuick(i,j)

    * Memory grain (about 10x).

    Memory hierarchy (L1, L2, L3 caches), Intel SSSE4

    instructions, loop unrolling, hand assembly-coding,

    Case Study: Matrix Multiply

  • Case Study: Matrix Multiply

    C A B

    = i i

    j j k

    k

    Column-major order, memory grain

  • DRAM

    Reading one location really reads a row (16kB for 4GB

    DRAM). Sequential access >>faster>> random access

    16 kB row

  • Fixing Matrix Multiply

    C A B

    = i i

    j j k

    k

    Column-major order, memory grain

  • Fixing Matrix Multiply

    C A B

    = i i

    j j k

    k

    Column-major order, memory grain

  • (Assume C is initialized to zero)

    for (int j = 0; j < n; j++)

    for (int k = 0; k < p; k++) {

    double bval = B[k,j];

    for (int i = 0; i < m; i++)

    C[i,j] += A[i,k] * bval;

    }

    About 10x faster on 1000x1000 matrices, i7 processor. Caveats:

    100x100 multiply also much faster with Grandmas alg.!

    But you can get similar gains with large dense/sparse

    operations by paying attention to grain.

    Case Study: Matrix Multiply

  • Constants

    47

    Dense Matrix Algebra:

    Sparse Matrix Algebra:

    Transcendental Functions:

    Random numbers:

    Nave Java

    (Mahout, GL)

    Linear memory

    Java (Weka)

    Blocked, native

    (Intel MKL, 8C)

    GPU (4G)

    0.1-1 Gflop/s 10 Gflop/s 300 Gflop/s 4 Tflop/s

    Nave Java Intel MKL (8C) GPU (4G)

    0.2 Gfev/s 4 Gfev/s 160 Gfev/s

    Nave Java Intel MKL (8C) GPU (4G)

    0.1-1 Gflop/s 5-7 Gflop/s 80-160 Gflop/s

    Nave Java Intel MKL (8C) GPU (4G)

    0.09 GRN/s 1 GRN/s 160 GRN/s

  • Constants

    48

    Dense Matrix Algebra:

    Sparse Matrix Algebra:

    Transcendental Functions:

    Random numbers:

    Nave Java

    (Mahout, GL)

    Linear memory

    Java (Weka)

    Blocked, native

    (Intel MKL, 8C)

    GPU (4G)

    0.1-1 Gflop/s 10 Gflop/s 300 Gflop/s 4 Tflop/s

    Nave Java Intel MKL (8C) GPU (4G)

    0.2 Gfev/s 4 Gfev/s 160 Gfev/s

    Nave Java Intel MKL (8C) GPU (4G)

    0.1-1 Gflop/s 5-7 Gflop/s 80-160 Gflop/s

    Nave Java Intel MKL (8C) GPU (4G)

    0.09 GRN/s 1 GRN/s 160 GRN/s

    ~105

    104

    104

    104

  • Benchmarks

    49

    Latent Dirichlet Allocation (1000 dims, 20 million docs)

    System Type Nodes x cores Time Perf

    Smola et al. Gibbs sampler 100 x 8 12 hours 20 Gflops

    Powergraph Gibbs sampler 64 x 16 16 hours ? 15 Gflops

    BIDMach VB 1 x 8 x 4G 30 mins 110 Gflops

  • Benchmarks

    50

    Alternating Least Squares (Netflix synthetic data) on a single

    node

    But the basic ALS algorithm can be improved by an order of

    magnitude by replacing matrix inversion with iterative solving.

    Dimension 20 50 100 1000

    Mahout 0.13 Gflops 0.1 Gflops * *

    Powergraph 0.5 Gflops 1.1 Gflops 1 Gflops **

    BIDMach 5 Gflop 100 Gflops 300 Gflops 500 Gflops

  • Why Causal Analysis?

    51

    Pervasive in Economics, Public Health, Biology,

    not so far in data mining.

    A motivating example from Mechanical Turk:

  • Mechanical Turker Demographics

    52

  • Causal Analysis

    53

    If we know the Turkers country of origin, this all makes sense.

    But if not we could easily conclude:

    On average, women earn about 10x what men do

    University education cuts expected income by about 5x

    * The households that men live have on average 2x as many

    people in them as the households that women live in.

    The nation of origin is a confounder in all these relationships.

    Causal analysis allows us to separate its effects from genuine

    influences from these variables.

    * Graph data not provided for this

  • Why Natural Language?

    54

    Weakness of bag-of-words and n-gram features:

    No attachments of actors, sentiments and targets

    No sense of agency: who did what to whom?

    Conflation of voices in a text

    How similar are two texts? Did one derive from the other?

    Generally low-quality as features

    Parsing is one approach to addressing these challenges.

    But high-quality parsing is extremely slow:

    10s of sentences per second (constituency or PCFG parsing)

  • Natural Language Parsing

    55

    Surprisingly good match for GPU computation.

    High-quality parsing requires around 1 billion PCFG rule

    evaluations per sentence.

    Only practical on CPUs with heavy pruning: 99.7%

    Using new techniques we achieve ~1 rule evaluation per GPU

    instruction, 3 Teraflops (4G), and 1000 sentences per second

    without any pruning.

    This is similar to GPU performance on dense matrix

    multiply, and close to the devices max speed.

  • Natural Language Alignment

    56

    Tracking stories in social media

    Tracking influence of scientific ideas and theories

    Tracking literary plots and themes

    DNA sequence alignment

  • Summary

    57

    Behavioral data history

    Some example problems

    Course components

    Why single-machine performance matters

    Causal analysis

    Natural language