1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques,...

177
1 Electronic Commerce

Transcript of 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques,...

Page 1: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

1

Electronic Commerce

Page 2: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

2

High-Level Overview

The course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce, as captured by executives in the CEO/CTO/VP Business Development level in B2B and B2C companies:

An introduction to the science behind

Google, Amazon, and ebay.

Page 3: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

3

High-Level Overview

Background required: Algorithms, and basic principles of

computer science Basic mathematical background in

algebra and probability. Exposure to the Internet.

Page 4: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

4

High-Level Overview Discovering buyers and sellers

Buyers finding sellers• Search engines

Sellers finding buyers• Data mining• Recommender systems

Making a deal Auctions

Executing the deal Payments, security

Page 5: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

5

Searching for sellers

Page 6: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

6

Finding Sellers

A major use of search engines is finding pages that offer an item for sale.

How do search engines find the right pages?

We’ll study: Google’s PageRank technique and other

“tricks” “Hubs and authorities.”

Page 7: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

7

Page Rank

Intuition: solve the recursive equation: “a page is important if important pages link to it.”

In technical terms: compute the principal eigenvector of the stochastic matrix of the Web. A few fixups needed.

Page 8: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

8

Stochastic Matrix of the Web

Enumerate pages. Page i corresponds to row and column

i. M[i,j] = 1/n if page j links to n pages,

including page i; 0 if j does not link to i. Seems backwards, but allows

multiplication by M on the left to represent “follow a link.”

Page 9: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

9

Example

i

j

Suppose page j links to 3 pages, including i

1/3

Page 10: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

10

Random Walks on the Web

Suppose v is a vector whose i-th component is the probability that we are at page i at a certain time.

If we follow a link from i at random, the probability distribution of the page we are then at is given by the vector Mv.

Page 11: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

11

The multiplicationp11 p12 p13 p1

p21 p22 p23 X p2

p31 p32 p33 p3

If the probability that we are in page i is pi, then in the next iteration p1 will be the probability we are in page 1 and will stay there + the probability we are in page 2 times the probability of moving from 2 to 1 + the probability that we are in page 3 times the probability of moving from 3 to 1:

p11 x p1 + p12 x p2+ p13 x p3

Page 12: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

12

Random Walks 2

Starting from any vector v, the limit M(M(…M(Mv)…)) is the distribution of page visits during a random walk.

Intuition: pages are important in proportion to how often a random walker would visit them.

The math: limiting distribution = principal eigenvector of M = PageRank.

Page 13: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

13

Example: The Web in 1839

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

Page 14: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

14

Simulating a Random Walk

Start with the vector v = [1,1,…,1] representing the idea that each Web page is given one unit of “importance.”

Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.

Limit exists, but about 50 iterations is sufficient to estimate final distribution.

Page 15: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

15

Example

Equations v = Mv: y = y/2 + a/2 a = y/2 + m m = a/2

ya =m

111

13/21/2

5/4 13/4

9/811/81/2

6/56/53/5

. . .

Page 16: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

16

Solving The Equations

These 3 equations in 3 unknowns do not have a unique solution.

Add in the fact that y+a+m=3 to solve.

In Web-sized examples, we cannot solve by Gaussian elimination (we need to use other solution (relaxation = iterative solution).

Page 17: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

17

Real-World Problems

Some pages are “dead ends” (have no links out). Such a page causes importance to

leak out. Other (groups of) pages are spider

traps (all out-links are within the group). Eventually spider traps absorb all

importance.

Page 18: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

18

Microsoft Becomes Dead EndYahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 0

y a m

Page 19: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

19

Example

Equations v = Mv: y = y/2 + a/2 a = y/2 m = a/2

ya =m

111

11/21/2

3/41/21/4

5/83/81/4

000

. . .

Page 20: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

20

M’soft Becomes Spider Trap

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

Page 21: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

21

Example

Equations v = Mv: y = y/2 + a/2 a = y/2 m = a/2 + m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Page 22: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

22

Google Solution to Traps, Etc.

“Tax” each page a fixed percentage at each iteration. This percentage is also called “damping factor”.

Add the same constant to all pages. Models a random walk in which surfer

has a fixed probability of abandoning search and going to a random page next.

Page 23: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

23

Ex: Previous with 20% Tax

Equations v = 0.8(Mv) + 0.2: y = 0.8(y/2 + a/2) + 0.2 a = 0.8(y/2) + 0.2 m = 0.8(a/2 + m) + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/11 5/1121/11

. . .

Page 24: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

24

Solving the Equations

We can expect to solve small examples by Gaussian elimination.

Web-sized examples still need to be solved by more complex (relaxation) methods.

Page 25: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

25

Search-Engine Architecture

All search engines, including Google, select pages that have the words of your query.

Give more weight to the word appearing in the title, header, etc.

Inverted indexes speed the discovery of pages with given words.

Page 26: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

26

Google Anti-Spam Devices

Early search engines relied on the words on a page to tell what it is about. Led to “tricks” in which pages attracted

attention by placing false words in the background color on their page.

Google trusts the words in anchor text Relies on others telling the truth about

your page, rather than relying on you.

Page 27: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

27

Use of Page Rank

Pages are ordered by many criteria, including the PageRank and the appearance of query words. “Important” pages more likely to be

what you want. PageRank is also an antispam device.

Creating bogus links to yourself doesn’t help if you are not an important page.

Page 28: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

28

Discussion

Dealing with incentives Several types of links Page ranking as voting

Page 29: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

29

Hubs and Authorities

Distinguishing Two Roles for Pages

Page 30: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

30

Hubs and Authorities

Mutually recursive definition: A hub links to many authorities; An authority is linked to by many hubs.

Authorities turn out to be places where information can be found. Example: information about how to use a

programming language Hubs tell who the authorities are.

Example: a catalogue of sources about programming languages

Page 31: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

31

Transition Matrix A

H&A uses a matrix A[i,j] = 1 if page i links to page j, 0 if not.

A’, the transpose of A, is similar to the PageRank matrix M, but A’ has 1’s where M has fractions.

Page 32: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

32

Example

Yahoo

M’softAmazon

y 1 1 1a 1 0 1m 0 1 0

y a m

A =

Page 33: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

33

Using Matrix A for H&A

Let h and a be vectors measuring the “hubbiness” and authority of each page.

Equations: h = Aa; a = A’ h. Hubbiness = scaled sum of

authorities of linked pages. Authority = scaled sum of hubbiness

of linked predecessors.

Page 34: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

34

Consequences of Basic Equations

From h = Aa; a = A’ h we can derive: h = AA’ h a = A’Aa

Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority.

There are different normalization techniques (after each iteration in an iterative procedure; other implementation is “normalization at end”).

Page 35: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

35

The multiplication

1 1 1 a1 h1

1 0 1 x a2 = h2

0 1 0 a3 h3

In order to know the hubbiness of page 2, h2, we need to add up the level of authority of the pages it points to (1 and 3).

Page 36: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

36

The multiplication

1 1 0 h1 a1

1 0 1 x h2 = a2

1 1 0 h3 a3

In order to know the level authority of page 3, a3, we need to add up the amount of hubbiness of the pages that point to it (1 and 2).

Page 37: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

37

Example

1 1 1A = 1 0 1 0 1 0

1 1 0A’ = 1 0 1 1 1 0

3 2 1AA’= 2 2 0 1 0 1

2 1 2A’A= 1 2 1 2 1 2

a(yahoo)a(amazon)a(m’soft)

===

111

545

241824

114 84114

. . .

. . .

. . .

1+sqrt(3)21+sqrt(3)

h(yahoo) = 1h(amazon) = 1h(m’soft) = 1

642

132 96 36

. . .

. . .

. . .

1.0000.7350.268

2820 8

Page 38: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

38

Solving the Equations

Solution of even small examples is tricky.

As for PageRank, we need to solve big examples by relaxation.

Page 39: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

39

Approaching potential buyers and algorithmic

data mining

Page 40: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

40

Data Mining: Associations

Frequent itemsets, market baskets A-priori algorithm Hash-based improvements One- or two-pass approximations High-correlation mining

Page 41: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

41

Purpose

If people tend to buy A and B together, then a buyer of A is a good target for an advertisement for B.

The same technology has other uses, such as detecting plagiarism and organizing the Web.

Page 42: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

42

The Market-Basket Model

A large set of items, e.g., things sold in a supermarket.

A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day.

Page 43: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

43

Support

Simplest question: find sets of items that appear “frequently” in the baskets.

Support for itemset I = the number of baskets containing all items in I.

Given a support threshold s, sets of items that appear in >= s baskets are called frequent itemsets.

Page 44: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

44

Example Items={milk, coke, pepsi, beer,

juice}. Support = 3 baskets.

B1 = {m, c, b} B2 = {m, p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}

Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.

Page 45: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

45

Applications 1

Real market baskets: chain stores keep terabytes of information about what customers buy together. Tells how typical customers navigate

stores, lets them position tempting items. Suggests tie-in “tricks,” e.g., run sale on

hamburger and raise the price of ketchup. High support needed, or no $$’s .

Page 46: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

46

Applications 2

“Baskets” = documents; “items” = words in those documents. Let us find words that appear together

unusually frequently, i.e., linked concepts. “Baskets” = sentences, “items” =

documents containing those sentences. Items that appear together too often

could represent plagiarism.

Page 47: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

47

Applications 3

“Baskets” = Web pages; “items” = linked pages. Pairs of pages with many common

references may be about the same topic.

“Baskets” = Web pages p ; “items” = pages that link to p . Pages with many of the same links may

be mirrors or about the same topic.

Page 48: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

48

Scale of Problem

WalMart sells 100,000 items and can store hundreds of millions of baskets.

The Web has 100,000,000 words and several billion pages.

Page 49: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

49

Association Rules

If-then rules about the contents of baskets.

{i1, i2,…, ik} -> j Means: “if a basket contains all of i1,

…,ik, then it is likely to contain j. Confidence of this association rule

is the probability of j given i1,…,ik.

Page 50: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

50

Example

B1 = {m, c, b} B2 = {m, p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}

An association rule: {m, b} -> c. Confidence = 2/4 = 50%.

+__ +

Page 51: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

51

Finding Association Rules

A typical question is “find all association rules with support >= s and confidence >= c.”

The hard part is finding the high-support itemsets. Once you have those, checking the

confidence of association rules involving those sets is relatively easy.

Page 52: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

52

Computation Model

Typically, data is kept in a “flat file” rather than a database system. Stored on disk. Stored basket-by-basket.

• Expand baskets into pairs, triples, etc. as you read baskets.

True cost = # of Disk I/O’s. Count # of passes through the data.

Page 53: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

53

Main-Memory Bottleneck

In many algorithms to find frequent itemsets we need to worry about how main-memory is used. As we read baskets, we need to count

something, e.g., occurrences of pairs. The number of different things we can

count is limited by main memory. Swapping counts in/out is a disaster.

Page 54: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

54

Finding Frequent Pairs

The hardest problem often turns out to be finding the frequent pairs.

We’ll concentrate on how to do that, then discuss extensions to finding frequent triples, etc.

Page 55: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

55

Naïve Algorithm

A simple way to find frequent pairs is: Read file once, counting in main

memory the occurrences of each pair.• Expand each basket of n items into its

n(n-1)/2 pairs.

Fails if #items-squared exceeds main memory.

Page 56: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

56

A-Priori Algorithm 1

A two-pass approach called a-priori limits the need for main memory.

Key idea: monotonicity : if a set of items appears at least s times, so does every subset. Converse for pairs: if item i does not

appear in s baskets, then no pair including i can appear in s baskets.

Page 57: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

57

A-Priori Algorithm 2 Pass 1: Read baskets and count in main

memory the occurrences of each item. Requires only memory proportional to #items.

Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to have occurred at least s times. Requires memory proportional to square of

frequent items only.

Page 58: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

58

Picture of A-Priori

Item counts

Pass 1 Pass 2

Frequent items

Counts ofcandidate pairs

Page 59: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

59

PCY Algorithm 1

Hash-based improvement to A-Priori. During Pass 1 of A-priori, most memory

is idle. Use that memory to keep counts of

buckets into which pairs of items are hashed. Just the count, not the pairs themselves.

Gives extra condition that candidate pairs must satisfy on Pass 2.

Page 60: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

60

Picture of PCY

Hashtable

Item counts

Bitmap

Pass 1 Pass 2

Frequent items

Counts ofcandidate pairs

Page 61: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

61

PCY Algorithm 2 PCY Pass 1:

Count items. Hash each pair to a bucket and increment

its count by 1. PCY Pass 2:

Summarize buckets by a bitmap : 1 = frequent (count >= s ); 0 = not.

Count only those pairs that (a) are both frequent and (b) hash to a frequent bucket.

Page 62: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

62

Multistage Algorithm

Key idea: After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY.

On middle pass, fewer pairs contribute to buckets, so fewer false drops --- buckets that have count s , yet no pair that hashes to that bucket has count s .

Page 63: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

63

Multistage Picture

Firsthash table

Secondhash table

Item counts

Bitmap 1 Bitmap 1

Bitmap 2

Freq. items Freq. items

Counts ofCandidate pairs

Page 64: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

64

Finding Larger Itemsets We may proceed beyond frequent pairs to

find frequent triples, quadruples, . . . Key a-priori idea: a set of items S can only be

frequent if S - {a } is frequent for all a in S . The k th pass through the file is counts the

candidate sets of size k : those whose every immediate subset (subset of size k - 1) is frequent.

Cost is proportional to the maximum size of a frequent itemset.

Page 65: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

65

Low-Support, High-Correlation

Finding rare, but very similar items

Page 66: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

66

Assumptions

1. Number of items allows a small amount of main-memory/item.

2. Too many items to store anything in main-memory for each pair of items.

3. Too many baskets to store anything in main memory for each basket.

4. Data is very sparse: it is rare for an item to be in a basket.

Page 67: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

67

Applications

While marketing may require high-support, or there’s no money to be made, mining customer behavior is often based on correlation, rather than support. Example: Few customers buy Handel’s

Watermusick, but of those who do, 20% buy Bach’s Brandenburg Concertos.

Page 68: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

68

Matrix Representation

Columns = items. Baskets = rows. Entry (r , c ) = 1 if item c is in

basket r ; = 0 if not. Assume matrix is almost all 0’s.

Page 69: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

69

In Matrix Formm c p b j

{m,c,b} 1 1 0 1 0{m,p,b} 1 0 1 1 0{m,b} 1 0 0 1 0{c,j} 0 1 0 0 1{m,p,j} 1 0 1 0 1{m,c,b,j} 0 1 1 1 1{c,b,j} 0 1 0 1 1{c,b} 0 1 0 1 0

Page 70: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

70

Similarity of Columns

Think of a column as the set of rows in which it has 1.

The similarity of columns C1 and C2, sim (C1,C2), is the ratio of the sizes of the intersection and union of C1 and C2. (Jaccard measure)

Goal of finding correlated columns becomes finding similar columns.

Page 71: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

71

Finding similar columns

Non-trivial algorithms (e.g. minhash) are used due to the fact that we have storage problems as mentioned before.

Page 72: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

72

Summary

Finding frequent pairs: A-priori --> PCY (hashing) -->

multistage. Finding all frequent itemsets Finding similar pairs:

Minhash

Page 73: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

73

Clustering

Page 74: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

74

The Problem of Clustering

Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as nearby as possible.

Page 75: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

75

Example

x xx x x xx x x x

x x xx x

xxx x

x x x x x

xx x x

x

x xx x x x x x x

x

x

x

Page 76: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

76

Applications

E-Business-related applications of clustering tend to involve very high-dimensional spaces. The problem looks deceptively easy in

a 2-dimensional, Euclidean space.

Page 77: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

77

Example: Clustering CD’s

Intuitively, music divides into categories, and customers prefer one or a few categories. But who’s to say what the categories really

are?

Represent a CD by the customers who bought it.

Similar CD’s have similar sets of customers, and vice-versa.

Page 78: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

78

The Space of CD’s

Think of a space with one dimension for each customer. Values 0 or 1 only in each dimension.

A CD’s point in this space is (x1,x2,…,xk), where xi = 1 iff the i th customer bought the CD. Compare with the “correlated items”

matrix: rows = customers; cols. = CD’s.

Page 79: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

79

Distance Measures

Two kinds of spaces: Euclidean: points have a location in space, and

dist(x,y) = sqrt(sum of square of difference in each dimension).

• Some alternatives, e.g. Manhattan distance = sum of magnitudes of differences.

Non-Euclidean: there is a distance measure giving dist(x,y), but no “point location.”

• Obeys triangle inequality: d(x,y) < d(x,z)+d(z,y).• Also, d(x,x) = 0; d(x,y) > 0; d(x,y) = d(y,x).

Page 80: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

80

Examples of Euclidean Distances

x = (5,5)

y = (9,8)L2-norm:dist(x,y) =sqrt(42+32)= 5

L1-norm:dist(x,y) =4+3 = 7

4

35

Page 81: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

81

Non-Euclidean Distances

Jaccard measure for binary vectors = ratio of intersection (of components with 1) to union.

Cosine measure = angle between vectors from the origin to the points in question.

Page 82: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

82

Jaccard Measure

Example: p1 = 00111; p2 = 10011. Size of intersection = 2; union = 4,

J.M. = 1/2. Need to make a distance function

satisfying triangle inequality and other laws.

dist(p1,p2) = 1 - J.M. works. dist(x,x) = 0, etc.

Page 83: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

83

Cosine Measure

Think of a point as a vector from the origin (0,0,…,0) to its location.

Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors. Example p1 = 00111; p2 = 10011. p1.p2 = 2; |p1| = |p2| = sqrt(3). cos(p1,p2) = 2/3.

Page 84: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

84

Example

010

011 110

101

100

001

110

Page 85: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

85

Methods of Clustering

Hierarchical: Initially, each point in cluster by itself. Repeatedly combine the two “closest”

clusters into one. Centroid-based:

Estimate number of clusters and their centroids.

Place points into closest cluster.

Page 86: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

86

Hierarchical Clustering

Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest?

Euclidean case: each cluster has a centroid = average of its points. Measure intercluster distances by

distances of centroids.

Page 87: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

87

Example

(5,3)o

(1,2)o

o (2,1) o (4,1)

o (0,0) o (5,0)

x (1.5,1.5)

x (4.5,0.5)

x (1,1)x (4.7,1.3)

Page 88: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

88

Comments

In a typical implementation the number of clusters to be reached is determined in advance (other implementations exist).

Page 89: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

89

And in the Non-Euclidean Case?

The only “locations” we can talk about are the points themselves.

Approach 1: Pick a point from a cluster to be the clustroid = point with minimum maximum distance to other points. Treat clustroid as if it were centroid,

when computing intercluster distances.

Page 90: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

90

Example

1 2

34

5

6

interclusterdistance

clustroid

clustroid

Page 91: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

91

Other Approaches

Approach 2: let the intercluster distance be the minimum of the distances between any two pairs of points, one from each cluster.

Page 92: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

92

k-Means

Assumes Euclidean space. Starts by picking k, the number of

clusters. Initialize clusters by picking one

point per cluster. For instance, pick one point at random,

then k -1 other points, each as far away as possible from the previous points.

Page 93: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

93

Populating Clusters

For each point, place it in the cluster whose centroid it is nearest.

After all points are assigned, fix the centroids of the k clusters.

Reassign all points to their closest centroid. Sometimes moves points between

clusters.

Page 94: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

94

Example

1

2

3

4

5

6

7 8x

x

Page 95: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

95

Comments

In a typical implementation the centroid of each cluster is dynamically determined when points are added, and the transition of points to other clusters based on the location of the centroids at end is applied only once (other implementations exist).

Page 96: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

96

Decision Trees

Page 97: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

97

Example Decision Tree

Married?y n

Own dog?y n

Own home?y n

Own home?y n

Own dog?y n

Bad

??

Good

BadBad Good

Page 98: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

98

Constructing Decision Trees

Typically, we are given data consisting of a number of records, perhaps representing individuals.

Each record has a value for each of several attributes. Often binary attributes, e.g., “has dog.” Sometimes numeric, e.g. “age”, or

discrete, multiway, like “school attended.”

Page 99: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

99

Making a Decision

Records are classified into “good” or “bad.” More generally: some number of

outcomes. The goal is to make a small

number of tests involving attributes to decide as best we can whether a record is good or bad.

Page 100: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

100

Using the Decision Tree

Given a record to classify, start at the root, and answer the question at the root for that record. E.g., is the record for a married person?

Move next to the indicated child. Recursively, apply the DT rooted at

that child, until we reach a decision.

Page 101: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

101

Training Sets

Decision-tree construction is today considered a type of “machine learning.”

We are given a training set of example records, properly classified, with which to construct our decision tree.

Page 102: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

102

Applications

Credit-card companies and banks develop DT’s to decide whether to grant a card or loan.

Medical apps, e.g., given information about patients, decide which will benefit from a new drug.

Many others.

Page 103: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

103

Example

Here is the data on which our example DT was based:

Married?Home? Dog?Rating0 1 0 G

0 0 1 G0 1 1 G1 0 0 G1 0 0 B0 0 0 B1 0 1 B1 1 0 B

Page 104: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

104

Selecting Attributes

We can pick an attribute to place at the root by considering how nonrandom are the sets of records that go to each side.

Branches correspond to the value of the chosen attribute.

Page 105: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

105

Entropy: A Measure of Goodness

Consider the pools of records on the “yes” and “no” sides.

If fraction p on on a side are “good,” the entropy of that branch is -(p log2p + (1-p) log2(1-p)).

= p log2(1/p) + (1-p) log2(1/(1-p)) Pick attribute that minimizes maximum

entropies of the branches. Another (more common) alternative: pick an

attribute that minimizes the weighted entropy over all branches.

Page 106: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

106

Shape of Entropy Function

0

1

0 1/2 1

Page 107: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

107

Intuition

Entropy 1 = random behavior, no useful information.

Low entropy = significant information. At entropy = 0, we know exactly.

Ideally, we find an attribute such that most of the “good’s” are on one side, and most of the “bad’s” are on the other.

Page 108: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

108

Example

Our Married, Home, Dog, Rating data: 010G, 001G, 011G, 100G, 100B, 000B,

101B, 110B. Married: 1/4 of Y is G; 1/4 of N is B.

Entropy = ((1/4) log 4 + (3/4) log (4/3)) = .81 on both sides.

The average is 4/8 x 0.81 + 4/8 x 0.81 = 0.81

Page 109: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

109

Example, Continued

010G, 001G, 011G, 100G, 100B, 000B, 101B, 110B.

Dog: 1/3 of Y is B; 2/5 of N is G. Entropy is (1/3) log 3 + (2/3) log (3/2) = .92

on Y side. Entropy is (2/5) log (5/2) + (3/5) log (5/3)

= .98 on N side. The average is 3/8 x .92 + 5/8 x .98, greater

than for Married. Home is similar, so Married “wins.”

Page 110: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

110

Example (Cont.)

0 1 0 G0 0 1 G0 1 1 G0 0 0 B

Married?Home? Dog? Rating

Page 111: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

111

Example (Cont.)

Entropy for home (in the branch of not married) is 0 for Y and 1 for N, so the average entropy is 2/4 x 0 + 2/4 x 1= 0.5.

Entropy for dog (in the branch of not married) is also 0.5. We should take the minimum of them, so we can take an arbitrary here.

Page 112: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

112

Example (Cont.)

1 0 0 G1 0 0 B1 0 1 B1 1 0 B

Married?Home? Dog? Rating

Page 113: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

113

Example (Cont.)

The computation is now applied to the branch of married.

Notice that in principle different attribute may be selected there!

We continue until all input examples (training set) are classified.

Page 114: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

114

The “Training” Process Married?

100G, 100B, y n 010G, 001G101B, 110B 011G, 000B

Dog?101B y n 100G, 100B

110BBad

Home?010G, y n 001G,011G 000BGood Home?

110B y n 100B, 100G

Bad ??

Dog?001G y n 000B

Good Bad

Page 115: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

115

Handling Numeric Data

While complicated tests at a node are permissible, e.g., “age = 30 or age <= 50 and age >= 42,” the simplest thing is to pick one breakpoint, and divide records by value <= breakpoint and value > breakpoint.

Rate an attribute and breakpoint by min-max or average entropy of the sides.

Page 116: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

116

Overfitting

A major problem in designing decision trees is that one tends to create too many levels. The number of records reaching a

node is small, so significance is lost.

Page 117: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

117

Possible Solutions

1. Limit depth of tree so that each decision is based on a sufficiently large pool of training data.

2. Create several trees independently (needs randomness in choice of attribute). Decision based on vote of D.T.’s.

Page 118: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

118

Selecting Products

Page 119: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

119

Problem Statement

Select a multi-set (set with number) of products, subject to certain constraints, that maximizes profit

Page 120: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

120

Essence of Selling

What products do I stock in my stores? Constraint: capital tied up in keeping products in

stores (inventory) What products do I keep in my end-caps

(checkout counters)? Constraint: shelf-space

What paid-listings do I show first in a search? Constraint: online real-estate

For a given customer, what’s the best product to advertise? Constraint: online real-estate

Page 121: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

121

Two Scenarios

Focus on aggregate customer behavior Problem definition

• E.g. what products do I stock in my stores?

No information available about individual customers

Focus on individual customer personalization

Page 122: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

122

General Framework

X1

X2

Xn

Xi

P1 (M1)

P3 (M3)

Pj (Mj)

Pm (Mm)

P2 (M2)

.

.

.

.

.

.

.

.

.

E(X1,P1)

E(Xi,Pj)

Xi: Personi, Pi: Producti.

E(Xi, Pj): Expected number of Pj that Xi buys (clicks through, etc…)Mj: Profit-Margin on Pj

Page 123: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

123

Aggregate User Case

X1

X2

Xn

XiPj (Mj)

.

.

....

E(X1,P1)

E(Xi,Pj)

X Pj (Mj)

Demand, Dj = i E(Xi,Pj)

Dj

Collapse all the Xi’s to one node

Page 124: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

124

Problem Statement

Maximize: j kj*Mj, Turns, kj = 0,1,2,… ( number of Pj selected)

Subject to : j kj*cj <= C, cj – cost associated with Pj &

kj <= Dj not to exceed demand

Profit, $j = kj*Mj

Page 125: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

125

Example

Margin Demand Cost Margin/Cost

P1 3 12 25 12%

P2 9 3 40 22.5%

P3 10 1 55 18.2%

Constraint: total cost <= 100 (C)

Greedy (pick maximal margin/cost at each step): {P22}

LP: { P3, P2}

Page 126: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

126

Retailers and LP

In general product selection can be set up as a linear/integer program (LP)

Retailers are giant multi-stage LP execution engines!

Page 127: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

127

In real life…

Space of products may be too large• Eg. Wal-mart has millions of products to consider

All information may not be available Implementation complexity and

Performance impact• Problems too large to run in real-time

Intractability Buyers do the job of product selection

• More in line with greedy algorithm

Page 128: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

128

Product Selection in Retailers

If all retailers solve the same equations, why don’t they all have the same products?

Product Selection defines Retailer (brand) Brand constraint: maximize profits in the future

• E.g. Wal-mart brand constraint: select only products that will be bought by 80% of population

• E.g. Gucci brand constraint: select only high-value (margin) products

Page 129: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

129

ExampleMargin Demand Cost Margin/

Cost

P1 3 12 25 12%

P2 9 3 40 22.5%

P3 10 1 55 18.2%

Constraint: total cost <= 100 (C)

Wal-mart brand constraint: maximize turns: {P14}

Gucci brand constraint: no low-margin products: { P3,P2}

Page 130: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

130

Classifying Retailers

Margin

Turns

Wal-mart

Costco

JC Penney’s

Gucci

Efficient frontier

Newco

Page 131: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

131

Online Search

Overture Amazon Google

Page 132: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

132

Page 133: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

133

Page 134: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

134

Page 135: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

135

Page 136: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

136

Personalization

Given customer Xi, what products do I recommend to her? Xi is a loyal customer – purchase history available

• Collaborative-Filtering based Recommender Systems Xi is a new customer – has done certain operations

on the site like search, view products, etc…• Assortment of techniques

Xi is a new customer – know nothing about her• Mass merchandizing as in offline retailers, bestsellers,…

In practice, combination of all of the above

Page 137: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

137

Personalization

Offline retail: merchandizers pick products to advertise One size fits all – no personalization

Millions of customers, cannot have human merchandizing to each customer

Algorithms that look at only customer’s data do not work well

Heuristic: customers help each other Algorithms enable this to happen!

Page 138: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

138

Recommender Systems

Xi

P1

P3

Pj

Pm

P2

.

.

.

E(Xi,P1)

E(Xi,Pj)

Purchase History of Xi availableWhat new products to advertise to Xi?

Given set of products that Xi has bought B = { Pi1, Pi2,… Pin}

Find Pj, such that E(Xi,Pj) is maximum

Page 139: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

139

Recommender Systems

Intuition: Ask your friends, what products they

like

Friends = people who have similar behavior to you

Page 140: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

140

Page 141: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

141

Collaborative Filtering

Representation of Customer and Product data

Neighborhood formation (find my friends)

Recommendation Generation from neighborhood

Page 142: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

142

Representation

M*N customer product matrix, R rij = 1 if Xi has bought Pj , 0 otherwise

Issues: Sparsity

• Mostly 0’s. E.g. Amazon.com 2 million books, less than 0.1% is 1

Scalability• Very large data sets

Authority• Take into account similarity between products

– E.g. paperback “Cold Mountain” is same as hardcover “Cold Mountain”

Page 143: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

143

Finding Neighbors

Similar to clustering cluster around a given customer

First compute similarity between customers: Xa, Xb

Xa^ -- corresponding product vector

Cosine measure• Cosine of angle between vectors gives similarity• Sim(Xa, Xb) = Xa

^ . Xb^/| Xa

^ | | Xb^ |

• See class on Clustering for examples, more info

Page 144: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

144

Neighborhood

Now compute neighborhood of Xa

Center-based• Select k closest neighbors to Xa

Centroid-based• Assume j closest neighbors selected• Select j+1st neighbor by picking customer

closest to centroid of first j neighbors• Repeat 1..k

Page 145: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

145

Generating Recommendations

From the neighborhood among products Xa has not bought yet, pick: most frequently occuring Weighted Average based on similarity Based on Association Rules

See Sarwar et al (sections 1-3) (http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/ec00.pdf)

Page 146: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

146

Example

Shrek Star Wars

MIB Harry Potter

X-files

John 1 1 1

Jane 1 1 1 1

Pete 1 1

Jeff 1 1

Ellen 1 ? 1 ? ?

What new movie should we recommend to Ellen?

Page 147: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

147

Similarity FunctionShrek Star

WarsMIB Harry

PotterX-files Similarity

to Ellen

John 1 1 1 1/sqrt(6) = 0.41

Jane 1 1 1 1 1/sqrt(2) = 0.71

Pete 1 1 1/2

Jeff 1 1 1/2

Ellen 1 ? 1 ? ?Use Cosine measure for similarity

Page 148: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

148

NeighborsShrek Star

WarsMIB Harry

PotterX-files Similarit

y to Ellen

John 1 1 1 0.41

Jane 1 1 1 1 0.71

Pete

1 1 0.5

Jeff 1 1 0.5

Ellen 1 ? 1 ? ?Use Center-based approach and pick 3 closest neighbors

Page 149: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

149

RecommendationShrek Star

WarsMIB Harry

PotterX-files

Jane 1 1 1 1

Pete 1 1

Jeff 1 1

Ellen 1 2 1 1 1

Recommend Star Wars

Page 150: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

150

Implementation Issues

Serious application Large data sizes: millions of users * millions of

products CPU cycles

Scalability key Partition the data set and the processing

Real-time vs Batch Real-time can lead to poor response times Real-time preferable – recommend immediately

after a customer purchase! Incremental solution key for real time

Page 151: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

151

Summary

Product Selection is the essence of retailing Personalization is unique to online retailing

Every customer can have their own store Most successful personalization techniques,

get customers to help one another Algorithms, like CF, enable this interaction

In real life, algorithms are complex monsters due to scaling issues, repeated tweaking, etc…

Page 152: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

152

Public-Key Cryptosystems

Page 153: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

153

Public-Key Cryptosystems

M – message (treated as a number) E – Encryption procedure D – Decryption procedure Required properties: 1. D(E(M))=M 2. E and D are easy to compute 3. Revealing E does not reveal easy way

to compute D 4. E(D(M))=M

Page 154: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

154

Public-Key Cryptosystems

Two users A(lice) and B(ob) A and B publicly announce EA,EB respectively.

B sends a private message M to A, EA(M) A decipher the message by computing

DA(EA(M))

Signature by B on message M to be sent to A: B computes S=DB(M) (can add its name for example to

M) B sends EA(S) to A

Page 155: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

155

Public-Key Cryptosystems

Given the signed message S, A can find the original message M by computing EB(S)

B can not deny sending the message M to A, because no one else could generate S.

A can not change M to M’ and claim it has been sent, since it will have for that to generate a corresponding signature S’=DB(M’)

Page 156: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

156

RSAThe public key is a pair (e,n) of positive integers.A message M is treated as an integer between 0 and n-1.C=E(M)=Me (mod n) D(C)=Cd (mod n)We need to get an appropriate decryption key. 1. Choose n=pq, where p and q are very large random primes. 2. Pick an integer d that is relatively prime to (p-1)(q-1), I.e. satisfy gcd(d,(p-1)(q-1))=1_ 3. Pick e, such that ed=1(mod (p-1)(q-1))

Page 157: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

157

Digital Cash

Page 158: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

158

Digital Cash

Players: Bank (B), Vendor (V), User (A, Alice) Four protocols: withdrawl (user from bank) spend (user at vendor) deposit (vendor at bank) transfer (user to user; will skip)

Goal of basic schemes: avoid obvious attacks, such as “double spending”.

Page 159: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

159

A basic scheme

Withdrawl: 1. AB give $1 2. BA Coin: SigB[$1, “Alice”,seq#]

(seq# is unique for every coin) 3. B deducts $1 from Alice’s

account

Page 160: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

160

A basic scheme

Spend: 1. AV buy something for $1 2. VA choose random r{0,1}128

3. AV SigA[r,coin]=Vcoin

4. VA verify Alice signature and release good

Page 161: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

161

A basic scheme

Deposit: VB deposit $1 , Vcoin =[r,coin] The bank stores all seq# that have been previously spent,

and check whether the sequence number has been already spent.

If not – everything fine. If it has been spent: who to blame? If V’coin =[r’,coin] is already spent (in B’s database) then if

r=r’ then V is to blame with overwhelming probability, and otherwise Alice is guilty with overwhelming probability.

3

Page 162: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

162

Escrow services

When one user has a good and when one has a different good (or money) they may wish to make an exchange.

Escrow service will take the goods and exchange them.

One can show that without escrow services simple fair exchanges are impossible.

The more general problem: contract signing by two parties.

Page 163: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

163

Off-line Escrow services

We do not wish the escrow service to deal with each instance of contract signing.

Off-line escrow service: will be used only if there is a problem.

We now describe such a service.

Page 164: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

164

Fair exchange

E-escrow service with public key pe and private key se

A and B need to sign a contract M. The basic idea -- Verifiable Escrow:

User A signs on M – SA(M), create

CA =Epe [SA(M)+condition], and PROVE (without revealing information) that

CA has been built correctly.

Page 165: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

165

Fair exchange

1. AB verifiable escrow CA where the condition is e.g. that B reveals SB=SigB(M)

2. BA B verifies validity of CA and

send SB=SigB(M)

3. AB A verifies SB (to be valid signature), and if fine sends SA(M)

4. BA B verifies SA(M)

What happens if A aborts the protocol before sending the signature to B?

Page 166: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

166

Fair exchange

5. If in step 4, B claims it has been cheated, then it sends CA,SB to E, who verifies that SB=SigB(M) , recover SA(M) from CA and sends to B.

E also sends SB to A.

Page 167: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

167

Micropayments

Page 168: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

168

Micropayments schemes Payment schemes that emphasize the ability to

make payments of small amounts are called micropayment schemes.

Applications of micropayments include paying for each web page visited, and for each minute of music or video as it streamed to the user.

The problem: the cost of transactions is much higher than the worth of each transaction.

Micropayment schemes try to aggregate many small payments info fewer, larger payments, whose processing costs are relatively small.

Page 169: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

169

Observations Hash functions are about 100 times

faster than RSA signature verifications, and about 10000 faster than RSA signature generation.

On a typical workstation, one can sign two messages per second, verify 200 signatures per second, and compute 20000 hash functions per second.

Page 170: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

170

Notation B – broker/bank U – user V- Vendor PK – public key SK – secret Key h – cryptographically strong hash function

(such as MD5). – a very large search is required to produce a single input producing a given output, or to find two inputs producing the same output.

Page 171: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

171

PayWord U computes an h-chain, x0,x1,….,xn,

where xi = h(xi+1) U commits to the entire chain by sending his

signature on x0 to V. Each successive payment is made by releasing

the next consecutive value in the chain, which can be verified by checking that it hashes to the previous element.

Page 172: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

172

PayWord If after i micropaymens, V wishes to make

a deposit, then it can deposit i cents by giving B xi and the user signature on xo

B can verify the signature and iterate h i times to verify the operation.

Page 173: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

173

User-Bank relationship U request an account, and gives B over

secure channel her credit card number, PKU, delivery address AU, etc.

U’s certificate will have an expiration date E, and may include further information IU

The user’s certificate has the form CU={B,U,AU,PKU,E,IU}SKB

Page 174: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

174

User-Vendor Relationship U and V relationships occur when e.g. user visits a

web-site, use/purchase 10 pages, and then move elsewhere.

Commitments: When U contacts a new vendor V, U computes a fresh payword w1,….,wn with root w0, where n is chosen to be “convenient”.

U then compute her commitment to the chain

M={V,CU,w0,D,IM}SKU

where D is the current date and IM is additional

information (such as the value of n).

Page 175: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

175

User-Vendor RelationshipThe commitment authorizes B to pay V for any

paywords w1,….,wn that V redeem with B before date D (+ perhaps additional day, assuming a micropayment by the end of the day).

Payments: assume some agreement on each payment (e.g. 1 cent), a payment P from U to V consists of a payword and its index.

Notice that the payment need not be signed, and it is short.

The user spend her paywords in order starting from w1

Page 176: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

176

User-Vendor Relationship Payment policy: for each commitment a

vendor V is paid l cents where (wl,l) is the corresponding payment received with the largest index.

V needs to store only the payment with the highest index. Once a user spends wi, she can not spend wj for j < i.

Page 177: 1 Electronic Commerce 2 High-Level Overview uThe course presents basic algorithmic techniques, considered to be fundamental to state of the art e-commerce,

177

Vendor-Bank relationships V needs to obtain PKB

V needs to establish a way for B to pay V.

By the end of period (e.g a day) V sends B a redemption message for each of B’s users.

B needs to verify user signatures, and verify each (wl,l) payments (by l applications of h).