DiscoRank: optimizing discoverability on SoundCloud

37
DiscoRank: Optimizing Discoverability on SoundCloud Amélie Anglade

description

These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013. The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/

Transcript of DiscoRank: optimizing discoverability on SoundCloud

Page 1: DiscoRank: optimizing discoverability on SoundCloud

DiscoRank: Optimizing Discoverability on SoundCloud

Amélie Anglade

Page 2: DiscoRank: optimizing discoverability on SoundCloud

• Developer at SoundCloud

• SoundCloud is the world’s largest social sound platform

• Academic background in Music Information Retrieval (MIR)

• Design, prototype and implement Machine Learning algorithms for music discovery

Page 3: DiscoRank: optimizing discoverability on SoundCloud

DISCOVERABILITY ?

Page 4: DiscoRank: optimizing discoverability on SoundCloud
Page 5: DiscoRank: optimizing discoverability on SoundCloud
Page 6: DiscoRank: optimizing discoverability on SoundCloud
Page 7: DiscoRank: optimizing discoverability on SoundCloud

PAGERANK

Page 8: DiscoRank: optimizing discoverability on SoundCloud

• The web is a graph:• nodes = web pages• edges = hyperlinks

• The (Page)rank of a node depends on the link structure of the graph

WEB AND PAGERANK

Page 9: DiscoRank: optimizing discoverability on SoundCloud

RANDOM SURFER

Page 10: DiscoRank: optimizing discoverability on SoundCloud

RANDOM SURFER

A

B

C

D

1/3

1/3

1/3

Page 11: DiscoRank: optimizing discoverability on SoundCloud

RANDOM SURFER

A

B

C

D

1/3

1/3

1/3

Page 12: DiscoRank: optimizing discoverability on SoundCloud

Nodes visited more often:• Nodes with many links• Coming from frequently visited nodes

RANDOM SURFER

A

B

C

D

E

Page 13: DiscoRank: optimizing discoverability on SoundCloud

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Page 14: DiscoRank: optimizing discoverability on SoundCloud

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Page 15: DiscoRank: optimizing discoverability on SoundCloud

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Page 16: DiscoRank: optimizing discoverability on SoundCloud

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Page 17: DiscoRank: optimizing discoverability on SoundCloud

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Page 18: DiscoRank: optimizing discoverability on SoundCloud

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Page 19: DiscoRank: optimizing discoverability on SoundCloud

TELEPORT

A

B

C

D

E

Page 20: DiscoRank: optimizing discoverability on SoundCloud

TELEPORT

A

B

C

D

E

Page 21: DiscoRank: optimizing discoverability on SoundCloud

TELEPORT

A

B

C

D

E

Page 22: DiscoRank: optimizing discoverability on SoundCloud

If N nodes in graph, probability to teleport to any other node (including self) = 1/N

TELEPORT

A

B

C

D

E

1/N1/N

1/N

1/N

1/N

Page 23: DiscoRank: optimizing discoverability on SoundCloud

TELEPORT

A

B

C

D

E

1/N1/N

1/N

1/N

α?

1-α

1/N

At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)

Page 24: DiscoRank: optimizing discoverability on SoundCloud

Probability distribution of the surfer at any time is a vector.

COMPUTING THE PAGERANK

That vector converges to a steady state: the PageRank vector.

Page 25: DiscoRank: optimizing discoverability on SoundCloud

PAGERANK EQUATION

Page 26: DiscoRank: optimizing discoverability on SoundCloud

SOUNDCLOUDDISCORANK

Page 27: DiscoRank: optimizing discoverability on SoundCloud
Page 28: DiscoRank: optimizing discoverability on SoundCloud

DISCORANK

A

B

C

D

EUser

User

Track

Playlist

favorite

follow

featured in

Page 29: DiscoRank: optimizing discoverability on SoundCloud

• Search across People, Sounds, Sets, Groups• One unique rank vector that contains all entities

• Weight the links based on the type of event:

• User favorites Track• Track is featured in Playlist

...

• New big (but sparse) adjacency matrix:

UNIVERSAL SEARCH

Page 30: DiscoRank: optimizing discoverability on SoundCloud
Page 31: DiscoRank: optimizing discoverability on SoundCloud

• How do we identify content that is trending?

• The more recent a listen, favorite, etc. (event) the higher the weight

• Multiply each event (=edge) by a time decay:

• New adjacency matrix:

BACK TO EXPLORE

Page 32: DiscoRank: optimizing discoverability on SoundCloud

PERFORMANCE OPTIMIZATION

Page 33: DiscoRank: optimizing discoverability on SoundCloud

• Millions of entities(=nodes) and events(=edges)

• First DiscoRank: several hours of computation

• Trimmed down to a few minutes using:• Sparse matrix• Optimized storage of the graph in memory• Versioned copies of the DiscoRank

• So technically we could compute the DiscoRank realtime

A VERY LARGE GRAPH

Page 34: DiscoRank: optimizing discoverability on SoundCloud

• Re-mapping entity ids

• Memory optimization so the graph holds in memory:• All edges details are stored in memory in a byte[]• buffer the byte[] into an opaque byte block pool• no object• sort the buffered byte[] in place

• On disk and when computing the DiscoRank:• Delta encoded ordered adjacency lists:

• One “from” node, several “to” nodes• Delta encode the “to” node ids

USING SPARSITY

Page 35: DiscoRank: optimizing discoverability on SoundCloud

• We keep versioned copies of:• the DiscoRank vector of results• the DiscoRank graph

• We rebuild the entire DiscoRank graph from scratch once a week

• In between:• we create additional graph segments with new

entities and events• and use as prior for the DiscoRank computation

the results of the previous DiscoRank run

• Side effect:• Also allows for experimentation

VERSIONED DISCORANK

Page 36: DiscoRank: optimizing discoverability on SoundCloud

• MySQL batch jobs

• DiscoRank results stored in HDFS

• At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine

its Lucene score with its DiscoRank

INTEGRATION IN OUR INFRASTRUCTURE

Page 37: DiscoRank: optimizing discoverability on SoundCloud

Amélie AngladeSound/Music Information Retrieval Engineer

about.me/utstikkar@utstikkar

We’re hiring!

www.soundcloud.com