DiscoRank: optimizing discoverability on SoundCloud

Post on 25-May-2015

1.519 views 2 download

Tags:

description

These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013. The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/

Transcript of DiscoRank: optimizing discoverability on SoundCloud

DiscoRank: Optimizing Discoverability on SoundCloud

Amélie Anglade

• Developer at SoundCloud

• SoundCloud is the world’s largest social sound platform

• Academic background in Music Information Retrieval (MIR)

• Design, prototype and implement Machine Learning algorithms for music discovery

DISCOVERABILITY ?

PAGERANK

• The web is a graph:• nodes = web pages• edges = hyperlinks

• The (Page)rank of a node depends on the link structure of the graph

WEB AND PAGERANK

RANDOM SURFER

RANDOM SURFER

A

B

C

D

1/3

1/3

1/3

RANDOM SURFER

A

B

C

D

1/3

1/3

1/3

Nodes visited more often:• Nodes with many links• Coming from frequently visited nodes

RANDOM SURFER

A

B

C

D

E

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

Adjacency matrix A

COMPUTING THE PAGERANK

A

B

C

D

E

Transition probability matrix M

Probability distribution of surfer’s position

TELEPORT

A

B

C

D

E

TELEPORT

A

B

C

D

E

TELEPORT

A

B

C

D

E

If N nodes in graph, probability to teleport to any other node (including self) = 1/N

TELEPORT

A

B

C

D

E

1/N1/N

1/N

1/N

1/N

TELEPORT

A

B

C

D

E

1/N1/N

1/N

1/N

α?

1-α

1/N

At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)

Probability distribution of the surfer at any time is a vector.

COMPUTING THE PAGERANK

That vector converges to a steady state: the PageRank vector.

PAGERANK EQUATION

SOUNDCLOUDDISCORANK

DISCORANK

A

B

C

D

EUser

User

Track

Playlist

favorite

follow

featured in

• Search across People, Sounds, Sets, Groups• One unique rank vector that contains all entities

• Weight the links based on the type of event:

• User favorites Track• Track is featured in Playlist

...

• New big (but sparse) adjacency matrix:

UNIVERSAL SEARCH

• How do we identify content that is trending?

• The more recent a listen, favorite, etc. (event) the higher the weight

• Multiply each event (=edge) by a time decay:

• New adjacency matrix:

BACK TO EXPLORE

PERFORMANCE OPTIMIZATION

• Millions of entities(=nodes) and events(=edges)

• First DiscoRank: several hours of computation

• Trimmed down to a few minutes using:• Sparse matrix• Optimized storage of the graph in memory• Versioned copies of the DiscoRank

• So technically we could compute the DiscoRank realtime

A VERY LARGE GRAPH

• Re-mapping entity ids

• Memory optimization so the graph holds in memory:• All edges details are stored in memory in a byte[]• buffer the byte[] into an opaque byte block pool• no object• sort the buffered byte[] in place

• On disk and when computing the DiscoRank:• Delta encoded ordered adjacency lists:

• One “from” node, several “to” nodes• Delta encode the “to” node ids

USING SPARSITY

• We keep versioned copies of:• the DiscoRank vector of results• the DiscoRank graph

• We rebuild the entire DiscoRank graph from scratch once a week

• In between:• we create additional graph segments with new

entities and events• and use as prior for the DiscoRank computation

the results of the previous DiscoRank run

• Side effect:• Also allows for experimentation

VERSIONED DISCORANK

• MySQL batch jobs

• DiscoRank results stored in HDFS

• At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine

its Lucene score with its DiscoRank

INTEGRATION IN OUR INFRASTRUCTURE

Amélie AngladeSound/Music Information Retrieval Engineer

about.me/utstikkar@utstikkar

We’re hiring!

www.soundcloud.com