From Power Chord to the Power of Models - Oredev

Post on 07-Apr-2017

177 views 0 download

Transcript of From Power Chord to the Power of Models - Oredev

From Power Chords

to the Power of

Models

@aliostadAli Kheyrollahi

> stackoverflow> £1.5 bln

global fashion destination

> 35% every year

8

Local pop music

9

Local pop music “Cheelee pom!”

10

Boney M “Rasputin”

11

Blondie “Heart of Glass”

Infobox

Free textLinks

Data Acquisition

Data Source - Wiki

4,990,2794,990,279 English Articles

37,583,879 Articles

Data Source - Wiki vs BritannicaFeng Zhu (assistant prof at Harvard):

“There has been lots of research on the accuracy of Wikipedia, and the results are mixed—some studies show it is just as good as the experts, others show [that] Wikipedia is not accurate at all.”

“… the editors [of Britannica] are still not found to be more objective than the crowd in articles that are sufficiently revised.”

Data Source - Wikipedia in scholar papers

0

45000

90000

135000

180000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014Source: Google Scholar

Data Acquisition - Wiki

List of Rock Genres Rock Genres Rock Artists

Store

Store HTML

Capture Links

Store HTML

Python scripts

Postgres

Data Source - Content vs. Data

Hyphen U+002D

figure dash U+2012

minus sign U+2015

em dash U+2014

en dash U+2013

Data Exploration

Data Exploration

“I personally … literally just look at the screen, just like the matrix”

Claudia Perlich, multi-award winner Data Scientist

Data Exploration

“… the dirty little secret that I have won all of them because I have found something wrong with the data… I would like to play around with dataset and get initimately familiar with dataset and its properties.“

Claudia Perlich

Album Genre

Album Genre

http://wiki-rock.azurewebsites.net/top10-album-genres.html

Data Models

Data Models Model?!

Data Models Model

Mathematical representation of a concept based on parameters that impact that concept

• Rating of a native app • Stackoverflow score • Credit score • Fraud check

“All models are wrong… but some are useful.

George Box

Data Models Model

Data Models Graph 101

Social Network Analysis and Graph Theory

• Nodes/vertices and edges/lines • Directedness:

• Directed • Undirected

• Degree, InDegree/OutDegree • Weight

A B

Data Models Centrality

12

4

2

2

1

Same degree Different betweenness

Degree

Graph Codez

import networkx as nx

g = nx.Graph() g.add_edge(‘a’, ‘b’) g.add_edge(‘b’, ‘c’) … print len(g[‘b’]) # degree c = nx.betweenness_centrality(g, normalized=True) # c -> dictionary of node names and their score

DiGraph()

Modelling Influence using Wiki

Data Models Cited Influence

Howlin’ Wolf

Captain Beefheart

1940 1964

Data Models Cited InfluenceMost influential Rock Artists Based on out-degree

The Beatles => 188 Black Sabbath => 127 Led Zeppelin => 118 Jimi Hendrix => 114 Bob Dylan => 94 Pink Floyd => 86 Iron Maiden => 77 Metallica => 77 The Rolling Stones => 66 The Beach Boys => 65 Neil Young => 63 Nirvana => 62 Slayer => 60 Queen => 59

Data Models Cited InfluenceMost influential Rock Artists Based on Betweenness Centrality

Jimi Hendrix => 53476.2014921 The Beatles => 47511.7957531 Bob Dylan => 38107.0298185 Led Zeppelin => 32701.7223273 Nirvana => 29733.9066836 Metallica => 29356.6009213 Queen => 28989.2844223 Robert Smith => 28880.670718 Elvis Presley => 28463.2891497 Slade => 27656.487307 Iron Maiden => 22449.6697023 Ramones => 22437.6112965 Rush => 21125.9481602 Neil Young => 19913.887522

Data Models Cited InfluenceMost influential Artists Based on Betweenness Centrality

Metallica => 566.06 Iron Maiden => 419.21 Corey Taylor => 146.0 Led Zeppelin => 122.73 Slipknot => 116.58 King Diamond => 94.7 Machine Head => 85.12 Rush => 70.41 Black Sabbath => 68.0 Van Halen => 54.56 Deep Purple => 53.5 Megadeth => 42.63 Guns N' Roses => 24.25

Heavy MetalNirvana => 490.08 Muse => 114.5 Weezer => 97.33 Pixies => 94.17 Sonic Youth => 78.5 Rivers Cuomo => 69.5 Siouxsie and the Banshees => 51.67 The Smiths => 51.5 Jeff Buckley => 46.17 The Offspring => 43.0 Placebo => 42.0 My Chemical Romance => 34.0 The Smashing Pumpkins => 32.33

Alternative RockRush => 54.0 Marillion => 34.0 Pink Floyd => 33.0 Yes => 20.0 Porcupine Tree => 19.5 Dream Theater => 19.0 Chris Squire => 16.5 Primus => 15.0 Tool => 12.0 Mahavishnu Orchestra => 8.0 Geddy Lee => 7.0 Neil Peart => 5.0 Keith Emerson => 5.0

Progressive Rock

Data Models PageRank

Data Models Page RankThe Beatles => 0.00837723421839 Blind Lemon Jefferson => 0.00837369035189 Josh White => 0.00824945015047 Bessie Smith => 0.00717743996144 Louis Armstrong => 0.00692897940193 James P. Johnson => 0.00628676810257 Little Richard => 0.00584677302727 Muddy Waters => 0.005773172933 Tampa Red => 0.00572032424174 Robert Johnson => 0.00523579252974 Big Bill Broonzy => 0.00516075834679 Moon Mullican => 0.0050657751593 Black Sabbath => 0.00498789229732 Elvis Presley => 0.00497932058047 Duke Ellington => 0.00465800760107 Bo Diddley => 0.0044496675634 Jimmy Page => 0.00437658472459 Frank Zappa => 0.00431978608953 Miles Davis => 0.00396303890974 Jimi Hendrix => 0.00391117233916 Sister Rosetta Tharpe => 0.00390833570401 Bing Crosby => 0.00385435213525 Bob Dylan => 0.00358608821536 James Brown => 0.00349870931123

Other Models

Weighted graph Album GenresKrautrock

Psychedelic Rock

Experimental Rock

1

1

1

Genre Affinity

Indie Rock

Shoegazing

Alternative Rock

Dream Pop

22

25

2412

Post-rock

Genre Affinity

Gothic Metal

Doom Metal

Black Metal

Heavy Metal

13

34

2712

Stoner Metal

Clustering in Networks

Clustering in Networks

u1 u2 u3 u4 u5u1 1 0 0 1u2 1 1 1 0u3 0 1 0 1u4 0 1 0 1u5 1 0 1 1

Adjacency Matrix (Similarity Matrix)

u1 u2 u3 u4 u5u1 2u2 3u3 2u4 2u5 3

Degree Matrix1

5

4

2

3

Clustering in Networks

u1 u2 u3 u4 u5u1 2u2 3u3 2u4 2u5 3

Spectral Clustering: Using Eigenvectors of the Laplacian Matrix

−u1 u2 u3 u4 u5

u1 1 0 0 1u2 1 1 1 0u3 0 1 0 1u4 0 1 0 1u5 1 0 1 1

=u1 u2 u3 u4 u5

u1 2 -1 0 0 -1u2 -1 3 -1 -1 0u3 0 -1 2 0 -1u4 0 -1 0 2 -1u5 -1 0 -1 -1 3

Degree MatrixAdjacency Matrix (Similarity Matrix)

Laplacian Matrix

Clustering in Networks

Eigenvector: a vector (v) that by getting multiplied in matrix A does not result in changing its direction (similar to being multiplied by scalar λ)

u1 u2 u3 u4 u5

-0.7 0.3 -0.2 -0.1 0.7-0.7 0.3 -0.2 -0.1 0.7

Spectral Clustering Codez

from sklearn.cluster import spectral_clustering import numpy as np

A = [[0.0 for x in n] for x in n] … # build adjacency matrix res = spectral_clustering(np.matrix(A), n_clusters) # res -> list of cluster indices e.g. [1,1,0,5,…]

Spectral Clustering Results

Folk Rock Country Rock

Blues Folk

Country Americana Roots Rock Blues Rock

Southern Rock

Power Metal Progressive Metal Symphonic Metal Black Metal Melodic Death Metal Groove Metal Nu Metal Thrash Metal

Death Metal Metalcore Industrial Metal Gothic Metal Christian Metal Doom Metal Speed Metal

Alternative Rock Indie Rock

New Wave Synthpop

Electronica

Rock R&B Pop

Pop Rock Funk Soul

Heavy Metal Hard Rock

Alternative Metal

Intelligent Models

word2vec Model

Skip-gram: a proximity-based probability model trained using Neural Networks (Deep Learning)

Pink Floyd were an English rock band formed in LondonX XX

word2vec Representation

rock

0000000100000

0000000

0010000000000

Pink Floyd

band

formed

London

0000000010000

0000000000010

1000000000000

0.90.10.20.40.10.1

0.80.10.10.40.10.2

pop

word2vec Demo

Album Genre Model

Fun Happy Saturday We Are Friends Electronic Frozen Blood In My Veins Redneck Dance Chaos and Mayhem Basement Dub

Sentiment Analysis in text

Predicting the genre based on name of the album

Deep Learning Basics

1) Traditional Neural Networks with many layers2) Often uses convolution as the node function 3) Training on Big Data can take weeks even on GPU

0) A method of supervised learning

4) Huge success attributed to improved training, powerful computation and above all Big Data5) Pooling, Dropout and local connections important

Deep Learning Topology

Deep Learning TensorFlow

“Wish you were here”

=> [123, 101, 42, 1969 ]=> [123, 101, 42, 1969, 0, 0, 0, … 0 ]

Rock=> [0, 0, 0, 1, 0, 0, 0, 0 ]

=> [[100000000000],[000000010000], … ]

Deep Learning Demo

Wrap-up

References•All pictures from wikipedia.org used under Creative Commons •Source of all data is from wikipedia.org collected online using a single call and then stored and processed •Efficient Estimation of Word Representations in Vector Space. Mikolov et. al. http://arxiv.org/abs/1301.3781 •Gensim's word2vec •networkx lib •word2vec blog post (500K docs): Five crazy abstractions my Deep Learning word2vec model just did •word2vec on Rock music blog: Daft Punk+Tool=Muse: word2vec model trained on a small Rock music corpus •code for word2vec on wiki data •Highcharts: highcharts •word2vec paper: PDF •Automatic real-time road marking recognition using a feature-driven approach PDF •Video of the road marking recognition: here and here and here •Future of Programming - Rise of the Scientific Programmer (and fall of the craftsman) •Deep Learning articles •code for Deep Learning genre analysis •…