From Power Chord to the Power of Models - Oredev
-
Upload
ali-kheyrollahi -
Category
Software
-
view
176 -
download
0
Transcript of From Power Chord to the Power of Models - Oredev
From Power Chords
to the Power of
Models
@aliostadAli Kheyrollahi
> stackoverflow> £1.5 bln
global fashion destination
> 35% every year
8
Local pop music
9
Local pop music “Cheelee pom!”
10
Boney M “Rasputin”
11
Blondie “Heart of Glass”
Infobox
Free textLinks
Data Acquisition
Data Source - Wiki
4,990,2794,990,279 English Articles
37,583,879 Articles
Data Source - Wiki vs BritannicaFeng Zhu (assistant prof at Harvard):
“There has been lots of research on the accuracy of Wikipedia, and the results are mixed—some studies show it is just as good as the experts, others show [that] Wikipedia is not accurate at all.”
“… the editors [of Britannica] are still not found to be more objective than the crowd in articles that are sufficiently revised.”
Data Source - Wikipedia in scholar papers
0
45000
90000
135000
180000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014Source: Google Scholar
Data Acquisition - Wiki
List of Rock Genres Rock Genres Rock Artists
Store
Store HTML
Capture Links
Store HTML
Python scripts
Postgres
Data Source - Content vs. Data
Hyphen U+002D
figure dash U+2012
minus sign U+2015
em dash U+2014
en dash U+2013
Data Exploration
Data Exploration
“I personally … literally just look at the screen, just like the matrix”
Claudia Perlich, multi-award winner Data Scientist
Data Exploration
“… the dirty little secret that I have won all of them because I have found something wrong with the data… I would like to play around with dataset and get initimately familiar with dataset and its properties.“
Claudia Perlich
Album Genre
Album Genre
http://wiki-rock.azurewebsites.net/top10-album-genres.html
Data Models
Data Models Model?!
Data Models Model
Mathematical representation of a concept based on parameters that impact that concept
• Rating of a native app • Stackoverflow score • Credit score • Fraud check
“All models are wrong… but some are useful.
George Box
Data Models Model
Data Models Graph 101
Social Network Analysis and Graph Theory
• Nodes/vertices and edges/lines • Directedness:
• Directed • Undirected
• Degree, InDegree/OutDegree • Weight
A B
Data Models Centrality
12
4
2
2
1
Same degree Different betweenness
Degree
Graph Codez
import networkx as nx
g = nx.Graph() g.add_edge(‘a’, ‘b’) g.add_edge(‘b’, ‘c’) … print len(g[‘b’]) # degree c = nx.betweenness_centrality(g, normalized=True) # c -> dictionary of node names and their score
DiGraph()
Modelling Influence using Wiki
Data Models Cited Influence
Howlin’ Wolf
Captain Beefheart
1940 1964
Data Models Cited InfluenceMost influential Rock Artists Based on out-degree
The Beatles => 188 Black Sabbath => 127 Led Zeppelin => 118 Jimi Hendrix => 114 Bob Dylan => 94 Pink Floyd => 86 Iron Maiden => 77 Metallica => 77 The Rolling Stones => 66 The Beach Boys => 65 Neil Young => 63 Nirvana => 62 Slayer => 60 Queen => 59
Data Models Cited InfluenceMost influential Rock Artists Based on Betweenness Centrality
Jimi Hendrix => 53476.2014921 The Beatles => 47511.7957531 Bob Dylan => 38107.0298185 Led Zeppelin => 32701.7223273 Nirvana => 29733.9066836 Metallica => 29356.6009213 Queen => 28989.2844223 Robert Smith => 28880.670718 Elvis Presley => 28463.2891497 Slade => 27656.487307 Iron Maiden => 22449.6697023 Ramones => 22437.6112965 Rush => 21125.9481602 Neil Young => 19913.887522
Data Models Cited InfluenceMost influential Artists Based on Betweenness Centrality
Metallica => 566.06 Iron Maiden => 419.21 Corey Taylor => 146.0 Led Zeppelin => 122.73 Slipknot => 116.58 King Diamond => 94.7 Machine Head => 85.12 Rush => 70.41 Black Sabbath => 68.0 Van Halen => 54.56 Deep Purple => 53.5 Megadeth => 42.63 Guns N' Roses => 24.25
Heavy MetalNirvana => 490.08 Muse => 114.5 Weezer => 97.33 Pixies => 94.17 Sonic Youth => 78.5 Rivers Cuomo => 69.5 Siouxsie and the Banshees => 51.67 The Smiths => 51.5 Jeff Buckley => 46.17 The Offspring => 43.0 Placebo => 42.0 My Chemical Romance => 34.0 The Smashing Pumpkins => 32.33
Alternative RockRush => 54.0 Marillion => 34.0 Pink Floyd => 33.0 Yes => 20.0 Porcupine Tree => 19.5 Dream Theater => 19.0 Chris Squire => 16.5 Primus => 15.0 Tool => 12.0 Mahavishnu Orchestra => 8.0 Geddy Lee => 7.0 Neil Peart => 5.0 Keith Emerson => 5.0
Progressive Rock
Data Models PageRank
Data Models Page RankThe Beatles => 0.00837723421839 Blind Lemon Jefferson => 0.00837369035189 Josh White => 0.00824945015047 Bessie Smith => 0.00717743996144 Louis Armstrong => 0.00692897940193 James P. Johnson => 0.00628676810257 Little Richard => 0.00584677302727 Muddy Waters => 0.005773172933 Tampa Red => 0.00572032424174 Robert Johnson => 0.00523579252974 Big Bill Broonzy => 0.00516075834679 Moon Mullican => 0.0050657751593 Black Sabbath => 0.00498789229732 Elvis Presley => 0.00497932058047 Duke Ellington => 0.00465800760107 Bo Diddley => 0.0044496675634 Jimmy Page => 0.00437658472459 Frank Zappa => 0.00431978608953 Miles Davis => 0.00396303890974 Jimi Hendrix => 0.00391117233916 Sister Rosetta Tharpe => 0.00390833570401 Bing Crosby => 0.00385435213525 Bob Dylan => 0.00358608821536 James Brown => 0.00349870931123
Other Models
Weighted graph Album GenresKrautrock
Psychedelic Rock
Experimental Rock
1
1
1
Genre Affinity
Indie Rock
Shoegazing
Alternative Rock
Dream Pop
22
25
2412
Post-rock
Genre Affinity
Gothic Metal
Doom Metal
Black Metal
Heavy Metal
13
34
2712
Stoner Metal
Clustering in Networks
Clustering in Networks
u1 u2 u3 u4 u5u1 1 0 0 1u2 1 1 1 0u3 0 1 0 1u4 0 1 0 1u5 1 0 1 1
Adjacency Matrix (Similarity Matrix)
u1 u2 u3 u4 u5u1 2u2 3u3 2u4 2u5 3
Degree Matrix1
5
4
2
3
Clustering in Networks
u1 u2 u3 u4 u5u1 2u2 3u3 2u4 2u5 3
Spectral Clustering: Using Eigenvectors of the Laplacian Matrix
−u1 u2 u3 u4 u5
u1 1 0 0 1u2 1 1 1 0u3 0 1 0 1u4 0 1 0 1u5 1 0 1 1
=u1 u2 u3 u4 u5
u1 2 -1 0 0 -1u2 -1 3 -1 -1 0u3 0 -1 2 0 -1u4 0 -1 0 2 -1u5 -1 0 -1 -1 3
Degree MatrixAdjacency Matrix (Similarity Matrix)
Laplacian Matrix
Clustering in Networks
Eigenvector: a vector (v) that by getting multiplied in matrix A does not result in changing its direction (similar to being multiplied by scalar λ)
u1 u2 u3 u4 u5
-0.7 0.3 -0.2 -0.1 0.7-0.7 0.3 -0.2 -0.1 0.7
Spectral Clustering Codez
from sklearn.cluster import spectral_clustering import numpy as np
A = [[0.0 for x in n] for x in n] … # build adjacency matrix res = spectral_clustering(np.matrix(A), n_clusters) # res -> list of cluster indices e.g. [1,1,0,5,…]
Spectral Clustering Results
Folk Rock Country Rock
Blues Folk
Country Americana Roots Rock Blues Rock
Southern Rock
Power Metal Progressive Metal Symphonic Metal Black Metal Melodic Death Metal Groove Metal Nu Metal Thrash Metal
Death Metal Metalcore Industrial Metal Gothic Metal Christian Metal Doom Metal Speed Metal
Alternative Rock Indie Rock
New Wave Synthpop
Electronica
Rock R&B Pop
Pop Rock Funk Soul
Heavy Metal Hard Rock
Alternative Metal
Intelligent Models
word2vec Model
Skip-gram: a proximity-based probability model trained using Neural Networks (Deep Learning)
Pink Floyd were an English rock band formed in LondonX XX
word2vec Representation
rock
0000000100000
0000000
0010000000000
Pink Floyd
band
formed
London
0000000010000
0000000000010
1000000000000
0.90.10.20.40.10.1
0.80.10.10.40.10.2
pop
word2vec Demo
Album Genre Model
Fun Happy Saturday We Are Friends Electronic Frozen Blood In My Veins Redneck Dance Chaos and Mayhem Basement Dub
Sentiment Analysis in text
Predicting the genre based on name of the album
Deep Learning Basics
1) Traditional Neural Networks with many layers2) Often uses convolution as the node function 3) Training on Big Data can take weeks even on GPU
0) A method of supervised learning
4) Huge success attributed to improved training, powerful computation and above all Big Data5) Pooling, Dropout and local connections important
Deep Learning Topology
Deep Learning TensorFlow
“Wish you were here”
=> [123, 101, 42, 1969 ]=> [123, 101, 42, 1969, 0, 0, 0, … 0 ]
Rock=> [0, 0, 0, 1, 0, 0, 0, 0 ]
=> [[100000000000],[000000010000], … ]
Deep Learning Demo
Wrap-up
References•All pictures from wikipedia.org used under Creative Commons •Source of all data is from wikipedia.org collected online using a single call and then stored and processed •Efficient Estimation of Word Representations in Vector Space. Mikolov et. al. http://arxiv.org/abs/1301.3781 •Gensim's word2vec •networkx lib •word2vec blog post (500K docs): Five crazy abstractions my Deep Learning word2vec model just did •word2vec on Rock music blog: Daft Punk+Tool=Muse: word2vec model trained on a small Rock music corpus •code for word2vec on wiki data •Highcharts: highcharts •word2vec paper: PDF •Automatic real-time road marking recognition using a feature-driven approach PDF •Video of the road marking recognition: here and here and here •Future of Programming - Rise of the Scientific Programmer (and fall of the craftsman) •Deep Learning articles •code for Deep Learning genre analysis •…