Apache Spark GraphX highlights.
-
Upload
doug-needham -
Category
Data & Analytics
-
view
55 -
download
3
Transcript of Apache Spark GraphX highlights.
Introduction
@dougneedham
Data Guy - Started as a DBA in the Marine Corps, evolved to Architect, now aspiring Data Scientist.
Oracle, SQL Server, Cassandra, Hadoop, MySQL, Spark.
I have a strong relational/traditional background.
Perpetual Student
Learning new things challenges our assumptions. Forces us to take a new perspective on “old” problems. Eventually maybe even shows us that there is a better way to solve a problem.
Graphs: What problems do they solve?
Solving Crime
Customers/Products
Some examples: Introduction to Graph_Theory
There are many ways of constructing networks, and how exactly you construct them depends on the questions you are posing.
Economics: You don’t participate in an economy by yourself, you make purchases from others. Record enough transactions, you have a graph.
Almost anything can be modeled as a graph. However, it does require a slight shift in thinking.
One of the most used examples is a citation network for academic publications.
I publish a paper, then you cite my paper in your publication.
This shows which paper (ultimately back through the tree) had the largest influence.
A little History
The 7 Bridges of Konisberg
Every tome on Graph theory or Network analysis devotes a small portion of there time to the 7 Bridges of Konisberg.
If I don’t cover this with you, the gods of mathematics will strike me down, and never allow me to do analysis again in the future.
The Problem
Folks enjoyed there Sunday afternoon strolls across the bridges, but occasionally people would wonder if one particular route was more efficient than another.
Eventually Leonhard Euler was brought into the debate about the efficiency problem.
Euler used Vertices to represent the land masses and edges (or arcs, at the time) to represent bridges. He realized the odd number of edges per vertex made the problem unsolvable.
Sarada Herke provides for one of the best explanations of the solution Solution to Konisburg
And here is the cool thing about mathematicians. If we tell you something is impossible, we have to tell you why in a way you can understand it. But he also invented the branch of mathematics today we call Graph Theory.
http://en.wikipedia.org/wiki/Leonhard_Euler
A few terms Stand back, we are going to talk about math!
Basically we are talking about a bunch of dots joined together by lines
Vertex – Dot on a graph
Edge – Line connecting the two points
Edge_Label – this is a term I coined originally related to Data Structure Graphs that helps trace a path. If you label your edges, and you have multiple edges with the same label in a Graph you can quite easily identify walks, paths, and cycles through your graph.
Triangle – 3 Vertices, 3 Edges
Square – 4 Vertices, 4 edges
Open Triangle - 3 Vertices, 2 edges\
A lot of things are networks if you look at them the right way.
Mark Newman has done a number of really cool presentations, available on Youtube about Network analysis.
https://www.youtube.com/watch?v=lETt7IcDWLI
More terms Shortest path – How are two vertices connected?
Longest Path – Tracing the flow of an interesting item through a large collection of applications.
What is a path?
Centrality – Hub and Authority
This is almost a whole topic by itself, since there are different types of Centrality:
Degree Centrality, Eigenvector Centrality, PageRank, etc…
Transitivity
Homophily – how things are similar
Directed Graphs – or Digraphs
Contagion – How do things “spread” through a network?
Let’s rearrange things, how does the layout affect understanding?
Order of a graph – number of vertices
Size of the graph – number of edges
This is not just data visualization, it can also be used for prediction. https://www.youtube.com/watch?v=rwA-y-XwjuU
Samples
Some Samples from Wiki.
On the right, a basic graph, on the left the languages used in wikipedia
Little sidebar - Paths
Now that we have some terms under our belt.
What is the difference between shortest path, and longest path?
The Math doesn’t change.
One thing I like about Graphs –
The Math does not change.
The math behind Graph theory can be a little intense, but it does not change regardless of the scale of the graph.
Once you understand how to “do the math” on a small graph, those same Maths apply to a Graph whether it is a graph of the people in this room, or a graph of the people on this planet.
Small Graphs
What is a small graph?
Friends on Facebook, or LinkedIN.
Usually this can be displayed and analyzed rather easily.
If the Graph continues to grow, you need better tools.
Let’s do a quick demo of a small graph visualization.
Gephi http://gephi.github.io/
From the website: “Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.”
To get this yourself go into Facebook and search for: Netvizz. (You have to authorized it. You can un-authorized it later)
Click the application.
Click “personal network”
Click Start
Download your gdf file
Quick Demo – ( Vote time: If everyone is comfortable with general graphs we can come back to this.)
Large Graphs
What is a large graph?
To me a large graph is one that cannot be easily visualized by software such as Gephi.
You have to use large tools to calculate the important statistics, such as centrality, diameter, average degree, etc…
Breaking a large graph down to a small graph is actually not as simple as it sounds.
This can be done reasonably easily with tools such as GraphX
Now what we all came for:
GraphX
GraphX is Apache Spark's API for graphs and graph-parallel computation.
https://spark.apache.org/graphx/
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
While GraphX is “just a library” it is a library that exists within the Spark environment. Which provides a whole host of benefits like scaling, clustering, storage, and other things that you don’t have to dwell on.
As of right now, GraphX is Scala only.
Data Science Challenge
Who should Follow whom?
Winklr is a curiously popular social network for fans of the sitcom Happy Days. Users can post photos, write messages, and most importantly, follow each other’s posts and content. This helps users keep up with new content from their favorite users on the site.
Problem 3 of the data science challenge was a graph analysis problem.
Derive the top 70,000 connections that should be recommended.
My approach Type of problem: Graph Analysis
Create a Master Graph.
Run Page Rank to identify centrality.
Create many small graphs for individual users.
Mask the Master Graph, and PageRank Graph.
Multiply out Centrality, number of in Degrees for a possible followers, and the inverse of the length of the path away from this particular user to a candidate vertex to be followed.
This code runs in over 48 hours.
Code: Problem3.sh, and AnalyzeGraph.scala
Now we will review github https://
github.com/dougneedham/Cloudera-Data-Scientist-Challenge-3/tree/master/problem3
Snapshot of code:
var PathGraph = ShortestPathFromSource.mask(ClickPairGraph)
var Influence = PathGraph.joinVertices(MasterGraph.inDegrees)((id,pathlength,indeg) => (1/pathlength)*indeg)
var central_influence = Influence.joinVertices(MaskedMasterGraphPR.vertices)((id,dist,pagerank) => dist*pagerank)
//
// We want to eliminate the infinite, follow someone that there is in fact a path to
//
println("Processing " + central_influence.vertices.filter(_._2 < Double.PositiveInfinity).count())
//central_influence.vertices.filter(_._2 < Double.PositiveInfinity).collect()foreach(record_to_list(SourceID,_))
val save_file_name = base_path+"/problem3/OutGraph/"+SourceID.trim()+".data"
central_influence.vertices.filter(_._2 < Double.PositiveInfinity).saveAsTextFile(save_file_name)
Expectations This is where we tie together the “small graphs” versus “big graphs”
Creating a Sub-graph of a larger graph is not obvious.
I was expecting to see one big clump of nodes tightly connected. This would be the “Target” to follow.
I was also expecting to see two smaller clumps of nodes, loosely connected to the larger clump. These are the “followers”, as we make a recommendation to them to follow the more popular node, they will be closer connected to this user.
Here is the output from Gephi that shows whether the code worked or not.
Where do I get data?
How you construct the network depends on the question(s) you are posing.
Chances are you have lots of data already, it is simply a matter of perspective.
Apply Graphs to your own companies architecture
Public social network data
The example mentioned from Gephi (netvizz)
Data Structure Graphs
A DSG Level 1 can show you where you are going to have the most interesting query performance of your tables.
A DSG Level 2 can show you where the most amount of work is going on in your Enterprise.
Data Structure Graph Level 1 – This is roughly like an Entity Relationship Diagram (ERD) Tables are Vertices, Foreign Keys are Edges.
Data Structure Graph Level 2 – Each Vertex in this graph is an application. Each Edge is data transfer. Roughly equivalent to what we used to call Data Flow diagrams.
SNAP
SNAP – Stanford Network Analysis Project.
If you want to learn about how to do Network Analysis and you can’t find any data, go here.
Consider the following: Network/Graph Analysis is cool.
It can show you some interesting things about your data that you may not have considered.
Due thought should be put towards a network analysis project.
Organizing the data requires a bit of thought. (From -> To vertices is just a start).
Directed graph, undirected, bigraph? Some up front setup work needs to be done.
Tools help with the detailed calculations, and show the paths, walks, etc.
If you need assistance, send a message to the group, or contact me directly (I am easy to find @dougneedham)