Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed...

21
Subgraph Isomorphism Graph Challenge - draft - Siddharth Samsi, Vijay Gadepally, Jeremy Kepner, Albert Reuther http://GraphChallenge.org

Transcript of Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed...

Page 1: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Subgraph Isomorphism Graph Challenge - draft -

Siddharth Samsi, Vijay Gadepally, Jeremy Kepner, Albert Reuther

http://GraphChallenge.org

Page 2: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 2

Outline

•  Introduction

•  Data Sets

•  Static Graph Isomorphism Challenge

•  Metrics

•  Summary

Page 3: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 3

Introduction

•  Previous challenges in machine learning, High Performance Computing and visual analytics include –  YOHO, MNIST, HPC Challenge, ImageNet, VAST

•  GraphChallenge encourages community approaches, such as DARPA HIVE, to develop new solutions for analyzing graphs derived from social media, sensor feeds, and scientific data to enable relationships between events to be discovered as they unfold in the field

•  GraphChallenge organizers will provide specifications, data sets, data generators, and serial implementations in various languages

•  GraphChallenge participants are encouraged to apply innovative hardware, software, and algorithm techniques to push the envelop of power efficiency, computational efficiency, and scale

•  Submissions will be in the form of full conference write ups to IEEE HPEC which will allow participants to be evaluated on their complete solution

Page 4: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 4

Graph Challenge

•  GraphChallenge seeks input from diverse communities to develop graph challenges that take the best of what has been learned from groundbreaking efforts such as GraphAnalysis, Graph500, FireHose, MiniTri, and GraphBLAS to create a new set of challenges to move the community forward

•  Initial Graph Challenges –  Static Graph Challenge: Sub-Graph Isomorphism

•  This challenge seeks to identify a given sub-graph in a larger graph

–  Streaming Graph Challenge: Stochastic Block Optimization •  This challenge seeks to identify optimal blocks (or clusters) in a larger graph

Page 5: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 5

Static versus Streaming Mode

•  Static graph processing –  Given a large graph G –  Evaluate ƒ(G)

•  Two classes of streaming: stateless and stateful –  Stateless: process data g as it goes by ƒ(g) –  Stateful: add data g to a larger corpus G and then process corpus ƒ(G + g)

•  Graph processing is often in the stateful category –  Easy to simulate by partitioning G into streaming pieces g

Page 6: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 6

•  Filtering on edge/vertex labels is often used when available to reduce the search space

–  Filtering can be applied at initialization, during intermediate steps, or at the vertex level –  Very problem dependent

•  Some Graph Challenge data sets have labels and some data sets are unlabeled –  Some participants will want to filter on labels

•  Initial example implementations will work without labels –  Labels can be used if they choose

•  GraphChallenge is judged by a panel –  Panel will value the variations that are submitted

Labels and Filtering

Page 7: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 7

Outline

•  Introduction

•  Data Sets

•  Static Graph Isomorphism Challenge

•  Metrics

•  Summary

Page 8: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 8

Publicly Available Datasets Initial Datasets in Blue

•  Social networks (10 data sets up to 4.8M vertices & 69M edges) –  Online social networks, edges represent interactions between people

•  Networks with ground-truth (6 data sets up to 66M vertices & 1.8B edges, e.g. Friendster) –  Ground-truth network communities in social and information networks

•  Communication networks (3 data sets up to 2.3M vertices & 5M edges) –  Email communication networks with edges representing communication

•  Citation networks (3 data sets up to 3.7M vertices & 16M edges) –  Nodes represent papers, edges represent citations

•  Collaboration networks (5 data sets up to 23K vertices & 198K edges) –  Nodes represent scientists, edges represent collaborations

•  Web graphs (4 data sets up to 875K vertices & 5.1M edges) –  Nodes represent webpages and edges are hyperlinks

•  Amazon networks (5 data sets up to 548K vertices & 3.4M edges) –  Nodes represent products and edges link commonly co-purchased

products

•  Internet networks (9 data sets up to 62K vertices & 147K edges) –  Nodes represent computers and edges communication

•  Road networks (3 data sets up to 1.9M vertices & 2.8M edges) –  Nodes represent intersections and edges roads connecting the

intersections

•  Autonomous systems (5 data sets up to 26K vertices & 106K edges) –  Graphs of the internet

•  Signed networks (10 data sets up to 4.8M vertices & 69M edges) –  Networks with positive and negative edges (friend/foe, trust/distrust)

•  Location-based online social networks (2 data sets up to 198K vertices & 950K edges) –  Social networks with geographic check-ins

•  Wikipedia networks, articles, and metadata (7 data sets up to 3.5M vertices & 250M edges) –  Talk, editing, voting, and article data from Wikipedia

•  Twitter and Memetracker (4 data sets up to 96M vertices & 476M edges) –  Memetracker phrases, links and Tweets

•  Online communities (3 data sets up to 2.3M images) –  Data from online communities such as Reddit and Flickr

•  Online reviews (6 data sets up to 34M product reviews) –  Data from online review systems such as BeerAdvocate and Amazon

Stanford Large Network Dataset Collection snap.stanford.edu/data

Page 9: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 9

Publicly Available Datasets Initial Datasets in Blue

•  Astronomy (1 data set 180 GB)

–  Sloan Digital Sky Survey SQL MDF files •  Biology (4 data sets up to 200 TB)

–  Genome sequence data •  Climate (3 data sets size growing daily)

–  Satellite imagery data •  Economics (10 data sets up to 220 GB)

–  Census, transaction, and transportation data •  Encyclopedic (10 data sets up to 541 TB - Common Crawl Corpus)

–  Various online encyclopedia data •  Geographic (3 data sets up to 125 GB)

–  Street maps •  Mathematics (1 data sets 160 GB)

–  University of Florida Sparse Matrix Collection

•  Advertising and Market Data (4 data sets up to 3.7 GB) –  Yahoo!'s auction-based platform for selling advertising space

•  Competition Data (3 data sets up to 1.5 GB)

–  Data challenges run by Yahoo •  Computing Systems Data (5 data sets up to 8.8 GB)

–  Computer systems log data •  Graph and Social Data (3 data sets up to 5 GB)

–  Graph data from search, groups, and webpages

•  Image Data (3 data sets up to 14 GB) –  Flickr imagery and metadata

•  Language Data (29 data sets up to 166 GB - Answers browsing behavior)

–  Wide range of question/answer data sets •  Ratings and Classification Data (10 data sets up to 83 GB — 1.5 TB

dataset withdrawn) –  Community preferences and data

•  Used to generate world’s largest power law graphs •  Can be modeled with Kronecker Product of a Recursive

MATrix (R-MAT): G⊗k = G⊗k–1 ⊗ G –  Where “⊗”denotes the Kronecker product of two matrices

•  R-MAT: A Recursive Model for Graph Mining, Chakrabarti, Zhan & Faloutsos (2004) •  Mathematically Tractable Graph Generation and Evolution, Leskovec, Chakrabarti, Kleinberg & Faloutsos, (2005)

AWS Public Data Sets aws.amazon.com/public-data-sets

Graph500.org Data Generator

Yahoo! Webscope Datasets webscope.sandbox.yahoo.com

Tools to generate (at various scales and parameters) data sets will be provided as part of the challenge

Page 10: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 10

GraphChallenge Data Formats

•  Public data is available in a variety of formats –  Linked list, tab separated, labeled/unlabeled

•  Requires parsing and standardization •  Proposed formats

–  Tab separated triples in ASCII file with labels removed –  MMIO ASCII format: math.nist.gov/MatrixMarket

The Matrix Market Exchange Formats: Initial Design, Boisvert, Pozo & Remington (1996)

Wikipedia Voting Wikipedia Adminship Road Network Twitter Data

KillYourTV_I2P

markaci

ankedDebonairFox

ageis

tehowe

stevewerby

otr_imComunityCube

WOGofFARs

iamsambee

i2porignal

sednawk

decourl

emmangoldstein

deray

iamjohnoliver

MsLods

archuser

joepie91

chaospiratin

Reversity

msaukko

supaheld

GeorgFritzsche

islammireligion

TorEkelandPC

l33tness

n0x00

langley_va

dieKadda

kgerloff

Checkmarx

samhocevar

Street_D

ogg

flamsm

arkaccessjam

escoracurrier

yuange75

MarineTraffic

Schema

taTheory

chronic_jordannitot

luutoo

a_marshall_plan

D3rB3obacht3r

sleevi_

ajam

edouardng

4b5

abnev

blackplansAnonNewsSwe

decorrespondent

TA3Mchicago

sleepylemur

LilianaSeguramastahyeti

GreatDismal

certivoxlabs

Fl0range

switchingtoguns

JacobHokland

mathpunk

anttitikkanen

angealbertinianarchival

sbl0410

carlsmith

infosecjerk

JamesRisen

virus

mchunky11

romanmars

NorseCorp

d_olex

DarkArtifacts

ClickInfosec

chrismckee

jimmieange

l

linuxaudit

_murx

languagejones

CharlieShrem

TheQuinnspiracy

ibroadfo

LucioIO

Mauerfall89

BetterCrypto freebsdgirl

subtopes

torservers

colleenkelly

yawnbox

eitanm

k

chrifpa

bspector

murmosh

VSebastien

Philae2014

nikmd23

MichelleLMcKee

Pedinska

letelierdos

resentfultweet

cijournalism

amherstac

xandersherry

mattbraga

philsweeney

yinettesys

guaqTimJPrice

sotongciliejpfauth

landonfuller

flr666

Nienor_

esaoperationsArchipet

jonobacon

GPGTools

phillymag

camicelli

InfoSecHotSpot

Fengxii

AbiWilks

JeroenInTheDark

lehtior2

LaatBloeien

pzb

sc13ns

Ideenwanderer

ACLU_NorCalthereaIbanksy

yamila_m

oreno

IVAW

nginx

heinihm

peksa

alldaydotcom

die_hausmaus

HelloArbit

CynthiaLive

film_girlTin

kerToyTechDrPhiltill Keltounet

jlangdale

e_svedang

BAESystems_AI

secmeme

ustwogames

MidmarketCIO

hack_lu

mikispag

Mobutesustaincondoms

karsvaneijsden

Turmio_

adron

NettieBeez

RA_Thomas_Koll

erikandersonExplo

reNZtravel

mlowdi

sinanantoon

Yorgosh

thalmic

InterFaxAPI

BrianGoulet_

UID_

calculusdude16

cznweb

oursdemer

vmjaggard99

FightThePower13

YaiAou

shpendk

philpraxis

Readywater

danslimmoncammipham

birdsfallsilent

JayKyuu

swadoka

The_SolarSystem

ParteiHHSarcasticRover

ichetandhembre

AhmiaNews

Huicthgara

heckmueller

Haukursteinn

tofugu

ErikRose

pandom_

shaftag

HovikYerevan

cubed2D

SPIEGEL_Netz

qirexVoulnet

bradfitz

lennutrajektoor

rem

ccneill

AtariFrosch

ianthe88

asgrim

fujin_

oherrala

theodoricmeyer

medtek

toebonecasalejlittlecalculist

mza

scopesetic

lastfrodo

TheG

utes

Cloudiu

sSystem

sn1dq

Nerd4U_eu

UXHow

vgcerf

hej107us

nickster4kpenajulia

_tty0

ProfVeenstra

ebeip90

jimalkhalili

dkords

gelnior

DebC

onf

GodlessUtopia

chrisbkelly

CasBow

ie

hmadrwx

TheM

atjaz

angarinmichaud

FranzvonFeld

HopeFrank

augrunt

VERISIGN

archaeologymag

SofieHagen

commiegirl1

GISgamer

CRISoundStageSerianox_

reesemoore

TRACterrorism

ichilton

goldtristesse

TexasVC

daavidhentunen

mikk0j

jerrychen

therealelp

SpaceCharlieUK

jvesalaanthony

sterling

noclador

colinhostert

felipelerena

Fuzion24

andersostlund

evktalobull3tp

r00f

InfoSecJesus

YourAnonXploit

AlexKara15

traeumeleben

Anon2earth

dipdip11

gmpolice

tomihj

JD_Smi

les

abhishekmdb

twittersecurity

HUBB

LE_space

humble

pondswimmer

JaanaHu

hta

cavaticat

nolaforensix

mohamedrmah

farahney

__agwa

VickyChandler

PekkaNikander

nicholasjhebert

josephredfern

Roambi

Daeinar

SexyLikeMeiosis

daspecst

er

simonbayly

ColisPrive

OlafScholz

helpmonks

AsteroidEnergy

vidrackdev

levudev

techinsider_sv

heathercmiller

praetorianlabs

Tural301

beckyferreira

stevewoz

biom

otor

DennysDiner

serial

RooneyMcNibNug

msftsecresponse

IndieELF

NodiNovi

DataArtist

linkgard

jellyfish

_games

willcanine

BlucoinBox

defcoin

nodesecurity

CertiVox

AnonO_o

bertranruiz

Athbi

keen_io

skriptuurus

garthbrooks

NummerM

agazin

CommitStrip

EightTons

CHSommers

elizaege

Giribot

NightDiveStudio

anlum

o1

numer6_6

jonasanbrown

droidespleen

HuibBroekhuis

scottdwarevnik5

287

PriteshJRocks

fallenjehovaAgarri_FR

sh33psy

MelbourneGeek

calsilcock

PureNewZeala

nd

randi_ebooks

equartey

tanwapi

cyberzedd

k

dptreks1

Mobile_Secwpgoulet

TechieIO

PoliceVideo

romainletendartinboxbygmail

CryptoVillage

kasimerkan

antkanan

SocketIO

MicSpehr

Pleasure__Kevin

JenLucPiquant

Bundes_GenSek

NSDC_ua

lorbone

wilh01

DogFlotilla

thinktecture

shittydogs

marpichun

CryptocoinAPI

RusiAlpo

elizabethusa

anikas_h

GoodPhellasPHL

TheGingerDog

SoluMachines

acmebench

AndreaBarisan

i

tealcavalon

rungga_reksya

JeffreyChadd

CASBforCloud

JanetInfoSec

reasonablysnd

saunatotoro

EdgeyP

PGDayTR

elektronaut1

linuxnotlari

GamesAustria

djdavies7

willnichthaben

hartrobert

silvinamarq

MontrealSauce

NeustarCTO

MuOneOz

CodeSavvyOrg

JasonMajoue

GenomeC

ompiler

tbwsam

ples

CrowdTRetailJedediahSPurdy

KhalidH

Alanazi

w8rbt

cssquare_unb

AndyR

anger4

Tiffani_Bova

CatCrenshaw

mattrsull

ivan

TK5EP

AcademiaObscura

LinuxMisyoneri

indieRadiantGames_is

JMutarjim

timeless_app

oficiallu

qman

devops

daysDe

rby

BitquarkBot

autismusleben

UnmagicalM

e

WillKleiser

luqman_hakim

_y

SmoothLocalize

zulahni_ebooks

MarijkeD87

joshcaganebooks

bittarman

thera_ebooks

kayla_ebooks

DrKeremDundar

kaygeeuk

ospadano

Autoktor

michtodd

joeklein

PetriKo

istinen

leahkuipers

ifeulner

minoad7

krakjoe

pono

getlychee

marian_ebooks

msolnik

ihmdi

oghieyesterday

davidthodey

FurioFortaleza

clearlogin

fraukaleu

MayorHodges

nick__w

ScoopNZ

MotorMouthN3ws

Chuca

galpondbanquito

Apryl_Parchergiannimaggiora

dieappleshow

williedillsGarrettArt

GJHodson

hackerspacefest

MFordFuture

matthi_bolte

jasontcohen

sbrebrown

wonder75386_SX

GoogleDesign

AARNet

AmanHardikar

openzfsonosx

Uber_Paris

YoApp

Kotaku_UK

ICT_Networks

DavidCh27992090

karishanindya

KPbewelldoc

SinanCananmountainrescuer

RFairclough

PaulCuts

inger

TeselaGen

Sklcrshrmtn

TLuzat

ijimene

TarlaciSultan

jdauphant

sporndly

big3bioS

F

therealredman

VMPJon

kki

CurlyBeautys

pa2nkOz

AkseIi

bobveznat

SpikeBeeJo

Mbosinwa

1cloudstar

prcheney

ICONTM

Cryptos_Crew

fibre2342

arieslob

TomWasTaken

Hari4u27

mgualtieri

alexlearn3005

ReasonDigital

dcbz32

mobify

BeatrijsRitsema

mattmcilwain

infil00p

PauliAltieri

Westin

idcommune

atseitlin

TayeDiggs

Hamdiyah_c

pu

martinbuberl

PLDHnet

mrdepressiv

laura

nsarafa_

Boulevard_Beer

oddur

sitelutions

veethembem

johnryanshea

jjconti

dotGoEu

TiphaineLara

AUusiJa

akkola

UNBFCS

PHIL_FISH

HarmonyGrits

TACpodcast

ASOS_Servicecli

chrishasl

floorbaieli

Coverity

gutweiler

junaxup

dnlsrl

kev_mcdougall

sandersonm

ichae

miratim herrod

oracl_ebooks

mesosphere

firmread

autistichoyaLBMForte

DanielRopers

ShipShowPodcast

areuugee

NathanGracie

ItalianFD

RulerAnalytics

_mattio

Kolibree

atejerina

JhonyKnx

bikesnobnyc

DowntonAbbey

UruShop

FranThai

ezest

LadyDetectives1

KellyBBattles

RachelAndJun

Shoe_key

jcsjcsjcs

mmtinez

deklanowski

skimmfr

tommylandz

DomSpoerli

ILiveCoffee

Timothy_Lindsey

BlitzWing00

bernino

6C9

SSurjik

Beikonkisur

billofmaterials

_tony_richards

phamquang578578

KNavelin

Cyclingnewsfeed

phoenixskyXXX

ValuedMerchants

POIDirectors

mhausherr

protel

royalathletes

gusuECPadrien69290

Fundme

Amazon cloud services will be used to host particularly large data sets

Page 11: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 11

Outline

•  Introduction

•  Data Sets

•  Static Graph Isomorphism Challenge

•  Metrics

•  Summary

Page 12: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 12

Static Subgraph Isomorphism Problem

Vertex-Based

•  Given a sub-graph H and a larger graph G

•  Is there a 1-to-1 mapping of the vertices in H to vertices in G such that every edge in H is also in G ?

Array-Based

•  Given sub-graph adjacency array B and a larger graph adjacency array A

•  A subgraph isomorphism exists iff there is a permutation array P where

B = PT A P

Bold uppercase indicates a matrix, Bold lowercase indicates a vector, lowercase indicates scalars, sets, ...

Page 13: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 13

Scaling Observations

•  Two key parameters govern the complexity of subgraph isomorphism –  Size of G and H –  Nominal complexity is |G||H|

•  Large G and H is may be too challenging •  Large G or H is more practical

•  Small G and G and H are of same order is less interesting –  Focus of program is large graphs

•  Large G and small H (G is much larger than H) is more interesting –  Example: H is a triangle (i.e., triangle finding algorithm) –  Example: H is a k-truss (i.e., k-truss algorithm)

Page 14: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 14

Triangle Counting

•  Triangles are the most basic non-trivial subgraph –  Set of three mutually adjacent vertices in a graph

•  Number of triangles in a graph is an important metric used in applications such as –  Social network mining (e.g.: clustering coefficient calculation) –  Link classification and recommendation –  Cyber security –  Intelligence –  Functional biology –  Spam detection

•  Example of triangles in a Graph –  Triangle 1 : Nodes 3,4,5 –  Triangle 2 : Nodes 2,4,5

Counting and Sampling Triangles from a Graph Stream A. Pavan, K. Tangwongsan, S. Tirthapura & Kun-Lung Wu, VLDB 2013

4 5 1

2

3

2

1

Page 15: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 15

Triangle Counting and Enumeration - Example Algorithm -

•  Lower complexity array approach •  Triangle enumeration has big I/O •  Triangle counting has minimal I/O

Article

Information Visualization1–10! The Author(s) 2016Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1473871616666393ivi.sagepub.com

Graphing trillions of triangles

Paul Burkhardt

AbstractThe increasing size of Big Data is often heralded but how data are transformed and represented is also pro-foundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention hasbeen placed on the scale of the input graph but the product of a graph algorithm can be many times largerthan the input. This is true for many graph problems, such as listing all triangles in a graph. Enabling scal-able graph exploration for Big Graphs requires new approaches to algorithms, architectures, and visual analy-tics. A brief tutorial is given to aid the argument for thoughtful representation of data in the context of graphanalysis. Then a new algebraic method to reduce the arithmetic operations in counting and listing triangles ingraphs is introduced. Additionally, a scalable triangle listing algorithm in the MapReduce model will be pre-sented followed by a description of the experiments with that algorithm that led to the current largest andfastest triangle listing benchmarks to date. Finally, a method for identifying triangles in new visual graphexploration technologies is proposed.

KeywordsGraph, scalable algorithms, triangle counting, visual analytics, parallel programming, MapReduce

Introduction

Data, data, data, and more data! Obtaining data is justthe beginning of analysis, but heaping piles of raw dataare rarely useful. How data are structured or repre-sented facilitates analysis. Data must undergo transfor-mation, translation, transmuting and otherstructuring, categorization, and organization opera-tions to enable intelligent analysis. This includes mak-ing data more concise, more relevant, and structuredto improve the performance of a query or algorithm.Ultimately, data form the raw material we need topave the roads to knowledge, and visualization is theinterface to insight. The entanglement between datarepresentation and analysis is best exemplified in thegraph-theoretic study of complex networks.

With the era of Big Data1 comes the emergence ofBig Graphs,2 and while there has been attention toimproving the capacity of graph primitives, such asBreadth-First Search,3,4 it should not be overlookedthat the answers to even simple questions, the algo-rithm output itself, can be tremendously large andcomplex. Consider the task of identifying all mutually

connected triplets in a graph or network of entities,often depicted as a triangle where each vertex of thetriangle represents an entity that is connected to theother entities by the edges of the triangle. Finding allsuch triangles in a graph requires a lot of computationand the number of triangles can be many times morethan the number of entities. Despite a triangle being asimple graphic, visualizing all triangles in a graph isvery difficult because the triangles overlap and are eas-ily obscured.

Computing triangles in graphs is a fundamentaloperation in graph theory and has wide application inthe analysis of social networks,5 identifying dense sub-graphs6 and network motifs,7 spam detection,8 anduncovering hidden thematic layers in the World WideWeb.9 Yet, given the attention to finding, counting,

US National Security Agency, USA

Corresponding author:Paul Burkhardt, National Security Agency, 9800 Savage Road, FortMeade, MD 20755, USA.Email: [email protected]

by guest on September 13, 2016ivi.sagepub.comDownloaded from

Graph500 (www.graph500.org) benchmark para-meters. Table 2 lists the sizes of the graphs and the tri-angle counts computed by the algorithm. The resultsof this benchmark are given in Figure 8. The top per-formance was listing 1.6 billion triangles per minute!This was achieved on the RMAT scale 30 graph withD(G). 200 billion, which was 1.8 TB in size; to thebest of this author’s knowledge it is the largest andfastest triangle listing experiment. The RMAT 28 andlarger graphs have a maximum degree on the order of106 but because of the aforementioned load-balancingtechnique, the neighbor pairing for high-degree ver-tices was distributed across many reduce tasks,demonstrating the efficacy of the approach. Moreover,

the number of 2-paths output from the RMAT 30graph was nearly two trillion and for the RMAT 32graph exceeded eleven trillion. The experiment on theRMAT 32 graph did not complete owing to time con-straints and not failure.

Visualizing trillions of triangles

Having established an algorithm capable of listingpotentially trillions of triangles in a graph, the next stepis to identify how to visualize the triangles. To begin,we need a scalable visual analytics platform but thereare very few tools designed for large-scale graph explo-ration. The GreenHornet26 visual analytics tool is

Figure 7. Triangle listing results for SNAP datasets.

Figure 8. Triangle listing results for Graph500 datasets.

Table 2. Graph500 RMAT graphs (A = 0:57, B = C = 0:19, D = 0:05).

n (vertices) 2m (edges) D(G)

RMAT 24 16,777,216 268,435,456 2,127,854,845RMAT 26 67,108,864 1,073,741,824 9,838,965,401RMAT 28 268,435,456 4,294,967,296 44,970,850,296RMAT 30 1,073,741,824 17,179,869,184 203,333,933,183RMAT 32 4,294,967,296 68,719,476,736 \ 11,068,947,204,458

8 Information Visualization

by guest on September 13, 2016ivi.sagepub.comDownloaded from

which every vertex is connected to every other vertex,and a cycle, i.e. a path with identical endpoints.Triangles in graphs are useful for complex networkanalysis based solely on structural, i.e. adjacency,information. Let D(v) be the local count of trianglesinvolving vertex v and D(G) be the global count of tri-angles in G. There can be many more triangles thanvertices and edges, and if the graph is completely con-nected, where G itself is a clique, the total number oftriangles, D(G), is

D(G)=n3

! "2 Y n3

# $

The triangles in a graph can be easily identifiedby testing all possible triples, fu, v,wg, for

fu, vg, fv,wg, fu,wg 2 E, but this takes On3

! "! "2

O(n3) time. If most of the triples were triangles andthe task was to enumerate all triangles, then theapproach is optimal. But generally this brute forcemethod is highly inefficient. We will briefly review thestandard method for counting triangles leading to anew matrix approach, then transition into the descrip-tion of a new algorithm that can potentially enumeratetrillions of triangles.

Counting triangles

Since cycles of length n are diagonal elements in An

and a triangle is a cycle of length 3, then it is wellknown that the local and global triangle counts can becalculated by

D(v)=1

2A3

vv

D(G)=1

6

X

v

A3vv =

1

6Tr A3# $

where the symbol Tr will refer to the trace operator ofa matrix. The 1

2 factor in D(v) is to account for thecycles from fu,wg incident to v, and the factor16 = 1

2 3 13 in D(G) is because each triangle is counted

thrice.Computing triangles by A3 by fast matrix multipli-

cation admits the best runtime of nv+ o(1), wherev42:3728639 is the fast matrix product exponent.14

But the classic matrix product is used in practice,therefore practical improvements would be beneficial.A new matrix approach developed in 2013 by thisauthor reduces the total number of arithmetic opera-tions.15 Here the element-wise multiplication operatorfor matrices is denoted by the 8 operator.

Theorem 1. Given G and the Hadamard product(A2 s A), then D(v)= 1

2

Pv (A

2 s A)v and D(G)=16

Pij (A

2 s A)ij .

Proof. The paths of length two, i.e. 2-paths, are iden-tified by A2; then element-wise multiplication A s A2

retains only those fv, ug endpoints of 2-paths that arealso the fv, ug 2 E endpoints of edges. Then the sumof elements in the vth column vector of A2 s A is thenumber of 3-cycles involving v. Using D(v)= 1

2 A3vv

leads to

D(v)=1

2A3

vv =1

2

X

v

A2s A

# $v

ð1Þ

Now recall Tr(XYT)=P

ij (X s Y )ij; thenTr(A2AT)=

Pij (A

2 s A)ij . Finally, since D(G)=16 Tr(A3), then

D(G)=1

6Tr A3# $

=1

6Tr A2AT# $

=1

6

X

ij

A2s A

# $ij

ð2Þ

The corollary to this is that the number of arith-metic operations is cut in half when using classicmatrix multiplication.

Corollary 1. The number of arithmetic operations forcounting all triangles in G by Theorem 1 using sparsematrix representation is 2N(n+ 1)# 1=O(mn), whereN = 2m is the number of non-zeros, which is approxi-mately half the number of arithmetic operations if countingby Tr(A3) using sparse matrix products.

Proof. There are a total of 2N multiplication andaddition operations for a sparse matrix–vector prod-uct. Then A2 is possible by direct application of sparsematrix–vector multiplication of A with all n columnvectors in A for a total of 2Nn arithmetic operations.Then A s A2 takes N operations for the element-wisemultiplication and, finally, summing all non-zero ele-ments from this is an additional (N # 1) operations.Thus,

Pij (A

2 s A)ij requires 2N(n+ 1)# 1=O(mn)arithmetic operations, since N = 2m. In contrast,computing A3 takes 4Nn operations and Tr(A3) addsan extra (N # 1) operations totaling 4Nn+N # 1,which is of the order of twice the number of arithmeticoperations, as

2Nn+ 2N # 1

4Nn+N # 1’

2Nn

4Nn=

1

2

Using similar logic, it is easy to see that this resultalso holds for dense graphs and the local trianglecount, D(v).

Burkhardt 5

by guest on September 13, 2016ivi.sagepub.comDownloaded from

Page 16: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 16

Triangle Counting using miniTri - Example Algorithm -

•  Array based algorithm –  Given a graph adjacency array A and incidence array E

C = AE nT = nnz(C)/3

4 5 1

2

3

miniTri represented in compact linear algebra-based operations Kokkos/Qthreads Task-Parallel Approach to Linear Algebra Based Graph Analytics Wolf, Edwards & Olivier, IEEE HPEC 2016

A =

0

BBBB@

0 0 0 0 10 0 0 1 10 0 0 1 10 1 1 0 11 1 1 1 0

1

CCCCAB =

0

BBBB@

1 0 0 0 0 00 1 0 0 1 00 0 1 1 0 00 0 0 1 1 11 1 1 0 0 1

1

CCCCA=

0

BBBB@

; ; ; ; ; ;; ; ; ; ; {2, 4, 5}; ; ; ; ; {3, 4, 5}; {4, 2, 5} {4, 3, 5} ; ; ;; ; ; {5, 3, 4} {5, 2, 4} ;

1

CCCCAC=AET =

•  Multiplication is overloaded such that

A ET

C(i,j) = {i,x,y} iff A(i,x) = A(i,y) = 1 and ET(x,j) = ET(y,j) = 1

Page 17: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 17

K-Truss

•  A graph is a k-truss if each edge is part of at least k-2 triangles •  A generalization of a clique (a k-clique is a k-truss), ensuring a minimum level of

connectivity within the graph

•  Traditional technique to find a k-truss subgraph: –  Compute the support for every edge –  Remove any edges with support less than k-2 and update the list of edges –  When all edges have support of at least k-2, we have a k-truss

Example 3-truss

Graphulo: Linear algebra graph kernels for NoSQL databases Gadepally, Bolewski, Hook, Hutchison, Miller & Kepner, IPDPS GABB 2015

Page 18: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 18

K Truss: Array Formulation - Example Algorithm -

•  If E is the unoriented incidence array (rows are edges and columns are vertices) of graph G, and A is the adjacency array

•  If G is a k-truss, the following must be satisfied –  (E A == 2) 1 > (k – 2)

Graphulo: Linear algebra graph kernels for NoSQL databases, Gadepally, Bolewski, Hook, Hutchison, Miller & Kepner, IPDPS GABB 2015

Algorithm

R = E A

x = find( (R == 2 ) 1 < k − 2 )

while x

Ex = E(x,∶)

E = E(xc,∶)

R = E(xc,∶) A

R = R − E ( ExExT − diag(ExEx

T) )

x = find( (R == 2 ) 1 < k − 2 )

Page 19: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 19

Outline

•  Introduction

•  Data Sets

•  Static Graph Isomorphism Challenge

•  Metrics

•  Summary

Page 20: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 20

Metrics

•  Correctness –  Number of triangles is exact for a given graph –  Enumerating all k-trusses is exact for a given graph

•  Performance –  Total number of edges in graph (edges) –  Execution time (seconds) –  Rate (edges/second) –  Energy (Watts) –  Rate per energy (edges/second/Watt)

Page 21: Subgraph Isomorphism Graph Challenge - draft...2017/02/09  · Graphs of the internet • Signed networks (10 data sets up to 4.8M vertices & 69M edges) – Networks with positive

Graph Challenge - 21

Summary

•  Static sub-graph isomorphism graph challenge consists of –  Triangle finding –  K-Truss

•  An implementation will perform these algorithms on provided data sets and report the given metrics

•  A submission to the IEEE HPEC Graph Challenge consists of a conference style paper describing the approach, implementation, innovations, and results

•  Both hardware and software should be described in solution –  Innovative hardware solutions are of interest (in addition to best algorithm for hardware) –  Special interest in performance for very large scale data sets (1-100 trillion edges)