Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington...
-
date post
21-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington...
Graph-based Learning and DiscoveryGraph-based Learning and Discovery
Diane J. CookDiane J. Cook
University of Texas at ArlingtonUniversity of Texas at Arlington
[email protected]@cse.uta.edu
http://www-cse.uta.edu/~cookhttp://www-cse.uta.edu/~cook
Data MiningData Mining
“The nontrivial extraction of implicit, previously unknown,and potentially useful information from data” [Frawley et al., 92]
Increasing ability to generate data Increasing ability to store data
KDD ProcessKDD Process
Approaches to Data MiningApproaches to Data Mining
Pattern extractionPattern extraction Prediction / classificationPrediction / classification ClusteringClustering
Debt Loan
NoLoan
0.123
0.203
0.117
0.545
Income
Debt<50
Income Income
NO YES YES NO NO YES
yes no
<5050-100 >100 <50
50-100 >100
Substructure DiscoverySubstructure Discovery
Most data mining algorithms deal with Most data mining algorithms deal with linearlinear attribute-value data attribute-value data
Need to represent and learn Need to represent and learn relationshipsrelationships between attributes between attributes
Discovers repetitive substructure patterns in Discovers repetitive substructure patterns in graph databasesgraph databases
Pattern extraction, classification, clusteringPattern extraction, classification, clustering Serial and parallel / distributed versionsSerial and parallel / distributed versions Applied to Applied to CAD circuits, telecom, DNA, and moreCAD circuits, telecom, DNA, and more
http://cygnus.uta.edu/subduehttp://cygnus.uta.edu/subdue
object
triangle
Graph RepresentationGraph Representation Input is a labeled graphInput is a labeled graph A A substructuresubstructure is connected subgraph is connected subgraph An An instanceinstance of a substructure is a subgraph of a substructure is a subgraph
that is isomorphic to substructure definitionthat is isomorphic to substructure definition
R1
C1
T1
S1
T2
S2
T3
S3
T4
S4
Input Database Substructure S1 (graph form)
Compressed Database
R1
C1object
squareon
shape
shape S1S1 S1S1 S1S1
S1S1
MDL PrincipleMDL Principle Best theory minimizes description length of dataBest theory minimizes description length of data Evaluate substructure based ability to compress DL Evaluate substructure based ability to compress DL
of graph of graph Description length = Description length = DL(S) + DL(G|S)DL(S) + DL(G|S)
AlgorithmAlgorithm1.1. Create substructure for each unique vertex labelCreate substructure for each unique vertex label
circle
rectangle
left
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
onleft
left left
left
Substructures:
triangle (4), square (4),circle (1), rectangle (1)
AlgorithmAlgorithm2.2. Expand best substructure by an edge or Expand best substructure by an edge or
edge+neighboring vertexedge+neighboring vertex
circle
rectangle
left
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
onleft
left left
left
Substructures:
triangle
square
on
circleleftsquare
rectangle
square
on
rectangle
triangleon
AlgorithmAlgorithm
3.3. Keep only best substructures on queue Keep only best substructures on queue (specified by (specified by beam widthbeam width))
4.4. Terminate when queue is empty or Terminate when queue is empty or #discovered substructures >= limit#discovered substructures >= limit
5.5. Compress graph and repeat to generate Compress graph and repeat to generate hierarchical descriptionhierarchical description
Note:Note: polynomially constrained polynomially constrained [IEEE Exp96][IEEE Exp96]
Examples Examples [Jair94][Jair94]
Inexact Graph Match Inexact Graph Match [JIIS95][JIIS95]
Some variations may occur between Some variations may occur between instancesinstances
Want to abstract over minor differencesWant to abstract over minor differences Difference = cost of transforming one Difference = cost of transforming one
graph to make it isomorphic to anothergraph to make it isomorphic to another Match if Match if cost/size < thresholdcost/size < threshold
Inexact Graph MatchInexact Graph Match
1 2A Ba
b
5
3 4B Ab
aa b
B
(1,3) 1 (1,4) 0 (1,5) 1 (1,) 1
(2,4)7
(2,5)6
(2,)10
(2,3)3
(2,5)6
(2,)9
(2,3)7
(2,4)7
(2,)10
(2,3)9
(2,4)10
(2,5)9
(2,)11
Least-cost match is {(1,4), (2,3)}
Background Knowledge Background Knowledge [IEEE TKDE96][IEEE TKDE96]
Some substructures not relevantSome substructures not relevant Background knowledge can bias searchBackground knowledge can bias search Two typesTwo types
Model knowledgeModel knowledge Graph match rulesGraph match rules
Parallel/distributed Subdue Parallel/distributed Subdue [JPDC00][JPDC00]
Scalability issuesScalability issues Three approachesThree approaches
Dynamic partitioningDynamic partitioning Functional parallelFunctional parallel Static partitioningStatic partitioning
Static PartitioningStatic Partitioning
Divide graph into P partitions, distribute Divide graph into P partitions, distribute to P processorsto P processors
Each processor performs serial Subdue Each processor performs serial Subdue on local partitionon local partition
Broadcast best substructures, evaluate Broadcast best substructures, evaluate on other processorson other processors
Master processor stores best global Master processor stores best global substructuressubstructures
Static Partitioning ResultsStatic Partitioning Results Close to linear speedupClose to linear speedup Continue until #processors > #verticesContinue until #processors > #vertices
IssuesIssues
When partition graph, lose informationWhen partition graph, lose information Metis graph partitioning systemMetis graph partitioning system Quality of resulting substructures?Quality of resulting substructures? Recapture by overlap, multiple partitionsRecapture by overlap, multiple partitions Evaluating more substructures globallyEvaluating more substructures globally
Compression ResultsCompression Results
AutoClassAutoClass Linear representationLinear representation Fit possible probabilistic models to dataFit possible probabilistic models to data Satellite data, DNA data, Landsat dataSatellite data, DNA data, Landsat data
SSUBDUEUBDUE/AutoClass Combined/AutoClass Combined
Data
structural features
structural patterns
Classeslinear features
= Combination of linear data or addition of linear features
Subdue
AutoClass+
+
Example - 30 2-color squaresExample - 30 2-color squares
AutoClassAutoClass Rep - tuple for Rep - tuple for each line (x1, y1, x2, y2, each line (x1, y1, x2, y2, angle, length, color)angle, length, color)
Add structure Add structure (neighboring edge (neighboring edge information)information)
SubdueSubdue Rep - each line is Rep - each line is node in graph, edges node in graph, edges between connecting linesbetween connecting lines
Attributes from nodesAttributes from nodes
ResultsResults AutoClass (12 classes)AutoClass (12 classes)
Subdue (top substructure)Subdue (top substructure)
Class 0 (20): Color=green, LineNo=Line1=Line2=98 +/- 10Class 1 (20): Color=red, LineNo=Line1=Line2=99 +/- 10…Class 11 (3): Line2=1 +/-13, Color=green
Combined ResultsCombined Results
Combine 4 entries for each square into oneCombine 4 entries for each square into one 30 tuples (one for each square)30 tuples (one for each square) DiscoverDiscover
Class 0 (10): Color1=red, Color2=red,Color3=green, Color4=green
Class 1 (10): Color1=green, Color2=green, Color3=blue, Color4=blue
Class 2 (10): Color1=blue, Color2=blue,Color3=red, Color4=red
More ResultsMore Results
Supervised Supervised SSUBDUE UBDUE [IEEE IS00][IEEE IS00]
One graph stores One graph stores positivepositive examples examples One graph stores One graph stores negativenegative examples examples Find substructure that compresses Find substructure that compresses
positivepositive graph but not graph but not negativenegative graph graph
ExampleExample
object
object
object
on
on
triangle
square
shape
shape
ResultsResults
Chess endgames (19,257 examples), BK is Chess endgames (19,257 examples), BK is (+) or is not (-) in check(+) or is not (-) in check
99.8% FOIL, 99.77% C4.5, 99.21% Subdue99.8% FOIL, 99.77% C4.5, 99.21% Subdue
More ResultsMore Results Tic Tac Toe endgamesTic Tac Toe endgames
+ is win for X (958 examples)+ is win for X (958 examples) 100% Subdue, 100% Subdue,
92.35% FOIL, 96.03% C4.592.35% FOIL, 96.03% C4.5 Bach choralesBach chorales
Musical sequences (20 sequences)Musical sequences (20 sequences) 100% Subdue, 100% Subdue,
85.71% FOIL, 82.00% C4.585.71% FOIL, 82.00% C4.5
Clustering Using Clustering Using SSUBDUEUBDUE Iterate Subdue until single vertexIterate Subdue until single vertex
Each cluster (substructure) inserted into a Each cluster (substructure) inserted into a classification latticeclassification lattice
Early results similar to COBWEB Early results similar to COBWEB [Fisher87][Fisher87]
Root
Discovery Application DomainsDiscovery Application Domains Biochemical domainsBiochemical domains
Protein data Protein data [PSB99, IDA99][PSB99, IDA99] Human Genome DNA dataHuman Genome DNA data Toxicology (cancer) dataToxicology (cancer) data
Spatial-temporal domainsSpatial-temporal domains Earthquake dataEarthquake data Aircraft Safety and Reporting SystemAircraft Safety and Reporting System
Telecommunications dataTelecommunications data Program source codeProgram source code
Structured Web Search Structured Web Search [AAAI-AIWS00][AAAI-AIWS00]
Existing search engines use linear feature matchExisting search engines use linear feature match Subdue searches based on structureSubdue searches based on structure Incorporation of WordNet allows for inexact feature match Incorporation of WordNet allows for inexact feature match
through synset path lengththrough synset path length TechniqueTechnique
Breadth-first search through domain to generate graphBreadth-first search through domain to generate graph Nodes represent pages / documentsNodes represent pages / documents Edges represent hyperlinksEdges represent hyperlinks Additional nodes used to represent document keywordsAdditional nodes used to represent document keywords Pose query as graphPose query as graph Search for query match within domain graphSearch for query match within domain graph
Sample SearchSample Search
Instructor
TeachingRobotics
ResearchRobotics
Publication
Robotics
httphttp
Postscript| PDF
Query: Query: Find all pages which link to Find all pages which link to a page containing term ‘subdue’a page containing term ‘subdue’
Subgraph vertices: 1 _page_URL: http://cygnus.uta.edu7 _page_URL: http://cygnus.uta.edu/projects.html8 Subdue[1->7] hyperlink[7->8] word
subdue
pagehyperlink
/* Vertex ID Label */
sv 1 _page_v 2 _page_v 3 subdue
/* Edge Vertex 1 Vertex 2 Label */
d 1 2 _hyperlink_d 2 3 _word_
word
page
Search for Presentation PagesSearch for Presentation Pages
SubdueSubdue 22 instances22 instances
AltaVistaAltaVista Query Query ““host:www-cse.uta.edu AND host:www-cse.uta.edu AND
image:next_motif.gif AND image:up_motif.gif AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.”image:previous_motif.gif.”
12 instances12 instances
page
page page page
hyperlinkhyperlink
hyperlink
hyperlink hyperlink
Search for Reference PagesSearch for Reference Pages
Search for page with at least 35 in linksSearch for page with at least 35 in links 5 pages in www-cse5 pages in www-cse
AltaVista cannot perform this type of searchAltaVista cannot perform this type of search
page
page page page
hyperlinkhyperlink
hyperlink
…
Search for pages on ‘jobs in Search for pages on ‘jobs in computer science’computer science’
Inexact match: allow one level of synonymsInexact match: allow one level of synonyms Subdue found 33 matchesSubdue found 33 matches
Words include Words include employment, work, job, problem, employment, work, job, problem, tasktask
AltaVista found 2 matchesAltaVista found 2 matches
page
jobs computer science
wordword
word
Search for ‘authority’ hub and authority pagesSearch for ‘authority’ hub and authority pages
Subdue found 3 hub Subdue found 3 hub (and 3 authority) pages(and 3 authority) pages
AltaVista cannot AltaVista cannot perform this type of perform this type of searchsearch
Inexact match applied Inexact match applied with threshold = 0.2 (4.2 with threshold = 0.2 (4.2 transformations allowed)transformations allowed)
Subdue found 13 Subdue found 13 matchesmatches
page
hyperlink
page page
page page page
word word word
algorithms algorithms algorithms
HUBS
AUTHORITIES
Subdue Learning from Web DataSubdue Learning from Web Data Distinguish professors’ and students’ web pagesDistinguish professors’ and students’ web pages
Learned concept (professors have “box” in Learned concept (professors have “box” in address field)address field)
Distinguish online stores and professors’ web pagesDistinguish online stores and professors’ web pages Learned concept (stores have more levels in Learned concept (stores have more levels in
graph)graph)
page boxword
page
page
page
page
page
page page