Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu...

18
Table & Query Design for Table & Query Design for Hierarchical Data Hierarchical Data without CONNECT-BY without CONNECT-BY -- A Path Code Approach -- A Path Code Approach Charles Yu Charles Yu Database Architect Database Architect Elance Inc. Elance Inc. [email protected] [email protected] [email protected] [email protected] 2005-08 2005-08

Transcript of Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu...

Page 1: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Table & Query Design for Table & Query Design for Hierarchical Data Hierarchical Data

without CONNECT-BY without CONNECT-BY -- A Path Code Approach-- A Path Code Approach

Charles YuCharles YuDatabase ArchitectDatabase Architect

Elance Inc.Elance [email protected]@elance.com

[email protected]@acm.org

2005-082005-08

Page 2: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

BackgroundBackground Node-Uniform Hierarchical (NUH for short) data can be visualized Node-Uniform Hierarchical (NUH for short) data can be visualized

as a tree or forest graph where every node has the same set of as a tree or forest graph where every node has the same set of attributes.attributes.

NUH data can be naturally represented in RDBMS by recursive NUH data can be naturally represented in RDBMS by recursive tables where the parent-child relationship is implemented in a way tables where the parent-child relationship is implemented in a way that if record x is a child of record y, then the value of x’s that if record x is a child of record y, then the value of x’s parent_id column is the same as the value of the id column of y’s. parent_id column is the same as the value of the id column of y’s.

Standard SQL does not support for general query on NUH data in Standard SQL does not support for general query on NUH data in basic recursive tables.basic recursive tables.

Oracle comes with a native mechanism for general query on NUH Oracle comes with a native mechanism for general query on NUH data and beyond, known as connect-by. For all its elegancy and data and beyond, known as connect-by. For all its elegancy and usefulness, it is short in two accounts: it is slow in cases due to usefulness, it is short in two accounts: it is slow in cases due to the fact that the parent-child relationship cannot be directly the fact that the parent-child relationship cannot be directly indexed; and it is Oracle dependent in that SQL queries using indexed; and it is Oracle dependent in that SQL queries using connect-by cannot be easily adapted to RDBMS of other vendors. connect-by cannot be easily adapted to RDBMS of other vendors.

Page 3: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Basic Recursive Table DesignBasic Recursive Table Design

Basic Columns Basic Columns XidXid --system assigned unique id--system assigned unique id Parent_xidParent_xid --xid of parent of this entry--xid of parent of this entry Entry_codeEntry_code --content unique identifier of the entry--content unique identifier of the entry Normal_stuffNormal_stuff --one or more such columns for content --one or more such columns for content

valuesvalues

Some variant:Some variant:

Use a separate table to store the hierarchical relationship Use a separate table to store the hierarchical relationship consisting essentially of two columns: xid/child_xid and consisting essentially of two columns: xid/child_xid and parent_xid; and use FK to link the table to the main data parent_xid; and use FK to link the table to the main data tabletable

Page 4: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Basic Recursive Table Query Basic Recursive Table Query MechanismsMechanisms

Oracle-native Connect byOracle-native Connect by K-way self outer join (for up to level k K-way self outer join (for up to level k

depth)depth) Other??Other??

Page 5: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Basic Idea of Path Code ApproachBasic Idea of Path Code Approach

A node of a tree is fully determined A node of a tree is fully determined by the path from the root to itself.by the path from the root to itself.

Path code as full representation of Path code as full representation of the path can be very compact in the path can be very compact in length, in the order of logarithmic of length, in the order of logarithmic of total size of the tree.total size of the tree.

Path Code can be maintained Path Code can be maintained dynamically feasibly.dynamically feasibly.

Path code permits direct indexing.Path code permits direct indexing.

Page 6: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Path-code enhanced recursive Path-code enhanced recursive table designtable design

Basic Columns Basic Columns

xidxid parent_xidparent_xid path_codepath_code --code of the path for the node (detail --code of the path for the node (detail

later)later) entry_levelentry_level --level of the record in the tree the --level of the record in the tree the

entry entry belongs tobelongs to sibling_nosibling_no --sequence no of the child e--sequence no of the child entry_code ntry_code

with respect to the parentwith respect to the parent is_leaf is_leaf --1/0 for being a leaf/not a leaf--1/0 for being a leaf/not a leaf Entry_code Entry_code --content unique identifier of the entry--content unique identifier of the entry normal_stuff normal_stuff --one or more --one or more

Page 7: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Value Setting for H columns (I)Value Setting for H columns (I) Parent_xidParent_xid set as usualset as usual Sibling_no can be set according to any ordering, Sibling_no can be set according to any ordering,

e.g. according to entry_code, starting at 1 for e.g. according to entry_code, starting at 1 for each parent; the sibling_no of root entries are set each parent; the sibling_no of root entries are set as if those roots were children of a super root;as if those roots were children of a super root;

Entry_level can be set from top down, having Entry_level can be set from top down, having entry_level=0 for all root entries; and entry_level=0 for all root entries; and X.entry_level=k+1 if X has parent Y and X.entry_level=k+1 if X has parent Y and Y.entry_level=k;Y.entry_level=k;

Is_leaf =0/1 if there is child of the node/not soIs_leaf =0/1 if there is child of the node/not so

Page 8: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Value Setting for H columns (II)Value Setting for H columns (II)

Path_codePath_code– for root entries X: X.path_code = to_char(X.sibling_no, for root entries X: X.path_code = to_char(X.sibling_no,

‘00’)‘00’)– for non-root entries X with X.parent_xid=Y.xid: for non-root entries X with X.parent_xid=Y.xid:

X.path_code = Y.path_code||to_char(X.sibling_no,’00’)X.path_code = Y.path_code||to_char(X.sibling_no,’00’)

Path_code of a node N at level k has k+1 sections;Path_code of a node N at level k has k+1 sections;

level j section is left-zero padded string conversion of level j section is left-zero padded string conversion of sibling_no of N or N’s parent at level j;sibling_no of N or N’s parent at level j;

For convenience, the last (rightmost) section is called the For convenience, the last (rightmost) section is called the base section, the concatenation of all the non-last base section, the concatenation of all the non-last sections is called the ancestor section.sections is called the ancestor section.

Page 9: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Example of H-value settingExample of H-value setting

O R G tree

c324

0 10 203

c225

0 10 202

c126

0 10 201

b212

0 102

b113

0 101

E C = aE L =0X ID = 1P C = 01

LOGO

EC for entry_code

EL for entry_level

XID for xid

PC for path_code

Explanation•Path_code is in the uniform format

•Path_code order is based on entry_code order but not on XID order. It could be otherwise.

•Path_code of a child is the path_code of its parent plus its base section code.

•Sibling_no is not shown but assumed to be in accordance with entry_code.

•Entry_code and xid value settings can be independent of each other.

•parent_xid, sibling_no, is_leaf and other fields are not shown.

format assumption Node uniform (see next) Section length =2 String expression in format(Same assumption for later code examples)

Page 10: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Variants of path_code pattern Variants of path_code pattern (advanced topic)(advanced topic)

node uniform: every section of all path codes has equal node uniform: every section of all path codes has equal length (a simplest; and it is used in the previous example)length (a simplest; and it is used in the previous example)

Level uniform: every section of the same level of all Level uniform: every section of the same level of all path_codes has equal lengthpath_codes has equal length

Parent uniform: every child node of any parent node has Parent uniform: every child node of any parent node has equal path_code lengthequal path_code length

Dot (or delimiter) uniform: use the same delimiter character Dot (or delimiter) uniform: use the same delimiter character (e.g. dot) to separate all sections of all path_codes(e.g. dot) to separate all sections of all path_codes

Min uniform: the length of base section of the path_code is Min uniform: the length of base section of the path_code is always maintained to be minimumalways maintained to be minimum

String/Binary/hex/ in expression and interpretation, sorting String/Binary/hex/ in expression and interpretation, sorting relevant, etc.relevant, etc.

Sparse uniform: path_code sections each allows more Sparse uniform: path_code sections each allows more values than actually and currently needed, for easing values than actually and currently needed, for easing subsequent node insertions.subsequent node insertions.

Page 11: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Query PatternsQuery Patterns

Get all children of a parent PGet all children of a parent Pselect * from T where path_code like select * from T where path_code like

P.path_code||’%’P.path_code||’%’

Get all ancestors if a child CGet all ancestors if a child Cselect * from T where C.path_code like select * from T where C.path_code like

path_code||’%’path_code||’%’

Get all siblings of a node NGet all siblings of a node Nsselect * from T where parent_xid = elect * from T where parent_xid =

N.parent_xidN.parent_xid

Page 12: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

DML Patterns (insert at end)DML Patterns (insert at end) ((Insert record with path_code and sibling_no as nullInsert record with path_code and sibling_no as null) )

insert into T(xid,parent_xid,entry_level, entry_code, insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;c.entry_code, c.normal_stuff;

((Update sibling_noUpdate sibling_no) ) update T set sibling_no = (select max(sibling_no)+1 from T update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid) where xid = c.xid;where parent_xid = p.xid) where xid = c.xid;

((Update path_codeUpdate path_code))update T set path_code = p.path_code || to_char(sibling_no, update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid;'00') where xid = c.xid;

((reset is_leaf for p, detail omitreset is_leaf for p, detail omit) )

Page 13: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

DML Patterns (insert in middle)DML Patterns (insert in middle) ((Insert record with path_code and sibling_no as nullInsert record with path_code and sibling_no as null) )

insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;

((Update sibling_no for those siblings elder than cUpdate sibling_no for those siblings elder than c) ) update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and entry_code >c.entry_code;entry_code >c.entry_code;

((Update sibling_no for cUpdate sibling_no for c) ) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid and entry_code < c.entry_code;= p.xid and entry_code < c.entry_code;

((Update path_code for cUpdate path_code for c))update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid;c.xid;

((Update path_code for those siblings elder than c and all decendents of those Update path_code for those siblings elder than c and all decendents of those elder siblings, pcs_length stands for path_code section lengthelder siblings, pcs_length stands for path_code section length))update T set path_code = substr(path_code,1, pcs_length*entry_level) || update T set path_code = substr(path_code,1, pcs_length*entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(entry_level+1)+1) to_char(sibling_no, '00')||substr(path_code, pcs_length*(entry_level+1)+1) where path_code like p.path_code||’%’ and path_code > (select path_code where path_code like p.path_code||’%’ and path_code > (select path_code from T where xid = c.xid)from T where xid = c.xid)

Page 14: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

DML patterns (delete)DML patterns (delete) ((Delete node C and all its decendentsDelete node C and all its decendents) )

delete from T where path_code like C.path_code||’%’;delete from T where path_code like C.path_code||’%’; ((Shifting sibling_no for those siblings elder than CShifting sibling_no for those siblings elder than C) )

Update T set sibling_no = sibling_no -1 where parent_xid = Update T set sibling_no = sibling_no -1 where parent_xid = P.parent_id and sibling_no >C.sibling_no;P.parent_id and sibling_no >C.sibling_no;

((Shifting path_code for those siblings elder than C Shifting path_code for those siblings elder than C and their decendentsand their decendents) ) Update T set path_code = substr(path_code,1, Update T set path_code = substr(path_code,1, pcs_length*C.entry_level) || to_char(sibling_no, '00')||pcs_length*C.entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(C.entry_level+1)+1) where substr(path_code, pcs_length*(C.entry_level+1)+1) where path_code like P.path_code||’%’ and path_code > path_code like P.path_code||’%’ and path_code > C.path_code;C.path_code;

((reset is_leaf of the parent P of C according to reset is_leaf of the parent P of C according to whether P has other childrenwhether P has other children) )

Page 15: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Complexity AnalysisComplexity Analysis

SpaceSpace – length of path_code increases in order length of path_code increases in order

logarithmic of the total number of rows in the logarithmic of the total number of rows in the table (for non-degenerated hierarchical data). table (for non-degenerated hierarchical data).

– e.g. length of c*20 vs 1M rowse.g. length of c*20 vs 1M rows TimeTime

– queries execute very much based on index queries execute very much based on index range scan, usually the fastest available.range scan, usually the fastest available.

– Inserts/deletes may involve sub-tree Inserts/deletes may involve sub-tree processing for delete or update.processing for delete or update.

Page 16: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Comparison with Connect-ByComparison with Connect-By

categorycategory Connect-ByConnect-By Path code approachPath code approach

Data patternData pattern General directed General directed graphgraph

Strictly hierarchicalStrictly hierarchical

Table designTable design Recursive basicRecursive basic Recursive basic + Recursive basic + extra H-columnsextra H-columns

Disk space costDisk space cost MinimumMinimum Overhead of up to \Overhead of up to \log nlog n

Time Efficiency*Time Efficiency* Incapable of direct Incapable of direct indexing or unknownindexing or unknown

Capable of direct Capable of direct indexingindexing

RDBMS Vendor RDBMS Vendor independence*independence*

NoNo YesYes

Query complexity*Query complexity* It dependsIt depends It dependsIt depends

DML complexityDML complexity MinimumMinimum Substantial but in a Substantial but in a small set of patternssmall set of patterns

Page 17: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

A stretched idea on RDBMS designA stretched idea on RDBMS design

Make entry_id and parent_entry_id Make entry_id and parent_entry_id relationship declarative;relationship declarative;

Enforce hierarchy constraint to the effect Enforce hierarchy constraint to the effect that each node can only have zero or one that each node can only have zero or one parent node;parent node;

Create and maintain path_code, Create and maintain path_code, entry_level, etc by RDBMS like creating entry_level, etc by RDBMS like creating and maintaining functional indexes;and maintaining functional indexes;

Add syntax to SQL similar to Oracle’s Add syntax to SQL similar to Oracle’s connect-by, but with the extra of taking connect-by, but with the extra of taking advantage of the hidden indexes.advantage of the hidden indexes.

Page 18: Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com.

Additional Questions and Additional Questions and ReferencesReferences

Whether and how to generalize the design Whether and how to generalize the design for node non-uniform hierarchical data?for node non-uniform hierarchical data?

To see latest alternative approaches, e.g. To see latest alternative approaches, e.g. http://www.inconcept.com/JCM/May2005/Dhttp://www.inconcept.com/JCM/May2005/David.htmlavid.html ((Using ANSI SQL as a Conceptual Hierarchical Data Using ANSI SQL as a Conceptual Hierarchical Data Modeling and Processing Language for XML, by Michael M Modeling and Processing Language for XML, by Michael M David)David)