A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate...

16
A In-Memory Compressed XML A In-Memory Compressed XML Representation of Representation of Astronomical Data Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student O’Neil Delpratt – PhD Student University of Leicester University of Leicester Supervisors: Supervisors: Rajeev Raman, Clive Page, Tony Linde & Rajeev Raman, Clive Page, Tony Linde & Mike Watson – University of Leicester Mike Watson – University of Leicester

Transcript of A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate...

Page 1: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

A In-Memory Compressed XML A In-Memory Compressed XML Representation of Representation of Astronomical DataAstronomical Data

A In-Memory Compressed XML A In-Memory Compressed XML Representation of Representation of Astronomical DataAstronomical Data

PPARC UK e-Science Postgraduate School ’05

O’Neil Delpratt – PhD StudentO’Neil Delpratt – PhD Student

University of LeicesterUniversity of Leicester

O’Neil Delpratt – PhD StudentO’Neil Delpratt – PhD Student

University of LeicesterUniversity of Leicester

Supervisors:Supervisors:

Rajeev Raman, Clive Page, Tony Linde & Rajeev Raman, Clive Page, Tony Linde &

Mike Watson – University of LeicesterMike Watson – University of Leicester

Supervisors:Supervisors:

Rajeev Raman, Clive Page, Tony Linde & Rajeev Raman, Clive Page, Tony Linde &

Mike Watson – University of LeicesterMike Watson – University of Leicester

Page 2: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

• Topics:– Outline of Project– Role in Project

– Current Work

– Conclusion

• Topics:– Outline of Project– Role in Project

– Current Work

– Conclusion

IntroductionIntroductionIntroductionIntroduction

Page 3: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Outline of ProjectOutline of ProjectOutline of ProjectOutline of Project

Building a data-grid for UK astronomy.Building a data-grid for UK astronomy.

Consist of a wide variety of web/grid services, Consist of a wide variety of web/grid services, serving up data to be combined and processed, serving up data to be combined and processed, and ultimately displayed to the user.and ultimately displayed to the user.

•Astronomers Previously used FITS binary.Astronomers Previously used FITS binary.•International Virtual Observatory AllianceInternational Virtual Observatory Alliance ( (IVOA) IVOA) Introduced VOTable xml format.Introduced VOTable xml format.

Building a data-grid for UK astronomy.Building a data-grid for UK astronomy.

Consist of a wide variety of web/grid services, Consist of a wide variety of web/grid services, serving up data to be combined and processed, serving up data to be combined and processed, and ultimately displayed to the user.and ultimately displayed to the user.

•Astronomers Previously used FITS binary.Astronomers Previously used FITS binary.•International Virtual Observatory AllianceInternational Virtual Observatory Alliance ( (IVOA) IVOA) Introduced VOTable xml format.Introduced VOTable xml format.

AstroGrid ProjectAstroGrid ProjectAstroGrid ProjectAstroGrid Project

Page 4: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Main ProblemMain ProblemMain ProblemMain Problem- XML-based files are larger than equivalent formats, - Network bandwidth is limited.- Adhoc approach- Slow searching

- XML-based files are larger than equivalent formats, - Network bandwidth is limited.- Adhoc approach- Slow searching

Concern:Concern:

Using the VOTable XML-based astronomical data format we know that it:• Is Basic (XML representation of a 2d binary table. ).• Has Static Data (features for big-data and grid computing)• Allows data to be stored separately, with the remote data link

Using the VOTable XML-based astronomical data format we know that it:• Is Basic (XML representation of a 2d binary table. ).• Has Static Data (features for big-data and grid computing)• Allows data to be stored separately, with the remote data link

VOTable is designed as a flexible storage and exchange format for tabular data, with emphasis on astronomical tables.

VOTable is designed as a flexible storage and exchange format for tabular data, with emphasis on astronomical tables.

Page 5: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Role in ProjectRole in Project

Compressed fileRepresentation supporting sequential searching

Compression tool

In-memory XML representation supporting queries

•Provide software tool for the Astrogrid Group.

•Remove the adhoc approach of accessing astronomical data in XML documents.

•Remove the limitation of parsing large XML documents, allowing us to make ‘proper’ flexible XML files rather than representation of 2D tables.

Page 6: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

An existing XML-specific compression tools are:

•XMILL [H. Liefke and D. Suciu, 2000]

Files are inaccessible until uncompressed.

Therefore we need to develop a software tool that provides both:

•Compression •Powerful operations on the compressed XML documents.

Page 7: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

RequirementsRequirements

•Produce a software tool for efficient storage and processing of large XML files that arise in the context of IVOA.

•Remove adhoc way of accessing XML data.

•Provide efficient searching facilities of VOTable files.

•Rapid querying operations.

•Reduce bandwidth.

•Allow for merging of XML files.

Page 8: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

• Topics:• Succinct DOM• Using PrefixSum Data Structures in DOM• Experimental Results

• Topics:• Succinct DOM• Using PrefixSum Data Structures in DOM• Experimental Results

Current WorkCurrent WorkCurrent WorkCurrent Work

Page 9: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Succinct DOMSuccinct DOM• DOM is a specification for representing XML documents in main memory.

• XML file is parsed and represented using our succinct DOM implementation

What is Succinctness?What is Succinctness?

A succinct data structure is one that uses space approaching the information theoretic minimum for the required structure yet ideally supports the desired operations in constant time, unconstrained by the size of the structure.

Page 10: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Parenthesis Data Structure:

Components of Succinct DOMComponents of Succinct DOM

<Directory><Directory> <Name>Basic</Name><Name>Basic</Name> <File type=<File type=“XML”“XML”>> <FName>Basic</FName><FName>Basic</FName> <Created><Created> </Time></Time> <Hour>15</Hour><Hour>15</Hour> <Minute>00</Minute><Minute>00</Minute> </Time></Time> <Date>10/05/05<Date>10/05/05 </Date></Date> </Created> </Created> <Note>Files details</Note><Note>Files details</Note> </File></File> <Note>XML files</Note><Note>XML files</Note></Directory></Directory>

Tree RepresentationTree Representation

9

7 8

6

4 105

2 3 11

1Directory

Name File Note

FName

Created

Note

Time Date

Hour Minutes

• Currently using pointers• Parenthesis we have 2 bits per

node.( ( ) ( ( ) ( ( ( ) ( ) ) ( ) ) ( ) ) ( ) )1 2 3 4 5 6 7 8 9 10 11

As a BitVector:0010010001011011011011

[Geary, Rahman, R. Raman, V Raman. 2004]

Page 11: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Components of Succinct DOMComponents of Succinct DOM

PrefixSum Data Structure Motivation:PrefixSum Data Structure Motivation:PrefixSum Data Structure Motivation:PrefixSum Data Structure Motivation:

We can store the text node data from XML files as a concatenated string.

Store the offsets for each text node data in an integer array.

Problem Here:

• The text nodes often take most of the XML tree, sometimes 50%.

•Therefore, the offsets as 32-bit integers in memory would not be space efficient.

Text node1 … … ..... “ “

offset1 offset2 offset3 offset4 offset5 offset6

Page 12: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Components of Succinct DOMComponents of Succinct DOM

PrefixSum Data StructurePrefixSum Data StructurePrefixSum Data StructurePrefixSum Data Structure

Given a sequence of k numbers, knn ,...,1 , where 0in , for all i, the prefixsums for

the sequence are kdd ,...,1 , where kk nndnndnd ...,...,, 121211 .

ADT PrefixSum is ABSTRACT DATA a non-negative integer k and k positive integers n_1,...,n_k OPERATIONS Constructor(n_1,...,n_k) initialises k and n_1,...,n_k Sum(i) returns the value n_1+...+n_i, where 1<=i<=k END

The requirements for the PrefixSum Data structure to The requirements for the PrefixSum Data structure to represent with an Abstract Data Type:represent with an Abstract Data Type:The requirements for the PrefixSum Data structure to The requirements for the PrefixSum Data structure to represent with an Abstract Data Type:represent with an Abstract Data Type:

Page 13: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

PrefixSum Data Structure Continued...PrefixSum Data Structure Continued...

We consider three integer-coding methods that are used in inverted file compression:

• Gamma Code – requires bits.

• Delta Code –requires bits.

• Golomb – requires bits.

xlog21

xx log2loglog21

bbx log/)1(

There are existing integer-coding methods that we can use to encode the number 1,…,x compactly.

Page 14: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Components of Succinct DOMComponents of Succinct DOM

Some Of the DOM methods we will implement:

•childNodes

•parentNode

•Attributes

•nextSibling

•previousSibling

•nodeType

•nodeName

•nodeValue

Page 15: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Experimental ResultsExperimental Results

Files

Original S

ize

Bzip2

Su

ccinct D

OM

UNSPSC-2.04.XML 1.02MB 107KB 294KB Mondial-3.0.xml 1.70MB 107KB 324KB Orders.xml 5.12MB 285KB 1,159KB xCRL.xml 8.29MB 256KB 1,029KB Votable2.xml 15.10MB 1,622KB 4,335KB Nasa.xml 23.80MB 2,688KB 8,279KB XPath.xml 49.80MB 1,323KB 7,437KB

Page 16: A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Challenges that remainChallenges that remain

• Improve the results of the succinct DOM and that of the Prefix-sum data structures.

• Challenges to bring space usage closer to bzip2.

• Develop a compressed file and in-memory representation.

• Relate the XML representation data structures to VOTable files for further compression improvements.