Transitioning to Semesters CSE MS Program Prof. Gagan Agrawal Grad Studies Chair.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State...
-
Upload
jared-lyons -
Category
Documents
-
view
214 -
download
0
Transcript of Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State...
![Page 1: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/1.jpg)
Supporting High-Performance Data Processing on Flat-Files
Xuan Zhang
Gagan Agrawal
Ohio State University
![Page 2: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/2.jpg)
Motivation
• Challenges of bioinformatics integration– Data volume: overwhelming
• DNA sequence: 100 gigabases (August, 2005)
– Data growth:
exponential
Figure provided by PDB
![Page 3: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/3.jpg)
Existing Solutions
– (Relational) Databases• Support for indexing and high-level queries • Not suitable for biological data
– Flat Files with Scripts • Compact, Perl Scripts available • Lack indexing and high-level query processing
– Web-services • Significant overhead
![Page 4: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/4.jpg)
• Enhance information integration systems on– Functionality
• On-the-fly data incorporation• Flat file data process
– Usability• Declarative interface• Low programming requirement
– Performance • Incorporate indexing support
Our Approach
![Page 5: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/5.jpg)
Approach Summary
• Metadata– Declarative description of data– Data mining algorithms for semi-automatic
writing– Reusable by different requests on same data
• Code generation– Request analysis and execution separated– General modules with plug-in data module
![Page 6: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/6.jpg)
System OverviewUnderstand Data Process Data
Data File User Request
Answ
er
Metadata Description
Layout Descriptor---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema Descriptor
CodeGeneration
RequestProcessor
Layout Miner
SchemaMiner
Information Integration System
![Page 7: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/7.jpg)
Advantages
• Simple interface– At metadata level, declarative
• General data model– Semi-structured data– Flat file data
• Low human involvement– Semi-automatic data incorporation– Low maintenance cost
• OK Performance– Linear scale guaranteed – Can improve by using indexing
![Page 8: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/8.jpg)
System Components
• Understand data– Layout mining– Schema mining
• Process data– Wrapper generation– Query Process– Query Process with indices
![Page 9: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/9.jpg)
Data Process Overview
• Automatic code generation approach• Input
– Metadata about datasets involved– Optional:
• Implicit data transformation task• Request by users• Indexing functions
• Output– Executable programs
• General modules• Task-specific data module
![Page 10: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/10.jpg)
Metadata Description
• Two aspects of data in flat files– Logical view of the data– Physical data organization
• Two components of every data descriptor– Schema description– Layout description
• Design goals– Powerful– Easy for writing and interpretation
![Page 11: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/11.jpg)
Schema Descriptors
• Follow XML DTD standard for semi-structured data
• Simple attribute list for relational data
<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>
[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string
![Page 12: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/12.jpg)
Layout Descriptors
• Overall structure (FASTA example)
DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name
DATASPACE LINESIZE=80 {
// ---- File layout details goes here ----
}DATA {osu/fasta} //File location
}
![Page 13: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/13.jpg)
Wrapper GenerationSystem Overview
DataReader DataWriterSynchronizer
SourceDataset
TargetDataset
WRAPINFO
Wrapper generationsystem
wrapper
Mapping File
Mapping Parser
Schema Mapping
Mapping Generator
Schema Descriptors
Layout Parser
Layout Descriptor
Data EntryRepresentation
Application Analyzer
![Page 14: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/14.jpg)
Query With IndicesMotivation
• Goal– Improve the performance of query-proc program
• Index
– Maintain the advantages• Flat file based• Low requirement on programming
![Page 15: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/15.jpg)
Challenges & Approaches
• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces
• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer
• Metadata about indices– Layout descriptor
![Page 16: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/16.jpg)
System Revisitedquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
Application analyzer
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
Targetdata file
Source/target names
Schema & Layout information mappings
Query analysis
Query execution
Index file Index functions
![Page 17: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/17.jpg)
Language Enhancement
• Describe indices– Indexing is a property of dataset– Extend layout descriptors
– Maintain query format
DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}
AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …
New meaning of “=“:If index available, use index
retrieving functionElse, compare values directly
![Page 18: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/18.jpg)
System Enhancement
• Metadata Descriptor Parser+ parse index information
• Application Analyzer+ index information: index look-up table
+ test condition: compare_field_indexing
![Page 19: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/19.jpg)
Microarray Gene Information Look-up
• Goal: gather information about genes (120)
• Query: microarray output join genome database
• Index: gene names in genome
0.01 0.72
20.89
81.59
0
10
20
30
40
50
60
70
80
90
Per
form
ance
(se
c)
queryanalysis
indexgeneration
query withindices
query w/oindices
![Page 20: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/20.jpg)
BLAST-ENHANCE Query
• Goal: Add extra information to BLAST output
• Query: BLAST output join Swiss-Prot database
• Index: protein ID in Swiss-Prot
0
200
400
600
800
1000
1200
Per
form
ance
(se
c)
indexgeneration
query w/indices
query w/oindices
3 5 12
![Page 21: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/21.jpg)
OMIM-PLUS Query
• Goal: add Swiss-Prot link to OMIM
• Query: OMIM join Swiss-Prot
• Index: protein ID in Swiss-Prot
1
10
100
1000
10000
100000
1000000
10000000
Perf
orm
ance
(sec
)
indexgeneration
query w/indices
query w/oindices
![Page 22: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/22.jpg)
Homology Search Query
• Goal: find similar sequences
• Query: query sequence list * sequence database
• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values
![Page 23: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/23.jpg)
Homology Search (1)
• Index (Singh’s algorithm)– Data: yeast
genome– wavelet
coefficients – minimum
bounding rectangles
0
50
100
150
200
250
300
350
Per
form
ance
(sec
)
1 2 3 4 5
Database size (9.8MB)
Index generation
10
20
40
![Page 24: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/24.jpg)
Homology Search (2)
• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0
5
10
15
20
25
30
perf
orm
ance
(sec
)
1 2 3 4 5
Database size (250MB)
10
20
40
![Page 25: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baadd3/html5/thumbnails/25.jpg)
Conclusions
• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically
by data mining tools– New data processed automatically by generated
programs – Support for indexing incorporated flexibly